Hi everyone,
at first, thanks for building this great framework! We are using Flink and especially Gelly for building a graph analytics stack (gradoop.com). I was wondering if there is a [planned] support for a containment join operator. Consider the following example: DataSet<List<Int>> left := {[0, 1], [2, 3, 4], [5]} DataSet<Tuple2<Int, Int>> right := {<0, 1>, <1, 0>, <2, 1>, <5, 2>} What I want to compute is left.join(right).where(list).contains(tuple.f0) := { <[0, 1], <0,1>>, <[0, 1], <1, 0>>, <[2, 3, 4], <2, 1>>, <[5], <5, 2> } At the moment, I am solving that using cross and filter, which can be expensive. The generalization of that operator would be "set containment join", where you join if the right set is contained in the left set. If there is a general need for that operator, I would also like to contribute to its implementation. But maybe, there is already another nice solution which I didn't discover yet? Any help would be appreciated. Especially since I would also like to contribute some of our graph operators (e.g., graph summarization) back to Flink/Gelly (current WIP state can be found here: [1]). Thanks, Martin [1] https://github.com/dbs-leipzig/gradoop/blob/%2345_gradoop_flink/gradoop-flink/src/main/java/org/gradoop/model/impl/operators/Summarization.java |
Hi Martin, good to hear that you like Flink :-)Regarding a containment join, I am not sure about the number of use cases. I would rather try to implement it on top of Flink's batch API instead of adding it as an internal feature/operator to the system because this would touch a lot of things (API, optimizer, operator implementation). - Do you know a distributed algorithm for containment joins? Maybe it can be implemented with Flink's API. - I guess, you are implementing a generic graph framework, but can you make certain assumptions about the data such as relative sizes of the inputs or avg/max size of the lists, etc.? Contributions to Gelly (and Flink in general) are highly welcome. Best, Fabian 2015-07-16 9:39 GMT+02:00 Martin Junghanns <[hidden email]>: Hi everyone, |
I would rewrite this to replicate the list into tuples: "foreach x in list: emit (x, list)" Then join on fields 0. This replicates the lists, but makes the join very efficient. On Fri, Jul 17, 2015 at 12:26 AM, Fabian Hueske <[hidden email]> wrote:
|
Hi Fabian, hi Stephen,
thanks for answering my question. Good hint with the list replication, I will benchmark this vs. cross + filter. Best, Martin Am 17.07.2015 um 11:17 schrieb Stephan
Ewen:
|
Hi Martin, I'm really glad to see that you've started using Gelly :) I think that a graph summarization library method would be a great addition! Let me know if you need help and if you want to discuss ideas or other methods. Cheers, Vasia. On 17 July 2015 at 12:25, Martin Junghanns <[hidden email]> wrote:
|
Free forum by Nabble | Edit this page |