Containment Join Support

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

Containment Join Support

Martin Junghanns
Hi everyone,

at first, thanks for building this great framework! We are using Flink
and especially Gelly for building a graph analytics stack (gradoop.com).

I was wondering if there is a [planned] support for a containment join
operator. Consider the following example:

DataSet<List<Int>> left := {[0, 1], [2, 3, 4], [5]}
DataSet<Tuple2<Int, Int>> right := {<0, 1>, <1, 0>, <2, 1>, <5, 2>}

What I want to compute is

left.join(right).where(list).contains(tuple.f0) :=

{
<[0, 1], <0,1>>, <[0, 1], <1, 0>>,
<[2, 3, 4], <2, 1>>,
<[5], <5, 2>
}

At the moment, I am solving that using cross and filter, which can be
expensive.

The generalization of that operator would be "set containment join",
where you join if the right set is contained in the left set.

If there is a general need for that operator, I would also like to
contribute to its implementation.

But maybe, there is already another nice solution which I didn't
discover yet?

Any help would be appreciated. Especially since I would also like to
contribute some of our graph operators (e.g., graph summarization) back
to Flink/Gelly (current WIP state can be found here: [1]).

Thanks,

Martin


[1]
https://github.com/dbs-leipzig/gradoop/blob/%2345_gradoop_flink/gradoop-flink/src/main/java/org/gradoop/model/impl/operators/Summarization.java


Reply | Threaded
Open this post in threaded view
|

Re: Containment Join Support

Fabian Hueske-2
Hi Martin,

good to hear that you like Flink :-)
AFAIK, there are no plans to add a containment join. The Flink community is currently working on adding support for outer joins.
Regarding a containment join, I am not sure about the number of use cases. I would rather try to implement it on top of Flink's batch API instead of adding it as an internal feature/operator to the system because this would touch a lot of things (API, optimizer, operator implementation).

There might be better ways to implement a containment join than using a cross and a filter.
- Do you know a distributed algorithm for containment joins? Maybe it can be implemented with Flink's API.
- I guess, you are implementing a generic graph framework, but can you make certain assumptions about the data such as relative sizes of the inputs or avg/max size of the lists, etc.?

Contributions to Gelly (and Flink in general) are highly welcome.

Best, Fabian


2015-07-16 9:39 GMT+02:00 Martin Junghanns <[hidden email]>:
Hi everyone,

at first, thanks for building this great framework! We are using Flink
and especially Gelly for building a graph analytics stack (gradoop.com).

I was wondering if there is a [planned] support for a containment join
operator. Consider the following example:

DataSet<List<Int>> left := {[0, 1], [2, 3, 4], [5]}
DataSet<Tuple2<Int, Int>> right := {<0, 1>, <1, 0>, <2, 1>, <5, 2>}

What I want to compute is

left.join(right).where(list).contains(tuple.f0) :=

{
<[0, 1], <0,1>>, <[0, 1], <1, 0>>,
<[2, 3, 4], <2, 1>>,
<[5], <5, 2>
}

At the moment, I am solving that using cross and filter, which can be
expensive.

The generalization of that operator would be "set containment join",
where you join if the right set is contained in the left set.

If there is a general need for that operator, I would also like to
contribute to its implementation.

But maybe, there is already another nice solution which I didn't
discover yet?

Any help would be appreciated. Especially since I would also like to
contribute some of our graph operators (e.g., graph summarization) back
to Flink/Gelly (current WIP state can be found here: [1]).

Thanks,

Martin


[1]
https://github.com/dbs-leipzig/gradoop/blob/%2345_gradoop_flink/gradoop-flink/src/main/java/org/gradoop/model/impl/operators/Summarization.java



Reply | Threaded
Open this post in threaded view
|

Re: Containment Join Support

Stephan Ewen
I would rewrite this to replicate the list into tuples: 

"foreach x in list: emit (x, list)"
Then join on fields 0.

This replicates the lists, but makes the join very efficient.

On Fri, Jul 17, 2015 at 12:26 AM, Fabian Hueske <[hidden email]> wrote:
Hi Martin,

good to hear that you like Flink :-)
AFAIK, there are no plans to add a containment join. The Flink community is currently working on adding support for outer joins.
Regarding a containment join, I am not sure about the number of use cases. I would rather try to implement it on top of Flink's batch API instead of adding it as an internal feature/operator to the system because this would touch a lot of things (API, optimizer, operator implementation).

There might be better ways to implement a containment join than using a cross and a filter.
- Do you know a distributed algorithm for containment joins? Maybe it can be implemented with Flink's API.
- I guess, you are implementing a generic graph framework, but can you make certain assumptions about the data such as relative sizes of the inputs or avg/max size of the lists, etc.?

Contributions to Gelly (and Flink in general) are highly welcome.

Best, Fabian


2015-07-16 9:39 GMT+02:00 Martin Junghanns <[hidden email]>:
Hi everyone,

at first, thanks for building this great framework! We are using Flink
and especially Gelly for building a graph analytics stack (gradoop.com).

I was wondering if there is a [planned] support for a containment join
operator. Consider the following example:

DataSet<List<Int>> left := {[0, 1], [2, 3, 4], [5]}
DataSet<Tuple2<Int, Int>> right := {<0, 1>, <1, 0>, <2, 1>, <5, 2>}

What I want to compute is

left.join(right).where(list).contains(tuple.f0) :=

{
<[0, 1], <0,1>>, <[0, 1], <1, 0>>,
<[2, 3, 4], <2, 1>>,
<[5], <5, 2>
}

At the moment, I am solving that using cross and filter, which can be
expensive.

The generalization of that operator would be "set containment join",
where you join if the right set is contained in the left set.

If there is a general need for that operator, I would also like to
contribute to its implementation.

But maybe, there is already another nice solution which I didn't
discover yet?

Any help would be appreciated. Especially since I would also like to
contribute some of our graph operators (e.g., graph summarization) back
to Flink/Gelly (current WIP state can be found here: [1]).

Thanks,

Martin


[1]
https://github.com/dbs-leipzig/gradoop/blob/%2345_gradoop_flink/gradoop-flink/src/main/java/org/gradoop/model/impl/operators/Summarization.java




Reply | Threaded
Open this post in threaded view
|

Re: Containment Join Support

Martin Junghanns
Hi Fabian, hi Stephen,

thanks for answering my question. Good hint with the list replication, I will benchmark this vs. cross + filter.

Best, Martin

Am 17.07.2015 um 11:17 schrieb Stephan Ewen:
I would rewrite this to replicate the list into tuples: 

"foreach x in list: emit (x, list)"
Then join on fields 0.

This replicates the lists, but makes the join very efficient.

On Fri, Jul 17, 2015 at 12:26 AM, Fabian Hueske <[hidden email]> wrote:
Hi Martin,

good to hear that you like Flink :-)
AFAIK, there are no plans to add a containment join. The Flink community is currently working on adding support for outer joins.
Regarding a containment join, I am not sure about the number of use cases. I would rather try to implement it on top of Flink's batch API instead of adding it as an internal feature/operator to the system because this would touch a lot of things (API, optimizer, operator implementation).

There might be better ways to implement a containment join than using a cross and a filter.
- Do you know a distributed algorithm for containment joins? Maybe it can be implemented with Flink's API.
- I guess, you are implementing a generic graph framework, but can you make certain assumptions about the data such as relative sizes of the inputs or avg/max size of the lists, etc.?

Contributions to Gelly (and Flink in general) are highly welcome.

Best, Fabian


2015-07-16 9:39 GMT+02:00 Martin Junghanns <[hidden email]>:
Hi everyone,

at first, thanks for building this great framework! We are using Flink
and especially Gelly for building a graph analytics stack (gradoop.com).

I was wondering if there is a [planned] support for a containment join
operator. Consider the following example:

DataSet<List<Int>> left := {[0, 1], [2, 3, 4], [5]}
DataSet<Tuple2<Int, Int>> right := {<0, 1>, <1, 0>, <2, 1>, <5, 2>}

What I want to compute is

left.join(right).where(list).contains(tuple.f0) :=

{
<[0, 1], <0,1>>, <[0, 1], <1, 0>>,
<[2, 3, 4], <2, 1>>,
<[5], <5, 2>
}

At the moment, I am solving that using cross and filter, which can be
expensive.

The generalization of that operator would be "set containment join",
where you join if the right set is contained in the left set.

If there is a general need for that operator, I would also like to
contribute to its implementation.

But maybe, there is already another nice solution which I didn't
discover yet?

Any help would be appreciated. Especially since I would also like to
contribute some of our graph operators (e.g., graph summarization) back
to Flink/Gelly (current WIP state can be found here: [1]).

Thanks,

Martin


[1]
https://github.com/dbs-leipzig/gradoop/blob/%2345_gradoop_flink/gradoop-flink/src/main/java/org/gradoop/model/impl/operators/Summarization.java





Reply | Threaded
Open this post in threaded view
|

Re: Containment Join Support

Vasiliki Kalavri
Hi Martin,

I'm really glad to see that you've started using Gelly :)

I think that a graph summarization library method would be a great addition! 
Let me know if you need help and if you want to discuss ideas or other methods.

Cheers,
Vasia.

On 17 July 2015 at 12:25, Martin Junghanns <[hidden email]> wrote:
Hi Fabian, hi Stephen,

thanks for answering my question. Good hint with the list replication, I will benchmark this vs. cross + filter.

Best, Martin


Am 17.07.2015 um 11:17 schrieb Stephan Ewen:
I would rewrite this to replicate the list into tuples: 

"foreach x in list: emit (x, list)"
Then join on fields 0.

This replicates the lists, but makes the join very efficient.

On Fri, Jul 17, 2015 at 12:26 AM, Fabian Hueske <[hidden email]> wrote:
Hi Martin,

good to hear that you like Flink :-)
AFAIK, there are no plans to add a containment join. The Flink community is currently working on adding support for outer joins.
Regarding a containment join, I am not sure about the number of use cases. I would rather try to implement it on top of Flink's batch API instead of adding it as an internal feature/operator to the system because this would touch a lot of things (API, optimizer, operator implementation).

There might be better ways to implement a containment join than using a cross and a filter.
- Do you know a distributed algorithm for containment joins? Maybe it can be implemented with Flink's API.
- I guess, you are implementing a generic graph framework, but can you make certain assumptions about the data such as relative sizes of the inputs or avg/max size of the lists, etc.?

Contributions to Gelly (and Flink in general) are highly welcome.

Best, Fabian


2015-07-16 9:39 GMT+02:00 Martin Junghanns <[hidden email]>:
Hi everyone,

at first, thanks for building this great framework! We are using Flink
and especially Gelly for building a graph analytics stack (gradoop.com).

I was wondering if there is a [planned] support for a containment join
operator. Consider the following example:

DataSet<List<Int>> left := {[0, 1], [2, 3, 4], [5]}
DataSet<Tuple2<Int, Int>> right := {<0, 1>, <1, 0>, <2, 1>, <5, 2>}

What I want to compute is

left.join(right).where(list).contains(tuple.f0) :=

{
<[0, 1], <0,1>>, <[0, 1], <1, 0>>,
<[2, 3, 4], <2, 1>>,
<[5], <5, 2>
}

At the moment, I am solving that using cross and filter, which can be
expensive.

The generalization of that operator would be "set containment join",
where you join if the right set is contained in the left set.

If there is a general need for that operator, I would also like to
contribute to its implementation.

But maybe, there is already another nice solution which I didn't
discover yet?

Any help would be appreciated. Especially since I would also like to
contribute some of our graph operators (e.g., graph summarization) back
to Flink/Gelly (current WIP state can be found here: [1]).

Thanks,

Martin


[1]
https://github.com/dbs-leipzig/gradoop/blob/%2345_gradoop_flink/gradoop-flink/src/main/java/org/gradoop/model/impl/operators/Summarization.java