(DEPRECATED) Apache Flink User Mailing List archive.

Containment Join Support

Classic

List

Threaded

5 messages Options

Martin Junghanns

Containment Join Support

Hi everyone,

at first, thanks for building this great framework! We are using Flink
and especially Gelly for building a graph analytics stack (gradoop.com).

I was wondering if there is a [planned] support for a containment join
operator. Consider the following example:

DataSet<List<Int>> left := {[0, 1], [2, 3, 4], [5]}
DataSet<Tuple2<Int, Int>> right := {<0, 1>, <1, 0>, <2, 1>, <5, 2>}

What I want to compute is

left.join(right).where(list).contains(tuple.f0) :=

{
<[0, 1], <0,1>>, <[0, 1], <1, 0>>,
<[2, 3, 4], <2, 1>>,
<[5], <5, 2>
}

At the moment, I am solving that using cross and filter, which can be
expensive.

The generalization of that operator would be "set containment join",
where you join if the right set is contained in the left set.

If there is a general need for that operator, I would also like to
contribute to its implementation.

But maybe, there is already another nice solution which I didn't
discover yet?

Any help would be appreciated. Especially since I would also like to
contribute some of our graph operators (e.g., graph summarization) back
to Flink/Gelly (current WIP state can be found here: [1]).

Thanks,

Martin

[1]
https://github.com/dbs-leipzig/gradoop/blob/%2345_gradoop_flink/gradoop-flink/src/main/java/org/gradoop/model/impl/operators/Summarization.java

Fabian Hueske-2

Re: Containment Join Support

Hi Martin,

good to hear that you like Flink :-)

AFAIK, there are no plans to add a containment join. The Flink community is currently working on adding support for outer joins.

Regarding a containment join, I am not sure about the number of use cases. I would rather try to implement it on top of Flink's batch API instead of adding it as an internal feature/operator to the system because this would touch a lot of things (API, optimizer, operator implementation).

There might be better ways to implement a containment join than using a cross and a filter.
- Do you know a distributed algorithm for containment joins? Maybe it can be implemented with Flink's API.
- I guess, you are implementing a generic graph framework, but can you make certain assumptions about the data such as relative sizes of the inputs or avg/max size of the lists, etc.?

Contributions to Gelly (and Flink in general) are highly welcome.

Best, Fabian

2015-07-16 9:39 GMT+02:00 Martin Junghanns <[hidden email]>:

Hi everyone,

at first, thanks for building this great framework! We are using Flink
and especially Gelly for building a graph analytics stack (gradoop.com).

I was wondering if there is a [planned] support for a containment join
operator. Consider the following example:

DataSet<List<Int>> left := {[0, 1], [2, 3, 4], [5]}
DataSet<Tuple2<Int, Int>> right := {<0, 1>, <1, 0>, <2, 1>, <5, 2>}

What I want to compute is

left.join(right).where(list).contains(tuple.f0) :=

{
<[0, 1], <0,1>>, <[0, 1], <1, 0>>,
<[2, 3, 4], <2, 1>>,
<[5], <5, 2>
}

At the moment, I am solving that using cross and filter, which can be
expensive.

The generalization of that operator would be "set containment join",
where you join if the right set is contained in the left set.

If there is a general need for that operator, I would also like to
contribute to its implementation.

But maybe, there is already another nice solution which I didn't
discover yet?

Any help would be appreciated. Especially since I would also like to
contribute some of our graph operators (e.g., graph summarization) back
to Flink/Gelly (current WIP state can be found here: [1]).

Thanks,

Martin

[1]
https://github.com/dbs-leipzig/gradoop/blob/%2345_gradoop_flink/gradoop-flink/src/main/java/org/gradoop/model/impl/operators/Summarization.java

Stephan Ewen

Re: Containment Join Support

I would rewrite this to replicate the list into tuples:

"foreach x in list: emit (x, list)"

Then join on fields 0.

This replicates the lists, but makes the join very efficient.

On Fri, Jul 17, 2015 at 12:26 AM, Fabian Hueske <[hidden email]> wrote:

Hi Martin,

good to hear that you like Flink :-)
AFAIK, there are no plans to add a containment join. The Flink community is currently working on adding support for outer joins.
Regarding a containment join, I am not sure about the number of use cases. I would rather try to implement it on top of Flink's batch API instead of adding it as an internal feature/operator to the system because this would touch a lot of things (API, optimizer, operator implementation).

There might be better ways to implement a containment join than using a cross and a filter.
- Do you know a distributed algorithm for containment joins? Maybe it can be implemented with Flink's API.
- I guess, you are implementing a generic graph framework, but can you make certain assumptions about the data such as relative sizes of the inputs or avg/max size of the lists, etc.?

Contributions to Gelly (and Flink in general) are highly welcome.

Best, Fabian

2015-07-16 9:39 GMT+02:00 Martin Junghanns <[hidden email]>:
Hi everyone,

at first, thanks for building this great framework! We are using Flink
and especially Gelly for building a graph analytics stack (gradoop.com).

I was wondering if there is a [planned] support for a containment join
operator. Consider the following example:

DataSet<List<Int>> left := {[0, 1], [2, 3, 4], [5]}
DataSet<Tuple2<Int, Int>> right := {<0, 1>, <1, 0>, <2, 1>, <5, 2>}

What I want to compute is

left.join(right).where(list).contains(tuple.f0) :=

{
<[0, 1], <0,1>>, <[0, 1], <1, 0>>,
<[2, 3, 4], <2, 1>>,
<[5], <5, 2>
}

At the moment, I am solving that using cross and filter, which can be
expensive.

The generalization of that operator would be "set containment join",
where you join if the right set is contained in the left set.

If there is a general need for that operator, I would also like to
contribute to its implementation.

But maybe, there is already another nice solution which I didn't
discover yet?

Any help would be appreciated. Especially since I would also like to
contribute some of our graph operators (e.g., graph summarization) back
to Flink/Gelly (current WIP state can be found here: [1]).

Thanks,

Martin

[1]
https://github.com/dbs-leipzig/gradoop/blob/%2345_gradoop_flink/gradoop-flink/src/main/java/org/gradoop/model/impl/operators/Summarization.java

Martin Junghanns

Re: Containment Join Support

Hi Fabian, hi Stephen,

thanks for answering my question. Good hint with the list replication, I will benchmark this vs. cross + filter.

Best, Martin

Am 17.07.2015 um 11:17 schrieb Stephan Ewen:

I would rewrite this to replicate the list into tuples:

"foreach x in list: emit (x, list)"
Then join on fields 0.

This replicates the lists, but makes the join very efficient.

On Fri, Jul 17, 2015 at 12:26 AM, Fabian Hueske <[hidden email]> wrote:

Hi Martin,

good to hear that you like Flink :-)

AFAIK, there are no plans to add a containment join. The Flink community is currently working on adding support for outer joins.

Regarding a containment join, I am not sure about the number of use cases. I would rather try to implement it on top of Flink's batch API instead of adding it as an internal feature/operator to the system because this would touch a lot of things (API, optimizer, operator implementation).

There might be better ways to implement a containment join than using a cross and a filter.
- Do you know a distributed algorithm for containment joins? Maybe it can be implemented with Flink's API.
- I guess, you are implementing a generic graph framework, but can you make certain assumptions about the data such as relative sizes of the inputs or avg/max size of the lists, etc.?

Contributions to Gelly (and Flink in general) are highly welcome.

Best, Fabian

2015-07-16 9:39 GMT+02:00 Martin Junghanns <[hidden email]>:

Hi everyone,

at first, thanks for building this great framework! We are using Flink
and especially Gelly for building a graph analytics stack (gradoop.com).

I was wondering if there is a [planned] support for a containment join
operator. Consider the following example:

DataSet<List<Int>> left := {[0, 1], [2, 3, 4], [5]}
DataSet<Tuple2<Int, Int>> right := {<0, 1>, <1, 0>, <2, 1>, <5, 2>}

What I want to compute is

left.join(right).where(list).contains(tuple.f0) :=

{
<[0, 1], <0,1>>, <[0, 1], <1, 0>>,
<[2, 3, 4], <2, 1>>,
<[5], <5, 2>
}

At the moment, I am solving that using cross and filter, which can be
expensive.

The generalization of that operator would be "set containment join",
where you join if the right set is contained in the left set.

If there is a general need for that operator, I would also like to
contribute to its implementation.

But maybe, there is already another nice solution which I didn't
discover yet?

Any help would be appreciated. Especially since I would also like to
contribute some of our graph operators (e.g., graph summarization) back
to Flink/Gelly (current WIP state can be found here: [1]).

Thanks,

Martin

[1]
https://github.com/dbs-leipzig/gradoop/blob/%2345_gradoop_flink/gradoop-flink/src/main/java/org/gradoop/model/impl/operators/Summarization.java

Vasiliki Kalavri

Re: Containment Join Support

Hi Martin,

I'm really glad to see that you've started using Gelly :)

I think that a graph summarization library method would be a great addition!

Let me know if you need help and if you want to discuss ideas or other methods.

Cheers,

Vasia.

On 17 July 2015 at 12:25, Martin Junghanns <[hidden email]> wrote:

Hi Fabian, hi Stephen,

thanks for answering my question. Good hint with the list replication, I will benchmark this vs. cross + filter.

Best, Martin

Am 17.07.2015 um 11:17 schrieb Stephan Ewen:

I would rewrite this to replicate the list into tuples:

"foreach x in list: emit (x, list)"
Then join on fields 0.

This replicates the lists, but makes the join very efficient.

On Fri, Jul 17, 2015 at 12:26 AM, Fabian Hueske <[hidden email]> wrote:

Hi Martin,

good to hear that you like Flink :-)

AFAIK, there are no plans to add a containment join. The Flink community is currently working on adding support for outer joins.

Regarding a containment join, I am not sure about the number of use cases. I would rather try to implement it on top of Flink's batch API instead of adding it as an internal feature/operator to the system because this would touch a lot of things (API, optimizer, operator implementation).

There might be better ways to implement a containment join than using a cross and a filter.
- Do you know a distributed algorithm for containment joins? Maybe it can be implemented with Flink's API.
- I guess, you are implementing a generic graph framework, but can you make certain assumptions about the data such as relative sizes of the inputs or avg/max size of the lists, etc.?

Contributions to Gelly (and Flink in general) are highly welcome.

Best, Fabian

2015-07-16 9:39 GMT+02:00 Martin Junghanns <[hidden email]>:

Hi everyone,

at first, thanks for building this great framework! We are using Flink
and especially Gelly for building a graph analytics stack (gradoop.com).

I was wondering if there is a [planned] support for a containment join
operator. Consider the following example:

DataSet<List<Int>> left := {[0, 1], [2, 3, 4], [5]}
DataSet<Tuple2<Int, Int>> right := {<0, 1>, <1, 0>, <2, 1>, <5, 2>}

What I want to compute is

left.join(right).where(list).contains(tuple.f0) :=

{
<[0, 1], <0,1>>, <[0, 1], <1, 0>>,
<[2, 3, 4], <2, 1>>,
<[5], <5, 2>
}

At the moment, I am solving that using cross and filter, which can be
expensive.

The generalization of that operator would be "set containment join",
where you join if the right set is contained in the left set.

If there is a general need for that operator, I would also like to
contribute to its implementation.

But maybe, there is already another nice solution which I didn't
discover yet?

Any help would be appreciated. Especially since I would also like to
contribute some of our graph operators (e.g., graph summarization) back
to Flink/Gelly (current WIP state can be found here: [1]).

Thanks,

Martin

[1]
https://github.com/dbs-leipzig/gradoop/blob/%2345_gradoop_flink/gradoop-flink/src/main/java/org/gradoop/model/impl/operators/Summarization.java