Some questions about Join

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Some questions about Join

Vinh June
Hello,

I have some questions concerning Join:

1. I would like to make join with different conditions, is there any way to create a Join with conditions different to "equalTo", for example, how would I make a join with > or >=

2. I have a DataSet[Map[String, Any]]. Is it possible to specify KeySelector using a map key? I tried to use below Scala code but it doesn't work

Set1.join(Set2).where(_.get("key")).equalTo(_.get("key"))
Reply | Threaded
Open this post in threaded view
|

Re: Some questions about Join

Fabian Hueske-2
Hi,

non-equi joins are only supported by building the cross product. 
This is essentially the nested-loop join strategy, that a conventional database system would chose. However, such joins are prohibitively expensive when applied to large data sets. 
If you have one small and another large data set, you can do the join by broadcasting the smaller side to a MapFunction (withBroadcastSet() [1]) that has the larger data set as regular input and evaluate the join condition in the MapFunction.

The problem with the Any key-selector is, that Flink needs to know the types when the program is optimized because it generates type specific serializers. I think an Any type does not work as join key.

Best, Fabian



2015-02-21 10:28 GMT+01:00 Vinh June <[hidden email]>:
Hello,

I have some questions concerning Join:

1. I would like to make join with different conditions, is there any way to
create a Join with conditions different to "equalTo", for example, how would
I make a join with > or >=

2. I have a DataSet[Map[String, Any]]. Is it possible to specify KeySelector
using a map key? I tried to use below Scala code but it doesn't work

Set1.join(Set2).where(_.get("key")).equalTo(_.get("key"))



--
View this message in context: http://apache-flink-incubator-user-mailing-list-archive.2336050.n4.nabble.com/Some-questions-about-Join-tp780.html
Sent from the Apache Flink (Incubator) User Mailing List archive. mailing list archive at Nabble.com.

Reply | Threaded
Open this post in threaded view
|

Re: Some questions about Join

Fabian Hueske-2
If your join predicates are like "x < y" or "x <= y" you do not need to build the full Cartesian product in the Map function.
Instead you can sort the broadcasted set or even build an index to reduce the number of predicate evaluations.

2015-02-21 11:40 GMT+01:00 Fabian Hueske <[hidden email]>:
Hi,

non-equi joins are only supported by building the cross product. 
This is essentially the nested-loop join strategy, that a conventional database system would chose. However, such joins are prohibitively expensive when applied to large data sets. 
If you have one small and another large data set, you can do the join by broadcasting the smaller side to a MapFunction (withBroadcastSet() [1]) that has the larger data set as regular input and evaluate the join condition in the MapFunction.

The problem with the Any key-selector is, that Flink needs to know the types when the program is optimized because it generates type specific serializers. I think an Any type does not work as join key.

Best, Fabian



2015-02-21 10:28 GMT+01:00 Vinh June <[hidden email]>:
Hello,

I have some questions concerning Join:

1. I would like to make join with different conditions, is there any way to
create a Join with conditions different to "equalTo", for example, how would
I make a join with > or >=

2. I have a DataSet[Map[String, Any]]. Is it possible to specify KeySelector
using a map key? I tried to use below Scala code but it doesn't work

Set1.join(Set2).where(_.get("key")).equalTo(_.get("key"))



--
View this message in context: http://apache-flink-incubator-user-mailing-list-archive.2336050.n4.nabble.com/Some-questions-about-Join-tp780.html
Sent from the Apache Flink (Incubator) User Mailing List archive. mailing list archive at Nabble.com.