NOT IN with Flink

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

NOT IN with Flink

Malte Schwarzer
Hi,

is there an easy way to a NOT IN or something like join().where().notEquals() on two datasets with Flink?

Cheers
Malte
Reply | Threaded
Open this post in threaded view
|

Re: NOT IN with Flink

Stephan Ewen

Hey!

Careful: The semantics in SQL of a "not-equal" join are quite different from a NOT IN statement.

Here is how you do the equivalent of NOT IN:

If the list of elements is small and known up front, create a hash set and give it to a filter function (closure or constructor). The filter function can look up whether the element is contained or not.

If the elements are not known up front, use a broadcast variable that you attach to a RichFilterFunction. In the filter function's open() method, grab the broadcast variable and turn it into a hash set. The filter is the same as above then.

Check out the API guides for some examples of how to use broadcast variables.

Stephan

Am 11.12.2014 12:17 schrieb "Malte Schwarzer" <[hidden email]>:
Hi,

is there an easy way to a NOT IN or something like join().where().notEquals() on two datasets with Flink?

Cheers
Malte