(DEPRECATED) Apache Flink User Mailing List archive.

Best way to implemented non-windowed join

Classic

List

Threaded

2 messages Options

Yaroslav Tkachenko-2

Best way to implemented non-windowed join

Hello everyone,

I have a question about implementing a join of N datastreams (where N > 2) without any time guarantees. According to my requirements, late data is not tolerable, so if I have a join between stream A and stream B and a message with key X arrives in stream B one year after arriving in stream A, it still should be able to match and emit the result.

I'm currently working on a pipeline with 9 sources and 2 intermediate joins, so the DAG looks like this:

stream 1 ----->
stream 2 ----->
stream 3 -----> join1 ----->
stream 4 ----->
stream 5 ----->
stream 6 -----> join2 ----->
stream 7 ------------------>
stream 8 ------------------>
stream 9 ------------------> final join

I'm approaching this by creating state variables for all inputs and using them for lookups (I'm fine with constantly growing keyspace). In terms of actually joining data streams I see 3 options:

- coGroup with a GlobalWindow and triggers

- connect

- union

coGroup and connect only support two inputs, so implementing the DAG I described above is very cumbersome. Also, coGroup needs windowing semantics (even if it's just a single GlobalWindow), and in my experience with other frameworks, this could add some overhead.

union only works on inputs of the same type, but this can be solved by introducing a wrapper class.

I'm able to get the union approach working, but I'm still not sure if it's the best way to implement this pipeline. Any suggestions?

Thank you!

Timo Walther

Re: Best way to implemented non-windowed join

Hi Yaroslav,

I think your approach is correct. Union is perfect to implement multiway
joins if you normalize the type of all streams before. It can simply be
a composite type with the key and a member variable for each stream
where only one of those variables is not null. A keyed process function
can then perform the actual joining.

Regards,
Timo

On 25.02.21 23:29, Yaroslav Tkachenko wrote:

> Hello everyone,
>
> I have a question about implementing a join of N datastreams (where N >
> 2) without any time guarantees. According to my requirements, late data
> is not tolerable, so if I have a join between stream A and stream B and
> a message with key X arrives in stream B one year after arriving in
> stream A, it still should be able to match and emit the result.
>
> I'm currently working on a pipeline with 9 sources and 2 intermediate
> joins, so the DAG looks like this:
>
> stream 1 ----->
> stream 2 ----->
> stream 3 -----> join1 ----->
> stream 4 ----->
> stream 5 ----->
> stream 6 -----> join2 ----->
> stream 7 ------------------>
> stream 8 ------------------>
> stream 9 ------------------> final join
>
> I'm approaching this by creating state variables for all inputs and
> using them for lookups (I'm fine with constantly growing keyspace). In
> terms of actually joining data streams I see 3 options:
> - coGroup with a GlobalWindow and triggers
> - connect
> - union
>
> coGroup and connect only support two inputs, so implementing the DAG I
> described above is very cumbersome. Also, coGroup needs windowing
> semantics (even if it's just a single GlobalWindow), and in my
> experience with other frameworks, this could add some overhead.
>
> union only works on inputs of the same type, but this can be solved by
> introducing a wrapper class.
>
> I'm able to get the union approach working, but I'm still not sure if
> it's the best way to implement this pipeline. Any suggestions?
>
> Thank you!
>