Best way to implemented non-windowed join

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

Best way to implemented non-windowed join

Yaroslav Tkachenko-2
Hello everyone,

I have a question about implementing a join of N datastreams (where N > 2) without any time guarantees. According to my requirements, late data is not tolerable, so if I have a join between stream A and stream B and a message with key X arrives in stream B one year after arriving in stream A, it still should be able to match and emit the result. 

I'm currently working on a pipeline with 9 sources and 2 intermediate joins, so the DAG looks like this:

stream 1 ----->
stream 2 ----->
stream 3 -----> join1 ----->
stream 4 ----->
stream 5 ----->
stream 6 -----> join2 ----->
stream 7 ------------------>
stream 8 ------------------>
stream 9 ------------------> final join

I'm approaching this by creating state variables for all inputs and using them for lookups (I'm fine with constantly growing keyspace). In terms of actually joining data streams I see 3 options:
- coGroup with a GlobalWindow and triggers
- connect
- union

coGroup and connect only support two inputs, so implementing the DAG I described above is very cumbersome. Also, coGroup needs windowing semantics (even if it's just a single GlobalWindow), and in my experience with other frameworks, this could add some overhead.

union only works on inputs of the same type, but this can be solved by introducing a wrapper class. 

I'm able to get the union approach working, but I'm still not sure if it's the best way to implement this pipeline. Any suggestions?

Thank you!

Reply | Threaded
Open this post in threaded view
|

Re: Best way to implemented non-windowed join

Timo Walther
Hi Yaroslav,

I think your approach is correct. Union is perfect to implement multiway
joins if you normalize the type of all streams before. It can simply be
a composite type with the key and a member variable for each stream
where only one of those variables is not null. A keyed process function
can then perform the actual joining.

Regards,
Timo


On 25.02.21 23:29, Yaroslav Tkachenko wrote:

> Hello everyone,
>
> I have a question about implementing a join of N datastreams (where N >
> 2) without any time guarantees. According to my requirements, late data
> is not tolerable, so if I have a join between stream A and stream B and
> a message with key X arrives in stream B one year after arriving in
> stream A, it still should be able to match and emit the result.
>
> I'm currently working on a pipeline with 9 sources and 2 intermediate
> joins, so the DAG looks like this:
>
> stream 1 ----->
> stream 2 ----->
> stream 3 -----> join1 ----->
> stream 4 ----->
> stream 5 ----->
> stream 6 -----> join2 ----->
> stream 7 ------------------>
> stream 8 ------------------>
> stream 9 ------------------> final join
>
> I'm approaching this by creating state variables for all inputs and
> using them for lookups (I'm fine with constantly growing keyspace). In
> terms of actually joining data streams I see 3 options:
> - coGroup with a GlobalWindow and triggers
> - connect
> - union
>
> coGroup and connect only support two inputs, so implementing the DAG I
> described above is very cumbersome. Also, coGroup needs windowing
> semantics (even if it's just a single GlobalWindow), and in my
> experience with other frameworks, this could add some overhead.
>
> union only works on inputs of the same type, but this can be solved by
> introducing a wrapper class.
>
> I'm able to get the union approach working, but I'm still not sure if
> it's the best way to implement this pipeline. Any suggestions?
>
> Thank you!
>