Hello everyone,
I have a question about implementing a join of N datastreams (where N > 2) without any time guarantees. According to my requirements, late data is not tolerable, so if I have a join between stream A and stream B and a message with key X arrives in stream B one year after arriving in stream A, it still should be able to match and emit the result. I'm currently working on a pipeline with 9 sources and 2 intermediate joins, so the DAG looks like this: stream 1 -----> stream 2 -----> stream 3 -----> join1 -----> stream 4 -----> stream 5 -----> stream 6 -----> join2 -----> stream 7 ------------------> stream 8 ------------------> stream 9 ------------------> final join I'm approaching this by creating state variables for all inputs and using them for lookups (I'm fine with constantly growing keyspace). In terms of actually joining data streams I see 3 options: - coGroup with a GlobalWindow and triggers - connect - union coGroup and connect only support two inputs, so implementing the DAG I described above is very cumbersome. Also, coGroup needs windowing semantics (even if it's just a single GlobalWindow), and in my experience with other frameworks, this could add some overhead. union only works on inputs of the same type, but this can be solved by introducing a wrapper class. I'm able to get the union approach working, but I'm still not sure if it's the best way to implement this pipeline. Any suggestions? Thank you! |
Hi Yaroslav,
I think your approach is correct. Union is perfect to implement multiway joins if you normalize the type of all streams before. It can simply be a composite type with the key and a member variable for each stream where only one of those variables is not null. A keyed process function can then perform the actual joining. Regards, Timo On 25.02.21 23:29, Yaroslav Tkachenko wrote: > Hello everyone, > > I have a question about implementing a join of N datastreams (where N > > 2) without any time guarantees. According to my requirements, late data > is not tolerable, so if I have a join between stream A and stream B and > a message with key X arrives in stream B one year after arriving in > stream A, it still should be able to match and emit the result. > > I'm currently working on a pipeline with 9 sources and 2 intermediate > joins, so the DAG looks like this: > > stream 1 -----> > stream 2 -----> > stream 3 -----> join1 -----> > stream 4 -----> > stream 5 -----> > stream 6 -----> join2 -----> > stream 7 ------------------> > stream 8 ------------------> > stream 9 ------------------> final join > > I'm approaching this by creating state variables for all inputs and > using them for lookups (I'm fine with constantly growing keyspace). In > terms of actually joining data streams I see 3 options: > - coGroup with a GlobalWindow and triggers > - connect > - union > > coGroup and connect only support two inputs, so implementing the DAG I > described above is very cumbersome. Also, coGroup needs windowing > semantics (even if it's just a single GlobalWindow), and in my > experience with other frameworks, this could add some overhead. > > union only works on inputs of the same type, but this can be solved by > introducing a wrapper class. > > I'm able to get the union approach working, but I'm still not sure if > it's the best way to implement this pipeline. Any suggestions? > > Thank you! > |
Free forum by Nabble | Edit this page |