(DEPRECATED) Apache Flink User Mailing List archive.

Joining more than 2 streams

Classic

List

Threaded

2 messages Options

Gagan Agrawal

Joining more than 2 streams

Hi,

I want to do window join on multiple Kafka streams (say a, b, c) on common field in all 3 streams and apply some custom function on joined stream. As I understand we can join only 2 streams at a time via DataStream api. So may be I need to join a and b first and then join first joined stream with c. I want to understand how would stream state be stored in backend? Since I will be joining a and b stream first, I believe both streams will be stored in state backend for window time. And then again join of first joined stream (of a and b) with c will result storage of all 3 streams for windowed period. Does that mean stream a and b are stored twice in state backend?

Let's say instead of using inbuilt join api, if I rather union all 3 streams (after transforming them to common schema) and keyBy stream on common field and apply process function where I implement joining on my own and store streams in some state backend, will that be more storage efficient as I will be saving 3 streams just once instead of twice?

Gagan

Fabian Hueske-2

Re: Joining more than 2 streams

Hi,

Yes, your reasoning is correct. If you use two binary joins, the data of the first two streams will be buffered twice.

Unioning all three streams and joining them in a custom ProcessFunction would reduce the amount of required state.

Best, Fabian

Am Sa., 24. Nov. 2018 um 14:08 Uhr schrieb Gagan Agrawal <[hidden email]>:

Hi,
I want to do window join on multiple Kafka streams (say a, b, c) on common field in all 3 streams and apply some custom function on joined stream. As I understand we can join only 2 streams at a time via DataStream api. So may be I need to join a and b first and then join first joined stream with c. I want to understand how would stream state be stored in backend? Since I will be joining a and b stream first, I believe both streams will be stored in state backend for window time. And then again join of first joined stream (of a and b) with c will result storage of all 3 streams for windowed period. Does that mean stream a and b are stored twice in state backend?

Let's say instead of using inbuilt join api, if I rather union all 3 streams (after transforming them to common schema) and keyBy stream on common field and apply process function where I implement joining on my own and store streams in some state backend, will that be more storage efficient as I will be saving 3 streams just once instead of twice?

Gagan