merge/union fast and slow streams based on event timestamp

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

merge/union fast and slow streams based on event timestamp

xiatao123
I am running into a problem when processing the past 7 days of data from multiple streams.  I am trying to union the streams based on event timestamp.

The problem is that there are streams are significant big than other streams. For example if one stream has 1,000 event/sec and the other stream has 1,000,000 event/sec.  

I am using a PrirotyQueue to sort the event based on event timestamp. Since the fast(smaller) streams watermarks moves much faster than the slow(bigger) streams, there are lots of events  from the faster streams ended up in the Queue waiting for the slower stream to catch up and eventually ran out of memory.

Is there anyway we can send back pressure on the fast streams so they can slow down?  or somehow to coordinate the watermarks between all the streams?

I am planning to use an external storage to tracking the low watermarks between all the streams. so we don't read the event we cannot handle into the PriorityQueue.

Any better suggestions?

Thanks,
Tao
Reply | Threaded
Open this post in threaded view
|

Re: merge/union fast and slow streams based on event timestamp

Fabian Hueske-2
Hi Tao,

there's no built-in mechanism for this in Flink.
Throttling streams and creating back pressure is not a good idea in general because it prevents checkpoint barriers (which must be aligned with the events) to arrive a the operators.
This might cause checkpoints to time out.

The only way to prevent this is to not emit records from a source.
Once a records is emitted, it should be processed if possible.

Best, Fabian

2018-04-30 23:15 GMT+02:00 Tao Xia <[hidden email]>:
I am running into a problem when processing the past 7 days of data from multiple streams.  I am trying to union the streams based on event timestamp.

The problem is that there are streams are significant big than other streams. For example if one stream has 1,000 event/sec and the other stream has 1,000,000 event/sec.  

I am using a PrirotyQueue to sort the event based on event timestamp. Since the fast(smaller) streams watermarks moves much faster than the slow(bigger) streams, there are lots of events  from the faster streams ended up in the Queue waiting for the slower stream to catch up and eventually ran out of memory.

Is there anyway we can send back pressure on the fast streams so they can slow down?  or somehow to coordinate the watermarks between all the streams?

I am planning to use an external storage to tracking the low watermarks between all the streams. so we don't read the event we cannot handle into the PriorityQueue.

Any better suggestions?

Thanks,
Tao