I am running into a problem when processing the past 7 days of data from multiple streams. I am trying to union the streams based on event timestamp.
The problem is that there are streams are significant big than other streams. For example if one stream has 1,000 event/sec and the other stream has 1,000,000 event/sec.
I am using a PrirotyQueue to sort the event based on event timestamp. Since the fast(smaller) streams watermarks moves much faster than the slow(bigger) streams, there are lots of events from the faster streams ended up in the Queue waiting for the slower stream to catch up and eventually ran out of memory.
Is there anyway we can send back pressure on the fast streams so they can slow down? or somehow to coordinate the watermarks between all the streams?
I am planning to use an external storage to tracking the low watermarks between all the streams. so we don't read the event we cannot handle into the PriorityQueue.
Any better suggestions?
Thanks,
Tao