(DEPRECATED) Apache Flink User Mailing List archive.

Synchronize reading from two Kafka Topics of different size

Classic

List

Threaded

2 messages Options

Olle Noren

Synchronize reading from two Kafka Topics of different size

Hi,

We have a Flink job were we are trying to window join two datastreams originating from two different Kafka topics, where one topic contains a lot more data per time instance than the other one.

We use event time processing, and this all works fine when running our pipeline live, i.e. data is consumed and processed as soon as it is ingested in Kafka.

The problem though occurs in the scenario when we are replaying with data stored in Kafka, then the watermarks of the “larger-stream” are lagging behind the “smaller-stream” since this stream has less data per time unit and then is advancing faster.

This leads to a large state at the join operation since data from the “smaller-stream” needs to be kept until the corresponding watermarks from the “larger-stream” have passed.

To avoid a very large state at the join operator, we have tried to increase the parallelism for the consumer of the “larger-stream” to make this keep up with the “smaller stream”, this decreases the size of the state to some extent. This seems though like a ugly way to get around the problem and will not work if the sizes of the two Kafka topics are changing over time.

Is there any way we can synchronize the reading of the Kafka sources based on the watermarks we have in the two streams, i.e. to pause the reading of the “smaller-topic” until the “larger-stream” has caught up? Any other ideas how to handle this replay-scenario?

Thanks in advance

Olle


	Olle Noren Systems Engineer Fleet Perception for Maintenance
	NIRA Dynamics AB Wallenbergs gata 4 58330 Linköping Sweden	Mobile: +46 709 748 304 [hidden email] www.niradynamics.se
	Together for smarter safety

Till Rohrmann

Re: Synchronize reading from two Kafka Topics of different size

Hi Olle,

what you are describing is indeed a problem in Flink. The solution to the problem would be to synchronize the event time across sources so that a source can throttle down when it realizes that it has advanced too far [1]. At the moment, this feature is in development, but not yet finished. I think the best solution right now is what you've actually done: Increase parallelism in order to spread the state load.

[1] https://issues.apache.org/jira/browse/FLINK-10886

Cheers,

Till

On Thu, Feb 14, 2019 at 5:37 PM Olle Noren <[hidden email]> wrote:

Hi,

We have a Flink job were we are trying to window join two datastreams originating from two different Kafka topics, where one topic contains a lot more data per time instance than the other one.

We use event time processing, and this all works fine when running our pipeline live, i.e. data is consumed and processed as soon as it is ingested in Kafka.

The problem though occurs in the scenario when we are replaying with data stored in Kafka, then the watermarks of the “larger-stream” are lagging behind the “smaller-stream” since this stream has less data per time unit and then is advancing faster.

This leads to a large state at the join operation since data from the “smaller-stream” needs to be kept until the corresponding watermarks from the “larger-stream” have passed.

To avoid a very large state at the join operator, we have tried to increase the parallelism for the consumer of the “larger-stream” to make this keep up with the “smaller stream”, this decreases the size of the state to some extent. This seems though like a ugly way to get around the problem and will not work if the sizes of the two Kafka topics are changing over time.

Is there any way we can synchronize the reading of the Kafka sources based on the watermarks we have in the two streams, i.e. to pause the reading of the “smaller-topic” until the “larger-stream” has caught up? Any other ideas how to handle this replay-scenario?

Thanks in advance

Olle

Olle Noren
Systems Engineer
Fleet Perception for Maintenance

NIRA Dynamics AB
Wallenbergs gata 4
58330 Linköping
Sweden Mobile: +46 709 748 304
[hidden email]
www.niradynamics.se

Together for smarter safety