I have a job with multiple Kafka sources. They all contain certain historical data.

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

I have a job with multiple Kafka sources. They all contain certain historical data.

hao kong
Hello, I have a job with multiple Kafka sources. They all contain certain historical data. If you use the events-time window, it will cause sources with less data to cover more sources through water mark. Is there a solution?
Reply | Threaded
Open this post in threaded view
|

Fwd: I have a job with multiple Kafka sources. They all contain certain historical data.

hao kong
Hello guys, 

I have a job with multiple Kafka sources. They all contain certain historical data. If you use the events-time window, it will cause sources with less data to cover more sources through water mark.

I can think of a solution, Implement a scheduler in the source phase, But it is quite complicated to implement. Are ther otherbetter solutions?

Any suggestions?
Thanks!

Reply | Threaded
Open this post in threaded view
|

Re: I have a job with multiple Kafka sources. They all contain certain historical data.

Piotr Nowojski-4
Hey,

If you are worried about increased amount of buffered data by the WindowOperator if watermarks/event time is not progressing uniformly across multiple sources, then there is little you can do currently. FLIP-27 [1] will allow us to address this problem in more generic way. What you can currently do is one of two things:

1. Implement a custom throttling function/operator sitting after the sources, that would throttle the sources. If you chain it with the source function, it's relatively ok solution. Note, while you are blocking execution, you will be blocking for example checkpoints from happening. So it's better to sleep 10 ms per every record, compared to sleep 10 seconds once every 1000 records.
2. Throttle the sources themselves (you would need to modify or write your custom sources).

But in both cases you need to manually track the event time, and manually make decision which source should be throttled and by how much.

Best regards, Piotrek


śr., 16 wrz 2020 o 04:17 hao kong <[hidden email]> napisał(a):
Hello guys, 

I have a job with multiple Kafka sources. They all contain certain historical data. If you use the events-time window, it will cause sources with less data to cover more sources through water mark.

I can think of a solution, Implement a scheduler in the source phase, But it is quite complicated to implement. Are ther otherbetter solutions?

Any suggestions?
Thanks!

Reply | Threaded
Open this post in threaded view
|

Re: I have a job with multiple Kafka sources. They all contain certain historical data.

hao kong
Thanks for the tip! 
     I am currently trying to implement a zookeeper-based coordinator.use it to record the current watermark and control streaming according to your first suggest.

Piotr Nowojski <[hidden email]> 于2020年9月16日周三 下午11:56写道:
Hey,

If you are worried about increased amount of buffered data by the WindowOperator if watermarks/event time is not progressing uniformly across multiple sources, then there is little you can do currently. FLIP-27 [1] will allow us to address this problem in more generic way. What you can currently do is one of two things:

1. Implement a custom throttling function/operator sitting after the sources, that would throttle the sources. If you chain it with the source function, it's relatively ok solution. Note, while you are blocking execution, you will be blocking for example checkpoints from happening. So it's better to sleep 10 ms per every record, compared to sleep 10 seconds once every 1000 records.
2. Throttle the sources themselves (you would need to modify or write your custom sources).

But in both cases you need to manually track the event time, and manually make decision which source should be throttled and by how much.

Best regards, Piotrek


śr., 16 wrz 2020 o 04:17 hao kong <[hidden email]> napisał(a):
Hello guys, 

I have a job with multiple Kafka sources. They all contain certain historical data. If you use the events-time window, it will cause sources with less data to cover more sources through water mark.

I can think of a solution, Implement a scheduler in the source phase, But it is quite complicated to implement. Are ther otherbetter solutions?

Any suggestions?
Thanks!

Reply | Threaded
Open this post in threaded view
|

Re: I have a job with multiple Kafka sources. They all contain certain historical data.

Piotr Nowojski-4
Great, thanks for the update! And please share your feedback if it worked or not.

Piotrek

niedz., 27 wrz 2020 o 11:20 hao kong <[hidden email]> napisał(a):
Thanks for the tip! 
     I am currently trying to implement a zookeeper-based coordinator.use it to record the current watermark and control streaming according to your first suggest.

Piotr Nowojski <[hidden email]> 于2020年9月16日周三 下午11:56写道:
Hey,

If you are worried about increased amount of buffered data by the WindowOperator if watermarks/event time is not progressing uniformly across multiple sources, then there is little you can do currently. FLIP-27 [1] will allow us to address this problem in more generic way. What you can currently do is one of two things:

1. Implement a custom throttling function/operator sitting after the sources, that would throttle the sources. If you chain it with the source function, it's relatively ok solution. Note, while you are blocking execution, you will be blocking for example checkpoints from happening. So it's better to sleep 10 ms per every record, compared to sleep 10 seconds once every 1000 records.
2. Throttle the sources themselves (you would need to modify or write your custom sources).

But in both cases you need to manually track the event time, and manually make decision which source should be throttled and by how much.

Best regards, Piotrek


śr., 16 wrz 2020 o 04:17 hao kong <[hidden email]> napisał(a):
Hello guys, 

I have a job with multiple Kafka sources. They all contain certain historical data. If you use the events-time window, it will cause sources with less data to cover more sources through water mark.

I can think of a solution, Implement a scheduler in the source phase, But it is quite complicated to implement. Are ther otherbetter solutions?

Any suggestions?
Thanks!