Re: how to propagate watermarks across multiple jobs
Posted by
yidan zhao on
URL: http://deprecated-apache-flink-user-mailing-list-archive.369.s1.nabble.com/how-to-propagate-watermarks-across-multiple-jobs-tp41813p41968.html
Thank you.
Hey Yidan,
KafkaShuffle is initially motivated to support shuffle data materialization on Kafka, and started with a limited version supporting hash-partition only. Watermark is maintained and forwarded as part of shuffle data. So you are right, watermark storing/forwarding logic has nothing to do with whether the stream is keyed or not. The current approach in KafkaShuffle should also work for non-keyed streams if I remember correclty. So, yes, the logic can be extracted and generalized.
Best,
Yuan
One more question, If I only need watermark's logic, not keyedStream, why not provide methods such as writeDataStream and readDataStream. It uses the similar methods for kafka producer sink records and broadcast watermark to partitions and then kafka consumers read it and regenerate the watermark. I think it will be more general? In this way, the kafka consumer reads the stream from kafka, and can continue to call keyBy to get a keyedStream. I don't know why KafkaShuffle only considers the 'keyedStream' case.
Great :)
Just one more note. Currently FlinkKafkaShuffle has a critical bug [1] that probably will prevent you from using it directly. I hope it will be fixed in some next release. In the meantime you can just inspire your solution with the source code.
Best,
Piotrek
Yes, you are right and thank you. I take a brief look at what FlinkKafkaShuffle is doing, it seems what I need and I will have a try.