I have a use case where I need to start a stream replaying historical data, and then have it continue processing on a live kafka source, and am looking for guidance / best practices for implementation.
Basically, I want to start up a new “version” of the stream job, and have it process each element from a specified range of historical data, before continuing to process records from the live source. I'd then let the job catch up to current time and atomically switch to it as the “live/current” job and then shut down the previously running one. The issue I’m struggling with is that switch-over from the historical source to the live source. Is there a way to build a composite stream source which would emit records from a bounded data set before consuming from a kafka topic, for example? Or would I be better off stopping the job once it’s read through the historical set, switching it’s source to the live topic and re-starting it? Some of our jobs rely on rolling fold state, so I think I need to resume from the save point of the historical processing. signature.asc (465 bytes) Download Attachment |
Hi Jared, I think both approaches should work. The source that integrates the finite batch input and the stream might be more comfortable to use.make approximately the same progress. Otherwise, data might be dropped as late or Flink has to buffer a lot of data. AFAIK, there is not much tooling for these use cases available yet. Best, Fabian 2017-01-18 18:35 GMT+01:00 Jared Stehler <[hidden email]>: I have a use case where I need to start a stream replaying historical data, and then have it continue processing on a live kafka source, and am looking for guidance / best practices for implementation. |
Free forum by Nabble | Edit this page |