seeding a stream job

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

seeding a stream job

Jared Stehler
I have a use case where I need to start a stream replaying historical data, and then have it continue processing on a live kafka source, and am looking for guidance / best practices for implementation.

Basically, I want to start up a new “version” of the stream job, and have it process each element from a specified range of historical data, before continuing to process records from the live source. I'd then let the job catch up to current time and atomically switch to it as the “live/current” job and then shut down the previously running one. The issue I’m struggling with is that switch-over from the historical source to the live source. Is there a way to build a composite stream source which would emit records from a bounded data set before consuming from a kafka topic, for example? Or would I be better off stopping the job once it’s read through the historical set, switching it’s source to the live topic and re-starting it? Some of our jobs rely on rolling fold state, so I think I need to resume from the save point of the historical processing.




signature.asc (465 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: seeding a stream job

Fabian Hueske-2
Hi Jared,

I think both approaches should work. The source that integrates the finite batch input and the stream might be more comfortable to use.
As you said, the challenge would be to identify the exact point when to switch from one input to the other.

One thing to consider when reading finite batch data into a streaming job is to ensure that the data is read in event-time order and that all parallel source
make approximately the same progress. Otherwise, data might be dropped as late or Flink has to buffer a lot of data.
AFAIK, there is not much tooling for these use cases available yet.

Best, Fabian


2017-01-18 18:35 GMT+01:00 Jared Stehler <[hidden email]>:
I have a use case where I need to start a stream replaying historical data, and then have it continue processing on a live kafka source, and am looking for guidance / best practices for implementation.

Basically, I want to start up a new “version” of the stream job, and have it process each element from a specified range of historical data, before continuing to process records from the live source. I'd then let the job catch up to current time and atomically switch to it as the “live/current” job and then shut down the previously running one. The issue I’m struggling with is that switch-over from the historical source to the live source. Is there a way to build a composite stream source which would emit records from a bounded data set before consuming from a kafka topic, for example? Or would I be better off stopping the job once it’s read through the historical set, switching it’s source to the live topic and re-starting it? Some of our jobs rely on rolling fold state, so I think I need to resume from the save point of the historical processing.