We have a Kafka stream of events that we want to process with a Flink datastream process. However, the stream is populated by an upstream batch process that only executes every few hours. So the stream has very 'bursty' behaviour. We need
a window based on event time to await the next events for the same key. Due to this batch population of the stream, these windows can remain open (with no event activity on the stream) for many hours. From what I understand we could indeed leave the Flink
datastream process up and running all this time and the window would remain open. We could even use a savepoint and then stop the process and restart it (with the window state being restored) when we get the next batch and the events start appearing in the
stream again. One rationale for this mode of operation is that we have a future usecase where this stream will be populated in real-time and would behave like a normal stream. Is that a best-practice approach for this scenario? Or should we be treating these batches as individual batches (Flink job that ends with the end of the batch) and manually handle the windowing that needs to cross multiple batches. Thanks, Jonny
|
In flink you cant read data from kafka in Dataset API (Batch) And you dont want to mess with start and stop your job every few hours. Can you elaborate more on your use case , Are you going to use KeyBy , is thire any way to use trigger ... ? On Mon, Jan 21, 2019 at 4:43 PM Jonny Graham <[hidden email]> wrote:
|
Thanks. The plan is to use the DataStream API (or possibly Beam) which is fine for data from Kafka (and presumably from a file too). Assuming that I don't stop and restart the streaming job but just leave it idle (as no events are coming
into Kafka), is there any issue with leaving a window 'open' for a few hours? The window would be defined as 5 minutes of event time but due to the fact that the upstream system is buffering the data Flink would only see the end of those 5 minutes after many
hours (processing time). I don't have specific details of what the Flink job will be doing because I want this architecture to support many usecases of processing the data – I'm trying to determine if it makes sense to model the data as a stream, coming via Kafka
despite the Kafka stream being populated in large chunks (that aren't to do with event time, but rather due to batch-based processing of the upstream process) or not. We would be typically using a keyed sliding window. The watermark of the window would be
defined in terms of event time. The event time would not be progressing for the extended period between the end of one upstream batch and the next so the windows would remain open for that time. Thanks, Jonny From: miki haiat [mailto:[hidden email]] In flink you cant read data from kafka in Dataset API (Batch) And you dont want to mess with start and stop your job every few hours. Can you elaborate more on your use case , Are you going to use KeyBy , is thire any way to use trigger ... ? On Mon, Jan 21, 2019 at 4:43 PM Jonny Graham <[hidden email]> wrote:
|
Hi Jonny, I think this is good use case for event time stream processing. The idea of taking a savepoint, stopping and later resuming the job is good as it frees the resources that would otherwise be occupied by the idle job. In that sense it would behave like a batch job. However, in contrast to a batch job, you keep the state of the open windows and can resume exactly where you left off when the job was stopped. Best, Fabian Am Di., 22. Jan. 2019 um 11:42 Uhr schrieb Jonny Graham <[hidden email]>:
|
Free forum by Nabble | Edit this page |