(DEPRECATED) Apache Flink User Mailing List archive.

Kafka stream fed in batches throughout the day

Classic

List

Threaded

4 messages Options

Jonny Graham

Kafka stream fed in batches throughout the day

We have a Kafka stream of events that we want to process with a Flink datastream process. However, the stream is populated by an upstream batch process that only executes every few hours. So the stream has very 'bursty' behaviour. We need a window based on event time to await the next events for the same key. Due to this batch population of the stream, these windows can remain open (with no event activity on the stream) for many hours. From what I understand we could indeed leave the Flink datastream process up and running all this time and the window would remain open. We could even use a savepoint and then stop the process and restart it (with the window state being restored) when we get the next batch and the events start appearing in the stream again.

One rationale for this mode of operation is that we have a future usecase where this stream will be populated in real-time and would behave like a normal stream.

Is that a best-practice approach for this scenario? Or should we be treating these batches as individual batches (Flink job that ends with the end of the batch) and manually handle the windowing that needs to cross multiple batches.

Thanks,

Jonny

Confidentiality: This communication and any attachments are intended for the above-named persons only and may be confidential and/or legally privileged. Any opinions expressed in this communication are not necessarily those of NICE Actimize. If this communication has come to you in error you must take no action based on it, nor must you copy or show it to anyone; please delete/destroy and inform the sender by e-mail immediately.
Monitoring: NICE Actimize may monitor incoming and outgoing e-mails.
Viruses: Although we have taken steps toward ensuring that this e-mail and attachments are free from any virus, we advise that in keeping with good computing practice the recipient should ensure they are actually virus free.

miki haiat

Re: Kafka stream fed in batches throughout the day

In flink you cant read data from kafka in Dataset API (Batch)

And you dont want to mess with start and stop your job every few hours.

Can you elaborate more on your use case ,

Are you going to use KeyBy , is thire any way to use trigger ... ?

On Mon, Jan 21, 2019 at 4:43 PM Jonny Graham <[hidden email]> wrote:

We have a Kafka stream of events that we want to process with a Flink datastream process. However, the stream is populated by an upstream batch process that only executes every few hours. So the stream has very 'bursty' behaviour. We need a window based on event time to await the next events for the same key. Due to this batch population of the stream, these windows can remain open (with no event activity on the stream) for many hours. From what I understand we could indeed leave the Flink datastream process up and running all this time and the window would remain open. We could even use a savepoint and then stop the process and restart it (with the window state being restored) when we get the next batch and the events start appearing in the stream again.

One rationale for this mode of operation is that we have a future usecase where this stream will be populated in real-time and would behave like a normal stream.

Is that a best-practice approach for this scenario? Or should we be treating these batches as individual batches (Flink job that ends with the end of the batch) and manually handle the windowing that needs to cross multiple batches.

Thanks,

Jonny

Confidentiality: This communication and any attachments are intended for the above-named persons only and may be confidential and/or legally privileged. Any opinions expressed in this communication are not necessarily those of NICE Actimize. If this communication has come to you in error you must take no action based on it, nor must you copy or show it to anyone; please delete/destroy and inform the sender by e-mail immediately.
Monitoring: NICE Actimize may monitor incoming and outgoing e-mails.
Viruses: Although we have taken steps toward ensuring that this e-mail and attachments are free from any virus, we advise that in keeping with good computing practice the recipient should ensure they are actually virus free.

Confidentiality: This communication and any attachments are intended for the above-named persons only and may be confidential and/or legally privileged. Any opinions expressed in this communication are not necessarily those of NICE Actimize. If this communication has come to you in error you must take no action based on it, nor must you copy or show it to anyone; please delete/destroy and inform the sender by e-mail immediately.
Monitoring: NICE Actimize may monitor incoming and outgoing e-mails.
Viruses: Although we have taken steps toward ensuring that this e-mail and attachments are free from any virus, we advise that in keeping with good computing practice the recipient should ensure they are actually virus free.

Jonny Graham

RE: Kafka stream fed in batches throughout the day

Thanks. The plan is to use the DataStream API (or possibly Beam) which is fine for data from Kafka (and presumably from a file too). Assuming that I don't stop and restart the streaming job but just leave it idle (as no events are coming into Kafka), is there any issue with leaving a window 'open' for a few hours? The window would be defined as 5 minutes of event time but due to the fact that the upstream system is buffering the data Flink would only see the end of those 5 minutes after many hours (processing time).

I don't have specific details of what the Flink job will be doing because I want this architecture to support many usecases of processing the data – I'm trying to determine if it makes sense to model the data as a stream, coming via Kafka despite the Kafka stream being populated in large chunks (that aren't to do with event time, but rather due to batch-based processing of the upstream process) or not. We would be typically using a keyed sliding window. The watermark of the window would be defined in terms of event time. The event time would not be progressing for the extended period between the end of one upstream batch and the next so the windows would remain open for that time.

Thanks,

Jonny

From: miki haiat [mailto:[hidden email]]
Sent: Monday, January 21, 2019 5:07 PM
To: Jonny Graham <[hidden email]>
Cc: [hidden email]
Subject: Re: Kafka stream fed in batches throughout the day

In flink you cant read data from kafka in Dataset API (Batch)

And you dont want to mess with start and stop your job every few hours.

Can you elaborate more on your use case ,

Are you going to use KeyBy , is thire any way to use trigger ... ?

On Mon, Jan 21, 2019 at 4:43 PM Jonny Graham <[hidden email]> wrote:

We have a Kafka stream of events that we want to process with a Flink datastream process. However, the stream is populated by an upstream batch process that only executes every few hours. So the stream has very 'bursty' behaviour. We need a window based on event time to await the next events for the same key. Due to this batch population of the stream, these windows can remain open (with no event activity on the stream) for many hours. From what I understand we could indeed leave the Flink datastream process up and running all this time and the window would remain open. We could even use a savepoint and then stop the process and restart it (with the window state being restored) when we get the next batch and the events start appearing in the stream again.

One rationale for this mode of operation is that we have a future usecase where this stream will be populated in real-time and would behave like a normal stream.

Is that a best-practice approach for this scenario? Or should we be treating these batches as individual batches (Flink job that ends with the end of the batch) and manually handle the windowing that needs to cross multiple batches.

Thanks,

Jonny

Confidentiality: This communication and any attachments are intended for the above-named persons only and may be confidential and/or legally privileged. Any opinions expressed in this communication are not necessarily those of NICE Actimize. If this communication has come to you in error you must take no action based on it, nor must you copy or show it to anyone; please delete/destroy and inform the sender by e-mail immediately.
Monitoring: NICE Actimize may monitor incoming and outgoing e-mails.
Viruses: Although we have taken steps toward ensuring that this e-mail and attachments are free from any virus, we advise that in keeping with good computing practice the recipient should ensure they are actually virus free.

Confidentiality: This communication and any attachments are intended for the above-named persons only and may be confidential and/or legally privileged. Any opinions expressed in this communication are not necessarily those of NICE Actimize. If this communication has come to you in error you must take no action based on it, nor must you copy or show it to anyone; please delete/destroy and inform the sender by e-mail immediately.
Monitoring: NICE Actimize may monitor incoming and outgoing e-mails.
Viruses: Although we have taken steps toward ensuring that this e-mail and attachments are free from any virus, we advise that in keeping with good computing practice the recipient should ensure they are actually virus free.

Fabian Hueske-2

Re: Kafka stream fed in batches throughout the day

Hi Jonny,

I think this is good use case for event time stream processing.

The idea of taking a savepoint, stopping and later resuming the job is good as it frees the resources that would otherwise be occupied by the idle job. In that sense it would behave like a batch job.

However, in contrast to a batch job, you keep the state of the open windows and can resume exactly where you left off when the job was stopped.

Best, Fabian

Am Di., 22. Jan. 2019 um 11:42 Uhr schrieb Jonny Graham <[hidden email]>:

Thanks. The plan is to use the DataStream API (or possibly Beam) which is fine for data from Kafka (and presumably from a file too). Assuming that I don't stop and restart the streaming job but just leave it idle (as no events are coming into Kafka), is there any issue with leaving a window 'open' for a few hours? The window would be defined as 5 minutes of event time but due to the fact that the upstream system is buffering the data Flink would only see the end of those 5 minutes after many hours (processing time).

I don't have specific details of what the Flink job will be doing because I want this architecture to support many usecases of processing the data – I'm trying to determine if it makes sense to model the data as a stream, coming via Kafka despite the Kafka stream being populated in large chunks (that aren't to do with event time, but rather due to batch-based processing of the upstream process) or not. We would be typically using a keyed sliding window. The watermark of the window would be defined in terms of event time. The event time would not be progressing for the extended period between the end of one upstream batch and the next so the windows would remain open for that time.

Thanks,

Jonny

From: miki haiat [mailto:[hidden email]]
Sent: Monday, January 21, 2019 5:07 PM
To: Jonny Graham <[hidden email]>
Cc: [hidden email]
Subject: Re: Kafka stream fed in batches throughout the day

In flink you cant read data from kafka in Dataset API (Batch)

And you dont want to mess with start and stop your job every few hours.

Can you elaborate more on your use case ,

Are you going to use KeyBy , is thire any way to use trigger ... ?

On Mon, Jan 21, 2019 at 4:43 PM Jonny Graham <[hidden email]> wrote:

We have a Kafka stream of events that we want to process with a Flink datastream process. However, the stream is populated by an upstream batch process that only executes every few hours. So the stream has very 'bursty' behaviour. We need a window based on event time to await the next events for the same key. Due to this batch population of the stream, these windows can remain open (with no event activity on the stream) for many hours. From what I understand we could indeed leave the Flink datastream process up and running all this time and the window would remain open. We could even use a savepoint and then stop the process and restart it (with the window state being restored) when we get the next batch and the events start appearing in the stream again.

One rationale for this mode of operation is that we have a future usecase where this stream will be populated in real-time and would behave like a normal stream.

Is that a best-practice approach for this scenario? Or should we be treating these batches as individual batches (Flink job that ends with the end of the batch) and manually handle the windowing that needs to cross multiple batches.

Thanks,

Jonny

Confidentiality: This communication and any attachments are intended for the above-named persons only and may be confidential and/or legally privileged. Any opinions expressed in this communication are not necessarily those of NICE Actimize. If this communication has come to you in error you must take no action based on it, nor must you copy or show it to anyone; please delete/destroy and inform the sender by e-mail immediately.
Monitoring: NICE Actimize may monitor incoming and outgoing e-mails.
Viruses: Although we have taken steps toward ensuring that this e-mail and attachments are free from any virus, we advise that in keeping with good computing practice the recipient should ensure they are actually virus free.

Confidentiality: This communication and any attachments are intended for the above-named persons only and may be confidential and/or legally privileged. Any opinions expressed in this communication are not necessarily those of NICE Actimize. If this communication has come to you in error you must take no action based on it, nor must you copy or show it to anyone; please delete/destroy and inform the sender by e-mail immediately.
Monitoring: NICE Actimize may monitor incoming and outgoing e-mails.
Viruses: Although we have taken steps toward ensuring that this e-mail and attachments are free from any virus, we advise that in keeping with good computing practice the recipient should ensure they are actually virus free.

Confidentiality: This communication and any attachments are intended for the above-named persons only and may be confidential and/or legally privileged. Any opinions expressed in this communication are not necessarily those of NICE Actimize. If this communication has come to you in error you must take no action based on it, nor must you copy or show it to anyone; please delete/destroy and inform the sender by e-mail immediately.
Monitoring: NICE Actimize may monitor incoming and outgoing e-mails.
Viruses: Although we have taken steps toward ensuring that this e-mail and attachments are free from any virus, we advise that in keeping with good computing practice the recipient should ensure they are actually virus free.

Confidentiality: This communication and any attachments are intended for the above-named persons only and may be confidential and/or legally privileged. Any opinions expressed in this communication are not necessarily those of NICE Actimize. If this communication has come to you in error you must take no action based on it, nor must you copy or show it to anyone; please delete/destroy and inform the sender by e-mail immediately.
Monitoring: NICE Actimize may monitor incoming and outgoing e-mails.
Viruses: Although we have taken steps toward ensuring that this e-mail and attachments are free from any virus, we advise that in keeping with good computing practice the recipient should ensure they are actually virus free.