(DEPRECATED) Apache Flink User Mailing List archive.

Late arriving events

Classic

List

Threaded

4 messages Options

Chen Qin

Late arriving events

Hi there,

I understand Flink currently doesn't support handling late arriving events. In reality, a exact-once link job needs to deal data missing or backfill from time to time without rewind to previous save point, which implies restored job suffers blackout while it tried to catch up.

In general, I don't know if there are good solutions for all of these scenarios. Some might keep messages in windows longer.(messages not purged yet) Some might kick off another pipeline just dealing with affected windows(messages already purged). What would be suggested patterns?

For keeping message longer approach, we need to corrdinate backfill messages and current on going messages assigned to windows not firing without all of these messages. One challenge of this approach would be determine when backfill messages all processed. Ideally there would be a customized barrier that travel through entire topology and tell windows backfills are done. This works both for non keyed stream and keyed stream. I don't think link support this yet. Another way would be use session window merge and extent window purging time with some reasonable estimation. This approach is based on estimation and may add execution latency to those windows.

Which would be suggested way in general?

Thanks,

Chen

Jamie Grier

Re: Late arriving events

I put a few comments in-line below...

On Tue, Jul 5, 2016 at 4:06 PM, Chen Qin <[hidden email]> wrote:

Hi there,

I understand Flink currently doesn't support handling late arriving events. In reality, a exact-once link job needs to deal data missing or backfill from time to time without rewind to previous save point, which implies restored job suffers blackout while it tried to catch up.

Another way to do this is to kick off a parallel job to do the backfill from the previous savepoint without stopping the current "realtime" job. This way you would not have to have a "blackout". This assumes your final sink can tolerate having some parallel writes to it OR you have two different sinks and throw a switch from one to another for downstream jobs, etc.

In general, I don't know if there are good solutions for all of these scenarios. Some might keep messages in windows longer.(messages not purged yet) Some might kick off another pipeline just dealing with affected windows(messages already purged). What would be suggested patterns?

Of course, ideally you would just keep data in windows longer such that you don't purge window state until you're sure there is no more data coming. The problem with this approach in the real world is that you may be wrong with whatever time you choose ;) I would suggest doing the best job possible upfront by using an appropriate watermark strategy to deal with most of the data. Then process the truly late data with a separate path in the application code. This "separate" path may have to deal with merging late data with the data that's already been written to the sink but this is definitely possible depending on the sink.

For keeping message longer approach, we need to corrdinate backfill messages and current on going messages assigned to windows not firing without all of these messages. One challenge of this approach would be determine when backfill messages all processed. Ideally there would be a customized barrier that travel through entire topology and tell windows backfills are done. This works both for non keyed stream and keyed stream. I don't think link support this yet. Another way would be use session window merge and extent window purging time with some reasonable estimation. This approach is based on estimation and may add execution latency to those windows.

Which would be suggested way in general?

Thanks,
Chen

Jamie Grier

data Artisans, Director of Applications Engineering

@jamiegrier

[hidden email]

Chen Qin

Re: Late arriving events

Jamie,

Sorry for late reply, some of my thoughts inline.

-Chen

Another way to do this is to kick off a parallel job to do the backfill from the previous savepoint without stopping the current "realtime" job. This way you would not have to have a "blackout". This assumes your final sink can tolerate having some parallel writes to it OR you have two different sinks and throw a switch from one to another for downstream jobs, etc.

Sounds great to me. I think it will solve "blackout" issue I mentioned. Sink might a bit more like read-check-write fashion but should be fine.

In general, I don't know if there are good solutions for all of these scenarios. Some might keep messages in windows longer.(messages not purged yet) Some might kick off another pipeline just dealing with affected windows(messages already purged). What would be suggested patterns?

Of course, ideally you would just keep data in windows longer such that you don't purge window state until you're sure there is no more data coming. The problem with this approach in the real world is that you may be wrong with whatever time you choose ;) I would suggest doing the best job possible upfront by using an appropriate watermark strategy to deal with most of the data. Then process the truly late data with a separate path in the application code. This "separate" path may have to deal with merging late data with the data that's already been written to the sink but this is definitely possible depending on the sink.

Make sense. A truly late events should go through a side job that merge with whatever written in sink. That might also imply both sinks able to do read-check-merge.

e.g job doing search keyword count from begining, an outage caused some hosts partitioned by keywords went down for couple of days. backfill job started load and adding counts, after it backfilled all missing keywords and merge aggregation results, it might needs to write to current yet written windows and let main job pickup and merge results.

Kostas Kloudas

Re: Late arriving events

Hi Chen,

Just to add in the previous discussion that we are currently discussing possible improvements of windowing mechanism/semantics here:

https://docs.google.com/document/d/1Xp-YBf87vLTduYSivgqWVEMjYUmkA-hyb4muX3KRl08/edit#heading=h.psfzjlv68tp

and the related discussion in on the Flink Dev mailing list, in the thread with title:

[DISCUSS] Allowed Lateness in Flink

In the doc you can also see that allowed lateness is already in the master branch and will be part of the upcoming release.

Any input to the discussion is more than welcome, so feel free to jump in.

Kostas

On Jul 7, 2016, at 1:17 AM, Chen Qin <[hidden email]> wrote:

Jamie,

Sorry for late reply, some of my thoughts inline.

-Chen

Another way to do this is to kick off a parallel job to do the backfill from the previous savepoint without stopping the current "realtime" job. This way you would not have to have a "blackout". This assumes your final sink can tolerate having some parallel writes to it OR you have two different sinks and throw a switch from one to another for downstream jobs, etc.

Sounds great to me. I think it will solve "blackout" issue I mentioned. Sink might a bit more like read-check-write fashion but should be fine.

In general, I don't know if there are good solutions for all of these scenarios. Some might keep messages in windows longer.(messages not purged yet) Some might kick off another pipeline just dealing with affected windows(messages already purged). What would be suggested patterns?

Of course, ideally you would just keep data in windows longer such that you don't purge window state until you're sure there is no more data coming. The problem with this approach in the real world is that you may be wrong with whatever time you choose ;) I would suggest doing the best job possible upfront by using an appropriate watermark strategy to deal with most of the data. Then process the truly late data with a separate path in the application code. This "separate" path may have to deal with merging late data with the data that's already been written to the sink but this is definitely possible depending on the sink.

Make sense. A truly late events should go through a side job that merge with whatever written in sink. That might also imply both sinks able to do read-check-merge.

e.g job doing search keyword count from begining, an outage caused some hosts partitioned by keywords went down for couple of days. backfill job started load and adding counts, after it backfilled all missing keywords and merge aggregation results, it might needs to write to current yet written windows and let main job pickup and merge results.