Where should a secondary flow for late events processing be defined?

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

Where should a secondary flow for late events processing be defined?

Jose Velasco
Hi everybody!

I'm so excited to be here asking a first question about Flink DataStream API.

I have a basic enrichment pipeline (event time). Basically, there's a main stream A (Kafka source) being enriched with the info of 2 other streams: B and C. (Kafka sources as well).
Basically, the enrichment graph consists in 2 stages:

1. The stream A is enriched with stream B, resulting in stream (A, B)
2. The stream (A, B) is enriched with C, resulting in a stream (A, B, C).


I've created a side output for late events A. The flow for these events would be slightly different:

1. The stream A is enriched by fetching the info from an external service using Flin Async I/O, resulting in stream (A, B)
2. Then, the stream (A, B) is enriched with C, resulting in a stream (A, B, C) (same before)


Note that late A events can arrive out of order (weeks in same cases)


I was wondering where the late-events flow should be defined. One graph/job for both flows or 2 graph/jobs with different times? What's the general pattern for cases like this?  



Many thanks, Jose Velasco
Reply | Threaded
Open this post in threaded view
|

Re: Where should a secondary flow for late events processing be defined?

Guowei Ma
Hi, Jose

What I understand your question is
    Your job has two stages. You want to handle the first stage differently according to the event time of the Stream A. It means that if the event time of Stream A is “too late” then you would enrich Stream A with the external system and or you would enrich Stream A with the Stream B. But the second stage is the same. You want to know the best practice for this situation.

    I don’t know whether there is a best practice. :-) (Maybe other guys know) But Personally, I prefer using one graph to handle this. From a cost perspective the second stage could be reused(both logical and runtime resource).
Best,
Guowei


On Mon, Jan 25, 2021 at 4:18 AM Jose Velasco <[hidden email]> wrote:
Hi everybody!

I'm so excited to be here asking a first question about Flink DataStream API.

I have a basic enrichment pipeline (event time). Basically, there's a main stream A (Kafka source) being enriched with the info of 2 other streams: B and C. (Kafka sources as well).
Basically, the enrichment graph consists in 2 stages:

1. The stream A is enriched with stream B, resulting in stream (A, B)
2. The stream (A, B) is enriched with C, resulting in a stream (A, B, C).


I've created a side output for late events A. The flow for these events would be slightly different:

1. The stream A is enriched by fetching the info from an external service using Flin Async I/O, resulting in stream (A, B)
2. Then, the stream (A, B) is enriched with C, resulting in a stream (A, B, C) (same before)


Note that late A events can arrive out of order (weeks in same cases)


I was wondering where the late-events flow should be defined. One graph/job for both flows or 2 graph/jobs with different times? What's the general pattern for cases like this?  



Many thanks, Jose Velasco