Where should a secondary flow for late events processing be defined?

Posted by Jose Velasco on
URL: http://deprecated-apache-flink-user-mailing-list-archive.369.s1.nabble.com/Where-should-a-secondary-flow-for-late-events-processing-be-defined-tp40933.html

Hi everybody!

I'm so excited to be here asking a first question about Flink DataStream API.

I have a basic enrichment pipeline (event time). Basically, there's a main stream A (Kafka source) being enriched with the info of 2 other streams: B and C. (Kafka sources as well).
Basically, the enrichment graph consists in 2 stages:

1. The stream A is enriched with stream B, resulting in stream (A, B)
2. The stream (A, B) is enriched with C, resulting in a stream (A, B, C).


I've created a side output for late events A. The flow for these events would be slightly different:

1. The stream A is enriched by fetching the info from an external service using Flin Async I/O, resulting in stream (A, B)
2. Then, the stream (A, B) is enriched with C, resulting in a stream (A, B, C) (same before)


Note that late A events can arrive out of order (weeks in same cases)


I was wondering where the late-events flow should be defined. One graph/job for both flows or 2 graph/jobs with different times? What's the general pattern for cases like this?  



Many thanks, Jose Velasco