Where should a secondary flow for late events processing be defined?
Posted by
Jose Velasco on
URL: http://deprecated-apache-flink-user-mailing-list-archive.369.s1.nabble.com/Where-should-a-secondary-flow-for-late-events-processing-be-defined-tp40933.html
Hi everybody!
I'm so excited to be here asking a first question about Flink DataStream API.
I have a basic enrichment pipeline (event time). Basically, there's a main stream A (Kafka source) being enriched with the info of 2 other streams: B and C. (Kafka sources as well).
Basically, the enrichment graph consists in 2 stages:
1. The stream A is enriched with stream B, resulting in stream (A, B)
2. The stream (A, B) is enriched with C, resulting in a stream (A, B, C).
I've created a side output for late events A. The flow for these events would be slightly different:
1. The stream A is enriched by fetching the info from an external service using Flin Async I/O, resulting in stream (A, B)
2. Then, the stream (A, B) is enriched with C, resulting in a stream (A, B, C) (same before)
Note that late A events can arrive out of order (weeks in same cases)
I was wondering where the late-events flow should be defined. One graph/job for both flows or 2 graph/jobs with different times? What's the general pattern for cases like this?
Many thanks, Jose Velasco