Hi Ananth,
> if it detects significant backlog, skip over and consume from the latest offset, and schedule a separate backfill task for the backlogged 20 minutes?
Fabian is right, there is no built-in operators for this.
If you don't care about Watermark, I think we can implement it with a custom source which can sleep or consume data within a time range.
The job looks like:
Source1(s1) ->
Union -> Sink
Source2(s2) ->
The job works as follows:
- t1: s1 working, s2 sleep
- t2: There is an outage of Elastic Search cluster
- t3: ES is available. s1 resume from t1 and end with t3. s2 start from t3 directly.
- t4: s1 sleep, s2 working
To achieve this, we should also find a way to exchange progresses between the two sources. For example, sync source status with a Hbase or a Mysql Table.
Best, Hequn