We plan to use Flink for streaming analytics, sending its results to a time-series database (TSDB).
When we make an error in code or deployment, this will cause incorrect data to flow into the TSDB (quite probably with scattered timestamps). We want to be able to fix that kind of problem by restoring all data to a known-good state, and then restreaming from there.
Flink snapshots make it easy to restore past state - but how can we restore the TSDB to its equivalent past state? We can see two approaches - does anyone in the Flink community have advice on best-practice for this?
1) We could enrich every datapoint that we send out of Flink to the TSDB with a snapshot label. Then when we wind back, we selectively delete all the data with a newer snapshot label. Or a simpler approach would be to add an "arrival time" metric to each datapoint we put in the TSDB, then when we wind back we delete all data with arrival_time > snapshot-time. This approach is likely to stress the TSDB (because it's a large selective-delete operation, as arrival time is not the TSDB's primary timestamp key).
2) We take snapshot backups of the TSDB (assuming it supports that) which are synchronous with Flink snapshots. How can we ensure they are synchronous? Perhaps as Flink's "we're doing a snapshot" token propagates out of the "sink" end of Flink, we can use it to trigger a synchronous TSDB snapshot of the same name.
All advice gratefully received!
Thanks,