Hello,
I would like clarification on the StreamingFileSink, thank you.
From my testing, it seems that resuming job from checkpoint does not also restore the rolling part counter.
E.g, job may have stopped with last file:
part-6-71
But when resuming from most recent checkpoint:
part-6-89
(There is unexplained gap).
This is a problem if I am having an issue with my job, and need to roll back more than one checkpoint. After rolling back to the 4th last checkpoint, e.g, the data will be written into different part file names, causing duplication.
-----------------------------------------------------------------
For example, checkpoints:
chk-17, chk-18, chk-19, chk-20
Original data:
part-1-5, part-1-6, part-1-7
Rollback to chk-17, which writes part-1-18, but with the same data as part-1-5! This is duplicate.
------------------------------------------------------------------
Am I correct? How to avoid this?