StreamingFileSink duplicate data

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

StreamingFileSink duplicate data

Lei Nie
Hello,
I would like clarification on the StreamingFileSink, thank you.

From my testing, it seems that resuming job from checkpoint does not also restore the rolling part counter.

E.g, job may have stopped with last file:
part-6-71

But when resuming from most recent checkpoint:
part-6-89
(There is unexplained gap).

This is a problem if I am having an issue with my job, and need to roll back more than one checkpoint. After rolling back to the 4th last checkpoint, e.g, the data will be written into different part file names, causing duplication.
-----------------------------------------------------------------
For example, checkpoints:
chk-17, chk-18, chk-19, chk-20

Original data:
part-1-5, part-1-6, part-1-7

Rollback to chk-17, which writes part-1-18, but with the same data as part-1-5! This is duplicate.
------------------------------------------------------------------
Am I correct? How to avoid this?
Reply | Threaded
Open this post in threaded view
|

Re: StreamingFileSink duplicate data

Paul Lam
Hi,

StreamingFileSink would not remove committed files, so if you use a non-latest checkpoint to restore state, you may need to perform a manual cleanup.

WRT the part id issue, StreamingFileSink will track the global max part number, and use this value + 1 as the new id upon restoring. In this way, we avoid file name conflicts with the previous execution (see[1]).


Best,
Paul Lam

在 2019年11月21日,10:01,Lei Nie <[hidden email]> 写道:

Hello,
I would like clarification on the StreamingFileSink, thank you.

From my testing, it seems that resuming job from checkpoint does not also restore the rolling part counter.

E.g, job may have stopped with last file:
part-6-71

But when resuming from most recent checkpoint:
part-6-89
(There is unexplained gap).

This is a problem if I am having an issue with my job, and need to roll back more than one checkpoint. After rolling back to the 4th last checkpoint, e.g, the data will be written into different part file names, causing duplication.
-----------------------------------------------------------------
For example, checkpoints:
chk-17, chk-18, chk-19, chk-20

Original data:
part-1-5, part-1-6, part-1-7

Rollback to chk-17, which writes part-1-18, but with the same data as part-1-5! This is duplicate.
------------------------------------------------------------------
Am I correct? How to avoid this?