Re: Savepoint incomplete when job was killed after a cancel timeout

Posted by Till Rohrmann on
URL: http://deprecated-apache-flink-user-mailing-list-archive.369.s1.nabble.com/Savepoint-incomplete-when-job-was-killed-after-a-cancel-timeout-tp38392p38393.html

Hi Paul,

could you share with us the logs of the JobManager? They might help to better understand in which order each operation occurred. 

How big are you expecting the size of the state to be? If it is smaller than state.backend.fs.memory-threshold, then the state data will be stored in the _metadata file.

Cheers,
Till

On Tue, Sep 29, 2020 at 1:52 PM Paul Lam <[hidden email]> wrote:
Hi,

We have a Flink job that was stopped erroneously with no available checkpoint/savepoint to restore, 
and are looking for some help to narrow down the problem.

How we ran into this problem:

We stopped the job using cancel with savepoint command (for compatibility issue), but the command
timed out after 1 min because there was some backpressure. So we force kill the job by yarn kill command.
Usually, this would not cause troubles because we can still use the last checkpoint to restore the job.

But at this time, the last checkpoint dir was cleaned up and empty (the retained checkpoint number was 1).
According to zookeeper and the logs, the savepoint finished (job master logged “Savepoint stored in …”) 
right after the cancel timeout. However, the savepoint directory contains only _metadata file, and other 
state files referred by metadata are absent. 

Environment & Config:
- Flink 1.11.0
- YARN job cluster
- HA via zookeeper
- FsStateBackend
- Aligned non-incremental checkpoint

Any comments and suggestions are appreciated! Thanks!

Best,
Paul Lam