Hi, We have a Flink job that was stopped erroneously with no available checkpoint/savepoint to restore, and are looking for some help to narrow down the problem. How we ran into this problem: We stopped the job using cancel with savepoint command (for compatibility issue), but the command timed out after 1 min because there was some backpressure. So we force kill the job by yarn kill command. Usually, this would not cause troubles because we can still use the last checkpoint to restore the job. But at this time, the last checkpoint dir was cleaned up and empty (the retained checkpoint number was 1). According to zookeeper and the logs, the savepoint finished (job master logged “Savepoint stored in …”) right after the cancel timeout. However, the savepoint directory contains only _metadata file, and other state files referred by metadata are absent. Environment & Config: - Flink 1.11.0 - YARN job cluster - HA via zookeeper - FsStateBackend - Aligned non-incremental checkpoint Any comments and suggestions are appreciated! Thanks! Best, Paul Lam
|
Hi Paul, could you share with us the logs of the JobManager? They might help to better understand in which order each operation occurred. How big are you expecting the size of the state to be? If it is smaller than state.backend.fs.memory-threshold, then the state data will be stored in the _metadata file. Cheers, Till On Tue, Sep 29, 2020 at 1:52 PM Paul Lam <[hidden email]> wrote:
|
Thanks for sharing the logs with me. It looks as if the total size of the savepoint is 335kb for a job with a parallelism of 60 and a total of 120 tasks. Hence, the average size of a state per task is between 2.5kb - 5kb. I think that the state size threshold refers to the size of the per task state. Hence, I believe that the _metadata file should contain all of your state. Have you tried restoring from this savepoint? Cheers, Till On Tue, Sep 29, 2020 at 3:47 PM Paul Lam <[hidden email]> wrote:
|
Hi Till, Thanks a lot for the pointer! I tried to restore the job using the savepoint in a dry run, and it worked! Guess I've misunderstood the configuration option, and confused by the non-existent paths that the metadata contains. Best, Paul Lam Till Rohrmann <[hidden email]> 于2020年9月29日周二 下午10:30写道:
|
Glad to hear that your job data was not lost! Cheers, Till On Tue, Sep 29, 2020 at 7:28 PM Paul Lam <[hidden email]> wrote:
|
Free forum by Nabble | Edit this page |