I'm running a streaming job that uses the following config:
checkpointInterval = 5 mins minPauseBetweenCheckpoints = 2 mins checkpointTimeout = 1 minute maxConcurrentCheckpoints = 1 This is using incremental, async checkpoints with the RocksDb backend. So far around 2K checkpoints have been triggered, but I just noticed that after the first ~1K the checkpoints have been failing with: Checkpoint 1560 of job 9054d277265950c07ab90cf7ba0641d0 expired before completing. Now I'm in a very interesting position: I want to trigger a `savepoint` or a `cancel -s`, but both of those commands will fail because they are coupled to the checkpoint mechanism. i.e. both commands fail precisely because the checkpoints are timing out. Hence my question... is there a way to change the configuration of the checkpoints at runtime? It seems like there is no such thing, but also not a good reason why it couldn't be implemented (we already allow modifying the parallelism of a job which looks like a harder problem to solve). Assuming there is no way to do this... how should I try to save my job? I do have enabled the `RETAIN_ON_CANCELLATION` policy. Should I be able to resume the job from the last checkpoint using the --savepoint flag? -- Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/ |
You cannot change the checkpointing configuration at runtime.
You should be able to resume the job from the last checkpoint. On 22.01.2019 19:39, knur wrote: > I'm running a streaming job that uses the following config: > > checkpointInterval = 5 mins > minPauseBetweenCheckpoints = 2 mins > checkpointTimeout = 1 minute > maxConcurrentCheckpoints = 1 > > This is using incremental, async checkpoints with the RocksDb backend. So > far around 2K checkpoints have been triggered, but I just noticed that after > the first ~1K the checkpoints have been failing with: > > Checkpoint 1560 of job 9054d277265950c07ab90cf7ba0641d0 expired before > completing. > > Now I'm in a very interesting position: I want to trigger a `savepoint` or a > `cancel -s`, but both of those commands will fail because they are coupled > to the checkpoint mechanism. i.e. both commands fail precisely because the > checkpoints are timing out. > > Hence my question... is there a way to change the configuration of the > checkpoints at runtime? It seems like there is no such thing, but also not a > good reason why it couldn't be implemented (we already allow modifying the > parallelism of a job which looks like a harder problem to solve). > > Assuming there is no way to do this... how should I try to save my job? I do > have enabled the `RETAIN_ON_CANCELLATION` policy. > > Should I be able to resume the job from the last checkpoint using the > --savepoint flag? > > > > -- > Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/ > |
Free forum by Nabble | Edit this page |