Change Flink checkpoint configuration at runtime

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

Change Flink checkpoint configuration at runtime

knur
I'm running a streaming job that uses the following config:

    checkpointInterval = 5 mins
    minPauseBetweenCheckpoints = 2 mins
    checkpointTimeout = 1 minute
    maxConcurrentCheckpoints = 1

This is using incremental, async checkpoints with the RocksDb backend. So
far around 2K checkpoints have been triggered, but I just noticed that after
the first ~1K the checkpoints have been failing with:

    Checkpoint 1560 of job 9054d277265950c07ab90cf7ba0641d0 expired before
completing.

Now I'm in a very interesting position: I want to trigger a `savepoint` or a
`cancel -s`, but both of those commands will fail because they are coupled
to the checkpoint mechanism. i.e. both commands fail precisely because the
checkpoints are timing out.

Hence my question... is there a way to change the configuration of the
checkpoints at runtime? It seems like there is no such thing, but also not a
good reason why it couldn't be implemented (we already allow modifying the
parallelism of a job which looks like a harder problem to solve).

Assuming there is no way to do this... how should I try to save my job? I do
have enabled the `RETAIN_ON_CANCELLATION` policy.

Should I be able to resume the job from the last checkpoint using the
--savepoint flag?



--
Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/
Reply | Threaded
Open this post in threaded view
|

Re: Change Flink checkpoint configuration at runtime

Chesnay Schepler
You cannot change the checkpointing configuration at runtime.

You should be able to resume the job from the last checkpoint.

On 22.01.2019 19:39, knur wrote:

> I'm running a streaming job that uses the following config:
>
>      checkpointInterval = 5 mins
>      minPauseBetweenCheckpoints = 2 mins
>      checkpointTimeout = 1 minute
>      maxConcurrentCheckpoints = 1
>
> This is using incremental, async checkpoints with the RocksDb backend. So
> far around 2K checkpoints have been triggered, but I just noticed that after
> the first ~1K the checkpoints have been failing with:
>
>      Checkpoint 1560 of job 9054d277265950c07ab90cf7ba0641d0 expired before
> completing.
>
> Now I'm in a very interesting position: I want to trigger a `savepoint` or a
> `cancel -s`, but both of those commands will fail because they are coupled
> to the checkpoint mechanism. i.e. both commands fail precisely because the
> checkpoints are timing out.
>
> Hence my question... is there a way to change the configuration of the
> checkpoints at runtime? It seems like there is no such thing, but also not a
> good reason why it couldn't be implemented (we already allow modifying the
> parallelism of a job which looks like a harder problem to solve).
>
> Assuming there is no way to do this... how should I try to save my job? I do
> have enabled the `RETAIN_ON_CANCELLATION` policy.
>
> Should I be able to resume the job from the last checkpoint using the
> --savepoint flag?
>
>
>
> --
> Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/
>