Ambiguous behavior of Flink on Job cancellation with checkpoint configured

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Ambiguous behavior of Flink on Job cancellation with checkpoint configured

Parth Sarathy
Hi All,
               We are using flink 1.7.2 and have enabled checkpoint with
RocksDB configured as state backend with retain checkpoints on job cancel.
In our scenario we are cancelling the job and while resubmitting the job, we
try to restore the job with latest checkpoint / savepoint available. We are
observing ambiguous behavior based on the way job is being cancelled, below
are the captured observations:

Observations :
1. When we cancel the job with a savepoint option, a savepoint is created as
expected but flink is deleting the latest checkpoint directory available for
the running job. Is this an expected behavior even when the configuration
asks to retain checkpoints on job cancellation?
2. When we cancel the job without the savepoint option, the same latest
checkpoint was retained by flink as opposed to before where it was deleted
as job was cancelled with the savepoint option.

               As we have configured flink to retain only a single
checkpoint at any point of time, could there be any issue wherein when we
cancel the job with a savepoint, the savepoint gets triggered but fails
midway. So now we would end up with an incomplete savepoint and no trace of
checkpoint for the job as it would have been erased.

Thanks
Parth Sarathy



--
Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/
Reply | Threaded
Open this post in threaded view
|

Re: Ambiguous behavior of Flink on Job cancellation with checkpoint configured

Aljoscha Krettek
Hi,

That’s a good observation! And it is indeed the expected behaviour. There are two parts to understanding this:
 * "retain checkpoints” tells Flink to retain any checkpoints that it stores when a job is shut down
 * for recovery purposes (all checkpointing purposes, really) a savepoint counts as a checkpoint. Otherwise, you can have strange behaviour when you do a successful savepoint, then a failure occurs, and you would then restore from the checkpoint before that. See especially [1] about this.

So when you configure Flink to retain only one checkpoint this means that when you do a savepoint it will count as that one checkpoint, i.e. the previous checkpoint is discarded. Thus, you only have that savepoint after the job is canceled.

I hope this makes sense. Let me know if you have any questions.

Best,
Aljoscha

[1] https://issues.apache.org/jira/browse/FLINK-10354

> On 21. Mar 2019, at 08:58, Parth Sarathy <[hidden email]> wrote:
>
> Hi All,
>               We are using flink 1.7.2 and have enabled checkpoint with
> RocksDB configured as state backend with retain checkpoints on job cancel.
> In our scenario we are cancelling the job and while resubmitting the job, we
> try to restore the job with latest checkpoint / savepoint available. We are
> observing ambiguous behavior based on the way job is being cancelled, below
> are the captured observations:
>
> Observations :
> 1. When we cancel the job with a savepoint option, a savepoint is created as
> expected but flink is deleting the latest checkpoint directory available for
> the running job. Is this an expected behavior even when the configuration
> asks to retain checkpoints on job cancellation?
> 2. When we cancel the job without the savepoint option, the same latest
> checkpoint was retained by flink as opposed to before where it was deleted
> as job was cancelled with the savepoint option.
>
>               As we have configured flink to retain only a single
> checkpoint at any point of time, could there be any issue wherein when we
> cancel the job with a savepoint, the savepoint gets triggered but fails
> midway. So now we would end up with an incomplete savepoint and no trace of
> checkpoint for the job as it would have been erased.
>
> Thanks
> Parth Sarathy
>
>
>
> --
> Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/

Reply | Threaded
Open this post in threaded view
|

Re: Ambiguous behavior of Flink on Job cancellation with checkpoint configured

Parth Sarathy
That makes sense. Thanks for the clarification.



--
Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/