Hi All,
In flink document, it says DELETE_ON_CANCELLATION: “Delete the checkpoint when the job is cancelled. The checkpoint state will only be available if the job fails.” What is the definition and difference between job cancel and job fails? If I run the program on yarn, and after a few days, the yarn application get failed for some reason. If I use DELETE_ON_CANCELLATION option, in this case, does I have the checkpoint to resume the program? If the checkpoint are only deleted when I cancel the program, I can always make the savepoint before cancelation. Then it seems that I can only set DELETE_ON_CANCELLATION then. I can not find a case that RETAIN_ON_CANCELLATION should be used. Best Henry |
Hi All,
I mean if I can guarantee that a savepoint can always be made before manually cancelation. If I use DELETE_ON_CANCELLATION option on checkpoints, is there any probability that I do not have a checkpoint to recover from? Thank a a lot. Best Henry
|
Hi Henry, Answer your question: What is the definition and difference between job cancel and job fails? > The cancellation and failure of the job will cause the job to enter the termination state. But cancellation is artificially triggered and normally terminated, while failure is usually a passive termination due to an exception. If I use DELETE_ON_CANCELLATION option, in this case, does I have the checkpoint to resume the program? > No, if you use externalized checkpoints. you cannot resume from externalized checkpoints after the job has been cancelled. I mean if I can guarantee that a savepoint can always be made before manually cancelation. If I use DELETE_ON_CANCELLATION option on checkpoints, is there any probability that I do not have a checkpoint to recover from? > From the latest source code, savepoint is not affected by CheckpointRetentionPolicy, it needs to be cleaned up manually. Thanks, vino. 徐涛 <[hidden email]> 于2018年9月25日周二 上午11:06写道:
|
Hi Vino,
What is the definition and difference between job cancel and job fails? Can I say that if the program is shutdown artificially, then it is a job cancel, if the program is shutdown due to some error, it is a job fail? This is important because it is the prerequisite for the following question: In the document of Flink 1.6, it says: "ExternalizedCheckpointCleanup.RETAIN_ON_CANCELLATION: Retain the checkpoint when the job is cancelled. Note that you have to manually clean up the checkpoint state after cancellation in this case. ExternalizedCheckpointCleanup.DELETE_ON_CANCELLATION: Delete the checkpoint when the job is cancelled. The checkpoint state will only be available if the job fails." But it does not says whether the checkpoint will be retained on fail. If the checkpoint activity of fail is the same as cancel, then I have to use RETAIL_ON_CANCELLATION, because if I do not use it, the checkpoint will be deleted on job fail. If the checkpoint activity of fail is not delete, then at this case it is safe on job fail. Best Henry
|
Hi Henry, I gave a blue comment in your original email. Thanks, vino. 徐涛 <[hidden email]> 于2018年9月25日周二 下午12:56写道:
This is not entirely true, and artificially triggering a cancel may also lead to failure. You can think that if the human triggers the cancel, each task instance can be correctly canceled, then the final job's status is canceled. The final state of the job due to various anomalies is failed.
In the configuration, there are two enumeration classes `CheckpointRetentionPolicy` and `ExternalizedCheckpointCleanup`, you need to consider which configuration you want to use. Your main concern is ExternalizedCheckpointCleanup, which cleans up the metadata for externalized checkpoints. Are you sure you want to use it? Flink defaults to self-management checkpoint cleanup, which is a non-externalized checkpoint.
|
Hi Vino,
So I will use the default setting of DELETE_ON_CANCELLATION. When the program cancels the checkpoint will be deleted, when the program fails,because the checkpoint will not be deleted, I still can have a checkpoint that can be used to resume. Please help to correct me if I am wrong. Thanks. Best Henry
|
Hi Henry, Your understanding is correct. Checkpoint itself is for recovery purposes. If you cancel a job, Flink thinks it doesn't make sense to save the checkpoint again. If you want to recover after cancel, then you should use cancel with savepoint. So, by default, you don't need to manually clean up checkpoint metadata unless you plan to use externalized checkpoints. Thanks, vino. 徐涛 <[hidden email]> 于2018年9月25日周二 下午2:59写道:
|
Free forum by Nabble | Edit this page |