(DEPRECATED) Apache Flink User Mailing List archive.

Flink checkpointing behavior

Classic

List

Threaded

2 messages Options

amran dean

Flink checkpointing behavior

Hello,

Exact semantics for checkpointing/task recovery are still a little confusing to me after parsing docs: so a few questions.

- What does Flink consider a task failure? Is it any exception that the job does not handle?

- Do the failure recovery strategies mentioned in https://ci.apache.org/projects/flink/flink-docs-stable/dev/task_failure_recovery.html refer to restarting from the most recent checkpoint?

E.g for fixed-delay recoveries, a fixed number of restarts from a specific checkpoint are attempted.

- The docs mention the following command to resume from a checkpoint. In the checkpoint metadata path I have configured, I only see a series of directories named by hashes:

- 24c8d7a38dd90ca8bd5f04c36d1442ba

- shared

- taskowned

- 5d202a0ba04cdc1b917892c1e35d00dc

- shared

- taskowned

How do I know which is the most recent checkpoint?

Really appreciate any help. Thank you.

vino yang

Re: Flink checkpointing behavior

Hi Amran,

See my inline answers.

Best,

Vino

amran dean <[hidden email]> 于2019年10月30日周三上午2:59写道：

Hello,
Exact semantics for checkpointing/task recovery are still a little confusing to me after parsing docs: so a few questions.

- What does Flink consider a task failure? Is it any exception that the job does not handle?

Flink believes that the task failure is: any factor makes the task itself unable to continue to run.

- Do the failure recovery strategies mentioned in https://ci.apache.org/projects/flink/flink-docs-stable/dev/task_failure_recovery.html refer to restarting from the most recent checkpoint?
E.g for fixed-delay recoveries, a fixed number of restarts from a specific checkpoint are attempted.

For an automatic restart, Flink will try to find the nearest checkpoint.

- The docs mention the following command to resume from a checkpoint. In the checkpoint metadata path I have configured, I only see a series of directories named by hashes:

- 24c8d7a38dd90ca8bd5f04c36d1442ba
- shared
- taskowned
- 5d202a0ba04cdc1b917892c1e35d00dc
- shared
- taskowned
How do I know which is the most recent checkpoint?

In the checkpoint directory corresponding to the jobID, you should see some folder names, like "chk-xxx", so specify this path. More details please see here[1].

[1]: https://ci.apache.org/projects/flink/flink-docs-release-1.9/ops/state/checkpoints.html#resuming-from-a-retained-checkpoint

Really appreciate any help. Thank you.