TaskManager failure detection

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

TaskManager failure detection

Dominik Safaric
Hi,

As I’m investigating onto Flink’s fault tolerance capabilities, I would like to know what component and class is in charge of TaskManager failure detection and checkpoint restoring? In addition, how does Flink actually determine that a TaskManager has failed due to e.g. hardware failures?

Up to my knowledge, the state should be restored using the CheckpointCoordinator or ExecutionGraph. Correct me if I’m wrong.

Thanks in advance,
Dominik