(DEPRECATED) Apache Flink User Mailing List archive.

Failure detection and Heartbeats

Posted by Geldenhuys, Morgan Karl on
URL: http://deprecated-apache-flink-user-mailing-list-archive.369.s1.nabble.com/Failure-detection-and-Heartbeats-tp33544.html

Hi community,

I am interested in knowing more about the failure detection mechanism used by Flink, unfortunately information is a little thin on the ground and I was hoping someone could shed a little light on the topic.

Looking at the documentation (https://ci.apache.org/projects/flink/flink-docs-stable/ops/config.html), there are these two configuration options:

heartbeat.interval	10000	Long	Time interval for requesting heartbeat from sender side.
heartbeat.timeout	50000	Long	Timeout for requesting and receiving heartbeat for both sender and receiver sides.

This would indicate Flink uses a heartbeat mechanism to ascertain the liveness of TaskManagers. From this the following assumptions are made:

The JobManager is responsible for broadcasting a heartbeat requests to all TaskManagers and awaits responses.
If a response is not forthcoming from any particular node within the heartbeat timeout period, e.g. 50 seconds by default, then that node is timed out and assumed to have failed.
The heartbeat interval indicated how often the heartbeat request broadcast is scheduled.
Having the heartbeat interval shorter than the heartbeat timeout would mean that multiple requests can be underway at the same time.
Therefore, the TaskManager would need to fail to respond to 4 requests (assuming normal response times are lower than 10 seconds) before being timed out after 50 seconds.

So therefore if a failure were to occur (considering the default settings):
- In the best case the JobManager would detect the failure in the shortest time, i.e. 50 seconds +- (node fails just before receiving the next heartbeat request)
- In the worst case the JobManager would detect the failure in the longest time, i.e. 60 seconds +- (node fails just after sending the last heartbeat response)

Is this correct?

For JobManagers in HA mode, this is left to ZooKeeper timeouts which then initiates a round of elections and the new leader picks up from the previous checkpoint.

Thank you in advance.

Regards,
M.

Failure detection and Heartbeats

heartbeat.interval

heartbeat.timeout