Posted by
Geldenhuys, Morgan Karl on
URL: http://deprecated-apache-flink-user-mailing-list-archive.369.s1.nabble.com/Failure-detection-and-Heartbeats-tp33544.html
Hi community,
I am interested in knowing more about the failure detection
mechanism used by Flink, unfortunately information is a little thin
on the ground and I was hoping someone could shed a little light on
the topic.
Looking at the documentation (
https://ci.apache.org/projects/flink/flink-docs-stable/ops/config.html),
there are these two configuration options:
heartbeat.interval
|
10000 |
Long |
Time interval for requesting heartbeat from
sender side. |
heartbeat.timeout
|
50000 |
Long |
Timeout for requesting and receiving heartbeat
for both sender and receiver sides. |
This would indicate Flink uses a heartbeat mechanism to ascertain
the liveness of TaskManagers. From this the following assumptions
are made:
The JobManager is responsible for broadcasting a heartbeat requests
to all TaskManagers and awaits responses.
If a response is not forthcoming from any particular node within the
heartbeat timeout period, e.g. 50 seconds by default, then that node
is timed out and assumed to have failed.
The heartbeat interval indicated how often the heartbeat request
broadcast is scheduled.
Having the heartbeat interval shorter than the heartbeat timeout
would mean that multiple requests can be underway at the same time.
Therefore, the TaskManager would need to fail to respond to 4
requests (assuming normal response times are lower than 10 seconds)
before being timed out after 50 seconds.
So therefore if a failure were to occur (considering the default
settings):
- In the best case the JobManager would detect the failure in the
shortest time, i.e. 50 seconds +- (node fails just before receiving
the next heartbeat request)
- In the worst case the JobManager would detect the failure in the
longest time, i.e. 60 seconds +- (node fails just after sending the
last heartbeat response)
Is this correct?
For JobManagers in HA mode, this is left to ZooKeeper timeouts which
then initiates a round of elections and the new leader picks up from
the previous checkpoint.
Thank you in advance.
Regards,
M.