(DEPRECATED) Apache Flink User Mailing List archive.

Failure detection and Heartbeats

Classic

List

Threaded

2 messages Options

Geldenhuys, Morgan Karl

Failure detection and Heartbeats

Hi community,

I am interested in knowing more about the failure detection mechanism used by Flink, unfortunately information is a little thin on the ground and I was hoping someone could shed a little light on the topic.

Looking at the documentation (https://ci.apache.org/projects/flink/flink-docs-stable/ops/config.html), there are these two configuration options:

heartbeat.interval	10000	Long	Time interval for requesting heartbeat from sender side.
heartbeat.timeout	50000	Long	Timeout for requesting and receiving heartbeat for both sender and receiver sides.

This would indicate Flink uses a heartbeat mechanism to ascertain the liveness of TaskManagers. From this the following assumptions are made:

The JobManager is responsible for broadcasting a heartbeat requests to all TaskManagers and awaits responses.
If a response is not forthcoming from any particular node within the heartbeat timeout period, e.g. 50 seconds by default, then that node is timed out and assumed to have failed.
The heartbeat interval indicated how often the heartbeat request broadcast is scheduled.
Having the heartbeat interval shorter than the heartbeat timeout would mean that multiple requests can be underway at the same time.
Therefore, the TaskManager would need to fail to respond to 4 requests (assuming normal response times are lower than 10 seconds) before being timed out after 50 seconds.

So therefore if a failure were to occur (considering the default settings):
- In the best case the JobManager would detect the failure in the shortest time, i.e. 50 seconds +- (node fails just before receiving the next heartbeat request)
- In the worst case the JobManager would detect the failure in the longest time, i.e. 60 seconds +- (node fails just after sending the last heartbeat response)

Is this correct?

For JobManagers in HA mode, this is left to ZooKeeper timeouts which then initiates a round of elections and the new leader picks up from the previous checkpoint.

Thank you in advance.

Regards,
M.

Gary Yao-5

Re: Failure detection and Heartbeats

Hi Morgan,

> I am interested in knowing more about the failure detection mechanism used by Flink, unfortunately information is a little thin on the ground and I was hoping someone could shed a little light on the topic.
It is probably best to look into the implementation (see my answers below).

> Having the heartbeat interval shorter than the heartbeat timeout would mean that multiple requests can be underway at the same time.
Yes, in fact the heartbeat interval must be shorter than the timeout or else an exception is thrown [1]

> - In the worst case the JobManager would detect the failure in the longest time, i.e. 60 seconds +- (node fails just after sending the last heartbeat response)
If a heartbeat response is received, the 50s timeout is reset [2]. If we do not receive a single heartbeat response for 50s, we will assume a failure [3]. Therefore, I do not think that there is a worst case or best case here.

Lastly I wanted to mention that since FLIP-6 [4], the responsibilities of the JobManager have been split. We now have a ResourceManager and one JobManager for every job (note that in the code the class is called JobMaster). Each instance employs heartbeating to each other and also to the TaskManagers.

Best,
Gary

[1] https://github.com/apache/flink/blob/bf1195232a49cce1897c1fa86c5af9ee005212c6/flink-runtime/src/main/java/org/apache/flink/runtime/heartbeat/HeartbeatServices.java#L43
[2] https://github.com/apache/flink/blob/1b628d4a7d92f9c79c31f3fe90911940e0676b22/flink-runtime/src/main/java/org/apache/flink/runtime/heartbeat/HeartbeatMonitorImpl.java#L117-L128
[3] https://github.com/apache/flink/blob/1b628d4a7d92f9c79c31f3fe90911940e0676b22/flink-runtime/src/main/java/org/apache/flink/runtime/heartbeat/HeartbeatMonitorImpl.java#L106-L111
[4] https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=65147077

On Tue, Mar 10, 2020 at 2:54 PM Morgan Geldenhuys <[hidden email]> wrote:

Hi community,

I am interested in knowing more about the failure detection mechanism used by Flink, unfortunately information is a little thin on the ground and I was hoping someone could shed a little light on the topic.

Looking at the documentation (https://ci.apache.org/projects/flink/flink-docs-stable/ops/config.html), there are these two configuration options:

heartbeat.interval
10000 Long Time interval for requesting heartbeat from sender side.

heartbeat.timeout
50000 Long Timeout for requesting and receiving heartbeat for both sender and receiver sides.

This would indicate Flink uses a heartbeat mechanism to ascertain the liveness of TaskManagers. From this the following assumptions are made:

The JobManager is responsible for broadcasting a heartbeat requests to all TaskManagers and awaits responses.
If a response is not forthcoming from any particular node within the heartbeat timeout period, e.g. 50 seconds by default, then that node is timed out and assumed to have failed.
The heartbeat interval indicated how often the heartbeat request broadcast is scheduled.
Having the heartbeat interval shorter than the heartbeat timeout would mean that multiple requests can be underway at the same time.
Therefore, the TaskManager would need to fail to respond to 4 requests (assuming normal response times are lower than 10 seconds) before being timed out after 50 seconds.

So therefore if a failure were to occur (considering the default settings):
- In the best case the JobManager would detect the failure in the shortest time, i.e. 50 seconds +- (node fails just before receiving the next heartbeat request)
- In the worst case the JobManager would detect the failure in the longest time, i.e. 60 seconds +- (node fails just after sending the last heartbeat response)

Is this correct?

For JobManagers in HA mode, this is left to ZooKeeper timeouts which then initiates a round of elections and the new leader picks up from the previous checkpoint.

Thank you in advance.

Regards,
M.