Hi community,
I am interested in knowing more about the failure detection mechanism used by Flink, unfortunately information is a little thin on the ground and I was hoping someone could shed a little light on the topic. Looking at the documentation (https://ci.apache.org/projects/flink/flink-docs-stable/ops/config.html), there are these two configuration options:
The JobManager is responsible for broadcasting a heartbeat requests to all TaskManagers and awaits responses. If a response is not forthcoming from any particular node within the heartbeat timeout period, e.g. 50 seconds by default, then that node is timed out and assumed to have failed. The heartbeat interval indicated how often the heartbeat request broadcast is scheduled. Having the heartbeat interval shorter than the heartbeat timeout would mean that multiple requests can be underway at the same time. Therefore, the TaskManager would need to fail to respond to 4 requests (assuming normal response times are lower than 10 seconds) before being timed out after 50 seconds. So therefore if a failure were to occur (considering the default settings): - In the best case the JobManager would detect the failure in the shortest time, i.e. 50 seconds +- (node fails just before receiving the next heartbeat request) - In the worst case the JobManager would detect the failure in the longest time, i.e. 60 seconds +- (node fails just after sending the last heartbeat response) Is this correct? For JobManagers in HA mode, this is left to ZooKeeper timeouts which then initiates a round of elections and the new leader picks up from the previous checkpoint. Thank you in advance. Regards, M. |
Hi Morgan, > I am interested in knowing more about the failure detection mechanism used by Flink, unfortunately information is a little thin on the ground and I was hoping someone could shed a little light on the topic. It is probably best to look into the implementation (see my answers below). > Having the heartbeat interval shorter than the heartbeat timeout would mean that multiple requests can be underway at the same time. Yes, in fact the heartbeat interval must be shorter than the timeout or else an exception is thrown [1] > - In the worst case the JobManager would detect the failure in the longest time, i.e. 60 seconds +- (node fails just after sending the last heartbeat response) If a heartbeat response is received, the 50s timeout is reset [2]. If we do not receive a single heartbeat response for 50s, we will assume a failure [3]. Therefore, I do not think that there is a worst case or best case here. Lastly I wanted to mention that since FLIP-6 [4], the responsibilities of the JobManager have been split. We now have a ResourceManager and one JobManager for every job (note that in the code the class is called JobMaster). Each instance employs heartbeating to each other and also to the TaskManagers. Best, Gary [1] https://github.com/apache/flink/blob/bf1195232a49cce1897c1fa86c5af9ee005212c6/flink-runtime/src/main/java/org/apache/flink/runtime/heartbeat/HeartbeatServices.java#L43 [2] https://github.com/apache/flink/blob/1b628d4a7d92f9c79c31f3fe90911940e0676b22/flink-runtime/src/main/java/org/apache/flink/runtime/heartbeat/HeartbeatMonitorImpl.java#L117-L128 [3] https://github.com/apache/flink/blob/1b628d4a7d92f9c79c31f3fe90911940e0676b22/flink-runtime/src/main/java/org/apache/flink/runtime/heartbeat/HeartbeatMonitorImpl.java#L106-L111 [4] https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=65147077 On Tue, Mar 10, 2020 at 2:54 PM Morgan Geldenhuys <[hidden email]> wrote:
|
Free forum by Nabble | Edit this page |