How does JobManager terminate dangling task manager

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

How does JobManager terminate dangling task manager

narasimha
Hi, 

Trying to understand how JobManager. kills TaskManager that didn't respond for heartbeat after a certain time. 

For example: 

If a network connection b/w JobManager and TaskManager is lost for some reasons, the JobManager will bring up another Taskmanager post hearbeat timeout. 
In such a case, how does JobManager make sure all connections like to Kafka from lost Taskmanager are cut down and the new one will take from a certain consistent point. 

Also want to learn ways to debug what caused the timeout, our job fairly handles 5k records/s, not a heavy traffic job. 
--
A.Narasimha Swamy
Reply | Threaded
Open this post in threaded view
|

Re: How does JobManager terminate dangling task manager

Guowei Ma
Hi,
In fact, not only JobManager(ResoruceManager) will kill TimeOut's TaskManager, but if TaskManager finds that it cannot connect to JobManager(ResourceManager), it will also exit by itself.
You can look at the time period during which the HB timeout occurred and what happened in the log. Under normal circumstances, I also look at what the GC situation was like at that time.
Best,
Guowei


On Thu, May 13, 2021 at 11:06 AM narasimha <[hidden email]> wrote:
Hi, 

Trying to understand how JobManager. kills TaskManager that didn't respond for heartbeat after a certain time. 

For example: 

If a network connection b/w JobManager and TaskManager is lost for some reasons, the JobManager will bring up another Taskmanager post hearbeat timeout. 
In such a case, how does JobManager make sure all connections like to Kafka from lost Taskmanager are cut down and the new one will take from a certain consistent point. 

Also want to learn ways to debug what caused the timeout, our job fairly handles 5k records/s, not a heavy traffic job. 
--
A.Narasimha Swamy
Reply | Threaded
Open this post in threaded view
|

Re: How does JobManager terminate dangling task manager

Xintong Song
Hi narasimha,

For each TaskManager, there are two kinds of connections to the JobManager process.
- One single connection to the ResourceManager, which allows RM to monitor the slots' availability and assign them to Flink jobs.
- Connections to each JobMaster that the slots of this TM are assigned to.

Upon the JobMaster-TM disconnection, all tasks running on the TM that are from the corresponding job are failed immediately. Take the Kafka source as an example, that's where the task stops consuming data from Kafka.
Upon the RM-TM disconnection, TM kills itself if it cannot reconnect to the RM within a certain time. 
Since JobMaster and RM are in the same process, when one of the two connections breaks, the other usually also breaks. In cases not, RM-TM disconnection does not fail the running tasks, until the reconnection timeout.

As for failover consistency, that is guaranteed by the checkpointing mechanism. The new task does not resume from the exact position where the old task is stopped. Instead, it resumes from the last successful checkpoint.

Thank you~

Xintong Song



On Thu, May 13, 2021 at 5:38 PM Guowei Ma <[hidden email]> wrote:
Hi,
In fact, not only JobManager(ResoruceManager) will kill TimeOut's TaskManager, but if TaskManager finds that it cannot connect to JobManager(ResourceManager), it will also exit by itself.
You can look at the time period during which the HB timeout occurred and what happened in the log. Under normal circumstances, I also look at what the GC situation was like at that time.
Best,
Guowei


On Thu, May 13, 2021 at 11:06 AM narasimha <[hidden email]> wrote:
Hi, 

Trying to understand how JobManager. kills TaskManager that didn't respond for heartbeat after a certain time. 

For example: 

If a network connection b/w JobManager and TaskManager is lost for some reasons, the JobManager will bring up another Taskmanager post hearbeat timeout. 
In such a case, how does JobManager make sure all connections like to Kafka from lost Taskmanager are cut down and the new one will take from a certain consistent point. 

Also want to learn ways to debug what caused the timeout, our job fairly handles 5k records/s, not a heavy traffic job. 
--
A.Narasimha Swamy