Why Job Manager die/restarted when Task Manager die/restarted?

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

Why Job Manager die/restarted when Task Manager die/restarted?

Cam Mach
Hello Flink experts,

We are running Flink under Kubernetes and see that Job Manager die/restarted whenever Task Manager die/restarted or couldn't get connected each other. Is there any specific configurations/parameters that we need to turn on to stop this? Or this is expected?

Thanks,
Cam

Reply | Threaded
Open this post in threaded view
|

Re: Why Job Manager die/restarted when Task Manager die/restarted?

Zhu Zhu
Hi Cam, 

Flink master should not die when getting disconnected with task managers.
It may exit for cases below: 
1. when the job terminated(FINISHED/FAILED/CANCELED). If you job is configured with no restart retry, a TM failure can cause the job to be FAILED.
2. JM lost HA leadership, e.g. lost connection to ZK
3. encounters other unexpected fatal errors. In this case we need to check the log to see what happens then

Thanks,
Zhu Zhu

Cam Mach <[hidden email]> 于2019年8月12日周一 下午12:15写道:
Hello Flink experts,

We are running Flink under Kubernetes and see that Job Manager die/restarted whenever Task Manager die/restarted or couldn't get connected each other. Is there any specific configurations/parameters that we need to turn on to stop this? Or this is expected?

Thanks,
Cam

Reply | Threaded
Open this post in threaded view
|

Re: Why Job Manager die/restarted when Task Manager die/restarted?

Zhu Zhu
Another possibility is the JM is killed externally, e.g. K8s may kill JM/TM if it exceeds the resource limit.

Thanks,
Zhu Zhu

Zhu Zhu <[hidden email]> 于2019年8月12日周一 下午1:45写道:
Hi Cam, 

Flink master should not die when getting disconnected with task managers.
It may exit for cases below: 
1. when the job terminated(FINISHED/FAILED/CANCELED). If you job is configured with no restart retry, a TM failure can cause the job to be FAILED.
2. JM lost HA leadership, e.g. lost connection to ZK
3. encounters other unexpected fatal errors. In this case we need to check the log to see what happens then

Thanks,
Zhu Zhu

Cam Mach <[hidden email]> 于2019年8月12日周一 下午12:15写道:
Hello Flink experts,

We are running Flink under Kubernetes and see that Job Manager die/restarted whenever Task Manager die/restarted or couldn't get connected each other. Is there any specific configurations/parameters that we need to turn on to stop this? Or this is expected?

Thanks,
Cam

Reply | Threaded
Open this post in threaded view
|

Re: Why Job Manager die/restarted when Task Manager die/restarted?

Cam Mach

Hi Zhu,

Look like it's expected. Those are the cases that are happened to our cluster.

Thanks for your response, Zhu

Cam



On Sun, Aug 11, 2019 at 10:53 PM Zhu Zhu <[hidden email]> wrote:
Another possibility is the JM is killed externally, e.g. K8s may kill JM/TM if it exceeds the resource limit.

Thanks,
Zhu Zhu

Zhu Zhu <[hidden email]> 于2019年8月12日周一 下午1:45写道:
Hi Cam, 

Flink master should not die when getting disconnected with task managers.
It may exit for cases below: 
1. when the job terminated(FINISHED/FAILED/CANCELED). If you job is configured with no restart retry, a TM failure can cause the job to be FAILED.
2. JM lost HA leadership, e.g. lost connection to ZK
3. encounters other unexpected fatal errors. In this case we need to check the log to see what happens then

Thanks,
Zhu Zhu

Cam Mach <[hidden email]> 于2019年8月12日周一 下午12:15写道:
Hello Flink experts,

We are running Flink under Kubernetes and see that Job Manager die/restarted whenever Task Manager die/restarted or couldn't get connected each other. Is there any specific configurations/parameters that we need to turn on to stop this? Or this is expected?

Thanks,
Cam