Should the entire cluster be restarted if a single Task Manager crashes?

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Should the entire cluster be restarted if a single Task Manager crashes?

HarshithBolar

Hi all,

 

We're running a standalone Flink cluster with 2 Job Managers and 3 Task Managers. Whenever a TM crashes, we simply restart that particular TM and proceed with the processing.

 

But reading the comments on this question makes it look like we need to restart all the 5 nodes that form a cluster to deal with the failure of a single TM. Am I reading this right? What would be the consequences if we restart just the crashed TM and let the healthy ones run as is?

 

Thanks,

Harshith

 

Reply | Threaded
Open this post in threaded view
|

Re: Should the entire cluster be restarted if a single Task Manager crashes?

Fabian Hueske-2
Hi Harshith,

No, you don't need to restart the whole cluster. Flink only needs enough processing slots to recover the job.
If you have a standby TM, the job should restart immediately (according to its restart policy). Otherwise, you have to start a new TM to provide more slots. Once the slots are registered, the job recovers.

Best,
Fabian

Am Fr., 18. Jan. 2019 um 10:53 Uhr schrieb Kumar Bolar, Harshith <[hidden email]>:

Hi all,

 

We're running a standalone Flink cluster with 2 Job Managers and 3 Task Managers. Whenever a TM crashes, we simply restart that particular TM and proceed with the processing.

 

But reading the comments on this question makes it look like we need to restart all the 5 nodes that form a cluster to deal with the failure of a single TM. Am I reading this right? What would be the consequences if we restart just the crashed TM and let the healthy ones run as is?

 

Thanks,

Harshith

 

Reply | Threaded
Open this post in threaded view
|

Re: Re: Should the entire cluster be restarted if a single Task Manager crashes?

HarshithBolar

Thanks a lot for clarifying :-)

 

- Harshith

 

From: Fabian Hueske <[hidden email]>
Date: Friday, 18 January 2019 at 4:31 PM
To: Harshith Kumar Bolar <[hidden email]>
Cc: "[hidden email]" <[hidden email]>
Subject: [External] Re: Should the entire cluster be restarted if a single Task Manager crashes?

 

Hi Harshith,

 

No, you don't need to restart the whole cluster. Flink only needs enough processing slots to recover the job.

If you have a standby TM, the job should restart immediately (according to its restart policy). Otherwise, you have to start a new TM to provide more slots. Once the slots are registered, the job recovers.

 

Best,

Fabian

 

Am Fr., 18. Jan. 2019 um 10:53 Uhr schrieb Kumar Bolar, Harshith <[hidden email]>:

Hi all,

 

We're running a standalone Flink cluster with 2 Job Managers and 3 Task Managers. Whenever a TM crashes, we simply restart that particular TM and proceed with the processing.

 

But reading the comments on this question makes it look like we need to restart all the 5 nodes that form a cluster to deal with the failure of a single TM. Am I reading this right? What would be the consequences if we restart just the crashed TM and let the healthy ones run as is?

 

Thanks,

Harshith