Hi Harshith,
No, you don't need to restart the whole cluster. Flink only needs enough processing slots to recover the job.
If you have a standby TM, the job should restart immediately (according to its restart policy). Otherwise, you have to start a new TM to provide more slots. Once the slots are registered, the job recovers.
Best,
Fabian
Am Fr., 18. Jan. 2019 um 10:53 Uhr schrieb Kumar Bolar, Harshith <
[hidden email]>:
Hi all,
We're running a standalone Flink cluster with 2 Job Managers and 3 Task Managers. Whenever a TM crashes, we simply restart that particular TM and proceed with the processing.
But reading the comments on this question makes it look like we need to restart all the 5 nodes
that form a cluster to deal with the failure of a single TM. Am I reading this right? What would be the consequences if we restart just the crashed TM and let the healthy ones run as is?
Thanks,
Harshith