Re: TaskManager HA on YARN

Posted by Till Rohrmann on
URL: http://deprecated-apache-flink-user-mailing-list-archive.369.s1.nabble.com/TaskManager-HA-on-YARN-tp17001p17004.html

Hi Hayden,

in Yarn mode, Flink will tolerate as many TM failures as you have configured `yarn.maximum-failed-containers`. Per default this is set to the initial number of requested TMs. So in your case, the Flink cluster would restart twice a TM and then fail the cluster once a TM fails for the third time.

Cheers,
Till

On Mon, Dec 4, 2017 at 11:31 AM, Marchant, Hayden <[hidden email]> wrote:
Hi,

WE are currently start to test Flink running on YARN. Till now, we've been testing on Standalone Cluster. One thing lacking in standalone is that we have to manually restart a Task Manager if it dies. I looked at https://ci.apache.org/projects/flink/flink-docs-release-1.3/setup/jobmanager_high_availability.html#yarn-cluster-high-availability , and see that YARN deals with HA for Job Manager. How does it deal with a Task Manager if it dies? I would like the Task Manager to be dealt with similarly to Job Manager on failure. For example, let's say I have a cluster with two Task Managers, and one task manager dies. Will YARN restart the dead Task Manager, or would that need to be a manual restart?

What actually would happen in the above case?

Thanks,
Hayden