Flink HA cluster on YARN is restarted more than yarn.application-attempts value

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Flink HA cluster on YARN is restarted more than yarn.application-attempts value

新平和礼

Hi all,


I'm Flink newbie, and trying to understand Flink clusters recovery feature using Flink 1.7.2 and YARN 2.8.

To confirm HA clusters behavior, I created Flink YARN session cluster and stopped JobManager repeatedly using kill command after job deployment.

In that test, I set yarn.application-attempts to 5, but Flink cluster was recovered more than 5 times. 


Does anyone know what yarn.application-attempts mean, and when Flink clusters attempts time will be incremented ?


I asked same question at stackoverflow, but I still dont  get it. 


https://stackoverflow.com/questions/56225088/why-is-flink-ha-cluster-on-yarn-recovered-more-than-the-maximum-number-of-attemp



Best,

--
Kazunori Shinhira
Reply | Threaded
Open this post in threaded view
|

Re: Flink HA cluster on YARN is restarted more than yarn.application-attempts value

Shuyi Chen
AFAIR,  your manual kill won't count towards the max-attempt counter in hadoop's logic. Please see this post for more details: http://johnjianfang.blogspot.com/2015/04/the-number-of-maximum-attempts-of-yarn.html.

On Sun, Jun 2, 2019 at 9:48 AM 新平和礼 <[hidden email]> wrote:

Hi all,


I'm Flink newbie, and trying to understand Flink clusters recovery feature using Flink 1.7.2 and YARN 2.8.

To confirm HA clusters behavior, I created Flink YARN session cluster and stopped JobManager repeatedly using kill command after job deployment.

In that test, I set yarn.application-attempts to 5, but Flink cluster was recovered more than 5 times. 


Does anyone know what yarn.application-attempts mean, and when Flink clusters attempts time will be incremented ?


I asked same question at stackoverflow, but I still dont  get it. 


https://stackoverflow.com/questions/56225088/why-is-flink-ha-cluster-on-yarn-recovered-more-than-the-maximum-number-of-attemp



Best,

--
Kazunori Shinhira
Reply | Threaded
Open this post in threaded view
|

Re: Flink HA cluster on YARN is restarted more than yarn.application-attempts value

新平和礼

Hi, Shuyi


Thank you for your reply.

I read the blog post you suggested.

I understand that YARN dont count container failure when container exited with status one of PREEMPTED, ABORTED, DISK_FAILED, and KILLED_BY_RESOURCEMANAGER,  and stopping JobManager with kill command will cause one of that status. 


Now, I have another question about Flink recovery process.

I tried to create NON HA Flink cluster and killed JobManager, then that cluster was failed after I executed second time kill command.


The text like bellow was recorded as YARN Applications Diagnostics message.


```

Application application_1545044154652_0154 failed 2 times due to AM Container for appattempt_1545044154652_0154_000002 exited with exitCode: 137

Failing this attempt.Diagnostics: Container killed on request. Exit code is 137

..(omitted)

```


Furthermore following messages was recorded in ResourceManagers log, so I think that is expected behavior described in Flink document at https://ci.apache.org/projects/flink/flink-docs-stable/ops/jobmanager_high_availability.html#maximum-application-master-attempts-yarn-sitexml .


```

The specific max attempts: 10 for application: 154 is invalid, because it is out of the range [1, 2]. Use the global max attempts instead.

```

This message was also logged when using HA cluster, I found this after sending my first e-mail.



My question is that why attempts time is counted only when using NON HA cluster ? 

I use same command to kill JobManager but only when using NON HA cluster, it looks like JobManager failure is counted towards to calculate max attempts.

Is there any other conditions to count attempts time or is there possibility to happen another failure in recovery process when using NON HA cluster?

I check JobManager's log but can't find difference between HA cluster and NON HA cluster.



May be  this is a question  about Hadoop YARN implementation rather than Apache Flink.

Im sorry if this is inappropriate  for this mailing list.



Thanks,


2019年6月3日(月) 4:00 Shuyi Chen <[hidden email]>:
AFAIR,  your manual kill won't count towards the max-attempt counter in hadoop's logic. Please see this post for more details: http://johnjianfang.blogspot.com/2015/04/the-number-of-maximum-attempts-of-yarn.html.

On Sun, Jun 2, 2019 at 9:48 AM 新平和礼 <[hidden email]> wrote:

Hi all,


I'm Flink newbie, and trying to understand Flink clusters recovery feature using Flink 1.7.2 and YARN 2.8.

To confirm HA clusters behavior, I created Flink YARN session cluster and stopped JobManager repeatedly using kill command after job deployment.

In that test, I set yarn.application-attempts to 5, but Flink cluster was recovered more than 5 times. 


Does anyone know what yarn.application-attempts mean, and when Flink clusters attempts time will be incremented ?


I asked same question at stackoverflow, but I still dont  get it. 


https://stackoverflow.com/questions/56225088/why-is-flink-ha-cluster-on-yarn-recovered-more-than-the-maximum-number-of-attemp



Best,

--
Kazunori Shinhira


--
Kazunori Shinhira