Hi all, I'm Flink newbie, and trying to understand Flink cluster’s recovery feature using Flink 1.7.2 and YARN 2.8. To confirm HA cluster’s behavior, I created Flink YARN session cluster and stopped JobManager repeatedly using kill command after job deployment. In that test, I set “yarn.application-attempts” to 5, but Flink cluster was recovered more than 5 times. Does anyone know what “yarn.application-attempts” mean, and when Flink cluster’s attempts time will be incremented ? I asked same question at stackoverflow, but I still don’t get it.
Best, --Kazunori Shinhira Mail : [hidden email] |
AFAIR, your manual kill won't count towards the max-attempt counter in hadoop's logic. Please see this post for more details: http://johnjianfang.blogspot.com/2015/04/the-number-of-maximum-attempts-of-yarn.html. On Sun, Jun 2, 2019 at 9:48 AM 新平和礼 <[hidden email]> wrote:
|
Hi, Shuyi Thank you for your reply. I read the blog post you suggested. I understand that YARN don’t count container failure when container exited with status one of PREEMPTED, ABORTED, DISK_FAILED, and KILLED_BY_RESOURCEMANAGER, and stopping JobManager with kill command will cause one of that status. Now, I have another question about Flink recovery process. I tried to create NON HA Flink cluster and killed JobManager, then that cluster was failed after I executed second time kill command. The text like bellow was recorded as YARN Application’s Diagnostics message. ``` Application application_1545044154652_0154 failed 2 times due to AM Container for appattempt_1545044154652_0154_000002 exited with exitCode: 137 Failing this attempt.Diagnostics: Container killed on request. Exit code is 137 ..(omitted) ``` Furthermore following messages was recorded in ResourceManager’s log, so I think that is expected behavior described in Flink document at https://ci.apache.org/projects/flink/flink-docs-stable/ops/jobmanager_high_availability.html#maximum-application-master-attempts-yarn-sitexml . ``` The specific max attempts: 10 for application: 154 is invalid, because it is out of the range [1, 2]. Use the global max attempts instead. ``` ※ This message was also logged when using HA cluster, I found this after sending my first e-mail. My question is that why attempts time is counted only when using NON HA cluster ? I use same command to kill JobManager but only when using NON HA cluster, it looks like JobManager failure is counted towards to calculate max attempts. Is there any other conditions to count attempts time or is there possibility to happen another failure in recovery process when using NON HA cluster? I check JobManager's log but can't find difference between HA cluster and NON HA cluster. May be this is a question about Hadoop YARN implementation rather than Apache Flink. I’m sorry if this is inappropriate for this mailing list. Thanks, 2019年6月3日(月) 4:00 Shuyi Chen <[hidden email]>:
Kazunori Shinhira Mail : [hidden email] |
Free forum by Nabble | Edit this page |