Jobmanager not properly fenced when killed by YARN RM

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

Jobmanager not properly fenced when killed by YARN RM

Paul Lam
Hi,

Recently I've seen a situation when a JobManager received a stop signal from YARN RM but failed to exit and got in the restart loop, and keeps failing because the TaskManager containers are disconnected (killed by RM as well) before finally exited when hit the limit of the restart policy. This further resulted in the flink job being marked as final status failed and cleanup of  zookeeper paths, so when a new JobManager started up it found no checkpoint to restore and performed a stateless restart. In addition, the application is run with Flink 1.7.1 in HA job cluster mode on Hadoop 2.6.5.

As I can remember, I've seen a similar issue that relates to the fencing of JobManager, but I searched the JIRA and couldn't find it. It would be great if someone can point me to the right direction. And any comments are also welcome! Thanks!

Best,
Paul Lam
Reply | Threaded
Open this post in threaded view
|

Re: Jobmanager not properly fenced when killed by YARN RM

Yang Wang
Hi Paul,

I have gone through the codes and found that the root cause may be `YarnResourceManager` cleaned
up the application staging directory. When it unregisters from the Yarn ResourceManager failed, a
new attempt will be launched and failed quickly because of localization failed.

I think it is a bug, and will happen when unregister application failed. Could you share some logs of 
jobmanager, the second or following attempt so that i could confirm the bug?


Best,
Yang

Paul Lam <[hidden email]> 于2019年12月14日周六 上午11:02写道:
Hi,

Recently I've seen a situation when a JobManager received a stop signal from YARN RM but failed to exit and got in the restart loop, and keeps failing because the TaskManager containers are disconnected (killed by RM as well) before finally exited when hit the limit of the restart policy. This further resulted in the flink job being marked as final status failed and cleanup of  zookeeper paths, so when a new JobManager started up it found no checkpoint to restore and performed a stateless restart. In addition, the application is run with Flink 1.7.1 in HA job cluster mode on Hadoop 2.6.5.

As I can remember, I've seen a similar issue that relates to the fencing of JobManager, but I searched the JIRA and couldn't find it. It would be great if someone can point me to the right direction. And any comments are also welcome! Thanks!

Best,
Paul Lam
Reply | Threaded
Open this post in threaded view
|

Re: Jobmanager not properly fenced when killed by YARN RM

Paul Lam
Hi Yang,

Thanks a lot for your reply!

I might not make myself clear, but the new jobmanager in the new YARN application attempt did start successfully. And unluckily, I didn’t find any logs written by YarnResourceManager in the jobmanager logs.  

The jobmanager logs are in the attachment (with some subtask status change logs removed and env info censored).

Thanks!



Best,
Paul Lam

在 2019年12月16日,14:42,Yang Wang <[hidden email]> 写道:

Hi Paul,

I have gone through the codes and found that the root cause may be `YarnResourceManager` cleaned
up the application staging directory. When it unregisters from the Yarn ResourceManager failed, a
new attempt will be launched and failed quickly because of localization failed.

I think it is a bug, and will happen when unregister application failed. Could you share some logs of 
jobmanager, the second or following attempt so that i could confirm the bug?


Best,
Yang

Paul Lam <[hidden email]> 于2019年12月14日周六 上午11:02写道:
Hi,

Recently I've seen a situation when a JobManager received a stop signal from YARN RM but failed to exit and got in the restart loop, and keeps failing because the TaskManager containers are disconnected (killed by RM as well) before finally exited when hit the limit of the restart policy. This further resulted in the flink job being marked as final status failed and cleanup of  zookeeper paths, so when a new JobManager started up it found no checkpoint to restore and performed a stateless restart. In addition, the application is run with Flink 1.7.1 in HA job cluster mode on Hadoop 2.6.5.

As I can remember, I've seen a similar issue that relates to the fencing of JobManager, but I searched the JIRA and couldn't find it. It would be great if someone can point me to the right direction. And any comments are also welcome! Thanks!

Best,
Paul Lam


jobmanager_not_fenced.csv (513K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Jobmanager not properly fenced when killed by YARN RM

Yang Wang
Hi Paul,

I found lots of "Failed to stop Container " logs in the jobmanager.log. It seems that the Yarn cluster
is not working normally. So the Flink YarnResourceManager may also unregister app failed. If we
unregister app successfully, no new attempt will be started.

The second and following jobmanager attempt started and failed because the staging directory
on HDFS was cleaned up. So could you find the log "Could not unregister the application master"
in all the jobmanager logs, including the first one?



Best,
Yang

Paul Lam <[hidden email]> 于2019年12月16日周一 下午7:56写道:
Hi Yang,

Thanks a lot for your reply!

I might not make myself clear, but the new jobmanager in the new YARN application attempt did start successfully. And unluckily, I didn’t find any logs written by YarnResourceManager in the jobmanager logs.  

The jobmanager logs are in the attachment (with some subtask status change logs removed and env info censored).

Thanks!


Best,
Paul Lam

在 2019年12月16日,14:42,Yang Wang <[hidden email]> 写道:

Hi Paul,

I have gone through the codes and found that the root cause may be `YarnResourceManager` cleaned
up the application staging directory. When it unregisters from the Yarn ResourceManager failed, a
new attempt will be launched and failed quickly because of localization failed.

I think it is a bug, and will happen when unregister application failed. Could you share some logs of 
jobmanager, the second or following attempt so that i could confirm the bug?


Best,
Yang

Paul Lam <[hidden email]> 于2019年12月14日周六 上午11:02写道:
Hi,

Recently I've seen a situation when a JobManager received a stop signal from YARN RM but failed to exit and got in the restart loop, and keeps failing because the TaskManager containers are disconnected (killed by RM as well) before finally exited when hit the limit of the restart policy. This further resulted in the flink job being marked as final status failed and cleanup of  zookeeper paths, so when a new JobManager started up it found no checkpoint to restore and performed a stateless restart. In addition, the application is run with Flink 1.7.1 in HA job cluster mode on Hadoop 2.6.5.

As I can remember, I've seen a similar issue that relates to the fencing of JobManager, but I searched the JIRA and couldn't find it. It would be great if someone can point me to the right direction. And any comments are also welcome! Thanks!

Best,
Paul Lam

Reply | Threaded
Open this post in threaded view
|

Re: Jobmanager not properly fenced when killed by YARN RM

Paul Lam
Hi Yang,

Thanks a lot for your reasoning. You are right about the YARN cluster. The NodeManager was crashed, and that’s why RM would kill the containers on that machine, after a heartbeat timeout (about 10 min) with the NodeManager.

Actually the attached logs are from the first/old jobmanager, and I couldn’t the log about YARN application unregistration in all logs. I think maybe Flink resource manager was not trying to unregister the application (which would also remove the HA service state) when it got a shutdown request, because the Flink job runs well at the moment. 

I dug a bit deeper to find that the root cause might be that Flink ResoureManager was taking too long to shutdown and it didn’t change the Flink job status (so JobManager would keep working even the AM is killed by RM). And also I’ve found the related issue mentioned in my previous mail. [1]

Thanks a lot for you help!


Best,
Paul Lam

在 2019年12月16日,20:35,Yang Wang <[hidden email]> 写道:

Hi Paul,

I found lots of "Failed to stop Container " logs in the jobmanager.log. It seems that the Yarn cluster
is not working normally. So the Flink YarnResourceManager may also unregister app failed. If we
unregister app successfully, no new attempt will be started.

The second and following jobmanager attempt started and failed because the staging directory
on HDFS was cleaned up. So could you find the log "Could not unregister the application master"
in all the jobmanager logs, including the first one?



Best,
Yang

Paul Lam <[hidden email]> 于2019年12月16日周一 下午7:56写道:
Hi Yang,

Thanks a lot for your reply!

I might not make myself clear, but the new jobmanager in the new YARN application attempt did start successfully. And unluckily, I didn’t find any logs written by YarnResourceManager in the jobmanager logs.  

The jobmanager logs are in the attachment (with some subtask status change logs removed and env info censored).

Thanks!


Best,
Paul Lam

在 2019年12月16日,14:42,Yang Wang <[hidden email]> 写道:

Hi Paul,

I have gone through the codes and found that the root cause may be `YarnResourceManager` cleaned
up the application staging directory. When it unregisters from the Yarn ResourceManager failed, a
new attempt will be launched and failed quickly because of localization failed.

I think it is a bug, and will happen when unregister application failed. Could you share some logs of 
jobmanager, the second or following attempt so that i could confirm the bug?


Best,
Yang

Paul Lam <[hidden email]> 于2019年12月14日周六 上午11:02写道:
Hi,

Recently I've seen a situation when a JobManager received a stop signal from YARN RM but failed to exit and got in the restart loop, and keeps failing because the TaskManager containers are disconnected (killed by RM as well) before finally exited when hit the limit of the restart policy. This further resulted in the flink job being marked as final status failed and cleanup of  zookeeper paths, so when a new JobManager started up it found no checkpoint to restore and performed a stateless restart. In addition, the application is run with Flink 1.7.1 in HA job cluster mode on Hadoop 2.6.5.

As I can remember, I've seen a similar issue that relates to the fencing of JobManager, but I searched the JIRA and couldn't find it. It would be great if someone can point me to the right direction. And any comments are also welcome! Thanks!

Best,
Paul Lam


Reply | Threaded
Open this post in threaded view
|

Re: Jobmanager not properly fenced when killed by YARN RM

Yang Wang
Hi Paul,

Thanks for sharing your analysis. I think you are right. When the Yarn NodeManager crashed,
the first jobmanager running on it will not be killed. However, the Yarn ResourceManager found
the NodeManager lost, it launched a new jobmanager attempt. Before FLINK-14010, only the
FlinkResourceManager will exit when received onShutdownRequest from Yarn ResourceManager.
So Dispatcher and JobManager launched in new AM cannot be granted leadership properly and
will restart repeatedly.



Best,
Yang

Paul Lam <[hidden email]> 于2019年12月17日周二 下午12:35写道:
Hi Yang,

Thanks a lot for your reasoning. You are right about the YARN cluster. The NodeManager was crashed, and that’s why RM would kill the containers on that machine, after a heartbeat timeout (about 10 min) with the NodeManager.

Actually the attached logs are from the first/old jobmanager, and I couldn’t the log about YARN application unregistration in all logs. I think maybe Flink resource manager was not trying to unregister the application (which would also remove the HA service state) when it got a shutdown request, because the Flink job runs well at the moment. 

I dug a bit deeper to find that the root cause might be that Flink ResoureManager was taking too long to shutdown and it didn’t change the Flink job status (so JobManager would keep working even the AM is killed by RM). And also I’ve found the related issue mentioned in my previous mail. [1]

Thanks a lot for you help!


Best,
Paul Lam

在 2019年12月16日,20:35,Yang Wang <[hidden email]> 写道:

Hi Paul,

I found lots of "Failed to stop Container " logs in the jobmanager.log. It seems that the Yarn cluster
is not working normally. So the Flink YarnResourceManager may also unregister app failed. If we
unregister app successfully, no new attempt will be started.

The second and following jobmanager attempt started and failed because the staging directory
on HDFS was cleaned up. So could you find the log "Could not unregister the application master"
in all the jobmanager logs, including the first one?



Best,
Yang

Paul Lam <[hidden email]> 于2019年12月16日周一 下午7:56写道:
Hi Yang,

Thanks a lot for your reply!

I might not make myself clear, but the new jobmanager in the new YARN application attempt did start successfully. And unluckily, I didn’t find any logs written by YarnResourceManager in the jobmanager logs.  

The jobmanager logs are in the attachment (with some subtask status change logs removed and env info censored).

Thanks!


Best,
Paul Lam

在 2019年12月16日,14:42,Yang Wang <[hidden email]> 写道:

Hi Paul,

I have gone through the codes and found that the root cause may be `YarnResourceManager` cleaned
up the application staging directory. When it unregisters from the Yarn ResourceManager failed, a
new attempt will be launched and failed quickly because of localization failed.

I think it is a bug, and will happen when unregister application failed. Could you share some logs of 
jobmanager, the second or following attempt so that i could confirm the bug?


Best,
Yang

Paul Lam <[hidden email]> 于2019年12月14日周六 上午11:02写道:
Hi,

Recently I've seen a situation when a JobManager received a stop signal from YARN RM but failed to exit and got in the restart loop, and keeps failing because the TaskManager containers are disconnected (killed by RM as well) before finally exited when hit the limit of the restart policy. This further resulted in the flink job being marked as final status failed and cleanup of  zookeeper paths, so when a new JobManager started up it found no checkpoint to restore and performed a stateless restart. In addition, the application is run with Flink 1.7.1 in HA job cluster mode on Hadoop 2.6.5.

As I can remember, I've seen a similar issue that relates to the fencing of JobManager, but I searched the JIRA and couldn't find it. It would be great if someone can point me to the right direction. And any comments are also welcome! Thanks!

Best,
Paul Lam