Hello,
Today we encountered an issue where our Flink job request for Yarn container infinitely. In the JM log as below, there were errors when starting TMs (caused by underlying HDFS errors). So the allocated container failed and the job kept requesting for new containers. The failed containers were also not returned the the Yarn, so this job quickly exhausted our Yarn resources. Is there any way we can avoid such behavior? Thank you! ———————— JM log: INFO org.apache.flink.yarn.YarnResourceManager - Creating container launch context for TaskManagers INFO org.apache.flink.yarn.YarnResourceManager - Starting TaskManagers INFO org.apache.hadoop.yarn.client.api.impl.ContainerManagementProtocolProxy - Opening proxy : xxx.yyy ERROR org.apache.flink.yarn.YarnResourceManager - Could not start TaskManager in container container_e12345. org.apache.hadoop.yarn.exceptions.YarnException: Unauthorized request to start container. .... INFO org.apache.flink.yarn.YarnResourceManager - Requesting new TaskExecutor container with resources <memory:16384, vCores:4>. Number pending requests 19. INFO org.apache.flink.yarn.YarnResourceManager - Received new container: container_e195_1553781735010_27100_01_000136 - Remaining pending container requests: 19 ———————— Thanks, Qi
|
Hi Qi, I think the problem may be related to another similar problem reported in a previous JIRA [1]. I think a PR is also in discussion. Thanks, Rong On Fri, Mar 29, 2019 at 5:09 AM qi luo <[hidden email]> wrote:
|
Hi Qi, The current version of PR is runnable in production. But according to Till's suggestion, It needs one more round of change. Best Regards Peter Huang On Fri, Mar 29, 2019 at 3:42 PM Rong Rong <[hidden email]> wrote:
|
In reply to this post by Rong Rong
Thanks Rong, I will follow that issue.
|
Free forum by Nabble | Edit this page |