Containers are not released after job failed

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

Containers are not released after job failed

liujiangang
      I run flink 1.6.2 on yarn. At some time, job is failed becuase of: org.apache.flink.util.FlinkException: The assigned slot container_e708_1555051789618_2644286_01_000061_0 was removed

      Then the job restarts. After some time, the container container_e708_1555051789618_2644286_01_000061 is still not released.

      The log of container_e708_1555051789618_2644286_01_000061 is as following:
image.png

      The log shows that two tasks are canceled before successful registration at resource manager and one is canceled after registration. After five minutes, the container registers again. At last, the container is alive but not used.
      Anyone have any idea about this problem. Thank you.
Reply | Threaded
Open this post in threaded view
|

Re: Containers are not released after job failed

Timo Walther
Hi,

I will loop in Till here who might know about this problem. In the meantime could you maybe tell us a bit more about your setup/deployment (how is yarn configured and the Flink job submitted?) and link to the full logs?

Thanks,
Timo


Am 26.04.19 um 11:15 schrieb 刘建刚:
      I run flink 1.6.2 on yarn. At some time, job is failed becuase of: org.apache.flink.util.FlinkException: The assigned slot container_e708_1555051789618_2644286_01_000061_0 was removed

      Then the job restarts. After some time, the container container_e708_1555051789618_2644286_01_000061 is still not released.

      The log of container_e708_1555051789618_2644286_01_000061 is as following:
image.png

      The log shows that two tasks are canceled before successful registration at resource manager and one is canceled after registration. After five minutes, the container registers again. At last, the container is alive but not used.
      Anyone have any idea about this problem. Thank you.


Reply | Threaded
Open this post in threaded view
|

Re: Containers are not released after job failed

Till Rohrmann
Hi,

have you tried whether the same problem also occurs with the latest Flink version (1.8.0, 1.7.2 or 1.6.4)?

If yes, then I would need to take a look at the logs to better understand what's happening.

Cheers,
Till

On Fri, Apr 26, 2019 at 12:33 PM Timo Walther <[hidden email]> wrote:
Hi,

I will loop in Till here who might know about this problem. In the meantime could you maybe tell us a bit more about your setup/deployment (how is yarn configured and the Flink job submitted?) and link to the full logs?

Thanks,
Timo


Am 26.04.19 um 11:15 schrieb 刘建刚:
      I run flink 1.6.2 on yarn. At some time, job is failed becuase of: org.apache.flink.util.FlinkException: The assigned slot container_e708_1555051789618_2644286_01_000061_0 was removed

      Then the job restarts. After some time, the container container_e708_1555051789618_2644286_01_000061 is still not released.

      The log of container_e708_1555051789618_2644286_01_000061 is as following:
image.png

      The log shows that two tasks are canceled before successful registration at resource manager and one is canceled after registration. After five minutes, the container registers again. At last, the container is alive but not used.
      Anyone have any idea about this problem. Thank you.


Reply | Threaded
Open this post in threaded view
|

Re: Containers are not released after job failed

liujiangang
Thank you, it is fixed in the new version.



--
Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/