(DEPRECATED) Apache Flink User Mailing List archive.

Re: Containers are not released after job failed

Posted by Till Rohrmann on
URL: http://deprecated-apache-flink-user-mailing-list-archive.369.s1.nabble.com/Containers-are-not-released-after-job-failed-tp27549p27558.html

Hi,

have you tried whether the same problem also occurs with the latest Flink version (1.8.0, 1.7.2 or 1.6.4)?

If yes, then I would need to take a look at the logs to better understand what's happening.

Cheers,

Till

On Fri, Apr 26, 2019 at 12:33 PM Timo Walther <[hidden email]> wrote:

Hi,

I will loop in Till here who might know about this problem. In the meantime could you maybe tell us a bit more about your setup/deployment (how is yarn configured and the Flink job submitted?) and link to the full logs?

Thanks,

Timo

Am 26.04.19 um 11:15 schrieb 刘建刚:

I run flink 1.6.2 on yarn. At some time, job is failed becuase of: org.apache.flink.util.FlinkException: The assigned slot container_e708_1555051789618_2644286_01_000061_0 was removed

Then the job restarts. After some time, the container container_e708_1555051789618_2644286_01_000061 is still not released.

The log of container_e708_1555051789618_2644286_01_000061 is as following:

The log shows that two tasks are canceled before successful registration at resource manager and one is canceled after registration. After five minutes, the container registers again. At last, the container is alive but not used.

Anyone have any idea about this problem. Thank you.