Re: All but one TMs connect when JM has more than 16G of memory

Posted by rmetzger0 on
URL: http://deprecated-apache-flink-user-mailing-list-archive.369.s1.nabble.com/All-but-one-TMs-connect-when-JM-has-more-than-16G-of-memory-tp2974p2976.html

Hi Robert,

the problem here is that YARN's scheduler (there are different schedulers in YARN: FIFO, CapacityScheduler, ...) is not giving Flink's ApplicationMaster/JobManager all the containers it is requesting. By increasing the size of the AM/JM container, there is probably no memory left to fit the last TaskManager container.
I also experienced this issue, when I wanted to run a Flink job on YARN and the containers were fitting theoretically, but YARN was not giving me all the containers I requested. 
Back then, I asked on the yarn-dev list [1] (there were also some off-list emails) but we could not resolve the issue.

Can you check the resource manager logs? Maybe there is a log message which explains why the container request of Flink's AM is not fulfilled.


[1] http://search-hadoop.com/m/AsBtCilK5r1pKLjf1&subj=Re+QUESTION+Allocating+a+full+YARN+cluster

On Wed, Sep 30, 2015 at 5:02 PM, Robert Schmidtke <[hidden email]> wrote:
It's me again. This is a strange issue, I hope I managed to find the right keywords. I got 8 machines, 1 for the JM, the other 7 are TMs with 64G of memory each.

When running my job like so:

$FLINK_HOME/bin/flink run -m yarn-cluster -yjm 16384 -ytm 40960 -yn 7 .....

The job completes without any problems. When running it like so:

$FLINK_HOME/bin/flink run -m yarn-cluster -yjm 16385 -ytm 40960 -yn 7 .....

(note the one more M of memory for the JM), the execution stalls, continuously reporting:

.....
TaskManager status (6/7)
TaskManager status (6/7)
TaskManager status (6/7)
.....

I did some poking around, but I couldn't find any direct correlation with the code.

The JM log says:

.....
16:49:01,893 INFO  org.apache.flink.yarn.ApplicationMaster$                      -  JVM Options:
16:49:01,893 INFO  org.apache.flink.yarn.ApplicationMaster$                      -     -Xmx12289M
.....

but then continues to report

.....
16:52:59,311 INFO  org.apache.flink.yarn.ApplicationMaster$$anonfun$2$$anon$1    - The user requested 7 containers, 6 running. 1 containers missing
16:52:59,831 INFO  org.apache.flink.yarn.ApplicationMaster$$anonfun$2$$anon$1    - The user requested 7 containers, 6 running. 1 containers missing
16:53:00,351 INFO  org.apache.flink.yarn.ApplicationMaster$$anonfun$2$$anon$1    - The user requested 7 containers, 6 running. 1 containers missing
.....

forever until I cancel the job.

If you have any ideas I'm happy to try them out. Thanks in advance for any hints! Cheers.

Robert
--
My GPG Key ID: 336E2680