Re: All but one TMs connect when JM has more than 16G of memory

Posted by Robert Schmidtke on
URL: http://deprecated-apache-flink-user-mailing-list-archive.369.s1.nabble.com/All-but-one-TMs-connect-when-JM-has-more-than-16G-of-memory-tp2974p2975.html

I should say I'm running the current Flink master branch.

On Wed, Sep 30, 2015 at 5:02 PM, Robert Schmidtke <[hidden email]> wrote:
It's me again. This is a strange issue, I hope I managed to find the right keywords. I got 8 machines, 1 for the JM, the other 7 are TMs with 64G of memory each.

When running my job like so:

$FLINK_HOME/bin/flink run -m yarn-cluster -yjm 16384 -ytm 40960 -yn 7 .....

The job completes without any problems. When running it like so:

$FLINK_HOME/bin/flink run -m yarn-cluster -yjm 16385 -ytm 40960 -yn 7 .....

(note the one more M of memory for the JM), the execution stalls, continuously reporting:

.....
TaskManager status (6/7)
TaskManager status (6/7)
TaskManager status (6/7)
.....

I did some poking around, but I couldn't find any direct correlation with the code.

The JM log says:

.....
16:49:01,893 INFO  org.apache.flink.yarn.ApplicationMaster$                      -  JVM Options:
16:49:01,893 INFO  org.apache.flink.yarn.ApplicationMaster$                      -     -Xmx12289M
.....

but then continues to report

.....
16:52:59,311 INFO  org.apache.flink.yarn.ApplicationMaster$$anonfun$2$$anon$1    - The user requested 7 containers, 6 running. 1 containers missing
16:52:59,831 INFO  org.apache.flink.yarn.ApplicationMaster$$anonfun$2$$anon$1    - The user requested 7 containers, 6 running. 1 containers missing
16:53:00,351 INFO  org.apache.flink.yarn.ApplicationMaster$$anonfun$2$$anon$1    - The user requested 7 containers, 6 running. 1 containers missing
.....

forever until I cancel the job.

If you have any ideas I'm happy to try them out. Thanks in advance for any hints! Cheers.

Robert
--
My GPG Key ID: 336E2680



--
My GPG Key ID: 336E2680