(DEPRECATED) Apache Flink User Mailing List archive.

All but one TMs connect when JM has more than 16G of memory

Posted by Robert Schmidtke on
URL: http://deprecated-apache-flink-user-mailing-list-archive.369.s1.nabble.com/All-but-one-TMs-connect-when-JM-has-more-than-16G-of-memory-tp2974.html

It's me again. This is a strange issue, I hope I managed to find the right keywords. I got 8 machines, 1 for the JM, the other 7 are TMs with 64G of memory each.

When running my job like so:

$FLINK_HOME/bin/flink run -m yarn-cluster -yjm 16384 -ytm 40960 -yn 7 .....

The job completes without any problems. When running it like so:

$FLINK_HOME/bin/flink run -m yarn-cluster -yjm 16385 -ytm 40960 -yn 7 .....

(note the one more M of memory for the JM), the execution stalls, continuously reporting:

.....

TaskManager status (6/7)

.....

I did some poking around, but I couldn't find any direct correlation with the code.

The JM log says:

.....

16:49:01,893 INFO org.apache.flink.yarn.ApplicationMaster$ - JVM Options:

16:49:01,893 INFO org.apache.flink.yarn.ApplicationMaster$ - -Xmx12289M

.....

but then continues to report

.....

16:52:59,311 INFO org.apache.flink.yarn.ApplicationMaster$$anonfun$2$$anon$1 - The user requested 7 containers, 6 running. 1 containers missing

16:52:59,831 INFO org.apache.flink.yarn.ApplicationMaster$$anonfun$2$$anon$1 - The user requested 7 containers, 6 running. 1 containers missing

16:53:00,351 INFO org.apache.flink.yarn.ApplicationMaster$$anonfun$2$$anon$1 - The user requested 7 containers, 6 running. 1 containers missing

.....

forever until I cancel the job.

If you have any ideas I'm happy to try them out. Thanks in advance for any hints! Cheers.

Robert

My GPG Key ID: 336E2680