http://deprecated-apache-flink-user-mailing-list-archive.369.s1.nabble.com/All-but-one-TMs-connect-when-JM-has-more-than-16G-of-memory-tp2974.html
It's me again. This is a strange issue, I hope I managed to find the right keywords. I got 8 machines, 1 for the JM, the other 7 are TMs with 64G of memory each.
When running my job like so:
$FLINK_HOME/bin/flink run -m yarn-cluster -yjm 16384 -ytm 40960 -yn 7 .....
The job completes without any problems. When running it like so:
$FLINK_HOME/bin/flink run -m yarn-cluster -yjm 16385 -ytm 40960 -yn 7 .....
(note the one more M of memory for the JM), the execution stalls, continuously reporting:
.....
TaskManager status (6/7)
TaskManager status (6/7)
TaskManager status (6/7)
.....
I did some poking around, but I couldn't find any direct correlation with the code.
The JM log says:
.....
16:49:01,893 INFO org.apache.flink.yarn.ApplicationMaster$ - JVM Options:
16:49:01,893 INFO org.apache.flink.yarn.ApplicationMaster$ - -Xmx12289M
.....
but then continues to report
.....
16:52:59,311 INFO org.apache.flink.yarn.ApplicationMaster$$anonfun$2$$anon$1 - The user requested 7 containers, 6 running. 1 containers missing
16:52:59,831 INFO org.apache.flink.yarn.ApplicationMaster$$anonfun$2$$anon$1 - The user requested 7 containers, 6 running. 1 containers missing
16:53:00,351 INFO org.apache.flink.yarn.ApplicationMaster$$anonfun$2$$anon$1 - The user requested 7 containers, 6 running. 1 containers missing
.....
forever until I cancel the job.
If you have any ideas I'm happy to try them out. Thanks in advance for any hints! Cheers.
Robert