TM heartbeat timeout due to ResourceManager being busy

Posted by Paul Lam on
URL: http://deprecated-apache-flink-user-mailing-list-archive.369.s1.nabble.com/TM-heartbeat-timeout-due-to-ResourceManager-being-busy-tp38626.html

Hi,

After FLINK-13184 is implemented (even with Flink 1.11), occasionally there would still be jobs 
with high parallelism getting TM-RM heartbeat timeouts when RM is busy creating TM contexts 
on cluster initialization and HDFS is slow at that moment. 

Apart from increasing the TM heartbeat timeout, is there any recommended  out of the box 
approach that can reduce the chance of getting the timeouts? 

In the long run, is it possible to limit the number of taskmanager contexts that RM creates at 
a time, so that the heartbeat triggers can chime in? 

Thanks!

Best,
Paul Lam