(DEPRECATED) Apache Flink User Mailing List archive.

TM heartbeat timeout due to ResourceManager being busy

Posted by Paul Lam on
URL: http://deprecated-apache-flink-user-mailing-list-archive.369.s1.nabble.com/TM-heartbeat-timeout-due-to-ResourceManager-being-busy-tp38626.html

Hi,

After FLINK-13184 is implemented (even with Flink 1.11), occasionally there would still be jobs

with high parallelism getting TM-RM heartbeat timeouts when RM is busy creating TM contexts

on cluster initialization and HDFS is slow at that moment.

Apart from increasing the TM heartbeat timeout, is there any recommended out of the box

approach that can reduce the chance of getting the timeouts?

In the long run, is it possible to limit the number of taskmanager contexts that RM creates at

a time, so that the heartbeat triggers can chime in?

Thanks!

Best,

Paul Lam