TM heartbeat timeout due to ResourceManager being busy
Posted by
Paul Lam on
URL: http://deprecated-apache-flink-user-mailing-list-archive.369.s1.nabble.com/TM-heartbeat-timeout-due-to-ResourceManager-being-busy-tp38626.html
Hi,
After FLINK-13184 is implemented (even with Flink 1.11), occasionally there would still be jobs
with high parallelism getting TM-RM heartbeat timeouts when RM is busy creating TM contexts
on cluster initialization and HDFS is slow at that moment.
Apart from increasing the TM heartbeat timeout, is there any recommended out of the box
approach that can reduce the chance of getting the timeouts?
In the long run, is it possible to limit the number of taskmanager contexts that RM creates at
a time, so that the heartbeat triggers can chime in?
Thanks!