The heartbeat of JobManager timed out

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

The heartbeat of JobManager timed out

Alexey Trenikhun
Hello,

I periodically see in JM log (Flink 12.2):

{"ts":"2021-05-15T21:10:36.325Z","message":"The heartbeat of JobManager with id be8225ebae1d6422b7f268c801044b05 timed out.","logger_name":"org.apache.flink.runtime.resourcemanager.StandaloneResourceManager","thread_name":"flink-akka.actor.default-dispatcher-5","level":"INFO","level_value":20000}

How to diagnose/troubleshoot this problem? Why could JobManager, which is co-located with resource manager timeout, I assume this is unlikely network issue?

Thanks,
Alexey
Reply | Threaded
Open this post in threaded view
|

Re: The heartbeat of JobManager timed out

Smile
Hi Alexey,

We also have the same problem running on Yarn using Flink 1.9.0.
JM log shows this:


We are also looking for a way to troubleshoot this problem.

Best regards.
Smile


Alexey Trenikhun wrote

> Hello,
>
> I periodically see in JM log (Flink 12.2):
>
> {"ts":"2021-05-15T21:10:36.325Z","message":"The heartbeat of JobManager
> with id be8225ebae1d6422b7f268c801044b05 timed
> out.","logger_name":"org.apache.flink.runtime.resourcemanager.StandaloneResourceManager","thread_name":"flink-akka.actor.default-dispatcher-5","level":"INFO","level_value":20000}
>
> How to diagnose/troubleshoot this problem? Why could JobManager, which is
> co-located with resource manager timeout, I assume this is unlikely
> network issue?
>
> Thanks,
> Alexey





--
Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/
Reply | Threaded
Open this post in threaded view
|

Re: The heartbeat of JobManager timed out

Smile
In reply to this post by Alexey Trenikhun
JM log shows this:

INFO  org.apache.flink.yarn.YarnResourceManager                     - The
heartbeat of JobManager with id 41e3ef1f248d24ddefdccd1887947106 timed out.




--
Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/
Reply | Threaded
Open this post in threaded view
|

Re: The heartbeat of JobManager timed out

Xintong Song
Hi Alexey & Smile,

JM & RM are located in the same process, thus it's unlikely a network issue. Such timeouts are usually caused by one of the two endpoints not responding timely.

Some common causes:
- The process is under severe GC pressure. You can check the GC logs for the pressure.
- Insufficient CPU resource. You may check the cpu workload of the physical machine (standalone) or pod/container (K8s/Yarn).
- Busy RPC main thread. Even if there's sufficient CPU resources (multiple cores), the processing capacity can be limited by the single-pointed RPC main threads. This is usually observed for large scale jobs (in terms of number of vertices and parallelism). In that case, we would have to increase the heartbeat timeout.

Thank you~

Xintong Song



On Mon, May 17, 2021 at 11:12 AM Smile <[hidden email]> wrote:
JM log shows this:

INFO  org.apache.flink.yarn.YarnResourceManager                     - The
heartbeat of JobManager with id 41e3ef1f248d24ddefdccd1887947106 timed out.




--
Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/