(DEPRECATED) Apache Flink User Mailing List archive.

The heartbeat of JobManager timed out

Classic

List

Threaded

4 messages Options

Alexey Trenikhun

The heartbeat of JobManager timed out

Hello,

I periodically see in JM log (Flink 12.2):

{"ts":"2021-05-15T21:10:36.325Z","message":"The heartbeat of JobManager with id be8225ebae1d6422b7f268c801044b05 timed out.","logger_name":"org.apache.flink.runtime.resourcemanager.StandaloneResourceManager","thread_name":"flink-akka.actor.default-dispatcher-5","level":"INFO","level_value":20000}

How to diagnose/troubleshoot this problem? Why could JobManager, which is co-located with resource manager timeout, I assume this is unlikely network issue?

Thanks,
Alexey

Smile

Re: The heartbeat of JobManager timed out

Hi Alexey,

We also have the same problem running on Yarn using Flink 1.9.0.
JM log shows this:

We are also looking for a way to troubleshoot this problem.

Best regards.
Smile

Alexey Trenikhun wrote

> Hello,
>
> I periodically see in JM log (Flink 12.2):
>
> {"ts":"2021-05-15T21:10:36.325Z","message":"The heartbeat of JobManager
> with id be8225ebae1d6422b7f268c801044b05 timed
> out.","logger_name":"org.apache.flink.runtime.resourcemanager.StandaloneResourceManager","thread_name":"flink-akka.actor.default-dispatcher-5","level":"INFO","level_value":20000}
>
> How to diagnose/troubleshoot this problem? Why could JobManager, which is
> co-located with resource manager timeout, I assume this is unlikely
> network issue?
>
> Thanks,
> Alexey

--
Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/

Smile

Re: The heartbeat of JobManager timed out

In reply to this post by Alexey Trenikhun

JM log shows this:

INFO org.apache.flink.yarn.YarnResourceManager - The
heartbeat of JobManager with id 41e3ef1f248d24ddefdccd1887947106 timed out.

--
Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/

Xintong Song

Re: The heartbeat of JobManager timed out

Hi Alexey & Smile,

JM & RM are located in the same process, thus it's unlikely a network issue. Such timeouts are usually caused by one of the two endpoints not responding timely.

Some common causes:

- The process is under severe GC pressure. You can check the GC logs for the pressure.

- Insufficient CPU resource. You may check the cpu workload of the physical machine (standalone) or pod/container (K8s/Yarn).

- Busy RPC main thread. Even if there's sufficient CPU resources (multiple cores), the processing capacity can be limited by the single-pointed RPC main threads. This is usually observed for large scale jobs (in terms of number of vertices and parallelism). In that case, we would have to increase the heartbeat timeout.

Thank you~

Xintong Song

On Mon, May 17, 2021 at 11:12 AM Smile <[hidden email]> wrote:

JM log shows this:

INFO org.apache.flink.yarn.YarnResourceManager - The
heartbeat of JobManager with id 41e3ef1f248d24ddefdccd1887947106 timed out.

--
Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/