TaskManager crash. Zookeeper timeout

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

TaskManager crash. Zookeeper timeout

Colletta, Edward

Using flink 11.2 on java 11, session cluster with 16 jobs running on aws ecs instances.  Cluster has 3 JMs and 3 TMs, separate zookeeper cluster has 3 nodes.

 

One of our taskmanagers crashed today with what seems to be rooted in a zookeeper timeout.   We are wondering if there is any tuning that might cause this timeout.  Any help will be greatly appreciated.

 

The first sign of trouble in the log is the following:

 

2021-01-27 11:16:39,795 WARN  org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.ClientCnxn [] - Client session timed out, have not heard from server in 34951ms for sessionid 0x1400000c01570036

2021-01-27 11:16:39,795 INFO  org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.ClientCnxn [] - Client session timed out, have not heard from server in 34951ms for sessionid 0x1400000c01570036, closing socket connection and attempting reconnect

2021-01-27 11:16:39,897 INFO  org.apache.flink.shaded.curator4.org.apache.curator.framework.state.ConnectionStateManager [] - State change: SUSPENDED

2021-01-27 11:16:39,969 WARN  org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService [] - Connection to ZooKeeper suspended. Can no longer retrieve the leader from ZooKeeper.

2021-01-27 11:16:39,969 INFO  org.apache.flink.runtime.taskexecutor.TaskExecutor           [] - JobManager for job 7613291aea3f4892a0deed0e7036e229 with leader id 8959b1fb00fdd4e3d28daade48204e1f lost leadership.

2021-01-27 11:16:39,969 WARN  org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService [] - Connection to ZooKeeper suspended. Can no longer retrieve the leader from ZooKeeper.

2021-01-27 11:16:39,969 INFO  org.apache.flink.runtime.taskexecutor.TaskExecutor           [] - JobManager for job 3230dacf7fa0b8b8f9fe1c77ebdde2bb with leader id bccda87aa8ab14f23e98a4b6d2bf4081 lost leadership.

2021-01-27 11:16:39,969 WARN  org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService [] - Connection to ZooKeeper suspended. Can no longer retrieve the leader from ZooKeeper.

2021-01-27 11:16:39,969 INFO  org.apache.flink.runtime.taskexecutor.TaskExecutor           [] - JobManager for job 8f2ee940006ebb6d8f6d12e3db917da3 with leader id b72d64c2ec112d96cc3b93697d85478d lost leadership.

2021-01-27 11:16:39,969 WARN  org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService [] - Connection to ZooKeeper suspended. Can no longer retrieve the leader from ZooKeeper.

2021-01-27 11:16:39,969 INFO  org.apache.flink.runtime.taskexecutor.TaskExecutor           [] - JobManager for job aaec26e3924e81c12bd5a6d71f6c0d77 with leader id 8d91fefd14539d11d60a16e0e5cd45b1 lost leadership.

2021-01-27 11:16:39,969 WARN  org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService [] - Connection to ZooKeeper suspended. Can no longer retrieve the leader from ZooKeeper.

2021-01-27 11:16:39,969 INFO  org.apache.flink.runtime.taskexecutor.TaskExecutor           [] - JobManager for job 2d5f912867ff70a58638aff51c7f6f33 with leader id b24724d3e03bee3486fdc5dc616b4a9c lost leadership.

2021-01-27 11:16:39,969 WARN  org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService [] - Connection to ZooKeeper suspended. Can no longer retrieve the leader from ZooKeeper.

2021-01-27 11:16:39,969 INFO  org.apache.flink.runtime.taskexecutor.TaskExecutor           [] - JobManager for job 29eb631a7a07aa6b2c0224972b9937bb with leader id 8479de79b7eda73fca6593da93c04027 lost leadership.

2021-01-27 11:16:39,970 WARN  org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService [] - Connection to ZooKeeper suspended. Can no longer retrieve the leader from ZooKeeper.

2021-01-27 11:16:39,970 INFO  org.apache.flink.runtime.taskexecutor.TaskExecutor           [] - JobManager for job bc7688332e73f330f08c95428630b99e with leader id a541d5eb3b60d29afc3a16cab2f742e7 lost leadership.

2021-01-27 11:16:39,970 WARN  org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService [] - Connection to ZooKeeper suspended. Can no longer retrieve the leader from ZooKeeper.

2021-01-27 11:16:39,970 WARN  org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService [] - Connection to ZooKeeper suspended. Can no longer retrieve the leader from ZooKeeper.

2021-01-27 11:16:39,970 INFO  org.apache.flink.runtime.taskexecutor.TaskExecutor           [] - JobManager for job a70b0023b705c39fa66f47f1a666b65d with leader id a0bfc94c9ff40689a7143396cafe4ac7 lost leadership.

2021-01-27 11:16:39,970 WARN  org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService [] - Connection to ZooKeeper suspended. Can no longer retrieve the leader from ZooKeeper.

2021-01-27 11:16:39,970 INFO  org.apache.flink.runtime.taskexecutor.TaskExecutor           [] - JobManager for job 4c929f573971b8520a76ee1dfe5c3e35 with leader id 922675f382f87225300696bae21841cc lost leadership.

2021-01-27 11:16:39,970 WARN  org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService [] - Connection to ZooKeeper suspended. Can no longer retrieve the leader from ZooKeeper.

2021-01-27 11:16:39,970 INFO  org.apache.flink.runtime.taskexecutor.TaskExecutor           [] - JobManager for job a6eb4833baac19216d7ffd189ec7be4d with leader id 920ff4d6f778fcc5c0ad41e352914f46 lost leadership.

2021-01-27 11:16:39,970 WARN  org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService [] - Connection to ZooKeeper suspended. Can no longer retrieve the leader from ZooKeeper.

2021-01-27 11:16:39,970 INFO  org.apache.flink.runtime.taskexecutor.TaskExecutor           [] - JobManager for job fcb8e204e9efb85c5af46cfdeb29c743 with leader id 826bb52be9c8e80059eaf5f78c614252 lost leadership.

2021-01-27 11:16:40,723 INFO  org.apache.flink.runtime.taskexecutor.TaskExecutor           [] - Close JobManager connection for job 7613291aea3f4892a0deed0e7036e229.

2021-01-27 11:16:40,724 INFO  org.apache.flink.runtime.taskmanager.Task                    [] - Attempting to fail task externally EnrichTradeWithBlockSize -> LessThanBlockSize (4/4) (628f0445570d0df74ce62c2d0fb9b5c1).

2021-01-27 11:16:40,724 WARN  org.apache.flink.runtime.taskmanager.Task                    [] - EnrichTradeWithBlockSize -> LessThanBlockSize (4/4) (628f0445570d0df74ce62c2d0fb9b5c1) switched from RUNNING to FAILED.

org.apache.flink.util.FlinkException: JobManager responsible for 7613291aea3f4892a0deed0e7036e229 lost the leadership.

        at org.apache.flink.runtime.taskexecutor.TaskExecutor.disconnectJobManagerConnection(TaskExecutor.java:1415) ~[flink-dist_2.12-1.11.2.jar:1.11.2]

        at org.apache.flink.runtime.taskexecutor.TaskExecutor.access$1300(TaskExecutor.java:173) ~[flink-dist_2.12-1.11.2.jar:1.11.2]

        at org.apache.flink.runtime.taskexecutor.TaskExecutor$JobLeaderListenerImpl.lambda$null$2(TaskExecutor.java:1852) ~[flink-dist_2.12-1.11.2.jar:1.11.2]

        at java.util.Optional.ifPresent(Optional.java:183) ~[?:?]

        at org.apache.flink.runtime.taskexecutor.TaskExecutor$JobLeaderListenerImpl.lambda$jobManagerLostLeadership$3(TaskExecutor.java:1851) ~[flink-dist_2.12-1.11.2.jar:1.11.2]

        at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRunAsync(AkkaRpcActor.java:402) ~[flink-dist_2.12-1.11.2.jar:1.11.2]

        at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:195) ~[flink-dist_2.12-1.11.2.jar:1.11.2]

        at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleMessage(AkkaRpcActor.java:152) ~[flink-dist_2.12-1.11.2.jar:1.11.2]

        at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:26) [flink-dist_2.12-1.11.2.jar:1.11.2]

        at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:21) [flink-dist_2.12-1.11.2.jar:1.11.2]

        at scala.PartialFunction.applyOrElse(PartialFunction.scala:123) [flink-dist_2.12-1.11.2.jar:1.11.2]

        at scala.PartialFunction.applyOrElse$(PartialFunction.scala:122) [flink-dist_2.12-1.11.2.jar:1.11.2]

        at akka.japi.pf.UnitCaseStatement.applyOrElse(CaseStatements.scala:21) [flink-dist_2.12-1.11.2.jar:1.11.2]

        at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171) [flink-dist_2.12-1.11.2.jar:1.11.2]

        at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:172) [flink-dist_2.12-1.11.2.jar:1.11.2]

        at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:172) [flink-dist_2.12-1.11.2.jar:1.11.2]

        at akka.actor.Actor.aroundReceive(Actor.scala:517) [flink-dist_2.12-1.11.2.jar:1.11.2]

        at akka.actor.Actor.aroundReceive$(Actor.scala:515) [flink-dist_2.12-1.11.2.jar:1.11.2]

        at akka.actor.AbstractActor.aroundReceive(AbstractActor.scala:225) [flink-dist_2.12-1.11.2.jar:1.11.2]

        at akka.actor.ActorCell.receiveMessage(ActorCell.scala:592) [flink-dist_2.12-1.11.2.jar:1.11.2]

        at akka.actor.ActorCell.invoke(ActorCell.scala:561) [flink-dist_2.12-1.11.2.jar:1.11.2]

        at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:258) [flink-dist_2.12-1.11.2.jar:1.11.2]

        at akka.dispatch.Mailbox.run(Mailbox.scala:225) [flink-dist_2.12-1.11.2.jar:1.11.2]

        at akka.dispatch.Mailbox.exec(Mailbox.scala:235) [flink-dist_2.12-1.11.2.jar:1.11.2]

        at akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) [flink-dist_2.12-1.11.2.jar:1.11.2]

        at akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) [flink-dist_2.12-1.11.2.jar:1.11.2]

        at akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) [flink-dist_2.12-1.11.2.jar:1.11.2]

        at akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) [flink-dist_2.12-1.11.2.jar:1.11.2]