(DEPRECATED) Apache Flink User Mailing List archive.

[1.7.1] job stuck in suspended state

Classic

List

Threaded

5 messages Options

Steven Wu

[1.7.1] job stuck in suspended state

We have observe that sometimes job stuck in suspended state, and no job restart/recover were attempted once job is suspended.

* it is a high-parallelism job (like close to 2,000)

* there were a few job restarts before this

* there were high GC pause during the period

* zookeeper timeout. probably caused by high GC pause

I pasted some logs in the end.

Thanks,

Steven

2019-02-28 19:04:36,357 WARN org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn - SASL configuration failed: javax.security.auth.login.LoginException: No JAAS configuration section named 'Client' was found in speci

fied JAAS configuration file: '/tmp/jaas-6664341082794720643.conf'. Will continue connection to Zookeeper server without SASL authentication, if Zookeeper server allows it.

2019-02-28 19:04:36,357 INFO org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn - Opening socket connection to server 100.82.141.106/100.82.141.106:2181

2019-02-28 19:04:36,357 ERROR org.apache.flink.shaded.curator.org.apache.curator.ConnectionState - Authentication failed

2019-02-28 19:04:36,357 INFO org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn - Socket connection established to 100.82.141.106/100.82.141.106:2181, initiating session

2019-02-28 19:04:36,359 INFO org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn - Session establishment complete on server 100.82.141.106/100.82.141.106:2181, sessionid = 0x365ef9c4fe7f1f2, negotiated timeout = 40000

2019-02-28 19:04:36,359 INFO org.apache.flink.shaded.curator.org.apache.curator.framework.state.ConnectionStateManager - State change: RECONNECTED

2019-02-28 19:04:36,359 INFO org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService - Connection to ZooKeeper was reconnected. Leader election can be restarted.

2019-02-28 19:04:36,359 INFO org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService - Connection to ZooKeeper was reconnected. Leader retrieval can be restarted.

2019-02-28 19:04:36,359 INFO org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore - ZooKeeper connection RECONNECTED. Changes to the submitted job graphs are monitored again.

2019-02-28 19:04:36,360 INFO org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService - Connection to ZooKeeper was reconnected. Leader retrieval can be restarte

...

2019-02-28 19:05:09,400 INFO org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService - Stopping ZooKeeperLeaderRetrievalService /leader/resource_manager_lock.

2019-02-28 19:05:09,400 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph - Job cybertron-flink (0e594c065c7f8319a12fa47e089ca9b0) switched from state RESTARTING to SUSPENDING.

org.apache.flink.util.FlinkException: JobManager is no longer the leader.

at org.apache.flink.runtime.jobmaster.JobManagerRunner.revokeLeadership(JobManagerRunner.java:371)

at org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService.notLeader(ZooKeeperLeaderElectionService.java:247)

at org.apache.flink.shaded.curator.org.apache.curator.framework.recipes.leader.LeaderLatch$8.apply(LeaderLatch.java:640)

at org.apache.flink.shaded.curator.org.apache.curator.framework.recipes.leader.LeaderLatch$8.apply(LeaderLatch.java:636)

at org.apache.flink.shaded.curator.org.apache.curator.framework.listen.ListenerContainer$1.run(ListenerContainer.java:93)

at org.apache.flink.shaded.curator.org.apache.curator.shaded.com.google.common.util.concurrent.MoreExecutors$SameThreadExecutorService.execute(MoreExecutors.java:297)

at org.apache.flink.shaded.curator.org.apache.curator.framework.listen.ListenerContainer.forEach(ListenerContainer.java:85)

at org.apache.flink.shaded.curator.org.apache.curator.framework.recipes.leader.LeaderLatch.setLeadership(LeaderLatch.java:635)

at org.apache.flink.shaded.curator.org.apache.curator.framework.recipes.leader.LeaderLatch.handleStateChange(LeaderLatch.java:623)

at org.apache.flink.shaded.curator.org.apache.curator.framework.recipes.leader.LeaderLatch.access$000(LeaderLatch.java:64)

at org.apache.flink.shaded.curator.org.apache.curator.framework.recipes.leader.LeaderLatch$1.stateChanged(LeaderLatch.java:82)

at org.apache.flink.shaded.curator.org.apache.curator.framework.state.ConnectionStateManager$2.apply(ConnectionStateManager.java:259)

at org.apache.flink.shaded.curator.org.apache.curator.framework.state.ConnectionStateManager$2.apply(ConnectionStateManager.java:255)

at org.apache.flink.shaded.curator.org.apache.curator.framework.listen.ListenerContainer$1.run(ListenerContainer.java:93)

at org.apache.flink.shaded.curator.org.apache.curator.shaded.com.google.common.util.concurrent.MoreExecutors$SameThreadExecutorService.execute(MoreExecutors.java:297)

at org.apache.flink.shaded.curator.org.apache.curator.framework.listen.ListenerContainer.forEach(ListenerContainer.java:85)

at org.apache.flink.shaded.curator.org.apache.curator.framework.state.ConnectionStateManager.processEvents(ConnectionStateManager.java:253)

at org.apache.flink.shaded.curator.org.apache.curator.framework.state.ConnectionStateManager.access$000(ConnectionStateManager.java:43)

at org.apache.flink.shaded.curator.org.apache.curator.framework.state.ConnectionStateManager$1.call(ConnectionStateManager.java:111)

at java.util.concurrent.FutureTask.run(FutureTask.java:266)

at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)

at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)

at java.lang.Thread.run(Thread.java:748)

2019-02-28 19:05:09,403 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph - Job cybertron-flink (0e594c065c7f8319a12fa47e089ca9b0) switched from state SUSPENDING to SUSPENDED.

2019-02-28 19:05:09,403 INFO org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Stopping checkpoint coordinator for job 0e594c065c7f8319a12fa47e089ca9b0.

2019-02-28 19:05:09,403 INFO org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore - Suspending

2019-02-28 19:05:09,448 INFO org.apache.flink.runtime.checkpoint.ZooKeeperCheckpointIDCounter - Shutting down.

2019-02-28 19:05:09,448 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph - Job 0e594c065c7f8319a12fa47e089ca9b0 has been suspended.

...

2019-02-28 19:05:09,448 INFO org.apache.flink.runtime.jobmaster.JobMaster - Close ResourceManager connection 9db2027a0a32f2a44744a0d4a0f84b87: JobManager is no longer the leader..

2019-02-28 19:05:09,448 INFO org.apache.flink.runtime.jobmaster.slotpool.SlotPool - Suspending SlotPool.

2019-02-28 19:05:09,448 INFO org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor - The rpc endpoint org.apache.flink.runtime.jobmaster.JobMaster has not been started yet. Discarding message org.apache.flink.runtime.rpc.messages.RemoteFencedMessage until processing is started.

2019-02-28 19:05:09,448 INFO org.apache.flink.runtime.jobmaster.JobMaster - Stopping the JobMaster for job cybertron-flink(0e594c065c7f8319a12fa47e089ca9b0).

2019-02-28 19:05:09,449 INFO org.apache.flink.runtime.jobmaster.slotpool.SlotPool - Stopping SlotPool.

2019-02-28 19:05:09,449 INFO org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService - Stopping ZooKeeperLeaderElectionService ZooKeeperLeaderElectionService{leaderPath='/leader/0e594c065c7f8319a12fa47e089ca9b0/job_manager_lock'}.

2019-02-28 19:05:09,452 INFO org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore - Released locks of job graph 0e594c065c7f8319a12fa47e089ca9b0 from ZooKeeper.

Till Rohrmann

Re: [1.7.1] job stuck in suspended state

Hi Steven,

is this the tail of the logs or are there other statements following? I think your problem could indeed be related to FLINK-11537. Is it possible to somehow reliably reproduce this problem? If yes, then you could try out the RC for Flink 1.8.0 which should be published in the next days.

Cheers,

Till

On Sat, Mar 2, 2019 at 12:47 AM Steven Wu <[hidden email]> wrote:

We have observe that sometimes job stuck in suspended state, and no job restart/recover were attempted once job is suspended.
* it is a high-parallelism job (like close to 2,000)
* there were a few job restarts before this
* there were high GC pause during the period
* zookeeper timeout. probably caused by high GC pause

Is it related to https://issues.apache.org/jira/browse/FLINK-11537?

I pasted some logs in the end.

Thanks,
Steven

2019-02-28 19:04:36,357 WARN org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn - SASL configuration failed: javax.security.auth.login.LoginException: No JAAS configuration section named 'Client' was found in speci
fied JAAS configuration file: '/tmp/jaas-6664341082794720643.conf'. Will continue connection to Zookeeper server without SASL authentication, if Zookeeper server allows it.
2019-02-28 19:04:36,357 INFO org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn - Opening socket connection to server 100.82.141.106/100.82.141.106:2181
2019-02-28 19:04:36,357 ERROR org.apache.flink.shaded.curator.org.apache.curator.ConnectionState - Authentication failed
2019-02-28 19:04:36,357 INFO org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn - Socket connection established to 100.82.141.106/100.82.141.106:2181, initiating session
2019-02-28 19:04:36,359 INFO org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn - Session establishment complete on server 100.82.141.106/100.82.141.106:2181, sessionid = 0x365ef9c4fe7f1f2, negotiated timeout = 40000
2019-02-28 19:04:36,359 INFO org.apache.flink.shaded.curator.org.apache.curator.framework.state.ConnectionStateManager - State change: RECONNECTED
2019-02-28 19:04:36,359 INFO org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService - Connection to ZooKeeper was reconnected. Leader election can be restarted.
2019-02-28 19:04:36,359 INFO org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService - Connection to ZooKeeper was reconnected. Leader election can be restarted.
2019-02-28 19:04:36,359 INFO org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService - Connection to ZooKeeper was reconnected. Leader election can be restarted.
2019-02-28 19:04:36,359 INFO org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService - Connection to ZooKeeper was reconnected. Leader election can be restarted.
2019-02-28 19:04:36,359 INFO org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService - Connection to ZooKeeper was reconnected. Leader retrieval can be restarted.
2019-02-28 19:04:36,359 INFO org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService - Connection to ZooKeeper was reconnected. Leader retrieval can be restarted.
2019-02-28 19:04:36,359 INFO org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore - ZooKeeper connection RECONNECTED. Changes to the submitted job graphs are monitored again.
2019-02-28 19:04:36,360 INFO org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService - Connection to ZooKeeper was reconnected. Leader retrieval can be restarte
...
2019-02-28 19:05:09,400 INFO org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService - Stopping ZooKeeperLeaderRetrievalService /leader/resource_manager_lock.
2019-02-28 19:05:09,400 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph - Job cybertron-flink (0e594c065c7f8319a12fa47e089ca9b0) switched from state RESTARTING to SUSPENDING.
org.apache.flink.util.FlinkException: JobManager is no longer the leader.
at org.apache.flink.runtime.jobmaster.JobManagerRunner.revokeLeadership(JobManagerRunner.java:371)
at org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService.notLeader(ZooKeeperLeaderElectionService.java:247)
at org.apache.flink.shaded.curator.org.apache.curator.framework.recipes.leader.LeaderLatch$8.apply(LeaderLatch.java:640)
at org.apache.flink.shaded.curator.org.apache.curator.framework.recipes.leader.LeaderLatch$8.apply(LeaderLatch.java:636)
at org.apache.flink.shaded.curator.org.apache.curator.framework.listen.ListenerContainer$1.run(ListenerContainer.java:93)
at org.apache.flink.shaded.curator.org.apache.curator.shaded.com.google.common.util.concurrent.MoreExecutors$SameThreadExecutorService.execute(MoreExecutors.java:297)
at org.apache.flink.shaded.curator.org.apache.curator.framework.listen.ListenerContainer.forEach(ListenerContainer.java:85)
at org.apache.flink.shaded.curator.org.apache.curator.framework.recipes.leader.LeaderLatch.setLeadership(LeaderLatch.java:635)
at org.apache.flink.shaded.curator.org.apache.curator.framework.recipes.leader.LeaderLatch.handleStateChange(LeaderLatch.java:623)
at org.apache.flink.shaded.curator.org.apache.curator.framework.recipes.leader.LeaderLatch.access$000(LeaderLatch.java:64)
at org.apache.flink.shaded.curator.org.apache.curator.framework.recipes.leader.LeaderLatch$1.stateChanged(LeaderLatch.java:82)
at org.apache.flink.shaded.curator.org.apache.curator.framework.state.ConnectionStateManager$2.apply(ConnectionStateManager.java:259)
at org.apache.flink.shaded.curator.org.apache.curator.framework.state.ConnectionStateManager$2.apply(ConnectionStateManager.java:255)
at org.apache.flink.shaded.curator.org.apache.curator.framework.listen.ListenerContainer$1.run(ListenerContainer.java:93)
at org.apache.flink.shaded.curator.org.apache.curator.shaded.com.google.common.util.concurrent.MoreExecutors$SameThreadExecutorService.execute(MoreExecutors.java:297)
at org.apache.flink.shaded.curator.org.apache.curator.framework.listen.ListenerContainer.forEach(ListenerContainer.java:85)
at org.apache.flink.shaded.curator.org.apache.curator.framework.state.ConnectionStateManager.processEvents(ConnectionStateManager.java:253)
at org.apache.flink.shaded.curator.org.apache.curator.framework.state.ConnectionStateManager.access$000(ConnectionStateManager.java:43)
at org.apache.flink.shaded.curator.org.apache.curator.framework.state.ConnectionStateManager$1.call(ConnectionStateManager.java:111)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
2019-02-28 19:05:09,403 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph - Job cybertron-flink (0e594c065c7f8319a12fa47e089ca9b0) switched from state SUSPENDING to SUSPENDED.
2019-02-28 19:05:09,403 INFO org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Stopping checkpoint coordinator for job 0e594c065c7f8319a12fa47e089ca9b0.
2019-02-28 19:05:09,403 INFO org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore - Suspending
2019-02-28 19:05:09,448 INFO org.apache.flink.runtime.checkpoint.ZooKeeperCheckpointIDCounter - Shutting down.
2019-02-28 19:05:09,448 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph - Job 0e594c065c7f8319a12fa47e089ca9b0 has been suspended.
...
2019-02-28 19:05:09,448 INFO org.apache.flink.runtime.jobmaster.JobMaster - Close ResourceManager connection 9db2027a0a32f2a44744a0d4a0f84b87: JobManager is no longer the leader..
2019-02-28 19:05:09,448 INFO org.apache.flink.runtime.jobmaster.slotpool.SlotPool - Suspending SlotPool.
2019-02-28 19:05:09,448 INFO org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor - The rpc endpoint org.apache.flink.runtime.jobmaster.JobMaster has not been started yet. Discarding message org.apache.flink.runtime.rpc.messages.RemoteFencedMessage until processing is started.
2019-02-28 19:05:09,448 INFO org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor - The rpc endpoint org.apache.flink.runtime.jobmaster.JobMaster has not been started yet. Discarding message org.apache.flink.runtime.rpc.messages.RemoteFencedMessage until processing is started.
2019-02-28 19:05:09,448 INFO org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor - The rpc endpoint org.apache.flink.runtime.jobmaster.JobMaster has not been started yet. Discarding message org.apache.flink.runtime.rpc.messages.RemoteFencedMessage until processing is started.
2019-02-28 19:05:09,448 INFO org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor - The rpc endpoint org.apache.flink.runtime.jobmaster.JobMaster has not been started yet. Discarding message org.apache.flink.runtime.rpc.messages.RemoteFencedMessage until processing is started.
2019-02-28 19:05:09,448 INFO org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor - The rpc endpoint org.apache.flink.runtime.jobmaster.JobMaster has not been started yet. Discarding message org.apache.flink.runtime.rpc.messages.RemoteFencedMessage until processing is started.
2019-02-28 19:05:09,448 INFO org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor - The rpc endpoint org.apache.flink.runtime.jobmaster.JobMaster has not been started yet. Discarding message org.apache.flink.runtime.rpc.messages.RemoteFencedMessage until processing is started.
2019-02-28 19:05:09,448 INFO org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor - The rpc endpoint org.apache.flink.runtime.jobmaster.JobMaster has not been started yet. Discarding message org.apache.flink.runtime.rpc.messages.RemoteFencedMessage until processing is started.
2019-02-28 19:05:09,448 INFO org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor - The rpc endpoint org.apache.flink.runtime.jobmaster.JobMaster has not been started yet. Discarding message org.apache.flink.runtime.rpc.messages.RemoteFencedMessage until processing is started.
2019-02-28 19:05:09,448 INFO org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor - The rpc endpoint org.apache.flink.runtime.jobmaster.JobMaster has not been started yet. Discarding message org.apache.flink.runtime.rpc.messages.RemoteFencedMessage until processing is started.
2019-02-28 19:05:09,448 INFO org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor - The rpc endpoint org.apache.flink.runtime.jobmaster.JobMaster has not been started yet. Discarding message org.apache.flink.runtime.rpc.messages.RemoteFencedMessage until processing is started.
2019-02-28 19:05:09,448 INFO org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor - The rpc endpoint org.apache.flink.runtime.jobmaster.JobMaster has not been started yet. Discarding message org.apache.flink.runtime.rpc.messages.RemoteFencedMessage until processing is started.
2019-02-28 19:05:09,448 INFO org.apache.flink.runtime.jobmaster.JobMaster - Stopping the JobMaster for job cybertron-flink(0e594c065c7f8319a12fa47e089ca9b0).
2019-02-28 19:05:09,449 INFO org.apache.flink.runtime.jobmaster.slotpool.SlotPool - Stopping SlotPool.
2019-02-28 19:05:09,449 INFO org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService - Stopping ZooKeeperLeaderElectionService ZooKeeperLeaderElectionService{leaderPath='/leader/0e594c065c7f8319a12fa47e089ca9b0/job_manager_lock'}.
2019-02-28 19:05:09,452 INFO org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore - Released locks of job graph 0e594c065c7f8319a12fa47e089ca9b0 from ZooKeeper.

Steven Wu

Re: [1.7.1] job stuck in suspended state

Till,

I will send you the complete log offline. We don't know how to reliably reproduce the problem. but it did happen quite frequently, like once every a couple of days. Let me see if I can cherry pick the fix/commit to 1.7 branch.

Thanks,

Steven

On Mon, Mar 4, 2019 at 5:55 AM Till Rohrmann <[hidden email]> wrote:

Hi Steven,

is this the tail of the logs or are there other statements following? I think your problem could indeed be related to FLINK-11537. Is it possible to somehow reliably reproduce this problem? If yes, then you could try out the RC for Flink 1.8.0 which should be published in the next days.

Cheers,
Till

On Sat, Mar 2, 2019 at 12:47 AM Steven Wu <[hidden email]> wrote:
We have observe that sometimes job stuck in suspended state, and no job restart/recover were attempted once job is suspended.
* it is a high-parallelism job (like close to 2,000)
* there were a few job restarts before this
* there were high GC pause during the period
* zookeeper timeout. probably caused by high GC pause

Is it related to https://issues.apache.org/jira/browse/FLINK-11537?

I pasted some logs in the end.

Thanks,
Steven

2019-02-28 19:04:36,357 WARN org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn - SASL configuration failed: javax.security.auth.login.LoginException: No JAAS configuration section named 'Client' was found in speci
fied JAAS configuration file: '/tmp/jaas-6664341082794720643.conf'. Will continue connection to Zookeeper server without SASL authentication, if Zookeeper server allows it.
2019-02-28 19:04:36,357 INFO org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn - Opening socket connection to server 100.82.141.106/100.82.141.106:2181
2019-02-28 19:04:36,357 ERROR org.apache.flink.shaded.curator.org.apache.curator.ConnectionState - Authentication failed
2019-02-28 19:04:36,357 INFO org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn - Socket connection established to 100.82.141.106/100.82.141.106:2181, initiating session
2019-02-28 19:04:36,359 INFO org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn - Session establishment complete on server 100.82.141.106/100.82.141.106:2181, sessionid = 0x365ef9c4fe7f1f2, negotiated timeout = 40000
2019-02-28 19:04:36,359 INFO org.apache.flink.shaded.curator.org.apache.curator.framework.state.ConnectionStateManager - State change: RECONNECTED
2019-02-28 19:04:36,359 INFO org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService - Connection to ZooKeeper was reconnected. Leader election can be restarted.
2019-02-28 19:04:36,359 INFO org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService - Connection to ZooKeeper was reconnected. Leader election can be restarted.
2019-02-28 19:04:36,359 INFO org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService - Connection to ZooKeeper was reconnected. Leader election can be restarted.
2019-02-28 19:04:36,359 INFO org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService - Connection to ZooKeeper was reconnected. Leader election can be restarted.
2019-02-28 19:04:36,359 INFO org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService - Connection to ZooKeeper was reconnected. Leader retrieval can be restarted.
2019-02-28 19:04:36,359 INFO org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService - Connection to ZooKeeper was reconnected. Leader retrieval can be restarted.
2019-02-28 19:04:36,359 INFO org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore - ZooKeeper connection RECONNECTED. Changes to the submitted job graphs are monitored again.
2019-02-28 19:04:36,360 INFO org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService - Connection to ZooKeeper was reconnected. Leader retrieval can be restarte
...
2019-02-28 19:05:09,400 INFO org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService - Stopping ZooKeeperLeaderRetrievalService /leader/resource_manager_lock.
2019-02-28 19:05:09,400 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph - Job cybertron-flink (0e594c065c7f8319a12fa47e089ca9b0) switched from state RESTARTING to SUSPENDING.
org.apache.flink.util.FlinkException: JobManager is no longer the leader.
at org.apache.flink.runtime.jobmaster.JobManagerRunner.revokeLeadership(JobManagerRunner.java:371)
at org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService.notLeader(ZooKeeperLeaderElectionService.java:247)
at org.apache.flink.shaded.curator.org.apache.curator.framework.recipes.leader.LeaderLatch$8.apply(LeaderLatch.java:640)
at org.apache.flink.shaded.curator.org.apache.curator.framework.recipes.leader.LeaderLatch$8.apply(LeaderLatch.java:636)
at org.apache.flink.shaded.curator.org.apache.curator.framework.listen.ListenerContainer$1.run(ListenerContainer.java:93)
at org.apache.flink.shaded.curator.org.apache.curator.shaded.com.google.common.util.concurrent.MoreExecutors$SameThreadExecutorService.execute(MoreExecutors.java:297)
at org.apache.flink.shaded.curator.org.apache.curator.framework.listen.ListenerContainer.forEach(ListenerContainer.java:85)
at org.apache.flink.shaded.curator.org.apache.curator.framework.recipes.leader.LeaderLatch.setLeadership(LeaderLatch.java:635)
at org.apache.flink.shaded.curator.org.apache.curator.framework.recipes.leader.LeaderLatch.handleStateChange(LeaderLatch.java:623)
at org.apache.flink.shaded.curator.org.apache.curator.framework.recipes.leader.LeaderLatch.access$000(LeaderLatch.java:64)
at org.apache.flink.shaded.curator.org.apache.curator.framework.recipes.leader.LeaderLatch$1.stateChanged(LeaderLatch.java:82)
at org.apache.flink.shaded.curator.org.apache.curator.framework.state.ConnectionStateManager$2.apply(ConnectionStateManager.java:259)
at org.apache.flink.shaded.curator.org.apache.curator.framework.state.ConnectionStateManager$2.apply(ConnectionStateManager.java:255)
at org.apache.flink.shaded.curator.org.apache.curator.framework.listen.ListenerContainer$1.run(ListenerContainer.java:93)
at org.apache.flink.shaded.curator.org.apache.curator.shaded.com.google.common.util.concurrent.MoreExecutors$SameThreadExecutorService.execute(MoreExecutors.java:297)
at org.apache.flink.shaded.curator.org.apache.curator.framework.listen.ListenerContainer.forEach(ListenerContainer.java:85)
at org.apache.flink.shaded.curator.org.apache.curator.framework.state.ConnectionStateManager.processEvents(ConnectionStateManager.java:253)
at org.apache.flink.shaded.curator.org.apache.curator.framework.state.ConnectionStateManager.access$000(ConnectionStateManager.java:43)
at org.apache.flink.shaded.curator.org.apache.curator.framework.state.ConnectionStateManager$1.call(ConnectionStateManager.java:111)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
2019-02-28 19:05:09,403 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph - Job cybertron-flink (0e594c065c7f8319a12fa47e089ca9b0) switched from state SUSPENDING to SUSPENDED.
2019-02-28 19:05:09,403 INFO org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Stopping checkpoint coordinator for job 0e594c065c7f8319a12fa47e089ca9b0.
2019-02-28 19:05:09,403 INFO org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore - Suspending
2019-02-28 19:05:09,448 INFO org.apache.flink.runtime.checkpoint.ZooKeeperCheckpointIDCounter - Shutting down.
2019-02-28 19:05:09,448 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph - Job 0e594c065c7f8319a12fa47e089ca9b0 has been suspended.
...
2019-02-28 19:05:09,448 INFO org.apache.flink.runtime.jobmaster.JobMaster - Close ResourceManager connection 9db2027a0a32f2a44744a0d4a0f84b87: JobManager is no longer the leader..
2019-02-28 19:05:09,448 INFO org.apache.flink.runtime.jobmaster.slotpool.SlotPool - Suspending SlotPool.
2019-02-28 19:05:09,448 INFO org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor - The rpc endpoint org.apache.flink.runtime.jobmaster.JobMaster has not been started yet. Discarding message org.apache.flink.runtime.rpc.messages.RemoteFencedMessage until processing is started.
2019-02-28 19:05:09,448 INFO org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor - The rpc endpoint org.apache.flink.runtime.jobmaster.JobMaster has not been started yet. Discarding message org.apache.flink.runtime.rpc.messages.RemoteFencedMessage until processing is started.
2019-02-28 19:05:09,448 INFO org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor - The rpc endpoint org.apache.flink.runtime.jobmaster.JobMaster has not been started yet. Discarding message org.apache.flink.runtime.rpc.messages.RemoteFencedMessage until processing is started.
2019-02-28 19:05:09,448 INFO org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor - The rpc endpoint org.apache.flink.runtime.jobmaster.JobMaster has not been started yet. Discarding message org.apache.flink.runtime.rpc.messages.RemoteFencedMessage until processing is started.
2019-02-28 19:05:09,448 INFO org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor - The rpc endpoint org.apache.flink.runtime.jobmaster.JobMaster has not been started yet. Discarding message org.apache.flink.runtime.rpc.messages.RemoteFencedMessage until processing is started.
2019-02-28 19:05:09,448 INFO org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor - The rpc endpoint org.apache.flink.runtime.jobmaster.JobMaster has not been started yet. Discarding message org.apache.flink.runtime.rpc.messages.RemoteFencedMessage until processing is started.
2019-02-28 19:05:09,448 INFO org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor - The rpc endpoint org.apache.flink.runtime.jobmaster.JobMaster has not been started yet. Discarding message org.apache.flink.runtime.rpc.messages.RemoteFencedMessage until processing is started.
2019-02-28 19:05:09,448 INFO org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor - The rpc endpoint org.apache.flink.runtime.jobmaster.JobMaster has not been started yet. Discarding message org.apache.flink.runtime.rpc.messages.RemoteFencedMessage until processing is started.
2019-02-28 19:05:09,448 INFO org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor - The rpc endpoint org.apache.flink.runtime.jobmaster.JobMaster has not been started yet. Discarding message org.apache.flink.runtime.rpc.messages.RemoteFencedMessage until processing is started.
2019-02-28 19:05:09,448 INFO org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor - The rpc endpoint org.apache.flink.runtime.jobmaster.JobMaster has not been started yet. Discarding message org.apache.flink.runtime.rpc.messages.RemoteFencedMessage until processing is started.
2019-02-28 19:05:09,448 INFO org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor - The rpc endpoint org.apache.flink.runtime.jobmaster.JobMaster has not been started yet. Discarding message org.apache.flink.runtime.rpc.messages.RemoteFencedMessage until processing is started.
2019-02-28 19:05:09,448 INFO org.apache.flink.runtime.jobmaster.JobMaster - Stopping the JobMaster for job cybertron-flink(0e594c065c7f8319a12fa47e089ca9b0).
2019-02-28 19:05:09,449 INFO org.apache.flink.runtime.jobmaster.slotpool.SlotPool - Stopping SlotPool.
2019-02-28 19:05:09,449 INFO org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService - Stopping ZooKeeperLeaderElectionService ZooKeeperLeaderElectionService{leaderPath='/leader/0e594c065c7f8319a12fa47e089ca9b0/job_manager_lock'}.
2019-02-28 19:05:09,452 INFO org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore - Released locks of job graph 0e594c065c7f8319a12fa47e089ca9b0 from ZooKeeper.

Till Rohrmann

Re: [1.7.1] job stuck in suspended state

Hi Steven,

a quick update from my side after looking through the logs. The problem seems to be that the Dispatcher does not start recovering the jobs after regaining the leadership after it lost it before. I cannot yet tell why this is happening and I try to further debug the problem.

If you manage to reproduce the problem, could you maybe run the cluster with DEBUG log levels and send me again the logs? If I don't manage to figure out how this problem happens until then, it would be super helpful.

Cheers,

Till

On Mon, Mar 4, 2019 at 7:44 PM Steven Wu <[hidden email]> wrote:

Till,

I will send you the complete log offline. We don't know how to reliably reproduce the problem. but it did happen quite frequently, like once every a couple of days. Let me see if I can cherry pick the fix/commit to 1.7 branch.

Thanks,
Steven

On Mon, Mar 4, 2019 at 5:55 AM Till Rohrmann <[hidden email]> wrote:
Hi Steven,

is this the tail of the logs or are there other statements following? I think your problem could indeed be related to FLINK-11537. Is it possible to somehow reliably reproduce this problem? If yes, then you could try out the RC for Flink 1.8.0 which should be published in the next days.

Cheers,
Till

On Sat, Mar 2, 2019 at 12:47 AM Steven Wu <[hidden email]> wrote:
We have observe that sometimes job stuck in suspended state, and no job restart/recover were attempted once job is suspended.
* it is a high-parallelism job (like close to 2,000)
* there were a few job restarts before this
* there were high GC pause during the period
* zookeeper timeout. probably caused by high GC pause

Is it related to https://issues.apache.org/jira/browse/FLINK-11537?

I pasted some logs in the end.

Thanks,
Steven

2019-02-28 19:04:36,357 WARN org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn - SASL configuration failed: javax.security.auth.login.LoginException: No JAAS configuration section named 'Client' was found in speci
fied JAAS configuration file: '/tmp/jaas-6664341082794720643.conf'. Will continue connection to Zookeeper server without SASL authentication, if Zookeeper server allows it.
2019-02-28 19:04:36,357 INFO org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn - Opening socket connection to server 100.82.141.106/100.82.141.106:2181
2019-02-28 19:04:36,357 ERROR org.apache.flink.shaded.curator.org.apache.curator.ConnectionState - Authentication failed
2019-02-28 19:04:36,357 INFO org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn - Socket connection established to 100.82.141.106/100.82.141.106:2181, initiating session
2019-02-28 19:04:36,359 INFO org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn - Session establishment complete on server 100.82.141.106/100.82.141.106:2181, sessionid = 0x365ef9c4fe7f1f2, negotiated timeout = 40000
2019-02-28 19:04:36,359 INFO org.apache.flink.shaded.curator.org.apache.curator.framework.state.ConnectionStateManager - State change: RECONNECTED
2019-02-28 19:04:36,359 INFO org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService - Connection to ZooKeeper was reconnected. Leader election can be restarted.
2019-02-28 19:04:36,359 INFO org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService - Connection to ZooKeeper was reconnected. Leader election can be restarted.
2019-02-28 19:04:36,359 INFO org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService - Connection to ZooKeeper was reconnected. Leader election can be restarted.
2019-02-28 19:04:36,359 INFO org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService - Connection to ZooKeeper was reconnected. Leader election can be restarted.
2019-02-28 19:04:36,359 INFO org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService - Connection to ZooKeeper was reconnected. Leader retrieval can be restarted.
2019-02-28 19:04:36,359 INFO org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService - Connection to ZooKeeper was reconnected. Leader retrieval can be restarted.
2019-02-28 19:04:36,359 INFO org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore - ZooKeeper connection RECONNECTED. Changes to the submitted job graphs are monitored again.
2019-02-28 19:04:36,360 INFO org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService - Connection to ZooKeeper was reconnected. Leader retrieval can be restarte
...
2019-02-28 19:05:09,400 INFO org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService - Stopping ZooKeeperLeaderRetrievalService /leader/resource_manager_lock.
2019-02-28 19:05:09,400 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph - Job cybertron-flink (0e594c065c7f8319a12fa47e089ca9b0) switched from state RESTARTING to SUSPENDING.
org.apache.flink.util.FlinkException: JobManager is no longer the leader.
at org.apache.flink.runtime.jobmaster.JobManagerRunner.revokeLeadership(JobManagerRunner.java:371)
at org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService.notLeader(ZooKeeperLeaderElectionService.java:247)
at org.apache.flink.shaded.curator.org.apache.curator.framework.recipes.leader.LeaderLatch$8.apply(LeaderLatch.java:640)
at org.apache.flink.shaded.curator.org.apache.curator.framework.recipes.leader.LeaderLatch$8.apply(LeaderLatch.java:636)
at org.apache.flink.shaded.curator.org.apache.curator.framework.listen.ListenerContainer$1.run(ListenerContainer.java:93)
at org.apache.flink.shaded.curator.org.apache.curator.shaded.com.google.common.util.concurrent.MoreExecutors$SameThreadExecutorService.execute(MoreExecutors.java:297)
at org.apache.flink.shaded.curator.org.apache.curator.framework.listen.ListenerContainer.forEach(ListenerContainer.java:85)
at org.apache.flink.shaded.curator.org.apache.curator.framework.recipes.leader.LeaderLatch.setLeadership(LeaderLatch.java:635)
at org.apache.flink.shaded.curator.org.apache.curator.framework.recipes.leader.LeaderLatch.handleStateChange(LeaderLatch.java:623)
at org.apache.flink.shaded.curator.org.apache.curator.framework.recipes.leader.LeaderLatch.access$000(LeaderLatch.java:64)
at org.apache.flink.shaded.curator.org.apache.curator.framework.recipes.leader.LeaderLatch$1.stateChanged(LeaderLatch.java:82)
at org.apache.flink.shaded.curator.org.apache.curator.framework.state.ConnectionStateManager$2.apply(ConnectionStateManager.java:259)
at org.apache.flink.shaded.curator.org.apache.curator.framework.state.ConnectionStateManager$2.apply(ConnectionStateManager.java:255)
at org.apache.flink.shaded.curator.org.apache.curator.framework.listen.ListenerContainer$1.run(ListenerContainer.java:93)
at org.apache.flink.shaded.curator.org.apache.curator.shaded.com.google.common.util.concurrent.MoreExecutors$SameThreadExecutorService.execute(MoreExecutors.java:297)
at org.apache.flink.shaded.curator.org.apache.curator.framework.listen.ListenerContainer.forEach(ListenerContainer.java:85)
at org.apache.flink.shaded.curator.org.apache.curator.framework.state.ConnectionStateManager.processEvents(ConnectionStateManager.java:253)
at org.apache.flink.shaded.curator.org.apache.curator.framework.state.ConnectionStateManager.access$000(ConnectionStateManager.java:43)
at org.apache.flink.shaded.curator.org.apache.curator.framework.state.ConnectionStateManager$1.call(ConnectionStateManager.java:111)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
2019-02-28 19:05:09,403 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph - Job cybertron-flink (0e594c065c7f8319a12fa47e089ca9b0) switched from state SUSPENDING to SUSPENDED.
2019-02-28 19:05:09,403 INFO org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Stopping checkpoint coordinator for job 0e594c065c7f8319a12fa47e089ca9b0.
2019-02-28 19:05:09,403 INFO org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore - Suspending
2019-02-28 19:05:09,448 INFO org.apache.flink.runtime.checkpoint.ZooKeeperCheckpointIDCounter - Shutting down.
2019-02-28 19:05:09,448 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph - Job 0e594c065c7f8319a12fa47e089ca9b0 has been suspended.
...
2019-02-28 19:05:09,448 INFO org.apache.flink.runtime.jobmaster.JobMaster - Close ResourceManager connection 9db2027a0a32f2a44744a0d4a0f84b87: JobManager is no longer the leader..
2019-02-28 19:05:09,448 INFO org.apache.flink.runtime.jobmaster.slotpool.SlotPool - Suspending SlotPool.
2019-02-28 19:05:09,448 INFO org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor - The rpc endpoint org.apache.flink.runtime.jobmaster.JobMaster has not been started yet. Discarding message org.apache.flink.runtime.rpc.messages.RemoteFencedMessage until processing is started.
2019-02-28 19:05:09,448 INFO org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor - The rpc endpoint org.apache.flink.runtime.jobmaster.JobMaster has not been started yet. Discarding message org.apache.flink.runtime.rpc.messages.RemoteFencedMessage until processing is started.
2019-02-28 19:05:09,448 INFO org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor - The rpc endpoint org.apache.flink.runtime.jobmaster.JobMaster has not been started yet. Discarding message org.apache.flink.runtime.rpc.messages.RemoteFencedMessage until processing is started.
2019-02-28 19:05:09,448 INFO org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor - The rpc endpoint org.apache.flink.runtime.jobmaster.JobMaster has not been started yet. Discarding message org.apache.flink.runtime.rpc.messages.RemoteFencedMessage until processing is started.
2019-02-28 19:05:09,448 INFO org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor - The rpc endpoint org.apache.flink.runtime.jobmaster.JobMaster has not been started yet. Discarding message org.apache.flink.runtime.rpc.messages.RemoteFencedMessage until processing is started.
2019-02-28 19:05:09,448 INFO org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor - The rpc endpoint org.apache.flink.runtime.jobmaster.JobMaster has not been started yet. Discarding message org.apache.flink.runtime.rpc.messages.RemoteFencedMessage until processing is started.
2019-02-28 19:05:09,448 INFO org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor - The rpc endpoint org.apache.flink.runtime.jobmaster.JobMaster has not been started yet. Discarding message org.apache.flink.runtime.rpc.messages.RemoteFencedMessage until processing is started.
2019-02-28 19:05:09,448 INFO org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor - The rpc endpoint org.apache.flink.runtime.jobmaster.JobMaster has not been started yet. Discarding message org.apache.flink.runtime.rpc.messages.RemoteFencedMessage until processing is started.
2019-02-28 19:05:09,448 INFO org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor - The rpc endpoint org.apache.flink.runtime.jobmaster.JobMaster has not been started yet. Discarding message org.apache.flink.runtime.rpc.messages.RemoteFencedMessage until processing is started.
2019-02-28 19:05:09,448 INFO org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor - The rpc endpoint org.apache.flink.runtime.jobmaster.JobMaster has not been started yet. Discarding message org.apache.flink.runtime.rpc.messages.RemoteFencedMessage until processing is started.
2019-02-28 19:05:09,448 INFO org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor - The rpc endpoint org.apache.flink.runtime.jobmaster.JobMaster has not been started yet. Discarding message org.apache.flink.runtime.rpc.messages.RemoteFencedMessage until processing is started.
2019-02-28 19:05:09,448 INFO org.apache.flink.runtime.jobmaster.JobMaster - Stopping the JobMaster for job cybertron-flink(0e594c065c7f8319a12fa47e089ca9b0).
2019-02-28 19:05:09,449 INFO org.apache.flink.runtime.jobmaster.slotpool.SlotPool - Stopping SlotPool.
2019-02-28 19:05:09,449 INFO org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService - Stopping ZooKeeperLeaderElectionService ZooKeeperLeaderElectionService{leaderPath='/leader/0e594c065c7f8319a12fa47e089ca9b0/job_manager_lock'}.
2019-02-28 19:05:09,452 INFO org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore - Released locks of job graph 0e594c065c7f8319a12fa47e089ca9b0 from ZooKeeper.

Till Rohrmann

Re: [1.7.1] job stuck in suspended state

Hi Steven,

I think I found the problem. It is caused by a JobMaster which takes a long time to suspend the job and multiple leader changes. So what happens after the first leadership revoking and regaining is that the Dispatcher recovers the submitted job but waits to execute it because the JobMaster from the previous leader session has not been stopped yet. This is done by scheduling some futures with the current main thread executor. Now what happens next is another cycle of leadership revoking and regaining (3rd time that the Dispatcher receives the leadership). Due to this, we create a new main thread executor and invalidate the previous one. As a consequence the created future from the 2nd leadership will never be executed. This alone is not a problem but since we serialize the leadership gaining operations, we actually wait for this future to complete before starting the job recovery belonging to the 3rd leadership.

Long story short: FLINK-11537 should make this problem very unlikely to occur because the JobMaster termination is now almost instantaneous. However it does not fully fix the underlying problem. Here is the JIRA issue for fixing the problem: https://issues.apache.org/jira/browse/FLINK-11843

Cheers,

Till

On Wed, Mar 6, 2019 at 1:21 PM Till Rohrmann <[hidden email]> wrote:

Hi Steven,

a quick update from my side after looking through the logs. The problem seems to be that the Dispatcher does not start recovering the jobs after regaining the leadership after it lost it before. I cannot yet tell why this is happening and I try to further debug the problem.

If you manage to reproduce the problem, could you maybe run the cluster with DEBUG log levels and send me again the logs? If I don't manage to figure out how this problem happens until then, it would be super helpful.

Cheers,
Till

On Mon, Mar 4, 2019 at 7:44 PM Steven Wu <[hidden email]> wrote:
Till,

I will send you the complete log offline. We don't know how to reliably reproduce the problem. but it did happen quite frequently, like once every a couple of days. Let me see if I can cherry pick the fix/commit to 1.7 branch.

Thanks,
Steven

On Mon, Mar 4, 2019 at 5:55 AM Till Rohrmann <[hidden email]> wrote:
Hi Steven,

is this the tail of the logs or are there other statements following? I think your problem could indeed be related to FLINK-11537. Is it possible to somehow reliably reproduce this problem? If yes, then you could try out the RC for Flink 1.8.0 which should be published in the next days.

Cheers,
Till

On Sat, Mar 2, 2019 at 12:47 AM Steven Wu <[hidden email]> wrote:
We have observe that sometimes job stuck in suspended state, and no job restart/recover were attempted once job is suspended.
* it is a high-parallelism job (like close to 2,000)
* there were a few job restarts before this
* there were high GC pause during the period
* zookeeper timeout. probably caused by high GC pause

Is it related to https://issues.apache.org/jira/browse/FLINK-11537?

I pasted some logs in the end.

Thanks,
Steven

2019-02-28 19:04:36,357 WARN org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn - SASL configuration failed: javax.security.auth.login.LoginException: No JAAS configuration section named 'Client' was found in speci
fied JAAS configuration file: '/tmp/jaas-6664341082794720643.conf'. Will continue connection to Zookeeper server without SASL authentication, if Zookeeper server allows it.
2019-02-28 19:04:36,357 INFO org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn - Opening socket connection to server 100.82.141.106/100.82.141.106:2181
2019-02-28 19:04:36,357 ERROR org.apache.flink.shaded.curator.org.apache.curator.ConnectionState - Authentication failed
2019-02-28 19:04:36,357 INFO org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn - Socket connection established to 100.82.141.106/100.82.141.106:2181, initiating session
2019-02-28 19:04:36,359 INFO org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn - Session establishment complete on server 100.82.141.106/100.82.141.106:2181, sessionid = 0x365ef9c4fe7f1f2, negotiated timeout = 40000
2019-02-28 19:04:36,359 INFO org.apache.flink.shaded.curator.org.apache.curator.framework.state.ConnectionStateManager - State change: RECONNECTED
2019-02-28 19:04:36,359 INFO org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService - Connection to ZooKeeper was reconnected. Leader election can be restarted.
2019-02-28 19:04:36,359 INFO org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService - Connection to ZooKeeper was reconnected. Leader election can be restarted.
2019-02-28 19:04:36,359 INFO org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService - Connection to ZooKeeper was reconnected. Leader election can be restarted.
2019-02-28 19:04:36,359 INFO org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService - Connection to ZooKeeper was reconnected. Leader election can be restarted.
2019-02-28 19:04:36,359 INFO org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService - Connection to ZooKeeper was reconnected. Leader retrieval can be restarted.
2019-02-28 19:04:36,359 INFO org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService - Connection to ZooKeeper was reconnected. Leader retrieval can be restarted.
2019-02-28 19:04:36,359 INFO org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore - ZooKeeper connection RECONNECTED. Changes to the submitted job graphs are monitored again.
2019-02-28 19:04:36,360 INFO org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService - Connection to ZooKeeper was reconnected. Leader retrieval can be restarte
...
2019-02-28 19:05:09,400 INFO org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService - Stopping ZooKeeperLeaderRetrievalService /leader/resource_manager_lock.
2019-02-28 19:05:09,400 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph - Job cybertron-flink (0e594c065c7f8319a12fa47e089ca9b0) switched from state RESTARTING to SUSPENDING.
org.apache.flink.util.FlinkException: JobManager is no longer the leader.
at org.apache.flink.runtime.jobmaster.JobManagerRunner.revokeLeadership(JobManagerRunner.java:371)
at org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService.notLeader(ZooKeeperLeaderElectionService.java:247)
at org.apache.flink.shaded.curator.org.apache.curator.framework.recipes.leader.LeaderLatch$8.apply(LeaderLatch.java:640)
at org.apache.flink.shaded.curator.org.apache.curator.framework.recipes.leader.LeaderLatch$8.apply(LeaderLatch.java:636)
at org.apache.flink.shaded.curator.org.apache.curator.framework.listen.ListenerContainer$1.run(ListenerContainer.java:93)
at org.apache.flink.shaded.curator.org.apache.curator.shaded.com.google.common.util.concurrent.MoreExecutors$SameThreadExecutorService.execute(MoreExecutors.java:297)
at org.apache.flink.shaded.curator.org.apache.curator.framework.listen.ListenerContainer.forEach(ListenerContainer.java:85)
at org.apache.flink.shaded.curator.org.apache.curator.framework.recipes.leader.LeaderLatch.setLeadership(LeaderLatch.java:635)
at org.apache.flink.shaded.curator.org.apache.curator.framework.recipes.leader.LeaderLatch.handleStateChange(LeaderLatch.java:623)
at org.apache.flink.shaded.curator.org.apache.curator.framework.recipes.leader.LeaderLatch.access$000(LeaderLatch.java:64)
at org.apache.flink.shaded.curator.org.apache.curator.framework.recipes.leader.LeaderLatch$1.stateChanged(LeaderLatch.java:82)
at org.apache.flink.shaded.curator.org.apache.curator.framework.state.ConnectionStateManager$2.apply(ConnectionStateManager.java:259)
at org.apache.flink.shaded.curator.org.apache.curator.framework.state.ConnectionStateManager$2.apply(ConnectionStateManager.java:255)
at org.apache.flink.shaded.curator.org.apache.curator.framework.listen.ListenerContainer$1.run(ListenerContainer.java:93)
at org.apache.flink.shaded.curator.org.apache.curator.shaded.com.google.common.util.concurrent.MoreExecutors$SameThreadExecutorService.execute(MoreExecutors.java:297)
at org.apache.flink.shaded.curator.org.apache.curator.framework.listen.ListenerContainer.forEach(ListenerContainer.java:85)
at org.apache.flink.shaded.curator.org.apache.curator.framework.state.ConnectionStateManager.processEvents(ConnectionStateManager.java:253)
at org.apache.flink.shaded.curator.org.apache.curator.framework.state.ConnectionStateManager.access$000(ConnectionStateManager.java:43)
at org.apache.flink.shaded.curator.org.apache.curator.framework.state.ConnectionStateManager$1.call(ConnectionStateManager.java:111)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
2019-02-28 19:05:09,403 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph - Job cybertron-flink (0e594c065c7f8319a12fa47e089ca9b0) switched from state SUSPENDING to SUSPENDED.
2019-02-28 19:05:09,403 INFO org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Stopping checkpoint coordinator for job 0e594c065c7f8319a12fa47e089ca9b0.
2019-02-28 19:05:09,403 INFO org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore - Suspending
2019-02-28 19:05:09,448 INFO org.apache.flink.runtime.checkpoint.ZooKeeperCheckpointIDCounter - Shutting down.
2019-02-28 19:05:09,448 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph - Job 0e594c065c7f8319a12fa47e089ca9b0 has been suspended.
...
2019-02-28 19:05:09,448 INFO org.apache.flink.runtime.jobmaster.JobMaster - Close ResourceManager connection 9db2027a0a32f2a44744a0d4a0f84b87: JobManager is no longer the leader..
2019-02-28 19:05:09,448 INFO org.apache.flink.runtime.jobmaster.slotpool.SlotPool - Suspending SlotPool.
2019-02-28 19:05:09,448 INFO org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor - The rpc endpoint org.apache.flink.runtime.jobmaster.JobMaster has not been started yet. Discarding message org.apache.flink.runtime.rpc.messages.RemoteFencedMessage until processing is started.
2019-02-28 19:05:09,448 INFO org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor - The rpc endpoint org.apache.flink.runtime.jobmaster.JobMaster has not been started yet. Discarding message org.apache.flink.runtime.rpc.messages.RemoteFencedMessage until processing is started.
2019-02-28 19:05:09,448 INFO org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor - The rpc endpoint org.apache.flink.runtime.jobmaster.JobMaster has not been started yet. Discarding message org.apache.flink.runtime.rpc.messages.RemoteFencedMessage until processing is started.
2019-02-28 19:05:09,448 INFO org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor - The rpc endpoint org.apache.flink.runtime.jobmaster.JobMaster has not been started yet. Discarding message org.apache.flink.runtime.rpc.messages.RemoteFencedMessage until processing is started.
2019-02-28 19:05:09,448 INFO org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor - The rpc endpoint org.apache.flink.runtime.jobmaster.JobMaster has not been started yet. Discarding message org.apache.flink.runtime.rpc.messages.RemoteFencedMessage until processing is started.
2019-02-28 19:05:09,448 INFO org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor - The rpc endpoint org.apache.flink.runtime.jobmaster.JobMaster has not been started yet. Discarding message org.apache.flink.runtime.rpc.messages.RemoteFencedMessage until processing is started.
2019-02-28 19:05:09,448 INFO org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor - The rpc endpoint org.apache.flink.runtime.jobmaster.JobMaster has not been started yet. Discarding message org.apache.flink.runtime.rpc.messages.RemoteFencedMessage until processing is started.
2019-02-28 19:05:09,448 INFO org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor - The rpc endpoint org.apache.flink.runtime.jobmaster.JobMaster has not been started yet. Discarding message org.apache.flink.runtime.rpc.messages.RemoteFencedMessage until processing is started.
2019-02-28 19:05:09,448 INFO org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor - The rpc endpoint org.apache.flink.runtime.jobmaster.JobMaster has not been started yet. Discarding message org.apache.flink.runtime.rpc.messages.RemoteFencedMessage until processing is started.
2019-02-28 19:05:09,448 INFO org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor - The rpc endpoint org.apache.flink.runtime.jobmaster.JobMaster has not been started yet. Discarding message org.apache.flink.runtime.rpc.messages.RemoteFencedMessage until processing is started.
2019-02-28 19:05:09,448 INFO org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor - The rpc endpoint org.apache.flink.runtime.jobmaster.JobMaster has not been started yet. Discarding message org.apache.flink.runtime.rpc.messages.RemoteFencedMessage until processing is started.
2019-02-28 19:05:09,448 INFO org.apache.flink.runtime.jobmaster.JobMaster - Stopping the JobMaster for job cybertron-flink(0e594c065c7f8319a12fa47e089ca9b0).
2019-02-28 19:05:09,449 INFO org.apache.flink.runtime.jobmaster.slotpool.SlotPool - Stopping SlotPool.
2019-02-28 19:05:09,449 INFO org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService - Stopping ZooKeeperLeaderElectionService ZooKeeperLeaderElectionService{leaderPath='/leader/0e594c065c7f8319a12fa47e089ca9b0/job_manager_lock'}.
2019-02-28 19:05:09,452 INFO org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore - Released locks of job graph 0e594c065c7f8319a12fa47e089ca9b0 from ZooKeeper.