We have observe that sometimes job stuck in suspended state, and no job restart/recover were attempted once job is suspended. * it is a high-parallelism job (like close to 2,000) * there were a few job restarts before this * there were high GC pause during the period * zookeeper timeout. probably caused by high GC pause Is it related to https://issues.apache.org/jira/browse/FLINK-11537? I pasted some logs in the end. Thanks, Steven 2019-02-28 19:04:36,357 WARN org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn - SASL configuration failed: javax.security.auth.login.LoginException: No JAAS configuration section named 'Client' was found in speci fied JAAS configuration file: '/tmp/jaas-6664341082794720643.conf'. Will continue connection to Zookeeper server without SASL authentication, if Zookeeper server allows it. 2019-02-28 19:04:36,357 INFO org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn - Opening socket connection to server 100.82.141.106/100.82.141.106:2181 2019-02-28 19:04:36,357 ERROR org.apache.flink.shaded.curator.org.apache.curator.ConnectionState - Authentication failed 2019-02-28 19:04:36,357 INFO org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn - Socket connection established to 100.82.141.106/100.82.141.106:2181, initiating session 2019-02-28 19:04:36,359 INFO org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn - Session establishment complete on server 100.82.141.106/100.82.141.106:2181, sessionid = 0x365ef9c4fe7f1f2, negotiated timeout = 40000 2019-02-28 19:04:36,359 INFO org.apache.flink.shaded.curator.org.apache.curator.framework.state.ConnectionStateManager - State change: RECONNECTED 2019-02-28 19:04:36,359 INFO org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService - Connection to ZooKeeper was reconnected. Leader election can be restarted. 2019-02-28 19:04:36,359 INFO org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService - Connection to ZooKeeper was reconnected. Leader election can be restarted. 2019-02-28 19:04:36,359 INFO org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService - Connection to ZooKeeper was reconnected. Leader election can be restarted. 2019-02-28 19:04:36,359 INFO org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService - Connection to ZooKeeper was reconnected. Leader election can be restarted. 2019-02-28 19:04:36,359 INFO org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService - Connection to ZooKeeper was reconnected. Leader retrieval can be restarted. 2019-02-28 19:04:36,359 INFO org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService - Connection to ZooKeeper was reconnected. Leader retrieval can be restarted. 2019-02-28 19:04:36,359 INFO org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore - ZooKeeper connection RECONNECTED. Changes to the submitted job graphs are monitored again. 2019-02-28 19:04:36,360 INFO org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService - Connection to ZooKeeper was reconnected. Leader retrieval can be restarte ... 2019-02-28 19:05:09,400 INFO org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService - Stopping ZooKeeperLeaderRetrievalService /leader/resource_manager_lock. 2019-02-28 19:05:09,400 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph - Job cybertron-flink (0e594c065c7f8319a12fa47e089ca9b0) switched from state RESTARTING to SUSPENDING. org.apache.flink.util.FlinkException: JobManager is no longer the leader. at org.apache.flink.runtime.jobmaster.JobManagerRunner.revokeLeadership(JobManagerRunner.java:371) at org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService.notLeader(ZooKeeperLeaderElectionService.java:247) at org.apache.flink.shaded.curator.org.apache.curator.framework.recipes.leader.LeaderLatch$8.apply(LeaderLatch.java:640) at org.apache.flink.shaded.curator.org.apache.curator.framework.recipes.leader.LeaderLatch$8.apply(LeaderLatch.java:636) at org.apache.flink.shaded.curator.org.apache.curator.framework.listen.ListenerContainer$1.run(ListenerContainer.java:93) at org.apache.flink.shaded.curator.org.apache.curator.shaded.com.google.common.util.concurrent.MoreExecutors$SameThreadExecutorService.execute(MoreExecutors.java:297) at org.apache.flink.shaded.curator.org.apache.curator.framework.listen.ListenerContainer.forEach(ListenerContainer.java:85) at org.apache.flink.shaded.curator.org.apache.curator.framework.recipes.leader.LeaderLatch.setLeadership(LeaderLatch.java:635) at org.apache.flink.shaded.curator.org.apache.curator.framework.recipes.leader.LeaderLatch.handleStateChange(LeaderLatch.java:623) at org.apache.flink.shaded.curator.org.apache.curator.framework.recipes.leader.LeaderLatch.access$000(LeaderLatch.java:64) at org.apache.flink.shaded.curator.org.apache.curator.framework.recipes.leader.LeaderLatch$1.stateChanged(LeaderLatch.java:82) at org.apache.flink.shaded.curator.org.apache.curator.framework.state.ConnectionStateManager$2.apply(ConnectionStateManager.java:259) at org.apache.flink.shaded.curator.org.apache.curator.framework.state.ConnectionStateManager$2.apply(ConnectionStateManager.java:255) at org.apache.flink.shaded.curator.org.apache.curator.framework.listen.ListenerContainer$1.run(ListenerContainer.java:93) at org.apache.flink.shaded.curator.org.apache.curator.shaded.com.google.common.util.concurrent.MoreExecutors$SameThreadExecutorService.execute(MoreExecutors.java:297) at org.apache.flink.shaded.curator.org.apache.curator.framework.listen.ListenerContainer.forEach(ListenerContainer.java:85) at org.apache.flink.shaded.curator.org.apache.curator.framework.state.ConnectionStateManager.processEvents(ConnectionStateManager.java:253) at org.apache.flink.shaded.curator.org.apache.curator.framework.state.ConnectionStateManager.access$000(ConnectionStateManager.java:43) at org.apache.flink.shaded.curator.org.apache.curator.framework.state.ConnectionStateManager$1.call(ConnectionStateManager.java:111) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) 2019-02-28 19:05:09,403 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph - Job cybertron-flink (0e594c065c7f8319a12fa47e089ca9b0) switched from state SUSPENDING to SUSPENDED. 2019-02-28 19:05:09,403 INFO org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Stopping checkpoint coordinator for job 0e594c065c7f8319a12fa47e089ca9b0. 2019-02-28 19:05:09,403 INFO org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore - Suspending 2019-02-28 19:05:09,448 INFO org.apache.flink.runtime.checkpoint.ZooKeeperCheckpointIDCounter - Shutting down. 2019-02-28 19:05:09,448 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph - Job 0e594c065c7f8319a12fa47e089ca9b0 has been suspended. ... 2019-02-28 19:05:09,448 INFO org.apache.flink.runtime.jobmaster.JobMaster - Close ResourceManager connection 9db2027a0a32f2a44744a0d4a0f84b87: JobManager is no longer the leader.. 2019-02-28 19:05:09,448 INFO org.apache.flink.runtime.jobmaster.slotpool.SlotPool - Suspending SlotPool. 2019-02-28 19:05:09,448 INFO org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor - The rpc endpoint org.apache.flink.runtime.jobmaster.JobMaster has not been started yet. Discarding message org.apache.flink.runtime.rpc.messages.RemoteFencedMessage until processing is started. 2019-02-28 19:05:09,448 INFO org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor - The rpc endpoint org.apache.flink.runtime.jobmaster.JobMaster has not been started yet. Discarding message org.apache.flink.runtime.rpc.messages.RemoteFencedMessage until processing is started. 2019-02-28 19:05:09,448 INFO org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor - The rpc endpoint org.apache.flink.runtime.jobmaster.JobMaster has not been started yet. Discarding message org.apache.flink.runtime.rpc.messages.RemoteFencedMessage until processing is started. 2019-02-28 19:05:09,448 INFO org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor - The rpc endpoint org.apache.flink.runtime.jobmaster.JobMaster has not been started yet. Discarding message org.apache.flink.runtime.rpc.messages.RemoteFencedMessage until processing is started. 2019-02-28 19:05:09,448 INFO org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor - The rpc endpoint org.apache.flink.runtime.jobmaster.JobMaster has not been started yet. Discarding message org.apache.flink.runtime.rpc.messages.RemoteFencedMessage until processing is started. 2019-02-28 19:05:09,448 INFO org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor - The rpc endpoint org.apache.flink.runtime.jobmaster.JobMaster has not been started yet. Discarding message org.apache.flink.runtime.rpc.messages.RemoteFencedMessage until processing is started. 2019-02-28 19:05:09,448 INFO org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor - The rpc endpoint org.apache.flink.runtime.jobmaster.JobMaster has not been started yet. Discarding message org.apache.flink.runtime.rpc.messages.RemoteFencedMessage until processing is started. 2019-02-28 19:05:09,448 INFO org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor - The rpc endpoint org.apache.flink.runtime.jobmaster.JobMaster has not been started yet. Discarding message org.apache.flink.runtime.rpc.messages.RemoteFencedMessage until processing is started. 2019-02-28 19:05:09,448 INFO org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor - The rpc endpoint org.apache.flink.runtime.jobmaster.JobMaster has not been started yet. Discarding message org.apache.flink.runtime.rpc.messages.RemoteFencedMessage until processing is started. 2019-02-28 19:05:09,448 INFO org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor - The rpc endpoint org.apache.flink.runtime.jobmaster.JobMaster has not been started yet. Discarding message org.apache.flink.runtime.rpc.messages.RemoteFencedMessage until processing is started. 2019-02-28 19:05:09,448 INFO org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor - The rpc endpoint org.apache.flink.runtime.jobmaster.JobMaster has not been started yet. Discarding message org.apache.flink.runtime.rpc.messages.RemoteFencedMessage until processing is started. 2019-02-28 19:05:09,448 INFO org.apache.flink.runtime.jobmaster.JobMaster - Stopping the JobMaster for job cybertron-flink(0e594c065c7f8319a12fa47e089ca9b0). 2019-02-28 19:05:09,449 INFO org.apache.flink.runtime.jobmaster.slotpool.SlotPool - Stopping SlotPool. 2019-02-28 19:05:09,449 INFO org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService - Stopping ZooKeeperLeaderElectionService ZooKeeperLeaderElectionService{leaderPath='/leader/0e594c065c7f8319a12fa47e089ca9b0/job_manager_lock'}. 2019-02-28 19:05:09,452 INFO org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore - Released locks of job graph 0e594c065c7f8319a12fa47e089ca9b0 from ZooKeeper. |
Hi Steven, is this the tail of the logs or are there other statements following? I think your problem could indeed be related to FLINK-11537. Is it possible to somehow reliably reproduce this problem? If yes, then you could try out the RC for Flink 1.8.0 which should be published in the next days. Cheers, Till On Sat, Mar 2, 2019 at 12:47 AM Steven Wu <[hidden email]> wrote:
|
Till, I will send you the complete log offline. We don't know how to reliably reproduce the problem. but it did happen quite frequently, like once every a couple of days. Let me see if I can cherry pick the fix/commit to 1.7 branch. Thanks, Steven On Mon, Mar 4, 2019 at 5:55 AM Till Rohrmann <[hidden email]> wrote:
|
Hi Steven, a quick update from my side after looking through the logs. The problem seems to be that the Dispatcher does not start recovering the jobs after regaining the leadership after it lost it before. I cannot yet tell why this is happening and I try to further debug the problem. If you manage to reproduce the problem, could you maybe run the cluster with DEBUG log levels and send me again the logs? If I don't manage to figure out how this problem happens until then, it would be super helpful. Cheers, Till On Mon, Mar 4, 2019 at 7:44 PM Steven Wu <[hidden email]> wrote:
|
Hi Steven, I think I found the problem. It is caused by a JobMaster which takes a long time to suspend the job and multiple leader changes. So what happens after the first leadership revoking and regaining is that the Dispatcher recovers the submitted job but waits to execute it because the JobMaster from the previous leader session has not been stopped yet. This is done by scheduling some futures with the current main thread executor. Now what happens next is another cycle of leadership revoking and regaining (3rd time that the Dispatcher receives the leadership). Due to this, we create a new main thread executor and invalidate the previous one. As a consequence the created future from the 2nd leadership will never be executed. This alone is not a problem but since we serialize the leadership gaining operations, we actually wait for this future to complete before starting the job recovery belonging to the 3rd leadership. Long story short: FLINK-11537 should make this problem very unlikely to occur because the JobMaster termination is now almost instantaneous. However it does not fully fix the underlying problem. Here is the JIRA issue for fixing the problem: https://issues.apache.org/jira/browse/FLINK-11843 Cheers, Till On Wed, Mar 6, 2019 at 1:21 PM Till Rohrmann <[hidden email]> wrote:
|
Free forum by Nabble | Edit this page |