Old JobManager lost its leadership in zk

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

Old JobManager lost its leadership in zk

Steven Wu
Hi, 

After this error/exception, it seems that taskmanager never connects to jobmanager anymore.  Job stuck in failed state because there is not enough slots to recover the job. 

let's assume there was a temp glitch btw jobmanager and zk. would it cause such a permanent failure in Flink?

I checked the zookeeper record. 
* leader zknode seems to have the correct info for "job_manager_lock"
* I am not sure how to read the leaderlatch zknode


A little more about the job
* standalone cluster mode
* 1 jobmanager
* 1 taskmanager 

Thanks,
Steven

2018-04-11 01:11:48,007 INFO  org.apache.flink.runtime.taskmanager.Task                     - Attempting to fail task externally Source: kafkasource -> Sink: s3sink (1/1) (5a7dba2e186b9fdaebb62bdd703dc7dc).

2018-04-11 01:11:48,007 INFO  org.apache.flink.runtime.taskmanager.Task                     - Source: kafkasource -> Sink: s3sink (1/1) (5a7dba2e186b9fdaebb62bdd703dc7dc) switched from RUNNING to FAILED.

java.lang.Exception: TaskManager akka://flink/user/taskmanager disconnects from JobManager akka.tcp://flink@1.2.3.4:42787/user/jobmanager: Old JobManager lost its leadership.

       at org.apache.flink.runtime.taskmanager.TaskManager.handleJobManagerDisconnect(TaskManager.scala:1073)

       at org.apache.flink.runtime.taskmanager.TaskManager.org$apache$flink$runtime$taskmanager$TaskManager$$handleJobManagerLeaderAddress(TaskManager.scala:1467)

       at org.apache.flink.runtime.taskmanager.TaskManager$$anonfun$handleMessage$1.applyOrElse(TaskManager.scala:277)

       at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:36)

       at org.apache.flink.runtime.LeaderSessionMessageFilter$$anonfun$receive$1.applyOrElse(LeaderSessionMessageFilter.scala:49)

       at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:36)

       at org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:33)

       at org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:28)

       at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:123)

       at org.apache.flink.runtime.LogMessages$$anon$1.applyOrElse(LogMessages.scala:28)

       at akka.actor.Actor$class.aroundReceive(Actor.scala:502)

       at org.apache.flink.runtime.taskmanager.TaskManager.aroundReceive(TaskManager.scala:121)

       at akka.actor.ActorCell.receiveMessage(ActorCell.scala:526)

       at akka.actor.ActorCell.invoke(ActorCell.scala:495)

       at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:257)

       at akka.dispatch.Mailbox.run(Mailbox.scala:224)

       at akka.dispatch.Mailbox.exec(Mailbox.scala:234)

       at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)

       at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)

       at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)

       at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)

2018-04-11 01:11:48,011 INFO  org.apache.flink.runtime.taskmanager.Task                     - Triggering cancellation of task code Source: kafkasource -> Sink: s3sink (1/1) (5a7dba2e186b9fdaebb62bdd703dc7dc).

2018-04-11 01:11:48,013 INFO  org.apache.flink.runtime.blob.PermanentBlobCache              - Shutting down BLOB cache



Reply | Threaded
Open this post in threaded view
|

Re: Old JobManager lost its leadership in zk

Steven Wu
from Flink UI on jobmanager, sometimes I saw taskmanager connected and heartbeat time got updated.


but then sometimes the taskmanager page become blank. maybe disconnected.



On Wed, Apr 11, 2018 at 1:31 PM, Steven Wu <[hidden email]> wrote:
Hi, 

After this error/exception, it seems that taskmanager never connects to jobmanager anymore.  Job stuck in failed state because there is not enough slots to recover the job. 

let's assume there was a temp glitch btw jobmanager and zk. would it cause such a permanent failure in Flink?

I checked the zookeeper record. 
* leader zknode seems to have the correct info for "job_manager_lock"
* I am not sure how to read the leaderlatch zknode


A little more about the job
* standalone cluster mode
* 1 jobmanager
* 1 taskmanager 

Thanks,
Steven

2018-04-11 01:11:48,007 INFO  org.apache.flink.runtime.taskmanager.Task                     - Attempting to fail task externally Source: kafkasource -> Sink: s3sink (1/1) (5a7dba2e186b9fdaebb62bdd703dc7dc).

2018-04-11 01:11:48,007 INFO  org.apache.flink.runtime.taskmanager.Task                     - Source: kafkasource -> Sink: s3sink (1/1) (5a7dba2e186b9fdaebb62bdd703dc7dc) switched from RUNNING to FAILED.

java.lang.Exception: TaskManager akka://flink/user/taskmanager disconnects from JobManager akka.tcp://flink@1.2.3.4:42787/user/jobmanager: Old JobManager lost its leadership.

       at org.apache.flink.runtime.taskmanager.TaskManager.handleJobManagerDisconnect(TaskManager.scala:1073)

       at org.apache.flink.runtime.taskmanager.TaskManager.org$apache$flink$runtime$taskmanager$TaskManager$$handleJobManagerLeaderAddress(TaskManager.scala:1467)

       at org.apache.flink.runtime.taskmanager.TaskManager$$anonfun$handleMessage$1.applyOrElse(TaskManager.scala:277)

       at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:36)

       at org.apache.flink.runtime.LeaderSessionMessageFilter$$anonfun$receive$1.applyOrElse(LeaderSessionMessageFilter.scala:49)

       at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:36)

       at org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:33)

       at org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:28)

       at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:123)

       at org.apache.flink.runtime.LogMessages$$anon$1.applyOrElse(LogMessages.scala:28)

       at akka.actor.Actor$class.aroundReceive(Actor.scala:502)

       at org.apache.flink.runtime.taskmanager.TaskManager.aroundReceive(TaskManager.scala:121)

       at akka.actor.ActorCell.receiveMessage(ActorCell.scala:526)

       at akka.actor.ActorCell.invoke(ActorCell.scala:495)

       at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:257)

       at akka.dispatch.Mailbox.run(Mailbox.scala:224)

       at akka.dispatch.Mailbox.exec(Mailbox.scala:234)

       at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)

       at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)

       at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)

       at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)

2018-04-11 01:11:48,011 INFO  org.apache.flink.runtime.taskmanager.Task                     - Triggering cancellation of task code Source: kafkasource -> Sink: s3sink (1/1) (5a7dba2e186b9fdaebb62bdd703dc7dc).

2018-04-11 01:11:48,013 INFO  org.apache.flink.runtime.blob.PermanentBlobCache              - Shutting down BLOB cache




Reply | Threaded
Open this post in threaded view
|

Re: Old JobManager lost its leadership in zk

Steven Wu
Here are the error logs from jobmanager side

2018-04-11 01:14:16,280 INFO  org.apache.flink.shaded.curator.org.apache.curator.framework.state.ConnectionStateManager  - State change: SUSPENDED

2018-04-11 01:14:16,280 WARN  org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService  - Connection to ZooKeeper suspended. Can no longer retrieve the leader from ZooKeeper.

2018-04-11 01:14:16,280 WARN  org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore  - ZooKeeper connection SUSPENDED. Changes to the submitted job graphs are not monitored (temporarily).

2018-04-11 01:14:16,280 WARN  org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService  - Connection to ZooKeeper suspended. The contender akka.tcp://flink@100.65.108.246:42787/user/jobmanager no longer participates in the leader election.

2018-04-11 01:14:16,280 WARN  org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService  - Connection to ZooKeeper suspended. Can no longer retrieve the leader from ZooKeeper.

2018-04-11 01:14:16,281 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - Job ksrouter (f303cea2286a620503a9101585cc11bd) switched from state RUNNING to SUSPENDED.

java.lang.Exception: JobManager is no longer the leader.

       at org.apache.flink.runtime.jobmanager.JobManager$$anonfun$handleMessage$1.applyOrElse(JobManager.scala:284)

       at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:36)

       at org.apache.flink.runtime.LeaderSessionMessageFilter$$anonfun$receive$1.applyOrElse(LeaderSessionMessageFilter.scala:49)

       at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:36)

       at org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:33)

       at org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:28)

       at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:123)

       at org.apache.flink.runtime.LogMessages$$anon$1.applyOrElse(LogMessages.scala:28)

       at akka.actor.Actor$class.aroundReceive(Actor.scala:502)

       at org.apache.flink.runtime.jobmanager.JobManager.aroundReceive(JobManager.scala:122)

       at akka.actor.ActorCell.receiveMessage(ActorCell.scala:526)

       at akka.actor.ActorCell.invoke(ActorCell.scala:495)

       at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:257)

       at akka.dispatch.Mailbox.run(Mailbox.scala:224)

       at akka.dispatch.Mailbox.exec(Mailbox.scala:234)

       at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)

       at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)

       at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)

       at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)

2018-04-11 01:14:16,281 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - Source: kafkasource -> Sink: s3sink (1/1) (5a7dba2e186b9fdaebb62bdd703dc7dc) switched from RUNNING to FAILED.

java.lang.Exception: TaskManager was lost/killed: 8dc2fd35d4d13ff0d291ae12d5e2b6f6 @ 95656bc5-7e75-49ca-9b48-0c422f6eaa12 (dataPort=44671)

       at org.apache.flink.runtime.instance.SimpleSlot.releaseSlot(SimpleSlot.java:217)

       at org.apache.flink.runtime.instance.SlotSharingGroupAssignment.releaseSharedSlot(SlotSharingGroupAssignment.java:523)

       at org.apache.flink.runtime.instance.SharedSlot.releaseSlot(SharedSlot.java:192)

       at org.apache.flink.runtime.instance.Instance.markDead(Instance.java:167)

       at org.apache.flink.runtime.instance.InstanceManager.unregisterAllTaskManagers(InstanceManager.java:234)

       at org.apache.flink.runtime.jobmanager.JobManager$$anonfun$handleMessage$1.applyOrElse(JobManager.scala:295)

       at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:36)

       at org.apache.flink.runtime.LeaderSessionMessageFilter$$anonfun$receive$1.applyOrElse(LeaderSessionMessageFilter.scala:49)

       at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:36)

       at org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:33)

       at org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:28)

       at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:123)

       at org.apache.flink.runtime.LogMessages$$anon$1.applyOrElse(LogMessages.scala:28)

       at akka.actor.Actor$class.aroundReceive(Actor.scala:502)

       at org.apache.flink.runtime.jobmanager.JobManager.aroundReceive(JobManager.scala:122)

       at akka.actor.ActorCell.receiveMessage(ActorCell.scala:526)

       at akka.actor.ActorCell.invoke(ActorCell.scala:495)

       at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:257)

       at akka.dispatch.Mailbox.run(Mailbox.scala:224)

       at akka.dispatch.Mailbox.exec(Mailbox.scala:234)

       at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)

       at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)

       at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)

       at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)


2018-04-11 01:14:24,652 ERROR akka.remote.Remoting                                          - Association to [akka.tcp://flink@95656bc5-7e75-49ca-9b48-0c422f6eaa12:34651] with UID [1239336884] irrecoverably f

ailed. Quarantining address.

java.lang.IllegalStateException: Error encountered while processing system message acknowledgement buffer: [0 {0}] ack: ACK[1, {}]

       at akka.remote.ReliableDeliverySupervisor$$anonfun$receive$1.applyOrElse(Endpoint.scala:299)

       at akka.actor.Actor$class.aroundReceive(Actor.scala:502)

       at akka.remote.ReliableDeliverySupervisor.aroundReceive(Endpoint.scala:203)

       at akka.actor.ActorCell.receiveMessage(ActorCell.scala:526)

       at akka.actor.ActorCell.invoke(ActorCell.scala:495)

       at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:257)

       at akka.dispatch.Mailbox.run(Mailbox.scala:224)

       at akka.dispatch.Mailbox.exec(Mailbox.scala:234)

       at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)

       at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)

       at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)

       at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)

Caused by: java.lang.IllegalArgumentException: Highest SEQ so far was 0 but cumulative ACK is 1

       at akka.remote.AckedSendBuffer.acknowledge(AckedDelivery.scala:103)

       at akka.remote.ReliableDeliverySupervisor$$anonfun$receive$1.applyOrElse(Endpoint.scala:295)

       ... 11 more




On Wed, Apr 11, 2018 at 1:39 PM, Steven Wu <[hidden email]> wrote:
from Flink UI on jobmanager, sometimes I saw taskmanager connected and heartbeat time got updated.


but then sometimes the taskmanager page become blank. maybe disconnected.



On Wed, Apr 11, 2018 at 1:31 PM, Steven Wu <[hidden email]> wrote:
Hi, 

After this error/exception, it seems that taskmanager never connects to jobmanager anymore.  Job stuck in failed state because there is not enough slots to recover the job. 

let's assume there was a temp glitch btw jobmanager and zk. would it cause such a permanent failure in Flink?

I checked the zookeeper record. 
* leader zknode seems to have the correct info for "job_manager_lock"
* I am not sure how to read the leaderlatch zknode


A little more about the job
* standalone cluster mode
* 1 jobmanager
* 1 taskmanager 

Thanks,
Steven

2018-04-11 01:11:48,007 INFO  org.apache.flink.runtime.taskmanager.Task                     - Attempting to fail task externally Source: kafkasource -> Sink: s3sink (1/1) (5a7dba2e186b9fdaebb62bdd703dc7dc).

2018-04-11 01:11:48,007 INFO  org.apache.flink.runtime.taskmanager.Task                     - Source: kafkasource -> Sink: s3sink (1/1) (5a7dba2e186b9fdaebb62bdd703dc7dc) switched from RUNNING to FAILED.

java.lang.Exception: TaskManager akka://flink/user/taskmanager disconnects from JobManager akka.tcp://flink@1.2.3.4:42787/user/jobmanager: Old JobManager lost its leadership.

       at org.apache.flink.runtime.taskmanager.TaskManager.handleJobManagerDisconnect(TaskManager.scala:1073)

       at org.apache.flink.runtime.taskmanager.TaskManager.org$apache$flink$runtime$taskmanager$TaskManager$$handleJobManagerLeaderAddress(TaskManager.scala:1467)

       at org.apache.flink.runtime.taskmanager.TaskManager$$anonfun$handleMessage$1.applyOrElse(TaskManager.scala:277)

       at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:36)

       at org.apache.flink.runtime.LeaderSessionMessageFilter$$anonfun$receive$1.applyOrElse(LeaderSessionMessageFilter.scala:49)

       at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:36)

       at org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:33)

       at org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:28)

       at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:123)

       at org.apache.flink.runtime.LogMessages$$anon$1.applyOrElse(LogMessages.scala:28)

       at akka.actor.Actor$class.aroundReceive(Actor.scala:502)

       at org.apache.flink.runtime.taskmanager.TaskManager.aroundReceive(TaskManager.scala:121)

       at akka.actor.ActorCell.receiveMessage(ActorCell.scala:526)

       at akka.actor.ActorCell.invoke(ActorCell.scala:495)

       at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:257)

       at akka.dispatch.Mailbox.run(Mailbox.scala:224)

       at akka.dispatch.Mailbox.exec(Mailbox.scala:234)

       at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)

       at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)

       at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)

       at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)

2018-04-11 01:11:48,011 INFO  org.apache.flink.runtime.taskmanager.Task                     - Triggering cancellation of task code Source: kafkasource -> Sink: s3sink (1/1) (5a7dba2e186b9fdaebb62bdd703dc7dc).

2018-04-11 01:11:48,013 INFO  org.apache.flink.runtime.blob.PermanentBlobCache              - Shutting down BLOB cache





Reply | Threaded
Open this post in threaded view
|

Re: Old JobManager lost its leadership in zk

Steven Wu

2018-04-11 01:14:16,280 WARN  org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService  - Connection to ZooKeeper suspended. The contender akka.tcp://flink@100.65.108.246:42787/user/jobmanager no longer participates in the leader election.

2018-04-11 01:14:16,280 WARN  org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService  - Connection to ZooKeeper suspended. Can no longer retrieve the leader from ZooKeeper.


the initial zk disconnect issue is caused by high GC pause on jobmanager. but I do have a question for the above error logs. does that mean jobmanager will never participate in leader election again. Note that our setup only has 1 jobmanager but we will use zk. is that a bad setup?


On Wed, Apr 11, 2018 at 2:00 PM, Steven Wu <[hidden email]> wrote:
Here are the error logs from jobmanager side

2018-04-11 01:14:16,280 INFO  org.apache.flink.shaded.curator.org.apache.curator.framework.state.ConnectionStateManager  - State change: SUSPENDED

2018-04-11 01:14:16,280 WARN  org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService  - Connection to ZooKeeper suspended. Can no longer retrieve the leader from ZooKeeper.

2018-04-11 01:14:16,280 WARN  org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore  - ZooKeeper connection SUSPENDED. Changes to the submitted job graphs are not monitored (temporarily).

2018-04-11 01:14:16,280 WARN  org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService  - Connection to ZooKeeper suspended. The contender akka.tcp://flink@100.65.108.246:42787/user/jobmanager no longer participates in the leader election.

2018-04-11 01:14:16,280 WARN  org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService  - Connection to ZooKeeper suspended. Can no longer retrieve the leader from ZooKeeper.

2018-04-11 01:14:16,281 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - Job ksrouter (f303cea2286a620503a9101585cc11bd) switched from state RUNNING to SUSPENDED.

java.lang.Exception: JobManager is no longer the leader.

       at org.apache.flink.runtime.jobmanager.JobManager$$anonfun$handleMessage$1.applyOrElse(JobManager.scala:284)

       at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:36)

       at org.apache.flink.runtime.LeaderSessionMessageFilter$$anonfun$receive$1.applyOrElse(LeaderSessionMessageFilter.scala:49)

       at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:36)

       at org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:33)

       at org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:28)

       at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:123)

       at org.apache.flink.runtime.LogMessages$$anon$1.applyOrElse(LogMessages.scala:28)

       at akka.actor.Actor$class.aroundReceive(Actor.scala:502)

       at org.apache.flink.runtime.jobmanager.JobManager.aroundReceive(JobManager.scala:122)

       at akka.actor.ActorCell.receiveMessage(ActorCell.scala:526)

       at akka.actor.ActorCell.invoke(ActorCell.scala:495)

       at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:257)

       at akka.dispatch.Mailbox.run(Mailbox.scala:224)

       at akka.dispatch.Mailbox.exec(Mailbox.scala:234)

       at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)

       at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)

       at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)

       at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)

2018-04-11 01:14:16,281 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - Source: kafkasource -> Sink: s3sink (1/1) (5a7dba2e186b9fdaebb62bdd703dc7dc) switched from RUNNING to FAILED.

java.lang.Exception: TaskManager was lost/killed: 8dc2fd35d4d13ff0d291ae12d5e2b6f6 @ 95656bc5-7e75-49ca-9b48-0c422f6eaa12 (dataPort=44671)

       at org.apache.flink.runtime.instance.SimpleSlot.releaseSlot(SimpleSlot.java:217)

       at org.apache.flink.runtime.instance.SlotSharingGroupAssignment.releaseSharedSlot(SlotSharingGroupAssignment.java:523)

       at org.apache.flink.runtime.instance.SharedSlot.releaseSlot(SharedSlot.java:192)

       at org.apache.flink.runtime.instance.Instance.markDead(Instance.java:167)

       at org.apache.flink.runtime.instance.InstanceManager.unregisterAllTaskManagers(InstanceManager.java:234)

       at org.apache.flink.runtime.jobmanager.JobManager$$anonfun$handleMessage$1.applyOrElse(JobManager.scala:295)

       at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:36)

       at org.apache.flink.runtime.LeaderSessionMessageFilter$$anonfun$receive$1.applyOrElse(LeaderSessionMessageFilter.scala:49)

       at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:36)

       at org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:33)

       at org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:28)

       at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:123)

       at org.apache.flink.runtime.LogMessages$$anon$1.applyOrElse(LogMessages.scala:28)

       at akka.actor.Actor$class.aroundReceive(Actor.scala:502)

       at org.apache.flink.runtime.jobmanager.JobManager.aroundReceive(JobManager.scala:122)

       at akka.actor.ActorCell.receiveMessage(ActorCell.scala:526)

       at akka.actor.ActorCell.invoke(ActorCell.scala:495)

       at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:257)

       at akka.dispatch.Mailbox.run(Mailbox.scala:224)

       at akka.dispatch.Mailbox.exec(Mailbox.scala:234)

       at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)

       at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)

       at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)

       at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)


2018-04-11 01:14:24,652 ERROR akka.remote.Remoting                                          - Association to [akka.tcp://flink@95656bc5-7e75-49ca-9b48-0c422f6eaa12:34651] with UID [1239336884] irrecoverably f

ailed. Quarantining address.

java.lang.IllegalStateException: Error encountered while processing system message acknowledgement buffer: [0 {0}] ack: ACK[1, {}]

       at akka.remote.ReliableDeliverySupervisor$$anonfun$receive$1.applyOrElse(Endpoint.scala:299)

       at akka.actor.Actor$class.aroundReceive(Actor.scala:502)

       at akka.remote.ReliableDeliverySupervisor.aroundReceive(Endpoint.scala:203)

       at akka.actor.ActorCell.receiveMessage(ActorCell.scala:526)

       at akka.actor.ActorCell.invoke(ActorCell.scala:495)

       at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:257)

       at akka.dispatch.Mailbox.run(Mailbox.scala:224)

       at akka.dispatch.Mailbox.exec(Mailbox.scala:234)

       at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)

       at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)

       at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)

       at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)

Caused by: java.lang.IllegalArgumentException: Highest SEQ so far was 0 but cumulative ACK is 1

       at akka.remote.AckedSendBuffer.acknowledge(AckedDelivery.scala:103)

       at akka.remote.ReliableDeliverySupervisor$$anonfun$receive$1.applyOrElse(Endpoint.scala:295)

       ... 11 more




On Wed, Apr 11, 2018 at 1:39 PM, Steven Wu <[hidden email]> wrote:
from Flink UI on jobmanager, sometimes I saw taskmanager connected and heartbeat time got updated.


but then sometimes the taskmanager page become blank. maybe disconnected.



On Wed, Apr 11, 2018 at 1:31 PM, Steven Wu <[hidden email]> wrote:
Hi, 

After this error/exception, it seems that taskmanager never connects to jobmanager anymore.  Job stuck in failed state because there is not enough slots to recover the job. 

let's assume there was a temp glitch btw jobmanager and zk. would it cause such a permanent failure in Flink?

I checked the zookeeper record. 
* leader zknode seems to have the correct info for "job_manager_lock"
* I am not sure how to read the leaderlatch zknode


A little more about the job
* standalone cluster mode
* 1 jobmanager
* 1 taskmanager 

Thanks,
Steven

2018-04-11 01:11:48,007 INFO  org.apache.flink.runtime.taskmanager.Task                     - Attempting to fail task externally Source: kafkasource -> Sink: s3sink (1/1) (5a7dba2e186b9fdaebb62bdd703dc7dc).

2018-04-11 01:11:48,007 INFO  org.apache.flink.runtime.taskmanager.Task                     - Source: kafkasource -> Sink: s3sink (1/1) (5a7dba2e186b9fdaebb62bdd703dc7dc) switched from RUNNING to FAILED.

java.lang.Exception: TaskManager akka://flink/user/taskmanager disconnects from JobManager akka.tcp://flink@1.2.3.4:42787/user/jobmanager: Old JobManager lost its leadership.

       at org.apache.flink.runtime.taskmanager.TaskManager.handleJobManagerDisconnect(TaskManager.scala:1073)

       at org.apache.flink.runtime.taskmanager.TaskManager.org$apache$flink$runtime$taskmanager$TaskManager$$handleJobManagerLeaderAddress(TaskManager.scala:1467)

       at org.apache.flink.runtime.taskmanager.TaskManager$$anonfun$handleMessage$1.applyOrElse(TaskManager.scala:277)

       at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:36)

       at org.apache.flink.runtime.LeaderSessionMessageFilter$$anonfun$receive$1.applyOrElse(LeaderSessionMessageFilter.scala:49)

       at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:36)

       at org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:33)

       at org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:28)

       at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:123)

       at org.apache.flink.runtime.LogMessages$$anon$1.applyOrElse(LogMessages.scala:28)

       at akka.actor.Actor$class.aroundReceive(Actor.scala:502)

       at org.apache.flink.runtime.taskmanager.TaskManager.aroundReceive(TaskManager.scala:121)

       at akka.actor.ActorCell.receiveMessage(ActorCell.scala:526)

       at akka.actor.ActorCell.invoke(ActorCell.scala:495)

       at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:257)

       at akka.dispatch.Mailbox.run(Mailbox.scala:224)

       at akka.dispatch.Mailbox.exec(Mailbox.scala:234)

       at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)

       at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)

       at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)

       at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)

2018-04-11 01:11:48,011 INFO  org.apache.flink.runtime.taskmanager.Task                     - Triggering cancellation of task code Source: kafkasource -> Sink: s3sink (1/1) (5a7dba2e186b9fdaebb62bdd703dc7dc).

2018-04-11 01:11:48,013 INFO  org.apache.flink.runtime.blob.PermanentBlobCache              - Shutting down BLOB cache






Reply | Threaded
Open this post in threaded view
|

Re: Old JobManager lost its leadership in zk

Steven Wu
never mind. it might be caused by our underline container platform change. hostname seems not resolvable for akka. we can close this thread for now.

2018-04-12 04:42:12,697 WARN  akka.remote.ReliableDeliverySupervisor                        - Association with remote system [akka.tcp://flink@95656bc5-7e75-49ca-9b48-0c422f6eaa12:34651] has failed, address i

s now gated for [5000] ms. Reason: [Association failed with [akka.tcp://flink@95656bc5-7e75-49ca-9b48-0c422f6eaa12:34651]] Caused by: [95656bc5-7e75-49ca-9b48-0c422f6eaa12: Name or service not known]

2018-04-12 04:42:16,161 WARN  akka.remote.RemoteWatcher                                - Detected unreachable: [akka.tcp://flink@95656bc5-7e75-49ca-9b48-0c422f6eaa12:34651]

2018-04-12 04:42:16,161 WARN  akka.remote.Remoting                                - Association to [akka.tcp://flink@95656bc5-7e75-49ca-9b48-0c422f6eaa12:34651] with unknown UID is reported as quarantined, but address cannot be quarantined without knowing the UID, gating instead for 5000 ms.



On Wed, Apr 11, 2018 at 2:14 PM, Steven Wu <[hidden email]> wrote:

2018-04-11 01:14:16,280 WARN  org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService  - Connection to ZooKeeper suspended. The contender akka.tcp://flink@100.65.108.246:42787/user/jobmanager no longer participates in the leader election.

2018-04-11 01:14:16,280 WARN  org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService  - Connection to ZooKeeper suspended. Can no longer retrieve the leader from ZooKeeper.


the initial zk disconnect issue is caused by high GC pause on jobmanager. but I do have a question for the above error logs. does that mean jobmanager will never participate in leader election again. Note that our setup only has 1 jobmanager but we will use zk. is that a bad setup?


On Wed, Apr 11, 2018 at 2:00 PM, Steven Wu <[hidden email]> wrote:
Here are the error logs from jobmanager side

2018-04-11 01:14:16,280 INFO  org.apache.flink.shaded.curator.org.apache.curator.framework.state.ConnectionStateManager  - State change: SUSPENDED

2018-04-11 01:14:16,280 WARN  org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService  - Connection to ZooKeeper suspended. Can no longer retrieve the leader from ZooKeeper.

2018-04-11 01:14:16,280 WARN  org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore  - ZooKeeper connection SUSPENDED. Changes to the submitted job graphs are not monitored (temporarily).

2018-04-11 01:14:16,280 WARN  org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService  - Connection to ZooKeeper suspended. The contender akka.tcp://flink@100.65.108.246:42787/user/jobmanager no longer participates in the leader election.

2018-04-11 01:14:16,280 WARN  org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService  - Connection to ZooKeeper suspended. Can no longer retrieve the leader from ZooKeeper.

2018-04-11 01:14:16,281 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - Job ksrouter (f303cea2286a620503a9101585cc11bd) switched from state RUNNING to SUSPENDED.

java.lang.Exception: JobManager is no longer the leader.

       at org.apache.flink.runtime.jobmanager.JobManager$$anonfun$handleMessage$1.applyOrElse(JobManager.scala:284)

       at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:36)

       at org.apache.flink.runtime.LeaderSessionMessageFilter$$anonfun$receive$1.applyOrElse(LeaderSessionMessageFilter.scala:49)

       at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:36)

       at org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:33)

       at org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:28)

       at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:123)

       at org.apache.flink.runtime.LogMessages$$anon$1.applyOrElse(LogMessages.scala:28)

       at akka.actor.Actor$class.aroundReceive(Actor.scala:502)

       at org.apache.flink.runtime.jobmanager.JobManager.aroundReceive(JobManager.scala:122)

       at akka.actor.ActorCell.receiveMessage(ActorCell.scala:526)

       at akka.actor.ActorCell.invoke(ActorCell.scala:495)

       at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:257)

       at akka.dispatch.Mailbox.run(Mailbox.scala:224)

       at akka.dispatch.Mailbox.exec(Mailbox.scala:234)

       at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)

       at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)

       at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)

       at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)

2018-04-11 01:14:16,281 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - Source: kafkasource -> Sink: s3sink (1/1) (5a7dba2e186b9fdaebb62bdd703dc7dc) switched from RUNNING to FAILED.

java.lang.Exception: TaskManager was lost/killed: 8dc2fd35d4d13ff0d291ae12d5e2b6f6 @ 95656bc5-7e75-49ca-9b48-0c422f6eaa12 (dataPort=44671)

       at org.apache.flink.runtime.instance.SimpleSlot.releaseSlot(SimpleSlot.java:217)

       at org.apache.flink.runtime.instance.SlotSharingGroupAssignment.releaseSharedSlot(SlotSharingGroupAssignment.java:523)

       at org.apache.flink.runtime.instance.SharedSlot.releaseSlot(SharedSlot.java:192)

       at org.apache.flink.runtime.instance.Instance.markDead(Instance.java:167)

       at org.apache.flink.runtime.instance.InstanceManager.unregisterAllTaskManagers(InstanceManager.java:234)

       at org.apache.flink.runtime.jobmanager.JobManager$$anonfun$handleMessage$1.applyOrElse(JobManager.scala:295)

       at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:36)

       at org.apache.flink.runtime.LeaderSessionMessageFilter$$anonfun$receive$1.applyOrElse(LeaderSessionMessageFilter.scala:49)

       at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:36)

       at org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:33)

       at org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:28)

       at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:123)

       at org.apache.flink.runtime.LogMessages$$anon$1.applyOrElse(LogMessages.scala:28)

       at akka.actor.Actor$class.aroundReceive(Actor.scala:502)

       at org.apache.flink.runtime.jobmanager.JobManager.aroundReceive(JobManager.scala:122)

       at akka.actor.ActorCell.receiveMessage(ActorCell.scala:526)

       at akka.actor.ActorCell.invoke(ActorCell.scala:495)

       at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:257)

       at akka.dispatch.Mailbox.run(Mailbox.scala:224)

       at akka.dispatch.Mailbox.exec(Mailbox.scala:234)

       at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)

       at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)

       at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)

       at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)


2018-04-11 01:14:24,652 ERROR akka.remote.Remoting                                          - Association to [akka.tcp://flink@95656bc5-7e75-49ca-9b48-0c422f6eaa12:34651] with UID [1239336884] irrecoverably f

ailed. Quarantining address.

java.lang.IllegalStateException: Error encountered while processing system message acknowledgement buffer: [0 {0}] ack: ACK[1, {}]

       at akka.remote.ReliableDeliverySupervisor$$anonfun$receive$1.applyOrElse(Endpoint.scala:299)

       at akka.actor.Actor$class.aroundReceive(Actor.scala:502)

       at akka.remote.ReliableDeliverySupervisor.aroundReceive(Endpoint.scala:203)

       at akka.actor.ActorCell.receiveMessage(ActorCell.scala:526)

       at akka.actor.ActorCell.invoke(ActorCell.scala:495)

       at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:257)

       at akka.dispatch.Mailbox.run(Mailbox.scala:224)

       at akka.dispatch.Mailbox.exec(Mailbox.scala:234)

       at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)

       at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)

       at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)

       at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)

Caused by: java.lang.IllegalArgumentException: Highest SEQ so far was 0 but cumulative ACK is 1

       at akka.remote.AckedSendBuffer.acknowledge(AckedDelivery.scala:103)

       at akka.remote.ReliableDeliverySupervisor$$anonfun$receive$1.applyOrElse(Endpoint.scala:295)

       ... 11 more




On Wed, Apr 11, 2018 at 1:39 PM, Steven Wu <[hidden email]> wrote:
from Flink UI on jobmanager, sometimes I saw taskmanager connected and heartbeat time got updated.


but then sometimes the taskmanager page become blank. maybe disconnected.



On Wed, Apr 11, 2018 at 1:31 PM, Steven Wu <[hidden email]> wrote:
Hi, 

After this error/exception, it seems that taskmanager never connects to jobmanager anymore.  Job stuck in failed state because there is not enough slots to recover the job. 

let's assume there was a temp glitch btw jobmanager and zk. would it cause such a permanent failure in Flink?

I checked the zookeeper record. 
* leader zknode seems to have the correct info for "job_manager_lock"
* I am not sure how to read the leaderlatch zknode


A little more about the job
* standalone cluster mode
* 1 jobmanager
* 1 taskmanager 

Thanks,
Steven

2018-04-11 01:11:48,007 INFO  org.apache.flink.runtime.taskmanager.Task                     - Attempting to fail task externally Source: kafkasource -> Sink: s3sink (1/1) (5a7dba2e186b9fdaebb62bdd703dc7dc).

2018-04-11 01:11:48,007 INFO  org.apache.flink.runtime.taskmanager.Task                     - Source: kafkasource -> Sink: s3sink (1/1) (5a7dba2e186b9fdaebb62bdd703dc7dc) switched from RUNNING to FAILED.

java.lang.Exception: TaskManager akka://flink/user/taskmanager disconnects from JobManager akka.tcp://flink@1.2.3.4:42787/user/jobmanager: Old JobManager lost its leadership.

       at org.apache.flink.runtime.taskmanager.TaskManager.handleJobManagerDisconnect(TaskManager.scala:1073)

       at org.apache.flink.runtime.taskmanager.TaskManager.org$apache$flink$runtime$taskmanager$TaskManager$$handleJobManagerLeaderAddress(TaskManager.scala:1467)

       at org.apache.flink.runtime.taskmanager.TaskManager$$anonfun$handleMessage$1.applyOrElse(TaskManager.scala:277)

       at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:36)

       at org.apache.flink.runtime.LeaderSessionMessageFilter$$anonfun$receive$1.applyOrElse(LeaderSessionMessageFilter.scala:49)

       at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:36)

       at org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:33)

       at org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:28)

       at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:123)

       at org.apache.flink.runtime.LogMessages$$anon$1.applyOrElse(LogMessages.scala:28)

       at akka.actor.Actor$class.aroundReceive(Actor.scala:502)

       at org.apache.flink.runtime.taskmanager.TaskManager.aroundReceive(TaskManager.scala:121)

       at akka.actor.ActorCell.receiveMessage(ActorCell.scala:526)

       at akka.actor.ActorCell.invoke(ActorCell.scala:495)

       at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:257)

       at akka.dispatch.Mailbox.run(Mailbox.scala:224)

       at akka.dispatch.Mailbox.exec(Mailbox.scala:234)

       at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)

       at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)

       at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)

       at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)

2018-04-11 01:11:48,011 INFO  org.apache.flink.runtime.taskmanager.Task                     - Triggering cancellation of task code Source: kafkasource -> Sink: s3sink (1/1) (5a7dba2e186b9fdaebb62bdd703dc7dc).

2018-04-11 01:11:48,013 INFO  org.apache.flink.runtime.blob.PermanentBlobCache              - Shutting down BLOB cache