Hi,
After this error/exception, it seems that taskmanager never connects to jobmanager anymore. Job stuck in failed state because there is not enough slots to recover the job. let's assume there was a temp glitch btw jobmanager and zk. would it cause such a permanent failure in Flink? I checked the zookeeper record. * leader zknode seems to have the correct info for "job_manager_lock" * I am not sure how to read the leaderlatch zknode A little more about the job * standalone cluster mode * 1 jobmanager * 1 taskmanager Thanks, Steven 2018-04-11 01:11:48,007 INFO org.apache.flink.runtime.taskmanager.Task - Attempting to fail task externally Source: kafkasource -> Sink: s3sink (1/1) (5a7dba2e186b9fdaebb62bdd703dc7dc). 2018-04-11 01:11:48,007 INFO org.apache.flink.runtime.taskmanager.Task - Source: kafkasource -> Sink: s3sink (1/1) (5a7dba2e186b9fdaebb62bdd703dc7dc) switched from RUNNING to FAILED. java.lang.Exception: TaskManager akka://flink/user/taskmanager disconnects from JobManager akka.tcp://flink@1.2.3.4:42787/user/jobmanager: Old JobManager lost its leadership. at org.apache.flink.runtime.taskmanager.TaskManager.handleJobManagerDisconnect(TaskManager.scala:1073) at org.apache.flink.runtime.taskmanager.TaskManager.org$apache$flink$runtime$taskmanager$TaskManager$$handleJobManagerLeaderAddress(TaskManager.scala:1467) at org.apache.flink.runtime.taskmanager.TaskManager$$anonfun$handleMessage$1.applyOrElse(TaskManager.scala:277) at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:36) at org.apache.flink.runtime.LeaderSessionMessageFilter$$anonfun$receive$1.applyOrElse(LeaderSessionMessageFilter.scala:49) at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:36) at org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:33) at org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:28) at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:123) at org.apache.flink.runtime.LogMessages$$anon$1.applyOrElse(LogMessages.scala:28) at akka.actor.Actor$class.aroundReceive(Actor.scala:502) at org.apache.flink.runtime.taskmanager.TaskManager.aroundReceive(TaskManager.scala:121) at akka.actor.ActorCell.receiveMessage(ActorCell.scala:526) at akka.actor.ActorCell.invoke(ActorCell.scala:495) at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:257) at akka.dispatch.Mailbox.run(Mailbox.scala:224) at akka.dispatch.Mailbox.exec(Mailbox.scala:234) at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) 2018-04-11 01:11:48,011 INFO org.apache.flink.runtime.taskmanager.Task - Triggering cancellation of task code Source: kafkasource -> Sink: s3sink (1/1) (5a7dba2e186b9fdaebb62bdd703dc7dc). 2018-04-11 01:11:48,013 INFO org.apache.flink.runtime.blob.PermanentBlobCache - Shutting down BLOB cache |
from Flink UI on jobmanager, sometimes I saw taskmanager connected and heartbeat time got updated. but then sometimes the taskmanager page become blank. maybe disconnected. On Wed, Apr 11, 2018 at 1:31 PM, Steven Wu <[hidden email]> wrote:
|
Here are the error logs from jobmanager side 2018-04-11 01:14:16,280 INFO org.apache.flink.shaded.curator.org.apache.curator.framework.state.ConnectionStateManager - State change: SUSPENDED 2018-04-11 01:14:16,280 WARN org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService - Connection to ZooKeeper suspended. Can no longer retrieve the leader from ZooKeeper. 2018-04-11 01:14:16,280 WARN org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore - ZooKeeper connection SUSPENDED. Changes to the submitted job graphs are not monitored (temporarily). 2018-04-11 01:14:16,280 WARN org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService - Connection to ZooKeeper suspended. The contender akka.tcp://flink@100.65.108.246:42787/user/jobmanager no longer participates in the leader election. 2018-04-11 01:14:16,280 WARN org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService - Connection to ZooKeeper suspended. Can no longer retrieve the leader from ZooKeeper. 2018-04-11 01:14:16,281 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph - Job ksrouter (f303cea2286a620503a9101585cc11bd) switched from state RUNNING to SUSPENDED. java.lang.Exception: JobManager is no longer the leader. at org.apache.flink.runtime.jobmanager.JobManager$$anonfun$handleMessage$1.applyOrElse(JobManager.scala:284) at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:36) at org.apache.flink.runtime.LeaderSessionMessageFilter$$anonfun$receive$1.applyOrElse(LeaderSessionMessageFilter.scala:49) at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:36) at org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:33) at org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:28) at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:123) at org.apache.flink.runtime.LogMessages$$anon$1.applyOrElse(LogMessages.scala:28) at akka.actor.Actor$class.aroundReceive(Actor.scala:502) at org.apache.flink.runtime.jobmanager.JobManager.aroundReceive(JobManager.scala:122) at akka.actor.ActorCell.receiveMessage(ActorCell.scala:526) at akka.actor.ActorCell.invoke(ActorCell.scala:495) at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:257) at akka.dispatch.Mailbox.run(Mailbox.scala:224) at akka.dispatch.Mailbox.exec(Mailbox.scala:234) at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) 2018-04-11 01:14:16,281 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph - Source: kafkasource -> Sink: s3sink (1/1) (5a7dba2e186b9fdaebb62bdd703dc7dc) switched from RUNNING to FAILED. java.lang.Exception: TaskManager was lost/killed: 8dc2fd35d4d13ff0d291ae12d5e2b6f6 @ 95656bc5-7e75-49ca-9b48-0c422f6eaa12 (dataPort=44671) at org.apache.flink.runtime.instance.SimpleSlot.releaseSlot(SimpleSlot.java:217) at org.apache.flink.runtime.instance.SlotSharingGroupAssignment.releaseSharedSlot(SlotSharingGroupAssignment.java:523) at org.apache.flink.runtime.instance.SharedSlot.releaseSlot(SharedSlot.java:192) at org.apache.flink.runtime.instance.Instance.markDead(Instance.java:167) at org.apache.flink.runtime.instance.InstanceManager.unregisterAllTaskManagers(InstanceManager.java:234) at org.apache.flink.runtime.jobmanager.JobManager$$anonfun$handleMessage$1.applyOrElse(JobManager.scala:295) at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:36) at org.apache.flink.runtime.LeaderSessionMessageFilter$$anonfun$receive$1.applyOrElse(LeaderSessionMessageFilter.scala:49) at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:36) at org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:33) at org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:28) at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:123) at org.apache.flink.runtime.LogMessages$$anon$1.applyOrElse(LogMessages.scala:28) at akka.actor.Actor$class.aroundReceive(Actor.scala:502) at org.apache.flink.runtime.jobmanager.JobManager.aroundReceive(JobManager.scala:122) at akka.actor.ActorCell.receiveMessage(ActorCell.scala:526) at akka.actor.ActorCell.invoke(ActorCell.scala:495) at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:257) at akka.dispatch.Mailbox.run(Mailbox.scala:224) at akka.dispatch.Mailbox.exec(Mailbox.scala:234) at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) 2018-04-11 01:14:24,652 ERROR akka.remote.Remoting - Association to [akka.tcp://flink@95656bc5-7e75-49ca-9b48-0c422f6eaa12:34651] with UID [1239336884] irrecoverably f ailed. Quarantining address. java.lang.IllegalStateException: Error encountered while processing system message acknowledgement buffer: [0 {0}] ack: ACK[1, {}] at akka.remote.ReliableDeliverySupervisor$$anonfun$receive$1.applyOrElse(Endpoint.scala:299) at akka.actor.Actor$class.aroundReceive(Actor.scala:502) at akka.remote.ReliableDeliverySupervisor.aroundReceive(Endpoint.scala:203) at akka.actor.ActorCell.receiveMessage(ActorCell.scala:526) at akka.actor.ActorCell.invoke(ActorCell.scala:495) at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:257) at akka.dispatch.Mailbox.run(Mailbox.scala:224) at akka.dispatch.Mailbox.exec(Mailbox.scala:234) at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) Caused by: java.lang.IllegalArgumentException: Highest SEQ so far was 0 but cumulative ACK is 1 at akka.remote.AckedSendBuffer.acknowledge(AckedDelivery.scala:103) at akka.remote.ReliableDeliverySupervisor$$anonfun$receive$1.applyOrElse(Endpoint.scala:295) ... 11 more On Wed, Apr 11, 2018 at 1:39 PM, Steven Wu <[hidden email]> wrote:
|
2018-04-11 01:14:16,280 WARN org.apache.flink.runtime. 2018-04-11 01:14:16,280 WARN org.apache.flink.runtime. On Wed, Apr 11, 2018 at 2:00 PM, Steven Wu <[hidden email]> wrote:
|
never mind. it might be caused by our underline container platform change. hostname seems not resolvable for akka. we can close this thread for now. 2018-04-12 04:42:12,697 WARN akka.remote.ReliableDeliverySupervisor - Association with remote system [akka.tcp://flink@95656bc5-7e75-49ca-9b48-0c422f6eaa12:34651] has failed, address i s now gated for [5000] ms. Reason: [Association failed with [akka.tcp://flink@95656bc5-7e75-49ca-9b48-0c422f6eaa12:34651]] Caused by: [95656bc5-7e75-49ca-9b48-0c422f6eaa12: Name or service not known] 2018-04-12 04:42:16,161 WARN akka.remote.RemoteWatcher - Detected unreachable: [akka.tcp://flink@95656bc5-7e75-49ca-9b48-0c422f6eaa12:34651] 2018-04-12 04:42:16,161 WARN akka.remote.Remoting - Association to [akka.tcp://flink@95656bc5-7e75-49ca-9b48-0c422f6eaa12:34651] with unknown UID is reported as quarantined, but address cannot be quarantined without knowing the UID, gating instead for 5000 ms. On Wed, Apr 11, 2018 at 2:14 PM, Steven Wu <[hidden email]> wrote:
|
Free forum by Nabble | Edit this page |