Fwd: HA : My job didn't restart even if task manager restarted.

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

Fwd: HA : My job didn't restart even if task manager restarted.

sunny yun

I have 4 nodes for flink cluster using flink 1.3.2 and hadoop 2.6.5. All configurations are same in 4 nodes.

Running yarn session and flink job, I killed node#2(flink-02)s TM(YarnTaskManager) process.

After resoucemanager make new TM container, job still failed status.

But when I killed JM(YarnApplicationMasterRunner) process, all behavior seems ok. My job restarted well.

Im using DFS for storageDir.

Flink configuration file and log files are attached. If any other information needed, please let me know.

 

Here is use case, -------------------------------------------------------------------------------------------------------

1. start yarn session

flink-01 ~]$ flink/bin/yarn-session.sh -n 4 -s 8 -nm flink -d

2. submit job

flink-01 ~]$ flink/bin/flink run -d -p 3 -c a.b.c.Application abc.jar

3. find TM pid and kill it

flink-02 ~]$ jps | grep YarnTaskManager

4022 YarnTaskManager

flink-02 ~]$ kill 4022

 

And JM logs -------------------------------------------------------------------------------------------------------

2017-09-04 00:50:57,830 WARN  akka.remote.ReliableDeliverySupervisor                        - Association with remote system [akka.tcp://flink@flink-02:45104] has failed, address is now gated for [5000] ms. Reason: [Disassociated]

2017-09-04 00:50:58,013 INFO  org.apache.flink.yarn.YarnFlinkResourceManager                - Container container_1504449399430_0005_01_000004 failed. Exit status: 143

2017-09-04 00:50:58,014 INFO  org.apache.flink.yarn.YarnFlinkResourceManager                - Diagnostics for container container_1504449399430_0005_01_000004 in state COMPLETE : exitStatus=143 diagnostics=Container killed on request. Exit code is 143

Container exited with a non-zero exit code 143

Killed by external signal

 

2017-09-04 00:50:58,014 INFO  org.apache.flink.yarn.YarnFlinkResourceManager                - Total number of failed containers so far: 1

2017-09-04 00:50:58,014 INFO  org.apache.flink.yarn.YarnFlinkResourceManager                - Received new container: container_1504449399430_0005_01_000005 - Remaining pending container requests: 0

2017-09-04 00:50:58,014 INFO  org.apache.flink.yarn.YarnFlinkResourceManager                - Launching TaskManager in container ContainerInLaunch @ 1504453858014: Container: [ContainerId: container_1504449399430_0005_01_000005, NodeId: flink-02:34821, NodeHttpAddress: flink-02:8042, Resource: <memory:8192, vCores:1>, Priority: 0, Token: Token { kind: ContainerToken, service: 10.1.0.5:34821 }, ] on host flink-02

2017-09-04 00:50:58,015 INFO  org.apache.hadoop.yarn.client.api.impl.ContainerManagementProtocolProxy  - Opening proxy : flink-02:34821

2017-09-04 00:50:58,015 INFO  org.apache.flink.yarn.YarnJobManager                          - Task manager akka.tcp://flink@flink-02:45104/user/taskmanager terminated.

2017-09-04 00:50:58,016 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - Source: src -> Service: vps -> Sink: snk (3/3) (33b33a9c15eb0107b320e375374ac07f) switched from RUNNING to FAILED.

java.lang.Exception: TaskManager was lost/killed: container_1504449399430_0005_01_000004 @ flink-02 (dataPort=42381)

           at org.apache.flink.runtime.instance.SimpleSlot.releaseSlot(SimpleSlot.java:217)

           at org.apache.flink.runtime.instance.SlotSharingGroupAssignment.releaseSharedSlot(SlotSharingGroupAssignment.java:533)

           at org.apache.flink.runtime.instance.SharedSlot.releaseSlot(SharedSlot.java:192)

           at org.apache.flink.runtime.instance.Instance.markDead(Instance.java:167)

           at org.apache.flink.runtime.instance.InstanceManager.unregisterTaskManager(InstanceManager.java:212)

           at org.apache.flink.runtime.jobmanager.JobManager.org$apache$flink$runtime$jobmanager$JobManager$$handleTaskManagerTerminated(JobManager.scala:1228)

           at org.apache.flink.runtime.jobmanager.JobManager$$anonfun$handleMessage$1.applyOrElse(JobManager.scala:474)

           at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:36)

           at org.apache.flink.runtime.clusterframework.ContaineredJobManager$$anonfun$handleContainerMessage$1.applyOrElse(ContaineredJobManager.scala:100)

           at scala.PartialFunction$OrElse.apply(PartialFunction.scala:167)

           at org.apache.flink.yarn.YarnJobManager$$anonfun$handleYarnShutdown$1.applyOrElse(YarnJobManager.scala:103)

           at scala.PartialFunction$OrElse.apply(PartialFunction.scala:167)

           at org.apache.flink.runtime.LeaderSessionMessageFilter$$anonfun$receive$1.applyOrElse(LeaderSessionMessageFilter.scala:38)

           at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:36)

           at org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:33)

           at org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:28)

           at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:123)

           at org.apache.flink.runtime.LogMessages$$anon$1.applyOrElse(LogMessages.scala:28)

           at akka.actor.Actor$class.aroundReceive(Actor.scala:467)

           at org.apache.flink.runtime.jobmanager.JobManager.aroundReceive(JobManager.scala:125)

           at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)

           at akka.actor.ActorCell.invoke(ActorCell.scala:487)

           at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:238)

           at akka.dispatch.Mailbox.run(Mailbox.scala:220)

           at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:397)

           at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)

           at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)

           at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)

           at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)

2017-09-04 00:50:58,018 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - Job vp (6e73f6babf36c9321efc0524a82dede1) switched from state RUNNING to FAILING.

java.lang.Exception: TaskManager was lost/killed: container_1504449399430_0005_01_000004 @ flink-02 (dataPort=42381)

           at org.apache.flink.runtime.instance.SimpleSlot.releaseSlot(SimpleSlot.java:217)

           at org.apache.flink.runtime.instance.SlotSharingGroupAssignment.releaseSharedSlot(SlotSharingGroupAssignment.java:533)

           at org.apache.flink.runtime.instance.SharedSlot.releaseSlot(SharedSlot.java:192)

           at org.apache.flink.runtime.instance.Instance.markDead(Instance.java:167)

           at org.apache.flink.runtime.instance.InstanceManager.unregisterTaskManager(InstanceManager.java:212)

           at org.apache.flink.runtime.jobmanager.JobManager.org$apache$flink$runtime$jobmanager$JobManager$$handleTaskManagerTerminated(JobManager.scala:1228)

           at org.apache.flink.runtime.jobmanager.JobManager$$anonfun$handleMessage$1.applyOrElse(JobManager.scala:474)

           at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:36)

           at org.apache.flink.runtime.clusterframework.ContaineredJobManager$$anonfun$handleContainerMessage$1.applyOrElse(ContaineredJobManager.scala:100)

           at scala.PartialFunction$OrElse.apply(PartialFunction.scala:167)

           at org.apache.flink.yarn.YarnJobManager$$anonfun$handleYarnShutdown$1.applyOrElse(YarnJobManager.scala:103)

           at scala.PartialFunction$OrElse.apply(PartialFunction.scala:167)

           at org.apache.flink.runtime.LeaderSessionMessageFilter$$anonfun$receive$1.applyOrElse(LeaderSessionMessageFilter.scala:38)

           at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:36)

           at org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:33)

           at org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:28)

           at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:123)

           at org.apache.flink.runtime.LogMessages$$anon$1.applyOrElse(LogMessages.scala:28)

           at akka.actor.Actor$class.aroundReceive(Actor.scala:467)

           at org.apache.flink.runtime.jobmanager.JobManager.aroundReceive(JobManager.scala:125)

           at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)

           at akka.actor.ActorCell.invoke(ActorCell.scala:487)

           at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:238)

           at akka.dispatch.Mailbox.run(Mailbox.scala:220)

           at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:397)

           at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)

           at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)

           at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)

           at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)

2017-09-04 00:50:58,027 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - Source: src -> Service: vps -> Sink: snk (1/3) (02232baaa7b3f88982cbf40cb5ebe489) switched from RUNNING to CANCELING.

2017-09-04 00:50:58,029 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - Source: src -> Service: vps -> Sink: snk (2/3) (952c34a277df34aa7e3216a8b96ae2d5) switched from RUNNING to CANCELING.

2017-09-04 00:50:58,029 INFO  org.apache.flink.yarn.YarnFlinkResourceManager                - Requesting new TaskManager container with 8192 megabytes memory. Pending requests: 1

2017-09-04 00:50:58,030 INFO  org.apache.flink.runtime.instance.InstanceManager             - Unregistered task manager flink-02/10.1.0.5. Number of registered task managers 2. Number of available slots 16.

2017-09-04 00:50:58,065 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - Source: src -> Service: vps -> Sink: snk (1/3) (02232baaa7b3f88982cbf40cb5ebe489) switched from CANCELING to CANCELED.

2017-09-04 00:50:58,071 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - Source: src -> Service: vps -> Sink: snk (2/3) (952c34a277df34aa7e3216a8b96ae2d5) switched from CANCELING to CANCELED.

2017-09-04 00:50:58,072 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - Try to restart or fail the job vp (6e73f6babf36c9321efc0524a82dede1) if no longer possible.

2017-09-04 00:50:58,072 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - Job vp (6e73f6babf36c9321efc0524a82dede1) switched from state FAILING to FAILED.

java.lang.Exception: TaskManager was lost/killed: container_1504449399430_0005_01_000004 @ flink-02 (dataPort=42381)

           at org.apache.flink.runtime.instance.SimpleSlot.releaseSlot(SimpleSlot.java:217)

           at org.apache.flink.runtime.instance.SlotSharingGroupAssignment.releaseSharedSlot(SlotSharingGroupAssignment.java:533)

           at org.apache.flink.runtime.instance.SharedSlot.releaseSlot(SharedSlot.java:192)

           at org.apache.flink.runtime.instance.Instance.markDead(Instance.java:167)

           at org.apache.flink.runtime.instance.InstanceManager.unregisterTaskManager(InstanceManager.java:212)

           at org.apache.flink.runtime.jobmanager.JobManager.org$apache$flink$runtime$jobmanager$JobManager$$handleTaskManagerTerminated(JobManager.scala:1228)

           at org.apache.flink.runtime.jobmanager.JobManager$$anonfun$handleMessage$1.applyOrElse(JobManager.scala:474)

           at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:36)

           at org.apache.flink.runtime.clusterframework.ContaineredJobManager$$anonfun$handleContainerMessage$1.applyOrElse(ContaineredJobManager.scala:100)

           at scala.PartialFunction$OrElse.apply(PartialFunction.scala:167)

           at org.apache.flink.yarn.YarnJobManager$$anonfun$handleYarnShutdown$1.applyOrElse(YarnJobManager.scala:103)

           at scala.PartialFunction$OrElse.apply(PartialFunction.scala:167)

           at org.apache.flink.runtime.LeaderSessionMessageFilter$$anonfun$receive$1.applyOrElse(LeaderSessionMessageFilter.scala:38)

           at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:36)

           at org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:33)

           at org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:28)

           at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:123)

           at org.apache.flink.runtime.LogMessages$$anon$1.applyOrElse(LogMessages.scala:28)

           at akka.actor.Actor$class.aroundReceive(Actor.scala:467)

           at org.apache.flink.runtime.jobmanager.JobManager.aroundReceive(JobManager.scala:125)

           at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)

           at akka.actor.ActorCell.invoke(ActorCell.scala:487)

           at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:238)

           at akka.dispatch.Mailbox.run(Mailbox.scala:220)

           at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:397)

           at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)

           at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)

           at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)

           at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)

2017-09-04 00:50:58,072 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - Could not restart the job vp (6e73f6babf36c9321efc0524a82dede1) because the restart strategy prevented it.

java.lang.Exception: TaskManager was lost/killed: container_1504449399430_0005_01_000004 @ flink-02 (dataPort=42381)

           at org.apache.flink.runtime.instance.SimpleSlot.releaseSlot(SimpleSlot.java:217)

           at org.apache.flink.runtime.instance.SlotSharingGroupAssignment.releaseSharedSlot(SlotSharingGroupAssignment.java:533)

           at org.apache.flink.runtime.instance.SharedSlot.releaseSlot(SharedSlot.java:192)

           at org.apache.flink.runtime.instance.Instance.markDead(Instance.java:167)

           at org.apache.flink.runtime.instance.InstanceManager.unregisterTaskManager(InstanceManager.java:212)

           at org.apache.flink.runtime.jobmanager.JobManager.org$apache$flink$runtime$jobmanager$JobManager$$handleTaskManagerTerminated(JobManager.scala:1228)

           at org.apache.flink.runtime.jobmanager.JobManager$$anonfun$handleMessage$1.applyOrElse(JobManager.scala:474)

           at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:36)

           at org.apache.flink.runtime.clusterframework.ContaineredJobManager$$anonfun$handleContainerMessage$1.applyOrElse(ContaineredJobManager.scala:100)

           at scala.PartialFunction$OrElse.apply(PartialFunction.scala:167)

           at org.apache.flink.yarn.YarnJobManager$$anonfun$handleYarnShutdown$1.applyOrElse(YarnJobManager.scala:103)

           at scala.PartialFunction$OrElse.apply(PartialFunction.scala:167)

           at org.apache.flink.runtime.LeaderSessionMessageFilter$$anonfun$receive$1.applyOrElse(LeaderSessionMessageFilter.scala:38)

           at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:36)

           at org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:33)

           at org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:28)

           at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:123)

           at org.apache.flink.runtime.LogMessages$$anon$1.applyOrElse(LogMessages.scala:28)

           at akka.actor.Actor$class.aroundReceive(Actor.scala:467)

           at org.apache.flink.runtime.jobmanager.JobManager.aroundReceive(JobManager.scala:125)

           at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)

           at akka.actor.ActorCell.invoke(ActorCell.scala:487)

           at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:238)

           at akka.dispatch.Mailbox.run(Mailbox.scala:220)

           at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:397)

           at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)

           at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)

           at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)

           at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)

2017-09-04 00:50:58,073 INFO  org.apache.flink.runtime.checkpoint.CheckpointCoordinator     - Stopping checkpoint coordinator for job 6e73f6babf36c9321efc0524a82dede1

2017-09-04 00:50:58,073 INFO  org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore  - Shutting down

2017-09-04 00:50:58,073 INFO  org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore  - Removing /flink/cluster_one/checkpoints/6e73f6babf36c9321efc0524a82dede1 from ZooKeeper

2017-09-04 00:50:58,102 INFO  org.apache.flink.runtime.checkpoint.ZooKeeperCheckpointIDCounter  - Shutting down.

2017-09-04 00:50:58,102 INFO  org.apache.flink.runtime.checkpoint.ZooKeeperCheckpointIDCounter  - Removing /checkpoint-counter/6e73f6babf36c9321efc0524a82dede1 from ZooKeeper

2017-09-04 00:50:58,118 INFO  org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore  - Removed job graph 6e73f6babf36c9321efc0524a82dede1 from ZooKeeper.

2017-09-04 00:51:01,000 INFO  org.apache.flink.runtime.instance.InstanceManager             - Registered TaskManager at flink-02 (akka.tcp://flink@flink-02:37281/user/taskmanager) as 6f8a6941cddb361de1643fa47b71a1db. Current number of registered hosts is 3. Current number of alive task slots is 24.

2017-09-04 00:51:01,000 INFO  org.apache.flink.yarn.YarnFlinkResourceManager                - TaskManager container_1504449399430_0005_01_000005 has started.

2017-09-04 00:51:02,868 WARN  akka.remote.ReliableDeliverySupervisor                        - Association with remote system [akka.tcp://flink@flink-02:45104] has failed, address is now gated for [5000] ms. Reason: [Association failed with [akka.tcp://flink@flink-02:45104]] Caused by: [Connection refused: flink-02/10.1.0.5:45104]

2017-09-04 00:51:07,883 WARN  akka.remote.ReliableDeliverySupervisor                        - Association with remote system [akka.tcp://flink@flink-02:45104] has failed, address is now gated for [5000] ms. Reason: [Association failed with [akka.tcp://flink@flink-02:45104]] Caused by: [Connection refused: flink-02/10.1.0.5:45104]

2017-09-04 00:51:12,903 WARN  akka.remote.ReliableDeliverySupervisor                        - Association with remote system [akka.tcp://flink@flink-02:45104] has failed, address is now gated for [5000] ms. Reason: [Association failed with [akka.tcp://flink@flink-02:45104]] Caused by: [Connection refused: flink-02/10.1.0.5:45104]

2017-09-04 00:51:17,924 WARN  akka.remote.ReliableDeliverySupervisor                        - Association with remote system [akka.tcp://flink@flink-02:45104] has failed, address is now gated for [5000] ms. Reason: [Association failed with [akka.tcp://flink@flink-02:45104]] Caused by: [Connection refused: flink-02/10.1.0.5:45104]

2017-09-04 00:51:22,942 WARN  akka.remote.ReliableDeliverySupervisor                        - Association with remote system [akka.tcp://flink@flink-02:45104] has failed, address is now gated for [5000] ms. Reason: [Association failed with [akka.tcp://flink@flink-02:45104]] Caused by: [Connection refused: flink-02/10.1.0.5:45104]

2017-09-04 00:51:27,963 WARN  akka.remote.ReliableDeliverySupervisor                        - Association with remote system [akka.tcp://flink@flink-02:45104] has failed, address is now gated for [5000] ms. Reason: [Association failed with [akka.tcp://flink@flink-02:45104]] Caused by: [Connection refused: flink-02/10.1.0.5:45104]

2017-09-04 00:51:32,983 WARN  akka.remote.ReliableDeliverySupervisor                        - Association with remote system [akka.tcp://flink@flink-02:45104] has failed, address is now gated for [5000] ms. Reason: [Association failed with [akka.tcp://flink@flink-02:45104]] Caused by: [Connection refused: flink-02/10.1.0.5:45104]

2017-09-04 00:51:38,004 WARN  akka.remote.ReliableDeliverySupervisor                        - Association with remote system [akka.tcp://flink@flink-02:45104] has failed, address is now gated for [5000] ms. Reason: [Association failed with [akka.tcp://flink@flink-02:45104]] Caused by: [Connection refused: flink-02/10.1.0.5:45104]

2017-09-04 00:51:43,024 WARN  akka.remote.ReliableDeliverySupervisor                        - Association with remote system [akka.tcp://flink@flink-02:45104] has failed, address is now gated for [5000] ms. Reason: [Association failed with [akka.tcp://flink@flink-02:45104]] Caused by: [Connection refused: flink-02/10.1.0.5:45104]

2017-09-04 00:51:48,043 WARN  akka.remote.ReliableDeliverySupervisor                        - Association with remote system [akka.tcp://flink@flink-02:45104] has failed, address is now gated for [5000] ms. Reason: [Association failed with [akka.tcp://flink@flink-02:45104]] Caused by: [Connection refused: flink-02/10.1.0.5:45104]

2017-09-04 00:51:53,063 WARN  akka.remote.ReliableDeliverySupervisor                        - Association with remote system [akka.tcp://flink@flink-02:45104] has failed, address is now gated for [5000] ms. Reason: [Association failed with [akka.tcp://flink@flink-02:45104]] Caused by: [Connection refused: flink-02/10.1.0.5:45104]

2017-09-04 00:51:58,082 WARN  akka.remote.ReliableDeliverySupervisor                        - Association with remote system [akka.tcp://flink@flink-02:45104] has failed, address is now gated for [5000] ms. Reason: [Association failed with [akka.tcp://flink@flink-02:45104]] Caused by: [Connection refused: flink-02/10.1.0.5:45104]

2017-09-04 00:52:03,105 WARN  akka.remote.ReliableDeliverySupervisor                        - Association with remote system [akka.tcp://flink@flink-02:45104] has failed, address is now gated for [5000] ms. Reason: [Association failed with [akka.tcp://flink@flink-02:45104]] Caused by: [Connection refused: flink-02/10.1.0.5:45104]

2017-09-04 00:52:08,123 WARN  akka.remote.ReliableDeliverySupervisor                        - Association with remote system [akka.tcp://flink@flink-02:45104] has failed, address is now gated for [5000] ms. Reason: [Association failed with [akka.tcp://flink@flink-02:45104]] Caused by: [Connection refused: flink-02/10.1.0.5:45104]

2017-09-04 00:52:13,143 WARN  akka.remote.ReliableDeliverySupervisor                        - Association with remote system [akka.tcp://flink@flink-02:45104] has failed, address is now gated for [5000] ms. Reason: [Association failed with [akka.tcp://flink@flink-02:45104]] Caused by: [Connection refused: flink-02/10.1.0.5:45104]

2017-09-04 00:52:18,164 WARN  akka.remote.ReliableDeliverySupervisor                        - Association with remote system [akka.tcp://flink@flink-02:45104] has failed, address is now gated for [5000] ms. Reason: [Association failed with [akka.tcp://flink@flink-02:45104]] Caused by: [Connection refused: flink-02/10.1.0.5:45104]

2017-09-04 00:52:23,184 WARN  akka.remote.ReliableDeliverySupervisor                        - Association with remote system [akka.tcp://flink@flink-02:45104] has failed, address is now gated for [5000] ms. Reason: [Association failed with [akka.tcp://flink@flink-02:45104]] Caused by: [Connection refused: flink-02/10.1.0.5:45104]

2017-09-04 00:52:28,202 WARN  akka.remote.ReliableDeliverySupervisor                        - Association with remote system [akka.tcp://flink@flink-02:45104] has failed, address is now gated for [5000] ms. Reason: [Association failed with [akka.tcp://flink@flink-02:45104]] Caused by: [Connection refused: flink-02/10.1.0.5:45104]

2017-09-04 00:52:33,225 WARN  akka.remote.ReliableDeliverySupervisor                        - Association with remote system [akka.tcp://flink@flink-02:45104] has failed, address is now gated for [5000] ms. Reason: [Association failed with [akka.tcp://flink@flink-02:45104]] Caused by: [Connection refused: flink-02/10.1.0.5:45104]

2017-09-04 00:52:38,244 WARN  akka.remote.ReliableDeliverySupervisor                        - Association with remote system [akka.tcp://flink@flink-02:45104] has failed, address is now gated for [5000] ms. Reason: [Association failed with [akka.tcp://flink@flink-02:45104]] Caused by: [Connection refused: flink-02/10.1.0.5:45104]

2017-09-04 00:52:43,263 WARN  akka.remote.ReliableDeliverySupervisor                        - Association with remote system [akka.tcp://flink@flink-02:45104] has failed, address is now gated for [5000] ms. Reason: [Association failed with [akka.tcp://flink@flink-02:45104]] Caused by: [Connection refused: flink-02/10.1.0.5:45104]

2017-09-04 00:52:48,282 WARN  akka.remote.ReliableDeliverySupervisor                        - Association with remote system [akka.tcp://flink@flink-02:45104] has failed, address is now gated for [5000] ms. Reason: [Association failed with [akka.tcp://flink@flink-02:45104]] Caused by: [Connection refused: flink-02/10.1.0.5:45104]

2017-09-04 00:52:53,302 WARN  akka.remote.ReliableDeliverySupervisor                        - Association with remote system [akka.tcp://flink@flink-02:45104] has failed, address is now gated for [5000] ms. Reason: [Association failed with [akka.tcp://flink@flink-02:45104]] Caused by: [Connection refused: flink-02/10.1.0.5:45104]

2017-09-04 00:52:58,324 WARN  akka.remote.ReliableDeliverySupervisor                        - Association with remote system [akka.tcp://flink@flink-02:45104] has failed, address is now gated for [5000] ms. Reason: [Association failed with [akka.tcp://flink@flink-02:45104]] Caused by: [Connection refused: flink-02/10.1.0.5:45104]

2017-09-04 00:53:03,344 WARN  akka.remote.ReliableDeliverySupervisor                        - Association with remote system [akka.tcp://flink@flink-02:45104] has failed, address is now gated for [5000] ms. Reason: [Association failed with [akka.tcp://flink@flink-02:45104]] Caused by: [Connection refused: flink-02/10.1.0.5:45104]

2017-09-04 00:53:08,362 WARN  akka.remote.ReliableDeliverySupervisor                        - Association with remote system [akka.tcp://flink@flink-02:45104] has failed, address is now gated for [5000] ms. Reason: [Association failed with [akka.tcp://flink@flink-02:45104]] Caused by: [Connection refused: flink-02/10.1.0.5:45104]

2017-09-04 00:53:13,384 WARN  akka.remote.ReliableDeliverySupervisor                        - Association with remote system [akka.tcp://flink@flink-02:45104] has failed, address is now gated for [5000] ms. Reason: [Association failed with [akka.tcp://flink@flink-02:45104]] Caused by: [Connection refused: flink-02/10.1.0.5:45104]

2017-09-04 00:53:18,403 WARN  akka.remote.ReliableDeliverySupervisor                        - Association with remote system [akka.tcp://flink@flink-02:45104] has failed, address is now gated for [5000] ms. Reason: [Association failed with [akka.tcp://flink@flink-02:45104]] Caused by: [Connection refused: flink-02/10.1.0.5:45104]

2017-09-04 00:53:23,422 WARN  akka.remote.ReliableDeliverySupervisor                        - Association with remote system [akka.tcp://flink@flink-02:45104] has failed, address is now gated for [5000] ms. Reason: [Association failed with [akka.tcp://flink@flink-02:45104]] Caused by: [Connection refused: flink-02/10.1.0.5:45104]

2017-09-04 00:53:28,443 WARN  akka.remote.ReliableDeliverySupervisor                        - Association with remote system [akka.tcp://flink@flink-02:45104] has failed, address is now gated for [5000] ms. Reason: [Association failed with [akka.tcp://flink@flink-02:45104]] Caused by: [Connection refused: flink-02/10.1.0.5:45104]

2017-09-04 00:53:33,462 WARN  akka.remote.ReliableDeliverySupervisor                        - Association with remote system [akka.tcp://flink@flink-02:45104] has failed, address is now gated for [5000] ms. Reason: [Association failed with [akka.tcp://flink@flink-02:45104]] Caused by: [Connection refused: flink-02/10.1.0.5:45104]

2017-09-04 00:53:38,482 WARN  akka.remote.ReliableDeliverySupervisor                        - Association with remote system [akka.tcp://flink@flink-02:45104] has failed, address is now gated for [5000] ms. Reason: [Association failed with [akka.tcp://flink@flink-02:45104]] Caused by: [Connection refused: flink-02/10.1.0.5:45104]

2017-09-04 00:53:43,502 WARN  akka.remote.ReliableDeliverySupervisor                        - Association with remote system [akka.tcp://flink@flink-02:45104] has failed, address is now gated for [5000] ms. Reason: [Association failed with [akka.tcp://flink@flink-02:45104]] Caused by: [Connection refused: flink-02/10.1.0.5:45104]

2017-09-04 00:53:48,522 WARN  akka.remote.ReliableDeliverySupervisor                        - Association with remote system [akka.tcp://flink@flink-02:45104] has failed, address is now gated for [5000] ms. Reason: [Association failed with [akka.tcp://flink@flink-02:45104]] Caused by: [Connection refused: flink-02/10.1.0.5:45104]

2017-09-04 00:53:53,549 WARN  akka.remote.ReliableDeliverySupervisor                        - Association with remote system [akka.tcp://flink@flink-02:45104] has failed, address is now gated for [5000] ms. Reason: [Association failed with [akka.tcp://flink@flink-02:45104]] Caused by: [Connection refused: flink-02/10.1.0.5:45104]

2017-09-04 00:53:58,573 WARN  akka.remote.ReliableDeliverySupervisor                        - Association with remote system [akka.tcp://flink@flink-02:45104] has failed, address is now gated for [5000] ms. Reason: [Association failed with [akka.tcp://flink@flink-02:45104]] Caused by: [Connection refused: flink-02/10.1.0.5:45104]

2017-09-04 00:54:03,592 ERROR Remoting                                                      - Association to [akka.tcp://flink@flink-02:45104] with UID [1818461746] irrecoverably failed. Quarantining address.

java.util.concurrent.TimeoutException: Delivery of system messages timed out and they were dropped.

           at akka.remote.ReliableDeliverySupervisor$$anonfun$gated$1.applyOrElse(Endpoint.scala:336)

           at akka.actor.Actor$class.aroundReceive(Actor.scala:467)

           at akka.remote.ReliableDeliverySupervisor.aroundReceive(Endpoint.scala:189)

           at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)

           at akka.actor.ActorCell.invoke(ActorCell.scala:487)

           at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:238)

           at akka.dispatch.Mailbox.run(Mailbox.scala:220)

           at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:397)

           at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)

           at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)

           at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)

           at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)



jobmanager.log (104K) Download Attachment
flink-conf.yaml (10K) Download Attachment
killed-taskmanager.log (74K) Download Attachment
new-taskmanager.log (52K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Fwd: HA : My job didn't restart even if task manager restarted.

sunny yun
This post was updated on .
I am still struggling to solve this problem.
I have no doubt that the JOB should automatically restart after restarting
the TASK MANAGER in YARN MODE. Is it a misunderstood?

Problem is, "JOB MANAGER still try to connect to old TASK MANAGER after new TASK MANAGER container be created"

See the log I posted,

2017-09-04 00:51:01,000 INFO  org.apache.flink.runtime.instance.InstanceManager             - Registered TaskManager at flink-02 (akka.tcp://flink@flink-02:37281/user/taskmanager) as 6f8a6941cddb361de1643fa47b71a1db. Current number of registered hosts is 3. Current number of alive task slots is 24.
==> JM register new TM (flink-02:37281)

2017-09-04 00:51:02,868 WARN  akka.remote.ReliableDeliverySupervisor                        - Association with remote system [akka.tcp://flink@flink-02:45104] has failed, address is now gated for [5000] ms. Reason: [Association failed with [akka.tcp://flink@flink-02:45104]] Caused by: [Connection refused: flink-02/10.1.0.5:45104]
==> JM still try to connect to old TM (flink-02:45104)
==> Port 45104 is disabled already. JM should connect to port 37281.


If anyone has succeeded in the same situation(YARN + TM FAILURE), please just tell me. That will be big help to me.



sunny yun wrote
I have 4 nodes for flink cluster using flink 1.3.2 and hadoop 2.6.5. All
configurations are same in 4 nodes.

Running yarn session and flink job, I killed node#2(flink-02)’s
TM(YarnTaskManager) process.

After resoucemanager make new TM container, job still failed status.

But when I killed JM(YarnApplicationMasterRunner) process, all behavior
seems ok. My job restarted well.

I’m using DFS for storageDir.

Flink configuration file and log files are attached. If any other
information needed, please let me know.



Here is use case, ------------------------------
-------------------------------------------------------------------------

1. start yarn session

flink-01 ~]$ flink/bin/yarn-session.sh -n 4 -s 8 -nm flink -d

2. submit job

flink-01 ~]$ flink/bin/flink run -d -p 3 -c a.b.c.Application abc.jar

3. find TM pid and kill it

flink-02 ~]$ jps | grep YarnTaskManager

4022 YarnTaskManager

flink-02 ~]$ kill 4022



And JM logs ------------------------------------------------------------
-------------------------------------------

2017-09-04 00:50:57,830 WARN  akka.remote.ReliableDeliverySupervisor
- Association with remote system [akka.tcp://flink@flink-02:45104] has
failed, address is now gated for [5000] ms. Reason: [Disassociated]

2017-09-04 00:50:58,013 INFO  org.apache.flink.yarn.
YarnFlinkResourceManager                - Container
container_1504449399430_0005_01_000004 failed. Exit status: 143

2017-09-04 00:50:58,014 INFO  org.apache.flink.yarn.
YarnFlinkResourceManager                - Diagnostics for container
container_1504449399430_0005_01_000004 in state COMPLETE : exitStatus=143
diagnostics=Container killed on request. Exit code is 143

Container exited with a non-zero exit code 143

Killed by external signal



2017-09-04 00:50:58,014 INFO  org.apache.flink.yarn.
YarnFlinkResourceManager                - Total number of failed containers
so far: 1

2017-09-04 00:50:58,014 INFO  org.apache.flink.yarn.
YarnFlinkResourceManager                - Received new container:
container_1504449399430_0005_01_000005 - Remaining pending container
requests: 0

2017-09-04 00:50:58,014 INFO  org.apache.flink.yarn.
YarnFlinkResourceManager                - Launching TaskManager in
container ContainerInLaunch @ 1504453858014: Container: [ContainerId:
container_1504449399430_0005_01_000005, NodeId: flink-02:34821,
NodeHttpAddress: flink-02:8042, Resource: <memory:8192, vCores:1>,
Priority: 0, Token: Token { kind: ContainerToken, service: 10.1.0.5:34821
}, ] on host flink-02

2017-09-04 00:50:58,015 INFO  org.apache.hadoop.yarn.client.api.impl.
ContainerManagementProtocolProxy  - Opening proxy : flink-02:34821

2017-09-04 00:50:58,015 INFO  org.apache.flink.yarn.
YarnJobManager                          - Task manager
akka.tcp://flink@flink-02:45104/user/taskmanager terminated.

2017-09-04 00:50:58,016 INFO  org.apache.flink.runtime.
executiongraph.ExecutionGraph        - Source: src -> Service: vps -> Sink:
snk (3/3) (33b33a9c15eb0107b320e375374ac07f) switched from RUNNING to
FAILED.

java.lang.Exception: TaskManager was lost/killed:
container_1504449399430_0005_01_000004 @ flink-02 (dataPort=42381)

           at org.apache.flink.runtime.instance.SimpleSlot.
releaseSlot(SimpleSlot.java:217)

           at org.apache.flink.runtime.instance.SlotSharingGroupAssignment.
releaseSharedSlot(SlotSharingGroupAssignment.java:533)

           at org.apache.flink.runtime.instance.SharedSlot.
releaseSlot(SharedSlot.java:192)

           at org.apache.flink.runtime.instance.Instance.markDead(
Instance.java:167)

           at org.apache.flink.runtime.instance.InstanceManager.
unregisterTaskManager(InstanceManager.java:212)

           at org.apache.flink.runtime.jobmanager.JobManager.org$
apache$flink$runtime$jobmanager$JobManager$$handleTaskManagerTerminated(
JobManager.scala:1228)

           at org.apache.flink.runtime.jobmanager.JobManager$$
anonfun$handleMessage$1.applyOrElse(JobManager.scala:474)

           at scala.runtime.AbstractPartialFunction.apply(
AbstractPartialFunction.scala:36)

           at org.apache.flink.runtime.clusterframework.
ContaineredJobManager$$anonfun$handleContainerMessage$1.applyOrElse(
ContaineredJobManager.scala:100)

           at scala.PartialFunction$OrElse.apply(PartialFunction.scala:167)

           at org.apache.flink.yarn.YarnJobManager$$anonfun$
handleYarnShutdown$1.applyOrElse(YarnJobManager.scala:103)

           at scala.PartialFunction$OrElse.apply(PartialFunction.scala:167)

           at org.apache.flink.runtime.LeaderSessionMessageFilter$$
anonfun$receive$1.applyOrElse(LeaderSessionMessageFilter.scala:38)

           at scala.runtime.AbstractPartialFunction.apply(
AbstractPartialFunction.scala:36)

           at org.apache.flink.runtime.LogMessages$$anon$1.apply(
LogMessages.scala:33)

           at org.apache.flink.runtime.LogMessages$$anon$1.apply(
LogMessages.scala:28)

           at scala.PartialFunction$class.applyOrElse(PartialFunction.
scala:123)

           at org.apache.flink.runtime.LogMessages$$anon$1.
applyOrElse(LogMessages.scala:28)

           at akka.actor.Actor$class.aroundReceive(Actor.scala:467)

           at org.apache.flink.runtime.jobmanager.JobManager.
aroundReceive(JobManager.scala:125)

           at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)

           at akka.actor.ActorCell.invoke(ActorCell.scala:487)

           at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:238)

           at akka.dispatch.Mailbox.run(Mailbox.scala:220)

           at akka.dispatch.ForkJoinExecutorConfigurator$
AkkaForkJoinTask.exec(AbstractDispatcher.scala:397)

           at scala.concurrent.forkjoin.ForkJoinTask.doExec(
ForkJoinTask.java:260)

           at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.
runTask(ForkJoinPool.java:1339)

           at scala.concurrent.forkjoin.ForkJoinPool.runWorker(
ForkJoinPool.java:1979)

           at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(
ForkJoinWorkerThread.java:107)

2017-09-04 00:50:58,018 INFO  org.apache.flink.runtime.
executiongraph.ExecutionGraph        - Job vp (
6e73f6babf36c9321efc0524a82dede1) switched from state RUNNING to FAILING.

java.lang.Exception: TaskManager was lost/killed:
container_1504449399430_0005_01_000004 @ flink-02 (dataPort=42381)

           at org.apache.flink.runtime.instance.SimpleSlot.
releaseSlot(SimpleSlot.java:217)

           at org.apache.flink.runtime.instance.SlotSharingGroupAssignment.
releaseSharedSlot(SlotSharingGroupAssignment.java:533)

           at org.apache.flink.runtime.instance.SharedSlot.
releaseSlot(SharedSlot.java:192)

           at org.apache.flink.runtime.instance.Instance.markDead(
Instance.java:167)

           at org.apache.flink.runtime.instance.InstanceManager.
unregisterTaskManager(InstanceManager.java:212)

           at org.apache.flink.runtime.jobmanager.JobManager.org$
apache$flink$runtime$jobmanager$JobManager$$handleTaskManagerTerminated(
JobManager.scala:1228)

           at org.apache.flink.runtime.jobmanager.JobManager$$
anonfun$handleMessage$1.applyOrElse(JobManager.scala:474)

           at scala.runtime.AbstractPartialFunction.apply(
AbstractPartialFunction.scala:36)

           at org.apache.flink.runtime.clusterframework.
ContaineredJobManager$$anonfun$handleContainerMessage$1.applyOrElse(
ContaineredJobManager.scala:100)

           at scala.PartialFunction$OrElse.apply(PartialFunction.scala:167)

           at org.apache.flink.yarn.YarnJobManager$$anonfun$
handleYarnShutdown$1.applyOrElse(YarnJobManager.scala:103)

           at scala.PartialFunction$OrElse.apply(PartialFunction.scala:167)

           at org.apache.flink.runtime.LeaderSessionMessageFilter$$
anonfun$receive$1.applyOrElse(LeaderSessionMessageFilter.scala:38)

           at scala.runtime.AbstractPartialFunction.apply(
AbstractPartialFunction.scala:36)

           at org.apache.flink.runtime.LogMessages$$anon$1.apply(
LogMessages.scala:33)

           at org.apache.flink.runtime.LogMessages$$anon$1.apply(
LogMessages.scala:28)

           at scala.PartialFunction$class.applyOrElse(PartialFunction.
scala:123)

           at org.apache.flink.runtime.LogMessages$$anon$1.
applyOrElse(LogMessages.scala:28)

           at akka.actor.Actor$class.aroundReceive(Actor.scala:467)

           at org.apache.flink.runtime.jobmanager.JobManager.
aroundReceive(JobManager.scala:125)

           at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)

           at akka.actor.ActorCell.invoke(ActorCell.scala:487)

           at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:238)

           at akka.dispatch.Mailbox.run(Mailbox.scala:220)

           at akka.dispatch.ForkJoinExecutorConfigurator$
AkkaForkJoinTask.exec(AbstractDispatcher.scala:397)

           at scala.concurrent.forkjoin.ForkJoinTask.doExec(
ForkJoinTask.java:260)

           at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.
runTask(ForkJoinPool.java:1339)

           at scala.concurrent.forkjoin.ForkJoinPool.runWorker(
ForkJoinPool.java:1979)

           at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(
ForkJoinWorkerThread.java:107)

2017-09-04 00:50:58,027 INFO  org.apache.flink.runtime.
executiongraph.ExecutionGraph        - Source: src -> Service: vps -> Sink:
snk (1/3) (02232baaa7b3f88982cbf40cb5ebe489) switched from RUNNING to
CANCELING.

2017-09-04 00:50:58,029 INFO  org.apache.flink.runtime.
executiongraph.ExecutionGraph        - Source: src -> Service: vps -> Sink:
snk (2/3) (952c34a277df34aa7e3216a8b96ae2d5) switched from RUNNING to
CANCELING.

2017-09-04 00:50:58,029 INFO  org.apache.flink.yarn.
YarnFlinkResourceManager                - Requesting new TaskManager
container with 8192 megabytes memory. Pending requests: 1

2017-09-04 00:50:58,030 INFO  org.apache.flink.runtime.
instance.InstanceManager             - Unregistered task manager flink-02/
10.1.0.5. Number of registered task managers 2. Number of available slots
16.

2017-09-04 00:50:58,065 INFO  org.apache.flink.runtime.
executiongraph.ExecutionGraph        - Source: src -> Service: vps -> Sink:
snk (1/3) (02232baaa7b3f88982cbf40cb5ebe489) switched from CANCELING to
CANCELED.

2017-09-04 00:50:58,071 INFO  org.apache.flink.runtime.
executiongraph.ExecutionGraph        - Source: src -> Service: vps -> Sink:
snk (2/3) (952c34a277df34aa7e3216a8b96ae2d5) switched from CANCELING to
CANCELED.

2017-09-04 00:50:58,072 INFO  org.apache.flink.runtime.
executiongraph.ExecutionGraph        - Try to restart or fail the job vp (
6e73f6babf36c9321efc0524a82dede1) if no longer possible.

2017-09-04 00:50:58,072 INFO  org.apache.flink.runtime.
executiongraph.ExecutionGraph        - Job vp (
6e73f6babf36c9321efc0524a82dede1) switched from state FAILING to FAILED.

java.lang.Exception: TaskManager was lost/killed:
container_1504449399430_0005_01_000004 @ flink-02 (dataPort=42381)

           at org.apache.flink.runtime.instance.SimpleSlot.
releaseSlot(SimpleSlot.java:217)

           at org.apache.flink.runtime.instance.SlotSharingGroupAssignment.
releaseSharedSlot(SlotSharingGroupAssignment.java:533)

           at org.apache.flink.runtime.instance.SharedSlot.
releaseSlot(SharedSlot.java:192)

           at org.apache.flink.runtime.instance.Instance.markDead(
Instance.java:167)

           at org.apache.flink.runtime.instance.InstanceManager.
unregisterTaskManager(InstanceManager.java:212)

           at org.apache.flink.runtime.jobmanager.JobManager.org$
apache$flink$runtime$jobmanager$JobManager$$handleTaskManagerTerminated(
JobManager.scala:1228)

           at org.apache.flink.runtime.jobmanager.JobManager$$
anonfun$handleMessage$1.applyOrElse(JobManager.scala:474)

           at scala.runtime.AbstractPartialFunction.apply(
AbstractPartialFunction.scala:36)

           at org.apache.flink.runtime.clusterframework.
ContaineredJobManager$$anonfun$handleContainerMessage$1.applyOrElse(
ContaineredJobManager.scala:100)

           at scala.PartialFunction$OrElse.apply(PartialFunction.scala:167)

           at org.apache.flink.yarn.YarnJobManager$$anonfun$
handleYarnShutdown$1.applyOrElse(YarnJobManager.scala:103)

           at scala.PartialFunction$OrElse.apply(PartialFunction.scala:167)

           at org.apache.flink.runtime.LeaderSessionMessageFilter$$
anonfun$receive$1.applyOrElse(LeaderSessionMessageFilter.scala:38)

           at scala.runtime.AbstractPartialFunction.apply(
AbstractPartialFunction.scala:36)

           at org.apache.flink.runtime.LogMessages$$anon$1.apply(
LogMessages.scala:33)

           at org.apache.flink.runtime.LogMessages$$anon$1.apply(
LogMessages.scala:28)

           at scala.PartialFunction$class.applyOrElse(PartialFunction.
scala:123)

           at org.apache.flink.runtime.LogMessages$$anon$1.
applyOrElse(LogMessages.scala:28)

           at akka.actor.Actor$class.aroundReceive(Actor.scala:467)

           at org.apache.flink.runtime.jobmanager.JobManager.
aroundReceive(JobManager.scala:125)

           at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)

           at akka.actor.ActorCell.invoke(ActorCell.scala:487)

           at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:238)

           at akka.dispatch.Mailbox.run(Mailbox.scala:220)

           at akka.dispatch.ForkJoinExecutorConfigurator$
AkkaForkJoinTask.exec(AbstractDispatcher.scala:397)

           at scala.concurrent.forkjoin.ForkJoinTask.doExec(
ForkJoinTask.java:260)

           at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.
runTask(ForkJoinPool.java:1339)

           at scala.concurrent.forkjoin.ForkJoinPool.runWorker(
ForkJoinPool.java:1979)

           at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(
ForkJoinWorkerThread.java:107)

2017-09-04 00:50:58,072 INFO  org.apache.flink.runtime.
executiongraph.ExecutionGraph        - Could not restart the job vp (
6e73f6babf36c9321efc0524a82dede1) because the restart strategy prevented it.

java.lang.Exception: TaskManager was lost/killed:
container_1504449399430_0005_01_000004 @ flink-02 (dataPort=42381)

           at org.apache.flink.runtime.instance.SimpleSlot.
releaseSlot(SimpleSlot.java:217)

           at org.apache.flink.runtime.instance.SlotSharingGroupAssignment.
releaseSharedSlot(SlotSharingGroupAssignment.java:533)

           at org.apache.flink.runtime.instance.SharedSlot.
releaseSlot(SharedSlot.java:192)

           at org.apache.flink.runtime.instance.Instance.markDead(
Instance.java:167)

           at org.apache.flink.runtime.instance.InstanceManager.
unregisterTaskManager(InstanceManager.java:212)

           at org.apache.flink.runtime.jobmanager.JobManager.org$
apache$flink$runtime$jobmanager$JobManager$$handleTaskManagerTerminated(
JobManager.scala:1228)

           at org.apache.flink.runtime.jobmanager.JobManager$$
anonfun$handleMessage$1.applyOrElse(JobManager.scala:474)

           at scala.runtime.AbstractPartialFunction.apply(
AbstractPartialFunction.scala:36)

           at org.apache.flink.runtime.clusterframework.
ContaineredJobManager$$anonfun$handleContainerMessage$1.applyOrElse(
ContaineredJobManager.scala:100)

           at scala.PartialFunction$OrElse.apply(PartialFunction.scala:167)

           at org.apache.flink.yarn.YarnJobManager$$anonfun$
handleYarnShutdown$1.applyOrElse(YarnJobManager.scala:103)

           at scala.PartialFunction$OrElse.apply(PartialFunction.scala:167)

           at org.apache.flink.runtime.LeaderSessionMessageFilter$$
anonfun$receive$1.applyOrElse(LeaderSessionMessageFilter.scala:38)

           at scala.runtime.AbstractPartialFunction.apply(
AbstractPartialFunction.scala:36)

           at org.apache.flink.runtime.LogMessages$$anon$1.apply(
LogMessages.scala:33)

           at org.apache.flink.runtime.LogMessages$$anon$1.apply(
LogMessages.scala:28)

           at scala.PartialFunction$class.applyOrElse(PartialFunction.
scala:123)

           at org.apache.flink.runtime.LogMessages$$anon$1.
applyOrElse(LogMessages.scala:28)

           at akka.actor.Actor$class.aroundReceive(Actor.scala:467)

           at org.apache.flink.runtime.jobmanager.JobManager.
aroundReceive(JobManager.scala:125)

           at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)

           at akka.actor.ActorCell.invoke(ActorCell.scala:487)

           at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:238)

           at akka.dispatch.Mailbox.run(Mailbox.scala:220)

           at akka.dispatch.ForkJoinExecutorConfigurator$
AkkaForkJoinTask.exec(AbstractDispatcher.scala:397)

           at scala.concurrent.forkjoin.ForkJoinTask.doExec(
ForkJoinTask.java:260)

           at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.
runTask(ForkJoinPool.java:1339)

           at scala.concurrent.forkjoin.ForkJoinPool.runWorker(
ForkJoinPool.java:1979)

           at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(
ForkJoinWorkerThread.java:107)

2017-09-04 00:50:58,073 INFO
org.apache.flink.runtime.checkpoint.CheckpointCoordinator
- Stopping checkpoint coordinator for job 6e73f6babf36c9321efc0524a82dede1

2017-09-04 00:50:58,073 INFO  org.apache.flink.runtime.checkpoint.
ZooKeeperCompletedCheckpointStore  - Shutting down

2017-09-04 00:50:58,073 INFO  org.apache.flink.runtime.checkpoint.
ZooKeeperCompletedCheckpointStore  - Removing /flink/cluster_one/
checkpoints/6e73f6babf36c9321efc0524a82dede1 from ZooKeeper

2017-09-04 00:50:58,102 INFO
org.apache.flink.runtime.checkpoint.ZooKeeperCheckpointIDCounter
- Shutting down.

2017-09-04 00:50:58,102 INFO
org.apache.flink.runtime.checkpoint.ZooKeeperCheckpointIDCounter
- Removing /checkpoint-counter/6e73f6babf36c9321efc0524a82dede1 from
ZooKeeper

2017-09-04 00:50:58,118 INFO  org.apache.flink.runtime.jobmanager.
ZooKeeperSubmittedJobGraphStore  - Removed job graph
6e73f6babf36c9321efc0524a82dede1 from ZooKeeper.

2017-09-04 00:51:01,000 INFO  org.apache.flink.runtime.
instance.InstanceManager             - Registered TaskManager at flink-02
(akka.tcp://flink@flink-02:37281/user/taskmanager) as
6f8a6941cddb361de1643fa47b71a1db. Current number of registered hosts is 3.
Current number of alive task slots is 24.

2017-09-04 00:51:01,000 INFO  org.apache.flink.yarn.
YarnFlinkResourceManager                - TaskManager
container_1504449399430_0005_01_000005 has started.

2017-09-04 00:51:02,868 WARN  akka.remote.ReliableDeliverySupervisor
- Association with remote system [akka.tcp://flink@flink-02:45104] has
failed, address is now gated for [5000] ms. Reason: [Association failed
with [akka.tcp://flink@flink-02:45104]] Caused by: [Connection refused:
flink-02/10.1.0.5:45104]

2017-09-04 00:51:07,883 WARN  akka.remote.ReliableDeliverySupervisor
- Association with remote system [akka.tcp://flink@flink-02:45104] has
failed, address is now gated for [5000] ms. Reason: [Association failed
with [akka.tcp://flink@flink-02:45104]] Caused by: [Connection refused:
flink-02/10.1.0.5:45104]

2017-09-04 00:51:12,903 WARN  akka.remote.ReliableDeliverySupervisor
- Association with remote system [akka.tcp://flink@flink-02:45104] has
failed, address is now gated for [5000] ms. Reason: [Association failed
with [akka.tcp://flink@flink-02:45104]] Caused by: [Connection refused:
flink-02/10.1.0.5:45104]

2017-09-04 00:51:17,924 WARN  akka.remote.ReliableDeliverySupervisor
- Association with remote system [akka.tcp://flink@flink-02:45104] has
failed, address is now gated for [5000] ms. Reason: [Association failed
with [akka.tcp://flink@flink-02:45104]] Caused by: [Connection refused:
flink-02/10.1.0.5:45104]

2017-09-04 00:51:22,942 WARN  akka.remote.ReliableDeliverySupervisor
- Association with remote system [akka.tcp://flink@flink-02:45104] has
failed, address is now gated for [5000] ms. Reason: [Association failed
with [akka.tcp://flink@flink-02:45104]] Caused by: [Connection refused:
flink-02/10.1.0.5:45104]

2017-09-04 00:51:27,963 WARN  akka.remote.ReliableDeliverySupervisor
- Association with remote system [akka.tcp://flink@flink-02:45104] has
failed, address is now gated for [5000] ms. Reason: [Association failed
with [akka.tcp://flink@flink-02:45104]] Caused by: [Connection refused:
flink-02/10.1.0.5:45104]

2017-09-04 00:51:32,983 WARN  akka.remote.ReliableDeliverySupervisor
                    - Association with remote system
[akka.tcp://flink@flink-02:45104] has failed, address is now gated for
[5000] ms. Reason: [Association failed with [akka.tcp://flink@flink-02:45104]]
Caused by: [Connection refused: flink-02/10.1.0.5:45104]

2017-09-04 00:51:38,004 WARN  akka.remote.ReliableDeliverySupervisor
- Association with remote system [akka.tcp://flink@flink-02:45104] has
failed, address is now gated for [5000] ms. Reason: [Association failed
with [akka.tcp://flink@flink-02:45104]] Caused by: [Connection refused:
flink-02/10.1.0.5:45104]

2017-09-04 00:51:43,024 WARN  akka.remote.ReliableDeliverySupervisor
- Association with remote system [akka.tcp://flink@flink-02:45104] has
failed, address is now gated for [5000] ms. Reason: [Association failed
with [akka.tcp://flink@flink-02:45104]] Caused by: [Connection refused:
flink-02/10.1.0.5:45104]

2017-09-04 00:51:48,043 WARN  akka.remote.ReliableDeliverySupervisor
- Association with remote system [akka.tcp://flink@flink-02:45104] has
failed, address is now gated for [5000] ms. Reason: [Association failed
with [akka.tcp://flink@flink-02:45104]] Caused by: [Connection refused:
flink-02/10.1.0.5:45104]

2017-09-04 00:51:53,063 WARN  akka.remote.ReliableDeliverySupervisor
- Association with remote system [akka.tcp://flink@flink-02:45104] has
failed, address is now gated for [5000] ms. Reason: [Association failed
with [akka.tcp://flink@flink-02:45104]] Caused by: [Connection refused:
flink-02/10.1.0.5:45104]

2017-09-04 00:51:58,082 WARN  akka.remote.ReliableDeliverySupervisor
- Association with remote system [akka.tcp://flink@flink-02:45104] has
failed, address is now gated for [5000] ms. Reason: [Association failed
with [akka.tcp://flink@flink-02:45104]] Caused by: [Connection refused:
flink-02/10.1.0.5:45104]

2017-09-04 00:52:03,105 WARN  akka.remote.ReliableDeliverySupervisor
- Association with remote system [akka.tcp://flink@flink-02:45104] has
failed, address is now gated for [5000] ms. Reason: [Association failed
with [akka.tcp://flink@flink-02:45104]] Caused by: [Connection refused:
flink-02/10.1.0.5:45104]

2017-09-04 00:52:08,123 WARN  akka.remote.ReliableDeliverySupervisor
- Association with remote system [akka.tcp://flink@flink-02:45104] has
failed, address is now gated for [5000] ms. Reason: [Association failed
with [akka.tcp://flink@flink-02:45104]] Caused by: [Connection refused:
flink-02/10.1.0.5:45104]

2017-09-04 00:52:13,143 WARN  akka.remote.ReliableDeliverySupervisor
- Association with remote system [akka.tcp://flink@flink-02:45104] has
failed, address is now gated for [5000] ms. Reason: [Association failed
with [akka.tcp://flink@flink-02:45104]] Caused by: [Connection refused:
flink-02/10.1.0.5:45104]

2017-09-04 00:52:18,164 WARN  akka.remote.ReliableDeliverySupervisor
- Association with remote system [akka.tcp://flink@flink-02:45104] has
failed, address is now gated for [5000] ms. Reason: [Association failed
with [akka.tcp://flink@flink-02:45104]] Caused by: [Connection refused:
flink-02/10.1.0.5:45104]

2017-09-04 00:52:23,184 WARN  akka.remote.ReliableDeliverySupervisor
- Association with remote system [akka.tcp://flink@flink-02:45104] has
failed, address is now gated for [5000] ms. Reason: [Association failed
with [akka.tcp://flink@flink-02:45104]] Caused by: [Connection refused:
flink-02/10.1.0.5:45104]

2017-09-04 00:52:28,202 WARN  akka.remote.ReliableDeliverySupervisor
- Association with remote system [akka.tcp://flink@flink-02:45104] has
failed, address is now gated for [5000] ms. Reason: [Association failed
with [akka.tcp://flink@flink-02:45104]] Caused by: [Connection refused:
flink-02/10.1.0.5:45104]

2017-09-04 00:52:33,225 WARN  akka.remote.ReliableDeliverySupervisor
- Association with remote system [akka.tcp://flink@flink-02:45104] has
failed, address is now gated for [5000] ms. Reason: [Association failed
with [akka.tcp://flink@flink-02:45104]] Caused by: [Connection refused:
flink-02/10.1.0.5:45104]

2017-09-04 00:52:38,244 WARN  akka.remote.ReliableDeliverySupervisor
        - Association with remote system [akka.tcp://flink@flink-02:45104]
has failed, address is now gated for [5000] ms. Reason: [Association failed
with [akka.tcp://flink@flink-02:45104]] Caused by: [Connection refused:
flink-02/10.1.0.5:45104]

2017-09-04 00:52:43,263 WARN  akka.remote.ReliableDeliverySupervisor
- Association with remote system [akka.tcp://flink@flink-02:45104] has
failed, address is now gated for [5000] ms. Reason: [Association failed
with [akka.tcp://flink@flink-02:45104]] Caused by: [Connection refused:
flink-02/10.1.0.5:45104]

2017-09-04 00:52:48,282 WARN  akka.remote.ReliableDeliverySupervisor
- Association with remote system [akka.tcp://flink@flink-02:45104] has
failed, address is now gated for [5000] ms. Reason: [Association failed
with [akka.tcp://flink@flink-02:45104]] Caused by: [Connection refused:
flink-02/10.1.0.5:45104]

2017-09-04 00:52:53,302 WARN  akka.remote.ReliableDeliverySupervisor
- Association with remote system [akka.tcp://flink@flink-02:45104] has
failed, address is now gated for [5000] ms. Reason: [Association failed
with [akka.tcp://flink@flink-02:45104]] Caused by: [Connection refused:
flink-02/10.1.0.5:45104]

2017-09-04 00:52:58,324 WARN  akka.remote.ReliableDeliverySupervisor
- Association with remote system [akka.tcp://flink@flink-02:45104] has
failed, address is now gated for [5000] ms. Reason: [Association failed
with [akka.tcp://flink@flink-02:45104]] Caused by: [Connection refused:
flink-02/10.1.0.5:45104]

2017-09-04 00:53:03,344 WARN  akka.remote.ReliableDeliverySupervisor
- Association with remote system [akka.tcp://flink@flink-02:45104] has
failed, address is now gated for [5000] ms. Reason: [Association failed
with [akka.tcp://flink@flink-02:45104]] Caused by: [Connection refused:
flink-02/10.1.0.5:45104]

2017-09-04 00:53:08,362 WARN  akka.remote.ReliableDeliverySupervisor
- Association with remote system [akka.tcp://flink@flink-02:45104] has
failed, address is now gated for [5000] ms. Reason: [Association failed
with [akka.tcp://flink@flink-02:45104]] Caused by: [Connection refused:
flink-02/10.1.0.5:45104]

2017-09-04 00:53:13,384 WARN  akka.remote.ReliableDeliverySupervisor
- Association with remote system [akka.tcp://flink@flink-02:45104] has
failed, address is now gated for [5000] ms. Reason: [Association failed
with [akka.tcp://flink@flink-02:45104]] Caused by: [Connection refused:
flink-02/10.1.0.5:45104]

2017-09-04 00:53:18,403 WARN  akka.remote.ReliableDeliverySupervisor
- Association with remote system [akka.tcp://flink@flink-02:45104] has
failed, address is now gated for [5000] ms. Reason: [Association failed
with [akka.tcp://flink@flink-02:45104]] Caused by: [Connection refused:
flink-02/10.1.0.5:45104]

2017-09-04 00:53:23,422 WARN  akka.remote.ReliableDeliverySupervisor
- Association with remote system [akka.tcp://flink@flink-02:45104] has
failed, address is now gated for [5000] ms. Reason: [Association failed
with [akka.tcp://flink@flink-02:45104]] Caused by: [Connection refused:
flink-02/10.1.0.5:45104]

2017-09-04 00:53:28,443 WARN  akka.remote.ReliableDeliverySupervisor
- Association with remote system [akka.tcp://flink@flink-02:45104] has
failed, address is now gated for [5000] ms. Reason: [Association failed
with [akka.tcp://flink@flink-02:45104]] Caused by: [Connection refused:
flink-02/10.1.0.5:45104]

2017-09-04 00:53:33,462 WARN  akka.remote.ReliableDeliverySupervisor
- Association with remote system [akka.tcp://flink@flink-02:45104] has
failed, address is now gated for [5000] ms. Reason: [Association failed
with [akka.tcp://flink@flink-02:45104]] Caused by: [Connection refused:
flink-02/10.1.0.5:45104]

2017-09-04 00:53:38,482 WARN  akka.remote.ReliableDeliverySupervisor
- Association with remote system [akka.tcp://flink@flink-02:45104] has
failed, address is now gated for [5000] ms. Reason: [Association failed
with [akka.tcp://flink@flink-02:45104]] Caused by: [Connection refused:
flink-02/10.1.0.5:45104]

2017-09-04 00:53:43,502 WARN  akka.remote.ReliableDeliverySupervisor
- Association with remote system [akka.tcp://flink@flink-02:45104] has
failed, address is now gated for [5000] ms. Reason: [Association failed
with [akka.tcp://flink@flink-02:45104]] Caused by: [Connection refused:
flink-02/10.1.0.5:45104]

2017-09-04 00:53:48,522 WARN  akka.remote.ReliableDeliverySupervisor
- Association with remote system [akka.tcp://flink@flink-02:45104] has
failed, address is now gated for [5000] ms. Reason: [Association failed
with [akka.tcp://flink@flink-02:45104]] Caused by: [Connection refused:
flink-02/10.1.0.5:45104]

2017-09-04 00:53:53,549 WARN  akka.remote.ReliableDeliverySupervisor
- Association with remote system [akka.tcp://flink@flink-02:45104] has
failed, address is now gated for [5000] ms. Reason: [Association failed
with [akka.tcp://flink@flink-02:45104]] Caused by: [Connection refused:
flink-02/10.1.0.5:45104]

2017-09-04 00:53:58,573 WARN  akka.remote.ReliableDeliverySupervisor
- Association with remote system [akka.tcp://flink@flink-02:45104] has
failed, address is now gated for [5000] ms. Reason: [Association failed
with [akka.tcp://flink@flink-02:45104]] Caused by: [Connection refused:
flink-02/10.1.0.5:45104]

2017-09-04 00:54:03,592 ERROR Remoting
                    - Association to [akka.tcp://flink@flink-02:45104] with
UID [1818461746] irrecoverably failed. Quarantining address.

java.util.concurrent.TimeoutException: Delivery of system messages timed
out and they were dropped.

           at akka.remote.ReliableDeliverySupervisor$$
anonfun$gated$1.applyOrElse(Endpoint.scala:336)

           at akka.actor.Actor$class.aroundReceive(Actor.scala:467)

           at akka.remote.ReliableDeliverySupervisor.
aroundReceive(Endpoint.scala:189)

           at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)

           at akka.actor.ActorCell.invoke(ActorCell.scala:487)

           at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:238)

           at akka.dispatch.Mailbox.run(Mailbox.scala:220)

           at akka.dispatch.ForkJoinExecutorConfigurator$
AkkaForkJoinTask.exec(AbstractDispatcher.scala:397)

           at scala.concurrent.forkjoin.ForkJoinTask.doExec(
ForkJoinTask.java:260)

           at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.
runTask(ForkJoinPool.java:1339)

           at scala.concurrent.forkjoin.ForkJoinPool.runWorker(
ForkJoinPool.java:1979)

           at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(
ForkJoinWorkerThread.java:107)





jobmanager.log (104K) <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/attachment/15351/0/jobmanager.log>
flink-conf.yaml (10K) <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/attachment/15351/1/flink-conf.yaml>
killed-taskmanager.log (74K) <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/attachment/15351/2/killed-taskmanager.log>
new-taskmanager.log (52K) <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/attachment/15351/3/new-taskmanager.log>
Reply | Threaded
Open this post in threaded view
|

Re: Fwd: HA : My job didn't restart even if task manager restarted.

Fabian Hueske-2
Hi,

sorry for the late response!
I'm not familiar with the details of the failure recovery but Till (in CC) knows the code in depth.
Maybe he can figure out what's going on.

Best, Fabian

2017-09-06 5:35 GMT+02:00 sunny yun <[hidden email]>:
I am still struggling to solve this problem.
I have no doubt that the JOB should automatically restart after restarting
the TASK MANAGER in YARN MODE. Is it a misunderstood?

Problem seems that *JOB MANAGER still try to connect to old TASK MANAGER
even after new TASK MANAGER container be created.*
When I killed TM on node#2 then new TM container is created on node#3, but
JM still tries to connect to TM on node#2 according to the log file. (It was
not a log I posted before, when I found it while continuing the test.
Normally the TM be created on the same node after killed.)
So new TM don't know JOB info and JM show us JOB with fail status.

If anyone has succeeded in the same situation(YARN + TM FAILURE), please
just tell me.
That will be big help to me.



--
Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/

Reply | Threaded
Open this post in threaded view
|

Re: Fwd: HA : My job didn't restart even if task manager restarted.

sunny yun
Hi Fabian,

Thank you for your reply.

I found solution.
After "enableCheckpointing" set, It works as intended.

Actually I don't want to use "checkpoint", at least for now. So I am setting the interval large. Do you have any other suggestions?


Best regards, Sunny.



2017-09-09 0:39 GMT+09:00 Fabian Hueske-2 [via Apache Flink User Mailing List archive.] <[hidden email]>:
Hi,

sorry for the late response!
I'm not familiar with the details of the failure recovery but Till (in CC) knows the code in depth.
Maybe he can figure out what's going on.

Best, Fabian

2017-09-06 5:35 GMT+02:00 sunny yun <[hidden email]>:
I am still struggling to solve this problem.
I have no doubt that the JOB should automatically restart after restarting
the TASK MANAGER in YARN MODE. Is it a misunderstood?

Problem seems that *JOB MANAGER still try to connect to old TASK MANAGER
even after new TASK MANAGER container be created.*
When I killed TM on node#2 then new TM container is created on node#3, but
JM still tries to connect to TM on node#2 according to the log file. (It was
not a log I posted before, when I found it while continuing the test.
Normally the TM be created on the same node after killed.)
So new TM don't know JOB info and JM show us JOB with fail status.

If anyone has succeeded in the same situation(YARN + TM FAILURE), please
just tell me.
That will be big help to me.



--
Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/




To unsubscribe from Fwd: HA : My job didn't restart even if task manager restarted., click here.
NAML

Reply | Threaded
Open this post in threaded view
|

Re: Fwd: HA : My job didn't restart even if task manager restarted.

sunny yun
In reply to this post by Fabian Hueske-2
I misunderstood.
I set "RestartStrategy" instead of "enableCheckpointing".
Recap, "RestartStrategy" is required for HA to be valid.

regards, Sunny.

2017-09-13 11:42 GMT+09:00 Sunny Yun <[hidden email]>:
Hi Fabian,

Thank you for your reply.

I found solution.
After "enableCheckpointing" set, It works as intended.

Actually I don't want to use "checkpoint", at least for now. So I am setting the interval large. Do you have any other suggestions?


Best regards, Sunny.



2017-09-09 0:39 GMT+09:00 Fabian Hueske-2 [via Apache Flink User Mailing List archive.] <[hidden email]>:
Hi,

sorry for the late response!
I'm not familiar with the details of the failure recovery but Till (in CC) knows the code in depth.
Maybe he can figure out what's going on.

Best, Fabian

2017-09-06 5:35 GMT+02:00 sunny yun <[hidden email]>:
I am still struggling to solve this problem.
I have no doubt that the JOB should automatically restart after restarting
the TASK MANAGER in YARN MODE. Is it a misunderstood?

Problem seems that *JOB MANAGER still try to connect to old TASK MANAGER
even after new TASK MANAGER container be created.*
When I killed TM on node#2 then new TM container is created on node#3, but
JM still tries to connect to TM on node#2 according to the log file. (It was
not a log I posted before, when I found it while continuing the test.
Normally the TM be created on the same node after killed.)
So new TM don't know JOB info and JM show us JOB with fail status.

If anyone has succeeded in the same situation(YARN + TM FAILURE), please
just tell me.
That will be big help to me.



--
Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/




To unsubscribe from Fwd: HA : My job didn't restart even if task manager restarted., click here.
NAML


Reply | Threaded
Open this post in threaded view
|

Re: Fwd: HA : My job didn't restart even if task manager restarted.

sunny yun
In reply to this post by Fabian Hueske-2
Hi, Till,
Thank you for reply.

I have another question,
According to the JM log, when the TM terminates, the job always changes to the "FAILING" state.
I want to avoid job failures as much as possible because recovery takes time and streams are stopped during recovery.
"fixed-delay.delay" is set to 0 secs but not enough.
.- question1 : Is there no way to re-start only the sub-task of the failed node?
.- question2 : Flink Kafka consumer is never re-partitioned because it is assigned not subscribing. Is that correct?
Flink version is 1.2.1, source/sink are Kafka 0.10.2.0.

I really appreciate your help.

regards, 
Sunny


2017-09-14 17:35 GMT+09:00 Till Rohrmann <[hidden email]>:
Yes you are right Sunny. In order to restart a job you have to set a restart strategy. The reason why it also worked when enabling checkpointing is because the system will set a default restart strategy if none was set.

Cheers,
Till

On Wed, Sep 13, 2017 at 10:56 AM, Sunny Yun <[hidden email]> wrote:
I misunderstood.
I set "RestartStrategy" instead of "enableCheckpointing".
Recap, "RestartStrategy" is required for HA to be valid.

regards, Sunny.

2017-09-13 11:42 GMT+09:00 Sunny Yun <[hidden email]>:
Hi Fabian,

Thank you for your reply.

I found solution.
After "enableCheckpointing" set, It works as intended.

Actually I don't want to use "checkpoint", at least for now. So I am setting the interval large. Do you have any other suggestions?


Best regards, Sunny.



2017-09-09 0:39 GMT+09:00 Fabian Hueske-2 [via Apache Flink User Mailing List archive.] <[hidden email]>:
Hi,

sorry for the late response!
I'm not familiar with the details of the failure recovery but Till (in CC) knows the code in depth.
Maybe he can figure out what's going on.

Best, Fabian

2017-09-06 5:35 GMT+02:00 sunny yun <[hidden email]>:
I am still struggling to solve this problem.
I have no doubt that the JOB should automatically restart after restarting
the TASK MANAGER in YARN MODE. Is it a misunderstood?

Problem seems that *JOB MANAGER still try to connect to old TASK MANAGER
even after new TASK MANAGER container be created.*
When I killed TM on node#2 then new TM container is created on node#3, but
JM still tries to connect to TM on node#2 according to the log file. (It was
not a log I posted before, when I found it while continuing the test.
Normally the TM be created on the same node after killed.)
So new TM don't know JOB info and JM show us JOB with fail status.

If anyone has succeeded in the same situation(YARN + TM FAILURE), please
just tell me.
That will be big help to me.



--
Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/




To unsubscribe from Fwd: HA : My job didn't restart even if task manager restarted., click here.
NAML