JobManager shows TaskManager was lost/killed while TaskManger Process is still running and the network is OK.

classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|

JobManager shows TaskManager was lost/killed while TaskManger Process is still running and the network is OK.

Renkai

I use the newest snapshot of Flink, all jobs failed since a TaskManager was lost/killed.There is a sample of jobmanager and taskmanager logs

 

//job manager

java.lang.Exception: TaskManager was lost/killed: ResourceID{resourceId='8f4b98897b1cbdbb576cbf298ac1339f'} @ 10.17.123.56 (dataPort=62636)

        at org.apache.flink.runtime.instance.SimpleSlot.releaseSlot(SimpleSlot.java:217)

        at org.apache.flink.runtime.instance.SlotSharingGroupAssignment.releaseSharedSlot(SlotSharingGroupAssignment.java:533)

        at org.apache.flink.runtime.instance.SharedSlot.releaseSlot(SharedSlot.java:192)

        at org.apache.flink.runtime.instance.Instance.markDead(Instance.java:167)

        at org.apache.flink.runtime.instance.InstanceManager.unregisterTaskManager(InstanceManager.java:214)

        at org.apache.flink.runtime.jobmanager.JobManager.org$apache$flink$runtime$jobmanager$JobManager$$handleTaskManagerTerminated(JobManager.scala:1160)

        at org.apache.flink.runtime.jobmanager.JobManager$$anonfun$handleMessage$1.applyOrElse(JobManager.scala:1063)

        at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:36)

        at org.apache.flink.runtime.LeaderSessionMessageFilter$$anonfun$receive$1.applyOrElse(LeaderSessionMessageFilter.scala:44)

        at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:36)

        at org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:33)

        at org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:28)

        at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:123)

        at org.apache.flink.runtime.LogMessages$$anon$1.applyOrElse(LogMessages.scala:28)

        at akka.actor.Actor$class.aroundReceive(Actor.scala:465)

        at org.apache.flink.runtime.jobmanager.JobManager.aroundReceive(JobManager.scala:119)

        at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)

        at akka.actor.dungeon.DeathWatch$class.receivedTerminated(DeathWatch.scala:46)

        at akka.actor.ActorCell.receivedTerminated(ActorCell.scala:369)

        at akka.actor.ActorCell.autoReceiveMessage(ActorCell.scala:501)

        at akka.actor.ActorCell.invoke(ActorCell.scala:486)

        at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:254)

        at akka.dispatch.Mailbox.run(Mailbox.scala:221)

        at akka.dispatch.Mailbox.exec(Mailbox.scala:231)

        at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)

        at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)

        at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)

        at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)

2016-11-25 07:19:58,136 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - Could not restart the job shop-monitor (cd3b18a4854c3f720cb581b1c84830c4).

 

//task manager

2016-11-25 07:08:31,312 INFO  org.apache.zookeeper.ZooKeeper                                - Initiating client connection, connectString=kafka.zk1.service.mogujie.org:2181,kafka.zk2.service.mogujie.org:2181,kafka.zk3.service.mogujie.org:2181 sessionTimeout=60000 watcher=org.apache.flink.shaded.org.apache.curator.ConnectionState@77f9e968

2016-11-25 07:08:31,319 WARN  org.apache.flink.shaded.org.apache.curator.ConnectionState    - Connection attempt unsuccessful after 147624 (greater than max timeout of 60000). Resetting connection and trying again with a new connection.

2016-11-25 07:08:31,321 INFO  org.apache.zookeeper.ClientCnxn                               - Opening socket connection to server 10.17.34.11/10.17.34.11:2181

2016-11-25 07:08:31,322 INFO  org.apache.zookeeper.ClientCnxn                               - Socket connection established to 10.17.34.11/10.17.34.11:2181, initiating session

2016-11-25 07:08:31,325 INFO  org.apache.zookeeper.ClientCnxn                               - Session establishment complete on server 10.17.34.11/10.17.34.11:2181, sessionid = 0x456b80d2f6ce4c7, negotiated timeout = 40000

2016-11-25 07:09:45,169 INFO  org.apache.zookeeper.ZooKeeper                                - Session: 0x456b80d2f6ce4c4 closed

2016-11-25 07:09:45,170 INFO  org.apache.zookeeper.ClientCnxn                               - EventThread shut down

2016-11-25 07:09:45,170 INFO  org.apache.zookeeper.ZooKeeper                                - Session: 0x0 closed

2016-11-25 07:09:45,169 INFO  org.apache.zookeeper.ClientCnxn                               - EventThread shut down

2016-11-25 07:09:45,170 INFO  org.apache.zookeeper.ZooKeeper                                - Initiating client connection, connectString=kafka.zk1.service.mogujie.org:2181,kafka.zk2.service.mogujie.org:2181,kafka.zk3.service.mogujie.org:2181 sessionTimeout=60000 watcher=org.apache.flink.shaded.org.apache.curator.ConnectionState@3a5d2ba6

2016-11-25 07:09:45,170 INFO  org.apache.zookeeper.ZooKeeper                                - Initiating client connection, connectString=kafka.zk1.service.mogujie.org:2181,kafka.zk2.service.mogujie.org:2181,kafka.zk3.service.mogujie.org:2181 sessionTimeout=60000 watcher=org.apache.flink.shaded.org.apache.curator.ConnectionState@50d1bcf2

2016-11-25 07:09:45,171 INFO  com.mogujie.corgi.common.keeper.KeeperProxy                   - unable to refresh keeper status, cause: java.util.concurrent.TimeoutException, master: 10.15.2.123:8888

2016-11-25 07:09:45,171 WARN  com.mogujie.corgi.net.handler.DispatchHandler                 - no future for response, route: 1, from: /10.15.2.123:8888, packetId: 17073

2016-11-25 07:09:45,171 WARN  org.apache.flink.shaded.org.apache.curator.ConnectionState    - Connection attempt unsuccessful after 147297 (greater than max timeout of 60000). Resetting connection and trying again with a new connection.

2016-11-25 07:09:45,174 WARN  org.apache.flink.shaded.org.apache.curator.ConnectionState    - Connection attempt unsuccessful after 147300 (greater than max timeout of 60000). Resetting connection and trying again with a new connection.

2016-11-25 07:09:45,174 ERROR org.apache.flink.shaded.org.apache.curator.framework.imps.CuratorFrameworkImpl  - Background operation retry gave up

org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss

        at org.apache.zookeeper.KeeperException.create(KeeperException.java:99)

        at org.apache.flink.shaded.org.apache.curator.framework.imps.CuratorFrameworkImpl.checkBackgroundRetry(CuratorFrameworkImpl.java:708)

        at org.apache.flink.shaded.org.apache.curator.framework.imps.CuratorFrameworkImpl.performBackgroundOperation(CuratorFrameworkImpl.java:826)

        at org.apache.flink.shaded.org.apache.curator.framework.imps.CuratorFrameworkImpl.backgroundOperationsLoop(CuratorFrameworkImpl.java:792)

        at org.apache.flink.shaded.org.apache.curator.framework.imps.CuratorFrameworkImpl.access$300(CuratorFrameworkImpl.java:62)

        at org.apache.flink.shaded.org.apache.curator.framework.imps.CuratorFrameworkImpl$4.call(CuratorFrameworkImpl.java:257)

        at java.util.concurrent.FutureTask.run(FutureTask.java:266)

        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)

        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)

        at java.lang.Thread.run(Thread.java:745)

2016-11-25 07:09:45,175 INFO  org.apache.zookeeper.ClientCnxn                               - Opening socket connection to server 10.17.36.74/10.17.36.74:2181

2016-11-25 07:09:45,176 INFO  org.apache.zookeeper.ClientCnxn                               - Opening socket connection to server 10.17.34.11/10.17.34.11:2181

2016-11-25 07:09:45,176 ERROR org.apache.flink.shaded.org.apache.curator.framework.imps.CuratorFrameworkImpl  - Background retry gave up

org.apache.flink.shaded.org.apache.curator.CuratorConnectionLossException: KeeperErrorCode = ConnectionLoss

        at org.apache.flink.shaded.org.apache.curator.framework.imps.CuratorFrameworkImpl.performBackgroundOperation(CuratorFrameworkImpl.java:809)

        at org.apache.flink.shaded.org.apache.curator.framework.imps.CuratorFrameworkImpl.backgroundOperationsLoop(CuratorFrameworkImpl.java:792)

        at org.apache.flink.shaded.org.apache.curator.framework.imps.CuratorFrameworkImpl.access$300(CuratorFrameworkImpl.java:62)

        at org.apache.flink.shaded.org.apache.curator.framework.imps.CuratorFrameworkImpl$4.call(CuratorFrameworkImpl.java:257)

        at java.util.concurrent.FutureTask.run(FutureTask.java:266)

        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)

        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)

        at java.lang.Thread.run(Thread.java:745)

2016-11-25 07:09:45,177 INFO  org.apache.zookeeper.ClientCnxn                               - Socket connection established to 10.17.36.74/10.17.36.74:2181, initiating session

2016-11-25 07:09:45,177 INFO  org.apache.zookeeper.ClientCnxn                               - Socket connection established to 10.17.34.11/10.17.34.11:2181, initiating session

2016-11-25 07:09:45,179 INFO  org.apache.zookeeper.ClientCnxn                               - Session establishment complete on server 10.17.36.74/10.17.36.74:2181, sessionid = 0x556b80d3a88e2f1, negotiated timeout = 40000

2016-11-25 07:09:45,179 INFO  org.apache.zookeeper.ClientCnxn                               - Session establishment complete on server 10.17.34.11/10.17.34.11:2181, sessionid = 0x456b80d2f6ce4cc, negotiated timeout = 40000

2016-11-25 07:09:45,180 INFO  org.apache.zookeeper.ClientCnxn                               - EventThread shut down

2016-11-25 07:09:45,180 INFO  org.apache.zookeeper.ZooKeeper                                - Session: 0x556b80d3a88e2f1 closed

2016-11-25 07:09:45,181 INFO  org.apache.zookeeper.ZooKeeper                                - Initiating client connection, connectString=kafka.zk1.service.mogujie.org:2181,kafka.zk2.service.mogujie.org:2181,kafka.zk3.service.mogujie.org:2181 sessionTimeout=60000 watcher=org.apache.flink.shaded.org.apache.curator.ConnectionState@3a5d2ba6

2016-11-25 07:09:45,181 INFO  org.apache.zookeeper.ZooKeeper                                - Session: 0x456b80d2f6ce4cc closed

2016-11-25 07:09:45,181 INFO  org.apache.zookeeper.ClientCnxn                               - EventThread shut down

2016-11-25 07:09:45,181 INFO  org.apache.zookeeper.ZooKeeper                                - Initiating client connection, connectString=kafka.zk1.service.mogujie.org:2181,kafka.zk2.service.mogujie.org:2181,kafka.zk3.service.mogujie.org:2181 sessionTimeout=60000 watcher=org.apache.flink.shaded.org.apache.curator.ConnectionState@50d1bcf2

2016-11-25 07:09:45,182 INFO  org.apache.flink.shaded.org.apache.curator.framework.state.ConnectionStateManager  - State change: LOST

2016-11-25 07:10:59,160 INFO  org.apache.zookeeper.ZooKeeper                                - Session: 0x456b80d2f6ce4c7 closed

2016-11-25 07:10:59,163 INFO  org.apache.zookeeper.ClientCnxn                               - Opening socket connection to server 10.17.34.11/10.17.34.11:2181

2016-11-25 07:10:59,161 INFO  org.apache.zookeeper.ClientCnxn                               - Opening socket connection to server 10.17.34.11/10.17.34.11:2181

2016-11-25 07:10:59,161 ERROR org.apache.flink.shaded.org.apache.curator.framework.imps.CuratorFrameworkImpl  - Background operation retry gave up

org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss

        at org.apache.zookeeper.KeeperException.create(KeeperException.java:99)

        at org.apache.flink.shaded.org.apache.curator.framework.imps.CuratorFrameworkImpl.checkBackgroundRetry(CuratorFrameworkImpl.java:708)

        at org.apache.flink.shaded.org.apache.curator.framework.imps.CuratorFrameworkImpl.performBackgroundOperation(CuratorFrameworkImpl.java:826)

        at org.apache.flink.shaded.org.apache.curator.framework.imps.CuratorFrameworkImpl.backgroundOperationsLoop(CuratorFrameworkImpl.java:792)

        at org.apache.flink.shaded.org.apache.curator.framework.imps.CuratorFrameworkImpl.access$300(CuratorFrameworkImpl.java:62)

        at org.apache.flink.shaded.org.apache.curator.framework.imps.CuratorFrameworkImpl$4.call(CuratorFrameworkImpl.java:257)

        at java.util.concurrent.FutureTask.run(FutureTask.java:266)

        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)

        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)

        at java.lang.Thread.run(Thread.java:745)

2016-11-25 07:10:59,161 ERROR org.apache.flink.shaded.org.apache.curator.framework.imps.CuratorFrameworkImpl  - Background operation retry gave up

org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss

        at org.apache.zookeeper.KeeperException.create(KeeperException.java:99)

        at org.apache.flink.shaded.org.apache.curator.framework.imps.CuratorFrameworkImpl.checkBackgroundRetry(CuratorFrameworkImpl.java:708)

        at org.apache.flink.shaded.org.apache.curator.framework.imps.CuratorFrameworkImpl.performBackgroundOperation(CuratorFrameworkImpl.java:826)

        at org.apache.flink.shaded.org.apache.curator.framework.imps.CuratorFrameworkImpl.backgroundOperationsLoop(CuratorFrameworkImpl.java:792)

        at org.apache.flink.shaded.org.apache.curator.framework.imps.CuratorFrameworkImpl.access$300(CuratorFrameworkImpl.java:62)

        at org.apache.flink.shaded.org.apache.curator.framework.imps.CuratorFrameworkImpl$4.call(CuratorFrameworkImpl.java:257)

        at java.util.concurrent.FutureTask.run(FutureTask.java:266)

        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)

        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)

        at java.lang.Thread.run(Thread.java:745)

2016-11-25 07:10:59,160 INFO  org.apache.zookeeper.ClientCnxn                               - EventThread shut down

2016-11-25 07:10:59,167 ERROR org.apache.flink.shaded.org.apache.curator.framework.imps.CuratorFrameworkImpl  - Background retry gave up

org.apache.flink.shaded.org.apache.curator.CuratorConnectionLossException: KeeperErrorCode = ConnectionLoss

        at org.apache.flink.shaded.org.apache.curator.framework.imps.CuratorFrameworkImpl.performBackgroundOperation(CuratorFrameworkImpl.java:809)

        at org.apache.flink.shaded.org.apache.curator.framework.imps.CuratorFrameworkImpl.backgroundOperationsLoop(CuratorFrameworkImpl.java:792)

        at org.apache.flink.shaded.org.apache.curator.framework.imps.CuratorFrameworkImpl.access$300(CuratorFrameworkImpl.java:62)

        at org.apache.flink.shaded.org.apache.curator.framework.imps.CuratorFrameworkImpl$4.call(CuratorFrameworkImpl.java:257)

        at java.util.concurrent.FutureTask.run(FutureTask.java:266)

        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)

        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)

        at java.lang.Thread.run(Thread.java:745)

2016-11-25 07:10:59,167 INFO  org.apache.zookeeper.ClientCnxn                               - Socket connection established to 10.17.34.11/10.17.34.11:2181, initiating session

2016-11-25 07:10:59,167 ERROR org.apache.flink.shaded.org.apache.curator.framework.imps.CuratorFrameworkImpl  - Background retry gave up

org.apache.flink.shaded.org.apache.curator.CuratorConnectionLossException: KeeperErrorCode = ConnectionLoss

        at org.apache.flink.shaded.org.apache.curator.framework.imps.CuratorFrameworkImpl.performBackgroundOperation(CuratorFrameworkImpl.java:809)

        at org.apache.flink.shaded.org.apache.curator.framework.imps.CuratorFrameworkImpl.backgroundOperationsLoop(CuratorFrameworkImpl.java:792)

        at org.apache.flink.shaded.org.apache.curator.framework.imps.CuratorFrameworkImpl.access$300(CuratorFrameworkImpl.java:62)

        at org.apache.flink.shaded.org.apache.curator.framework.imps.CuratorFrameworkImpl$4.call(CuratorFrameworkImpl.java:257)

        at java.util.concurrent.FutureTask.run(FutureTask.java:266)

        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)

        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)

        at java.lang.Thread.run(Thread.java:745)

2016-11-25 07:10:59,168 WARN  org.apache.flink.shaded.org.apache.curator.ConnectionState    - Connection attempt unsuccessful after 73997 (greater than max timeout of 60000). Resetting connection and trying again with a new connection.

2016-11-25 07:10:59,167 INFO  org.apache.zookeeper.ClientCnxn                               - Client session timed out, have not heard from server in 73985ms for sessionid 0x0, closing socket connection and attempting reconnect

2016-11-25 07:10:59,163 INFO  org.apache.zookeeper.ZooKeeper                                - Initiating client connection, connectString=kafka.zk1.service.mogujie.org:2181,kafka.zk2.service.mogujie.org:2181,kafka.zk3.service.mogujie.org:2181 sessionTimeout=60000 watcher=org.apache.flink.shaded.org.apache.curator.ConnectionState@77f9e968

2016-11-25 07:10:59,168 WARN  org.apache.flink.shaded.org.apache.curator.ConnectionState    - Connection attempt unsuccessful after 73994 (greater than max timeout of 60000). Resetting connection and trying again with a new connection.

2016-11-25 07:10:59,169 INFO  org.apache.zookeeper.ClientCnxn                               - Session establishment complete on server 10.17.34.11/10.17.34.11:2181, sessionid = 0x456b80d2f6ce4ce, negotiated timeout = 40000

2016-11-25 07:12:13,370 WARN  org.apache.flink.shaded.org.apache.curator.ConnectionState    - Connection attempt unsuccessful after 147852 (greater than max timeout of 60000). Resetting connection and trying again with a new connection.

2016-11-25 07:12:13,370 INFO  org.apache.zookeeper.ClientCnxn                               - Opening socket connection to server 10.17.34.22/10.17.34.22:2181

2016-11-25 07:12:13,370 WARN  com.mogujie.corgi.net.handler.DispatchHandler                 - no future for response, route: 1, from: /10.11.13.22:9003, packetId: 17076

2016-11-25 07:12:13,373 INFO  org.apache.zookeeper.ClientCnxn                               - EventThread shut down

2016-11-25 07:12:13,373 INFO  com.mogujie.corgi.net.channel.AbstractChannelHandler          - user idleTriggered event triggered, channel: [id: 0x60449ebe, /10.17.123.56:14660 => corgi.keeper.service.mogujie.org/10.15.2.123:8888]

2016-11-25 07:12:13,373 INFO  org.apache.zookeeper.ZooKeeper                                - Session: 0x0 closed

2016-11-25 07:12:13,373 INFO  org.apache.zookeeper.ZooKeeper                                - Initiating client connection, connectString=kafka.zk1.service.mogujie.org:2181,kafka.zk2.service.mogujie.org:2181,kafka.zk3.service.mogujie.org:2181 sessionTimeout=60000 watcher=org.apache.flink.shaded.org.apache.curator.ConnectionState@50d1bcf2

2016-11-25 07:12:13,375 WARN  org.apache.flink.shaded.org.apache.curator.ConnectionState    - Connection attempt unsuccessful after 74205 (greater than max timeout of 60000). Resetting connection and trying again with a new connection.

2016-11-25 07:12:13,377 INFO  org.apache.zookeeper.ClientCnxn                               - Opening socket connection to server 10.17.34.22/10.17.34.22:2181

2016-11-25 07:12:13,378 INFO  org.apache.zookeeper.ClientCnxn                               - Socket connection established to 10.17.34.22/10.17.34.22:2181, initiating session

2016-11-25 07:13:27,110 INFO  org.apache.zookeeper.ZooKeeper                                - Session: 0x456b80d2f6ce4ce closed

2016-11-25 07:13:27,110 INFO  org.apache.zookeeper.ClientCnxn                               - EventThread shut down

2016-11-25 07:13:27,111 INFO  org.apache.zookeeper.ZooKeeper                                - Initiating client connection, connectString=kafka.zk1.service.mogujie.org:2181,kafka.zk2.service.mogujie.org:2181,kafka.zk3.service.mogujie.org:2181 sessionTimeout=60000 watcher=org.apache.flink.shaded.org.apache.curator.ConnectionState@3a5d2ba6

2016-11-25 07:13:27,111 INFO  org.apache.zookeeper.ZooKeeper                                - Session: 0x0 closed

2016-11-25 07:13:27,111 INFO  org.apache.zookeeper.ZooKeeper                                - Initiating client connection, connectString=kafka.zk1.service.mogujie.org:2181,kafka.zk2.service.mogujie.org:2181,kafka.zk3.service.mogujie.org:2181 sessionTimeout=60000 watcher=org.apache.flink.shaded.org.apache.curator.ConnectionState@77f9e968

2016-11-25 07:13:27,111 INFO  org.apache.zookeeper.ClientCnxn                               - EventThread shut down

2016-11-25 07:13:27,112 WARN  org.apache.flink.shaded.org.apache.curator.ConnectionState    - Connection attempt unsuccessful after 147944 (greater than max timeout of 60000). Resetting connection and trying again with a new connection.

2016-11-25 07:13:27,112 WARN  org.apache.flink.shaded.org.apache.curator.ConnectionState    - Connection attempt unsuccessful after 73742 (greater than max timeout of 60000). Resetting connection and trying again with a new connection.

2016-11-25 07:13:27,114 INFO  org.apache.zookeeper.ClientCnxn                               - Opening socket connection to server 10.17.34.22/10.17.34.22:2181

2016-11-25 07:13:27,114 INFO  org.apache.zookeeper.ClientCnxn                               - Opening socket connection to server 10.17.34.11/10.17.34.11:2181

2016-11-25 07:13:27,115 INFO  org.apache.zookeeper.ClientCnxn                               - Socket connection established to 10.17.34.22/10.17.34.22:2181, initiating session

2016-11-25 07:13:27,117 INFO  org.apache.zookeeper.ClientCnxn                               - Socket connection established to 10.17.34.11/10.17.34.11:2181, initiating session

2016-11-25 07:13:27,118 INFO  org.apache.zookeeper.ClientCnxn                               - Session establishment complete on server 10.17.34.22/10.17.34.22:2181, sessionid = 0x356b80d2eebe879, negotiated timeout = 40000

2016-11-25 07:13:27,118 INFO  org.apache.zookeeper.ClientCnxn                               - Session establishment complete on server 10.17.34.11/10.17.34.11:2181, sessionid = 0x456b80d2f6ce4da, negotiated timeout = 40000

2016-11-25 07:14:33,247 INFO  org.apache.zookeeper.ZooKeeper                                - Session: 0x356b80d2eebe879 closed

2016-11-25 07:14:33,248 INFO  org.apache.zookeeper.ZooKeeper                                - Session: 0x0 closed

2016-11-25 07:14:33,248 INFO  org.apache.zookeeper.ClientCnxn                               - EventThread shut down

2016-11-25 07:14:33,247 INFO  org.apache.zookeeper.ClientCnxn                               - EventThread shut down

2016-11-25 07:14:33,248 INFO  org.apache.zookeeper.ZooKeeper                                - Initiating client connection, connectString=kafka.zk1.service.mogujie.org:2181,kafka.zk2.service.mogujie.org:2181,kafka.zk3.service.mogujie.org:2181 sessionTimeout=60000 watcher=org.apache.flink.shaded.org.apache.curator.ConnectionState@50d1bcf2

2016-11-25 07:14:33,248 INFO  org.apache.zookeeper.ZooKeeper                                - Initiating client connection, connectString=kafka.zk1.service.mogujie.org:2181,kafka.zk2.service.mogujie.org:2181,kafka.zk3.service.mogujie.org:2181 sessionTimeout=60000 watcher=org.apache.flink.shaded.org.apache.curator.ConnectionState@3a5d2ba6

2016-11-25 07:14:33,249 WARN  org.apache.flink.shaded.org.apache.curator.ConnectionState    - Connection attempt unsuccessful after 66137 (greater than max timeout of 60000). Resetting connection and trying again with a new connection.

2016-11-25 07:14:33,249 WARN  org.apache.flink.shaded.org.apache.curator.ConnectionState    - Connection attempt unsuccessful after 139874 (greater than max timeout of 60000). Resetting connection and trying again with a new connection.

2016-11-25 07:14:33,253 INFO  org.apache.zookeeper.ClientCnxn                               - Opening socket connection to server 10.17.34.22/10.17.34.22:2181

2016-11-25 07:14:33,253 INFO  org.apache.zookeeper.ClientCnxn                               - Socket connection established to 10.17.34.22/10.17.34.22:2181, initiating session

2016-11-25 07:14:33,255 INFO  org.apache.zookeeper.ClientCnxn                               - Session establishment complete on server 10.17.34.22/10.17.34.22:2181, sessionid = 0x356b80d2eebe87d, negotiated timeout = 40000

2016-11-25 07:14:33,256 INFO  org.apache.zookeeper.ClientCnxn                               - EventThread shut down

2016-11-25 07:14:33,256 INFO  org.apache.zookeeper.ZooKeeper                                - Session: 0x356b80d2eebe87d closed

2016-11-25 07:14:33,258 INFO  org.apache.zookeeper.ZooKeeper                                - Initiating client connection, connectString=kafka.zk1.service.mogujie.org:2181,kafka.zk2.service.mogujie.org:2181,kafka.zk3.service.mogujie.org:2181 sessionTimeout=60000 watcher=org.apache.flink.shaded.org.apache.curator.ConnectionState@50d1bcf2

2016-11-25 07:15:38,952 WARN  org.apache.flink.shaded.org.apache.curator.ConnectionState    - Connection attempt unsuccessful after 65703 (greater than max timeout of 60000). Resetting connection and trying again with a new connection.

2016-11-25 07:15:38,953 INFO  org.apache.zookeeper.ClientCnxn                               - Opening socket connection to server 10.17.36.74/10.17.36.74:2181

2016-11-25 07:15:38,953 INFO  org.apache.zookeeper.ClientCnxn                               - Opening socket connection to server 10.17.34.22/10.17.34.22:2181

2016-11-25 07:15:38,955 INFO  org.apache.zookeeper.ZooKeeper                                - Session: 0x456b80d2f6ce4da closed

2016-11-25 07:15:38,955 INFO  org.apache.zookeeper.ClientCnxn                               - EventThread shut down

2016-11-25 07:15:38,955 INFO  org.apache.zookeeper.ZooKeeper                                - Initiating client connection, connectString=kafka.zk1.service.mogujie.org:2181,kafka.zk2.service.mogujie.org:2181,kafka.zk3.service.mogujie.org:2181 sessionTimeout=60000 watcher=org.apache.flink.shaded.org.apache.curator.ConnectionState@77f9e968

2016-11-25 07:16:45,848 WARN  org.apache.flink.shaded.org.apache.curator.ConnectionState    - Connection attempt unsuccessful after 131846 (greater than max timeout of 60000). Resetting connection and trying again with a new connection.

2016-11-25 07:16:45,850 INFO  org.apache.zookeeper.ClientCnxn                               - Opening socket connection to server 10.17.36.74/10.17.36.74:2181

2016-11-25 07:16:45,850 INFO  org.apache.zookeeper.ClientCnxn                               - EventThread shut down

2016-11-25 07:16:45,850 INFO  org.apache.zookeeper.ZooKeeper                                - Session: 0x0 closed

2016-11-25 07:16:45,850 INFO  org.apache.zookeeper.ClientCnxn                               - EventThread shut down

2016-11-25 07:16:45,850 INFO  org.apache.zookeeper.ZooKeeper                                - Session: 0x0 closed

2016-11-25 07:16:45,850 WARN  com.mogujie.corgi.net.handler.DispatchHandler                 - no future for response, route: 1, from: /10.17.36.202:9003, packetId: 17095

2016-11-25 07:16:45,852 INFO  org.apache.zookeeper.ZooKeeper                                - Initiating client connection, connectString=kafka.zk1.service.mogujie.org:2181,kafka.zk2.service.mogujie.org:2181,kafka.zk3.service.mogujie.org:2181 sessionTimeout=60000 watcher=org.apache.flink.shaded.org.apache.curator.ConnectionState@50d1bcf2

2016-11-25 07:16:45,851 INFO  org.apache.zookeeper.ZooKeeper                                - Initiating client connection, connectString=kafka.zk1.service.mogujie.org:2181,kafka.zk2.service.mogujie.org:2181,kafka.zk3.service.mogujie.org:2181 sessionTimeout=60000 watcher=org.apache.flink.shaded.org.apache.curator.ConnectionState@3a5d2ba6

2016-11-25 07:16:45,853 WARN  org.apache.flink.shaded.org.apache.curator.ConnectionState    - Connection attempt unsuccessful after 66900 (greater than max timeout of 60000). Resetting connection and trying again with a new connection.

2016-11-25 07:16:45,853 WARN  org.apache.flink.shaded.org.apache.curator.ConnectionState    - Connection attempt unsuccessful after 132604 (greater than max timeout of 60000). Resetting connection and trying again with a new connection.

2016-11-25 07:16:45,855 INFO  org.apache.zookeeper.ClientCnxn                               - Opening socket connection to server 10.17.36.74/10.17.36.74:2181

2016-11-25 07:16:45,856 INFO  org.apache.zookeeper.ClientCnxn                               - Socket connection established to 10.17.36.74/10.17.36.74:2181, initiating session

2016-11-25 07:19:05,172 INFO  org.apache.zookeeper.ZooKeeper                                - Session: 0x0 closed

2016-11-25 07:21:32,016 INFO  org.apache.zookeeper.ClientCnxn                               - Opening socket connection to server 10.17.36.74/10.17.36.74:2181

2016-11-25 07:21:32,016 INFO  org.apache.zookeeper.ClientCnxn                               - EventThread shut down

2016-11-25 07:21:32,016 INFO  org.apache.zookeeper.ZooKeeper                                - Session: 0x0 closed

2016-11-25 07:19:05,172 INFO  org.apache.zookeeper.ClientCnxn                               - EventThread shut down

2016-11-25 07:25:16,838 INFO  com.mogujie.corgi.net.channel.AbstractChannelHandler          - user idleTriggered event triggered, channel: [id: 0x60449ebe, /10.17.123.56:14660 => corgi.keeper.service.mogujie.org/10.15.2.123:8888]

2016-11-25 07:24:02,740 INFO  org.apache.zookeeper.ZooKeeper                                - Initiating client connection, connectString=kafka.zk1.service.mogujie.org:2181,kafka.zk2.service.mogujie.org:2181,kafka.zk3.service.mogujie.org:2181 sessionTimeout=60000 watcher=org.apache.flink.shaded.org.apache.curator.ConnectionState@77f9e968

2016-11-25 07:27:43,909 INFO  org.apache.zookeeper.ZooKeeper                                - Initiating client connection, connectString=kafka.zk1.service.mogujie.org:2181,kafka.zk2.service.mogujie.org:2181,kafka.zk3.service.mogujie.org:2181 sessionTimeout=60000 watcher=org.apache.flink.shaded.org.apache.curator.ConnectionState@3a5d2ba6

2016-11-25 07:27:43,909 INFO  com.mogujie.corgi.net.channel.AbstractChannelHandler          - user idleTriggered event triggered, channel: [id: 0xcf39e7ef, /10.17.123.56:31625 => /10.11.13.22:9003]

2016-11-25 07:27:43,910 INFO  com.mogujie.corgi.net.channel.AbstractChannelHandler          - user idleTriggered event triggered, channel: [id: 0x12f19cbd, /10.17.123.56:18394 => /10.11.13.14:9003]

 

I suppose there are some bugs cause this error.

Reply | Threaded
Open this post in threaded view
|

Re: JobManager shows TaskManager was lost/killed while TaskManger Process is still running and the network is OK.

Renkai
some additional logs I found in jobManager.

2016-11-25 07:19:57,958 WARN  akka.remote.RemoteWatcher                                     - Detected unreachable: [akka.tcp://flink@10.17.123.56:59247]
2016-11-25 07:19:57,962 INFO  org.apache.flink.runtime.jobmanager.JobManager                - Task manager akka.tcp://flink@10.17.123.56:59247/user/taskmanager terminated.
Reply | Threaded
Open this post in threaded view
|

Re: JobManager shows TaskManager was lost/killed while TaskManger Process is still running and the network is OK.

Till Rohrmann
Hi Renkai,

it seems to me as if the TM lost its network connection somehow. Therefore, the JM's heartbeat won't get answered and it marks the TM as terminated. This would also explain why the TM can not longer talk to ZooKeeper.

Is this problem reproducible? If so, could you share the full logs with us?

Cheers,
Till

On Fri, Nov 25, 2016 at 5:12 AM, Renkai <[hidden email]> wrote:
some additional logs I found in jobManager.

2016-11-25 07:19:57,958 WARN  akka.remote.RemoteWatcher
- Detected unreachable: [akka.tcp://flink@10.17.123.56:59247]
2016-11-25 07:19:57,962 INFO  org.apache.flink.runtime.jobmanager.JobManager
- Task manager akka.tcp://flink@10.17.123.56:59247/user/taskmanager
terminated.



--
View this message in context: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/JobManager-shows-TaskManager-was-lost-killed-while-TaskManger-Process-is-still-running-and-the-netwo-tp10329p10330.html
Sent from the Apache Flink User Mailing List archive. mailing list archive at Nabble.com.

Reply | Threaded
Open this post in threaded view
|

Re: JobManager shows TaskManager was lost/killed while TaskManger Process is still running and the network is OK.

Renkai
The zookeeper related logs are loged by user codes,I finally find the reason why the taskmanger was lost,that was I gave the taskmanager a big amount of memory, the jobmanager identify the taskmanager is down during the taskmanager in Full GC.Thanks for your help.
Reply | Threaded
Open this post in threaded view
|

Re: JobManager shows TaskManager was lost/killed while TaskManger Process is still running and the network is OK.

AndreaKinn
Hi, sorry for re-vive this old conversation.
I have exactly the same problem, can you provide more details about your
solution?
Have you used another garbage collector as G1? How can I set it?

I've seen on configuration guideline I have to set the option: env.java.opts
but I don't know which is the value to insert to set G1.


Renkai wrote
> The zookeeper related logs are loged by user codes,I finally find the
> reason why the taskmanger was lost,that was I gave the taskmanager a big
> amount of memory, the jobmanager identify the taskmanager is down during
> the taskmanager in Full GC.Thanks for your help.





--
Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/
Reply | Threaded
Open this post in threaded view
|

Re: JobManager shows TaskManager was lost/killed while TaskManger Process is still running and the network is OK.

Rahul Raj
HI All,

Even I am facing the same issue. My code fails after running for 15 hours throwing same "Task Manager lost/killed exception". Can we please know the possible solution in detail for this?

Rahul Raj

On 15 September 2017 at 23:06, AndreaKinn <[hidden email]> wrote:
Hi, sorry for re-vive this old conversation.
I have exactly the same problem, can you provide more details about your
solution?
Have you used another garbage collector as G1? How can I set it?

I've seen on configuration guideline I have to set the option: env.java.opts
but I don't know which is the value to insert to set G1.


Renkai wrote
> The zookeeper related logs are loged by user codes,I finally find the
> reason why the taskmanger was lost,that was I gave the taskmanager a big
> amount of memory, the jobmanager identify the taskmanager is down during
> the taskmanager in Full GC.Thanks for your help.





--
Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/

Reply | Threaded
Open this post in threaded view
|

Re: JobManager shows TaskManager was lost/killed while TaskManger Process is still running and the network is OK.

Nico Kruber
In reply to this post by AndreaKinn
From what I read in [1], simply add JVM options to env.java.opts as you would
when you start a Java program yourself, so setting "-XX:+UseG1GC" should
enable G1.

Nico

[1] https://ci.apache.org/projects/flink/flink-docs-release-1.3/setup/
config.html#common-options

On Friday, 15 September 2017 19:36:02 CET AndreaKinn wrote:

> Hi, sorry for re-vive this old conversation.
> I have exactly the same problem, can you provide more details about your
> solution?
> Have you used another garbage collector as G1? How can I set it?
>
> I've seen on configuration guideline I have to set the option: env.java.opts
> but I don't know which is the value to insert to set G1.
>
>
> Renkai wrote
>
> > The zookeeper related logs are loged by user codes,I finally find the
> > reason why the taskmanger was lost,that was I gave the taskmanager a big
> > amount of memory, the jobmanager identify the taskmanager is down during
> > the taskmanager in Full GC.Thanks for your help.
>
> --
> Sent from:
> http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/


signature.asc (201 bytes) Download Attachment