I use the newest snapshot of Flink, all jobs failed since a TaskManager was lost/killed.There is a sample of jobmanager and taskmanager logs //job manager java.lang.Exception: TaskManager was lost/killed: ResourceID{resourceId='8f4b98897b1cbdbb576cbf298ac1339f'} @ 10.17.123.56 (dataPort=62636) at org.apache.flink.runtime.instance.SimpleSlot.releaseSlot(SimpleSlot.java:217) at org.apache.flink.runtime.instance.SlotSharingGroupAssignment.releaseSharedSlot(SlotSharingGroupAssignment.java:533) at org.apache.flink.runtime.instance.SharedSlot.releaseSlot(SharedSlot.java:192) at org.apache.flink.runtime.instance.Instance.markDead(Instance.java:167) at org.apache.flink.runtime.instance.InstanceManager.unregisterTaskManager(InstanceManager.java:214) at org.apache.flink.runtime.jobmanager.JobManager.org$apache$flink$runtime$jobmanager$JobManager$$handleTaskManagerTerminated(JobManager.scala:1160) at org.apache.flink.runtime.jobmanager.JobManager$$anonfun$handleMessage$1.applyOrElse(JobManager.scala:1063) at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:36) at org.apache.flink.runtime.LeaderSessionMessageFilter$$anonfun$receive$1.applyOrElse(LeaderSessionMessageFilter.scala:44) at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:36) at org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:33) at org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:28) at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:123) at org.apache.flink.runtime.LogMessages$$anon$1.applyOrElse(LogMessages.scala:28) at akka.actor.Actor$class.aroundReceive(Actor.scala:465) at org.apache.flink.runtime.jobmanager.JobManager.aroundReceive(JobManager.scala:119) at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516) at akka.actor.dungeon.DeathWatch$class.receivedTerminated(DeathWatch.scala:46) at akka.actor.ActorCell.receivedTerminated(ActorCell.scala:369) at akka.actor.ActorCell.autoReceiveMessage(ActorCell.scala:501) at akka.actor.ActorCell.invoke(ActorCell.scala:486) at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:254) at akka.dispatch.Mailbox.run(Mailbox.scala:221) at akka.dispatch.Mailbox.exec(Mailbox.scala:231) at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) 2016-11-25 07:19:58,136 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph - Could not restart the job shop-monitor (cd3b18a4854c3f720cb581b1c84830c4). //task manager 2016-11-25 07:08:31,312 INFO org.apache.zookeeper.ZooKeeper - Initiating client connection, connectString=kafka.zk1.service.mogujie.org:2181,kafka.zk2.service.mogujie.org:2181,kafka.zk3.service.mogujie.org:2181 sessionTimeout=60000 watcher=org.apache.flink.shaded.org.apache.curator.ConnectionState@77f9e968 2016-11-25 07:08:31,319 WARN org.apache.flink.shaded.org.apache.curator.ConnectionState - Connection attempt unsuccessful after 147624 (greater than max timeout of 60000). Resetting connection and trying again with a new connection. 2016-11-25 07:08:31,321 INFO org.apache.zookeeper.ClientCnxn - Opening socket connection to server 10.17.34.11/10.17.34.11:2181 2016-11-25 07:08:31,322 INFO org.apache.zookeeper.ClientCnxn - Socket connection established to 10.17.34.11/10.17.34.11:2181, initiating session 2016-11-25 07:08:31,325 INFO org.apache.zookeeper.ClientCnxn - Session establishment complete on server 10.17.34.11/10.17.34.11:2181, sessionid = 0x456b80d2f6ce4c7, negotiated timeout = 40000 2016-11-25 07:09:45,169 INFO org.apache.zookeeper.ZooKeeper - Session: 0x456b80d2f6ce4c4 closed 2016-11-25 07:09:45,170 INFO org.apache.zookeeper.ClientCnxn - EventThread shut down 2016-11-25 07:09:45,170 INFO org.apache.zookeeper.ZooKeeper - Session: 0x0 closed 2016-11-25 07:09:45,169 INFO org.apache.zookeeper.ClientCnxn - EventThread shut down 2016-11-25 07:09:45,170 INFO org.apache.zookeeper.ZooKeeper - Initiating client connection, connectString=kafka.zk1.service.mogujie.org:2181,kafka.zk2.service.mogujie.org:2181,kafka.zk3.service.mogujie.org:2181 sessionTimeout=60000 watcher=org.apache.flink.shaded.org.apache.curator.ConnectionState@3a5d2ba6 2016-11-25 07:09:45,170 INFO org.apache.zookeeper.ZooKeeper - Initiating client connection, connectString=kafka.zk1.service.mogujie.org:2181,kafka.zk2.service.mogujie.org:2181,kafka.zk3.service.mogujie.org:2181 sessionTimeout=60000 watcher=org.apache.flink.shaded.org.apache.curator.ConnectionState@50d1bcf2 2016-11-25 07:09:45,171 INFO com.mogujie.corgi.common.keeper.KeeperProxy - unable to refresh keeper status, cause: java.util.concurrent.TimeoutException, master: 10.15.2.123:8888 2016-11-25 07:09:45,171 WARN com.mogujie.corgi.net.handler.DispatchHandler - no future for response, route: 1, from: /10.15.2.123:8888, packetId: 17073 2016-11-25 07:09:45,171 WARN org.apache.flink.shaded.org.apache.curator.ConnectionState - Connection attempt unsuccessful after 147297 (greater than max timeout of 60000). Resetting connection and trying again with a new connection. 2016-11-25 07:09:45,174 WARN org.apache.flink.shaded.org.apache.curator.ConnectionState - Connection attempt unsuccessful after 147300 (greater than max timeout of 60000). Resetting connection and trying again with a new connection. 2016-11-25 07:09:45,174 ERROR org.apache.flink.shaded.org.apache.curator.framework.imps.CuratorFrameworkImpl - Background operation retry gave up org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss at org.apache.zookeeper.KeeperException.create(KeeperException.java:99) at org.apache.flink.shaded.org.apache.curator.framework.imps.CuratorFrameworkImpl.checkBackgroundRetry(CuratorFrameworkImpl.java:708) at org.apache.flink.shaded.org.apache.curator.framework.imps.CuratorFrameworkImpl.performBackgroundOperation(CuratorFrameworkImpl.java:826) at org.apache.flink.shaded.org.apache.curator.framework.imps.CuratorFrameworkImpl.backgroundOperationsLoop(CuratorFrameworkImpl.java:792) at org.apache.flink.shaded.org.apache.curator.framework.imps.CuratorFrameworkImpl.access$300(CuratorFrameworkImpl.java:62) at org.apache.flink.shaded.org.apache.curator.framework.imps.CuratorFrameworkImpl$4.call(CuratorFrameworkImpl.java:257) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) 2016-11-25 07:09:45,175 INFO org.apache.zookeeper.ClientCnxn - Opening socket connection to server 10.17.36.74/10.17.36.74:2181 2016-11-25 07:09:45,176 INFO org.apache.zookeeper.ClientCnxn - Opening socket connection to server 10.17.34.11/10.17.34.11:2181 2016-11-25 07:09:45,176 ERROR org.apache.flink.shaded.org.apache.curator.framework.imps.CuratorFrameworkImpl - Background retry gave up org.apache.flink.shaded.org.apache.curator.CuratorConnectionLossException: KeeperErrorCode = ConnectionLoss at org.apache.flink.shaded.org.apache.curator.framework.imps.CuratorFrameworkImpl.performBackgroundOperation(CuratorFrameworkImpl.java:809) at org.apache.flink.shaded.org.apache.curator.framework.imps.CuratorFrameworkImpl.backgroundOperationsLoop(CuratorFrameworkImpl.java:792) at org.apache.flink.shaded.org.apache.curator.framework.imps.CuratorFrameworkImpl.access$300(CuratorFrameworkImpl.java:62) at org.apache.flink.shaded.org.apache.curator.framework.imps.CuratorFrameworkImpl$4.call(CuratorFrameworkImpl.java:257) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) 2016-11-25 07:09:45,177 INFO org.apache.zookeeper.ClientCnxn - Socket connection established to 10.17.36.74/10.17.36.74:2181, initiating session 2016-11-25 07:09:45,177 INFO org.apache.zookeeper.ClientCnxn - Socket connection established to 10.17.34.11/10.17.34.11:2181, initiating session 2016-11-25 07:09:45,179 INFO org.apache.zookeeper.ClientCnxn - Session establishment complete on server 10.17.36.74/10.17.36.74:2181, sessionid = 0x556b80d3a88e2f1, negotiated timeout = 40000 2016-11-25 07:09:45,179 INFO org.apache.zookeeper.ClientCnxn - Session establishment complete on server 10.17.34.11/10.17.34.11:2181, sessionid = 0x456b80d2f6ce4cc, negotiated timeout = 40000 2016-11-25 07:09:45,180 INFO org.apache.zookeeper.ClientCnxn - EventThread shut down 2016-11-25 07:09:45,180 INFO org.apache.zookeeper.ZooKeeper - Session: 0x556b80d3a88e2f1 closed 2016-11-25 07:09:45,181 INFO org.apache.zookeeper.ZooKeeper - Initiating client connection, connectString=kafka.zk1.service.mogujie.org:2181,kafka.zk2.service.mogujie.org:2181,kafka.zk3.service.mogujie.org:2181 sessionTimeout=60000 watcher=org.apache.flink.shaded.org.apache.curator.ConnectionState@3a5d2ba6 2016-11-25 07:09:45,181 INFO org.apache.zookeeper.ZooKeeper - Session: 0x456b80d2f6ce4cc closed 2016-11-25 07:09:45,181 INFO org.apache.zookeeper.ClientCnxn - EventThread shut down 2016-11-25 07:09:45,181 INFO org.apache.zookeeper.ZooKeeper - Initiating client connection, connectString=kafka.zk1.service.mogujie.org:2181,kafka.zk2.service.mogujie.org:2181,kafka.zk3.service.mogujie.org:2181 sessionTimeout=60000 watcher=org.apache.flink.shaded.org.apache.curator.ConnectionState@50d1bcf2 2016-11-25 07:09:45,182 INFO org.apache.flink.shaded.org.apache.curator.framework.state.ConnectionStateManager - State change: LOST 2016-11-25 07:10:59,160 INFO org.apache.zookeeper.ZooKeeper - Session: 0x456b80d2f6ce4c7 closed 2016-11-25 07:10:59,163 INFO org.apache.zookeeper.ClientCnxn - Opening socket connection to server 10.17.34.11/10.17.34.11:2181 2016-11-25 07:10:59,161 INFO org.apache.zookeeper.ClientCnxn - Opening socket connection to server 10.17.34.11/10.17.34.11:2181 2016-11-25 07:10:59,161 ERROR org.apache.flink.shaded.org.apache.curator.framework.imps.CuratorFrameworkImpl - Background operation retry gave up org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss at org.apache.zookeeper.KeeperException.create(KeeperException.java:99) at org.apache.flink.shaded.org.apache.curator.framework.imps.CuratorFrameworkImpl.checkBackgroundRetry(CuratorFrameworkImpl.java:708) at org.apache.flink.shaded.org.apache.curator.framework.imps.CuratorFrameworkImpl.performBackgroundOperation(CuratorFrameworkImpl.java:826) at org.apache.flink.shaded.org.apache.curator.framework.imps.CuratorFrameworkImpl.backgroundOperationsLoop(CuratorFrameworkImpl.java:792) at org.apache.flink.shaded.org.apache.curator.framework.imps.CuratorFrameworkImpl.access$300(CuratorFrameworkImpl.java:62) at org.apache.flink.shaded.org.apache.curator.framework.imps.CuratorFrameworkImpl$4.call(CuratorFrameworkImpl.java:257) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) 2016-11-25 07:10:59,161 ERROR org.apache.flink.shaded.org.apache.curator.framework.imps.CuratorFrameworkImpl - Background operation retry gave up org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss at org.apache.zookeeper.KeeperException.create(KeeperException.java:99) at org.apache.flink.shaded.org.apache.curator.framework.imps.CuratorFrameworkImpl.checkBackgroundRetry(CuratorFrameworkImpl.java:708) at org.apache.flink.shaded.org.apache.curator.framework.imps.CuratorFrameworkImpl.performBackgroundOperation(CuratorFrameworkImpl.java:826) at org.apache.flink.shaded.org.apache.curator.framework.imps.CuratorFrameworkImpl.backgroundOperationsLoop(CuratorFrameworkImpl.java:792) at org.apache.flink.shaded.org.apache.curator.framework.imps.CuratorFrameworkImpl.access$300(CuratorFrameworkImpl.java:62) at org.apache.flink.shaded.org.apache.curator.framework.imps.CuratorFrameworkImpl$4.call(CuratorFrameworkImpl.java:257) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) 2016-11-25 07:10:59,160 INFO org.apache.zookeeper.ClientCnxn - EventThread shut down 2016-11-25 07:10:59,167 ERROR org.apache.flink.shaded.org.apache.curator.framework.imps.CuratorFrameworkImpl - Background retry gave up org.apache.flink.shaded.org.apache.curator.CuratorConnectionLossException: KeeperErrorCode = ConnectionLoss at org.apache.flink.shaded.org.apache.curator.framework.imps.CuratorFrameworkImpl.performBackgroundOperation(CuratorFrameworkImpl.java:809) at org.apache.flink.shaded.org.apache.curator.framework.imps.CuratorFrameworkImpl.backgroundOperationsLoop(CuratorFrameworkImpl.java:792) at org.apache.flink.shaded.org.apache.curator.framework.imps.CuratorFrameworkImpl.access$300(CuratorFrameworkImpl.java:62) at org.apache.flink.shaded.org.apache.curator.framework.imps.CuratorFrameworkImpl$4.call(CuratorFrameworkImpl.java:257) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) 2016-11-25 07:10:59,167 INFO org.apache.zookeeper.ClientCnxn - Socket connection established to 10.17.34.11/10.17.34.11:2181, initiating session 2016-11-25 07:10:59,167 ERROR org.apache.flink.shaded.org.apache.curator.framework.imps.CuratorFrameworkImpl - Background retry gave up org.apache.flink.shaded.org.apache.curator.CuratorConnectionLossException: KeeperErrorCode = ConnectionLoss at org.apache.flink.shaded.org.apache.curator.framework.imps.CuratorFrameworkImpl.performBackgroundOperation(CuratorFrameworkImpl.java:809) at org.apache.flink.shaded.org.apache.curator.framework.imps.CuratorFrameworkImpl.backgroundOperationsLoop(CuratorFrameworkImpl.java:792) at org.apache.flink.shaded.org.apache.curator.framework.imps.CuratorFrameworkImpl.access$300(CuratorFrameworkImpl.java:62) at org.apache.flink.shaded.org.apache.curator.framework.imps.CuratorFrameworkImpl$4.call(CuratorFrameworkImpl.java:257) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) 2016-11-25 07:10:59,168 WARN org.apache.flink.shaded.org.apache.curator.ConnectionState - Connection attempt unsuccessful after 73997 (greater than max timeout of 60000). Resetting connection and trying again with a new connection. 2016-11-25 07:10:59,167 INFO org.apache.zookeeper.ClientCnxn - Client session timed out, have not heard from server in 73985ms for sessionid 0x0, closing socket connection and attempting reconnect 2016-11-25 07:10:59,163 INFO org.apache.zookeeper.ZooKeeper - Initiating client connection, connectString=kafka.zk1.service.mogujie.org:2181,kafka.zk2.service.mogujie.org:2181,kafka.zk3.service.mogujie.org:2181 sessionTimeout=60000 watcher=org.apache.flink.shaded.org.apache.curator.ConnectionState@77f9e968 2016-11-25 07:10:59,168 WARN org.apache.flink.shaded.org.apache.curator.ConnectionState - Connection attempt unsuccessful after 73994 (greater than max timeout of 60000). Resetting connection and trying again with a new connection. 2016-11-25 07:10:59,169 INFO org.apache.zookeeper.ClientCnxn - Session establishment complete on server 10.17.34.11/10.17.34.11:2181, sessionid = 0x456b80d2f6ce4ce, negotiated timeout = 40000 2016-11-25 07:12:13,370 WARN org.apache.flink.shaded.org.apache.curator.ConnectionState - Connection attempt unsuccessful after 147852 (greater than max timeout of 60000). Resetting connection and trying again with a new connection. 2016-11-25 07:12:13,370 INFO org.apache.zookeeper.ClientCnxn - Opening socket connection to server 10.17.34.22/10.17.34.22:2181 2016-11-25 07:12:13,370 WARN com.mogujie.corgi.net.handler.DispatchHandler - no future for response, route: 1, from: /10.11.13.22:9003, packetId: 17076 2016-11-25 07:12:13,373 INFO org.apache.zookeeper.ClientCnxn - EventThread shut down 2016-11-25 07:12:13,373 INFO com.mogujie.corgi.net.channel.AbstractChannelHandler - user idleTriggered event triggered, channel: [id: 0x60449ebe, /10.17.123.56:14660 => corgi.keeper.service.mogujie.org/10.15.2.123:8888] 2016-11-25 07:12:13,373 INFO org.apache.zookeeper.ZooKeeper - Session: 0x0 closed 2016-11-25 07:12:13,373 INFO org.apache.zookeeper.ZooKeeper - Initiating client connection, connectString=kafka.zk1.service.mogujie.org:2181,kafka.zk2.service.mogujie.org:2181,kafka.zk3.service.mogujie.org:2181 sessionTimeout=60000 watcher=org.apache.flink.shaded.org.apache.curator.ConnectionState@50d1bcf2 2016-11-25 07:12:13,375 WARN org.apache.flink.shaded.org.apache.curator.ConnectionState - Connection attempt unsuccessful after 74205 (greater than max timeout of 60000). Resetting connection and trying again with a new connection. 2016-11-25 07:12:13,377 INFO org.apache.zookeeper.ClientCnxn - Opening socket connection to server 10.17.34.22/10.17.34.22:2181 2016-11-25 07:12:13,378 INFO org.apache.zookeeper.ClientCnxn - Socket connection established to 10.17.34.22/10.17.34.22:2181, initiating session 2016-11-25 07:13:27,110 INFO org.apache.zookeeper.ZooKeeper - Session: 0x456b80d2f6ce4ce closed 2016-11-25 07:13:27,110 INFO org.apache.zookeeper.ClientCnxn - EventThread shut down 2016-11-25 07:13:27,111 INFO org.apache.zookeeper.ZooKeeper - Initiating client connection, connectString=kafka.zk1.service.mogujie.org:2181,kafka.zk2.service.mogujie.org:2181,kafka.zk3.service.mogujie.org:2181 sessionTimeout=60000 watcher=org.apache.flink.shaded.org.apache.curator.ConnectionState@3a5d2ba6 2016-11-25 07:13:27,111 INFO org.apache.zookeeper.ZooKeeper - Session: 0x0 closed 2016-11-25 07:13:27,111 INFO org.apache.zookeeper.ZooKeeper - Initiating client connection, connectString=kafka.zk1.service.mogujie.org:2181,kafka.zk2.service.mogujie.org:2181,kafka.zk3.service.mogujie.org:2181 sessionTimeout=60000 watcher=org.apache.flink.shaded.org.apache.curator.ConnectionState@77f9e968 2016-11-25 07:13:27,111 INFO org.apache.zookeeper.ClientCnxn - EventThread shut down 2016-11-25 07:13:27,112 WARN org.apache.flink.shaded.org.apache.curator.ConnectionState - Connection attempt unsuccessful after 147944 (greater than max timeout of 60000). Resetting connection and trying again with a new connection. 2016-11-25 07:13:27,112 WARN org.apache.flink.shaded.org.apache.curator.ConnectionState - Connection attempt unsuccessful after 73742 (greater than max timeout of 60000). Resetting connection and trying again with a new connection. 2016-11-25 07:13:27,114 INFO org.apache.zookeeper.ClientCnxn - Opening socket connection to server 10.17.34.22/10.17.34.22:2181 2016-11-25 07:13:27,114 INFO org.apache.zookeeper.ClientCnxn - Opening socket connection to server 10.17.34.11/10.17.34.11:2181 2016-11-25 07:13:27,115 INFO org.apache.zookeeper.ClientCnxn - Socket connection established to 10.17.34.22/10.17.34.22:2181, initiating session 2016-11-25 07:13:27,117 INFO org.apache.zookeeper.ClientCnxn - Socket connection established to 10.17.34.11/10.17.34.11:2181, initiating session 2016-11-25 07:13:27,118 INFO org.apache.zookeeper.ClientCnxn - Session establishment complete on server 10.17.34.22/10.17.34.22:2181, sessionid = 0x356b80d2eebe879, negotiated timeout = 40000 2016-11-25 07:13:27,118 INFO org.apache.zookeeper.ClientCnxn - Session establishment complete on server 10.17.34.11/10.17.34.11:2181, sessionid = 0x456b80d2f6ce4da, negotiated timeout = 40000 2016-11-25 07:14:33,247 INFO org.apache.zookeeper.ZooKeeper - Session: 0x356b80d2eebe879 closed 2016-11-25 07:14:33,248 INFO org.apache.zookeeper.ZooKeeper - Session: 0x0 closed 2016-11-25 07:14:33,248 INFO org.apache.zookeeper.ClientCnxn - EventThread shut down 2016-11-25 07:14:33,247 INFO org.apache.zookeeper.ClientCnxn - EventThread shut down 2016-11-25 07:14:33,248 INFO org.apache.zookeeper.ZooKeeper - Initiating client connection, connectString=kafka.zk1.service.mogujie.org:2181,kafka.zk2.service.mogujie.org:2181,kafka.zk3.service.mogujie.org:2181 sessionTimeout=60000 watcher=org.apache.flink.shaded.org.apache.curator.ConnectionState@50d1bcf2 2016-11-25 07:14:33,248 INFO org.apache.zookeeper.ZooKeeper - Initiating client connection, connectString=kafka.zk1.service.mogujie.org:2181,kafka.zk2.service.mogujie.org:2181,kafka.zk3.service.mogujie.org:2181 sessionTimeout=60000 watcher=org.apache.flink.shaded.org.apache.curator.ConnectionState@3a5d2ba6 2016-11-25 07:14:33,249 WARN org.apache.flink.shaded.org.apache.curator.ConnectionState - Connection attempt unsuccessful after 66137 (greater than max timeout of 60000). Resetting connection and trying again with a new connection. 2016-11-25 07:14:33,249 WARN org.apache.flink.shaded.org.apache.curator.ConnectionState - Connection attempt unsuccessful after 139874 (greater than max timeout of 60000). Resetting connection and trying again with a new connection. 2016-11-25 07:14:33,253 INFO org.apache.zookeeper.ClientCnxn - Opening socket connection to server 10.17.34.22/10.17.34.22:2181 2016-11-25 07:14:33,253 INFO org.apache.zookeeper.ClientCnxn - Socket connection established to 10.17.34.22/10.17.34.22:2181, initiating session 2016-11-25 07:14:33,255 INFO org.apache.zookeeper.ClientCnxn - Session establishment complete on server 10.17.34.22/10.17.34.22:2181, sessionid = 0x356b80d2eebe87d, negotiated timeout = 40000 2016-11-25 07:14:33,256 INFO org.apache.zookeeper.ClientCnxn - EventThread shut down 2016-11-25 07:14:33,256 INFO org.apache.zookeeper.ZooKeeper - Session: 0x356b80d2eebe87d closed 2016-11-25 07:14:33,258 INFO org.apache.zookeeper.ZooKeeper - Initiating client connection, connectString=kafka.zk1.service.mogujie.org:2181,kafka.zk2.service.mogujie.org:2181,kafka.zk3.service.mogujie.org:2181 sessionTimeout=60000 watcher=org.apache.flink.shaded.org.apache.curator.ConnectionState@50d1bcf2 2016-11-25 07:15:38,952 WARN org.apache.flink.shaded.org.apache.curator.ConnectionState - Connection attempt unsuccessful after 65703 (greater than max timeout of 60000). Resetting connection and trying again with a new connection. 2016-11-25 07:15:38,953 INFO org.apache.zookeeper.ClientCnxn - Opening socket connection to server 10.17.36.74/10.17.36.74:2181 2016-11-25 07:15:38,953 INFO org.apache.zookeeper.ClientCnxn - Opening socket connection to server 10.17.34.22/10.17.34.22:2181 2016-11-25 07:15:38,955 INFO org.apache.zookeeper.ZooKeeper - Session: 0x456b80d2f6ce4da closed 2016-11-25 07:15:38,955 INFO org.apache.zookeeper.ClientCnxn - EventThread shut down 2016-11-25 07:15:38,955 INFO org.apache.zookeeper.ZooKeeper - Initiating client connection, connectString=kafka.zk1.service.mogujie.org:2181,kafka.zk2.service.mogujie.org:2181,kafka.zk3.service.mogujie.org:2181 sessionTimeout=60000 watcher=org.apache.flink.shaded.org.apache.curator.ConnectionState@77f9e968 2016-11-25 07:16:45,848 WARN org.apache.flink.shaded.org.apache.curator.ConnectionState - Connection attempt unsuccessful after 131846 (greater than max timeout of 60000). Resetting connection and trying again with a new connection. 2016-11-25 07:16:45,850 INFO org.apache.zookeeper.ClientCnxn - Opening socket connection to server 10.17.36.74/10.17.36.74:2181 2016-11-25 07:16:45,850 INFO org.apache.zookeeper.ClientCnxn - EventThread shut down 2016-11-25 07:16:45,850 INFO org.apache.zookeeper.ZooKeeper - Session: 0x0 closed 2016-11-25 07:16:45,850 INFO org.apache.zookeeper.ClientCnxn - EventThread shut down 2016-11-25 07:16:45,850 INFO org.apache.zookeeper.ZooKeeper - Session: 0x0 closed 2016-11-25 07:16:45,850 WARN com.mogujie.corgi.net.handler.DispatchHandler - no future for response, route: 1, from: /10.17.36.202:9003, packetId: 17095 2016-11-25 07:16:45,852 INFO org.apache.zookeeper.ZooKeeper - Initiating client connection, connectString=kafka.zk1.service.mogujie.org:2181,kafka.zk2.service.mogujie.org:2181,kafka.zk3.service.mogujie.org:2181 sessionTimeout=60000 watcher=org.apache.flink.shaded.org.apache.curator.ConnectionState@50d1bcf2 2016-11-25 07:16:45,851 INFO org.apache.zookeeper.ZooKeeper - Initiating client connection, connectString=kafka.zk1.service.mogujie.org:2181,kafka.zk2.service.mogujie.org:2181,kafka.zk3.service.mogujie.org:2181 sessionTimeout=60000 watcher=org.apache.flink.shaded.org.apache.curator.ConnectionState@3a5d2ba6 2016-11-25 07:16:45,853 WARN org.apache.flink.shaded.org.apache.curator.ConnectionState - Connection attempt unsuccessful after 66900 (greater than max timeout of 60000). Resetting connection and trying again with a new connection. 2016-11-25 07:16:45,853 WARN org.apache.flink.shaded.org.apache.curator.ConnectionState - Connection attempt unsuccessful after 132604 (greater than max timeout of 60000). Resetting connection and trying again with a new connection. 2016-11-25 07:16:45,855 INFO org.apache.zookeeper.ClientCnxn - Opening socket connection to server 10.17.36.74/10.17.36.74:2181 2016-11-25 07:16:45,856 INFO org.apache.zookeeper.ClientCnxn - Socket connection established to 10.17.36.74/10.17.36.74:2181, initiating session 2016-11-25 07:19:05,172 INFO org.apache.zookeeper.ZooKeeper - Session: 0x0 closed 2016-11-25 07:21:32,016 INFO org.apache.zookeeper.ClientCnxn - Opening socket connection to server 10.17.36.74/10.17.36.74:2181 2016-11-25 07:21:32,016 INFO org.apache.zookeeper.ClientCnxn - EventThread shut down 2016-11-25 07:21:32,016 INFO org.apache.zookeeper.ZooKeeper - Session: 0x0 closed 2016-11-25 07:19:05,172 INFO org.apache.zookeeper.ClientCnxn - EventThread shut down 2016-11-25 07:25:16,838 INFO com.mogujie.corgi.net.channel.AbstractChannelHandler - user idleTriggered event triggered, channel: [id: 0x60449ebe, /10.17.123.56:14660 => corgi.keeper.service.mogujie.org/10.15.2.123:8888] 2016-11-25 07:24:02,740 INFO org.apache.zookeeper.ZooKeeper - Initiating client connection, connectString=kafka.zk1.service.mogujie.org:2181,kafka.zk2.service.mogujie.org:2181,kafka.zk3.service.mogujie.org:2181 sessionTimeout=60000 watcher=org.apache.flink.shaded.org.apache.curator.ConnectionState@77f9e968 2016-11-25 07:27:43,909 INFO org.apache.zookeeper.ZooKeeper - Initiating client connection, connectString=kafka.zk1.service.mogujie.org:2181,kafka.zk2.service.mogujie.org:2181,kafka.zk3.service.mogujie.org:2181 sessionTimeout=60000 watcher=org.apache.flink.shaded.org.apache.curator.ConnectionState@3a5d2ba6 2016-11-25 07:27:43,909 INFO com.mogujie.corgi.net.channel.AbstractChannelHandler - user idleTriggered event triggered, channel: [id: 0xcf39e7ef, /10.17.123.56:31625 => /10.11.13.22:9003] 2016-11-25 07:27:43,910 INFO com.mogujie.corgi.net.channel.AbstractChannelHandler - user idleTriggered event triggered, channel: [id: 0x12f19cbd, /10.17.123.56:18394 => /10.11.13.14:9003] I suppose there are some bugs cause this error. |
some additional logs I found in jobManager.
2016-11-25 07:19:57,958 WARN akka.remote.RemoteWatcher - Detected unreachable: [akka.tcp://flink@10.17.123.56:59247] 2016-11-25 07:19:57,962 INFO org.apache.flink.runtime.jobmanager.JobManager - Task manager akka.tcp://flink@10.17.123.56:59247/user/taskmanager terminated. |
Hi Renkai, it seems to me as if the TM lost its network connection somehow. Therefore, the JM's heartbeat won't get answered and it marks the TM as terminated. This would also explain why the TM can not longer talk to ZooKeeper. Is this problem reproducible? If so, could you share the full logs with us? Cheers, Till On Fri, Nov 25, 2016 at 5:12 AM, Renkai <[hidden email]> wrote: some additional logs I found in jobManager. |
The zookeeper related logs are loged by user codes,I finally find the reason why the taskmanger was lost,that was I gave the taskmanager a big amount of memory, the jobmanager identify the taskmanager is down during the taskmanager in Full GC.Thanks for your help.
|
Hi, sorry for re-vive this old conversation.
I have exactly the same problem, can you provide more details about your solution? Have you used another garbage collector as G1? How can I set it? I've seen on configuration guideline I have to set the option: env.java.opts but I don't know which is the value to insert to set G1. Renkai wrote > The zookeeper related logs are loged by user codes,I finally find the > reason why the taskmanger was lost,that was I gave the taskmanager a big > amount of memory, the jobmanager identify the taskmanager is down during > the taskmanager in Full GC.Thanks for your help. -- Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/ |
HI All, Even I am facing the same issue. My code fails after running for 15 hours throwing same "Task Manager lost/killed exception". Can we please know the possible solution in detail for this? Rahul Raj On 15 September 2017 at 23:06, AndreaKinn <[hidden email]> wrote: Hi, sorry for re-vive this old conversation. |
In reply to this post by AndreaKinn
From what I read in [1], simply add JVM options to env.java.opts as you would
when you start a Java program yourself, so setting "-XX:+UseG1GC" should enable G1. Nico [1] https://ci.apache.org/projects/flink/flink-docs-release-1.3/setup/ config.html#common-options On Friday, 15 September 2017 19:36:02 CET AndreaKinn wrote: > Hi, sorry for re-vive this old conversation. > I have exactly the same problem, can you provide more details about your > solution? > Have you used another garbage collector as G1? How can I set it? > > I've seen on configuration guideline I have to set the option: env.java.opts > but I don't know which is the value to insert to set G1. > > > Renkai wrote > > > The zookeeper related logs are loged by user codes,I finally find the > > reason why the taskmanger was lost,that was I gave the taskmanager a big > > amount of memory, the jobmanager identify the taskmanager is down during > > the taskmanager in Full GC.Thanks for your help. > > -- > Sent from: > http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/ signature.asc (201 bytes) Download Attachment |
Free forum by Nabble | Edit this page |