Hello, I am facing an error which for which I cannot figure the cause. Any idea what could cause such an error? java.lang.Exception: The slot in which the task was executed has been released. Probably loss of TaskManager a8b69bd9449ee6792e869a9ff9e843e2 @ cloudr6-admin - 4 slots - URL: akka.tcp://flink@10.204.62.80:57910/user/taskmanager at org.apache.flink.runtime.instance.SimpleSlot.releaseSlot(SimpleSlot.java:151) at org.apache.flink.runtime.instance.SlotSharingGroupAssignment.releaseSharedSlot(SlotSharingGroupAssignment.java:547) at org.apache.flink.runtime.instance.SharedSlot.releaseSlot(SharedSlot.java:119) at org.apache.flink.runtime.instance.Instance.markDead(Instance.java:156) at org.apache.flink.runtime.instance.InstanceManager.unregisterTaskManager(InstanceManager.java:215) at org.apache.flink.runtime.jobmanager.JobManager$$anonfun$handleMessage$1.applyOrElse(JobManager.scala:696) at scala.runtime.AbstractPartialFunction$mcVL$sp.apply$mcVL$sp(AbstractPartialFunction.scala:33) at scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:33) at scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:25) at org.apache.flink.runtime.LeaderSessionMessageFilter$$anonfun$receive$1.applyOrElse(LeaderSessionMessageFilter.scala:44) at scala.runtime.AbstractPartialFunction$mcVL$sp.apply$mcVL$sp(AbstractPartialFunction.scala:33) at scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:33) at scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:25) at org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:33) at org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:28) at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:118) at org.apache.flink.runtime.LogMessages$$anon$1.applyOrElse(LogMessages.scala:28) at akka.actor.Actor$class.aroundReceive(Actor.scala:465) at org.apache.flink.runtime.jobmanager.JobManager.aroundReceive(JobManager.scala:100) at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516) at akka.actor.dungeon.DeathWatch$class.receivedTerminated(DeathWatch.scala:46) at akka.actor.ActorCell.receivedTerminated(ActorCell.scala:369) at akka.actor.ActorCell.autoReceiveMessage(ActorCell.scala:501) at akka.actor.ActorCell.invoke(ActorCell.scala:486) at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:254) at akka.dispatch.Mailbox.run(Mailbox.scala:221) at akka.dispatch.Mailbox.exec(Mailbox.scala:231) at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) Dr. Radu Tudoran Research Engineer - Big Data Expert IT R&D Division HUAWEI TECHNOLOGIES Duesseldorf GmbH European Research Center Riesstrasse 25, 80992 München E-mail:
[hidden email] Mobile: +49 15209084330 Telephone: +49 891588344173
HUAWEI TECHNOLOGIES Duesseldorf GmbH This e-mail and its attachments contain confidential information from HUAWEI, which is intended only for the person or entity whose address is listed above. Any use
of the information contained herein in any way (including, but not limited to, total or partial disclosure, reproduction, or dissemination) by persons other than the intended recipient(s) is prohibited. If you receive this e-mail in error, please notify the
sender by phone or email immediately and delete it! |
On Wed, Feb 3, 2016 at 6:00 PM, Radu Tudoran <[hidden email]> wrote:
|
Hey, Meanwhile the taskmanager is also trying to connect to the Jobmanager and fails multiple times. This happens on multiple taskmanagers seemingly randomly. So the TM stays alive but the connection is lost. Maybe these are related. We are currently trying to debug this problem. Gyula Till Rohrmann <[hidden email]> ezt írta (időpont: 2016. febr. 4., Cs, 15:55):
|
In reply to this post by Till Rohrmann
Hi, Well…yesterday when I looked into it there was no additional info than the one I have send. Today I reproduced the problem and I could see in the log file. akka.actor.ActorInitializationException: exception during creation at akka.actor.ActorInitializationException$.apply(Actor.scala:164) at akka.actor.ActorCell.create(ActorCell.scala:596) at akka.actor.ActorCell.invokeAll$1(ActorCell.scala:456) at akka.actor.ActorCell.systemInvoke(ActorCell.scala:478) at akka.dispatch.Mailbox.processAllSystemMessages(Mailbox.scala:279) at akka.dispatch.Mailbox.run(Mailbox.scala:220) at akka.dispatch.Mailbox.exec(Mailbox.scala:231) at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.pollAndExecAll(ForkJoinPool.java:1253) at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1346) at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) Caused by: java.lang.reflect.InvocationTargetException at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:422) at akka.util.Reflect$.instantiate(Reflect.scala:66) at akka.actor.ArgsReflectConstructor.produce(Props.scala:352) at akka.actor.Props.newActor(Props.scala:252) at akka.actor.ActorCell.newActor(ActorCell.scala:552) at akka.actor.ActorCell.create(ActorCell.scala:578) ... 10 more Caused by: java.lang.OutOfMemoryError: GC overhead limit exceeded 11:21:17,423 ERROR org.apache.flink.runtime.taskmanager.Task
- FATAL - exception in task resource cleanup java.lang.OutOfMemoryError: GC overhead limit exceeded 11:21:55,160 ERROR org.apache.flink.runtime.taskmanager.Task
- FATAL - exception in task exception handler java.lang.OutOfMemoryError: GC overhead limit exceeded …. - Unexpected exception in the selector loop. java.lang.OutOfMemoryError: GC overhead limit exceeded Looks like the input flow is faster than the GC collector Dr. Radu Tudoran Research Engineer - Big Data Expert IT R&D Division HUAWEI TECHNOLOGIES Duesseldorf GmbH European Research Center Riesstrasse 25, 80992 München E-mail:
[hidden email] Mobile: +49 15209084330 Telephone: +49 891588344173
HUAWEI TECHNOLOGIES Duesseldorf GmbH This e-mail and its attachments contain confidential information from HUAWEI, which is intended only for the person or entity whose address is listed above. Any use
of the information contained herein in any way (including, but not limited to, total or partial disclosure, reproduction, or dissemination) by persons other than the intended recipient(s) is prohibited. If you receive this e-mail in error, please notify the
sender by phone or email immediately and delete it! From: Till Rohrmann [mailto:[hidden email]]
On Wed, Feb 3, 2016 at 6:00 PM, Radu Tudoran <[hidden email]> wrote: Hello, I am facing an error which for which I cannot figure the cause. Any idea what could cause such an error? java.lang.Exception: The slot in which the task was executed has been released. Probably loss of TaskManager a8b69bd9449ee6792e869a9ff9e843e2 @ cloudr6-admin
- 4 slots - URL: akka.tcp://flink@10.204.62.80:57910/user/taskmanager at org.apache.flink.runtime.instance.SimpleSlot.releaseSlot(SimpleSlot.java:151) at org.apache.flink.runtime.instance.SlotSharingGroupAssignment.releaseSharedSlot(SlotSharingGroupAssignment.java:547) at org.apache.flink.runtime.instance.SharedSlot.releaseSlot(SharedSlot.java:119) at org.apache.flink.runtime.instance.Instance.markDead(Instance.java:156) at org.apache.flink.runtime.instance.InstanceManager.unregisterTaskManager(InstanceManager.java:215) at org.apache.flink.runtime.jobmanager.JobManager$$anonfun$handleMessage$1.applyOrElse(JobManager.scala:696) at scala.runtime.AbstractPartialFunction$mcVL$sp.apply$mcVL$sp(AbstractPartialFunction.scala:33) at scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:33) at scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:25) at org.apache.flink.runtime.LeaderSessionMessageFilter$$anonfun$receive$1.applyOrElse(LeaderSessionMessageFilter.scala:44) at scala.runtime.AbstractPartialFunction$mcVL$sp.apply$mcVL$sp(AbstractPartialFunction.scala:33) at scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:33) at scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:25) at org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:33) at org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:28) at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:118) at org.apache.flink.runtime.LogMessages$$anon$1.applyOrElse(LogMessages.scala:28) at akka.actor.Actor$class.aroundReceive(Actor.scala:465) at org.apache.flink.runtime.jobmanager.JobManager.aroundReceive(JobManager.scala:100) at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516) at akka.actor.dungeon.DeathWatch$class.receivedTerminated(DeathWatch.scala:46) at akka.actor.ActorCell.receivedTerminated(ActorCell.scala:369) at akka.actor.ActorCell.autoReceiveMessage(ActorCell.scala:501) at akka.actor.ActorCell.invoke(ActorCell.scala:486) at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:254) at akka.dispatch.Mailbox.run(Mailbox.scala:221) at akka.dispatch.Mailbox.exec(Mailbox.scala:231) at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) Dr. Radu Tudoran Research Engineer - Big Data Expert IT R&D Division HUAWEI TECHNOLOGIES Duesseldorf GmbH European Research Center Riesstrasse 25, 80992 München
E-mail: [hidden email] Mobile:
<a href="tel:%2B49%2015209084330" target="_blank">+49 15209084330 Telephone:
<a href="tel:%2B49%20891588344173" target="_blank">+49 891588344173
HUAWEI TECHNOLOGIES Duesseldorf GmbH This e-mail and its attachments contain confidential information from HUAWEI, which is intended only for
the person or entity whose address is listed above. Any use of the information contained herein in any way (including, but not limited to, total or partial disclosure, reproduction, or dissemination) by persons other than the intended recipient(s) is prohibited.
If you receive this e-mail in error, please notify the sender by phone or email immediately and delete it! |
@Gyula Do you see log messages about quarantined actor systems? There may be an issue with Akka Death watches that once the connection is lost, it cannot be re-established unless the TaskManager is restarted On Thu, Feb 4, 2016 at 5:03 PM, Radu Tudoran <[hidden email]> wrote:
|
Yes exactly , it says it is quarantined. Gyula Gyula On Thu, Feb 4, 2016 at 4:09 PM Stephan Ewen <[hidden email]> wrote:
image001.png (19K) Download Attachment |
Okay, here are the docs for the Akka version we are using: http://doc.akka.io/docs/akka/2.3.14/scala/remoting.html#Lifecycle_and_Failure_Recovery_Model It says that after a remote deathwatch trigger, the actor system must be restarted before it can connect again. We probably need to do the following: - Either restart TaskManager actor system when it detects that it is quarantined (or when it senses the JobManager as failed) - Or switch from Akka death watch to a manual heartbeat mechanism Would be good to also have Till's input on this... What do you think? Stephan On Thu, Feb 4, 2016 at 5:11 PM, Gyula Fóra <[hidden email]> wrote:
|
We should probably add to the TaskManager a "restart on quarantined" strategy anyways. We can detect it as follows: http://stackoverflow.com/questions/32471088/akka-cluster-detecting-quarantined-state On Thu, Feb 4, 2016 at 5:18 PM, Stephan Ewen <[hidden email]> wrote:
|
I am not an expert on this but I guess either could work. What is also interesting that sometimes this doesn't happen for a day sometimes it happens twice an hour so it's probably some network issue as well. Gyula Stephan Ewen <[hidden email]> ezt írta (időpont: 2016. febr. 4., Cs, 16:32):
|
I agree with Stephan. Once a TaskManager is quarantined the Adding our own heartbeat mechanism should be fairly easy. However, we would lose the more elaborate phi accrual failure detector which can react to changing network conditions. But I’m not sure how much difference this makes in reality. Listening to the Maybe we first do the latter option and then re-add our own heartbeats again. Cheers, On Fri, Feb 5, 2016 at 9:27 AM, Gyula Fóra <[hidden email]> wrote:
|
Free forum by Nabble | Edit this page |