release of task slot

classic Classic list List threaded Threaded
10 messages Options
Reply | Threaded
Open this post in threaded view
|

release of task slot

Radu Tudoran

Hello,

 

 

I am facing an error which for which I cannot figure the cause. Any idea what could cause such an error?

 

 

 

java.lang.Exception: The slot in which the task was executed has been released. Probably loss of TaskManager a8b69bd9449ee6792e869a9ff9e843e2 @ cloudr6-admin - 4 slots - URL: akka.tcp://flink@10.204.62.80:57910/user/taskmanager

        at org.apache.flink.runtime.instance.SimpleSlot.releaseSlot(SimpleSlot.java:151)

        at org.apache.flink.runtime.instance.SlotSharingGroupAssignment.releaseSharedSlot(SlotSharingGroupAssignment.java:547)

        at org.apache.flink.runtime.instance.SharedSlot.releaseSlot(SharedSlot.java:119)

        at org.apache.flink.runtime.instance.Instance.markDead(Instance.java:156)

        at org.apache.flink.runtime.instance.InstanceManager.unregisterTaskManager(InstanceManager.java:215)

        at org.apache.flink.runtime.jobmanager.JobManager$$anonfun$handleMessage$1.applyOrElse(JobManager.scala:696)

        at scala.runtime.AbstractPartialFunction$mcVL$sp.apply$mcVL$sp(AbstractPartialFunction.scala:33)

        at scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:33)

        at scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:25)

        at org.apache.flink.runtime.LeaderSessionMessageFilter$$anonfun$receive$1.applyOrElse(LeaderSessionMessageFilter.scala:44)

        at scala.runtime.AbstractPartialFunction$mcVL$sp.apply$mcVL$sp(AbstractPartialFunction.scala:33)

        at scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:33)

        at scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:25)

        at org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:33)

        at org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:28)

        at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:118)

        at org.apache.flink.runtime.LogMessages$$anon$1.applyOrElse(LogMessages.scala:28)

        at akka.actor.Actor$class.aroundReceive(Actor.scala:465)

        at org.apache.flink.runtime.jobmanager.JobManager.aroundReceive(JobManager.scala:100)

        at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)

        at akka.actor.dungeon.DeathWatch$class.receivedTerminated(DeathWatch.scala:46)

        at akka.actor.ActorCell.receivedTerminated(ActorCell.scala:369)

        at akka.actor.ActorCell.autoReceiveMessage(ActorCell.scala:501)

        at akka.actor.ActorCell.invoke(ActorCell.scala:486)

        at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:254)

        at akka.dispatch.Mailbox.run(Mailbox.scala:221)

        at akka.dispatch.Mailbox.exec(Mailbox.scala:231)

        at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)

        at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)

        at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)

        at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)

 

 

Dr. Radu Tudoran

Research Engineer - Big Data Expert

IT R&D Division

 

cid:image007.jpg@01CD52EB.AD060EE0

HUAWEI TECHNOLOGIES Duesseldorf GmbH

European Research Center

Riesstrasse 25, 80992 München

 

E-mail: [hidden email]

Mobile: +49 15209084330

Telephone: +49 891588344173

 

HUAWEI TECHNOLOGIES Duesseldorf GmbH
Hansaallee 205, 40549 Düsseldorf, Germany,
www.huawei.com
Registered Office: Düsseldorf, Register Court Düsseldorf, HRB 56063,
Managing Director: Bo PENG, Wanzhou MENG, Lifang CHEN
Sitz der Gesellschaft: Düsseldorf, Amtsgericht Düsseldorf, HRB 56063,
Geschäftsführer: Bo PENG, Wanzhou MENG, Lifang CHEN

This e-mail and its attachments contain confidential information from HUAWEI, which is intended only for the person or entity whose address is listed above. Any use of the information contained herein in any way (including, but not limited to, total or partial disclosure, reproduction, or dissemination) by persons other than the intended recipient(s) is prohibited. If you receive this e-mail in error, please notify the sender by phone or email immediately and delete it!

 

Reply | Threaded
Open this post in threaded view
|

Re: release of task slot

Till Rohrmann

Hi Radu,

what does the log of the TaskManager 10.204.62.80:57910 say?

Cheers,
Till


On Wed, Feb 3, 2016 at 6:00 PM, Radu Tudoran <[hidden email]> wrote:

Hello,

 

 

I am facing an error which for which I cannot figure the cause. Any idea what could cause such an error?

 

 

 

java.lang.Exception: The slot in which the task was executed has been released. Probably loss of TaskManager a8b69bd9449ee6792e869a9ff9e843e2 @ cloudr6-admin - 4 slots - URL: akka.tcp://flink@10.204.62.80:57910/user/taskmanager

        at org.apache.flink.runtime.instance.SimpleSlot.releaseSlot(SimpleSlot.java:151)

        at org.apache.flink.runtime.instance.SlotSharingGroupAssignment.releaseSharedSlot(SlotSharingGroupAssignment.java:547)

        at org.apache.flink.runtime.instance.SharedSlot.releaseSlot(SharedSlot.java:119)

        at org.apache.flink.runtime.instance.Instance.markDead(Instance.java:156)

        at org.apache.flink.runtime.instance.InstanceManager.unregisterTaskManager(InstanceManager.java:215)

        at org.apache.flink.runtime.jobmanager.JobManager$$anonfun$handleMessage$1.applyOrElse(JobManager.scala:696)

        at scala.runtime.AbstractPartialFunction$mcVL$sp.apply$mcVL$sp(AbstractPartialFunction.scala:33)

        at scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:33)

        at scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:25)

        at org.apache.flink.runtime.LeaderSessionMessageFilter$$anonfun$receive$1.applyOrElse(LeaderSessionMessageFilter.scala:44)

        at scala.runtime.AbstractPartialFunction$mcVL$sp.apply$mcVL$sp(AbstractPartialFunction.scala:33)

        at scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:33)

        at scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:25)

        at org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:33)

        at org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:28)

        at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:118)

        at org.apache.flink.runtime.LogMessages$$anon$1.applyOrElse(LogMessages.scala:28)

        at akka.actor.Actor$class.aroundReceive(Actor.scala:465)

        at org.apache.flink.runtime.jobmanager.JobManager.aroundReceive(JobManager.scala:100)

        at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)

        at akka.actor.dungeon.DeathWatch$class.receivedTerminated(DeathWatch.scala:46)

        at akka.actor.ActorCell.receivedTerminated(ActorCell.scala:369)

        at akka.actor.ActorCell.autoReceiveMessage(ActorCell.scala:501)

        at akka.actor.ActorCell.invoke(ActorCell.scala:486)

        at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:254)

        at akka.dispatch.Mailbox.run(Mailbox.scala:221)

        at akka.dispatch.Mailbox.exec(Mailbox.scala:231)

        at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)

        at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)

        at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)

        at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)

 

 

Dr. Radu Tudoran

Research Engineer - Big Data Expert

IT R&D Division

 

cid:image007.jpg@01CD52EB.AD060EE0

HUAWEI TECHNOLOGIES Duesseldorf GmbH

European Research Center

Riesstrasse 25, 80992 München

 

E-mail: [hidden email]

Mobile: <a href="tel:%2B49%2015209084330" value="+4915209084330" target="_blank">+49 15209084330

Telephone: <a href="tel:%2B49%20891588344173" value="+49891588344173" target="_blank">+49 891588344173

 

HUAWEI TECHNOLOGIES Duesseldorf GmbH
Hansaallee 205, 40549 Düsseldorf, Germany,
www.huawei.com
Registered Office: Düsseldorf, Register Court Düsseldorf, HRB 56063,
Managing Director: Bo PENG, Wanzhou MENG, Lifang CHEN
Sitz der Gesellschaft: Düsseldorf, Amtsgericht Düsseldorf, HRB 56063,
Geschäftsführer: Bo PENG, Wanzhou MENG, Lifang CHEN

This e-mail and its attachments contain confidential information from HUAWEI, which is intended only for the person or entity whose address is listed above. Any use of the information contained herein in any way (including, but not limited to, total or partial disclosure, reproduction, or dissemination) by persons other than the intended recipient(s) is prohibited. If you receive this e-mail in error, please notify the sender by phone or email immediately and delete it!

 


Reply | Threaded
Open this post in threaded view
|

Re: release of task slot

Gyula Fóra-2
Hey,

I am actually facing a similar issue lately, where the job manager release the task slots as it cannot contact the taskmanager.

Meanwhile the taskmanager is also trying to connect to the Jobmanager and fails multiple times. This happens on multiple taskmanagers seemingly randomly. So the TM stays alive but the connection is lost.

Maybe these are related. We are currently trying to debug this problem.

Gyula

Till Rohrmann <[hidden email]> ezt írta (időpont: 2016. febr. 4., Cs, 15:55):

Hi Radu,

what does the log of the TaskManager 10.204.62.80:57910 say?

Cheers,
Till


On Wed, Feb 3, 2016 at 6:00 PM, Radu Tudoran <[hidden email]> wrote:

Hello,

 

 

I am facing an error which for which I cannot figure the cause. Any idea what could cause such an error?

 

 

 

java.lang.Exception: The slot in which the task was executed has been released. Probably loss of TaskManager a8b69bd9449ee6792e869a9ff9e843e2 @ cloudr6-admin - 4 slots - URL: akka.tcp://flink@10.204.62.80:57910/user/taskmanager

        at org.apache.flink.runtime.instance.SimpleSlot.releaseSlot(SimpleSlot.java:151)

        at org.apache.flink.runtime.instance.SlotSharingGroupAssignment.releaseSharedSlot(SlotSharingGroupAssignment.java:547)

        at org.apache.flink.runtime.instance.SharedSlot.releaseSlot(SharedSlot.java:119)

        at org.apache.flink.runtime.instance.Instance.markDead(Instance.java:156)

        at org.apache.flink.runtime.instance.InstanceManager.unregisterTaskManager(InstanceManager.java:215)

        at org.apache.flink.runtime.jobmanager.JobManager$$anonfun$handleMessage$1.applyOrElse(JobManager.scala:696)

        at scala.runtime.AbstractPartialFunction$mcVL$sp.apply$mcVL$sp(AbstractPartialFunction.scala:33)

        at scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:33)

        at scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:25)

        at org.apache.flink.runtime.LeaderSessionMessageFilter$$anonfun$receive$1.applyOrElse(LeaderSessionMessageFilter.scala:44)

        at scala.runtime.AbstractPartialFunction$mcVL$sp.apply$mcVL$sp(AbstractPartialFunction.scala:33)

        at scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:33)

        at scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:25)

        at org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:33)

        at org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:28)

        at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:118)

        at org.apache.flink.runtime.LogMessages$$anon$1.applyOrElse(LogMessages.scala:28)

        at akka.actor.Actor$class.aroundReceive(Actor.scala:465)

        at org.apache.flink.runtime.jobmanager.JobManager.aroundReceive(JobManager.scala:100)

        at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)

        at akka.actor.dungeon.DeathWatch$class.receivedTerminated(DeathWatch.scala:46)

        at akka.actor.ActorCell.receivedTerminated(ActorCell.scala:369)

        at akka.actor.ActorCell.autoReceiveMessage(ActorCell.scala:501)

        at akka.actor.ActorCell.invoke(ActorCell.scala:486)

        at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:254)

        at akka.dispatch.Mailbox.run(Mailbox.scala:221)

        at akka.dispatch.Mailbox.exec(Mailbox.scala:231)

        at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)

        at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)

        at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)

        at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)

 

 

Dr. Radu Tudoran

Research Engineer - Big Data Expert

IT R&D Division

 

image001.png

HUAWEI TECHNOLOGIES Duesseldorf GmbH

European Research Center

Riesstrasse 25, 80992 München

 

E-mail: [hidden email]

Mobile: <a href="tel:%2B49%2015209084330" value="+4915209084330" target="_blank">+49 15209084330

Telephone: <a href="tel:%2B49%20891588344173" value="+49891588344173" target="_blank">+49 891588344173

 

HUAWEI TECHNOLOGIES Duesseldorf GmbH
Hansaallee 205, 40549 Düsseldorf, Germany,
www.huawei.com
Registered Office: Düsseldorf, Register Court Düsseldorf, HRB 56063,
Managing Director: Bo PENG, Wanzhou MENG, Lifang CHEN
Sitz der Gesellschaft: Düsseldorf, Amtsgericht Düsseldorf, HRB 56063,
Geschäftsführer: Bo PENG, Wanzhou MENG, Lifang CHEN

This e-mail and its attachments contain confidential information from HUAWEI, which is intended only for the person or entity whose address is listed above. Any use of the information contained herein in any way (including, but not limited to, total or partial disclosure, reproduction, or dissemination) by persons other than the intended recipient(s) is prohibited. If you receive this e-mail in error, please notify the sender by phone or email immediately and delete it!

 


Reply | Threaded
Open this post in threaded view
|

RE: release of task slot

Radu Tudoran
In reply to this post by Till Rohrmann

Hi,

 

Well…yesterday when I looked into it there was no additional info than the one I have send. Today I reproduced the problem and I could see in the log file.

 

 

akka.actor.ActorInitializationException: exception during creation

        at akka.actor.ActorInitializationException$.apply(Actor.scala:164)

        at akka.actor.ActorCell.create(ActorCell.scala:596)

        at akka.actor.ActorCell.invokeAll$1(ActorCell.scala:456)

        at akka.actor.ActorCell.systemInvoke(ActorCell.scala:478)

        at akka.dispatch.Mailbox.processAllSystemMessages(Mailbox.scala:279)

        at akka.dispatch.Mailbox.run(Mailbox.scala:220)

        at akka.dispatch.Mailbox.exec(Mailbox.scala:231)

        at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)

        at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.pollAndExecAll(ForkJoinPool.java:1253)

        at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1346)

        at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)

        at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)

Caused by: java.lang.reflect.InvocationTargetException

        at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)

        at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)

        at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)

        at java.lang.reflect.Constructor.newInstance(Constructor.java:422)

        at akka.util.Reflect$.instantiate(Reflect.scala:66)

        at akka.actor.ArgsReflectConstructor.produce(Props.scala:352)

        at akka.actor.Props.newActor(Props.scala:252)

        at akka.actor.ActorCell.newActor(ActorCell.scala:552)

        at akka.actor.ActorCell.create(ActorCell.scala:578)

        ... 10 more

Caused by: java.lang.OutOfMemoryError: GC overhead limit exceeded

11:21:17,423 ERROR org.apache.flink.runtime.taskmanager.Task                                                                                                                                                    - FATAL - exception in task resource cleanup

java.lang.OutOfMemoryError: GC overhead limit exceeded

11:21:55,160 ERROR org.apache.flink.runtime.taskmanager.Task                                                                                                                                                    - FATAL - exception in task exception handler

java.lang.OutOfMemoryError: GC overhead limit exceeded

 

….

 

- Unexpected exception in the selector loop.

java.lang.OutOfMemoryError: GC overhead limit exceeded

 

 

Looks like the input flow is faster than the GC collector

 

Dr. Radu Tudoran

Research Engineer - Big Data Expert

IT R&D Division

 

cid:image007.jpg@01CD52EB.AD060EE0

HUAWEI TECHNOLOGIES Duesseldorf GmbH

European Research Center

Riesstrasse 25, 80992 München

 

E-mail: [hidden email]

Mobile: +49 15209084330

Telephone: +49 891588344173

 

HUAWEI TECHNOLOGIES Duesseldorf GmbH
Hansaallee 205, 40549 Düsseldorf, Germany,
www.huawei.com
Registered Office: Düsseldorf, Register Court Düsseldorf, HRB 56063,
Managing Director: Bo PENG, Wanzhou MENG, Lifang CHEN
Sitz der Gesellschaft: Düsseldorf, Amtsgericht Düsseldorf, HRB 56063,
Geschäftsführer: Bo PENG, Wanzhou MENG, Lifang CHEN

This e-mail and its attachments contain confidential information from HUAWEI, which is intended only for the person or entity whose address is listed above. Any use of the information contained herein in any way (including, but not limited to, total or partial disclosure, reproduction, or dissemination) by persons other than the intended recipient(s) is prohibited. If you receive this e-mail in error, please notify the sender by phone or email immediately and delete it!

 

From: Till Rohrmann [mailto:[hidden email]]
Sent: Thursday, February 04, 2016 4:55 PM
To: [hidden email]
Subject: Re: release of task slot

 

Hi Radu,

what does the log of the TaskManager 10.204.62.80:57910 say?

Cheers,
Till

 

On Wed, Feb 3, 2016 at 6:00 PM, Radu Tudoran <[hidden email]> wrote:

Hello,

 

 

I am facing an error which for which I cannot figure the cause. Any idea what could cause such an error?

 

 

 

java.lang.Exception: The slot in which the task was executed has been released. Probably loss of TaskManager a8b69bd9449ee6792e869a9ff9e843e2 @ cloudr6-admin - 4 slots - URL: akka.tcp://flink@10.204.62.80:57910/user/taskmanager

        at org.apache.flink.runtime.instance.SimpleSlot.releaseSlot(SimpleSlot.java:151)

        at org.apache.flink.runtime.instance.SlotSharingGroupAssignment.releaseSharedSlot(SlotSharingGroupAssignment.java:547)

        at org.apache.flink.runtime.instance.SharedSlot.releaseSlot(SharedSlot.java:119)

        at org.apache.flink.runtime.instance.Instance.markDead(Instance.java:156)

        at org.apache.flink.runtime.instance.InstanceManager.unregisterTaskManager(InstanceManager.java:215)

        at org.apache.flink.runtime.jobmanager.JobManager$$anonfun$handleMessage$1.applyOrElse(JobManager.scala:696)

        at scala.runtime.AbstractPartialFunction$mcVL$sp.apply$mcVL$sp(AbstractPartialFunction.scala:33)

        at scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:33)

        at scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:25)

        at org.apache.flink.runtime.LeaderSessionMessageFilter$$anonfun$receive$1.applyOrElse(LeaderSessionMessageFilter.scala:44)

        at scala.runtime.AbstractPartialFunction$mcVL$sp.apply$mcVL$sp(AbstractPartialFunction.scala:33)

        at scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:33)

        at scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:25)

        at org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:33)

        at org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:28)

        at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:118)

        at org.apache.flink.runtime.LogMessages$$anon$1.applyOrElse(LogMessages.scala:28)

        at akka.actor.Actor$class.aroundReceive(Actor.scala:465)

        at org.apache.flink.runtime.jobmanager.JobManager.aroundReceive(JobManager.scala:100)

        at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)

        at akka.actor.dungeon.DeathWatch$class.receivedTerminated(DeathWatch.scala:46)

        at akka.actor.ActorCell.receivedTerminated(ActorCell.scala:369)

        at akka.actor.ActorCell.autoReceiveMessage(ActorCell.scala:501)

        at akka.actor.ActorCell.invoke(ActorCell.scala:486)

        at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:254)

        at akka.dispatch.Mailbox.run(Mailbox.scala:221)

        at akka.dispatch.Mailbox.exec(Mailbox.scala:231)

        at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)

        at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)

        at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)

        at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)

 

 

Dr. Radu Tudoran

Research Engineer - Big Data Expert

IT R&D Division

 

cid:image007.jpg@01CD52EB.AD060EE0

HUAWEI TECHNOLOGIES Duesseldorf GmbH

European Research Center

Riesstrasse 25, 80992 München

 

E-mail: [hidden email]

Mobile: <a href="tel:%2B49%2015209084330" target="_blank">+49 15209084330

Telephone: <a href="tel:%2B49%20891588344173" target="_blank">+49 891588344173

 

HUAWEI TECHNOLOGIES Duesseldorf GmbH
Hansaallee 205, 40549 Düsseldorf, Germany,
www.huawei.com
Registered Office: Düsseldorf, Register Court Düsseldorf, HRB 56063,
Managing Director: Bo PENG, Wanzhou MENG, Lifang CHEN
Sitz der Gesellschaft: Düsseldorf, Amtsgericht Düsseldorf, HRB 56063,
Geschäftsführer: Bo PENG, Wanzhou MENG, Lifang CHEN

This e-mail and its attachments contain confidential information from HUAWEI, which is intended only for the person or entity whose address is listed above. Any use of the information contained herein in any way (including, but not limited to, total or partial disclosure, reproduction, or dissemination) by persons other than the intended recipient(s) is prohibited. If you receive this e-mail in error, please notify the sender by phone or email immediately and delete it!

 

 

Reply | Threaded
Open this post in threaded view
|

Re: release of task slot

Stephan Ewen
@Gyula Do you see log messages about quarantined actor systems?

There may be an issue with Akka Death watches that once the connection is lost, it cannot be re-established unless the TaskManager is restarted




On Thu, Feb 4, 2016 at 5:03 PM, Radu Tudoran <[hidden email]> wrote:

Hi,

 

Well…yesterday when I looked into it there was no additional info than the one I have send. Today I reproduced the problem and I could see in the log file.

 

 

akka.actor.ActorInitializationException: exception during creation

        at akka.actor.ActorInitializationException$.apply(Actor.scala:164)

        at akka.actor.ActorCell.create(ActorCell.scala:596)

        at akka.actor.ActorCell.invokeAll$1(ActorCell.scala:456)

        at akka.actor.ActorCell.systemInvoke(ActorCell.scala:478)

        at akka.dispatch.Mailbox.processAllSystemMessages(Mailbox.scala:279)

        at akka.dispatch.Mailbox.run(Mailbox.scala:220)

        at akka.dispatch.Mailbox.exec(Mailbox.scala:231)

        at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)

        at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.pollAndExecAll(ForkJoinPool.java:1253)

        at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1346)

        at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)

        at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)

Caused by: java.lang.reflect.InvocationTargetException

        at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)

        at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)

        at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)

        at java.lang.reflect.Constructor.newInstance(Constructor.java:422)

        at akka.util.Reflect$.instantiate(Reflect.scala:66)

        at akka.actor.ArgsReflectConstructor.produce(Props.scala:352)

        at akka.actor.Props.newActor(Props.scala:252)

        at akka.actor.ActorCell.newActor(ActorCell.scala:552)

        at akka.actor.ActorCell.create(ActorCell.scala:578)

        ... 10 more

Caused by: java.lang.OutOfMemoryError: GC overhead limit exceeded

11:21:17,423 ERROR org.apache.flink.runtime.taskmanager.Task                                                                                                                                                    - FATAL - exception in task resource cleanup

java.lang.OutOfMemoryError: GC overhead limit exceeded

11:21:55,160 ERROR org.apache.flink.runtime.taskmanager.Task                                                                                                                                                    - FATAL - exception in task exception handler

java.lang.OutOfMemoryError: GC overhead limit exceeded

 

….

 

- Unexpected exception in the selector loop.

java.lang.OutOfMemoryError: GC overhead limit exceeded

 

 

Looks like the input flow is faster than the GC collector

 

Dr. Radu Tudoran

Research Engineer - Big Data Expert

IT R&D Division

 

cid:image007.jpg@01CD52EB.AD060EE0

HUAWEI TECHNOLOGIES Duesseldorf GmbH

European Research Center

Riesstrasse 25, 80992 München

 

E-mail: [hidden email]

Mobile: <a href="tel:%2B49%2015209084330" value="+4915209084330" target="_blank">+49 15209084330

Telephone: <a href="tel:%2B49%20891588344173" value="+49891588344173" target="_blank">+49 891588344173

 

HUAWEI TECHNOLOGIES Duesseldorf GmbH
Hansaallee 205, 40549 Düsseldorf, Germany,
www.huawei.com
Registered Office: Düsseldorf, Register Court Düsseldorf, HRB 56063,
Managing Director: Bo PENG, Wanzhou MENG, Lifang CHEN
Sitz der Gesellschaft: Düsseldorf, Amtsgericht Düsseldorf, HRB 56063,
Geschäftsführer: Bo PENG, Wanzhou MENG, Lifang CHEN

This e-mail and its attachments contain confidential information from HUAWEI, which is intended only for the person or entity whose address is listed above. Any use of the information contained herein in any way (including, but not limited to, total or partial disclosure, reproduction, or dissemination) by persons other than the intended recipient(s) is prohibited. If you receive this e-mail in error, please notify the sender by phone or email immediately and delete it!

 

From: Till Rohrmann [mailto:[hidden email]]
Sent: Thursday, February 04, 2016 4:55 PM
To: [hidden email]
Subject: Re: release of task slot

 

Hi Radu,

what does the log of the TaskManager 10.204.62.80:57910 say?

Cheers,
Till

 

On Wed, Feb 3, 2016 at 6:00 PM, Radu Tudoran <[hidden email]> wrote:

Hello,

 

 

I am facing an error which for which I cannot figure the cause. Any idea what could cause such an error?

 

 

 

java.lang.Exception: The slot in which the task was executed has been released. Probably loss of TaskManager a8b69bd9449ee6792e869a9ff9e843e2 @ cloudr6-admin - 4 slots - URL: akka.tcp://flink@10.204.62.80:57910/user/taskmanager

        at org.apache.flink.runtime.instance.SimpleSlot.releaseSlot(SimpleSlot.java:151)

        at org.apache.flink.runtime.instance.SlotSharingGroupAssignment.releaseSharedSlot(SlotSharingGroupAssignment.java:547)

        at org.apache.flink.runtime.instance.SharedSlot.releaseSlot(SharedSlot.java:119)

        at org.apache.flink.runtime.instance.Instance.markDead(Instance.java:156)

        at org.apache.flink.runtime.instance.InstanceManager.unregisterTaskManager(InstanceManager.java:215)

        at org.apache.flink.runtime.jobmanager.JobManager$$anonfun$handleMessage$1.applyOrElse(JobManager.scala:696)

        at scala.runtime.AbstractPartialFunction$mcVL$sp.apply$mcVL$sp(AbstractPartialFunction.scala:33)

        at scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:33)

        at scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:25)

        at org.apache.flink.runtime.LeaderSessionMessageFilter$$anonfun$receive$1.applyOrElse(LeaderSessionMessageFilter.scala:44)

        at scala.runtime.AbstractPartialFunction$mcVL$sp.apply$mcVL$sp(AbstractPartialFunction.scala:33)

        at scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:33)

        at scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:25)

        at org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:33)

        at org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:28)

        at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:118)

        at org.apache.flink.runtime.LogMessages$$anon$1.applyOrElse(LogMessages.scala:28)

        at akka.actor.Actor$class.aroundReceive(Actor.scala:465)

        at org.apache.flink.runtime.jobmanager.JobManager.aroundReceive(JobManager.scala:100)

        at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)

        at akka.actor.dungeon.DeathWatch$class.receivedTerminated(DeathWatch.scala:46)

        at akka.actor.ActorCell.receivedTerminated(ActorCell.scala:369)

        at akka.actor.ActorCell.autoReceiveMessage(ActorCell.scala:501)

        at akka.actor.ActorCell.invoke(ActorCell.scala:486)

        at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:254)

        at akka.dispatch.Mailbox.run(Mailbox.scala:221)

        at akka.dispatch.Mailbox.exec(Mailbox.scala:231)

        at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)

        at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)

        at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)

        at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)

 

 

Dr. Radu Tudoran

Research Engineer - Big Data Expert

IT R&D Division

 

cid:image007.jpg@01CD52EB.AD060EE0

HUAWEI TECHNOLOGIES Duesseldorf GmbH

European Research Center

Riesstrasse 25, 80992 München

 

E-mail: [hidden email]

Mobile: <a href="tel:%2B49%2015209084330" target="_blank">+49 15209084330

Telephone: <a href="tel:%2B49%20891588344173" target="_blank">+49 891588344173

 

HUAWEI TECHNOLOGIES Duesseldorf GmbH
Hansaallee 205, 40549 Düsseldorf, Germany,
www.huawei.com
Registered Office: Düsseldorf, Register Court Düsseldorf, HRB 56063,
Managing Director: Bo PENG, Wanzhou MENG, Lifang CHEN
Sitz der Gesellschaft: Düsseldorf, Amtsgericht Düsseldorf, HRB 56063,
Geschäftsführer: Bo PENG, Wanzhou MENG, Lifang CHEN

This e-mail and its attachments contain confidential information from HUAWEI, which is intended only for the person or entity whose address is listed above. Any use of the information contained herein in any way (including, but not limited to, total or partial disclosure, reproduction, or dissemination) by persons other than the intended recipient(s) is prohibited. If you receive this e-mail in error, please notify the sender by phone or email immediately and delete it!

 

 


Reply | Threaded
Open this post in threaded view
|

Re: release of task slot

Gyula Fóra
Yes exactly , it says it is quarantined.

Gyula


Gyula
On Thu, Feb 4, 2016 at 4:09 PM Stephan Ewen <[hidden email]> wrote:
@Gyula Do you see log messages about quarantined actor systems?

There may be an issue with Akka Death watches that once the connection is lost, it cannot be re-established unless the TaskManager is restarted




On Thu, Feb 4, 2016 at 5:03 PM, Radu Tudoran <[hidden email]> wrote:

Hi,

 

Well…yesterday when I looked into it there was no additional info than the one I have send. Today I reproduced the problem and I could see in the log file.

 

 

akka.actor.ActorInitializationException: exception during creation

        at akka.actor.ActorInitializationException$.apply(Actor.scala:164)

        at akka.actor.ActorCell.create(ActorCell.scala:596)

        at akka.actor.ActorCell.invokeAll$1(ActorCell.scala:456)

        at akka.actor.ActorCell.systemInvoke(ActorCell.scala:478)

        at akka.dispatch.Mailbox.processAllSystemMessages(Mailbox.scala:279)

        at akka.dispatch.Mailbox.run(Mailbox.scala:220)

        at akka.dispatch.Mailbox.exec(Mailbox.scala:231)

        at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)

        at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.pollAndExecAll(ForkJoinPool.java:1253)

        at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1346)

        at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)

        at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)

Caused by: java.lang.reflect.InvocationTargetException

        at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)

        at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)

        at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)

        at java.lang.reflect.Constructor.newInstance(Constructor.java:422)

        at akka.util.Reflect$.instantiate(Reflect.scala:66)

        at akka.actor.ArgsReflectConstructor.produce(Props.scala:352)

        at akka.actor.Props.newActor(Props.scala:252)

        at akka.actor.ActorCell.newActor(ActorCell.scala:552)

        at akka.actor.ActorCell.create(ActorCell.scala:578)

        ... 10 more

Caused by: java.lang.OutOfMemoryError: GC overhead limit exceeded

11:21:17,423 ERROR org.apache.flink.runtime.taskmanager.Task                                                                                                                                                    - FATAL - exception in task resource cleanup

java.lang.OutOfMemoryError: GC overhead limit exceeded

11:21:55,160 ERROR org.apache.flink.runtime.taskmanager.Task                                                                                                                                                    - FATAL - exception in task exception handler

java.lang.OutOfMemoryError: GC overhead limit exceeded

 

….

 

- Unexpected exception in the selector loop.

java.lang.OutOfMemoryError: GC overhead limit exceeded

 

 

Looks like the input flow is faster than the GC collector

 

Dr. Radu Tudoran

Research Engineer - Big Data Expert

IT R&D Division

 

cid:image007.jpg@01CD52EB.AD060EE0

HUAWEI TECHNOLOGIES Duesseldorf GmbH

European Research Center

Riesstrasse 25, 80992 München

 

E-mail: [hidden email]

Mobile: <a href="tel:%2B49%2015209084330" value="+4915209084330" target="_blank">+49 15209084330

Telephone: <a href="tel:%2B49%20891588344173" value="+49891588344173" target="_blank">+49 891588344173

 

HUAWEI TECHNOLOGIES Duesseldorf GmbH
Hansaallee 205, 40549 Düsseldorf, Germany,
www.huawei.com
Registered Office: Düsseldorf, Register Court Düsseldorf, HRB 56063,
Managing Director: Bo PENG, Wanzhou MENG, Lifang CHEN
Sitz der Gesellschaft: Düsseldorf, Amtsgericht Düsseldorf, HRB 56063,
Geschäftsführer: Bo PENG, Wanzhou MENG, Lifang CHEN

This e-mail and its attachments contain confidential information from HUAWEI, which is intended only for the person or entity whose address is listed above. Any use of the information contained herein in any way (including, but not limited to, total or partial disclosure, reproduction, or dissemination) by persons other than the intended recipient(s) is prohibited. If you receive this e-mail in error, please notify the sender by phone or email immediately and delete it!

 

From: Till Rohrmann [mailto:[hidden email]]
Sent: Thursday, February 04, 2016 4:55 PM
To: [hidden email]
Subject: Re: release of task slot

 

Hi Radu,

what does the log of the TaskManager 10.204.62.80:57910 say?

Cheers,
Till

 

On Wed, Feb 3, 2016 at 6:00 PM, Radu Tudoran <[hidden email]> wrote:

Hello,

 

 

I am facing an error which for which I cannot figure the cause. Any idea what could cause such an error?

 

 

 

java.lang.Exception: The slot in which the task was executed has been released. Probably loss of TaskManager a8b69bd9449ee6792e869a9ff9e843e2 @ cloudr6-admin - 4 slots - URL: akka.tcp://flink@10.204.62.80:57910/user/taskmanager

        at org.apache.flink.runtime.instance.SimpleSlot.releaseSlot(SimpleSlot.java:151)

        at org.apache.flink.runtime.instance.SlotSharingGroupAssignment.releaseSharedSlot(SlotSharingGroupAssignment.java:547)

        at org.apache.flink.runtime.instance.SharedSlot.releaseSlot(SharedSlot.java:119)

        at org.apache.flink.runtime.instance.Instance.markDead(Instance.java:156)

        at org.apache.flink.runtime.instance.InstanceManager.unregisterTaskManager(InstanceManager.java:215)

        at org.apache.flink.runtime.jobmanager.JobManager$$anonfun$handleMessage$1.applyOrElse(JobManager.scala:696)

        at scala.runtime.AbstractPartialFunction$mcVL$sp.apply$mcVL$sp(AbstractPartialFunction.scala:33)

        at scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:33)

        at scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:25)

        at org.apache.flink.runtime.LeaderSessionMessageFilter$$anonfun$receive$1.applyOrElse(LeaderSessionMessageFilter.scala:44)

        at scala.runtime.AbstractPartialFunction$mcVL$sp.apply$mcVL$sp(AbstractPartialFunction.scala:33)

        at scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:33)

        at scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:25)

        at org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:33)

        at org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:28)

        at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:118)

        at org.apache.flink.runtime.LogMessages$$anon$1.applyOrElse(LogMessages.scala:28)

        at akka.actor.Actor$class.aroundReceive(Actor.scala:465)

        at org.apache.flink.runtime.jobmanager.JobManager.aroundReceive(JobManager.scala:100)

        at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)

        at akka.actor.dungeon.DeathWatch$class.receivedTerminated(DeathWatch.scala:46)

        at akka.actor.ActorCell.receivedTerminated(ActorCell.scala:369)

        at akka.actor.ActorCell.autoReceiveMessage(ActorCell.scala:501)

        at akka.actor.ActorCell.invoke(ActorCell.scala:486)

        at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:254)

        at akka.dispatch.Mailbox.run(Mailbox.scala:221)

        at akka.dispatch.Mailbox.exec(Mailbox.scala:231)

        at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)

        at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)

        at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)

        at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)

 

 

Dr. Radu Tudoran

Research Engineer - Big Data Expert

IT R&D Division

 

cid:image007.jpg@01CD52EB.AD060EE0

HUAWEI TECHNOLOGIES Duesseldorf GmbH

European Research Center

Riesstrasse 25, 80992 München

 

E-mail: [hidden email]

Mobile: <a href="tel:%2B49%2015209084330" target="_blank">+49 15209084330

Telephone: <a href="tel:%2B49%20891588344173" target="_blank">+49 891588344173

 

HUAWEI TECHNOLOGIES Duesseldorf GmbH
Hansaallee 205, 40549 Düsseldorf, Germany,
www.huawei.com
Registered Office: Düsseldorf, Register Court Düsseldorf, HRB 56063,
Managing Director: Bo PENG, Wanzhou MENG, Lifang CHEN
Sitz der Gesellschaft: Düsseldorf, Amtsgericht Düsseldorf, HRB 56063,
Geschäftsführer: Bo PENG, Wanzhou MENG, Lifang CHEN

This e-mail and its attachments contain confidential information from HUAWEI, which is intended only for the person or entity whose address is listed above. Any use of the information contained herein in any way (including, but not limited to, total or partial disclosure, reproduction, or dissemination) by persons other than the intended recipient(s) is prohibited. If you receive this e-mail in error, please notify the sender by phone or email immediately and delete it!

 

 



image001.png (19K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: release of task slot

Stephan Ewen
Okay, here are the docs for the Akka version we are using: http://doc.akka.io/docs/akka/2.3.14/scala/remoting.html#Lifecycle_and_Failure_Recovery_Model

It says that after a remote deathwatch trigger, the actor system must be restarted before it can connect again.

We probably need to do the following:
  - Either restart TaskManager actor system when it detects that it is quarantined (or when it senses the JobManager as failed)

  - Or switch from Akka death watch to a manual heartbeat mechanism

Would be good to also have Till's input on this...

What do you think?

Stephan



On Thu, Feb 4, 2016 at 5:11 PM, Gyula Fóra <[hidden email]> wrote:
Yes exactly , it says it is quarantined.

Gyula


Gyula

On Thu, Feb 4, 2016 at 4:09 PM Stephan Ewen <[hidden email]> wrote:
@Gyula Do you see log messages about quarantined actor systems?

There may be an issue with Akka Death watches that once the connection is lost, it cannot be re-established unless the TaskManager is restarted




On Thu, Feb 4, 2016 at 5:03 PM, Radu Tudoran <[hidden email]> wrote:

Hi,

 

Well…yesterday when I looked into it there was no additional info than the one I have send. Today I reproduced the problem and I could see in the log file.

 

 

akka.actor.ActorInitializationException: exception during creation

        at akka.actor.ActorInitializationException$.apply(Actor.scala:164)

        at akka.actor.ActorCell.create(ActorCell.scala:596)

        at akka.actor.ActorCell.invokeAll$1(ActorCell.scala:456)

        at akka.actor.ActorCell.systemInvoke(ActorCell.scala:478)

        at akka.dispatch.Mailbox.processAllSystemMessages(Mailbox.scala:279)

        at akka.dispatch.Mailbox.run(Mailbox.scala:220)

        at akka.dispatch.Mailbox.exec(Mailbox.scala:231)

        at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)

        at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.pollAndExecAll(ForkJoinPool.java:1253)

        at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1346)

        at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)

        at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)

Caused by: java.lang.reflect.InvocationTargetException

        at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)

        at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)

        at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)

        at java.lang.reflect.Constructor.newInstance(Constructor.java:422)

        at akka.util.Reflect$.instantiate(Reflect.scala:66)

        at akka.actor.ArgsReflectConstructor.produce(Props.scala:352)

        at akka.actor.Props.newActor(Props.scala:252)

        at akka.actor.ActorCell.newActor(ActorCell.scala:552)

        at akka.actor.ActorCell.create(ActorCell.scala:578)

        ... 10 more

Caused by: java.lang.OutOfMemoryError: GC overhead limit exceeded

11:21:17,423 ERROR org.apache.flink.runtime.taskmanager.Task                                                                                                                                                    - FATAL - exception in task resource cleanup

java.lang.OutOfMemoryError: GC overhead limit exceeded

11:21:55,160 ERROR org.apache.flink.runtime.taskmanager.Task                                                                                                                                                    - FATAL - exception in task exception handler

java.lang.OutOfMemoryError: GC overhead limit exceeded

 

….

 

- Unexpected exception in the selector loop.

java.lang.OutOfMemoryError: GC overhead limit exceeded

 

 

Looks like the input flow is faster than the GC collector

 

Dr. Radu Tudoran

Research Engineer - Big Data Expert

IT R&D Division

 

cid:image007.jpg@01CD52EB.AD060EE0

HUAWEI TECHNOLOGIES Duesseldorf GmbH

European Research Center

Riesstrasse 25, 80992 München

 

E-mail: [hidden email]

Mobile: <a href="tel:%2B49%2015209084330" value="+4915209084330" target="_blank">+49 15209084330

Telephone: <a href="tel:%2B49%20891588344173" value="+49891588344173" target="_blank">+49 891588344173

 

HUAWEI TECHNOLOGIES Duesseldorf GmbH
Hansaallee 205, 40549 Düsseldorf, Germany,
www.huawei.com
Registered Office: Düsseldorf, Register Court Düsseldorf, HRB 56063,
Managing Director: Bo PENG, Wanzhou MENG, Lifang CHEN
Sitz der Gesellschaft: Düsseldorf, Amtsgericht Düsseldorf, HRB 56063,
Geschäftsführer: Bo PENG, Wanzhou MENG, Lifang CHEN

This e-mail and its attachments contain confidential information from HUAWEI, which is intended only for the person or entity whose address is listed above. Any use of the information contained herein in any way (including, but not limited to, total or partial disclosure, reproduction, or dissemination) by persons other than the intended recipient(s) is prohibited. If you receive this e-mail in error, please notify the sender by phone or email immediately and delete it!

 

From: Till Rohrmann [mailto:[hidden email]]
Sent: Thursday, February 04, 2016 4:55 PM
To: [hidden email]
Subject: Re: release of task slot

 

Hi Radu,

what does the log of the TaskManager 10.204.62.80:57910 say?

Cheers,
Till

 

On Wed, Feb 3, 2016 at 6:00 PM, Radu Tudoran <[hidden email]> wrote:

Hello,

 

 

I am facing an error which for which I cannot figure the cause. Any idea what could cause such an error?

 

 

 

java.lang.Exception: The slot in which the task was executed has been released. Probably loss of TaskManager a8b69bd9449ee6792e869a9ff9e843e2 @ cloudr6-admin - 4 slots - URL: akka.tcp://flink@10.204.62.80:57910/user/taskmanager

        at org.apache.flink.runtime.instance.SimpleSlot.releaseSlot(SimpleSlot.java:151)

        at org.apache.flink.runtime.instance.SlotSharingGroupAssignment.releaseSharedSlot(SlotSharingGroupAssignment.java:547)

        at org.apache.flink.runtime.instance.SharedSlot.releaseSlot(SharedSlot.java:119)

        at org.apache.flink.runtime.instance.Instance.markDead(Instance.java:156)

        at org.apache.flink.runtime.instance.InstanceManager.unregisterTaskManager(InstanceManager.java:215)

        at org.apache.flink.runtime.jobmanager.JobManager$$anonfun$handleMessage$1.applyOrElse(JobManager.scala:696)

        at scala.runtime.AbstractPartialFunction$mcVL$sp.apply$mcVL$sp(AbstractPartialFunction.scala:33)

        at scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:33)

        at scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:25)

        at org.apache.flink.runtime.LeaderSessionMessageFilter$$anonfun$receive$1.applyOrElse(LeaderSessionMessageFilter.scala:44)

        at scala.runtime.AbstractPartialFunction$mcVL$sp.apply$mcVL$sp(AbstractPartialFunction.scala:33)

        at scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:33)

        at scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:25)

        at org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:33)

        at org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:28)

        at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:118)

        at org.apache.flink.runtime.LogMessages$$anon$1.applyOrElse(LogMessages.scala:28)

        at akka.actor.Actor$class.aroundReceive(Actor.scala:465)

        at org.apache.flink.runtime.jobmanager.JobManager.aroundReceive(JobManager.scala:100)

        at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)

        at akka.actor.dungeon.DeathWatch$class.receivedTerminated(DeathWatch.scala:46)

        at akka.actor.ActorCell.receivedTerminated(ActorCell.scala:369)

        at akka.actor.ActorCell.autoReceiveMessage(ActorCell.scala:501)

        at akka.actor.ActorCell.invoke(ActorCell.scala:486)

        at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:254)

        at akka.dispatch.Mailbox.run(Mailbox.scala:221)

        at akka.dispatch.Mailbox.exec(Mailbox.scala:231)

        at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)

        at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)

        at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)

        at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)

 

 

Dr. Radu Tudoran

Research Engineer - Big Data Expert

IT R&D Division

 

cid:image007.jpg@01CD52EB.AD060EE0

HUAWEI TECHNOLOGIES Duesseldorf GmbH

European Research Center

Riesstrasse 25, 80992 München

 

E-mail: [hidden email]

Mobile: <a href="tel:%2B49%2015209084330" target="_blank">+49 15209084330

Telephone: <a href="tel:%2B49%20891588344173" target="_blank">+49 891588344173

 

HUAWEI TECHNOLOGIES Duesseldorf GmbH
Hansaallee 205, 40549 Düsseldorf, Germany,
www.huawei.com
Registered Office: Düsseldorf, Register Court Düsseldorf, HRB 56063,
Managing Director: Bo PENG, Wanzhou MENG, Lifang CHEN
Sitz der Gesellschaft: Düsseldorf, Amtsgericht Düsseldorf, HRB 56063,
Geschäftsführer: Bo PENG, Wanzhou MENG, Lifang CHEN

This e-mail and its attachments contain confidential information from HUAWEI, which is intended only for the person or entity whose address is listed above. Any use of the information contained herein in any way (including, but not limited to, total or partial disclosure, reproduction, or dissemination) by persons other than the intended recipient(s) is prohibited. If you receive this e-mail in error, please notify the sender by phone or email immediately and delete it!

 

 



Reply | Threaded
Open this post in threaded view
|

Re: release of task slot

Stephan Ewen
We should probably add to the TaskManager a "restart on quarantined" strategy anyways.



On Thu, Feb 4, 2016 at 5:18 PM, Stephan Ewen <[hidden email]> wrote:
Okay, here are the docs for the Akka version we are using: http://doc.akka.io/docs/akka/2.3.14/scala/remoting.html#Lifecycle_and_Failure_Recovery_Model

It says that after a remote deathwatch trigger, the actor system must be restarted before it can connect again.

We probably need to do the following:
  - Either restart TaskManager actor system when it detects that it is quarantined (or when it senses the JobManager as failed)

  - Or switch from Akka death watch to a manual heartbeat mechanism

Would be good to also have Till's input on this...

What do you think?

Stephan



On Thu, Feb 4, 2016 at 5:11 PM, Gyula Fóra <[hidden email]> wrote:
Yes exactly , it says it is quarantined.

Gyula


Gyula

On Thu, Feb 4, 2016 at 4:09 PM Stephan Ewen <[hidden email]> wrote:
@Gyula Do you see log messages about quarantined actor systems?

There may be an issue with Akka Death watches that once the connection is lost, it cannot be re-established unless the TaskManager is restarted




On Thu, Feb 4, 2016 at 5:03 PM, Radu Tudoran <[hidden email]> wrote:

Hi,

 

Well…yesterday when I looked into it there was no additional info than the one I have send. Today I reproduced the problem and I could see in the log file.

 

 

akka.actor.ActorInitializationException: exception during creation

        at akka.actor.ActorInitializationException$.apply(Actor.scala:164)

        at akka.actor.ActorCell.create(ActorCell.scala:596)

        at akka.actor.ActorCell.invokeAll$1(ActorCell.scala:456)

        at akka.actor.ActorCell.systemInvoke(ActorCell.scala:478)

        at akka.dispatch.Mailbox.processAllSystemMessages(Mailbox.scala:279)

        at akka.dispatch.Mailbox.run(Mailbox.scala:220)

        at akka.dispatch.Mailbox.exec(Mailbox.scala:231)

        at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)

        at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.pollAndExecAll(ForkJoinPool.java:1253)

        at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1346)

        at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)

        at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)

Caused by: java.lang.reflect.InvocationTargetException

        at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)

        at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)

        at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)

        at java.lang.reflect.Constructor.newInstance(Constructor.java:422)

        at akka.util.Reflect$.instantiate(Reflect.scala:66)

        at akka.actor.ArgsReflectConstructor.produce(Props.scala:352)

        at akka.actor.Props.newActor(Props.scala:252)

        at akka.actor.ActorCell.newActor(ActorCell.scala:552)

        at akka.actor.ActorCell.create(ActorCell.scala:578)

        ... 10 more

Caused by: java.lang.OutOfMemoryError: GC overhead limit exceeded

11:21:17,423 ERROR org.apache.flink.runtime.taskmanager.Task                                                                                                                                                    - FATAL - exception in task resource cleanup

java.lang.OutOfMemoryError: GC overhead limit exceeded

11:21:55,160 ERROR org.apache.flink.runtime.taskmanager.Task                                                                                                                                                    - FATAL - exception in task exception handler

java.lang.OutOfMemoryError: GC overhead limit exceeded

 

….

 

- Unexpected exception in the selector loop.

java.lang.OutOfMemoryError: GC overhead limit exceeded

 

 

Looks like the input flow is faster than the GC collector

 

Dr. Radu Tudoran

Research Engineer - Big Data Expert

IT R&D Division

 

cid:image007.jpg@01CD52EB.AD060EE0

HUAWEI TECHNOLOGIES Duesseldorf GmbH

European Research Center

Riesstrasse 25, 80992 München

 

E-mail: [hidden email]

Mobile: <a href="tel:%2B49%2015209084330" value="+4915209084330" target="_blank">+49 15209084330

Telephone: <a href="tel:%2B49%20891588344173" value="+49891588344173" target="_blank">+49 891588344173

 

HUAWEI TECHNOLOGIES Duesseldorf GmbH
Hansaallee 205, 40549 Düsseldorf, Germany,
www.huawei.com
Registered Office: Düsseldorf, Register Court Düsseldorf, HRB 56063,
Managing Director: Bo PENG, Wanzhou MENG, Lifang CHEN
Sitz der Gesellschaft: Düsseldorf, Amtsgericht Düsseldorf, HRB 56063,
Geschäftsführer: Bo PENG, Wanzhou MENG, Lifang CHEN

This e-mail and its attachments contain confidential information from HUAWEI, which is intended only for the person or entity whose address is listed above. Any use of the information contained herein in any way (including, but not limited to, total or partial disclosure, reproduction, or dissemination) by persons other than the intended recipient(s) is prohibited. If you receive this e-mail in error, please notify the sender by phone or email immediately and delete it!

 

From: Till Rohrmann [mailto:[hidden email]]
Sent: Thursday, February 04, 2016 4:55 PM
To: [hidden email]
Subject: Re: release of task slot

 

Hi Radu,

what does the log of the TaskManager 10.204.62.80:57910 say?

Cheers,
Till

 

On Wed, Feb 3, 2016 at 6:00 PM, Radu Tudoran <[hidden email]> wrote:

Hello,

 

 

I am facing an error which for which I cannot figure the cause. Any idea what could cause such an error?

 

 

 

java.lang.Exception: The slot in which the task was executed has been released. Probably loss of TaskManager a8b69bd9449ee6792e869a9ff9e843e2 @ cloudr6-admin - 4 slots - URL: akka.tcp://flink@10.204.62.80:57910/user/taskmanager

        at org.apache.flink.runtime.instance.SimpleSlot.releaseSlot(SimpleSlot.java:151)

        at org.apache.flink.runtime.instance.SlotSharingGroupAssignment.releaseSharedSlot(SlotSharingGroupAssignment.java:547)

        at org.apache.flink.runtime.instance.SharedSlot.releaseSlot(SharedSlot.java:119)

        at org.apache.flink.runtime.instance.Instance.markDead(Instance.java:156)

        at org.apache.flink.runtime.instance.InstanceManager.unregisterTaskManager(InstanceManager.java:215)

        at org.apache.flink.runtime.jobmanager.JobManager$$anonfun$handleMessage$1.applyOrElse(JobManager.scala:696)

        at scala.runtime.AbstractPartialFunction$mcVL$sp.apply$mcVL$sp(AbstractPartialFunction.scala:33)

        at scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:33)

        at scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:25)

        at org.apache.flink.runtime.LeaderSessionMessageFilter$$anonfun$receive$1.applyOrElse(LeaderSessionMessageFilter.scala:44)

        at scala.runtime.AbstractPartialFunction$mcVL$sp.apply$mcVL$sp(AbstractPartialFunction.scala:33)

        at scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:33)

        at scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:25)

        at org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:33)

        at org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:28)

        at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:118)

        at org.apache.flink.runtime.LogMessages$$anon$1.applyOrElse(LogMessages.scala:28)

        at akka.actor.Actor$class.aroundReceive(Actor.scala:465)

        at org.apache.flink.runtime.jobmanager.JobManager.aroundReceive(JobManager.scala:100)

        at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)

        at akka.actor.dungeon.DeathWatch$class.receivedTerminated(DeathWatch.scala:46)

        at akka.actor.ActorCell.receivedTerminated(ActorCell.scala:369)

        at akka.actor.ActorCell.autoReceiveMessage(ActorCell.scala:501)

        at akka.actor.ActorCell.invoke(ActorCell.scala:486)

        at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:254)

        at akka.dispatch.Mailbox.run(Mailbox.scala:221)

        at akka.dispatch.Mailbox.exec(Mailbox.scala:231)

        at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)

        at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)

        at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)

        at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)

 

 

Dr. Radu Tudoran

Research Engineer - Big Data Expert

IT R&D Division

 

cid:image007.jpg@01CD52EB.AD060EE0

HUAWEI TECHNOLOGIES Duesseldorf GmbH

European Research Center

Riesstrasse 25, 80992 München

 

E-mail: [hidden email]

Mobile: <a href="tel:%2B49%2015209084330" target="_blank">+49 15209084330

Telephone: <a href="tel:%2B49%20891588344173" target="_blank">+49 891588344173

 

HUAWEI TECHNOLOGIES Duesseldorf GmbH
Hansaallee 205, 40549 Düsseldorf, Germany,
www.huawei.com
Registered Office: Düsseldorf, Register Court Düsseldorf, HRB 56063,
Managing Director: Bo PENG, Wanzhou MENG, Lifang CHEN
Sitz der Gesellschaft: Düsseldorf, Amtsgericht Düsseldorf, HRB 56063,
Geschäftsführer: Bo PENG, Wanzhou MENG, Lifang CHEN

This e-mail and its attachments contain confidential information from HUAWEI, which is intended only for the person or entity whose address is listed above. Any use of the information contained herein in any way (including, but not limited to, total or partial disclosure, reproduction, or dissemination) by persons other than the intended recipient(s) is prohibited. If you receive this e-mail in error, please notify the sender by phone or email immediately and delete it!

 

 




Reply | Threaded
Open this post in threaded view
|

Re: release of task slot

Gyula Fóra
I am not an expert on this but I guess either could work. What is also interesting that sometimes this doesn't happen for a day sometimes it happens twice an hour so it's probably some network issue as well.

Gyula


Stephan Ewen <[hidden email]> ezt írta (időpont: 2016. febr. 4., Cs, 16:32):
We should probably add to the TaskManager a "restart on quarantined" strategy anyways.



On Thu, Feb 4, 2016 at 5:18 PM, Stephan Ewen <[hidden email]> wrote:
Okay, here are the docs for the Akka version we are using: http://doc.akka.io/docs/akka/2.3.14/scala/remoting.html#Lifecycle_and_Failure_Recovery_Model

It says that after a remote deathwatch trigger, the actor system must be restarted before it can connect again.

We probably need to do the following:
  - Either restart TaskManager actor system when it detects that it is quarantined (or when it senses the JobManager as failed)

  - Or switch from Akka death watch to a manual heartbeat mechanism

Would be good to also have Till's input on this...

What do you think?

Stephan



On Thu, Feb 4, 2016 at 5:11 PM, Gyula Fóra <[hidden email]> wrote:
Yes exactly , it says it is quarantined.

Gyula


Gyula

On Thu, Feb 4, 2016 at 4:09 PM Stephan Ewen <[hidden email]> wrote:
@Gyula Do you see log messages about quarantined actor systems?

There may be an issue with Akka Death watches that once the connection is lost, it cannot be re-established unless the TaskManager is restarted




On Thu, Feb 4, 2016 at 5:03 PM, Radu Tudoran <[hidden email]> wrote:

Hi,

 

Well…yesterday when I looked into it there was no additional info than the one I have send. Today I reproduced the problem and I could see in the log file.

 

 

akka.actor.ActorInitializationException: exception during creation

        at akka.actor.ActorInitializationException$.apply(Actor.scala:164)

        at akka.actor.ActorCell.create(ActorCell.scala:596)

        at akka.actor.ActorCell.invokeAll$1(ActorCell.scala:456)

        at akka.actor.ActorCell.systemInvoke(ActorCell.scala:478)

        at akka.dispatch.Mailbox.processAllSystemMessages(Mailbox.scala:279)

        at akka.dispatch.Mailbox.run(Mailbox.scala:220)

        at akka.dispatch.Mailbox.exec(Mailbox.scala:231)

        at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)

        at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.pollAndExecAll(ForkJoinPool.java:1253)

        at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1346)

        at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)

        at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)

Caused by: java.lang.reflect.InvocationTargetException

        at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)

        at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)

        at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)

        at java.lang.reflect.Constructor.newInstance(Constructor.java:422)

        at akka.util.Reflect$.instantiate(Reflect.scala:66)

        at akka.actor.ArgsReflectConstructor.produce(Props.scala:352)

        at akka.actor.Props.newActor(Props.scala:252)

        at akka.actor.ActorCell.newActor(ActorCell.scala:552)

        at akka.actor.ActorCell.create(ActorCell.scala:578)

        ... 10 more

Caused by: java.lang.OutOfMemoryError: GC overhead limit exceeded

11:21:17,423 ERROR org.apache.flink.runtime.taskmanager.Task                                                                                                                                                    - FATAL - exception in task resource cleanup

java.lang.OutOfMemoryError: GC overhead limit exceeded

11:21:55,160 ERROR org.apache.flink.runtime.taskmanager.Task                                                                                                                                                    - FATAL - exception in task exception handler

java.lang.OutOfMemoryError: GC overhead limit exceeded

 

….

 

- Unexpected exception in the selector loop.

java.lang.OutOfMemoryError: GC overhead limit exceeded

 

 

Looks like the input flow is faster than the GC collector

 

Dr. Radu Tudoran

Research Engineer - Big Data Expert

IT R&D Division

 

cid:image007.jpg@01CD52EB.AD060EE0

HUAWEI TECHNOLOGIES Duesseldorf GmbH

European Research Center

Riesstrasse 25, 80992 München

 

E-mail: [hidden email]

Mobile: <a href="tel:%2B49%2015209084330" value="+4915209084330" target="_blank">+49 15209084330

Telephone: <a href="tel:%2B49%20891588344173" value="+49891588344173" target="_blank">+49 891588344173

 

HUAWEI TECHNOLOGIES Duesseldorf GmbH
Hansaallee 205, 40549 Düsseldorf, Germany,
www.huawei.com
Registered Office: Düsseldorf, Register Court Düsseldorf, HRB 56063,
Managing Director: Bo PENG, Wanzhou MENG, Lifang CHEN
Sitz der Gesellschaft: Düsseldorf, Amtsgericht Düsseldorf, HRB 56063,
Geschäftsführer: Bo PENG, Wanzhou MENG, Lifang CHEN

This e-mail and its attachments contain confidential information from HUAWEI, which is intended only for the person or entity whose address is listed above. Any use of the information contained herein in any way (including, but not limited to, total or partial disclosure, reproduction, or dissemination) by persons other than the intended recipient(s) is prohibited. If you receive this e-mail in error, please notify the sender by phone or email immediately and delete it!

 

From: Till Rohrmann [mailto:[hidden email]]
Sent: Thursday, February 04, 2016 4:55 PM
To: [hidden email]
Subject: Re: release of task slot

 

Hi Radu,

what does the log of the TaskManager 10.204.62.80:57910 say?

Cheers,
Till

 

On Wed, Feb 3, 2016 at 6:00 PM, Radu Tudoran <[hidden email]> wrote:

Hello,

 

 

I am facing an error which for which I cannot figure the cause. Any idea what could cause such an error?

 

 

 

java.lang.Exception: The slot in which the task was executed has been released. Probably loss of TaskManager a8b69bd9449ee6792e869a9ff9e843e2 @ cloudr6-admin - 4 slots - URL: akka.tcp://flink@10.204.62.80:57910/user/taskmanager

        at org.apache.flink.runtime.instance.SimpleSlot.releaseSlot(SimpleSlot.java:151)

        at org.apache.flink.runtime.instance.SlotSharingGroupAssignment.releaseSharedSlot(SlotSharingGroupAssignment.java:547)

        at org.apache.flink.runtime.instance.SharedSlot.releaseSlot(SharedSlot.java:119)

        at org.apache.flink.runtime.instance.Instance.markDead(Instance.java:156)

        at org.apache.flink.runtime.instance.InstanceManager.unregisterTaskManager(InstanceManager.java:215)

        at org.apache.flink.runtime.jobmanager.JobManager$$anonfun$handleMessage$1.applyOrElse(JobManager.scala:696)

        at scala.runtime.AbstractPartialFunction$mcVL$sp.apply$mcVL$sp(AbstractPartialFunction.scala:33)

        at scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:33)

        at scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:25)

        at org.apache.flink.runtime.LeaderSessionMessageFilter$$anonfun$receive$1.applyOrElse(LeaderSessionMessageFilter.scala:44)

        at scala.runtime.AbstractPartialFunction$mcVL$sp.apply$mcVL$sp(AbstractPartialFunction.scala:33)

        at scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:33)

        at scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:25)

        at org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:33)

        at org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:28)

        at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:118)

        at org.apache.flink.runtime.LogMessages$$anon$1.applyOrElse(LogMessages.scala:28)

        at akka.actor.Actor$class.aroundReceive(Actor.scala:465)

        at org.apache.flink.runtime.jobmanager.JobManager.aroundReceive(JobManager.scala:100)

        at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)

        at akka.actor.dungeon.DeathWatch$class.receivedTerminated(DeathWatch.scala:46)

        at akka.actor.ActorCell.receivedTerminated(ActorCell.scala:369)

        at akka.actor.ActorCell.autoReceiveMessage(ActorCell.scala:501)

        at akka.actor.ActorCell.invoke(ActorCell.scala:486)

        at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:254)

        at akka.dispatch.Mailbox.run(Mailbox.scala:221)

        at akka.dispatch.Mailbox.exec(Mailbox.scala:231)

        at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)

        at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)

        at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)

        at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)

 

 

Dr. Radu Tudoran

Research Engineer - Big Data Expert

IT R&D Division

 

cid:image007.jpg@01CD52EB.AD060EE0

HUAWEI TECHNOLOGIES Duesseldorf GmbH

European Research Center

Riesstrasse 25, 80992 München

 

E-mail: [hidden email]

Mobile: <a href="tel:%2B49%2015209084330" target="_blank">+49 15209084330

Telephone: <a href="tel:%2B49%20891588344173" target="_blank">+49 891588344173

 

HUAWEI TECHNOLOGIES Duesseldorf GmbH
Hansaallee 205, 40549 Düsseldorf, Germany,
www.huawei.com
Registered Office: Düsseldorf, Register Court Düsseldorf, HRB 56063,
Managing Director: Bo PENG, Wanzhou MENG, Lifang CHEN
Sitz der Gesellschaft: Düsseldorf, Amtsgericht Düsseldorf, HRB 56063,
Geschäftsführer: Bo PENG, Wanzhou MENG, Lifang CHEN

This e-mail and its attachments contain confidential information from HUAWEI, which is intended only for the person or entity whose address is listed above. Any use of the information contained herein in any way (including, but not limited to, total or partial disclosure, reproduction, or dissemination) by persons other than the intended recipient(s) is prohibited. If you receive this e-mail in error, please notify the sender by phone or email immediately and delete it!

 

 




Reply | Threaded
Open this post in threaded view
|

Re: release of task slot

Till Rohrmann

I agree with Stephan. Once a TaskManager is quarantined the ActorSystem has to be restarted in order to connect back to the JobManager.

Adding our own heartbeat mechanism should be fairly easy. However, we would lose the more elaborate phi accrual failure detector which can react to changing network conditions. But I’m not sure how much difference this makes in reality.

Listening to the QuarantinedEvent and shutting down the TM should be easy to do. If I remember correctly, we always wanted to add a restart loop to the task manager start script in case that the TM exited with an error.

Maybe we first do the latter option and then re-add our own heartbeats again.

Cheers,
Till


On Fri, Feb 5, 2016 at 9:27 AM, Gyula Fóra <[hidden email]> wrote:
I am not an expert on this but I guess either could work. What is also interesting that sometimes this doesn't happen for a day sometimes it happens twice an hour so it's probably some network issue as well.

Gyula


Stephan Ewen <[hidden email]> ezt írta (időpont: 2016. febr. 4., Cs, 16:32):
We should probably add to the TaskManager a "restart on quarantined" strategy anyways.



On Thu, Feb 4, 2016 at 5:18 PM, Stephan Ewen <[hidden email]> wrote:
Okay, here are the docs for the Akka version we are using: http://doc.akka.io/docs/akka/2.3.14/scala/remoting.html#Lifecycle_and_Failure_Recovery_Model

It says that after a remote deathwatch trigger, the actor system must be restarted before it can connect again.

We probably need to do the following:
  - Either restart TaskManager actor system when it detects that it is quarantined (or when it senses the JobManager as failed)

  - Or switch from Akka death watch to a manual heartbeat mechanism

Would be good to also have Till's input on this...

What do you think?

Stephan



On Thu, Feb 4, 2016 at 5:11 PM, Gyula Fóra <[hidden email]> wrote:
Yes exactly , it says it is quarantined.

Gyula


Gyula

On Thu, Feb 4, 2016 at 4:09 PM Stephan Ewen <[hidden email]> wrote:
@Gyula Do you see log messages about quarantined actor systems?

There may be an issue with Akka Death watches that once the connection is lost, it cannot be re-established unless the TaskManager is restarted




On Thu, Feb 4, 2016 at 5:03 PM, Radu Tudoran <[hidden email]> wrote:

Hi,

 

Well…yesterday when I looked into it there was no additional info than the one I have send. Today I reproduced the problem and I could see in the log file.

 

 

akka.actor.ActorInitializationException: exception during creation

        at akka.actor.ActorInitializationException$.apply(Actor.scala:164)

        at akka.actor.ActorCell.create(ActorCell.scala:596)

        at akka.actor.ActorCell.invokeAll$1(ActorCell.scala:456)

        at akka.actor.ActorCell.systemInvoke(ActorCell.scala:478)

        at akka.dispatch.Mailbox.processAllSystemMessages(Mailbox.scala:279)

        at akka.dispatch.Mailbox.run(Mailbox.scala:220)

        at akka.dispatch.Mailbox.exec(Mailbox.scala:231)

        at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)

        at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.pollAndExecAll(ForkJoinPool.java:1253)

        at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1346)

        at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)

        at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)

Caused by: java.lang.reflect.InvocationTargetException

        at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)

        at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)

        at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)

        at java.lang.reflect.Constructor.newInstance(Constructor.java:422)

        at akka.util.Reflect$.instantiate(Reflect.scala:66)

        at akka.actor.ArgsReflectConstructor.produce(Props.scala:352)

        at akka.actor.Props.newActor(Props.scala:252)

        at akka.actor.ActorCell.newActor(ActorCell.scala:552)

        at akka.actor.ActorCell.create(ActorCell.scala:578)

        ... 10 more

Caused by: java.lang.OutOfMemoryError: GC overhead limit exceeded

11:21:17,423 ERROR org.apache.flink.runtime.taskmanager.Task                                                                                                                                                    - FATAL - exception in task resource cleanup

java.lang.OutOfMemoryError: GC overhead limit exceeded

11:21:55,160 ERROR org.apache.flink.runtime.taskmanager.Task                                                                                                                                                    - FATAL - exception in task exception handler

java.lang.OutOfMemoryError: GC overhead limit exceeded

 

….

 

- Unexpected exception in the selector loop.

java.lang.OutOfMemoryError: GC overhead limit exceeded

 

 

Looks like the input flow is faster than the GC collector

 

Dr. Radu Tudoran

Research Engineer - Big Data Expert

IT R&D Division

 

cid:image007.jpg@01CD52EB.AD060EE0

HUAWEI TECHNOLOGIES Duesseldorf GmbH

European Research Center

Riesstrasse 25, 80992 München

 

E-mail: [hidden email]

Mobile: <a href="tel:%2B49%2015209084330" value="+4915209084330" target="_blank">+49 15209084330

Telephone: <a href="tel:%2B49%20891588344173" value="+49891588344173" target="_blank">+49 891588344173

 

HUAWEI TECHNOLOGIES Duesseldorf GmbH
Hansaallee 205, 40549 Düsseldorf, Germany,
www.huawei.com
Registered Office: Düsseldorf, Register Court Düsseldorf, HRB 56063,
Managing Director: Bo PENG, Wanzhou MENG, Lifang CHEN
Sitz der Gesellschaft: Düsseldorf, Amtsgericht Düsseldorf, HRB 56063,
Geschäftsführer: Bo PENG, Wanzhou MENG, Lifang CHEN

This e-mail and its attachments contain confidential information from HUAWEI, which is intended only for the person or entity whose address is listed above. Any use of the information contained herein in any way (including, but not limited to, total or partial disclosure, reproduction, or dissemination) by persons other than the intended recipient(s) is prohibited. If you receive this e-mail in error, please notify the sender by phone or email immediately and delete it!

 

From: Till Rohrmann [mailto:[hidden email]]
Sent: Thursday, February 04, 2016 4:55 PM
To: [hidden email]
Subject: Re: release of task slot

 

Hi Radu,

what does the log of the TaskManager 10.204.62.80:57910 say?

Cheers,
Till

 

On Wed, Feb 3, 2016 at 6:00 PM, Radu Tudoran <[hidden email]> wrote:

Hello,

 

 

I am facing an error which for which I cannot figure the cause. Any idea what could cause such an error?

 

 

 

java.lang.Exception: The slot in which the task was executed has been released. Probably loss of TaskManager a8b69bd9449ee6792e869a9ff9e843e2 @ cloudr6-admin - 4 slots - URL: akka.tcp://flink@10.204.62.80:57910/user/taskmanager

        at org.apache.flink.runtime.instance.SimpleSlot.releaseSlot(SimpleSlot.java:151)

        at org.apache.flink.runtime.instance.SlotSharingGroupAssignment.releaseSharedSlot(SlotSharingGroupAssignment.java:547)

        at org.apache.flink.runtime.instance.SharedSlot.releaseSlot(SharedSlot.java:119)

        at org.apache.flink.runtime.instance.Instance.markDead(Instance.java:156)

        at org.apache.flink.runtime.instance.InstanceManager.unregisterTaskManager(InstanceManager.java:215)

        at org.apache.flink.runtime.jobmanager.JobManager$$anonfun$handleMessage$1.applyOrElse(JobManager.scala:696)

        at scala.runtime.AbstractPartialFunction$mcVL$sp.apply$mcVL$sp(AbstractPartialFunction.scala:33)

        at scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:33)

        at scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:25)

        at org.apache.flink.runtime.LeaderSessionMessageFilter$$anonfun$receive$1.applyOrElse(LeaderSessionMessageFilter.scala:44)

        at scala.runtime.AbstractPartialFunction$mcVL$sp.apply$mcVL$sp(AbstractPartialFunction.scala:33)

        at scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:33)

        at scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:25)

        at org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:33)

        at org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:28)

        at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:118)

        at org.apache.flink.runtime.LogMessages$$anon$1.applyOrElse(LogMessages.scala:28)

        at akka.actor.Actor$class.aroundReceive(Actor.scala:465)

        at org.apache.flink.runtime.jobmanager.JobManager.aroundReceive(JobManager.scala:100)

        at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)

        at akka.actor.dungeon.DeathWatch$class.receivedTerminated(DeathWatch.scala:46)

        at akka.actor.ActorCell.receivedTerminated(ActorCell.scala:369)

        at akka.actor.ActorCell.autoReceiveMessage(ActorCell.scala:501)

        at akka.actor.ActorCell.invoke(ActorCell.scala:486)

        at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:254)

        at akka.dispatch.Mailbox.run(Mailbox.scala:221)

        at akka.dispatch.Mailbox.exec(Mailbox.scala:231)

        at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)

        at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)

        at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)

        at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)

 

 

Dr. Radu Tudoran

Research Engineer - Big Data Expert

IT R&D Division

 

cid:image007.jpg@01CD52EB.AD060EE0

HUAWEI TECHNOLOGIES Duesseldorf GmbH

European Research Center

Riesstrasse 25, 80992 München

 

E-mail: [hidden email]

Mobile: <a href="tel:%2B49%2015209084330" target="_blank">+49 15209084330

Telephone: <a href="tel:%2B49%20891588344173" target="_blank">+49 891588344173

 

HUAWEI TECHNOLOGIES Duesseldorf GmbH
Hansaallee 205, 40549 Düsseldorf, Germany,
www.huawei.com
Registered Office: Düsseldorf, Register Court Düsseldorf, HRB 56063,
Managing Director: Bo PENG, Wanzhou MENG, Lifang CHEN
Sitz der Gesellschaft: Düsseldorf, Amtsgericht Düsseldorf, HRB 56063,
Geschäftsführer: Bo PENG, Wanzhou MENG, Lifang CHEN

This e-mail and its attachments contain confidential information from HUAWEI, which is intended only for the person or entity whose address is listed above. Any use of the information contained herein in any way (including, but not limited to, total or partial disclosure, reproduction, or dissemination) by persons other than the intended recipient(s) is prohibited. If you receive this e-mail in error, please notify the sender by phone or email immediately and delete it!