The slot in which the task was scheduled has been killed (probably loss of TaskManager)

classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|

The slot in which the task was scheduled has been killed (probably loss of TaskManager)

Flavio Pompermaier
Hi to all,

I have this strange error in my job and I don't know what's going on.
What can I do?

The full exception is:

The slot in which the task was scheduled has been killed (probably loss of TaskManager).
at org.apache.flink.runtime.instance.SimpleSlot.cancel(SimpleSlot.java:98)
at org.apache.flink.runtime.jobmanager.scheduler.SlotSharingGroupAssignment.releaseSimpleSlot(SlotSharingGroupAssignment.java:335)
at org.apache.flink.runtime.jobmanager.scheduler.SlotSharingGroupAssignment.releaseSharedSlot(SlotSharingGroupAssignment.java:319)
at org.apache.flink.runtime.instance.SharedSlot.releaseSlot(SharedSlot.java:106)
at org.apache.flink.runtime.instance.Instance.markDead(Instance.java:151)
at org.apache.flink.runtime.instance.InstanceManager.unregisterTaskManager(InstanceManager.java:182)
at org.apache.flink.runtime.jobmanager.JobManager$$anonfun$receiveWithLogMessages$1.applyOrElse(JobManager.scala:435)
at scala.runtime.AbstractPartialFunction$mcVL$sp.apply$mcVL$sp(AbstractPartialFunction.scala:33)
at scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:33)
at scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:25)
at org.apache.flink.runtime.ActorLogMessages$$anon$1.apply(ActorLogMessages.scala:37)
at org.apache.flink.runtime.ActorLogMessages$$anon$1.apply(ActorLogMessages.scala:30)
at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:118)
at org.apache.flink.runtime.ActorLogMessages$$anon$1.applyOrElse(ActorLogMessages.scala:30)
at akka.actor.Actor$class.aroundReceive(Actor.scala:465)
at org.apache.flink.runtime.jobmanager.JobManager.aroundReceive(JobManager.scala:94)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)
at akka.actor.ActorCell.invoke(ActorCell.scala:487)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:254)
at akka.dispatch.Mailbox.run(Mailbox.scala:221)
at akka.dispatch.Mailbox.exec(Mailbox.scala:231)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)

Reply | Threaded
Open this post in threaded view
|

Re: The slot in which the task was scheduled has been killed (probably loss of TaskManager)

Stephan Ewen

This means that the TaskManager was lost. The JobManager can no longer reach the TaskManager and consists all tasks executing ob the TaskManager as failed.

Have a look at the TaskManager log, it should describe why the TaskManager failed.

Am 15.04.2015 14:45 schrieb "Flavio Pompermaier" <[hidden email]>:
Hi to all,

I have this strange error in my job and I don't know what's going on.
What can I do?

The full exception is:

The slot in which the task was scheduled has been killed (probably loss of TaskManager).
at org.apache.flink.runtime.instance.SimpleSlot.cancel(SimpleSlot.java:98)
at org.apache.flink.runtime.jobmanager.scheduler.SlotSharingGroupAssignment.releaseSimpleSlot(SlotSharingGroupAssignment.java:335)
at org.apache.flink.runtime.jobmanager.scheduler.SlotSharingGroupAssignment.releaseSharedSlot(SlotSharingGroupAssignment.java:319)
at org.apache.flink.runtime.instance.SharedSlot.releaseSlot(SharedSlot.java:106)
at org.apache.flink.runtime.instance.Instance.markDead(Instance.java:151)
at org.apache.flink.runtime.instance.InstanceManager.unregisterTaskManager(InstanceManager.java:182)
at org.apache.flink.runtime.jobmanager.JobManager$$anonfun$receiveWithLogMessages$1.applyOrElse(JobManager.scala:435)
at scala.runtime.AbstractPartialFunction$mcVL$sp.apply$mcVL$sp(AbstractPartialFunction.scala:33)
at scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:33)
at scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:25)
at org.apache.flink.runtime.ActorLogMessages$$anon$1.apply(ActorLogMessages.scala:37)
at org.apache.flink.runtime.ActorLogMessages$$anon$1.apply(ActorLogMessages.scala:30)
at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:118)
at org.apache.flink.runtime.ActorLogMessages$$anon$1.applyOrElse(ActorLogMessages.scala:30)
at akka.actor.Actor$class.aroundReceive(Actor.scala:465)
at org.apache.flink.runtime.jobmanager.JobManager.aroundReceive(JobManager.scala:94)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)
at akka.actor.ActorCell.invoke(ActorCell.scala:487)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:254)
at akka.dispatch.Mailbox.run(Mailbox.scala:221)
at akka.dispatch.Mailbox.exec(Mailbox.scala:231)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)

Reply | Threaded
Open this post in threaded view
|

Re: The slot in which the task was scheduled has been killed (probably loss of TaskManager)

Flavio Pompermaier
Yes , sorry for that..I found it somewhere in the logs..the problem was that the program didn't die immediately but was somehow hanging and I discovered the source of the problem only running the program on a subset of the data.

Thnks for the support,
Flavio

On Wed, Apr 15, 2015 at 2:56 PM, Stephan Ewen <[hidden email]> wrote:

This means that the TaskManager was lost. The JobManager can no longer reach the TaskManager and consists all tasks executing ob the TaskManager as failed.

Have a look at the TaskManager log, it should describe why the TaskManager failed.

Am 15.04.2015 14:45 schrieb "Flavio Pompermaier" <[hidden email]>:
Hi to all,

I have this strange error in my job and I don't know what's going on.
What can I do?

The full exception is:

The slot in which the task was scheduled has been killed (probably loss of TaskManager).
at org.apache.flink.runtime.instance.SimpleSlot.cancel(SimpleSlot.java:98)
at org.apache.flink.runtime.jobmanager.scheduler.SlotSharingGroupAssignment.releaseSimpleSlot(SlotSharingGroupAssignment.java:335)
at org.apache.flink.runtime.jobmanager.scheduler.SlotSharingGroupAssignment.releaseSharedSlot(SlotSharingGroupAssignment.java:319)
at org.apache.flink.runtime.instance.SharedSlot.releaseSlot(SharedSlot.java:106)
at org.apache.flink.runtime.instance.Instance.markDead(Instance.java:151)
at org.apache.flink.runtime.instance.InstanceManager.unregisterTaskManager(InstanceManager.java:182)
at org.apache.flink.runtime.jobmanager.JobManager$$anonfun$receiveWithLogMessages$1.applyOrElse(JobManager.scala:435)
at scala.runtime.AbstractPartialFunction$mcVL$sp.apply$mcVL$sp(AbstractPartialFunction.scala:33)
at scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:33)
at scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:25)
at org.apache.flink.runtime.ActorLogMessages$$anon$1.apply(ActorLogMessages.scala:37)
at org.apache.flink.runtime.ActorLogMessages$$anon$1.apply(ActorLogMessages.scala:30)
at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:118)
at org.apache.flink.runtime.ActorLogMessages$$anon$1.applyOrElse(ActorLogMessages.scala:30)
at akka.actor.Actor$class.aroundReceive(Actor.scala:465)
at org.apache.flink.runtime.jobmanager.JobManager.aroundReceive(JobManager.scala:94)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)
at akka.actor.ActorCell.invoke(ActorCell.scala:487)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:254)
at akka.dispatch.Mailbox.run(Mailbox.scala:221)
at akka.dispatch.Mailbox.exec(Mailbox.scala:231)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)


Reply | Threaded
Open this post in threaded view
|

Re: The slot in which the task was scheduled has been killed (probably loss of TaskManager)

Alexander Alexandrov
I witnessed a similar issue yesterday on a simple job (single task chain, no shuffles) with a release-0.9 based fork.

2015-04-15 14:59 GMT+02:00 Flavio Pompermaier <[hidden email]>:
Yes , sorry for that..I found it somewhere in the logs..the problem was that the program didn't die immediately but was somehow hanging and I discovered the source of the problem only running the program on a subset of the data.

Thnks for the support,
Flavio

On Wed, Apr 15, 2015 at 2:56 PM, Stephan Ewen <[hidden email]> wrote:

This means that the TaskManager was lost. The JobManager can no longer reach the TaskManager and consists all tasks executing ob the TaskManager as failed.

Have a look at the TaskManager log, it should describe why the TaskManager failed.

Am 15.04.2015 14:45 schrieb "Flavio Pompermaier" <[hidden email]>:
Hi to all,

I have this strange error in my job and I don't know what's going on.
What can I do?

The full exception is:

The slot in which the task was scheduled has been killed (probably loss of TaskManager).
at org.apache.flink.runtime.instance.SimpleSlot.cancel(SimpleSlot.java:98)
at org.apache.flink.runtime.jobmanager.scheduler.SlotSharingGroupAssignment.releaseSimpleSlot(SlotSharingGroupAssignment.java:335)
at org.apache.flink.runtime.jobmanager.scheduler.SlotSharingGroupAssignment.releaseSharedSlot(SlotSharingGroupAssignment.java:319)
at org.apache.flink.runtime.instance.SharedSlot.releaseSlot(SharedSlot.java:106)
at org.apache.flink.runtime.instance.Instance.markDead(Instance.java:151)
at org.apache.flink.runtime.instance.InstanceManager.unregisterTaskManager(InstanceManager.java:182)
at org.apache.flink.runtime.jobmanager.JobManager$$anonfun$receiveWithLogMessages$1.applyOrElse(JobManager.scala:435)
at scala.runtime.AbstractPartialFunction$mcVL$sp.apply$mcVL$sp(AbstractPartialFunction.scala:33)
at scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:33)
at scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:25)
at org.apache.flink.runtime.ActorLogMessages$$anon$1.apply(ActorLogMessages.scala:37)
at org.apache.flink.runtime.ActorLogMessages$$anon$1.apply(ActorLogMessages.scala:30)
at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:118)
at org.apache.flink.runtime.ActorLogMessages$$anon$1.applyOrElse(ActorLogMessages.scala:30)
at akka.actor.Actor$class.aroundReceive(Actor.scala:465)
at org.apache.flink.runtime.jobmanager.JobManager.aroundReceive(JobManager.scala:94)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)
at akka.actor.ActorCell.invoke(ActorCell.scala:487)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:254)
at akka.dispatch.Mailbox.run(Mailbox.scala:221)
at akka.dispatch.Mailbox.exec(Mailbox.scala:231)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)



Reply | Threaded
Open this post in threaded view
|

Re: The slot in which the task was scheduled has been killed (probably loss of TaskManager)

Andra Lungu
Something similar in flink-0.10-SNAPSHOT:

06/29/2015 10:33:46     CHAIN Join(Join at main(TriangleCount.java:79)) -> Combine (Reduce at main(TriangleCount.java:79))(222/224) switched to FAILED
java.lang.Exception: The slot in which the task was executed has been released. Probably loss of TaskManager 57c67d938c9144bec5ba798bb8ebe636 @ wally025 - 8 slots - URL: akka.tcp://flink@130.149.249.35:56135/user/taskmanager
        at org.apache.flink.runtime.instance.SimpleSlot.releaseSlot(SimpleSlot.java:151)
        at org.apache.flink.runtime.instance.SlotSharingGroupAssignment.releaseSharedSlot(SlotSharingGroupAssignment.java:547)
        at org.apache.flink.runtime.instance.SharedSlot.releaseSlot(SharedSlot.java:119)
        at org.apache.flink.runtime.instance.Instance.markDead(Instance.java:154)
        at org.apache.flink.runtime.instance.InstanceManager.unregisterTaskManager(InstanceManager.java:182)
        at org.apache.flink.runtime.jobmanager.JobManager$$anonfun$receiveWithLogMessages$1.applyOrElse(JobManager.scala:421)
        at scala.runtime.AbstractPartialFunction$mcVL$sp.apply$mcVL$sp(AbstractPartialFunction.scala:33)
        at scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:33)
        at scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:25)
        at org.apache.flink.runtime.ActorLogMessages$$anon$1.apply(ActorLogMessages.scala:36)
        at org.apache.flink.runtime.ActorLogMessages$$anon$1.apply(ActorLogMessages.scala:29)
        at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:118)
        at org.apache.flink.runtime.ActorLogMessages$$anon$1.applyOrElse(ActorLogMessages.scala:29)
        at akka.actor.Actor$class.aroundReceive(Actor.scala:465)
        at org.apache.flink.runtime.jobmanager.JobManager.aroundReceive(JobManager.scala:92)
        at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)
        at akka.actor.dungeon.DeathWatch$class.receivedTerminated(DeathWatch.scala:46)
        at akka.actor.ActorCell.receivedTerminated(ActorCell.scala:369)
        at akka.actor.ActorCell.autoReceiveMessage(ActorCell.scala:501)
        at akka.actor.ActorCell.invoke(ActorCell.scala:486)
        at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:254)
        at akka.dispatch.Mailbox.run(Mailbox.scala:221)
        at akka.dispatch.Mailbox.exec(Mailbox.scala:231)
        at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
        at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
        at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
        at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)

06/29/2015 10:33:46     Job execution switched to status FAILING.


On Mon, Jun 29, 2015 at 1:08 PM, Alexander Alexandrov <[hidden email]> wrote:
I witnessed a similar issue yesterday on a simple job (single task chain, no shuffles) with a release-0.9 based fork.

2015-04-15 14:59 GMT+02:00 Flavio Pompermaier <[hidden email]>:
Yes , sorry for that..I found it somewhere in the logs..the problem was that the program didn't die immediately but was somehow hanging and I discovered the source of the problem only running the program on a subset of the data.

Thnks for the support,
Flavio

On Wed, Apr 15, 2015 at 2:56 PM, Stephan Ewen <[hidden email]> wrote:

This means that the TaskManager was lost. The JobManager can no longer reach the TaskManager and consists all tasks executing ob the TaskManager as failed.

Have a look at the TaskManager log, it should describe why the TaskManager failed.

Am 15.04.2015 14:45 schrieb "Flavio Pompermaier" <[hidden email]>:
Hi to all,

I have this strange error in my job and I don't know what's going on.
What can I do?

The full exception is:

The slot in which the task was scheduled has been killed (probably loss of TaskManager).
at org.apache.flink.runtime.instance.SimpleSlot.cancel(SimpleSlot.java:98)
at org.apache.flink.runtime.jobmanager.scheduler.SlotSharingGroupAssignment.releaseSimpleSlot(SlotSharingGroupAssignment.java:335)
at org.apache.flink.runtime.jobmanager.scheduler.SlotSharingGroupAssignment.releaseSharedSlot(SlotSharingGroupAssignment.java:319)
at org.apache.flink.runtime.instance.SharedSlot.releaseSlot(SharedSlot.java:106)
at org.apache.flink.runtime.instance.Instance.markDead(Instance.java:151)
at org.apache.flink.runtime.instance.InstanceManager.unregisterTaskManager(InstanceManager.java:182)
at org.apache.flink.runtime.jobmanager.JobManager$$anonfun$receiveWithLogMessages$1.applyOrElse(JobManager.scala:435)
at scala.runtime.AbstractPartialFunction$mcVL$sp.apply$mcVL$sp(AbstractPartialFunction.scala:33)
at scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:33)
at scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:25)
at org.apache.flink.runtime.ActorLogMessages$$anon$1.apply(ActorLogMessages.scala:37)
at org.apache.flink.runtime.ActorLogMessages$$anon$1.apply(ActorLogMessages.scala:30)
at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:118)
at org.apache.flink.runtime.ActorLogMessages$$anon$1.applyOrElse(ActorLogMessages.scala:30)
at akka.actor.Actor$class.aroundReceive(Actor.scala:465)
at org.apache.flink.runtime.jobmanager.JobManager.aroundReceive(JobManager.scala:94)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)
at akka.actor.ActorCell.invoke(ActorCell.scala:487)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:254)
at akka.dispatch.Mailbox.run(Mailbox.scala:221)
at akka.dispatch.Mailbox.exec(Mailbox.scala:231)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)




Reply | Threaded
Open this post in threaded view
|

Re: The slot in which the task was scheduled has been killed (probably loss of TaskManager)

Till Rohrmann
Do you have the JobManager and TaskManager logs of the corresponding TM, by any chance?

On Mon, Jun 29, 2015 at 8:12 PM, Andra Lungu <[hidden email]> wrote:
Something similar in flink-0.10-SNAPSHOT:

06/29/2015 10:33:46     CHAIN Join(Join at main(TriangleCount.java:79)) -> Combine (Reduce at main(TriangleCount.java:79))(222/224) switched to FAILED
java.lang.Exception: The slot in which the task was executed has been released. Probably loss of TaskManager 57c67d938c9144bec5ba798bb8ebe636 @ wally025 - 8 slots - URL: akka.tcp://flink@130.149.249.35:56135/user/taskmanager
        at org.apache.flink.runtime.instance.SimpleSlot.releaseSlot(SimpleSlot.java:151)
        at org.apache.flink.runtime.instance.SlotSharingGroupAssignment.releaseSharedSlot(SlotSharingGroupAssignment.java:547)
        at org.apache.flink.runtime.instance.SharedSlot.releaseSlot(SharedSlot.java:119)
        at org.apache.flink.runtime.instance.Instance.markDead(Instance.java:154)
        at org.apache.flink.runtime.instance.InstanceManager.unregisterTaskManager(InstanceManager.java:182)
        at org.apache.flink.runtime.jobmanager.JobManager$$anonfun$receiveWithLogMessages$1.applyOrElse(JobManager.scala:421)
        at scala.runtime.AbstractPartialFunction$mcVL$sp.apply$mcVL$sp(AbstractPartialFunction.scala:33)
        at scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:33)
        at scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:25)
        at org.apache.flink.runtime.ActorLogMessages$$anon$1.apply(ActorLogMessages.scala:36)
        at org.apache.flink.runtime.ActorLogMessages$$anon$1.apply(ActorLogMessages.scala:29)
        at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:118)
        at org.apache.flink.runtime.ActorLogMessages$$anon$1.applyOrElse(ActorLogMessages.scala:29)
        at akka.actor.Actor$class.aroundReceive(Actor.scala:465)
        at org.apache.flink.runtime.jobmanager.JobManager.aroundReceive(JobManager.scala:92)
        at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)
        at akka.actor.dungeon.DeathWatch$class.receivedTerminated(DeathWatch.scala:46)
        at akka.actor.ActorCell.receivedTerminated(ActorCell.scala:369)
        at akka.actor.ActorCell.autoReceiveMessage(ActorCell.scala:501)
        at akka.actor.ActorCell.invoke(ActorCell.scala:486)
        at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:254)
        at akka.dispatch.Mailbox.run(Mailbox.scala:221)
        at akka.dispatch.Mailbox.exec(Mailbox.scala:231)
        at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
        at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
        at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
        at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)

06/29/2015 10:33:46     Job execution switched to status FAILING.


On Mon, Jun 29, 2015 at 1:08 PM, Alexander Alexandrov <[hidden email]> wrote:
I witnessed a similar issue yesterday on a simple job (single task chain, no shuffles) with a release-0.9 based fork.

2015-04-15 14:59 GMT+02:00 Flavio Pompermaier <[hidden email]>:
Yes , sorry for that..I found it somewhere in the logs..the problem was that the program didn't die immediately but was somehow hanging and I discovered the source of the problem only running the program on a subset of the data.

Thnks for the support,
Flavio

On Wed, Apr 15, 2015 at 2:56 PM, Stephan Ewen <[hidden email]> wrote:

This means that the TaskManager was lost. The JobManager can no longer reach the TaskManager and consists all tasks executing ob the TaskManager as failed.

Have a look at the TaskManager log, it should describe why the TaskManager failed.

Am 15.04.2015 14:45 schrieb "Flavio Pompermaier" <[hidden email]>:
Hi to all,

I have this strange error in my job and I don't know what's going on.
What can I do?

The full exception is:

The slot in which the task was scheduled has been killed (probably loss of TaskManager).
at org.apache.flink.runtime.instance.SimpleSlot.cancel(SimpleSlot.java:98)
at org.apache.flink.runtime.jobmanager.scheduler.SlotSharingGroupAssignment.releaseSimpleSlot(SlotSharingGroupAssignment.java:335)
at org.apache.flink.runtime.jobmanager.scheduler.SlotSharingGroupAssignment.releaseSharedSlot(SlotSharingGroupAssignment.java:319)
at org.apache.flink.runtime.instance.SharedSlot.releaseSlot(SharedSlot.java:106)
at org.apache.flink.runtime.instance.Instance.markDead(Instance.java:151)
at org.apache.flink.runtime.instance.InstanceManager.unregisterTaskManager(InstanceManager.java:182)
at org.apache.flink.runtime.jobmanager.JobManager$$anonfun$receiveWithLogMessages$1.applyOrElse(JobManager.scala:435)
at scala.runtime.AbstractPartialFunction$mcVL$sp.apply$mcVL$sp(AbstractPartialFunction.scala:33)
at scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:33)
at scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:25)
at org.apache.flink.runtime.ActorLogMessages$$anon$1.apply(ActorLogMessages.scala:37)
at org.apache.flink.runtime.ActorLogMessages$$anon$1.apply(ActorLogMessages.scala:30)
at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:118)
at org.apache.flink.runtime.ActorLogMessages$$anon$1.applyOrElse(ActorLogMessages.scala:30)
at akka.actor.Actor$class.aroundReceive(Actor.scala:465)
at org.apache.flink.runtime.jobmanager.JobManager.aroundReceive(JobManager.scala:94)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)
at akka.actor.ActorCell.invoke(ActorCell.scala:487)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:254)
at akka.dispatch.Mailbox.run(Mailbox.scala:221)
at akka.dispatch.Mailbox.exec(Mailbox.scala:231)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)





Reply | Threaded
Open this post in threaded view
|

Re: The slot in which the task was scheduled has been killed (probably loss of TaskManager)

Andra Lungu
Hey Till,

I managed to reproduce the bug; The logs are in the corresponding JIRA [hopefully I got the right ones :)]: FLINK-2299

As a side line. Guys, these two issues (FLINK-2299 and FLINK-2293) are pretty much show stoppers for my work; this means that I can offer support to get them fixed: i.e. modify parameter x on the cluster; try running it now etc.

And I would appreciate it if someone can look into them and pass me a status update.

Thanks a lot! Sorry for the inconvenience!
Andra

On Tue, Jun 30, 2015 at 10:51 AM, Till Rohrmann <[hidden email]> wrote:
Do you have the JobManager and TaskManager logs of the corresponding TM, by any chance?

On Mon, Jun 29, 2015 at 8:12 PM, Andra Lungu <[hidden email]> wrote:
Something similar in flink-0.10-SNAPSHOT:

06/29/2015 10:33:46     CHAIN Join(Join at main(TriangleCount.java:79)) -> Combine (Reduce at main(TriangleCount.java:79))(222/224) switched to FAILED
java.lang.Exception: The slot in which the task was executed has been released. Probably loss of TaskManager 57c67d938c9144bec5ba798bb8ebe636 @ wally025 - 8 slots - URL: akka.tcp://flink@130.149.249.35:56135/user/taskmanager
        at org.apache.flink.runtime.instance.SimpleSlot.releaseSlot(SimpleSlot.java:151)
        at org.apache.flink.runtime.instance.SlotSharingGroupAssignment.releaseSharedSlot(SlotSharingGroupAssignment.java:547)
        at org.apache.flink.runtime.instance.SharedSlot.releaseSlot(SharedSlot.java:119)
        at org.apache.flink.runtime.instance.Instance.markDead(Instance.java:154)
        at org.apache.flink.runtime.instance.InstanceManager.unregisterTaskManager(InstanceManager.java:182)
        at org.apache.flink.runtime.jobmanager.JobManager$$anonfun$receiveWithLogMessages$1.applyOrElse(JobManager.scala:421)
        at scala.runtime.AbstractPartialFunction$mcVL$sp.apply$mcVL$sp(AbstractPartialFunction.scala:33)
        at scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:33)
        at scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:25)
        at org.apache.flink.runtime.ActorLogMessages$$anon$1.apply(ActorLogMessages.scala:36)
        at org.apache.flink.runtime.ActorLogMessages$$anon$1.apply(ActorLogMessages.scala:29)
        at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:118)
        at org.apache.flink.runtime.ActorLogMessages$$anon$1.applyOrElse(ActorLogMessages.scala:29)
        at akka.actor.Actor$class.aroundReceive(Actor.scala:465)
        at org.apache.flink.runtime.jobmanager.JobManager.aroundReceive(JobManager.scala:92)
        at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)
        at akka.actor.dungeon.DeathWatch$class.receivedTerminated(DeathWatch.scala:46)
        at akka.actor.ActorCell.receivedTerminated(ActorCell.scala:369)
        at akka.actor.ActorCell.autoReceiveMessage(ActorCell.scala:501)
        at akka.actor.ActorCell.invoke(ActorCell.scala:486)
        at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:254)
        at akka.dispatch.Mailbox.run(Mailbox.scala:221)
        at akka.dispatch.Mailbox.exec(Mailbox.scala:231)
        at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
        at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
        at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
        at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)

06/29/2015 10:33:46     Job execution switched to status FAILING.


On Mon, Jun 29, 2015 at 1:08 PM, Alexander Alexandrov <[hidden email]> wrote:
I witnessed a similar issue yesterday on a simple job (single task chain, no shuffles) with a release-0.9 based fork.

2015-04-15 14:59 GMT+02:00 Flavio Pompermaier <[hidden email]>:
Yes , sorry for that..I found it somewhere in the logs..the problem was that the program didn't die immediately but was somehow hanging and I discovered the source of the problem only running the program on a subset of the data.

Thnks for the support,
Flavio

On Wed, Apr 15, 2015 at 2:56 PM, Stephan Ewen <[hidden email]> wrote:

This means that the TaskManager was lost. The JobManager can no longer reach the TaskManager and consists all tasks executing ob the TaskManager as failed.

Have a look at the TaskManager log, it should describe why the TaskManager failed.

Am 15.04.2015 14:45 schrieb "Flavio Pompermaier" <[hidden email]>:
Hi to all,

I have this strange error in my job and I don't know what's going on.
What can I do?

The full exception is:

The slot in which the task was scheduled has been killed (probably loss of TaskManager).
at org.apache.flink.runtime.instance.SimpleSlot.cancel(SimpleSlot.java:98)
at org.apache.flink.runtime.jobmanager.scheduler.SlotSharingGroupAssignment.releaseSimpleSlot(SlotSharingGroupAssignment.java:335)
at org.apache.flink.runtime.jobmanager.scheduler.SlotSharingGroupAssignment.releaseSharedSlot(SlotSharingGroupAssignment.java:319)
at org.apache.flink.runtime.instance.SharedSlot.releaseSlot(SharedSlot.java:106)
at org.apache.flink.runtime.instance.Instance.markDead(Instance.java:151)
at org.apache.flink.runtime.instance.InstanceManager.unregisterTaskManager(InstanceManager.java:182)
at org.apache.flink.runtime.jobmanager.JobManager$$anonfun$receiveWithLogMessages$1.applyOrElse(JobManager.scala:435)
at scala.runtime.AbstractPartialFunction$mcVL$sp.apply$mcVL$sp(AbstractPartialFunction.scala:33)
at scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:33)
at scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:25)
at org.apache.flink.runtime.ActorLogMessages$$anon$1.apply(ActorLogMessages.scala:37)
at org.apache.flink.runtime.ActorLogMessages$$anon$1.apply(ActorLogMessages.scala:30)
at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:118)
at org.apache.flink.runtime.ActorLogMessages$$anon$1.applyOrElse(ActorLogMessages.scala:30)
at akka.actor.Actor$class.aroundReceive(Actor.scala:465)
at org.apache.flink.runtime.jobmanager.JobManager.aroundReceive(JobManager.scala:94)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)
at akka.actor.ActorCell.invoke(ActorCell.scala:487)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:254)
at akka.dispatch.Mailbox.run(Mailbox.scala:221)
at akka.dispatch.Mailbox.exec(Mailbox.scala:231)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)