(DEPRECATED) Apache Flink User Mailing List archive.

The slot in which the task was scheduled has been killed (probably loss of TaskManager)

Classic

List

Threaded

7 messages Options

Flavio Pompermaier

The slot in which the task was scheduled has been killed (probably loss of TaskManager)

Hi to all,

I have this strange error in my job and I don't know what's going on.

What can I do?

The full exception is:

The slot in which the task was scheduled has been killed (probably loss of TaskManager).

at org.apache.flink.runtime.instance.SimpleSlot.cancel(SimpleSlot.java:98)

at org.apache.flink.runtime.jobmanager.scheduler.SlotSharingGroupAssignment.releaseSimpleSlot(SlotSharingGroupAssignment.java:335)

at org.apache.flink.runtime.jobmanager.scheduler.SlotSharingGroupAssignment.releaseSharedSlot(SlotSharingGroupAssignment.java:319)

at org.apache.flink.runtime.instance.SharedSlot.releaseSlot(SharedSlot.java:106)

at org.apache.flink.runtime.instance.Instance.markDead(Instance.java:151)

at org.apache.flink.runtime.instance.InstanceManager.unregisterTaskManager(InstanceManager.java:182)

at org.apache.flink.runtime.jobmanager.JobManager$$anonfun$receiveWithLogMessages$1.applyOrElse(JobManager.scala:435)

at scala.runtime.AbstractPartialFunction$mcVL$sp.apply$mcVL$sp(AbstractPartialFunction.scala:33)

at scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:33)

at scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:25)

at org.apache.flink.runtime.ActorLogMessages$$anon$1.apply(ActorLogMessages.scala:37)

at org.apache.flink.runtime.ActorLogMessages$$anon$1.apply(ActorLogMessages.scala:30)

at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:118)

at org.apache.flink.runtime.ActorLogMessages$$anon$1.applyOrElse(ActorLogMessages.scala:30)

at akka.actor.Actor$class.aroundReceive(Actor.scala:465)

at org.apache.flink.runtime.jobmanager.JobManager.aroundReceive(JobManager.scala:94)

at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)

at akka.actor.ActorCell.invoke(ActorCell.scala:487)

at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:254)

at akka.dispatch.Mailbox.run(Mailbox.scala:221)

at akka.dispatch.Mailbox.exec(Mailbox.scala:231)

at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)

at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)

at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)

at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)

Stephan Ewen

Re: The slot in which the task was scheduled has been killed (probably loss of TaskManager)

This means that the TaskManager was lost. The JobManager can no longer reach the TaskManager and consists all tasks executing ob the TaskManager as failed.

Have a look at the TaskManager log, it should describe why the TaskManager failed.

Am 15.04.2015 14:45 schrieb "Flavio Pompermaier" <[hidden email]>:

Hi to all,

I have this strange error in my job and I don't know what's going on.
What can I do?

The full exception is:

The slot in which the task was scheduled has been killed (probably loss of TaskManager).
at org.apache.flink.runtime.instance.SimpleSlot.cancel(SimpleSlot.java:98)
at org.apache.flink.runtime.jobmanager.scheduler.SlotSharingGroupAssignment.releaseSimpleSlot(SlotSharingGroupAssignment.java:335)
at org.apache.flink.runtime.jobmanager.scheduler.SlotSharingGroupAssignment.releaseSharedSlot(SlotSharingGroupAssignment.java:319)
at org.apache.flink.runtime.instance.SharedSlot.releaseSlot(SharedSlot.java:106)
at org.apache.flink.runtime.instance.Instance.markDead(Instance.java:151)
at org.apache.flink.runtime.instance.InstanceManager.unregisterTaskManager(InstanceManager.java:182)
at org.apache.flink.runtime.jobmanager.JobManager$$anonfun$receiveWithLogMessages$1.applyOrElse(JobManager.scala:435)
at scala.runtime.AbstractPartialFunction$mcVL$sp.apply$mcVL$sp(AbstractPartialFunction.scala:33)
at scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:33)
at scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:25)
at org.apache.flink.runtime.ActorLogMessages$$anon$1.apply(ActorLogMessages.scala:37)
at org.apache.flink.runtime.ActorLogMessages$$anon$1.apply(ActorLogMessages.scala:30)
at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:118)
at org.apache.flink.runtime.ActorLogMessages$$anon$1.applyOrElse(ActorLogMessages.scala:30)
at akka.actor.Actor$class.aroundReceive(Actor.scala:465)
at org.apache.flink.runtime.jobmanager.JobManager.aroundReceive(JobManager.scala:94)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)
at akka.actor.ActorCell.invoke(ActorCell.scala:487)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:254)
at akka.dispatch.Mailbox.run(Mailbox.scala:221)
at akka.dispatch.Mailbox.exec(Mailbox.scala:231)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)

Flavio Pompermaier

Re: The slot in which the task was scheduled has been killed (probably loss of TaskManager)

Yes , sorry for that..I found it somewhere in the logs..the problem was that the program didn't die immediately but was somehow hanging and I discovered the source of the problem only running the program on a subset of the data.

Thnks for the support,

Flavio

On Wed, Apr 15, 2015 at 2:56 PM, Stephan Ewen <[hidden email]> wrote:

This means that the TaskManager was lost. The JobManager can no longer reach the TaskManager and consists all tasks executing ob the TaskManager as failed.

Have a look at the TaskManager log, it should describe why the TaskManager failed.

Am 15.04.2015 14:45 schrieb "Flavio Pompermaier" <[hidden email]>:
Hi to all,

I have this strange error in my job and I don't know what's going on.
What can I do?

The full exception is:

The slot in which the task was scheduled has been killed (probably loss of TaskManager).
at org.apache.flink.runtime.instance.SimpleSlot.cancel(SimpleSlot.java:98)
at org.apache.flink.runtime.jobmanager.scheduler.SlotSharingGroupAssignment.releaseSimpleSlot(SlotSharingGroupAssignment.java:335)
at org.apache.flink.runtime.jobmanager.scheduler.SlotSharingGroupAssignment.releaseSharedSlot(SlotSharingGroupAssignment.java:319)
at org.apache.flink.runtime.instance.SharedSlot.releaseSlot(SharedSlot.java:106)
at org.apache.flink.runtime.instance.Instance.markDead(Instance.java:151)
at org.apache.flink.runtime.instance.InstanceManager.unregisterTaskManager(InstanceManager.java:182)
at org.apache.flink.runtime.jobmanager.JobManager$$anonfun$receiveWithLogMessages$1.applyOrElse(JobManager.scala:435)
at scala.runtime.AbstractPartialFunction$mcVL$sp.apply$mcVL$sp(AbstractPartialFunction.scala:33)
at scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:33)
at scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:25)
at org.apache.flink.runtime.ActorLogMessages$$anon$1.apply(ActorLogMessages.scala:37)
at org.apache.flink.runtime.ActorLogMessages$$anon$1.apply(ActorLogMessages.scala:30)
at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:118)
at org.apache.flink.runtime.ActorLogMessages$$anon$1.applyOrElse(ActorLogMessages.scala:30)
at akka.actor.Actor$class.aroundReceive(Actor.scala:465)
at org.apache.flink.runtime.jobmanager.JobManager.aroundReceive(JobManager.scala:94)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)
at akka.actor.ActorCell.invoke(ActorCell.scala:487)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:254)
at akka.dispatch.Mailbox.run(Mailbox.scala:221)
at akka.dispatch.Mailbox.exec(Mailbox.scala:231)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)

Alexander Alexandrov

Re: The slot in which the task was scheduled has been killed (probably loss of TaskManager)

I witnessed a similar issue yesterday on a simple job (single task chain, no shuffles) with a release-0.9 based fork.

2015-04-15 14:59 GMT+02:00 Flavio Pompermaier <[hidden email]>:

Yes , sorry for that..I found it somewhere in the logs..the problem was that the program didn't die immediately but was somehow hanging and I discovered the source of the problem only running the program on a subset of the data.

Thnks for the support,
Flavio

On Wed, Apr 15, 2015 at 2:56 PM, Stephan Ewen <[hidden email]> wrote:
This means that the TaskManager was lost. The JobManager can no longer reach the TaskManager and consists all tasks executing ob the TaskManager as failed.

Have a look at the TaskManager log, it should describe why the TaskManager failed.

Am 15.04.2015 14:45 schrieb "Flavio Pompermaier" <[hidden email]>:
Hi to all,

I have this strange error in my job and I don't know what's going on.
What can I do?

The full exception is:

The slot in which the task was scheduled has been killed (probably loss of TaskManager).
at org.apache.flink.runtime.instance.SimpleSlot.cancel(SimpleSlot.java:98)
at org.apache.flink.runtime.jobmanager.scheduler.SlotSharingGroupAssignment.releaseSimpleSlot(SlotSharingGroupAssignment.java:335)
at org.apache.flink.runtime.jobmanager.scheduler.SlotSharingGroupAssignment.releaseSharedSlot(SlotSharingGroupAssignment.java:319)
at org.apache.flink.runtime.instance.SharedSlot.releaseSlot(SharedSlot.java:106)
at org.apache.flink.runtime.instance.Instance.markDead(Instance.java:151)
at org.apache.flink.runtime.instance.InstanceManager.unregisterTaskManager(InstanceManager.java:182)
at org.apache.flink.runtime.jobmanager.JobManager$$anonfun$receiveWithLogMessages$1.applyOrElse(JobManager.scala:435)
at scala.runtime.AbstractPartialFunction$mcVL$sp.apply$mcVL$sp(AbstractPartialFunction.scala:33)
at scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:33)
at scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:25)
at org.apache.flink.runtime.ActorLogMessages$$anon$1.apply(ActorLogMessages.scala:37)
at org.apache.flink.runtime.ActorLogMessages$$anon$1.apply(ActorLogMessages.scala:30)
at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:118)
at org.apache.flink.runtime.ActorLogMessages$$anon$1.applyOrElse(ActorLogMessages.scala:30)
at akka.actor.Actor$class.aroundReceive(Actor.scala:465)
at org.apache.flink.runtime.jobmanager.JobManager.aroundReceive(JobManager.scala:94)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)
at akka.actor.ActorCell.invoke(ActorCell.scala:487)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:254)
at akka.dispatch.Mailbox.run(Mailbox.scala:221)
at akka.dispatch.Mailbox.exec(Mailbox.scala:231)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)

Andra Lungu

Re: The slot in which the task was scheduled has been killed (probably loss of TaskManager)

Something similar in flink-0.10-SNAPSHOT:

06/29/2015 10:33:46     CHAIN Join(Join at main(TriangleCount.java:79)) -> Combine (Reduce at main(TriangleCount.java:79))(222/224) switched to FAILED
java.lang.Exception: The slot in which the task was executed has been released. Probably loss of TaskManager 57c67d938c9144bec5ba798bb8ebe636 @ wally025 - 8 slots - URL: akka.tcp://flink@130.149.249.35:56135/user/taskmanager
        at org.apache.flink.runtime.instance.SimpleSlot.releaseSlot(SimpleSlot.java:151)
        at org.apache.flink.runtime.instance.SlotSharingGroupAssignment.releaseSharedSlot(SlotSharingGroupAssignment.java:547)
        at org.apache.flink.runtime.instance.SharedSlot.releaseSlot(SharedSlot.java:119)
        at org.apache.flink.runtime.instance.Instance.markDead(Instance.java:154)
        at org.apache.flink.runtime.instance.InstanceManager.unregisterTaskManager(InstanceManager.java:182)
        at org.apache.flink.runtime.jobmanager.JobManager$$anonfun$receiveWithLogMessages$1.applyOrElse(JobManager.scala:421)
        at scala.runtime.AbstractPartialFunction$mcVL$sp.apply$mcVL$sp(AbstractPartialFunction.scala:33)
        at scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:33)
        at scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:25)
        at org.apache.flink.runtime.ActorLogMessages$$anon$1.apply(ActorLogMessages.scala:36)
        at org.apache.flink.runtime.ActorLogMessages$$anon$1.apply(ActorLogMessages.scala:29)
        at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:118)
        at org.apache.flink.runtime.ActorLogMessages$$anon$1.applyOrElse(ActorLogMessages.scala:29)
        at akka.actor.Actor$class.aroundReceive(Actor.scala:465)
        at org.apache.flink.runtime.jobmanager.JobManager.aroundReceive(JobManager.scala:92)
        at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)
        at akka.actor.dungeon.DeathWatch$class.receivedTerminated(DeathWatch.scala:46)
        at akka.actor.ActorCell.receivedTerminated(ActorCell.scala:369)
        at akka.actor.ActorCell.autoReceiveMessage(ActorCell.scala:501)
        at akka.actor.ActorCell.invoke(ActorCell.scala:486)
        at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:254)
        at akka.dispatch.Mailbox.run(Mailbox.scala:221)
        at akka.dispatch.Mailbox.exec(Mailbox.scala:231)
        at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
        at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
        at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
        at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)

06/29/2015 10:33:46     Job execution switched to status FAILING.

On Mon, Jun 29, 2015 at 1:08 PM, Alexander Alexandrov <[hidden email]> wrote:

I witnessed a similar issue yesterday on a simple job (single task chain, no shuffles) with a release-0.9 based fork.

2015-04-15 14:59 GMT+02:00 Flavio Pompermaier <[hidden email]>:
Yes , sorry for that..I found it somewhere in the logs..the problem was that the program didn't die immediately but was somehow hanging and I discovered the source of the problem only running the program on a subset of the data.

Thnks for the support,
Flavio

On Wed, Apr 15, 2015 at 2:56 PM, Stephan Ewen <[hidden email]> wrote:
This means that the TaskManager was lost. The JobManager can no longer reach the TaskManager and consists all tasks executing ob the TaskManager as failed.

Have a look at the TaskManager log, it should describe why the TaskManager failed.

Am 15.04.2015 14:45 schrieb "Flavio Pompermaier" <[hidden email]>:
Hi to all,

I have this strange error in my job and I don't know what's going on.
What can I do?

The full exception is:

The slot in which the task was scheduled has been killed (probably loss of TaskManager).
at org.apache.flink.runtime.instance.SimpleSlot.cancel(SimpleSlot.java:98)
at org.apache.flink.runtime.jobmanager.scheduler.SlotSharingGroupAssignment.releaseSimpleSlot(SlotSharingGroupAssignment.java:335)
at org.apache.flink.runtime.jobmanager.scheduler.SlotSharingGroupAssignment.releaseSharedSlot(SlotSharingGroupAssignment.java:319)
at org.apache.flink.runtime.instance.SharedSlot.releaseSlot(SharedSlot.java:106)
at org.apache.flink.runtime.instance.Instance.markDead(Instance.java:151)
at org.apache.flink.runtime.instance.InstanceManager.unregisterTaskManager(InstanceManager.java:182)
at org.apache.flink.runtime.jobmanager.JobManager$$anonfun$receiveWithLogMessages$1.applyOrElse(JobManager.scala:435)
at scala.runtime.AbstractPartialFunction$mcVL$sp.apply$mcVL$sp(AbstractPartialFunction.scala:33)
at scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:33)
at scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:25)
at org.apache.flink.runtime.ActorLogMessages$$anon$1.apply(ActorLogMessages.scala:37)
at org.apache.flink.runtime.ActorLogMessages$$anon$1.apply(ActorLogMessages.scala:30)
at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:118)
at org.apache.flink.runtime.ActorLogMessages$$anon$1.applyOrElse(ActorLogMessages.scala:30)
at akka.actor.Actor$class.aroundReceive(Actor.scala:465)
at org.apache.flink.runtime.jobmanager.JobManager.aroundReceive(JobManager.scala:94)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)
at akka.actor.ActorCell.invoke(ActorCell.scala:487)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:254)
at akka.dispatch.Mailbox.run(Mailbox.scala:221)
at akka.dispatch.Mailbox.exec(Mailbox.scala:231)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)

Till Rohrmann

Re: The slot in which the task was scheduled has been killed (probably loss of TaskManager)

Do you have the JobManager and TaskManager logs of the corresponding TM, by any chance?

On Mon, Jun 29, 2015 at 8:12 PM, Andra Lungu <[hidden email]> wrote:

Something similar in flink-0.10-SNAPSHOT:

06/29/2015 10:33:46     CHAIN Join(Join at main(TriangleCount.java:79)) -> Combine (Reduce at main(TriangleCount.java:79))(222/224) switched to FAILED
java.lang.Exception: The slot in which the task was executed has been released. Probably loss of TaskManager 57c67d938c9144bec5ba798bb8ebe636 @ wally025 - 8 slots - URL: akka.tcp://flink@130.149.249.35:56135/user/taskmanager
        at org.apache.flink.runtime.instance.SimpleSlot.releaseSlot(SimpleSlot.java:151)
        at org.apache.flink.runtime.instance.SlotSharingGroupAssignment.releaseSharedSlot(SlotSharingGroupAssignment.java:547)
        at org.apache.flink.runtime.instance.SharedSlot.releaseSlot(SharedSlot.java:119)
        at org.apache.flink.runtime.instance.Instance.markDead(Instance.java:154)
        at org.apache.flink.runtime.instance.InstanceManager.unregisterTaskManager(InstanceManager.java:182)
        at org.apache.flink.runtime.jobmanager.JobManager$$anonfun$receiveWithLogMessages$1.applyOrElse(JobManager.scala:421)
        at scala.runtime.AbstractPartialFunction$mcVL$sp.apply$mcVL$sp(AbstractPartialFunction.scala:33)
        at scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:33)
        at scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:25)
        at org.apache.flink.runtime.ActorLogMessages$$anon$1.apply(ActorLogMessages.scala:36)
        at org.apache.flink.runtime.ActorLogMessages$$anon$1.apply(ActorLogMessages.scala:29)
        at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:118)
        at org.apache.flink.runtime.ActorLogMessages$$anon$1.applyOrElse(ActorLogMessages.scala:29)
        at akka.actor.Actor$class.aroundReceive(Actor.scala:465)
        at org.apache.flink.runtime.jobmanager.JobManager.aroundReceive(JobManager.scala:92)
        at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)
        at akka.actor.dungeon.DeathWatch$class.receivedTerminated(DeathWatch.scala:46)
        at akka.actor.ActorCell.receivedTerminated(ActorCell.scala:369)
        at akka.actor.ActorCell.autoReceiveMessage(ActorCell.scala:501)
        at akka.actor.ActorCell.invoke(ActorCell.scala:486)
        at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:254)
        at akka.dispatch.Mailbox.run(Mailbox.scala:221)
        at akka.dispatch.Mailbox.exec(Mailbox.scala:231)
        at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
        at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
        at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
        at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)

06/29/2015 10:33:46     Job execution switched to status FAILING.

On Mon, Jun 29, 2015 at 1:08 PM, Alexander Alexandrov <[hidden email]> wrote:
I witnessed a similar issue yesterday on a simple job (single task chain, no shuffles) with a release-0.9 based fork.

2015-04-15 14:59 GMT+02:00 Flavio Pompermaier <[hidden email]>:
Yes , sorry for that..I found it somewhere in the logs..the problem was that the program didn't die immediately but was somehow hanging and I discovered the source of the problem only running the program on a subset of the data.

Thnks for the support,
Flavio

On Wed, Apr 15, 2015 at 2:56 PM, Stephan Ewen <[hidden email]> wrote:
This means that the TaskManager was lost. The JobManager can no longer reach the TaskManager and consists all tasks executing ob the TaskManager as failed.

Have a look at the TaskManager log, it should describe why the TaskManager failed.

Am 15.04.2015 14:45 schrieb "Flavio Pompermaier" <[hidden email]>:
Hi to all,

I have this strange error in my job and I don't know what's going on.
What can I do?

The full exception is:

The slot in which the task was scheduled has been killed (probably loss of TaskManager).
at org.apache.flink.runtime.instance.SimpleSlot.cancel(SimpleSlot.java:98)
at org.apache.flink.runtime.jobmanager.scheduler.SlotSharingGroupAssignment.releaseSimpleSlot(SlotSharingGroupAssignment.java:335)
at org.apache.flink.runtime.jobmanager.scheduler.SlotSharingGroupAssignment.releaseSharedSlot(SlotSharingGroupAssignment.java:319)
at org.apache.flink.runtime.instance.SharedSlot.releaseSlot(SharedSlot.java:106)
at org.apache.flink.runtime.instance.Instance.markDead(Instance.java:151)
at org.apache.flink.runtime.instance.InstanceManager.unregisterTaskManager(InstanceManager.java:182)
at org.apache.flink.runtime.jobmanager.JobManager$$anonfun$receiveWithLogMessages$1.applyOrElse(JobManager.scala:435)
at scala.runtime.AbstractPartialFunction$mcVL$sp.apply$mcVL$sp(AbstractPartialFunction.scala:33)
at scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:33)
at scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:25)
at org.apache.flink.runtime.ActorLogMessages$$anon$1.apply(ActorLogMessages.scala:37)
at org.apache.flink.runtime.ActorLogMessages$$anon$1.apply(ActorLogMessages.scala:30)
at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:118)
at org.apache.flink.runtime.ActorLogMessages$$anon$1.applyOrElse(ActorLogMessages.scala:30)
at akka.actor.Actor$class.aroundReceive(Actor.scala:465)
at org.apache.flink.runtime.jobmanager.JobManager.aroundReceive(JobManager.scala:94)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)
at akka.actor.ActorCell.invoke(ActorCell.scala:487)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:254)
at akka.dispatch.Mailbox.run(Mailbox.scala:221)
at akka.dispatch.Mailbox.exec(Mailbox.scala:231)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)

Andra Lungu

Re: The slot in which the task was scheduled has been killed (probably loss of TaskManager)

Hey Till,

I managed to reproduce the bug; The logs are in the corresponding JIRA [hopefully I got the right ones :)]: FLINK-2299

As a side line. Guys, these two issues (FLINK-2299 and FLINK-2293) are pretty much show stoppers for my work; this means that I can offer support to get them fixed: i.e. modify parameter x on the cluster; try running it now etc.

And I would appreciate it if someone can look into them and pass me a status update.

Thanks a lot! Sorry for the inconvenience!

Andra

On Tue, Jun 30, 2015 at 10:51 AM, Till Rohrmann <[hidden email]> wrote:

Do you have the JobManager and TaskManager logs of the corresponding TM, by any chance?

On Mon, Jun 29, 2015 at 8:12 PM, Andra Lungu <[hidden email]> wrote:
Something similar in flink-0.10-SNAPSHOT:

06/29/2015 10:33:46     CHAIN Join(Join at main(TriangleCount.java:79)) -> Combine (Reduce at main(TriangleCount.java:79))(222/224) switched to FAILED
java.lang.Exception: The slot in which the task was executed has been released. Probably loss of TaskManager 57c67d938c9144bec5ba798bb8ebe636 @ wally025 - 8 slots - URL: akka.tcp://flink@130.149.249.35:56135/user/taskmanager
        at org.apache.flink.runtime.instance.SimpleSlot.releaseSlot(SimpleSlot.java:151)
        at org.apache.flink.runtime.instance.SlotSharingGroupAssignment.releaseSharedSlot(SlotSharingGroupAssignment.java:547)
        at org.apache.flink.runtime.instance.SharedSlot.releaseSlot(SharedSlot.java:119)
        at org.apache.flink.runtime.instance.Instance.markDead(Instance.java:154)
        at org.apache.flink.runtime.instance.InstanceManager.unregisterTaskManager(InstanceManager.java:182)
        at org.apache.flink.runtime.jobmanager.JobManager$$anonfun$receiveWithLogMessages$1.applyOrElse(JobManager.scala:421)
        at scala.runtime.AbstractPartialFunction$mcVL$sp.apply$mcVL$sp(AbstractPartialFunction.scala:33)
        at scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:33)
        at scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:25)
        at org.apache.flink.runtime.ActorLogMessages$$anon$1.apply(ActorLogMessages.scala:36)
        at org.apache.flink.runtime.ActorLogMessages$$anon$1.apply(ActorLogMessages.scala:29)
        at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:118)
        at org.apache.flink.runtime.ActorLogMessages$$anon$1.applyOrElse(ActorLogMessages.scala:29)
        at akka.actor.Actor$class.aroundReceive(Actor.scala:465)
        at org.apache.flink.runtime.jobmanager.JobManager.aroundReceive(JobManager.scala:92)
        at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)
        at akka.actor.dungeon.DeathWatch$class.receivedTerminated(DeathWatch.scala:46)
        at akka.actor.ActorCell.receivedTerminated(ActorCell.scala:369)
        at akka.actor.ActorCell.autoReceiveMessage(ActorCell.scala:501)
        at akka.actor.ActorCell.invoke(ActorCell.scala:486)
        at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:254)
        at akka.dispatch.Mailbox.run(Mailbox.scala:221)
        at akka.dispatch.Mailbox.exec(Mailbox.scala:231)
        at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
        at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
        at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
        at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)

06/29/2015 10:33:46     Job execution switched to status FAILING.

On Mon, Jun 29, 2015 at 1:08 PM, Alexander Alexandrov <[hidden email]> wrote:
I witnessed a similar issue yesterday on a simple job (single task chain, no shuffles) with a release-0.9 based fork.

2015-04-15 14:59 GMT+02:00 Flavio Pompermaier <[hidden email]>:
Yes , sorry for that..I found it somewhere in the logs..the problem was that the program didn't die immediately but was somehow hanging and I discovered the source of the problem only running the program on a subset of the data.

Thnks for the support,
Flavio

On Wed, Apr 15, 2015 at 2:56 PM, Stephan Ewen <[hidden email]> wrote:
This means that the TaskManager was lost. The JobManager can no longer reach the TaskManager and consists all tasks executing ob the TaskManager as failed.

Have a look at the TaskManager log, it should describe why the TaskManager failed.

Am 15.04.2015 14:45 schrieb "Flavio Pompermaier" <[hidden email]>:
Hi to all,

I have this strange error in my job and I don't know what's going on.
What can I do?

The full exception is:

The slot in which the task was scheduled has been killed (probably loss of TaskManager).
at org.apache.flink.runtime.instance.SimpleSlot.cancel(SimpleSlot.java:98)
at org.apache.flink.runtime.jobmanager.scheduler.SlotSharingGroupAssignment.releaseSimpleSlot(SlotSharingGroupAssignment.java:335)
at org.apache.flink.runtime.jobmanager.scheduler.SlotSharingGroupAssignment.releaseSharedSlot(SlotSharingGroupAssignment.java:319)
at org.apache.flink.runtime.instance.SharedSlot.releaseSlot(SharedSlot.java:106)
at org.apache.flink.runtime.instance.Instance.markDead(Instance.java:151)
at org.apache.flink.runtime.instance.InstanceManager.unregisterTaskManager(InstanceManager.java:182)
at org.apache.flink.runtime.jobmanager.JobManager$$anonfun$receiveWithLogMessages$1.applyOrElse(JobManager.scala:435)
at scala.runtime.AbstractPartialFunction$mcVL$sp.apply$mcVL$sp(AbstractPartialFunction.scala:33)
at scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:33)
at scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:25)
at org.apache.flink.runtime.ActorLogMessages$$anon$1.apply(ActorLogMessages.scala:37)
at org.apache.flink.runtime.ActorLogMessages$$anon$1.apply(ActorLogMessages.scala:30)
at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:118)
at org.apache.flink.runtime.ActorLogMessages$$anon$1.applyOrElse(ActorLogMessages.scala:30)
at akka.actor.Actor$class.aroundReceive(Actor.scala:465)
at org.apache.flink.runtime.jobmanager.JobManager.aroundReceive(JobManager.scala:94)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)
at akka.actor.ActorCell.invoke(ActorCell.scala:487)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:254)
at akka.dispatch.Mailbox.run(Mailbox.scala:221)
at akka.dispatch.Mailbox.exec(Mailbox.scala:231)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)