Loss of TaskManager Error

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Loss of TaskManager Error

Debaditya Roy
Hello users,

I was running an experiment on a very simple cluster with two nodes (one jobmanager and another taskmanager). However after starting the execution, in a few seconds the program is aborted with the error.

The program finished with the following exception:

org.apache.flink.client.program.ProgramInvocationException: The program execution failed: Job execution failed.
    at org.apache.flink.client.program.Client.runBlocking(Client.java:381)
    at org.apache.flink.client.program.Client.runBlocking(Client.java:355)
    at org.apache.flink.streaming.api.environment.StreamContextEnvironment.execute(StreamContextEnvironment.java:65)
    at org.myorg.quickstart.Job.main(Job.java:55)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at org.apache.flink.client.program.PackagedProgram.callMainMethod(PackagedProgram.java:505)
    at org.apache.flink.client.program.PackagedProgram.invokeInteractiveModeForExecution(PackagedProgram.java:403)
    at org.apache.flink.client.program.Client.runBlocking(Client.java:248)
    at org.apache.flink.client.CliFrontend.executeProgramBlocking(CliFrontend.java:866)
    at org.apache.flink.client.CliFrontend.run(CliFrontend.java:333)
    at org.apache.flink.client.CliFrontend.parseParameters(CliFrontend.java:1192)
    at org.apache.flink.client.CliFrontend.main(CliFrontend.java:1243)
Caused by: org.apache.flink.runtime.client.JobExecutionException: Job execution failed.
    at org.apache.flink.runtime.jobmanager.JobManager$$anonfun$handleMessage$1$$anonfun$applyOrElse$7.apply$mcV$sp(JobManager.scala:717)
    at org.apache.flink.runtime.jobmanager.JobManager$$anonfun$handleMessage$1$$anonfun$applyOrElse$7.apply(JobManager.scala:663)
    at org.apache.flink.runtime.jobmanager.JobManager$$anonfun$handleMessage$1$$anonfun$applyOrElse$7.apply(JobManager.scala:663)
    at scala.concurrent.impl.Future$PromiseCompletingRunnable.liftedTree1$1(Future.scala:24)
    at scala.concurrent.impl.Future$PromiseCompletingRunnable.run(Future.scala:24)
    at akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:41)
    at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:401)
    at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
    at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
    at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
    at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
Caused by: java.lang.Exception: The slot in which the task was executed has been released. Probably loss of TaskManager 6d2c9d29eddb2a1497827217f4d9a6d1 @ parapluie-28 - 1 slots - URL: akka.tcp://flink@172.16.99.28:60365/user/taskmanager
    at org.apache.flink.runtime.instance.SimpleSlot.releaseSlot(SimpleSlot.java:153)
    at org.apache.flink.runtime.instance.SlotSharingGroupAssignment.releaseSharedSlot(SlotSharingGroupAssignment.java:547)
    at org.apache.flink.runtime.instance.SharedSlot.releaseSlot(SharedSlot.java:119)
    at org.apache.flink.runtime.instance.Instance.markDead(Instance.java:156)
    at org.apache.flink.runtime.instance.InstanceManager.unregisterTaskManager(InstanceManager.java:215)
    at org.apache.flink.runtime.jobmanager.JobManager$$anonfun$handleMessage$1.applyOrElse(JobManager.scala:850)
    at scala.runtime.AbstractPartialFunction$mcVL$sp.apply$mcVL$sp(AbstractPartialFunction.scala:33)
    at scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:33)
    at scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:25)
    at org.apache.flink.runtime.LeaderSessionMessageFilter$$anonfun$receive$1.applyOrElse(LeaderSessionMessageFilter.scala:44)
    at scala.runtime.AbstractPartialFunction$mcVL$sp.apply$mcVL$sp(AbstractPartialFunction.scala:33)
    at scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:33)
    at scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:25)
    at org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:33)
    at org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:28)
    at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:118)
    at org.apache.flink.runtime.LogMessages$$anon$1.applyOrElse(LogMessages.scala:28)
    at akka.actor.Actor$class.aroundReceive(Actor.scala:465)
    at org.apache.flink.runtime.jobmanager.JobManager.aroundReceive(JobManager.scala:107)
    at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)
    at akka.actor.dungeon.DeathWatch$class.receivedTerminated(DeathWatch.scala:46)
    at akka.actor.ActorCell.receivedTerminated(ActorCell.scala:369)
    at akka.actor.ActorCell.autoReceiveMessage(ActorCell.scala:501)
    at akka.actor.ActorCell.invoke(ActorCell.scala:486)
    at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:254)
    at akka.dispatch.Mailbox.run(Mailbox.scala:221)
    at akka.dispatch.Mailbox.exec(Mailbox.scala:231)
    ... 4 more

Can anyone having similar problem solved this issue. I am attaching the logs also. One of the logs say that the taskmanager ip is unreachable (but I have successfully pinged and ssh-ed from the job manager to the taskmanager).

Thanks in advance,

P.S: PFA the logs.

Regards,
Debaditya


Error (6K) Download Attachment
attachment1 (6K) Download Attachment
FlinkCommands (930 bytes) Download Attachment
Flink_Root_Client_Log_JM (11K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Loss of TaskManager Error

Till Rohrmann
Hi Debaditya,

could you check what the log of the presumably failed task manager says? It might contain hints to what actually went wrong.

Cheers,
Till

On Mon, Aug 1, 2016 at 9:49 PM, Debaditya Roy <[hidden email]> wrote:
Hello users,

I was running an experiment on a very simple cluster with two nodes (one jobmanager and another taskmanager). However after starting the execution, in a few seconds the program is aborted with the error.

The program finished with the following exception:

org.apache.flink.client.program.ProgramInvocationException: The program execution failed: Job execution failed.
    at org.apache.flink.client.program.Client.runBlocking(Client.java:381)
    at org.apache.flink.client.program.Client.runBlocking(Client.java:355)
    at org.apache.flink.streaming.api.environment.StreamContextEnvironment.execute(StreamContextEnvironment.java:65)
    at org.myorg.quickstart.Job.main(Job.java:55)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at org.apache.flink.client.program.PackagedProgram.callMainMethod(PackagedProgram.java:505)
    at org.apache.flink.client.program.PackagedProgram.invokeInteractiveModeForExecution(PackagedProgram.java:403)
    at org.apache.flink.client.program.Client.runBlocking(Client.java:248)
    at org.apache.flink.client.CliFrontend.executeProgramBlocking(CliFrontend.java:866)
    at org.apache.flink.client.CliFrontend.run(CliFrontend.java:333)
    at org.apache.flink.client.CliFrontend.parseParameters(CliFrontend.java:1192)
    at org.apache.flink.client.CliFrontend.main(CliFrontend.java:1243)
Caused by: org.apache.flink.runtime.client.JobExecutionException: Job execution failed.
    at org.apache.flink.runtime.jobmanager.JobManager$$anonfun$handleMessage$1$$anonfun$applyOrElse$7.apply$mcV$sp(JobManager.scala:717)
    at org.apache.flink.runtime.jobmanager.JobManager$$anonfun$handleMessage$1$$anonfun$applyOrElse$7.apply(JobManager.scala:663)
    at org.apache.flink.runtime.jobmanager.JobManager$$anonfun$handleMessage$1$$anonfun$applyOrElse$7.apply(JobManager.scala:663)
    at scala.concurrent.impl.Future$PromiseCompletingRunnable.liftedTree1$1(Future.scala:24)
    at scala.concurrent.impl.Future$PromiseCompletingRunnable.run(Future.scala:24)
    at akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:41)
    at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:401)
    at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
    at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
    at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
    at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
Caused by: java.lang.Exception: The slot in which the task was executed has been released. Probably loss of TaskManager 6d2c9d29eddb2a1497827217f4d9a6d1 @ parapluie-28 - 1 slots - URL: akka.tcp://flink@172.16.99.28:60365/user/taskmanager
    at org.apache.flink.runtime.instance.SimpleSlot.releaseSlot(SimpleSlot.java:153)
    at org.apache.flink.runtime.instance.SlotSharingGroupAssignment.releaseSharedSlot(SlotSharingGroupAssignment.java:547)
    at org.apache.flink.runtime.instance.SharedSlot.releaseSlot(SharedSlot.java:119)
    at org.apache.flink.runtime.instance.Instance.markDead(Instance.java:156)
    at org.apache.flink.runtime.instance.InstanceManager.unregisterTaskManager(InstanceManager.java:215)
    at org.apache.flink.runtime.jobmanager.JobManager$$anonfun$handleMessage$1.applyOrElse(JobManager.scala:850)
    at scala.runtime.AbstractPartialFunction$mcVL$sp.apply$mcVL$sp(AbstractPartialFunction.scala:33)
    at scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:33)
    at scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:25)
    at org.apache.flink.runtime.LeaderSessionMessageFilter$$anonfun$receive$1.applyOrElse(LeaderSessionMessageFilter.scala:44)
    at scala.runtime.AbstractPartialFunction$mcVL$sp.apply$mcVL$sp(AbstractPartialFunction.scala:33)
    at scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:33)
    at scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:25)
    at org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:33)
    at org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:28)
    at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:118)
    at org.apache.flink.runtime.LogMessages$$anon$1.applyOrElse(LogMessages.scala:28)
    at akka.actor.Actor$class.aroundReceive(Actor.scala:465)
    at org.apache.flink.runtime.jobmanager.JobManager.aroundReceive(JobManager.scala:107)
    at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)
    at akka.actor.dungeon.DeathWatch$class.receivedTerminated(DeathWatch.scala:46)
    at akka.actor.ActorCell.receivedTerminated(ActorCell.scala:369)
    at akka.actor.ActorCell.autoReceiveMessage(ActorCell.scala:501)
    at akka.actor.ActorCell.invoke(ActorCell.scala:486)
    at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:254)
    at akka.dispatch.Mailbox.run(Mailbox.scala:221)
    at akka.dispatch.Mailbox.exec(Mailbox.scala:231)
    ... 4 more

Can anyone having similar problem solved this issue. I am attaching the logs also. One of the logs say that the taskmanager ip is unreachable (but I have successfully pinged and ssh-ed from the job manager to the taskmanager).

Thanks in advance,

P.S: PFA the logs.

Regards,
Debaditya


Reply | Threaded
Open this post in threaded view
|

Re: Loss of TaskManager Error

Debaditya Roy
Hi Till,

Thanks for the input. The error was in a training set which I found in the .out file of the taskmanager. I corrected that and I am getting some results.

Thanks and Regards,
Debaditya

On Mon, Aug 1, 2016 at 3:54 PM, Till Rohrmann <[hidden email]> wrote:
Hi Debaditya,

could you check what the log of the presumably failed task manager says? It might contain hints to what actually went wrong.

Cheers,
Till

On Mon, Aug 1, 2016 at 9:49 PM, Debaditya Roy <[hidden email]> wrote:
Hello users,

I was running an experiment on a very simple cluster with two nodes (one jobmanager and another taskmanager). However after starting the execution, in a few seconds the program is aborted with the error.

The program finished with the following exception:

org.apache.flink.client.program.ProgramInvocationException: The program execution failed: Job execution failed.
    at org.apache.flink.client.program.Client.runBlocking(Client.java:381)
    at org.apache.flink.client.program.Client.runBlocking(Client.java:355)
    at org.apache.flink.streaming.api.environment.StreamContextEnvironment.execute(StreamContextEnvironment.java:65)
    at org.myorg.quickstart.Job.main(Job.java:55)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at org.apache.flink.client.program.PackagedProgram.callMainMethod(PackagedProgram.java:505)
    at org.apache.flink.client.program.PackagedProgram.invokeInteractiveModeForExecution(PackagedProgram.java:403)
    at org.apache.flink.client.program.Client.runBlocking(Client.java:248)
    at org.apache.flink.client.CliFrontend.executeProgramBlocking(CliFrontend.java:866)
    at org.apache.flink.client.CliFrontend.run(CliFrontend.java:333)
    at org.apache.flink.client.CliFrontend.parseParameters(CliFrontend.java:1192)
    at org.apache.flink.client.CliFrontend.main(CliFrontend.java:1243)
Caused by: org.apache.flink.runtime.client.JobExecutionException: Job execution failed.
    at org.apache.flink.runtime.jobmanager.JobManager$$anonfun$handleMessage$1$$anonfun$applyOrElse$7.apply$mcV$sp(JobManager.scala:717)
    at org.apache.flink.runtime.jobmanager.JobManager$$anonfun$handleMessage$1$$anonfun$applyOrElse$7.apply(JobManager.scala:663)
    at org.apache.flink.runtime.jobmanager.JobManager$$anonfun$handleMessage$1$$anonfun$applyOrElse$7.apply(JobManager.scala:663)
    at scala.concurrent.impl.Future$PromiseCompletingRunnable.liftedTree1$1(Future.scala:24)
    at scala.concurrent.impl.Future$PromiseCompletingRunnable.run(Future.scala:24)
    at akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:41)
    at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:401)
    at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
    at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
    at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
    at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
Caused by: java.lang.Exception: The slot in which the task was executed has been released. Probably loss of TaskManager 6d2c9d29eddb2a1497827217f4d9a6d1 @ parapluie-28 - 1 slots - URL: akka.tcp://flink@172.16.99.28:60365/user/taskmanager
    at org.apache.flink.runtime.instance.SimpleSlot.releaseSlot(SimpleSlot.java:153)
    at org.apache.flink.runtime.instance.SlotSharingGroupAssignment.releaseSharedSlot(SlotSharingGroupAssignment.java:547)
    at org.apache.flink.runtime.instance.SharedSlot.releaseSlot(SharedSlot.java:119)
    at org.apache.flink.runtime.instance.Instance.markDead(Instance.java:156)
    at org.apache.flink.runtime.instance.InstanceManager.unregisterTaskManager(InstanceManager.java:215)
    at org.apache.flink.runtime.jobmanager.JobManager$$anonfun$handleMessage$1.applyOrElse(JobManager.scala:850)
    at scala.runtime.AbstractPartialFunction$mcVL$sp.apply$mcVL$sp(AbstractPartialFunction.scala:33)
    at scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:33)
    at scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:25)
    at org.apache.flink.runtime.LeaderSessionMessageFilter$$anonfun$receive$1.applyOrElse(LeaderSessionMessageFilter.scala:44)
    at scala.runtime.AbstractPartialFunction$mcVL$sp.apply$mcVL$sp(AbstractPartialFunction.scala:33)
    at scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:33)
    at scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:25)
    at org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:33)
    at org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:28)
    at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:118)
    at org.apache.flink.runtime.LogMessages$$anon$1.applyOrElse(LogMessages.scala:28)
    at akka.actor.Actor$class.aroundReceive(Actor.scala:465)
    at org.apache.flink.runtime.jobmanager.JobManager.aroundReceive(JobManager.scala:107)
    at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)
    at akka.actor.dungeon.DeathWatch$class.receivedTerminated(DeathWatch.scala:46)
    at akka.actor.ActorCell.receivedTerminated(ActorCell.scala:369)
    at akka.actor.ActorCell.autoReceiveMessage(ActorCell.scala:501)
    at akka.actor.ActorCell.invoke(ActorCell.scala:486)
    at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:254)
    at akka.dispatch.Mailbox.run(Mailbox.scala:221)
    at akka.dispatch.Mailbox.exec(Mailbox.scala:231)
    ... 4 more

Can anyone having similar problem solved this issue. I am attaching the logs also. One of the logs say that the taskmanager ip is unreachable (but I have successfully pinged and ssh-ed from the job manager to the taskmanager).

Thanks in advance,

P.S: PFA the logs.

Regards,
Debaditya