Flink fails continuously: Couldn't retrieve the JobExecutionResult from the JobManager.

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Flink fails continuously: Couldn't retrieve the JobExecutionResult from the JobManager.

Marke Builder
Hi,

my flink job fails continously(sometimes behind minutes, sometimes behind hours) with the 
follwing exception. 

Flink run configuration:
run with yarn: -yn 2 -ys 5 -yjm 8192 -ymt 12288
streaming-job: kafka source and redis sink


The program finished with the following exception:

org.apache.flink.client.program.ProgramInvocationException: The program execution failed: Couldn't retrieve the JobExecutionResult from the JobManager.
        at org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:492)
        at org.apache.flink.yarn.YarnClusterClient.submitJob(YarnClusterClient.java:215)
        at org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:456)
        at org.apache.flink.streaming.api.environment.StreamContextEnvironment.execute(StreamContextEnvironment.java:66)
        at com.voith.cloud.app.timeseries.fasttrack.StreamToCache.main(StreamToCache.java:54)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at org.apache.flink.client.program.PackagedProgram.callMainMethod(PackagedProgram.java:525)
        at org.apache.flink.client.program.PackagedProgram.invokeInteractiveModeForExecution(PackagedProgram.java:417)
        at org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:396)
        at org.apache.flink.client.CliFrontend.executeProgram(CliFrontend.java:802)
        at org.apache.flink.client.CliFrontend.run(CliFrontend.java:282)
        at org.apache.flink.client.CliFrontend.parseParameters(CliFrontend.java:1054)
        at org.apache.flink.client.CliFrontend$1.call(CliFrontend.java:1101)
        at org.apache.flink.client.CliFrontend$1.call(CliFrontend.java:1098)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:422)
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1692)
        at org.apache.flink.runtime.security.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:41)
        at org.apache.flink.client.CliFrontend.main(CliFrontend.java:1098)
Caused by: org.apache.flink.runtime.client.JobExecutionException: Couldn't retrieve the JobExecutionResult from the JobManager.
        at org.apache.flink.runtime.client.JobClient.awaitJobResult(JobClient.java:300)
        at org.apache.flink.runtime.client.JobClient.submitJobAndWait(JobClient.java:387)
        at org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:481)
        ... 21 more
Caused by: org.apache.flink.runtime.client.JobClientActorConnectionTimeoutException: Lost connection to the JobManager.
        at org.apache.flink.runtime.client.JobClientActor.handleMessage(JobClientActor.java:219)
        at org.apache.flink.runtime.akka.FlinkUntypedActor.handleLeaderSessionID(FlinkUntypedActor.java:104)
        at org.apache.flink.runtime.akka.FlinkUntypedActor.onReceive(FlinkUntypedActor.java:71)
        at akka.actor.UntypedActor$$anonfun$receive$1.applyOrElse(UntypedActor.scala:165)
        at akka.actor.Actor$class.aroundReceive(Actor.scala:502)
        at akka.actor.UntypedActor.aroundReceive(UntypedActor.scala:95)
        at akka.actor.ActorCell.receiveMessage(ActorCell.scala:526)
        at akka.actor.ActorCell.invoke(ActorCell.scala:495)
        at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:257)
        at akka.dispatch.Mailbox.run(Mailbox.scala:224)
        at akka.dispatch.Mailbox.exec(Mailbox.scala:234)
        at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
        at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
        at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
        at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)




and: 



Reply | Threaded
Open this post in threaded view
|

Re: Flink fails continuously: Couldn't retrieve the JobExecutionResult from the JobManager.

vino yang
Hi Marke,

Are you expecting your job to quickly return the results of the stream calculation? 
If it is running for a long time, you can run it in detached mode when you submit the job[1]. 
It will not cause your client to be blocked and stay connected to the Flink JobManager.

Thanks, vino.


Marke Builder <[hidden email]> 于2018年10月19日周五 下午12:41写道:
Hi,

my flink job fails continously(sometimes behind minutes, sometimes behind hours) with the 
follwing exception. 

Flink run configuration:
run with yarn: -yn 2 -ys 5 -yjm 8192 -ymt 12288
streaming-job: kafka source and redis sink


The program finished with the following exception:

org.apache.flink.client.program.ProgramInvocationException: The program execution failed: Couldn't retrieve the JobExecutionResult from the JobManager.
        at org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:492)
        at org.apache.flink.yarn.YarnClusterClient.submitJob(YarnClusterClient.java:215)
        at org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:456)
        at org.apache.flink.streaming.api.environment.StreamContextEnvironment.execute(StreamContextEnvironment.java:66)
        at com.voith.cloud.app.timeseries.fasttrack.StreamToCache.main(StreamToCache.java:54)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at org.apache.flink.client.program.PackagedProgram.callMainMethod(PackagedProgram.java:525)
        at org.apache.flink.client.program.PackagedProgram.invokeInteractiveModeForExecution(PackagedProgram.java:417)
        at org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:396)
        at org.apache.flink.client.CliFrontend.executeProgram(CliFrontend.java:802)
        at org.apache.flink.client.CliFrontend.run(CliFrontend.java:282)
        at org.apache.flink.client.CliFrontend.parseParameters(CliFrontend.java:1054)
        at org.apache.flink.client.CliFrontend$1.call(CliFrontend.java:1101)
        at org.apache.flink.client.CliFrontend$1.call(CliFrontend.java:1098)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:422)
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1692)
        at org.apache.flink.runtime.security.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:41)
        at org.apache.flink.client.CliFrontend.main(CliFrontend.java:1098)
Caused by: org.apache.flink.runtime.client.JobExecutionException: Couldn't retrieve the JobExecutionResult from the JobManager.
        at org.apache.flink.runtime.client.JobClient.awaitJobResult(JobClient.java:300)
        at org.apache.flink.runtime.client.JobClient.submitJobAndWait(JobClient.java:387)
        at org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:481)
        ... 21 more
Caused by: org.apache.flink.runtime.client.JobClientActorConnectionTimeoutException: Lost connection to the JobManager.
        at org.apache.flink.runtime.client.JobClientActor.handleMessage(JobClientActor.java:219)
        at org.apache.flink.runtime.akka.FlinkUntypedActor.handleLeaderSessionID(FlinkUntypedActor.java:104)
        at org.apache.flink.runtime.akka.FlinkUntypedActor.onReceive(FlinkUntypedActor.java:71)
        at akka.actor.UntypedActor$$anonfun$receive$1.applyOrElse(UntypedActor.scala:165)
        at akka.actor.Actor$class.aroundReceive(Actor.scala:502)
        at akka.actor.UntypedActor.aroundReceive(UntypedActor.scala:95)
        at akka.actor.ActorCell.receiveMessage(ActorCell.scala:526)
        at akka.actor.ActorCell.invoke(ActorCell.scala:495)
        at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:257)
        at akka.dispatch.Mailbox.run(Mailbox.scala:224)
        at akka.dispatch.Mailbox.exec(Mailbox.scala:234)
        at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
        at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
        at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
        at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)




and: 



Reply | Threaded
Open this post in threaded view
|

Re: Flink fails continuously: Couldn't retrieve the JobExecutionResult from the JobManager.

vino yang
Hi marke,

My advice is not to keep your client connected to JM. 
If you expect continuous output, you can sink it out. 
In addition, it does not rule out that your JM load is too high, such as the emergence of full GC and so on. 
So, make sure your JM has enough resources to use and monitor it.

Thanks, vino.

Marke Builder <[hidden email]> 于2018年11月1日周四 上午7:25写道:
Hi Vino,
yes I'm expecting a quickly result of the stream (parse, normalize and store).
Thanks for the tip with the detached mode, the job is now a bit more stable, but sometimes it failed with the same issue.
Any other proposals ?

Thanks, marke.

Am Mo., 22. Okt. 2018 um 04:04 Uhr schrieb vino yang <[hidden email]>:
Hi Marke,

Are you expecting your job to quickly return the results of the stream calculation? 
If it is running for a long time, you can run it in detached mode when you submit the job[1]. 
It will not cause your client to be blocked and stay connected to the Flink JobManager.

Thanks, vino.


Marke Builder <[hidden email]> 于2018年10月19日周五 下午12:41写道:
Hi,

my flink job fails continously(sometimes behind minutes, sometimes behind hours) with the 
follwing exception. 

Flink run configuration:
run with yarn: -yn 2 -ys 5 -yjm 8192 -ymt 12288
streaming-job: kafka source and redis sink


The program finished with the following exception:

org.apache.flink.client.program.ProgramInvocationException: The program execution failed: Couldn't retrieve the JobExecutionResult from the JobManager.
        at org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:492)
        at org.apache.flink.yarn.YarnClusterClient.submitJob(YarnClusterClient.java:215)
        at org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:456)
        at org.apache.flink.streaming.api.environment.StreamContextEnvironment.execute(StreamContextEnvironment.java:66)
        at com.voith.cloud.app.timeseries.fasttrack.StreamToCache.main(StreamToCache.java:54)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at org.apache.flink.client.program.PackagedProgram.callMainMethod(PackagedProgram.java:525)
        at org.apache.flink.client.program.PackagedProgram.invokeInteractiveModeForExecution(PackagedProgram.java:417)
        at org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:396)
        at org.apache.flink.client.CliFrontend.executeProgram(CliFrontend.java:802)
        at org.apache.flink.client.CliFrontend.run(CliFrontend.java:282)
        at org.apache.flink.client.CliFrontend.parseParameters(CliFrontend.java:1054)
        at org.apache.flink.client.CliFrontend$1.call(CliFrontend.java:1101)
        at org.apache.flink.client.CliFrontend$1.call(CliFrontend.java:1098)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:422)
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1692)
        at org.apache.flink.runtime.security.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:41)
        at org.apache.flink.client.CliFrontend.main(CliFrontend.java:1098)
Caused by: org.apache.flink.runtime.client.JobExecutionException: Couldn't retrieve the JobExecutionResult from the JobManager.
        at org.apache.flink.runtime.client.JobClient.awaitJobResult(JobClient.java:300)
        at org.apache.flink.runtime.client.JobClient.submitJobAndWait(JobClient.java:387)
        at org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:481)
        ... 21 more
Caused by: org.apache.flink.runtime.client.JobClientActorConnectionTimeoutException: Lost connection to the JobManager.
        at org.apache.flink.runtime.client.JobClientActor.handleMessage(JobClientActor.java:219)
        at org.apache.flink.runtime.akka.FlinkUntypedActor.handleLeaderSessionID(FlinkUntypedActor.java:104)
        at org.apache.flink.runtime.akka.FlinkUntypedActor.onReceive(FlinkUntypedActor.java:71)
        at akka.actor.UntypedActor$$anonfun$receive$1.applyOrElse(UntypedActor.scala:165)
        at akka.actor.Actor$class.aroundReceive(Actor.scala:502)
        at akka.actor.UntypedActor.aroundReceive(UntypedActor.scala:95)
        at akka.actor.ActorCell.receiveMessage(ActorCell.scala:526)
        at akka.actor.ActorCell.invoke(ActorCell.scala:495)
        at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:257)
        at akka.dispatch.Mailbox.run(Mailbox.scala:224)
        at akka.dispatch.Mailbox.exec(Mailbox.scala:234)
        at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
        at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
        at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
        at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)




and: