(DEPRECATED) Apache Flink User Mailing List archive.

Flink Job Deployment (Not enough resources)

Classic

List

Threaded

2 messages Options

Rinat

Flink Job Deployment (Not enough resources)

Hi everyone, I’ve got the following problem, when I’m trying to submit new job and if cluster has not enough resources, job submission fails with the following exception

But in YARN job hangs and wait’s for requested resources. When resources become available, job successfully runs.

What can I do to be sure that job startup is completed successfully or completely failed ?

Thx.

The program finished with the following exception:\n\njava.lang.RuntimeException: Unable to tell application master to stop once the specified job has been finised\n\tat org.apache.flink.yarn.YarnClusterClient.stopAfterJob(YarnClusterClient.java:177)\n\tat org.apache.flink.yarn.YarnClusterClient.submitJob(YarnClusterClient.java:201)\n\tat org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:442)\n\tat org.apache.flink.client.program.DetachedEnvironment.finalizeExecute(DetachedEnvironment.java:76)\n\tat org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:387)\n\tat org.apache.flink.client.CliFrontend.executeProgram(CliFrontend.java:838)\n\tat org.apache.flink.client.CliFrontend.run(CliFrontend.java:259)\n\tat org.apache.flink.client.CliFrontend.parseParameters(CliFrontend.java:1086)\n\tat org.apache.flink.client.CliFrontend$2.call(CliFrontend.java:1133)\n\tat org.apache.flink.client.CliFrontend$2.call(CliFrontend.java:1130)\n\tat org.apache.flink.runtime.security.HadoopSecurityContext$1.run(HadoopSecurityContext.java:43)\n\tat java.security.AccessController.doPrivileged(Native Method)\n\tat javax.security.auth.Subject.doAs(Subject.java:422)\n\tat org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1656)\n\tat org.apache.flink.runtime.security.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:40)\n\tat org.apache.flink.client.CliFrontend.main(CliFrontend.java:1129)\nCaused by: org.apache.flink.util.FlinkException: Could not connect to the leading JobManager. Please check that the JobManager is running.\n\tat org.apache.flink.client.program.ClusterClient.getJobManagerGateway(ClusterClient.java:789)\n\tat org.apache.flink.yarn.YarnClusterClient.stopAfterJob(YarnClusterClient.java:171)\n\t... 15 more\nCaused by: org.apache.flink.runtime.leaderretrieval.LeaderRetrievalException: Could not retrieve the leader gateway.\n\tat org.apache.flink.runtime.util.LeaderRetrievalUtils.retrieveLeaderGateway(LeaderRetrievalUtils.java:79)\n\tat org.apache.flink.client.program.ClusterClient.getJobManagerGateway(ClusterClient.java:784)\n\t... 16 more\nCaused by: java.util.concurrent.TimeoutException: Futures timed out after [10000 milliseconds]\n\tat scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)\n\tat scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)\n\tat scala.concurrent.Await$$anonfun$result$1.apply(package.scala:190)\n\tat scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)\n\tat scala.concurrent.Await$.result(package.scala:190)\n\tat scala.concurrent.Await.result(package.scala)\n\tat org.apache.flink.runtime.util.LeaderRetrievalUtils.retrieveLeaderGateway(LeaderRetrievalUtils.java:77)\n\t... 17 more", "stderr_lines": ["", "------------------------------------------------------------", " The program finished with the following exception:", "", "java.lang.RuntimeException: Unable to tell application master to stop once the specified job has been finised", "\tat org.apache.flink.yarn.YarnClusterClient.stopAfterJob(YarnClusterClient.java:177)", "\tat org.apache.flink.yarn.YarnClusterClient.submitJob(YarnClusterClient.java:201)", "\tat org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:442)", "\tat org.apache.flink.client.program.DetachedEnvironment.finalizeExecute(DetachedEnvironment.java:76)", "\tat org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:387)", "\tat org.apache.flink.client.CliFrontend.executeProgram(CliFrontend.java:838)", "\tat org.apache.flink.client.CliFrontend.run(CliFrontend.java:259)", "\tat org.apache.flink.client.CliFrontend.parseParameters(CliFrontend.java:1086)", "\tat org.apache.flink.client.CliFrontend$2.call(CliFrontend.java:1133)", "\tat org.apache.flink.client.CliFrontend$2.call(CliFrontend.java:1130)", "\tat org.apache.flink.runtime.security.HadoopSecurityContext$1.run(HadoopSecurityContext.java:43)", "\tat java.security.AccessController.doPrivileged(Native Method)", "\tat javax.security.auth.Subject.doAs(Subject.java:422)", "\tat org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1656)", "\tat org.apache.flink.runtime.security.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:40)", "\tat org.apache.flink.client.CliFrontend.main(CliFrontend.java:1129)", "Caused by: org.apache.flink.util.FlinkException: Could not connect to the leading JobManager. Please check that the JobManager is running.", "\tat org.apache.flink.client.program.ClusterClient.getJobManagerGateway(ClusterClient.java:789)", "\tat org.apache.flink.yarn.YarnClusterClient.stopAfterJob(YarnClusterClient.java:171)", "\t... 15 more", "Caused by: org.apache.flink.runtime.leaderretrieval.LeaderRetrievalException: Could not retrieve the leader gateway.", "\tat org.apache.flink.runtime.util.LeaderRetrievalUtils.retrieveLeaderGateway(LeaderRetrievalUtils.java:79)", "\tat org.apache.flink.client.program.ClusterClient.getJobManagerGateway(ClusterClient.java:784)", "\t... 16 more", "Caused by: java.util.concurrent.TimeoutException: Futures timed out after [10000 milliseconds]", "\tat scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)", "\tat scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)", "\tat scala.concurrent.Await$$anonfun$result$1.apply(package.scala:190)", "\tat scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)"

Fabian Hueske-2

Re: Flink Job Deployment (Not enough resources)

Hi Rinat,

No, Flink does not have a switch to immediately cancel a job if it cannot allocate enough resources.

Maybe YARN has a configuration parameter to define a timeout after which a job is canceled if no resource become available.

2017-09-04 13:29 GMT+02:00 Rinat <[hidden email]>:

Hi everyone, I’ve got the following problem, when I’m trying to submit new job and if cluster has not enough resources, job submission fails with the following exception
But in YARN job hangs and wait’s for requested resources. When resources become available, job successfully runs.

What can I do to be sure that job startup is completed successfully or completely failed ?

Thx.

The program finished with the following exception:\n\njava.lang.RuntimeException: Unable to tell application master to stop once the specified job has been finised\n\tat org.apache.flink.yarn.YarnClusterClient.stopAfterJob(YarnClusterClient.java:177)\n\tat org.apache.flink.yarn.YarnClusterClient.submitJob(YarnClusterClient.java:201)\n\tat org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:442)\n\tat org.apache.flink.client.program.DetachedEnvironment.finalizeExecute(DetachedEnvironment.java:76)\n\tat org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:387)\n\tat org.apache.flink.client.CliFrontend.executeProgram(CliFrontend.java:838)\n\tat org.apache.flink.client.CliFrontend.run(CliFrontend.java:259)\n\tat org.apache.flink.client.CliFrontend.parseParameters(CliFrontend.java:1086)\n\tat org.apache.flink.client.CliFrontend$2.call(CliFrontend.java:1133)\n\tat org.apache.flink.client.CliFrontend$2.call(CliFrontend.java:1130)\n\tat org.apache.flink.runtime.security.HadoopSecurityContext$1.run(HadoopSecurityContext.java:43)\n\tat java.security.AccessController.doPrivileged(Native Method)\n\tat javax.security.auth.Subject.doAs(Subject.java:422)\n\tat org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1656)\n\tat org.apache.flink.runtime.security.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:40)\n\tat org.apache.flink.client.CliFrontend.main(CliFrontend.java:1129)\nCaused by: org.apache.flink.util.FlinkException: Could not connect to the leading JobManager. Please check that the JobManager is running.\n\tat org.apache.flink.client.program.ClusterClient.getJobManagerGateway(ClusterClient.java:789)\n\tat org.apache.flink.yarn.YarnClusterClient.stopAfterJob(YarnClusterClient.java:171)\n\t... 15 more\nCaused by: org.apache.flink.runtime.leaderretrieval.LeaderRetrievalException: Could not retrieve the leader gateway.\n\tat org.apache.flink.runtime.util.LeaderRetrievalUtils.retrieveLeaderGateway(LeaderRetrievalUtils.java:79)\n\tat org.apache.flink.client.program.ClusterClient.getJobManagerGateway(ClusterClient.java:784)\n\t... 16 more\nCaused by: java.util.concurrent.TimeoutException: Futures timed out after [10000 milliseconds]\n\tat scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)\n\tat scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)\n\tat scala.concurrent.Await$$anonfun$result$1.apply(package.scala:190)\n\tat scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)\n\tat scala.concurrent.Await$.result(package.scala:190)\n\tat scala.concurrent.Await.result(package.scala)\n\tat org.apache.flink.runtime.util.LeaderRetrievalUtils.retrieveLeaderGateway(LeaderRetrievalUtils.java:77)\n\t... 17 more", "stderr_lines": ["", "------------------------------------------------------------", " The program finished with the following exception:", "", "java.lang.RuntimeException: Unable to tell application master to stop once the specified job has been finised", "\tat org.apache.flink.yarn.YarnClusterClient.stopAfterJob(YarnClusterClient.java:177)", "\tat org.apache.flink.yarn.YarnClusterClient.submitJob(YarnClusterClient.java:201)", "\tat org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:442)", "\tat org.apache.flink.client.program.DetachedEnvironment.finalizeExecute(DetachedEnvironment.java:76)", "\tat org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:387)", "\tat org.apache.flink.client.CliFrontend.executeProgram(CliFrontend.java:838)", "\tat org.apache.flink.client.CliFrontend.run(CliFrontend.java:259)", "\tat org.apache.flink.client.CliFrontend.parseParameters(CliFrontend.java:1086)", "\tat org.apache.flink.client.CliFrontend$2.call(CliFrontend.java:1133)", "\tat org.apache.flink.client.CliFrontend$2.call(CliFrontend.java:1130)", "\tat org.apache.flink.runtime.security.HadoopSecurityContext$1.run(HadoopSecurityContext.java:43)", "\tat java.security.AccessController.doPrivileged(Native Method)", "\tat javax.security.auth.Subject.doAs(Subject.java:422)", "\tat org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1656)", "\tat org.apache.flink.runtime.security.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:40)", "\tat org.apache.flink.client.CliFrontend.main(CliFrontend.java:1129)", "Caused by: org.apache.flink.util.FlinkException: Could not connect to the leading JobManager. Please check that the JobManager is running.", "\tat org.apache.flink.client.program.ClusterClient.getJobManagerGateway(ClusterClient.java:789)", "\tat org.apache.flink.yarn.YarnClusterClient.stopAfterJob(YarnClusterClient.java:171)", "\t... 15 more", "Caused by: org.apache.flink.runtime.leaderretrieval.LeaderRetrievalException: Could not retrieve the leader gateway.", "\tat org.apache.flink.runtime.util.LeaderRetrievalUtils.retrieveLeaderGateway(LeaderRetrievalUtils.java:79)", "\tat org.apache.flink.client.program.ClusterClient.getJobManagerGateway(ClusterClient.java:784)", "\t... 16 more", "Caused by: java.util.concurrent.TimeoutException: Futures timed out after [10000 milliseconds]", "\tat scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)", "\tat scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)", "\tat scala.concurrent.Await$$anonfun$result$1.apply(package.scala:190)", "\tat scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)"