Hi everyone, I’ve got the following problem, when I’m trying to submit new job and if cluster has not enough resources, job submission fails with the following exception
But in YARN job hangs and wait’s for requested resources. When resources become available, job successfully runs. What can I do to be sure that job startup is completed successfully or completely failed ? Thx. The program finished with the following exception:\n\njava.lang.RuntimeException: Unable to tell application master to stop once the specified job has been finised\n\tat org.apache.flink.yarn.YarnClusterClient.stopAfterJob(YarnClusterClient.java:177)\n\tat org.apache.flink.yarn.YarnClusterClient.submitJob(YarnClusterClient.java:201)\n\tat org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:442)\n\tat org.apache.flink.client.program.DetachedEnvironment.finalizeExecute(DetachedEnvironment.java:76)\n\tat org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:387)\n\tat org.apache.flink.client.CliFrontend.executeProgram(CliFrontend.java:838)\n\tat org.apache.flink.client.CliFrontend.run(CliFrontend.java:259)\n\tat org.apache.flink.client.CliFrontend.parseParameters(CliFrontend.java:1086)\n\tat org.apache.flink.client.CliFrontend$2.call(CliFrontend.java:1133)\n\tat org.apache.flink.client.CliFrontend$2.call(CliFrontend.java:1130)\n\tat org.apache.flink.runtime.security.HadoopSecurityContext$1.run(HadoopSecurityContext.java:43)\n\tat java.security.AccessController.doPrivileged(Native Method)\n\tat javax.security.auth.Subject.doAs(Subject.java:422)\n\tat org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1656)\n\tat org.apache.flink.runtime.security.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:40)\n\tat org.apache.flink.client.CliFrontend.main(CliFrontend.java:1129)\nCaused by: org.apache.flink.util.FlinkException: Could not connect to the leading JobManager. Please check that the JobManager is running.\n\tat org.apache.flink.client.program.ClusterClient.getJobManagerGateway(ClusterClient.java:789)\n\tat org.apache.flink.yarn.YarnClusterClient.stopAfterJob(YarnClusterClient.java:171)\n\t... 15 more\nCaused by: org.apache.flink.runtime.leaderretrieval.LeaderRetrievalException: Could not retrieve the leader gateway.\n\tat org.apache.flink.runtime.util.LeaderRetrievalUtils.retrieveLeaderGateway(LeaderRetrievalUtils.java:79)\n\tat org.apache.flink.client.program.ClusterClient.getJobManagerGateway(ClusterClient.java:784)\n\t... 16 more\nCaused by: java.util.concurrent.TimeoutException: Futures timed out after [10000 milliseconds]\n\tat scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)\n\tat scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)\n\tat scala.concurrent.Await$$anonfun$result$1.apply(package.scala:190)\n\tat scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)\n\tat scala.concurrent.Await$.result(package.scala:190)\n\tat scala.concurrent.Await.result(package.scala)\n\tat org.apache.flink.runtime.util.LeaderRetrievalUtils.retrieveLeaderGateway(LeaderRetrievalUtils.java:77)\n\t... 17 more", "stderr_lines": ["", "------------------------------------------------------------", " The program finished with the following exception:", "", "java.lang.RuntimeException: Unable to tell application master to stop once the specified job has been finised", "\tat org.apache.flink.yarn.YarnClusterClient.stopAfterJob(YarnClusterClient.java:177)", "\tat org.apache.flink.yarn.YarnClusterClient.submitJob(YarnClusterClient.java:201)", "\tat org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:442)", "\tat org.apache.flink.client.program.DetachedEnvironment.finalizeExecute(DetachedEnvironment.java:76)", "\tat org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:387)", "\tat org.apache.flink.client.CliFrontend.executeProgram(CliFrontend.java:838)", "\tat org.apache.flink.client.CliFrontend.run(CliFrontend.java:259)", "\tat org.apache.flink.client.CliFrontend.parseParameters(CliFrontend.java:1086)", "\tat org.apache.flink.client.CliFrontend$2.call(CliFrontend.java:1133)", "\tat org.apache.flink.client.CliFrontend$2.call(CliFrontend.java:1130)", "\tat org.apache.flink.runtime.security.HadoopSecurityContext$1.run(HadoopSecurityContext.java:43)", "\tat java.security.AccessController.doPrivileged(Native Method)", "\tat javax.security.auth.Subject.doAs(Subject.java:422)", "\tat org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1656)", "\tat org.apache.flink.runtime.security.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:40)", "\tat org.apache.flink.client.CliFrontend.main(CliFrontend.java:1129)", "Caused by: org.apache.flink.util.FlinkException: Could not connect to the leading JobManager. Please check that the JobManager is running.", "\tat org.apache.flink.client.program.ClusterClient.getJobManagerGateway(ClusterClient.java:789)", "\tat org.apache.flink.yarn.YarnClusterClient.stopAfterJob(YarnClusterClient.java:171)", "\t... 15 more", "Caused by: org.apache.flink.runtime.leaderretrieval.LeaderRetrievalException: Could not retrieve the leader gateway.", "\tat org.apache.flink.runtime.util.LeaderRetrievalUtils.retrieveLeaderGateway(LeaderRetrievalUtils.java:79)", "\tat org.apache.flink.client.program.ClusterClient.getJobManagerGateway(ClusterClient.java:784)", "\t... 16 more", "Caused by: java.util.concurrent.TimeoutException: Futures timed out after [10000 milliseconds]", "\tat scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)", "\tat scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)", "\tat scala.concurrent.Await$$anonfun$result$1.apply(package.scala:190)", "\tat scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)"
|
Hi Rinat, No, Flink does not have a switch to immediately cancel a job if it cannot allocate enough resources.2017-09-04 13:29 GMT+02:00 Rinat <[hidden email]>:
|
Free forum by Nabble | Edit this page |