submit job failed on Yarn HA

classic Classic list List threaded Threaded
18 messages Options
sen
Reply | Threaded
Open this post in threaded view
|

submit job failed on Yarn HA

sen
Hi all:
        
        I run flink (1.5.1 with hadoop 2.7) on yarn ,and submit job by “/usr/local/flink/bin/flink run -m jmhost:port my.jar”, but the submission is failed.
The HA configuration is :
  •  high-availability: zookeeper
  •  high-availability.storageDir: <a href="hdfs:///flink/ha/" class="">hdfs:///flink/ha/
  •  high-availability.zookeeper.quorum:  hdp1:2181,hdp2:2181,hdp3:2181
  •  yarn.application-attempts: 2
The info showed int the client log:    

2019-02-27 11:48:38,651 INFO  org.apache.flink.runtime.rest.RestClient                      - Shutting down rest endpoint.
2019-02-27 11:48:38,659 INFO  org.apache.flink.runtime.rest.RestClient                      - Rest endpoint shutdown complete.
2019-02-27 11:48:38,662 INFO  org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService  - Stopping ZooKeeperLeaderRetrievalService /leader/rest_server_lock.
2019-02-27 11:48:38,665 INFO  org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService  - Stopping ZooKeeperLeaderRetrievalService /leader/dispatcher_lock.
2019-02-27 11:48:38,670 INFO  org.apache.flink.shaded.curator.org.apache.curator.framework.imps.CuratorFrameworkImpl  - backgroundOperationsLoop exiting
2019-02-27 11:48:38,689 INFO  org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Session: 0x2679c52880c00ee closed
2019-02-27 11:48:38,689 INFO  org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn  - EventThread shut down for session: 0x2679c52880c00ee
2019-02-27 11:48:38,690 ERROR org.apache.flink.client.cli.CliFrontend                       - Error while running the command.
org.apache.flink.client.program.ProgramInvocationException: Could not retrieve the execution result.
	at org.apache.flink.client.program.rest.RestClusterClient.submitJob(RestClusterClient.java:257)
	at org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:464)
	at org.apache.flink.streaming.api.environment.StreamContextEnvironment.execute(StreamContextEnvironment.java:66)
	at org.apache.flink.streaming.api.scala.StreamExecutionEnvironment.execute(StreamExecutionEnvironment.scala:654)
	at scala.Function0$class.apply$mcV$sp(Function0.scala:34)
	at scala.runtime.AbstractFunction0.apply$mcV$sp(AbstractFunction0.scala:12)
	at scala.App$$anonfun$main$1.apply(App.scala:76)
	at scala.App$$anonfun$main$1.apply(App.scala:76)
	at scala.collection.immutable.List.foreach(List.scala:381)
	at scala.collection.generic.TraversableForwarder$class.foreach(TraversableForwarder.scala:35)
	at scala.App$class.main(App.scala:76)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at org.apache.flink.client.program.PackagedProgram.callMainMethod(PackagedProgram.java:528)
	at org.apache.flink.client.program.PackagedProgram.invokeInteractiveModeForExecution(PackagedProgram.java:420)
	at org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:404)
	at org.apache.flink.client.cli.CliFrontend.executeProgram(CliFrontend.java:785)
	at org.apache.flink.client.cli.CliFrontend.runProgram(CliFrontend.java:279)
	at org.apache.flink.client.cli.CliFrontend.run(CliFrontend.java:214)
	at org.apache.flink.client.cli.CliFrontend.parseParameters(CliFrontend.java:1025)
	at org.apache.flink.client.cli.CliFrontend.lambda$main$9(CliFrontend.java:1101)
	at java.security.AccessController.doPrivileged(Native Method)
	at javax.security.auth.Subject.doAs(Subject.java:422)
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1754)
	at org.apache.flink.runtime.security.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:41)
	at org.apache.flink.client.cli.CliFrontend.main(CliFrontend.java:1101)
Caused by: org.apache.flink.runtime.client.JobSubmissionException: Failed to submit JobGraph.
	at org.apache.flink.client.program.rest.RestClusterClient.lambda$submitJob$8(RestClusterClient.java:370)
	at java.util.concurrent.CompletableFuture.uniExceptionally(CompletableFuture.java:870)
	at java.util.concurrent.CompletableFuture$UniExceptionally.tryFire(CompletableFuture.java:852)
	at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)
	at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977)
	at org.apache.flink.runtime.concurrent.FutureUtils.lambda$retryOperationWithDelay$5(FutureUtils.java:214)
	at java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:760)
	at java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:736)
	at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)
	at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977)
	at org.apache.flink.runtime.concurrent.FutureUtils$Timeout.run(FutureUtils.java:834)
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)
Caused by: java.util.concurrent.CompletionException: org.apache.flink.runtime.concurrent.FutureUtils$RetryException: Could not complete the operation. Exception is not retryable.
	at java.util.concurrent.CompletableFuture.encodeRelay(CompletableFuture.java:326)
	at java.util.concurrent.CompletableFuture.completeRelay(CompletableFuture.java:338)
	at java.util.concurrent.CompletableFuture.uniRelay(CompletableFuture.java:911)
	at java.util.concurrent.CompletableFuture$UniRelay.tryFire(CompletableFuture.java:899)
	... 15 more
Caused by: org.apache.flink.runtime.concurrent.FutureUtils$RetryException: Could not complete the operation. Exception is not retryable.
	... 13 more
Caused by: java.util.concurrent.CompletionException: java.util.concurrent.TimeoutException
	at java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:292)
	at java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:308)
	at java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:593)
	at java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:577)
	... 10 more
Caused by: java.util.concurrent.TimeoutException
	... 8 more

------------------------------------------------------------
 The program finished with the following exception:

org.apache.flink.client.program.ProgramInvocationException: Could not retrieve the execution result.
	at org.apache.flink.client.program.rest.RestClusterClient.submitJob(RestClusterClient.java:257)
	at org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:464)
	at org.apache.flink.streaming.api.environment.StreamContextEnvironment.execute(StreamContextEnvironment.java:66)
	at org.apache.flink.streaming.api.scala.StreamExecutionEnvironment.execute(StreamExecutionEnvironment.scala:654)
	at edp.wormhole.flinkx.eventflow.WormholeFlinkMainProcess.process(WormholeFlinkMainProcess.scala:114)
	at edp.wormhole.flinkx.eventflow.WormholeFlinkxStarter$.delayedEndpoint$edp$wormhole$flinkx$eventflow$WormholeFlinkxStarter$1(WormholeFlinkxStarter.scala:40)
	at edp.wormhole.flinkx.eventflow.WormholeFlinkxStarter$delayedInit$body.apply(WormholeFlinkxStarter.scala:29)
	at scala.Function0$class.apply$mcV$sp(Function0.scala:34)
	at scala.runtime.AbstractFunction0.apply$mcV$sp(AbstractFunction0.scala:12)
	at scala.App$$anonfun$main$1.apply(App.scala:76)
	at scala.App$$anonfun$main$1.apply(App.scala:76)
	at scala.collection.immutable.List.foreach(List.scala:381)
	at scala.collection.generic.TraversableForwarder$class.foreach(TraversableForwarder.scala:35)
	at scala.App$class.main(App.scala:76)
	at edp.wormhole.flinkx.eventflow.WormholeFlinkxStarter$.main(WormholeFlinkxStarter.scala:29)
	at edp.wormhole.flinkx.eventflow.WormholeFlinkxStarter.main(WormholeFlinkxStarter.scala)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at org.apache.flink.client.program.PackagedProgram.callMainMethod(PackagedProgram.java:528)
	at org.apache.flink.client.program.PackagedProgram.invokeInteractiveModeForExecution(PackagedProgram.java:420)
	at org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:404)
	at org.apache.flink.client.cli.CliFrontend.executeProgram(CliFrontend.java:785)
	at org.apache.flink.client.cli.CliFrontend.runProgram(CliFrontend.java:279)
	at org.apache.flink.client.cli.CliFrontend.run(CliFrontend.java:214)
	at org.apache.flink.client.cli.CliFrontend.parseParameters(CliFrontend.java:1025)
	at org.apache.flink.client.cli.CliFrontend.lambda$main$9(CliFrontend.java:1101)
	at java.security.AccessController.doPrivileged(Native Method)
	at javax.security.auth.Subject.doAs(Subject.java:422)
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1754)
	at org.apache.flink.runtime.security.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:41)
	at org.apache.flink.client.cli.CliFrontend.main(CliFrontend.java:1101)
Caused by: org.apache.flink.runtime.client.JobSubmissionException: Failed to submit JobGraph.
	at org.apache.flink.client.program.rest.RestClusterClient.lambda$submitJob$8(RestClusterClient.java:370)
	at java.util.concurrent.CompletableFuture.uniExceptionally(CompletableFuture.java:870)
	at java.util.concurrent.CompletableFuture$UniExceptionally.tryFire(CompletableFuture.java:852)
	at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)
	at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977)
	at org.apache.flink.runtime.concurrent.FutureUtils.lambda$retryOperationWithDelay$5(FutureUtils.java:214)
	at java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:760)
	at java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:736)
	at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)
	at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977)
	at org.apache.flink.runtime.concurrent.FutureUtils$Timeout.run(FutureUtils.java:834)
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)
Caused by: java.util.concurrent.CompletionException: org.apache.flink.runtime.concurrent.FutureUtils$RetryException: Could not complete the operation. Exception is not retryable.
	at java.util.concurrent.CompletableFuture.encodeRelay(CompletableFuture.java:326)
	at java.util.concurrent.CompletableFuture.completeRelay(CompletableFuture.java:338)
	at java.util.concurrent.CompletableFuture.uniRelay(CompletableFuture.java:911)
	at java.util.concurrent.CompletableFuture$UniRelay.tryFire(CompletableFuture.java:899)
	... 15 more
Caused by: org.apache.flink.runtime.concurrent.FutureUtils$RetryException: Could not complete the operation. Exception is not retryable.
	... 13 more
Caused by: java.util.concurrent.CompletionException: java.util.concurrent.TimeoutException
	at java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:292)
	at java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:308)
	at java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:593)
	at java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:577)
	... 10 more
Caused by: java.util.concurrent.TimeoutException
	... 8 more

Reply | Threaded
Open this post in threaded view
|

Re:submit job failed on Yarn HA

shengjk1
Hi
 <span contenteditable="false" ntes_mm_email="sennybob@163.com" ondragstart="event.preventDefault ? event.preventDefault() : event.returnValue = false;return false;" style="font-size:14px;color:#31353b;padding:1px 2px;">@孙森  
“/usr/local/flink/bin/flink run -m jmhost:port my.jar”  is not submit on yarn .
If you want sumit job on yarn ,you should   "/usr/local/flink/bin/flink run -m yarn-cluster  my.jar"

Please refer to https://ci.apache.org/projects/flink/flink-docs-release-1.7/ops/cli.html

Best,
Shengjk1


On 02/27/2019 15:09[hidden email] wrote:
Hi all:
        
        I run flink (1.5.1 with hadoop 2.7) on yarn ,and submit job by “/usr/local/flink/bin/flink run -m jmhost:port my.jar”, but the submission is failed.
The HA configuration is :
  •  high-availability: zookeeper
  •  high-availability.storageDir: <a href="hdfs:///flink/ha/" class="">hdfs:///flink/ha/
  •  high-availability.zookeeper.quorum:  hdp1:2181,hdp2:2181,hdp3:2181
  •  yarn.application-attempts: 2
The info showed int the client log:    

2019-02-27 11:48:38,651 INFO  org.apache.flink.runtime.rest.RestClient                      - Shutting down rest endpoint.
2019-02-27 11:48:38,659 INFO  org.apache.flink.runtime.rest.RestClient                      - Rest endpoint shutdown complete.
2019-02-27 11:48:38,662 INFO  org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService  - Stopping ZooKeeperLeaderRetrievalService /leader/rest_server_lock.
2019-02-27 11:48:38,665 INFO  org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService  - Stopping ZooKeeperLeaderRetrievalService /leader/dispatcher_lock.
2019-02-27 11:48:38,670 INFO  org.apache.flink.shaded.curator.org.apache.curator.framework.imps.CuratorFrameworkImpl  - backgroundOperationsLoop exiting
2019-02-27 11:48:38,689 INFO  org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Session: 0x2679c52880c00ee closed
2019-02-27 11:48:38,689 INFO  org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn  - EventThread shut down for session: 0x2679c52880c00ee
2019-02-27 11:48:38,690 ERROR org.apache.flink.client.cli.CliFrontend                       - Error while running the command.
org.apache.flink.client.program.ProgramInvocationException: Could not retrieve the execution result.
	at org.apache.flink.client.program.rest.RestClusterClient.submitJob(RestClusterClient.java:257)
	at org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:464)
	at org.apache.flink.streaming.api.environment.StreamContextEnvironment.execute(StreamContextEnvironment.java:66)
	at org.apache.flink.streaming.api.scala.StreamExecutionEnvironment.execute(StreamExecutionEnvironment.scala:654)
	at scala.Function0$class.apply$mcV$sp(Function0.scala:34)
	at scala.runtime.AbstractFunction0.apply$mcV$sp(AbstractFunction0.scala:12)
	at scala.App$$anonfun$main$1.apply(App.scala:76)
	at scala.App$$anonfun$main$1.apply(App.scala:76)
	at scala.collection.immutable.List.foreach(List.scala:381)
	at scala.collection.generic.TraversableForwarder$class.foreach(TraversableForwarder.scala:35)
	at scala.App$class.main(App.scala:76)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at org.apache.flink.client.program.PackagedProgram.callMainMethod(PackagedProgram.java:528)
	at org.apache.flink.client.program.PackagedProgram.invokeInteractiveModeForExecution(PackagedProgram.java:420)
	at org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:404)
	at org.apache.flink.client.cli.CliFrontend.executeProgram(CliFrontend.java:785)
	at org.apache.flink.client.cli.CliFrontend.runProgram(CliFrontend.java:279)
	at org.apache.flink.client.cli.CliFrontend.run(CliFrontend.java:214)
	at org.apache.flink.client.cli.CliFrontend.parseParameters(CliFrontend.java:1025)
	at org.apache.flink.client.cli.CliFrontend.lambda$main$9(CliFrontend.java:1101)
	at java.security.AccessController.doPrivileged(Native Method)
	at javax.security.auth.Subject.doAs(Subject.java:422)
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1754)
	at org.apache.flink.runtime.security.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:41)
	at org.apache.flink.client.cli.CliFrontend.main(CliFrontend.java:1101)
Caused by: org.apache.flink.runtime.client.JobSubmissionException: Failed to submit JobGraph.
	at org.apache.flink.client.program.rest.RestClusterClient.lambda$submitJob$8(RestClusterClient.java:370)
	at java.util.concurrent.CompletableFuture.uniExceptionally(CompletableFuture.java:870)
	at java.util.concurrent.CompletableFuture$UniExceptionally.tryFire(CompletableFuture.java:852)
	at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)
	at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977)
	at org.apache.flink.runtime.concurrent.FutureUtils.lambda$retryOperationWithDelay$5(FutureUtils.java:214)
	at java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:760)
	at java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:736)
	at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)
	at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977)
	at org.apache.flink.runtime.concurrent.FutureUtils$Timeout.run(FutureUtils.java:834)
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)
Caused by: java.util.concurrent.CompletionException: org.apache.flink.runtime.concurrent.FutureUtils$RetryException: Could not complete the operation. Exception is not retryable.
	at java.util.concurrent.CompletableFuture.encodeRelay(CompletableFuture.java:326)
	at java.util.concurrent.CompletableFuture.completeRelay(CompletableFuture.java:338)
	at java.util.concurrent.CompletableFuture.uniRelay(CompletableFuture.java:911)
	at java.util.concurrent.CompletableFuture$UniRelay.tryFire(CompletableFuture.java:899)
	... 15 more
Caused by: org.apache.flink.runtime.concurrent.FutureUtils$RetryException: Could not complete the operation. Exception is not retryable.
	... 13 more
Caused by: java.util.concurrent.CompletionException: java.util.concurrent.TimeoutException
	at java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:292)
	at java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:308)
	at java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:593)
	at java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:577)
	... 10 more
Caused by: java.util.concurrent.TimeoutException
	... 8 more

------------------------------------------------------------
 The program finished with the following exception:

org.apache.flink.client.program.ProgramInvocationException: Could not retrieve the execution result.
	at org.apache.flink.client.program.rest.RestClusterClient.submitJob(RestClusterClient.java:257)
	at org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:464)
	at org.apache.flink.streaming.api.environment.StreamContextEnvironment.execute(StreamContextEnvironment.java:66)
	at org.apache.flink.streaming.api.scala.StreamExecutionEnvironment.execute(StreamExecutionEnvironment.scala:654)
	at edp.wormhole.flinkx.eventflow.WormholeFlinkMainProcess.process(WormholeFlinkMainProcess.scala:114)
	at edp.wormhole.flinkx.eventflow.WormholeFlinkxStarter$.delayedEndpoint$edp$wormhole$flinkx$eventflow$WormholeFlinkxStarter$1(WormholeFlinkxStarter.scala:40)
	at edp.wormhole.flinkx.eventflow.WormholeFlinkxStarter$delayedInit$body.apply(WormholeFlinkxStarter.scala:29)
	at scala.Function0$class.apply$mcV$sp(Function0.scala:34)
	at scala.runtime.AbstractFunction0.apply$mcV$sp(AbstractFunction0.scala:12)
	at scala.App$$anonfun$main$1.apply(App.scala:76)
	at scala.App$$anonfun$main$1.apply(App.scala:76)
	at scala.collection.immutable.List.foreach(List.scala:381)
	at scala.collection.generic.TraversableForwarder$class.foreach(TraversableForwarder.scala:35)
	at scala.App$class.main(App.scala:76)
	at edp.wormhole.flinkx.eventflow.WormholeFlinkxStarter$.main(WormholeFlinkxStarter.scala:29)
	at edp.wormhole.flinkx.eventflow.WormholeFlinkxStarter.main(WormholeFlinkxStarter.scala)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at org.apache.flink.client.program.PackagedProgram.callMainMethod(PackagedProgram.java:528)
	at org.apache.flink.client.program.PackagedProgram.invokeInteractiveModeForExecution(PackagedProgram.java:420)
	at org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:404)
	at org.apache.flink.client.cli.CliFrontend.executeProgram(CliFrontend.java:785)
	at org.apache.flink.client.cli.CliFrontend.runProgram(CliFrontend.java:279)
	at org.apache.flink.client.cli.CliFrontend.run(CliFrontend.java:214)
	at org.apache.flink.client.cli.CliFrontend.parseParameters(CliFrontend.java:1025)
	at org.apache.flink.client.cli.CliFrontend.lambda$main$9(CliFrontend.java:1101)
	at java.security.AccessController.doPrivileged(Native Method)
	at javax.security.auth.Subject.doAs(Subject.java:422)
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1754)
	at org.apache.flink.runtime.security.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:41)
	at org.apache.flink.client.cli.CliFrontend.main(CliFrontend.java:1101)
Caused by: org.apache.flink.runtime.client.JobSubmissionException: Failed to submit JobGraph.
	at org.apache.flink.client.program.rest.RestClusterClient.lambda$submitJob$8(RestClusterClient.java:370)
	at java.util.concurrent.CompletableFuture.uniExceptionally(CompletableFuture.java:870)
	at java.util.concurrent.CompletableFuture$UniExceptionally.tryFire(CompletableFuture.java:852)
	at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)
	at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977)
	at org.apache.flink.runtime.concurrent.FutureUtils.lambda$retryOperationWithDelay$5(FutureUtils.java:214)
	at java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:760)
	at java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:736)
	at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)
	at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977)
	at org.apache.flink.runtime.concurrent.FutureUtils$Timeout.run(FutureUtils.java:834)
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)
Caused by: java.util.concurrent.CompletionException: org.apache.flink.runtime.concurrent.FutureUtils$RetryException: Could not complete the operation. Exception is not retryable.
	at java.util.concurrent.CompletableFuture.encodeRelay(CompletableFuture.java:326)
	at java.util.concurrent.CompletableFuture.completeRelay(CompletableFuture.java:338)
	at java.util.concurrent.CompletableFuture.uniRelay(CompletableFuture.java:911)
	at java.util.concurrent.CompletableFuture$UniRelay.tryFire(CompletableFuture.java:899)
	... 15 more
Caused by: org.apache.flink.runtime.concurrent.FutureUtils$RetryException: Could not complete the operation. Exception is not retryable.
	... 13 more
Caused by: java.util.concurrent.CompletionException: java.util.concurrent.TimeoutException
	at java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:292)
	at java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:308)
	at java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:593)
	at java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:577)
	... 10 more
Caused by: java.util.concurrent.TimeoutException
	... 8 more

sen
Reply | Threaded
Open this post in threaded view
|

Re: Re:submit job failed on Yarn HA

sen
I setup a yarn cluster use the :
./bin/yarn-session.sh -n 10 -tm 8192 -s 32

Then I submit a job to this cluster.It's OK,I've used for a long time。
In this way ,you can submit multi jobs in one cluster.




--
Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/
Reply | Threaded
Open this post in threaded view
|

Re: submit job failed on Yarn HA

Gary Yao-4
In reply to this post by sen
Hi,

How did you determine "jmhost" and "port"? Actually you do not need to specify
these manually. If the client is using the same configuration as your cluster,
the client will look up the leading JM from ZooKeeper.

If you have already tried omitting the "-m" parameter, you can check in the
client logs which host is used for the job submission [1]. Note that you need
to enable logging on DEBUG level.

The root cause in your stacktrace is a TimeoutException. I would debug this by
checking if you can establish a TCP connection – from the machine you are
submitting the job from, to the target host/port [2].

Moreover, you are using a quite dated Flink version. The newest version in the
1.5 major release is 1.5.6 – so consider upgrading to that or even to 1.7.

Best,
Gary

[1] https://github.com/apache/flink/blob/3488f8b144a2127497c39b8ed5a48a65b551c57d/flink-runtime/src/main/java/org/apache/flink/runtime/rest/RestClient.java#L185
[2] https://stackoverflow.com/questions/4922943/test-from-shell-script-if-remote-tcp-port-is-open

On Wed, Feb 27, 2019 at 8:09 AM 孙森 <[hidden email]> wrote:
Hi all:
        
        I run flink (1.5.1 with hadoop 2.7) on yarn ,and submit job by “/usr/local/flink/bin/flink run -m jmhost:port my.jar”, but the submission is failed.
The HA configuration is :
  •  high-availability: zookeeper
  •  high-availability.storageDir: hdfs:///flink/ha/
  •  high-availability.zookeeper.quorum:  hdp1:2181,hdp2:2181,hdp3:2181
  •  yarn.application-attempts: 2
The info showed int the client log:    

2019-02-27 11:48:38,651 INFO  org.apache.flink.runtime.rest.RestClient                      - Shutting down rest endpoint.
2019-02-27 11:48:38,659 INFO  org.apache.flink.runtime.rest.RestClient                      - Rest endpoint shutdown complete.
2019-02-27 11:48:38,662 INFO  org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService  - Stopping ZooKeeperLeaderRetrievalService /leader/rest_server_lock.
2019-02-27 11:48:38,665 INFO  org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService  - Stopping ZooKeeperLeaderRetrievalService /leader/dispatcher_lock.
2019-02-27 11:48:38,670 INFO  org.apache.flink.shaded.curator.org.apache.curator.framework.imps.CuratorFrameworkImpl  - backgroundOperationsLoop exiting
2019-02-27 11:48:38,689 INFO  org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Session: 0x2679c52880c00ee closed
2019-02-27 11:48:38,689 INFO  org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn  - EventThread shut down for session: 0x2679c52880c00ee
2019-02-27 11:48:38,690 ERROR org.apache.flink.client.cli.CliFrontend                       - Error while running the command.
org.apache.flink.client.program.ProgramInvocationException: Could not retrieve the execution result.
	at org.apache.flink.client.program.rest.RestClusterClient.submitJob(RestClusterClient.java:257)
	at org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:464)
	at org.apache.flink.streaming.api.environment.StreamContextEnvironment.execute(StreamContextEnvironment.java:66)
	at org.apache.flink.streaming.api.scala.StreamExecutionEnvironment.execute(StreamExecutionEnvironment.scala:654)
	at scala.Function0$class.apply$mcV$sp(Function0.scala:34)
	at scala.runtime.AbstractFunction0.apply$mcV$sp(AbstractFunction0.scala:12)
	at scala.App$$anonfun$main$1.apply(App.scala:76)
	at scala.App$$anonfun$main$1.apply(App.scala:76)
	at scala.collection.immutable.List.foreach(List.scala:381)
	at scala.collection.generic.TraversableForwarder$class.foreach(TraversableForwarder.scala:35)
	at scala.App$class.main(App.scala:76)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at org.apache.flink.client.program.PackagedProgram.callMainMethod(PackagedProgram.java:528)
	at org.apache.flink.client.program.PackagedProgram.invokeInteractiveModeForExecution(PackagedProgram.java:420)
	at org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:404)
	at org.apache.flink.client.cli.CliFrontend.executeProgram(CliFrontend.java:785)
	at org.apache.flink.client.cli.CliFrontend.runProgram(CliFrontend.java:279)
	at org.apache.flink.client.cli.CliFrontend.run(CliFrontend.java:214)
	at org.apache.flink.client.cli.CliFrontend.parseParameters(CliFrontend.java:1025)
	at org.apache.flink.client.cli.CliFrontend.lambda$main$9(CliFrontend.java:1101)
	at java.security.AccessController.doPrivileged(Native Method)
	at javax.security.auth.Subject.doAs(Subject.java:422)
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1754)
	at org.apache.flink.runtime.security.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:41)
	at org.apache.flink.client.cli.CliFrontend.main(CliFrontend.java:1101)
Caused by: org.apache.flink.runtime.client.JobSubmissionException: Failed to submit JobGraph.
	at org.apache.flink.client.program.rest.RestClusterClient.lambda$submitJob$8(RestClusterClient.java:370)
	at java.util.concurrent.CompletableFuture.uniExceptionally(CompletableFuture.java:870)
	at java.util.concurrent.CompletableFuture$UniExceptionally.tryFire(CompletableFuture.java:852)
	at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)
	at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977)
	at org.apache.flink.runtime.concurrent.FutureUtils.lambda$retryOperationWithDelay$5(FutureUtils.java:214)
	at java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:760)
	at java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:736)
	at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)
	at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977)
	at org.apache.flink.runtime.concurrent.FutureUtils$Timeout.run(FutureUtils.java:834)
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)
Caused by: java.util.concurrent.CompletionException: org.apache.flink.runtime.concurrent.FutureUtils$RetryException: Could not complete the operation. Exception is not retryable.
	at java.util.concurrent.CompletableFuture.encodeRelay(CompletableFuture.java:326)
	at java.util.concurrent.CompletableFuture.completeRelay(CompletableFuture.java:338)
	at java.util.concurrent.CompletableFuture.uniRelay(CompletableFuture.java:911)
	at java.util.concurrent.CompletableFuture$UniRelay.tryFire(CompletableFuture.java:899)
	... 15 more
Caused by: org.apache.flink.runtime.concurrent.FutureUtils$RetryException: Could not complete the operation. Exception is not retryable.
	... 13 more
Caused by: java.util.concurrent.CompletionException: java.util.concurrent.TimeoutException
	at java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:292)
	at java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:308)
	at java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:593)
	at java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:577)
	... 10 more
Caused by: java.util.concurrent.TimeoutException
	... 8 more

------------------------------------------------------------
 The program finished with the following exception:

org.apache.flink.client.program.ProgramInvocationException: Could not retrieve the execution result.
	at org.apache.flink.client.program.rest.RestClusterClient.submitJob(RestClusterClient.java:257)
	at org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:464)
	at org.apache.flink.streaming.api.environment.StreamContextEnvironment.execute(StreamContextEnvironment.java:66)
	at org.apache.flink.streaming.api.scala.StreamExecutionEnvironment.execute(StreamExecutionEnvironment.scala:654)
	at edp.wormhole.flinkx.eventflow.WormholeFlinkMainProcess.process(WormholeFlinkMainProcess.scala:114)
	at edp.wormhole.flinkx.eventflow.WormholeFlinkxStarter$.delayedEndpoint$edp$wormhole$flinkx$eventflow$WormholeFlinkxStarter$1(WormholeFlinkxStarter.scala:40)
	at edp.wormhole.flinkx.eventflow.WormholeFlinkxStarter$delayedInit$body.apply(WormholeFlinkxStarter.scala:29)
	at scala.Function0$class.apply$mcV$sp(Function0.scala:34)
	at scala.runtime.AbstractFunction0.apply$mcV$sp(AbstractFunction0.scala:12)
	at scala.App$$anonfun$main$1.apply(App.scala:76)
	at scala.App$$anonfun$main$1.apply(App.scala:76)
	at scala.collection.immutable.List.foreach(List.scala:381)
	at scala.collection.generic.TraversableForwarder$class.foreach(TraversableForwarder.scala:35)
	at scala.App$class.main(App.scala:76)
	at edp.wormhole.flinkx.eventflow.WormholeFlinkxStarter$.main(WormholeFlinkxStarter.scala:29)
	at edp.wormhole.flinkx.eventflow.WormholeFlinkxStarter.main(WormholeFlinkxStarter.scala)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at org.apache.flink.client.program.PackagedProgram.callMainMethod(PackagedProgram.java:528)
	at org.apache.flink.client.program.PackagedProgram.invokeInteractiveModeForExecution(PackagedProgram.java:420)
	at org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:404)
	at org.apache.flink.client.cli.CliFrontend.executeProgram(CliFrontend.java:785)
	at org.apache.flink.client.cli.CliFrontend.runProgram(CliFrontend.java:279)
	at org.apache.flink.client.cli.CliFrontend.run(CliFrontend.java:214)
	at org.apache.flink.client.cli.CliFrontend.parseParameters(CliFrontend.java:1025)
	at org.apache.flink.client.cli.CliFrontend.lambda$main$9(CliFrontend.java:1101)
	at java.security.AccessController.doPrivileged(Native Method)
	at javax.security.auth.Subject.doAs(Subject.java:422)
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1754)
	at org.apache.flink.runtime.security.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:41)
	at org.apache.flink.client.cli.CliFrontend.main(CliFrontend.java:1101)
Caused by: org.apache.flink.runtime.client.JobSubmissionException: Failed to submit JobGraph.
	at org.apache.flink.client.program.rest.RestClusterClient.lambda$submitJob$8(RestClusterClient.java:370)
	at java.util.concurrent.CompletableFuture.uniExceptionally(CompletableFuture.java:870)
	at java.util.concurrent.CompletableFuture$UniExceptionally.tryFire(CompletableFuture.java:852)
	at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)
	at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977)
	at org.apache.flink.runtime.concurrent.FutureUtils.lambda$retryOperationWithDelay$5(FutureUtils.java:214)
	at java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:760)
	at java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:736)
	at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)
	at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977)
	at org.apache.flink.runtime.concurrent.FutureUtils$Timeout.run(FutureUtils.java:834)
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)
Caused by: java.util.concurrent.CompletionException: org.apache.flink.runtime.concurrent.FutureUtils$RetryException: Could not complete the operation. Exception is not retryable.
	at java.util.concurrent.CompletableFuture.encodeRelay(CompletableFuture.java:326)
	at java.util.concurrent.CompletableFuture.completeRelay(CompletableFuture.java:338)
	at java.util.concurrent.CompletableFuture.uniRelay(CompletableFuture.java:911)
	at java.util.concurrent.CompletableFuture$UniRelay.tryFire(CompletableFuture.java:899)
	... 15 more
Caused by: org.apache.flink.runtime.concurrent.FutureUtils$RetryException: Could not complete the operation. Exception is not retryable.
	... 13 more
Caused by: java.util.concurrent.CompletionException: java.util.concurrent.TimeoutException
	at java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:292)
	at java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:308)
	at java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:593)
	at java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:577)
	... 10 more
Caused by: java.util.concurrent.TimeoutException
	... 8 more

sen
Reply | Threaded
Open this post in threaded view
|

Re: submit job failed on Yarn HA

sen
Hi,Gary

       Actually, I have several Flink cluster on Yarn ,each for a project. For one project ,it can only submit job to the specify cluster.
I’ve already enabled logging on DEBUG level.

How did you determine "jmhost" and "port”? 
We do this by request the rest api :   http://activeRm/proxy/appId/jars


The all client log is in the mail attachment.





在 2019年2月27日,下午9:30,Gary Yao <[hidden email]> 写道:

Hi,

How did you determine "jmhost" and "port"? Actually you do not need to specify
these manually. If the client is using the same configuration as your cluster,
the client will look up the leading JM from ZooKeeper.

If you have already tried omitting the "-m" parameter, you can check in the
client logs which host is used for the job submission [1]. Note that you need
to enable logging on DEBUG level.

The root cause in your stacktrace is a TimeoutException. I would debug this by
checking if you can establish a TCP connection – from the machine you are
submitting the job from, to the target host/port [2].

Moreover, you are using a quite dated Flink version. The newest version in the
1.5 major release is 1.5.6 – so consider upgrading to that or even to 1.7.

Best,
Gary

[1] https://github.com/apache/flink/blob/3488f8b144a2127497c39b8ed5a48a65b551c57d/flink-runtime/src/main/java/org/apache/flink/runtime/rest/RestClient.java#L185
[2] https://stackoverflow.com/questions/4922943/test-from-shell-script-if-remote-tcp-port-is-open

On Wed, Feb 27, 2019 at 8:09 AM 孙森 <[hidden email]> wrote:
Hi all:
        
        I run flink (1.5.1 with hadoop 2.7) on yarn ,and submit job by “/usr/local/flink/bin/flink run -m jmhost:port my.jar”, but the submission is failed.
The HA configuration is :
  •  high-availability: zookeeper
  •  high-availability.storageDir: hdfs:///flink/ha/
  •  high-availability.zookeeper.quorum:  hdp1:2181,hdp2:2181,hdp3:2181
  •  yarn.application-attempts: 2
The info showed int the client log:    

2019-02-27 11:48:38,651 INFO  org.apache.flink.runtime.rest.RestClient                      - Shutting down rest endpoint.
2019-02-27 11:48:38,659 INFO  org.apache.flink.runtime.rest.RestClient                      - Rest endpoint shutdown complete.
2019-02-27 11:48:38,662 INFO  org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService  - Stopping ZooKeeperLeaderRetrievalService /leader/rest_server_lock.
2019-02-27 11:48:38,665 INFO  org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService  - Stopping ZooKeeperLeaderRetrievalService /leader/dispatcher_lock.
2019-02-27 11:48:38,670 INFO  org.apache.flink.shaded.curator.org.apache.curator.framework.imps.CuratorFrameworkImpl  - backgroundOperationsLoop exiting
2019-02-27 11:48:38,689 INFO  org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Session: 0x2679c52880c00ee closed
2019-02-27 11:48:38,689 INFO  org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn  - EventThread shut down for session: 0x2679c52880c00ee
2019-02-27 11:48:38,690 ERROR org.apache.flink.client.cli.CliFrontend                       - Error while running the command.
org.apache.flink.client.program.ProgramInvocationException: Could not retrieve the execution result.
	at org.apache.flink.client.program.rest.RestClusterClient.submitJob(RestClusterClient.java:257)
	at org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:464)
	at org.apache.flink.streaming.api.environment.StreamContextEnvironment.execute(StreamContextEnvironment.java:66)
	at org.apache.flink.streaming.api.scala.StreamExecutionEnvironment.execute(StreamExecutionEnvironment.scala:654)
	at scala.Function0$class.apply$mcV$sp(Function0.scala:34)
	at scala.runtime.AbstractFunction0.apply$mcV$sp(AbstractFunction0.scala:12)
	at scala.App$$anonfun$main$1.apply(App.scala:76)
	at scala.App$$anonfun$main$1.apply(App.scala:76)
	at scala.collection.immutable.List.foreach(List.scala:381)
	at scala.collection.generic.TraversableForwarder$class.foreach(TraversableForwarder.scala:35)
	at scala.App$class.main(App.scala:76)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at org.apache.flink.client.program.PackagedProgram.callMainMethod(PackagedProgram.java:528)
	at org.apache.flink.client.program.PackagedProgram.invokeInteractiveModeForExecution(PackagedProgram.java:420)
	at org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:404)
	at org.apache.flink.client.cli.CliFrontend.executeProgram(CliFrontend.java:785)
	at org.apache.flink.client.cli.CliFrontend.runProgram(CliFrontend.java:279)
	at org.apache.flink.client.cli.CliFrontend.run(CliFrontend.java:214)
	at org.apache.flink.client.cli.CliFrontend.parseParameters(CliFrontend.java:1025)
	at org.apache.flink.client.cli.CliFrontend.lambda$main$9(CliFrontend.java:1101)
	at java.security.AccessController.doPrivileged(Native Method)
	at javax.security.auth.Subject.doAs(Subject.java:422)
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1754)
	at org.apache.flink.runtime.security.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:41)
	at org.apache.flink.client.cli.CliFrontend.main(CliFrontend.java:1101)
Caused by: org.apache.flink.runtime.client.JobSubmissionException: Failed to submit JobGraph.
	at org.apache.flink.client.program.rest.RestClusterClient.lambda$submitJob$8(RestClusterClient.java:370)
	at java.util.concurrent.CompletableFuture.uniExceptionally(CompletableFuture.java:870)
	at java.util.concurrent.CompletableFuture$UniExceptionally.tryFire(CompletableFuture.java:852)
	at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)
	at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977)
	at org.apache.flink.runtime.concurrent.FutureUtils.lambda$retryOperationWithDelay$5(FutureUtils.java:214)
	at java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:760)
	at java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:736)
	at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)
	at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977)
	at org.apache.flink.runtime.concurrent.FutureUtils$Timeout.run(FutureUtils.java:834)
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)
Caused by: java.util.concurrent.CompletionException: org.apache.flink.runtime.concurrent.FutureUtils$RetryException: Could not complete the operation. Exception is not retryable.
	at java.util.concurrent.CompletableFuture.encodeRelay(CompletableFuture.java:326)
	at java.util.concurrent.CompletableFuture.completeRelay(CompletableFuture.java:338)
	at java.util.concurrent.CompletableFuture.uniRelay(CompletableFuture.java:911)
	at java.util.concurrent.CompletableFuture$UniRelay.tryFire(CompletableFuture.java:899)
	... 15 more
Caused by: org.apache.flink.runtime.concurrent.FutureUtils$RetryException: Could not complete the operation. Exception is not retryable.
	... 13 more
Caused by: java.util.concurrent.CompletionException: java.util.concurrent.TimeoutException
	at java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:292)
	at java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:308)
	at java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:593)
	at java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:577)
	... 10 more
Caused by: java.util.concurrent.TimeoutException
	... 8 more

------------------------------------------------------------
 The program finished with the following exception:

org.apache.flink.client.program.ProgramInvocationException: Could not retrieve the execution result.
	at org.apache.flink.client.program.rest.RestClusterClient.submitJob(RestClusterClient.java:257)
	at org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:464)
	at org.apache.flink.streaming.api.environment.StreamContextEnvironment.execute(StreamContextEnvironment.java:66)
	at org.apache.flink.streaming.api.scala.StreamExecutionEnvironment.execute(StreamExecutionEnvironment.scala:654)
	at edp.wormhole.flinkx.eventflow.WormholeFlinkMainProcess.process(WormholeFlinkMainProcess.scala:114)
	at edp.wormhole.flinkx.eventflow.WormholeFlinkxStarter$.delayedEndpoint$edp$wormhole$flinkx$eventflow$WormholeFlinkxStarter$1(WormholeFlinkxStarter.scala:40)
	at edp.wormhole.flinkx.eventflow.WormholeFlinkxStarter$delayedInit$body.apply(WormholeFlinkxStarter.scala:29)
	at scala.Function0$class.apply$mcV$sp(Function0.scala:34)
	at scala.runtime.AbstractFunction0.apply$mcV$sp(AbstractFunction0.scala:12)
	at scala.App$$anonfun$main$1.apply(App.scala:76)
	at scala.App$$anonfun$main$1.apply(App.scala:76)
	at scala.collection.immutable.List.foreach(List.scala:381)
	at scala.collection.generic.TraversableForwarder$class.foreach(TraversableForwarder.scala:35)
	at scala.App$class.main(App.scala:76)
	at edp.wormhole.flinkx.eventflow.WormholeFlinkxStarter$.main(WormholeFlinkxStarter.scala:29)
	at edp.wormhole.flinkx.eventflow.WormholeFlinkxStarter.main(WormholeFlinkxStarter.scala)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at org.apache.flink.client.program.PackagedProgram.callMainMethod(PackagedProgram.java:528)
	at org.apache.flink.client.program.PackagedProgram.invokeInteractiveModeForExecution(PackagedProgram.java:420)
	at org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:404)
	at org.apache.flink.client.cli.CliFrontend.executeProgram(CliFrontend.java:785)
	at org.apache.flink.client.cli.CliFrontend.runProgram(CliFrontend.java:279)
	at org.apache.flink.client.cli.CliFrontend.run(CliFrontend.java:214)
	at org.apache.flink.client.cli.CliFrontend.parseParameters(CliFrontend.java:1025)
	at org.apache.flink.client.cli.CliFrontend.lambda$main$9(CliFrontend.java:1101)
	at java.security.AccessController.doPrivileged(Native Method)
	at javax.security.auth.Subject.doAs(Subject.java:422)
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1754)
	at org.apache.flink.runtime.security.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:41)
	at org.apache.flink.client.cli.CliFrontend.main(CliFrontend.java:1101)
Caused by: org.apache.flink.runtime.client.JobSubmissionException: Failed to submit JobGraph.
	at org.apache.flink.client.program.rest.RestClusterClient.lambda$submitJob$8(RestClusterClient.java:370)
	at java.util.concurrent.CompletableFuture.uniExceptionally(CompletableFuture.java:870)
	at java.util.concurrent.CompletableFuture$UniExceptionally.tryFire(CompletableFuture.java:852)
	at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)
	at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977)
	at org.apache.flink.runtime.concurrent.FutureUtils.lambda$retryOperationWithDelay$5(FutureUtils.java:214)
	at java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:760)
	at java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:736)
	at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)
	at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977)
	at org.apache.flink.runtime.concurrent.FutureUtils$Timeout.run(FutureUtils.java:834)
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)
Caused by: java.util.concurrent.CompletionException: org.apache.flink.runtime.concurrent.FutureUtils$RetryException: Could not complete the operation. Exception is not retryable.
	at java.util.concurrent.CompletableFuture.encodeRelay(CompletableFuture.java:326)
	at java.util.concurrent.CompletableFuture.completeRelay(CompletableFuture.java:338)
	at java.util.concurrent.CompletableFuture.uniRelay(CompletableFuture.java:911)
	at java.util.concurrent.CompletableFuture$UniRelay.tryFire(CompletableFuture.java:899)
	... 15 more
Caused by: org.apache.flink.runtime.concurrent.FutureUtils$RetryException: Could not complete the operation. Exception is not retryable.
	... 13 more
Caused by: java.util.concurrent.CompletionException: java.util.concurrent.TimeoutException
	at java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:292)
	at java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:308)
	at java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:593)
	at java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:577)
	... 10 more
Caused by: java.util.concurrent.TimeoutException
	... 8 more



flink-client.log (2M) Download Attachment
sen
Reply | Threaded
Open this post in threaded view
|

Re: submit job failed on Yarn HA

sen
In reply to this post by Gary Yao-4
Hi Gary:
            
           I have tried the 1.5.6 version, it shows the same error.

org.apache.flink.client.program.ProgramInvocationException: Could not retrieve the execution result.
	at org.apache.flink.client.program.rest.RestClusterClient.submitJob(RestClusterClient.java:258)
	at org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:464)
	at org.apache.flink.streaming.api.environment.StreamContextEnvironment.execute(StreamContextEnvironment.java:66)
	at org.apache.flink.streaming.api.scala.StreamExecutionEnvironment.execute(StreamExecutionEnvironment.scala:654)
	at scala.runtime.AbstractFunction0.apply$mcV$sp(AbstractFunction0.scala:12)
	at scala.App$$anonfun$main$1.apply(App.scala:76)
	at scala.App$$anonfun$main$1.apply(App.scala:76)
	at scala.collection.immutable.List.foreach(List.scala:392)
	at scala.collection.generic.TraversableForwarder$class.foreach(TraversableForwarder.scala:35)
	at scala.App$class.main(App.scala:76)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at org.apache.flink.client.program.PackagedProgram.callMainMethod(PackagedProgram.java:528)
	at org.apache.flink.client.program.PackagedProgram.invokeInteractiveModeForExecution(PackagedProgram.java:420)
	at org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:404)
	at org.apache.flink.client.cli.CliFrontend.executeProgram(CliFrontend.java:798)
	at org.apache.flink.client.cli.CliFrontend.runProgram(CliFrontend.java:289)
	at org.apache.flink.client.cli.CliFrontend.run(CliFrontend.java:215)
	at org.apache.flink.client.cli.CliFrontend.parseParameters(CliFrontend.java:1035)
	at org.apache.flink.client.cli.CliFrontend.lambda$main$9(CliFrontend.java:1111)
	at java.security.AccessController.doPrivileged(Native Method)
	at javax.security.auth.Subject.doAs(Subject.java:422)
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1754)
	at org.apache.flink.runtime.security.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:41)
	at org.apache.flink.client.cli.CliFrontend.main(CliFrontend.java:1111)
Caused by: org.apache.flink.runtime.client.JobSubmissionException: Failed to submit JobGraph.
	at org.apache.flink.client.program.rest.RestClusterClient.lambda$submitJob$8(RestClusterClient.java:371)
	at java.util.concurrent.CompletableFuture.uniExceptionally(CompletableFuture.java:870)
	at java.util.concurrent.CompletableFuture$UniExceptionally.tryFire(CompletableFuture.java:852)
	at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)
	at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977)
	at org.apache.flink.runtime.concurrent.FutureUtils.lambda$retryOperationWithDelay$5(FutureUtils.java:203)
	at java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:760)
	at java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:736)
	at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)
	at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977)
	at org.apache.flink.runtime.concurrent.FutureUtils$Timeout.run(FutureUtils.java:795)
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)
Caused by: java.util.concurrent.TimeoutException


Best!
Sen

在 2019年2月27日,下午9:30,Gary Yao <[hidden email]> 写道:

Hi,

How did you determine "jmhost" and "port"? Actually you do not need to specify
these manually. If the client is using the same configuration as your cluster,
the client will look up the leading JM from ZooKeeper.

If you have already tried omitting the "-m" parameter, you can check in the
client logs which host is used for the job submission [1]. Note that you need
to enable logging on DEBUG level.

The root cause in your stacktrace is a TimeoutException. I would debug this by
checking if you can establish a TCP connection – from the machine you are
submitting the job from, to the target host/port [2].

Moreover, you are using a quite dated Flink version. The newest version in the
1.5 major release is 1.5.6 – so consider upgrading to that or even to 1.7.

Best,
Gary

[1] https://github.com/apache/flink/blob/3488f8b144a2127497c39b8ed5a48a65b551c57d/flink-runtime/src/main/java/org/apache/flink/runtime/rest/RestClient.java#L185
[2] https://stackoverflow.com/questions/4922943/test-from-shell-script-if-remote-tcp-port-is-open

On Wed, Feb 27, 2019 at 8:09 AM 孙森 <[hidden email]> wrote:
Hi all:
        
        I run flink (1.5.1 with hadoop 2.7) on yarn ,and submit job by “/usr/local/flink/bin/flink run -m jmhost:port my.jar”, but the submission is failed.
The HA configuration is :
  •  high-availability: zookeeper
  •  high-availability.storageDir: hdfs:///flink/ha/
  •  high-availability.zookeeper.quorum:  hdp1:2181,hdp2:2181,hdp3:2181
  •  yarn.application-attempts: 2
The info showed int the client log:    

2019-02-27 11:48:38,651 INFO  org.apache.flink.runtime.rest.RestClient                      - Shutting down rest endpoint.
2019-02-27 11:48:38,659 INFO  org.apache.flink.runtime.rest.RestClient                      - Rest endpoint shutdown complete.
2019-02-27 11:48:38,662 INFO  org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService  - Stopping ZooKeeperLeaderRetrievalService /leader/rest_server_lock.
2019-02-27 11:48:38,665 INFO  org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService  - Stopping ZooKeeperLeaderRetrievalService /leader/dispatcher_lock.
2019-02-27 11:48:38,670 INFO  org.apache.flink.shaded.curator.org.apache.curator.framework.imps.CuratorFrameworkImpl  - backgroundOperationsLoop exiting
2019-02-27 11:48:38,689 INFO  org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Session: 0x2679c52880c00ee closed
2019-02-27 11:48:38,689 INFO  org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn  - EventThread shut down for session: 0x2679c52880c00ee
2019-02-27 11:48:38,690 ERROR org.apache.flink.client.cli.CliFrontend                       - Error while running the command.
org.apache.flink.client.program.ProgramInvocationException: Could not retrieve the execution result.
	at org.apache.flink.client.program.rest.RestClusterClient.submitJob(RestClusterClient.java:257)
	at org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:464)
	at org.apache.flink.streaming.api.environment.StreamContextEnvironment.execute(StreamContextEnvironment.java:66)
	at org.apache.flink.streaming.api.scala.StreamExecutionEnvironment.execute(StreamExecutionEnvironment.scala:654)
	at scala.Function0$class.apply$mcV$sp(Function0.scala:34)
	at scala.runtime.AbstractFunction0.apply$mcV$sp(AbstractFunction0.scala:12)
	at scala.App$$anonfun$main$1.apply(App.scala:76)
	at scala.App$$anonfun$main$1.apply(App.scala:76)
	at scala.collection.immutable.List.foreach(List.scala:381)
	at scala.collection.generic.TraversableForwarder$class.foreach(TraversableForwarder.scala:35)
	at scala.App$class.main(App.scala:76)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at org.apache.flink.client.program.PackagedProgram.callMainMethod(PackagedProgram.java:528)
	at org.apache.flink.client.program.PackagedProgram.invokeInteractiveModeForExecution(PackagedProgram.java:420)
	at org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:404)
	at org.apache.flink.client.cli.CliFrontend.executeProgram(CliFrontend.java:785)
	at org.apache.flink.client.cli.CliFrontend.runProgram(CliFrontend.java:279)
	at org.apache.flink.client.cli.CliFrontend.run(CliFrontend.java:214)
	at org.apache.flink.client.cli.CliFrontend.parseParameters(CliFrontend.java:1025)
	at org.apache.flink.client.cli.CliFrontend.lambda$main$9(CliFrontend.java:1101)
	at java.security.AccessController.doPrivileged(Native Method)
	at javax.security.auth.Subject.doAs(Subject.java:422)
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1754)
	at org.apache.flink.runtime.security.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:41)
	at org.apache.flink.client.cli.CliFrontend.main(CliFrontend.java:1101)
Caused by: org.apache.flink.runtime.client.JobSubmissionException: Failed to submit JobGraph.
	at org.apache.flink.client.program.rest.RestClusterClient.lambda$submitJob$8(RestClusterClient.java:370)
	at java.util.concurrent.CompletableFuture.uniExceptionally(CompletableFuture.java:870)
	at java.util.concurrent.CompletableFuture$UniExceptionally.tryFire(CompletableFuture.java:852)
	at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)
	at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977)
	at org.apache.flink.runtime.concurrent.FutureUtils.lambda$retryOperationWithDelay$5(FutureUtils.java:214)
	at java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:760)
	at java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:736)
	at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)
	at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977)
	at org.apache.flink.runtime.concurrent.FutureUtils$Timeout.run(FutureUtils.java:834)
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)
Caused by: java.util.concurrent.CompletionException: org.apache.flink.runtime.concurrent.FutureUtils$RetryException: Could not complete the operation. Exception is not retryable.
	at java.util.concurrent.CompletableFuture.encodeRelay(CompletableFuture.java:326)
	at java.util.concurrent.CompletableFuture.completeRelay(CompletableFuture.java:338)
	at java.util.concurrent.CompletableFuture.uniRelay(CompletableFuture.java:911)
	at java.util.concurrent.CompletableFuture$UniRelay.tryFire(CompletableFuture.java:899)
	... 15 more
Caused by: org.apache.flink.runtime.concurrent.FutureUtils$RetryException: Could not complete the operation. Exception is not retryable.
	... 13 more
Caused by: java.util.concurrent.CompletionException: java.util.concurrent.TimeoutException
	at java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:292)
	at java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:308)
	at java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:593)
	at java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:577)
	... 10 more
Caused by: java.util.concurrent.TimeoutException
	... 8 more

------------------------------------------------------------
 The program finished with the following exception:

org.apache.flink.client.program.ProgramInvocationException: Could not retrieve the execution result.
	at org.apache.flink.client.program.rest.RestClusterClient.submitJob(RestClusterClient.java:257)
	at org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:464)
	at org.apache.flink.streaming.api.environment.StreamContextEnvironment.execute(StreamContextEnvironment.java:66)
	at org.apache.flink.streaming.api.scala.StreamExecutionEnvironment.execute(StreamExecutionEnvironment.scala:654)
	at edp.wormhole.flinkx.eventflow.WormholeFlinkMainProcess.process(WormholeFlinkMainProcess.scala:114)
	at edp.wormhole.flinkx.eventflow.WormholeFlinkxStarter$.delayedEndpoint$edp$wormhole$flinkx$eventflow$WormholeFlinkxStarter$1(WormholeFlinkxStarter.scala:40)
	at edp.wormhole.flinkx.eventflow.WormholeFlinkxStarter$delayedInit$body.apply(WormholeFlinkxStarter.scala:29)
	at scala.Function0$class.apply$mcV$sp(Function0.scala:34)
	at scala.runtime.AbstractFunction0.apply$mcV$sp(AbstractFunction0.scala:12)
	at scala.App$$anonfun$main$1.apply(App.scala:76)
	at scala.App$$anonfun$main$1.apply(App.scala:76)
	at scala.collection.immutable.List.foreach(List.scala:381)
	at scala.collection.generic.TraversableForwarder$class.foreach(TraversableForwarder.scala:35)
	at scala.App$class.main(App.scala:76)
	at edp.wormhole.flinkx.eventflow.WormholeFlinkxStarter$.main(WormholeFlinkxStarter.scala:29)
	at edp.wormhole.flinkx.eventflow.WormholeFlinkxStarter.main(WormholeFlinkxStarter.scala)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at org.apache.flink.client.program.PackagedProgram.callMainMethod(PackagedProgram.java:528)
	at org.apache.flink.client.program.PackagedProgram.invokeInteractiveModeForExecution(PackagedProgram.java:420)
	at org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:404)
	at org.apache.flink.client.cli.CliFrontend.executeProgram(CliFrontend.java:785)
	at org.apache.flink.client.cli.CliFrontend.runProgram(CliFrontend.java:279)
	at org.apache.flink.client.cli.CliFrontend.run(CliFrontend.java:214)
	at org.apache.flink.client.cli.CliFrontend.parseParameters(CliFrontend.java:1025)
	at org.apache.flink.client.cli.CliFrontend.lambda$main$9(CliFrontend.java:1101)
	at java.security.AccessController.doPrivileged(Native Method)
	at javax.security.auth.Subject.doAs(Subject.java:422)
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1754)
	at org.apache.flink.runtime.security.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:41)
	at org.apache.flink.client.cli.CliFrontend.main(CliFrontend.java:1101)
Caused by: org.apache.flink.runtime.client.JobSubmissionException: Failed to submit JobGraph.
	at org.apache.flink.client.program.rest.RestClusterClient.lambda$submitJob$8(RestClusterClient.java:370)
	at java.util.concurrent.CompletableFuture.uniExceptionally(CompletableFuture.java:870)
	at java.util.concurrent.CompletableFuture$UniExceptionally.tryFire(CompletableFuture.java:852)
	at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)
	at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977)
	at org.apache.flink.runtime.concurrent.FutureUtils.lambda$retryOperationWithDelay$5(FutureUtils.java:214)
	at java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:760)
	at java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:736)
	at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)
	at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977)
	at org.apache.flink.runtime.concurrent.FutureUtils$Timeout.run(FutureUtils.java:834)
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)
Caused by: java.util.concurrent.CompletionException: org.apache.flink.runtime.concurrent.FutureUtils$RetryException: Could not complete the operation. Exception is not retryable.
	at java.util.concurrent.CompletableFuture.encodeRelay(CompletableFuture.java:326)
	at java.util.concurrent.CompletableFuture.completeRelay(CompletableFuture.java:338)
	at java.util.concurrent.CompletableFuture.uniRelay(CompletableFuture.java:911)
	at java.util.concurrent.CompletableFuture$UniRelay.tryFire(CompletableFuture.java:899)
	... 15 more
Caused by: org.apache.flink.runtime.concurrent.FutureUtils$RetryException: Could not complete the operation. Exception is not retryable.
	... 13 more
Caused by: java.util.concurrent.CompletionException: java.util.concurrent.TimeoutException
	at java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:292)
	at java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:308)
	at java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:593)
	at java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:577)
	... 10 more
Caused by: java.util.concurrent.TimeoutException
	... 8 more


Reply | Threaded
Open this post in threaded view
|

Re: submit job failed on Yarn HA

Gary Yao-4
In reply to this post by sen
Hi Sen,

I took a look at the CLI code again, and found out that -m is ignored if high-
availability: ZOOKEEPER is configured in your flink-conf.yaml. This does not
seem right and should be at least documented [1].

Judging from the client logs that you provided, I think the problem is that
the client cannot resolve the leading JobManager from ZooKeeper [2][3]. You
can try the following things for debugging:

    * Check the contents in the znode
      /flink/[...]/leader/rest_server_lock using the ZK CLI. It should contain the
      address of the leader. If not, I would check the jobmanager logs for releated
      errors.
   
    * Submit the job with -m parameter but without ZooKeeper configuration in
      the client's flink-conf.yaml

Best,
Gary 

[1] https://issues.apache.org/jira/browse/FLINK-11779
[2] https://github.com/apache/flink/blob/release-1.5.1/flink-runtime/src/main/java/org/apache/flink/runtime/leaderretrieval/ZooKeeperLeaderRetrievalService.java#L170
[3] https://github.com/apache/flink/blob/release-1.5.1/flink-clients/src/main/java/org/apache/flink/client/program/rest/RestClusterClient.java#L746-L750

On Thu, Feb 28, 2019 at 4:34 AM 孙森 <[hidden email]> wrote:
Hi,Gary

       Actually, I have several Flink cluster on Yarn ,each for a project. For one project ,it can only submit job to the specify cluster.
I’ve already enabled logging on DEBUG level.

How did you determine "jmhost" and "port”? 
We do this by request the rest api :   http://activeRm/proxy/appId/jars


The all client log is in the mail attachment.




在 2019年2月27日,下午9:30,Gary Yao <[hidden email]> 写道:

Hi,

How did you determine "jmhost" and "port"? Actually you do not need to specify
these manually. If the client is using the same configuration as your cluster,
the client will look up the leading JM from ZooKeeper.

If you have already tried omitting the "-m" parameter, you can check in the
client logs which host is used for the job submission [1]. Note that you need
to enable logging on DEBUG level.

The root cause in your stacktrace is a TimeoutException. I would debug this by
checking if you can establish a TCP connection – from the machine you are
submitting the job from, to the target host/port [2].

Moreover, you are using a quite dated Flink version. The newest version in the
1.5 major release is 1.5.6 – so consider upgrading to that or even to 1.7.

Best,
Gary

[1] https://github.com/apache/flink/blob/3488f8b144a2127497c39b8ed5a48a65b551c57d/flink-runtime/src/main/java/org/apache/flink/runtime/rest/RestClient.java#L185
[2] https://stackoverflow.com/questions/4922943/test-from-shell-script-if-remote-tcp-port-is-open

On Wed, Feb 27, 2019 at 8:09 AM 孙森 <[hidden email]> wrote:
Hi all:
        
        I run flink (1.5.1 with hadoop 2.7) on yarn ,and submit job by “/usr/local/flink/bin/flink run -m jmhost:port my.jar”, but the submission is failed.
The HA configuration is :
  •  high-availability: zookeeper
  •  high-availability.storageDir: hdfs:///flink/ha/
  •  high-availability.zookeeper.quorum:  hdp1:2181,hdp2:2181,hdp3:2181
  •  yarn.application-attempts: 2
The info showed int the client log:    

2019-02-27 11:48:38,651 INFO  org.apache.flink.runtime.rest.RestClient                      - Shutting down rest endpoint.
2019-02-27 11:48:38,659 INFO  org.apache.flink.runtime.rest.RestClient                      - Rest endpoint shutdown complete.
2019-02-27 11:48:38,662 INFO  org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService  - Stopping ZooKeeperLeaderRetrievalService /leader/rest_server_lock.
2019-02-27 11:48:38,665 INFO  org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService  - Stopping ZooKeeperLeaderRetrievalService /leader/dispatcher_lock.
2019-02-27 11:48:38,670 INFO  org.apache.flink.shaded.curator.org.apache.curator.framework.imps.CuratorFrameworkImpl  - backgroundOperationsLoop exiting
2019-02-27 11:48:38,689 INFO  org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Session: 0x2679c52880c00ee closed
2019-02-27 11:48:38,689 INFO  org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn  - EventThread shut down for session: 0x2679c52880c00ee
2019-02-27 11:48:38,690 ERROR org.apache.flink.client.cli.CliFrontend                       - Error while running the command.
org.apache.flink.client.program.ProgramInvocationException: Could not retrieve the execution result.
	at org.apache.flink.client.program.rest.RestClusterClient.submitJob(RestClusterClient.java:257)
	at org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:464)
	at org.apache.flink.streaming.api.environment.StreamContextEnvironment.execute(StreamContextEnvironment.java:66)
	at org.apache.flink.streaming.api.scala.StreamExecutionEnvironment.execute(StreamExecutionEnvironment.scala:654)
	at scala.Function0$class.apply$mcV$sp(Function0.scala:34)
	at scala.runtime.AbstractFunction0.apply$mcV$sp(AbstractFunction0.scala:12)
	at scala.App$$anonfun$main$1.apply(App.scala:76)
	at scala.App$$anonfun$main$1.apply(App.scala:76)
	at scala.collection.immutable.List.foreach(List.scala:381)
	at scala.collection.generic.TraversableForwarder$class.foreach(TraversableForwarder.scala:35)
	at scala.App$class.main(App.scala:76)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at org.apache.flink.client.program.PackagedProgram.callMainMethod(PackagedProgram.java:528)
	at org.apache.flink.client.program.PackagedProgram.invokeInteractiveModeForExecution(PackagedProgram.java:420)
	at org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:404)
	at org.apache.flink.client.cli.CliFrontend.executeProgram(CliFrontend.java:785)
	at org.apache.flink.client.cli.CliFrontend.runProgram(CliFrontend.java:279)
	at org.apache.flink.client.cli.CliFrontend.run(CliFrontend.java:214)
	at org.apache.flink.client.cli.CliFrontend.parseParameters(CliFrontend.java:1025)
	at org.apache.flink.client.cli.CliFrontend.lambda$main$9(CliFrontend.java:1101)
	at java.security.AccessController.doPrivileged(Native Method)
	at javax.security.auth.Subject.doAs(Subject.java:422)
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1754)
	at org.apache.flink.runtime.security.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:41)
	at org.apache.flink.client.cli.CliFrontend.main(CliFrontend.java:1101)
Caused by: org.apache.flink.runtime.client.JobSubmissionException: Failed to submit JobGraph.
	at org.apache.flink.client.program.rest.RestClusterClient.lambda$submitJob$8(RestClusterClient.java:370)
	at java.util.concurrent.CompletableFuture.uniExceptionally(CompletableFuture.java:870)
	at java.util.concurrent.CompletableFuture$UniExceptionally.tryFire(CompletableFuture.java:852)
	at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)
	at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977)
	at org.apache.flink.runtime.concurrent.FutureUtils.lambda$retryOperationWithDelay$5(FutureUtils.java:214)
	at java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:760)
	at java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:736)
	at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)
	at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977)
	at org.apache.flink.runtime.concurrent.FutureUtils$Timeout.run(FutureUtils.java:834)
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)
Caused by: java.util.concurrent.CompletionException: org.apache.flink.runtime.concurrent.FutureUtils$RetryException: Could not complete the operation. Exception is not retryable.
	at java.util.concurrent.CompletableFuture.encodeRelay(CompletableFuture.java:326)
	at java.util.concurrent.CompletableFuture.completeRelay(CompletableFuture.java:338)
	at java.util.concurrent.CompletableFuture.uniRelay(CompletableFuture.java:911)
	at java.util.concurrent.CompletableFuture$UniRelay.tryFire(CompletableFuture.java:899)
	... 15 more
Caused by: org.apache.flink.runtime.concurrent.FutureUtils$RetryException: Could not complete the operation. Exception is not retryable.
	... 13 more
Caused by: java.util.concurrent.CompletionException: java.util.concurrent.TimeoutException
	at java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:292)
	at java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:308)
	at java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:593)
	at java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:577)
	... 10 more
Caused by: java.util.concurrent.TimeoutException
	... 8 more

------------------------------------------------------------
 The program finished with the following exception:

org.apache.flink.client.program.ProgramInvocationException: Could not retrieve the execution result.
	at org.apache.flink.client.program.rest.RestClusterClient.submitJob(RestClusterClient.java:257)
	at org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:464)
	at org.apache.flink.streaming.api.environment.StreamContextEnvironment.execute(StreamContextEnvironment.java:66)
	at org.apache.flink.streaming.api.scala.StreamExecutionEnvironment.execute(StreamExecutionEnvironment.scala:654)
	at edp.wormhole.flinkx.eventflow.WormholeFlinkMainProcess.process(WormholeFlinkMainProcess.scala:114)
	at edp.wormhole.flinkx.eventflow.WormholeFlinkxStarter$.delayedEndpoint$edp$wormhole$flinkx$eventflow$WormholeFlinkxStarter$1(WormholeFlinkxStarter.scala:40)
	at edp.wormhole.flinkx.eventflow.WormholeFlinkxStarter$delayedInit$body.apply(WormholeFlinkxStarter.scala:29)
	at scala.Function0$class.apply$mcV$sp(Function0.scala:34)
	at scala.runtime.AbstractFunction0.apply$mcV$sp(AbstractFunction0.scala:12)
	at scala.App$$anonfun$main$1.apply(App.scala:76)
	at scala.App$$anonfun$main$1.apply(App.scala:76)
	at scala.collection.immutable.List.foreach(List.scala:381)
	at scala.collection.generic.TraversableForwarder$class.foreach(TraversableForwarder.scala:35)
	at scala.App$class.main(App.scala:76)
	at edp.wormhole.flinkx.eventflow.WormholeFlinkxStarter$.main(WormholeFlinkxStarter.scala:29)
	at edp.wormhole.flinkx.eventflow.WormholeFlinkxStarter.main(WormholeFlinkxStarter.scala)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at org.apache.flink.client.program.PackagedProgram.callMainMethod(PackagedProgram.java:528)
	at org.apache.flink.client.program.PackagedProgram.invokeInteractiveModeForExecution(PackagedProgram.java:420)
	at org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:404)
	at org.apache.flink.client.cli.CliFrontend.executeProgram(CliFrontend.java:785)
	at org.apache.flink.client.cli.CliFrontend.runProgram(CliFrontend.java:279)
	at org.apache.flink.client.cli.CliFrontend.run(CliFrontend.java:214)
	at org.apache.flink.client.cli.CliFrontend.parseParameters(CliFrontend.java:1025)
	at org.apache.flink.client.cli.CliFrontend.lambda$main$9(CliFrontend.java:1101)
	at java.security.AccessController.doPrivileged(Native Method)
	at javax.security.auth.Subject.doAs(Subject.java:422)
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1754)
	at org.apache.flink.runtime.security.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:41)
	at org.apache.flink.client.cli.CliFrontend.main(CliFrontend.java:1101)
Caused by: org.apache.flink.runtime.client.JobSubmissionException: Failed to submit JobGraph.
	at org.apache.flink.client.program.rest.RestClusterClient.lambda$submitJob$8(RestClusterClient.java:370)
	at java.util.concurrent.CompletableFuture.uniExceptionally(CompletableFuture.java:870)
	at java.util.concurrent.CompletableFuture$UniExceptionally.tryFire(CompletableFuture.java:852)
	at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)
	at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977)
	at org.apache.flink.runtime.concurrent.FutureUtils.lambda$retryOperationWithDelay$5(FutureUtils.java:214)
	at java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:760)
	at java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:736)
	at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)
	at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977)
	at org.apache.flink.runtime.concurrent.FutureUtils$Timeout.run(FutureUtils.java:834)
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)
Caused by: java.util.concurrent.CompletionException: org.apache.flink.runtime.concurrent.FutureUtils$RetryException: Could not complete the operation. Exception is not retryable.
	at java.util.concurrent.CompletableFuture.encodeRelay(CompletableFuture.java:326)
	at java.util.concurrent.CompletableFuture.completeRelay(CompletableFuture.java:338)
	at java.util.concurrent.CompletableFuture.uniRelay(CompletableFuture.java:911)
	at java.util.concurrent.CompletableFuture$UniRelay.tryFire(CompletableFuture.java:899)
	... 15 more
Caused by: org.apache.flink.runtime.concurrent.FutureUtils$RetryException: Could not complete the operation. Exception is not retryable.
	... 13 more
Caused by: java.util.concurrent.CompletionException: java.util.concurrent.TimeoutException
	at java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:292)
	at java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:308)
	at java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:593)
	at java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:577)
	... 10 more
Caused by: java.util.concurrent.TimeoutException
	... 8 more


sen
Reply | Threaded
Open this post in threaded view
|

Re: submit job failed on Yarn HA

sen
Hi Gary:
             I checked the znode, the address of leader was there.



When I removed the ZooKeeper configuration in the client's flink-conf.yaml, the job was submitted successfully.  
Then I tried to test if the HA could work.  I killed the job manager ,it restarted .But  the job   did not restart when the jog manager  restarted.


Best!
Sen

在 2019年2月28日,下午6:59,Gary Yao <[hidden email]> 写道:

Hi Sen,

I took a look at the CLI code again, and found out that -m is ignored if high-
availability: ZOOKEEPER is configured in your flink-conf.yaml. This does not
seem right and should be at least documented [1].

Judging from the client logs that you provided, I think the problem is that
the client cannot resolve the leading JobManager from ZooKeeper [2][3]. You
can try the following things for debugging:

    * Check the contents in the znode
      /flink/[...]/leader/rest_server_lock using the ZK CLI. It should contain the
      address of the leader. If not, I would check the jobmanager logs for releated
      errors.
   
    * Submit the job with -m parameter but without ZooKeeper configuration in
      the client's flink-conf.yaml

Best,
Gary 

[1] https://issues.apache.org/jira/browse/FLINK-11779
[2] https://github.com/apache/flink/blob/release-1.5.1/flink-runtime/src/main/java/org/apache/flink/runtime/leaderretrieval/ZooKeeperLeaderRetrievalService.java#L170
[3] https://github.com/apache/flink/blob/release-1.5.1/flink-clients/src/main/java/org/apache/flink/client/program/rest/RestClusterClient.java#L746-L750

On Thu, Feb 28, 2019 at 4:34 AM 孙森 <[hidden email]> wrote:
Hi,Gary

       Actually, I have several Flink cluster on Yarn ,each for a project. For one project ,it can only submit job to the specify cluster.
I’ve already enabled logging on DEBUG level.

How did you determine "jmhost" and "port”? 
We do this by request the rest api :   http://activeRm/proxy/appId/jars


The all client log is in the mail attachment.




在 2019年2月27日,下午9:30,Gary Yao <[hidden email]> 写道:

Hi,

How did you determine "jmhost" and "port"? Actually you do not need to specify
these manually. If the client is using the same configuration as your cluster,
the client will look up the leading JM from ZooKeeper.

If you have already tried omitting the "-m" parameter, you can check in the
client logs which host is used for the job submission [1]. Note that you need
to enable logging on DEBUG level.

The root cause in your stacktrace is a TimeoutException. I would debug this by
checking if you can establish a TCP connection – from the machine you are
submitting the job from, to the target host/port [2].

Moreover, you are using a quite dated Flink version. The newest version in the
1.5 major release is 1.5.6 – so consider upgrading to that or even to 1.7.

Best,
Gary

[1] https://github.com/apache/flink/blob/3488f8b144a2127497c39b8ed5a48a65b551c57d/flink-runtime/src/main/java/org/apache/flink/runtime/rest/RestClient.java#L185
[2] https://stackoverflow.com/questions/4922943/test-from-shell-script-if-remote-tcp-port-is-open

On Wed, Feb 27, 2019 at 8:09 AM 孙森 <[hidden email]> wrote:
Hi all:
        
        I run flink (1.5.1 with hadoop 2.7) on yarn ,and submit job by “/usr/local/flink/bin/flink run -m jmhost:port my.jar”, but the submission is failed.
The HA configuration is :
  •  high-availability: zookeeper
  •  high-availability.storageDir: hdfs:///flink/ha/
  •  high-availability.zookeeper.quorum:  hdp1:2181,hdp2:2181,hdp3:2181
  •  yarn.application-attempts: 2
The info showed int the client log:    

2019-02-27 11:48:38,651 INFO  org.apache.flink.runtime.rest.RestClient                      - Shutting down rest endpoint.
2019-02-27 11:48:38,659 INFO  org.apache.flink.runtime.rest.RestClient                      - Rest endpoint shutdown complete.
2019-02-27 11:48:38,662 INFO  org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService  - Stopping ZooKeeperLeaderRetrievalService /leader/rest_server_lock.
2019-02-27 11:48:38,665 INFO  org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService  - Stopping ZooKeeperLeaderRetrievalService /leader/dispatcher_lock.
2019-02-27 11:48:38,670 INFO  org.apache.flink.shaded.curator.org.apache.curator.framework.imps.CuratorFrameworkImpl  - backgroundOperationsLoop exiting
2019-02-27 11:48:38,689 INFO  org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Session: 0x2679c52880c00ee closed
2019-02-27 11:48:38,689 INFO  org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn  - EventThread shut down for session: 0x2679c52880c00ee
2019-02-27 11:48:38,690 ERROR org.apache.flink.client.cli.CliFrontend                       - Error while running the command.
org.apache.flink.client.program.ProgramInvocationException: Could not retrieve the execution result.
	at org.apache.flink.client.program.rest.RestClusterClient.submitJob(RestClusterClient.java:257)
	at org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:464)
	at org.apache.flink.streaming.api.environment.StreamContextEnvironment.execute(StreamContextEnvironment.java:66)
	at org.apache.flink.streaming.api.scala.StreamExecutionEnvironment.execute(StreamExecutionEnvironment.scala:654)
	at scala.Function0$class.apply$mcV$sp(Function0.scala:34)
	at scala.runtime.AbstractFunction0.apply$mcV$sp(AbstractFunction0.scala:12)
	at scala.App$$anonfun$main$1.apply(App.scala:76)
	at scala.App$$anonfun$main$1.apply(App.scala:76)
	at scala.collection.immutable.List.foreach(List.scala:381)
	at scala.collection.generic.TraversableForwarder$class.foreach(TraversableForwarder.scala:35)
	at scala.App$class.main(App.scala:76)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at org.apache.flink.client.program.PackagedProgram.callMainMethod(PackagedProgram.java:528)
	at org.apache.flink.client.program.PackagedProgram.invokeInteractiveModeForExecution(PackagedProgram.java:420)
	at org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:404)
	at org.apache.flink.client.cli.CliFrontend.executeProgram(CliFrontend.java:785)
	at org.apache.flink.client.cli.CliFrontend.runProgram(CliFrontend.java:279)
	at org.apache.flink.client.cli.CliFrontend.run(CliFrontend.java:214)
	at org.apache.flink.client.cli.CliFrontend.parseParameters(CliFrontend.java:1025)
	at org.apache.flink.client.cli.CliFrontend.lambda$main$9(CliFrontend.java:1101)
	at java.security.AccessController.doPrivileged(Native Method)
	at javax.security.auth.Subject.doAs(Subject.java:422)
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1754)
	at org.apache.flink.runtime.security.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:41)
	at org.apache.flink.client.cli.CliFrontend.main(CliFrontend.java:1101)
Caused by: org.apache.flink.runtime.client.JobSubmissionException: Failed to submit JobGraph.
	at org.apache.flink.client.program.rest.RestClusterClient.lambda$submitJob$8(RestClusterClient.java:370)
	at java.util.concurrent.CompletableFuture.uniExceptionally(CompletableFuture.java:870)
	at java.util.concurrent.CompletableFuture$UniExceptionally.tryFire(CompletableFuture.java:852)
	at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)
	at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977)
	at org.apache.flink.runtime.concurrent.FutureUtils.lambda$retryOperationWithDelay$5(FutureUtils.java:214)
	at java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:760)
	at java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:736)
	at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)
	at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977)
	at org.apache.flink.runtime.concurrent.FutureUtils$Timeout.run(FutureUtils.java:834)
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)
Caused by: java.util.concurrent.CompletionException: org.apache.flink.runtime.concurrent.FutureUtils$RetryException: Could not complete the operation. Exception is not retryable.
	at java.util.concurrent.CompletableFuture.encodeRelay(CompletableFuture.java:326)
	at java.util.concurrent.CompletableFuture.completeRelay(CompletableFuture.java:338)
	at java.util.concurrent.CompletableFuture.uniRelay(CompletableFuture.java:911)
	at java.util.concurrent.CompletableFuture$UniRelay.tryFire(CompletableFuture.java:899)
	... 15 more
Caused by: org.apache.flink.runtime.concurrent.FutureUtils$RetryException: Could not complete the operation. Exception is not retryable.
	... 13 more
Caused by: java.util.concurrent.CompletionException: java.util.concurrent.TimeoutException
	at java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:292)
	at java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:308)
	at java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:593)
	at java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:577)
	... 10 more
Caused by: java.util.concurrent.TimeoutException
	... 8 more

------------------------------------------------------------
 The program finished with the following exception:

org.apache.flink.client.program.ProgramInvocationException: Could not retrieve the execution result.
	at org.apache.flink.client.program.rest.RestClusterClient.submitJob(RestClusterClient.java:257)
	at org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:464)
	at org.apache.flink.streaming.api.environment.StreamContextEnvironment.execute(StreamContextEnvironment.java:66)
	at org.apache.flink.streaming.api.scala.StreamExecutionEnvironment.execute(StreamExecutionEnvironment.scala:654)
	at edp.wormhole.flinkx.eventflow.WormholeFlinkMainProcess.process(WormholeFlinkMainProcess.scala:114)
	at edp.wormhole.flinkx.eventflow.WormholeFlinkxStarter$.delayedEndpoint$edp$wormhole$flinkx$eventflow$WormholeFlinkxStarter$1(WormholeFlinkxStarter.scala:40)
	at edp.wormhole.flinkx.eventflow.WormholeFlinkxStarter$delayedInit$body.apply(WormholeFlinkxStarter.scala:29)
	at scala.Function0$class.apply$mcV$sp(Function0.scala:34)
	at scala.runtime.AbstractFunction0.apply$mcV$sp(AbstractFunction0.scala:12)
	at scala.App$$anonfun$main$1.apply(App.scala:76)
	at scala.App$$anonfun$main$1.apply(App.scala:76)
	at scala.collection.immutable.List.foreach(List.scala:381)
	at scala.collection.generic.TraversableForwarder$class.foreach(TraversableForwarder.scala:35)
	at scala.App$class.main(App.scala:76)
	at edp.wormhole.flinkx.eventflow.WormholeFlinkxStarter$.main(WormholeFlinkxStarter.scala:29)
	at edp.wormhole.flinkx.eventflow.WormholeFlinkxStarter.main(WormholeFlinkxStarter.scala)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at org.apache.flink.client.program.PackagedProgram.callMainMethod(PackagedProgram.java:528)
	at org.apache.flink.client.program.PackagedProgram.invokeInteractiveModeForExecution(PackagedProgram.java:420)
	at org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:404)
	at org.apache.flink.client.cli.CliFrontend.executeProgram(CliFrontend.java:785)
	at org.apache.flink.client.cli.CliFrontend.runProgram(CliFrontend.java:279)
	at org.apache.flink.client.cli.CliFrontend.run(CliFrontend.java:214)
	at org.apache.flink.client.cli.CliFrontend.parseParameters(CliFrontend.java:1025)
	at org.apache.flink.client.cli.CliFrontend.lambda$main$9(CliFrontend.java:1101)
	at java.security.AccessController.doPrivileged(Native Method)
	at javax.security.auth.Subject.doAs(Subject.java:422)
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1754)
	at org.apache.flink.runtime.security.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:41)
	at org.apache.flink.client.cli.CliFrontend.main(CliFrontend.java:1101)
Caused by: org.apache.flink.runtime.client.JobSubmissionException: Failed to submit JobGraph.
	at org.apache.flink.client.program.rest.RestClusterClient.lambda$submitJob$8(RestClusterClient.java:370)
	at java.util.concurrent.CompletableFuture.uniExceptionally(CompletableFuture.java:870)
	at java.util.concurrent.CompletableFuture$UniExceptionally.tryFire(CompletableFuture.java:852)
	at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)
	at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977)
	at org.apache.flink.runtime.concurrent.FutureUtils.lambda$retryOperationWithDelay$5(FutureUtils.java:214)
	at java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:760)
	at java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:736)
	at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)
	at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977)
	at org.apache.flink.runtime.concurrent.FutureUtils$Timeout.run(FutureUtils.java:834)
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)
Caused by: java.util.concurrent.CompletionException: org.apache.flink.runtime.concurrent.FutureUtils$RetryException: Could not complete the operation. Exception is not retryable.
	at java.util.concurrent.CompletableFuture.encodeRelay(CompletableFuture.java:326)
	at java.util.concurrent.CompletableFuture.completeRelay(CompletableFuture.java:338)
	at java.util.concurrent.CompletableFuture.uniRelay(CompletableFuture.java:911)
	at java.util.concurrent.CompletableFuture$UniRelay.tryFire(CompletableFuture.java:899)
	... 15 more
Caused by: org.apache.flink.runtime.concurrent.FutureUtils$RetryException: Could not complete the operation. Exception is not retryable.
	... 13 more
Caused by: java.util.concurrent.CompletionException: java.util.concurrent.TimeoutException
	at java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:292)
	at java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:308)
	at java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:593)
	at java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:577)
	... 10 more
Caused by: java.util.concurrent.TimeoutException
	... 8 more



Reply | Threaded
Open this post in threaded view
|

Re: submit job failed on Yarn HA

Gary Yao-4
Hi Sen,

Did you set a restart strategy [1]? If you enabled checkpoints [2], the fixed-
delay strategy will be used by default.

Best,
Gary

[1] https://ci.apache.org/projects/flink/flink-docs-release-1.5/dev/restart_strategies.html
[2] https://ci.apache.org/projects/flink/flink-docs-release-1.5/ops/state/checkpoints.html

On Fri, Mar 1, 2019 at 7:27 AM 孙森 <[hidden email]> wrote:
Hi Gary:
             I checked the znode, the address of leader was there.



When I removed the ZooKeeper configuration in the client's flink-conf.yaml, the job was submitted successfully.  
Then I tried to test if the HA could work.  I killed the job manager ,it restarted .But  the job   did not restart when the jog manager  restarted.


Best!
Sen

在 2019年2月28日,下午6:59,Gary Yao <[hidden email]> 写道:

Hi Sen,

I took a look at the CLI code again, and found out that -m is ignored if high-
availability: ZOOKEEPER is configured in your flink-conf.yaml. This does not
seem right and should be at least documented [1].

Judging from the client logs that you provided, I think the problem is that
the client cannot resolve the leading JobManager from ZooKeeper [2][3]. You
can try the following things for debugging:

    * Check the contents in the znode
      /flink/[...]/leader/rest_server_lock using the ZK CLI. It should contain the
      address of the leader. If not, I would check the jobmanager logs for releated
      errors.
   
    * Submit the job with -m parameter but without ZooKeeper configuration in
      the client's flink-conf.yaml

Best,
Gary 

[1] https://issues.apache.org/jira/browse/FLINK-11779
[2] https://github.com/apache/flink/blob/release-1.5.1/flink-runtime/src/main/java/org/apache/flink/runtime/leaderretrieval/ZooKeeperLeaderRetrievalService.java#L170
[3] https://github.com/apache/flink/blob/release-1.5.1/flink-clients/src/main/java/org/apache/flink/client/program/rest/RestClusterClient.java#L746-L750

On Thu, Feb 28, 2019 at 4:34 AM 孙森 <[hidden email]> wrote:
Hi,Gary

       Actually, I have several Flink cluster on Yarn ,each for a project. For one project ,it can only submit job to the specify cluster.
I’ve already enabled logging on DEBUG level.

How did you determine "jmhost" and "port”? 
We do this by request the rest api :   http://activeRm/proxy/appId/jars


The all client log is in the mail attachment.




在 2019年2月27日,下午9:30,Gary Yao <[hidden email]> 写道:

Hi,

How did you determine "jmhost" and "port"? Actually you do not need to specify
these manually. If the client is using the same configuration as your cluster,
the client will look up the leading JM from ZooKeeper.

If you have already tried omitting the "-m" parameter, you can check in the
client logs which host is used for the job submission [1]. Note that you need
to enable logging on DEBUG level.

The root cause in your stacktrace is a TimeoutException. I would debug this by
checking if you can establish a TCP connection – from the machine you are
submitting the job from, to the target host/port [2].

Moreover, you are using a quite dated Flink version. The newest version in the
1.5 major release is 1.5.6 – so consider upgrading to that or even to 1.7.

Best,
Gary

[1] https://github.com/apache/flink/blob/3488f8b144a2127497c39b8ed5a48a65b551c57d/flink-runtime/src/main/java/org/apache/flink/runtime/rest/RestClient.java#L185
[2] https://stackoverflow.com/questions/4922943/test-from-shell-script-if-remote-tcp-port-is-open

On Wed, Feb 27, 2019 at 8:09 AM 孙森 <[hidden email]> wrote:
Hi all:
        
        I run flink (1.5.1 with hadoop 2.7) on yarn ,and submit job by “/usr/local/flink/bin/flink run -m jmhost:port my.jar”, but the submission is failed.
The HA configuration is :
  •  high-availability: zookeeper
  •  high-availability.storageDir: hdfs:///flink/ha/
  •  high-availability.zookeeper.quorum:  hdp1:2181,hdp2:2181,hdp3:2181
  •  yarn.application-attempts: 2
The info showed int the client log:    

2019-02-27 11:48:38,651 INFO  org.apache.flink.runtime.rest.RestClient                      - Shutting down rest endpoint.
2019-02-27 11:48:38,659 INFO  org.apache.flink.runtime.rest.RestClient                      - Rest endpoint shutdown complete.
2019-02-27 11:48:38,662 INFO  org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService  - Stopping ZooKeeperLeaderRetrievalService /leader/rest_server_lock.
2019-02-27 11:48:38,665 INFO  org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService  - Stopping ZooKeeperLeaderRetrievalService /leader/dispatcher_lock.
2019-02-27 11:48:38,670 INFO  org.apache.flink.shaded.curator.org.apache.curator.framework.imps.CuratorFrameworkImpl  - backgroundOperationsLoop exiting
2019-02-27 11:48:38,689 INFO  org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Session: 0x2679c52880c00ee closed
2019-02-27 11:48:38,689 INFO  org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn  - EventThread shut down for session: 0x2679c52880c00ee
2019-02-27 11:48:38,690 ERROR org.apache.flink.client.cli.CliFrontend                       - Error while running the command.
org.apache.flink.client.program.ProgramInvocationException: Could not retrieve the execution result.
	at org.apache.flink.client.program.rest.RestClusterClient.submitJob(RestClusterClient.java:257)
	at org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:464)
	at org.apache.flink.streaming.api.environment.StreamContextEnvironment.execute(StreamContextEnvironment.java:66)
	at org.apache.flink.streaming.api.scala.StreamExecutionEnvironment.execute(StreamExecutionEnvironment.scala:654)
	at scala.Function0$class.apply$mcV$sp(Function0.scala:34)
	at scala.runtime.AbstractFunction0.apply$mcV$sp(AbstractFunction0.scala:12)
	at scala.App$$anonfun$main$1.apply(App.scala:76)
	at scala.App$$anonfun$main$1.apply(App.scala:76)
	at scala.collection.immutable.List.foreach(List.scala:381)
	at scala.collection.generic.TraversableForwarder$class.foreach(TraversableForwarder.scala:35)
	at scala.App$class.main(App.scala:76)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at org.apache.flink.client.program.PackagedProgram.callMainMethod(PackagedProgram.java:528)
	at org.apache.flink.client.program.PackagedProgram.invokeInteractiveModeForExecution(PackagedProgram.java:420)
	at org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:404)
	at org.apache.flink.client.cli.CliFrontend.executeProgram(CliFrontend.java:785)
	at org.apache.flink.client.cli.CliFrontend.runProgram(CliFrontend.java:279)
	at org.apache.flink.client.cli.CliFrontend.run(CliFrontend.java:214)
	at org.apache.flink.client.cli.CliFrontend.parseParameters(CliFrontend.java:1025)
	at org.apache.flink.client.cli.CliFrontend.lambda$main$9(CliFrontend.java:1101)
	at java.security.AccessController.doPrivileged(Native Method)
	at javax.security.auth.Subject.doAs(Subject.java:422)
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1754)
	at org.apache.flink.runtime.security.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:41)
	at org.apache.flink.client.cli.CliFrontend.main(CliFrontend.java:1101)
Caused by: org.apache.flink.runtime.client.JobSubmissionException: Failed to submit JobGraph.
	at org.apache.flink.client.program.rest.RestClusterClient.lambda$submitJob$8(RestClusterClient.java:370)
	at java.util.concurrent.CompletableFuture.uniExceptionally(CompletableFuture.java:870)
	at java.util.concurrent.CompletableFuture$UniExceptionally.tryFire(CompletableFuture.java:852)
	at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)
	at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977)
	at org.apache.flink.runtime.concurrent.FutureUtils.lambda$retryOperationWithDelay$5(FutureUtils.java:214)
	at java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:760)
	at java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:736)
	at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)
	at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977)
	at org.apache.flink.runtime.concurrent.FutureUtils$Timeout.run(FutureUtils.java:834)
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)
Caused by: java.util.concurrent.CompletionException: org.apache.flink.runtime.concurrent.FutureUtils$RetryException: Could not complete the operation. Exception is not retryable.
	at java.util.concurrent.CompletableFuture.encodeRelay(CompletableFuture.java:326)
	at java.util.concurrent.CompletableFuture.completeRelay(CompletableFuture.java:338)
	at java.util.concurrent.CompletableFuture.uniRelay(CompletableFuture.java:911)
	at java.util.concurrent.CompletableFuture$UniRelay.tryFire(CompletableFuture.java:899)
	... 15 more
Caused by: org.apache.flink.runtime.concurrent.FutureUtils$RetryException: Could not complete the operation. Exception is not retryable.
	... 13 more
Caused by: java.util.concurrent.CompletionException: java.util.concurrent.TimeoutException
	at java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:292)
	at java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:308)
	at java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:593)
	at java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:577)
	... 10 more
Caused by: java.util.concurrent.TimeoutException
	... 8 more

------------------------------------------------------------
 The program finished with the following exception:

org.apache.flink.client.program.ProgramInvocationException: Could not retrieve the execution result.
	at org.apache.flink.client.program.rest.RestClusterClient.submitJob(RestClusterClient.java:257)
	at org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:464)
	at org.apache.flink.streaming.api.environment.StreamContextEnvironment.execute(StreamContextEnvironment.java:66)
	at org.apache.flink.streaming.api.scala.StreamExecutionEnvironment.execute(StreamExecutionEnvironment.scala:654)
	at edp.wormhole.flinkx.eventflow.WormholeFlinkMainProcess.process(WormholeFlinkMainProcess.scala:114)
	at edp.wormhole.flinkx.eventflow.WormholeFlinkxStarter$.delayedEndpoint$edp$wormhole$flinkx$eventflow$WormholeFlinkxStarter$1(WormholeFlinkxStarter.scala:40)
	at edp.wormhole.flinkx.eventflow.WormholeFlinkxStarter$delayedInit$body.apply(WormholeFlinkxStarter.scala:29)
	at scala.Function0$class.apply$mcV$sp(Function0.scala:34)
	at scala.runtime.AbstractFunction0.apply$mcV$sp(AbstractFunction0.scala:12)
	at scala.App$$anonfun$main$1.apply(App.scala:76)
	at scala.App$$anonfun$main$1.apply(App.scala:76)
	at scala.collection.immutable.List.foreach(List.scala:381)
	at scala.collection.generic.TraversableForwarder$class.foreach(TraversableForwarder.scala:35)
	at scala.App$class.main(App.scala:76)
	at edp.wormhole.flinkx.eventflow.WormholeFlinkxStarter$.main(WormholeFlinkxStarter.scala:29)
	at edp.wormhole.flinkx.eventflow.WormholeFlinkxStarter.main(WormholeFlinkxStarter.scala)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at org.apache.flink.client.program.PackagedProgram.callMainMethod(PackagedProgram.java:528)
	at org.apache.flink.client.program.PackagedProgram.invokeInteractiveModeForExecution(PackagedProgram.java:420)
	at org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:404)
	at org.apache.flink.client.cli.CliFrontend.executeProgram(CliFrontend.java:785)
	at org.apache.flink.client.cli.CliFrontend.runProgram(CliFrontend.java:279)
	at org.apache.flink.client.cli.CliFrontend.run(CliFrontend.java:214)
	at org.apache.flink.client.cli.CliFrontend.parseParameters(CliFrontend.java:1025)
	at org.apache.flink.client.cli.CliFrontend.lambda$main$9(CliFrontend.java:1101)
	at java.security.AccessController.doPrivileged(Native Method)
	at javax.security.auth.Subject.doAs(Subject.java:422)
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1754)
	at org.apache.flink.runtime.security.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:41)
	at org.apache.flink.client.cli.CliFrontend.main(CliFrontend.java:1101)
Caused by: org.apache.flink.runtime.client.JobSubmissionException: Failed to submit JobGraph.
	at org.apache.flink.client.program.rest.RestClusterClient.lambda$submitJob$8(RestClusterClient.java:370)
	at java.util.concurrent.CompletableFuture.uniExceptionally(CompletableFuture.java:870)
	at java.util.concurrent.CompletableFuture$UniExceptionally.tryFire(CompletableFuture.java:852)
	at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)
	at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977)
	at org.apache.flink.runtime.concurrent.FutureUtils.lambda$retryOperationWithDelay$5(FutureUtils.java:214)
	at java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:760)
	at java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:736)
	at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)
	at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977)
	at org.apache.flink.runtime.concurrent.FutureUtils$Timeout.run(FutureUtils.java:834)
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)
Caused by: java.util.concurrent.CompletionException: org.apache.flink.runtime.concurrent.FutureUtils$RetryException: Could not complete the operation. Exception is not retryable.
	at java.util.concurrent.CompletableFuture.encodeRelay(CompletableFuture.java:326)
	at java.util.concurrent.CompletableFuture.completeRelay(CompletableFuture.java:338)
	at java.util.concurrent.CompletableFuture.uniRelay(CompletableFuture.java:911)
	at java.util.concurrent.CompletableFuture$UniRelay.tryFire(CompletableFuture.java:899)
	... 15 more
Caused by: org.apache.flink.runtime.concurrent.FutureUtils$RetryException: Could not complete the operation. Exception is not retryable.
	... 13 more
Caused by: java.util.concurrent.CompletionException: java.util.concurrent.TimeoutException
	at java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:292)
	at java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:308)
	at java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:593)
	at java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:577)
	... 10 more
Caused by: java.util.concurrent.TimeoutException
	... 8 more



sen
Reply | Threaded
Open this post in threaded view
|

Re: submit job failed on Yarn HA

sen
Hi Gary:


        Yes, I enable the checkpoints in my program .

在 2019年3月4日,上午3:03,Gary Yao <[hidden email]> 写道:

Hi Sen,

Did you set a restart strategy [1]? If you enabled checkpoints [2], the fixed-
delay strategy will be used by default.

Best,
Gary

[1] https://ci.apache.org/projects/flink/flink-docs-release-1.5/dev/restart_strategies.html
[2] https://ci.apache.org/projects/flink/flink-docs-release-1.5/ops/state/checkpoints.html

On Fri, Mar 1, 2019 at 7:27 AM 孙森 <[hidden email]> wrote:
Hi Gary:
             I checked the znode, the address of leader was there.

<屏幕快照 2019-03-01 上午10.45.45 1.png>

When I removed the ZooKeeper configuration in the client's flink-conf.yaml, the job was submitted successfully.  
Then I tried to test if the HA could work.  I killed the job manager ,it restarted .But  the job   did not restart when the jog manager  restarted.


Best!
Sen

在 2019年2月28日,下午6:59,Gary Yao <[hidden email]> 写道:

Hi Sen,

I took a look at the CLI code again, and found out that -m is ignored if high-
availability: ZOOKEEPER is configured in your flink-conf.yaml. This does not
seem right and should be at least documented [1].

Judging from the client logs that you provided, I think the problem is that
the client cannot resolve the leading JobManager from ZooKeeper [2][3]. You
can try the following things for debugging:

    * Check the contents in the znode
      /flink/[...]/leader/rest_server_lock using the ZK CLI. It should contain the
      address of the leader. If not, I would check the jobmanager logs for releated
      errors.
   
    * Submit the job with -m parameter but without ZooKeeper configuration in
      the client's flink-conf.yaml

Best,
Gary 

[1] https://issues.apache.org/jira/browse/FLINK-11779
[2] https://github.com/apache/flink/blob/release-1.5.1/flink-runtime/src/main/java/org/apache/flink/runtime/leaderretrieval/ZooKeeperLeaderRetrievalService.java#L170
[3] https://github.com/apache/flink/blob/release-1.5.1/flink-clients/src/main/java/org/apache/flink/client/program/rest/RestClusterClient.java#L746-L750

On Thu, Feb 28, 2019 at 4:34 AM 孙森 <[hidden email]> wrote:
Hi,Gary

       Actually, I have several Flink cluster on Yarn ,each for a project. For one project ,it can only submit job to the specify cluster.
I’ve already enabled logging on DEBUG level.

How did you determine "jmhost" and "port”? 
We do this by request the rest api :   http://activeRm/proxy/appId/jars


The all client log is in the mail attachment.




在 2019年2月27日,下午9:30,Gary Yao <[hidden email]> 写道:

Hi,

How did you determine "jmhost" and "port"? Actually you do not need to specify
these manually. If the client is using the same configuration as your cluster,
the client will look up the leading JM from ZooKeeper.

If you have already tried omitting the "-m" parameter, you can check in the
client logs which host is used for the job submission [1]. Note that you need
to enable logging on DEBUG level.

The root cause in your stacktrace is a TimeoutException. I would debug this by
checking if you can establish a TCP connection – from the machine you are
submitting the job from, to the target host/port [2].

Moreover, you are using a quite dated Flink version. The newest version in the
1.5 major release is 1.5.6 – so consider upgrading to that or even to 1.7.

Best,
Gary

[1] https://github.com/apache/flink/blob/3488f8b144a2127497c39b8ed5a48a65b551c57d/flink-runtime/src/main/java/org/apache/flink/runtime/rest/RestClient.java#L185
[2] https://stackoverflow.com/questions/4922943/test-from-shell-script-if-remote-tcp-port-is-open

On Wed, Feb 27, 2019 at 8:09 AM 孙森 <[hidden email]> wrote:
Hi all:
        
        I run flink (1.5.1 with hadoop 2.7) on yarn ,and submit job by “/usr/local/flink/bin/flink run -m jmhost:port my.jar”, but the submission is failed.
The HA configuration is :
  •  high-availability: zookeeper
  •  high-availability.storageDir: hdfs:///flink/ha/
  •  high-availability.zookeeper.quorum:  hdp1:2181,hdp2:2181,hdp3:2181
  •  yarn.application-attempts: 2
The info showed int the client log:    

2019-02-27 11:48:38,651 INFO  org.apache.flink.runtime.rest.RestClient                      - Shutting down rest endpoint.
2019-02-27 11:48:38,659 INFO  org.apache.flink.runtime.rest.RestClient                      - Rest endpoint shutdown complete.
2019-02-27 11:48:38,662 INFO  org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService  - Stopping ZooKeeperLeaderRetrievalService /leader/rest_server_lock.
2019-02-27 11:48:38,665 INFO  org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService  - Stopping ZooKeeperLeaderRetrievalService /leader/dispatcher_lock.
2019-02-27 11:48:38,670 INFO  org.apache.flink.shaded.curator.org.apache.curator.framework.imps.CuratorFrameworkImpl  - backgroundOperationsLoop exiting
2019-02-27 11:48:38,689 INFO  org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Session: 0x2679c52880c00ee closed
2019-02-27 11:48:38,689 INFO  org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn  - EventThread shut down for session: 0x2679c52880c00ee
2019-02-27 11:48:38,690 ERROR org.apache.flink.client.cli.CliFrontend                       - Error while running the command.
org.apache.flink.client.program.ProgramInvocationException: Could not retrieve the execution result.
	at org.apache.flink.client.program.rest.RestClusterClient.submitJob(RestClusterClient.java:257)
	at org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:464)
	at org.apache.flink.streaming.api.environment.StreamContextEnvironment.execute(StreamContextEnvironment.java:66)
	at org.apache.flink.streaming.api.scala.StreamExecutionEnvironment.execute(StreamExecutionEnvironment.scala:654)
	at scala.Function0$class.apply$mcV$sp(Function0.scala:34)
	at scala.runtime.AbstractFunction0.apply$mcV$sp(AbstractFunction0.scala:12)
	at scala.App$$anonfun$main$1.apply(App.scala:76)
	at scala.App$$anonfun$main$1.apply(App.scala:76)
	at scala.collection.immutable.List.foreach(List.scala:381)
	at scala.collection.generic.TraversableForwarder$class.foreach(TraversableForwarder.scala:35)
	at scala.App$class.main(App.scala:76)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at org.apache.flink.client.program.PackagedProgram.callMainMethod(PackagedProgram.java:528)
	at org.apache.flink.client.program.PackagedProgram.invokeInteractiveModeForExecution(PackagedProgram.java:420)
	at org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:404)
	at org.apache.flink.client.cli.CliFrontend.executeProgram(CliFrontend.java:785)
	at org.apache.flink.client.cli.CliFrontend.runProgram(CliFrontend.java:279)
	at org.apache.flink.client.cli.CliFrontend.run(CliFrontend.java:214)
	at org.apache.flink.client.cli.CliFrontend.parseParameters(CliFrontend.java:1025)
	at org.apache.flink.client.cli.CliFrontend.lambda$main$9(CliFrontend.java:1101)
	at java.security.AccessController.doPrivileged(Native Method)
	at javax.security.auth.Subject.doAs(Subject.java:422)
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1754)
	at org.apache.flink.runtime.security.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:41)
	at org.apache.flink.client.cli.CliFrontend.main(CliFrontend.java:1101)
Caused by: org.apache.flink.runtime.client.JobSubmissionException: Failed to submit JobGraph.
	at org.apache.flink.client.program.rest.RestClusterClient.lambda$submitJob$8(RestClusterClient.java:370)
	at java.util.concurrent.CompletableFuture.uniExceptionally(CompletableFuture.java:870)
	at java.util.concurrent.CompletableFuture$UniExceptionally.tryFire(CompletableFuture.java:852)
	at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)
	at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977)
	at org.apache.flink.runtime.concurrent.FutureUtils.lambda$retryOperationWithDelay$5(FutureUtils.java:214)
	at java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:760)
	at java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:736)
	at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)
	at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977)
	at org.apache.flink.runtime.concurrent.FutureUtils$Timeout.run(FutureUtils.java:834)
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)
Caused by: java.util.concurrent.CompletionException: org.apache.flink.runtime.concurrent.FutureUtils$RetryException: Could not complete the operation. Exception is not retryable.
	at java.util.concurrent.CompletableFuture.encodeRelay(CompletableFuture.java:326)
	at java.util.concurrent.CompletableFuture.completeRelay(CompletableFuture.java:338)
	at java.util.concurrent.CompletableFuture.uniRelay(CompletableFuture.java:911)
	at java.util.concurrent.CompletableFuture$UniRelay.tryFire(CompletableFuture.java:899)
	... 15 more
Caused by: org.apache.flink.runtime.concurrent.FutureUtils$RetryException: Could not complete the operation. Exception is not retryable.
	... 13 more
Caused by: java.util.concurrent.CompletionException: java.util.concurrent.TimeoutException
	at java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:292)
	at java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:308)
	at java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:593)
	at java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:577)
	... 10 more
Caused by: java.util.concurrent.TimeoutException
	... 8 more

------------------------------------------------------------
 The program finished with the following exception:

org.apache.flink.client.program.ProgramInvocationException: Could not retrieve the execution result.
	at org.apache.flink.client.program.rest.RestClusterClient.submitJob(RestClusterClient.java:257)
	at org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:464)
	at org.apache.flink.streaming.api.environment.StreamContextEnvironment.execute(StreamContextEnvironment.java:66)
	at org.apache.flink.streaming.api.scala.StreamExecutionEnvironment.execute(StreamExecutionEnvironment.scala:654)
	at edp.wormhole.flinkx.eventflow.WormholeFlinkMainProcess.process(WormholeFlinkMainProcess.scala:114)
	at edp.wormhole.flinkx.eventflow.WormholeFlinkxStarter$.delayedEndpoint$edp$wormhole$flinkx$eventflow$WormholeFlinkxStarter$1(WormholeFlinkxStarter.scala:40)
	at edp.wormhole.flinkx.eventflow.WormholeFlinkxStarter$delayedInit$body.apply(WormholeFlinkxStarter.scala:29)
	at scala.Function0$class.apply$mcV$sp(Function0.scala:34)
	at scala.runtime.AbstractFunction0.apply$mcV$sp(AbstractFunction0.scala:12)
	at scala.App$$anonfun$main$1.apply(App.scala:76)
	at scala.App$$anonfun$main$1.apply(App.scala:76)
	at scala.collection.immutable.List.foreach(List.scala:381)
	at scala.collection.generic.TraversableForwarder$class.foreach(TraversableForwarder.scala:35)
	at scala.App$class.main(App.scala:76)
	at edp.wormhole.flinkx.eventflow.WormholeFlinkxStarter$.main(WormholeFlinkxStarter.scala:29)
	at edp.wormhole.flinkx.eventflow.WormholeFlinkxStarter.main(WormholeFlinkxStarter.scala)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at org.apache.flink.client.program.PackagedProgram.callMainMethod(PackagedProgram.java:528)
	at org.apache.flink.client.program.PackagedProgram.invokeInteractiveModeForExecution(PackagedProgram.java:420)
	at org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:404)
	at org.apache.flink.client.cli.CliFrontend.executeProgram(CliFrontend.java:785)
	at org.apache.flink.client.cli.CliFrontend.runProgram(CliFrontend.java:279)
	at org.apache.flink.client.cli.CliFrontend.run(CliFrontend.java:214)
	at org.apache.flink.client.cli.CliFrontend.parseParameters(CliFrontend.java:1025)
	at org.apache.flink.client.cli.CliFrontend.lambda$main$9(CliFrontend.java:1101)
	at java.security.AccessController.doPrivileged(Native Method)
	at javax.security.auth.Subject.doAs(Subject.java:422)
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1754)
	at org.apache.flink.runtime.security.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:41)
	at org.apache.flink.client.cli.CliFrontend.main(CliFrontend.java:1101)
Caused by: org.apache.flink.runtime.client.JobSubmissionException: Failed to submit JobGraph.
	at org.apache.flink.client.program.rest.RestClusterClient.lambda$submitJob$8(RestClusterClient.java:370)
	at java.util.concurrent.CompletableFuture.uniExceptionally(CompletableFuture.java:870)
	at java.util.concurrent.CompletableFuture$UniExceptionally.tryFire(CompletableFuture.java:852)
	at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)
	at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977)
	at org.apache.flink.runtime.concurrent.FutureUtils.lambda$retryOperationWithDelay$5(FutureUtils.java:214)
	at java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:760)
	at java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:736)
	at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)
	at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977)
	at org.apache.flink.runtime.concurrent.FutureUtils$Timeout.run(FutureUtils.java:834)
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)
Caused by: java.util.concurrent.CompletionException: org.apache.flink.runtime.concurrent.FutureUtils$RetryException: Could not complete the operation. Exception is not retryable.
	at java.util.concurrent.CompletableFuture.encodeRelay(CompletableFuture.java:326)
	at java.util.concurrent.CompletableFuture.completeRelay(CompletableFuture.java:338)
	at java.util.concurrent.CompletableFuture.uniRelay(CompletableFuture.java:911)
	at java.util.concurrent.CompletableFuture$UniRelay.tryFire(CompletableFuture.java:899)
	... 15 more
Caused by: org.apache.flink.runtime.concurrent.FutureUtils$RetryException: Could not complete the operation. Exception is not retryable.
	... 13 more
Caused by: java.util.concurrent.CompletionException: java.util.concurrent.TimeoutException
	at java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:292)
	at java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:308)
	at java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:593)
	at java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:577)
	... 10 more
Caused by: java.util.concurrent.TimeoutException
	... 8 more




Reply | Threaded
Open this post in threaded view
|

Re: submit job failed on Yarn HA

Gary Yao-4
Hi Sen,

Are you using the default MemoryStateBackend [1]? As far as I know, it does
not support JobManager failover. If you are already using FsStateBackend or
RocksDBStateBackend, please send JM logs.

Best,
Gary

[1] https://ci.apache.org/projects/flink/flink-docs-release-1.5/ops/state/state_backends.html#available-state-backends


On Mon, Mar 4, 2019 at 10:01 AM 孙森 <[hidden email]> wrote:
Hi Gary:


        Yes, I enable the checkpoints in my program .

在 2019年3月4日,上午3:03,Gary Yao <[hidden email]> 写道:

Hi Sen,

Did you set a restart strategy [1]? If you enabled checkpoints [2], the fixed-
delay strategy will be used by default.

Best,
Gary

[1] https://ci.apache.org/projects/flink/flink-docs-release-1.5/dev/restart_strategies.html
[2] https://ci.apache.org/projects/flink/flink-docs-release-1.5/ops/state/checkpoints.html

On Fri, Mar 1, 2019 at 7:27 AM 孙森 <[hidden email]> wrote:
Hi Gary:
             I checked the znode, the address of leader was there.

<屏幕快照 2019-03-01 上午10.45.45 1.png>

When I removed the ZooKeeper configuration in the client's flink-conf.yaml, the job was submitted successfully.  
Then I tried to test if the HA could work.  I killed the job manager ,it restarted .But  the job   did not restart when the jog manager  restarted.


Best!
Sen

在 2019年2月28日,下午6:59,Gary Yao <[hidden email]> 写道:

Hi Sen,

I took a look at the CLI code again, and found out that -m is ignored if high-
availability: ZOOKEEPER is configured in your flink-conf.yaml. This does not
seem right and should be at least documented [1].

Judging from the client logs that you provided, I think the problem is that
the client cannot resolve the leading JobManager from ZooKeeper [2][3]. You
can try the following things for debugging:

    * Check the contents in the znode
      /flink/[...]/leader/rest_server_lock using the ZK CLI. It should contain the
      address of the leader. If not, I would check the jobmanager logs for releated
      errors.
   
    * Submit the job with -m parameter but without ZooKeeper configuration in
      the client's flink-conf.yaml

Best,
Gary 

[1] https://issues.apache.org/jira/browse/FLINK-11779
[2] https://github.com/apache/flink/blob/release-1.5.1/flink-runtime/src/main/java/org/apache/flink/runtime/leaderretrieval/ZooKeeperLeaderRetrievalService.java#L170
[3] https://github.com/apache/flink/blob/release-1.5.1/flink-clients/src/main/java/org/apache/flink/client/program/rest/RestClusterClient.java#L746-L750

On Thu, Feb 28, 2019 at 4:34 AM 孙森 <[hidden email]> wrote:
Hi,Gary

       Actually, I have several Flink cluster on Yarn ,each for a project. For one project ,it can only submit job to the specify cluster.
I’ve already enabled logging on DEBUG level.

How did you determine "jmhost" and "port”? 
We do this by request the rest api :   http://activeRm/proxy/appId/jars


The all client log is in the mail attachment.




在 2019年2月27日,下午9:30,Gary Yao <[hidden email]> 写道:

Hi,

How did you determine "jmhost" and "port"? Actually you do not need to specify
these manually. If the client is using the same configuration as your cluster,
the client will look up the leading JM from ZooKeeper.

If you have already tried omitting the "-m" parameter, you can check in the
client logs which host is used for the job submission [1]. Note that you need
to enable logging on DEBUG level.

The root cause in your stacktrace is a TimeoutException. I would debug this by
checking if you can establish a TCP connection – from the machine you are
submitting the job from, to the target host/port [2].

Moreover, you are using a quite dated Flink version. The newest version in the
1.5 major release is 1.5.6 – so consider upgrading to that or even to 1.7.

Best,
Gary

[1] https://github.com/apache/flink/blob/3488f8b144a2127497c39b8ed5a48a65b551c57d/flink-runtime/src/main/java/org/apache/flink/runtime/rest/RestClient.java#L185
[2] https://stackoverflow.com/questions/4922943/test-from-shell-script-if-remote-tcp-port-is-open

On Wed, Feb 27, 2019 at 8:09 AM 孙森 <[hidden email]> wrote:
Hi all:
        
        I run flink (1.5.1 with hadoop 2.7) on yarn ,and submit job by “/usr/local/flink/bin/flink run -m jmhost:port my.jar”, but the submission is failed.
The HA configuration is :
  •  high-availability: zookeeper
  •  high-availability.storageDir: hdfs:///flink/ha/
  •  high-availability.zookeeper.quorum:  hdp1:2181,hdp2:2181,hdp3:2181
  •  yarn.application-attempts: 2
The info showed int the client log:    

2019-02-27 11:48:38,651 INFO  org.apache.flink.runtime.rest.RestClient                      - Shutting down rest endpoint.
2019-02-27 11:48:38,659 INFO  org.apache.flink.runtime.rest.RestClient                      - Rest endpoint shutdown complete.
2019-02-27 11:48:38,662 INFO  org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService  - Stopping ZooKeeperLeaderRetrievalService /leader/rest_server_lock.
2019-02-27 11:48:38,665 INFO  org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService  - Stopping ZooKeeperLeaderRetrievalService /leader/dispatcher_lock.
2019-02-27 11:48:38,670 INFO  org.apache.flink.shaded.curator.org.apache.curator.framework.imps.CuratorFrameworkImpl  - backgroundOperationsLoop exiting
2019-02-27 11:48:38,689 INFO  org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Session: 0x2679c52880c00ee closed
2019-02-27 11:48:38,689 INFO  org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn  - EventThread shut down for session: 0x2679c52880c00ee
2019-02-27 11:48:38,690 ERROR org.apache.flink.client.cli.CliFrontend                       - Error while running the command.
org.apache.flink.client.program.ProgramInvocationException: Could not retrieve the execution result.
	at org.apache.flink.client.program.rest.RestClusterClient.submitJob(RestClusterClient.java:257)
	at org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:464)
	at org.apache.flink.streaming.api.environment.StreamContextEnvironment.execute(StreamContextEnvironment.java:66)
	at org.apache.flink.streaming.api.scala.StreamExecutionEnvironment.execute(StreamExecutionEnvironment.scala:654)
	at scala.Function0$class.apply$mcV$sp(Function0.scala:34)
	at scala.runtime.AbstractFunction0.apply$mcV$sp(AbstractFunction0.scala:12)
	at scala.App$$anonfun$main$1.apply(App.scala:76)
	at scala.App$$anonfun$main$1.apply(App.scala:76)
	at scala.collection.immutable.List.foreach(List.scala:381)
	at scala.collection.generic.TraversableForwarder$class.foreach(TraversableForwarder.scala:35)
	at scala.App$class.main(App.scala:76)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at org.apache.flink.client.program.PackagedProgram.callMainMethod(PackagedProgram.java:528)
	at org.apache.flink.client.program.PackagedProgram.invokeInteractiveModeForExecution(PackagedProgram.java:420)
	at org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:404)
	at org.apache.flink.client.cli.CliFrontend.executeProgram(CliFrontend.java:785)
	at org.apache.flink.client.cli.CliFrontend.runProgram(CliFrontend.java:279)
	at org.apache.flink.client.cli.CliFrontend.run(CliFrontend.java:214)
	at org.apache.flink.client.cli.CliFrontend.parseParameters(CliFrontend.java:1025)
	at org.apache.flink.client.cli.CliFrontend.lambda$main$9(CliFrontend.java:1101)
	at java.security.AccessController.doPrivileged(Native Method)
	at javax.security.auth.Subject.doAs(Subject.java:422)
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1754)
	at org.apache.flink.runtime.security.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:41)
	at org.apache.flink.client.cli.CliFrontend.main(CliFrontend.java:1101)
Caused by: org.apache.flink.runtime.client.JobSubmissionException: Failed to submit JobGraph.
	at org.apache.flink.client.program.rest.RestClusterClient.lambda$submitJob$8(RestClusterClient.java:370)
	at java.util.concurrent.CompletableFuture.uniExceptionally(CompletableFuture.java:870)
	at java.util.concurrent.CompletableFuture$UniExceptionally.tryFire(CompletableFuture.java:852)
	at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)
	at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977)
	at org.apache.flink.runtime.concurrent.FutureUtils.lambda$retryOperationWithDelay$5(FutureUtils.java:214)
	at java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:760)
	at java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:736)
	at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)
	at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977)
	at org.apache.flink.runtime.concurrent.FutureUtils$Timeout.run(FutureUtils.java:834)
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)
Caused by: java.util.concurrent.CompletionException: org.apache.flink.runtime.concurrent.FutureUtils$RetryException: Could not complete the operation. Exception is not retryable.
	at java.util.concurrent.CompletableFuture.encodeRelay(CompletableFuture.java:326)
	at java.util.concurrent.CompletableFuture.completeRelay(CompletableFuture.java:338)
	at java.util.concurrent.CompletableFuture.uniRelay(CompletableFuture.java:911)
	at java.util.concurrent.CompletableFuture$UniRelay.tryFire(CompletableFuture.java:899)
	... 15 more
Caused by: org.apache.flink.runtime.concurrent.FutureUtils$RetryException: Could not complete the operation. Exception is not retryable.
	... 13 more
Caused by: java.util.concurrent.CompletionException: java.util.concurrent.TimeoutException
	at java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:292)
	at java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:308)
	at java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:593)
	at java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:577)
	... 10 more
Caused by: java.util.concurrent.TimeoutException
	... 8 more

------------------------------------------------------------
 The program finished with the following exception:

org.apache.flink.client.program.ProgramInvocationException: Could not retrieve the execution result.
	at org.apache.flink.client.program.rest.RestClusterClient.submitJob(RestClusterClient.java:257)
	at org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:464)
	at org.apache.flink.streaming.api.environment.StreamContextEnvironment.execute(StreamContextEnvironment.java:66)
	at org.apache.flink.streaming.api.scala.StreamExecutionEnvironment.execute(StreamExecutionEnvironment.scala:654)
	at edp.wormhole.flinkx.eventflow.WormholeFlinkMainProcess.process(WormholeFlinkMainProcess.scala:114)
	at edp.wormhole.flinkx.eventflow.WormholeFlinkxStarter$.delayedEndpoint$edp$wormhole$flinkx$eventflow$WormholeFlinkxStarter$1(WormholeFlinkxStarter.scala:40)
	at edp.wormhole.flinkx.eventflow.WormholeFlinkxStarter$delayedInit$body.apply(WormholeFlinkxStarter.scala:29)
	at scala.Function0$class.apply$mcV$sp(Function0.scala:34)
	at scala.runtime.AbstractFunction0.apply$mcV$sp(AbstractFunction0.scala:12)
	at scala.App$$anonfun$main$1.apply(App.scala:76)
	at scala.App$$anonfun$main$1.apply(App.scala:76)
	at scala.collection.immutable.List.foreach(List.scala:381)
	at scala.collection.generic.TraversableForwarder$class.foreach(TraversableForwarder.scala:35)
	at scala.App$class.main(App.scala:76)
	at edp.wormhole.flinkx.eventflow.WormholeFlinkxStarter$.main(WormholeFlinkxStarter.scala:29)
	at edp.wormhole.flinkx.eventflow.WormholeFlinkxStarter.main(WormholeFlinkxStarter.scala)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at org.apache.flink.client.program.PackagedProgram.callMainMethod(PackagedProgram.java:528)
	at org.apache.flink.client.program.PackagedProgram.invokeInteractiveModeForExecution(PackagedProgram.java:420)
	at org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:404)
	at org.apache.flink.client.cli.CliFrontend.executeProgram(CliFrontend.java:785)
	at org.apache.flink.client.cli.CliFrontend.runProgram(CliFrontend.java:279)
	at org.apache.flink.client.cli.CliFrontend.run(CliFrontend.java:214)
	at org.apache.flink.client.cli.CliFrontend.parseParameters(CliFrontend.java:1025)
	at org.apache.flink.client.cli.CliFrontend.lambda$main$9(CliFrontend.java:1101)
	at java.security.AccessController.doPrivileged(Native Method)
	at javax.security.auth.Subject.doAs(Subject.java:422)
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1754)
	at org.apache.flink.runtime.security.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:41)
	at org.apache.flink.client.cli.CliFrontend.main(CliFrontend.java:1101)
Caused by: org.apache.flink.runtime.client.JobSubmissionException: Failed to submit JobGraph.
	at org.apache.flink.client.program.rest.RestClusterClient.lambda$submitJob$8(RestClusterClient.java:370)
	at java.util.concurrent.CompletableFuture.uniExceptionally(CompletableFuture.java:870)
	at java.util.concurrent.CompletableFuture$UniExceptionally.tryFire(CompletableFuture.java:852)
	at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)
	at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977)
	at org.apache.flink.runtime.concurrent.FutureUtils.lambda$retryOperationWithDelay$5(FutureUtils.java:214)
	at java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:760)
	at java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:736)
	at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)
	at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977)
	at org.apache.flink.runtime.concurrent.FutureUtils$Timeout.run(FutureUtils.java:834)
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)
Caused by: java.util.concurrent.CompletionException: org.apache.flink.runtime.concurrent.FutureUtils$RetryException: Could not complete the operation. Exception is not retryable.
	at java.util.concurrent.CompletableFuture.encodeRelay(CompletableFuture.java:326)
	at java.util.concurrent.CompletableFuture.completeRelay(CompletableFuture.java:338)
	at java.util.concurrent.CompletableFuture.uniRelay(CompletableFuture.java:911)
	at java.util.concurrent.CompletableFuture$UniRelay.tryFire(CompletableFuture.java:899)
	... 15 more
Caused by: org.apache.flink.runtime.concurrent.FutureUtils$RetryException: Could not complete the operation. Exception is not retryable.
	... 13 more
Caused by: java.util.concurrent.CompletionException: java.util.concurrent.TimeoutException
	at java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:292)
	at java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:308)
	at java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:593)
	at java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:577)
	... 10 more
Caused by: java.util.concurrent.TimeoutException
	... 8 more




sen
Reply | Threaded
Open this post in threaded view
|

Re: submit job failed on Yarn HA

sen
Hi Gary:
         I used FsStateBackend .



The jm log is here:



After restart , the log is :





Best!
Sen


在 2019年3月4日,下午10:50,Gary Yao <[hidden email]> 写道:

Hi Sen,

Are you using the default MemoryStateBackend [1]? As far as I know, it does
not support JobManager failover. If you are already using FsStateBackend or
RocksDBStateBackend, please send JM logs.

Best,
Gary

[1] https://ci.apache.org/projects/flink/flink-docs-release-1.5/ops/state/state_backends.html#available-state-backends


On Mon, Mar 4, 2019 at 10:01 AM 孙森 <[hidden email]> wrote:
Hi Gary:


        Yes, I enable the checkpoints in my program .

在 2019年3月4日,上午3:03,Gary Yao <[hidden email]> 写道:

Hi Sen,

Did you set a restart strategy [1]? If you enabled checkpoints [2], the fixed-
delay strategy will be used by default.

Best,
Gary

[1] https://ci.apache.org/projects/flink/flink-docs-release-1.5/dev/restart_strategies.html
[2] https://ci.apache.org/projects/flink/flink-docs-release-1.5/ops/state/checkpoints.html

On Fri, Mar 1, 2019 at 7:27 AM 孙森 <[hidden email]> wrote:
Hi Gary:
             I checked the znode, the address of leader was there.

<屏幕快照 2019-03-01 上午10.45.45 1.png>

When I removed the ZooKeeper configuration in the client's flink-conf.yaml, the job was submitted successfully.  
Then I tried to test if the HA could work.  I killed the job manager ,it restarted .But  the job   did not restart when the jog manager  restarted.


Best!
Sen

在 2019年2月28日,下午6:59,Gary Yao <[hidden email]> 写道:

Hi Sen,

I took a look at the CLI code again, and found out that -m is ignored if high-
availability: ZOOKEEPER is configured in your flink-conf.yaml. This does not
seem right and should be at least documented [1].

Judging from the client logs that you provided, I think the problem is that
the client cannot resolve the leading JobManager from ZooKeeper [2][3]. You
can try the following things for debugging:

    * Check the contents in the znode
      /flink/[...]/leader/rest_server_lock using the ZK CLI. It should contain the
      address of the leader. If not, I would check the jobmanager logs for releated
      errors.
   
    * Submit the job with -m parameter but without ZooKeeper configuration in
      the client's flink-conf.yaml

Best,
Gary 

[1] https://issues.apache.org/jira/browse/FLINK-11779
[2] https://github.com/apache/flink/blob/release-1.5.1/flink-runtime/src/main/java/org/apache/flink/runtime/leaderretrieval/ZooKeeperLeaderRetrievalService.java#L170
[3] https://github.com/apache/flink/blob/release-1.5.1/flink-clients/src/main/java/org/apache/flink/client/program/rest/RestClusterClient.java#L746-L750

On Thu, Feb 28, 2019 at 4:34 AM 孙森 <[hidden email]> wrote:
Hi,Gary

       Actually, I have several Flink cluster on Yarn ,each for a project. For one project ,it can only submit job to the specify cluster.
I’ve already enabled logging on DEBUG level.

How did you determine "jmhost" and "port”? 
We do this by request the rest api :   http://activeRm/proxy/appId/jars


The all client log is in the mail attachment.




在 2019年2月27日,下午9:30,Gary Yao <[hidden email]> 写道:

Hi,

How did you determine "jmhost" and "port"? Actually you do not need to specify
these manually. If the client is using the same configuration as your cluster,
the client will look up the leading JM from ZooKeeper.

If you have already tried omitting the "-m" parameter, you can check in the
client logs which host is used for the job submission [1]. Note that you need
to enable logging on DEBUG level.

The root cause in your stacktrace is a TimeoutException. I would debug this by
checking if you can establish a TCP connection – from the machine you are
submitting the job from, to the target host/port [2].

Moreover, you are using a quite dated Flink version. The newest version in the
1.5 major release is 1.5.6 – so consider upgrading to that or even to 1.7.

Best,
Gary

[1] https://github.com/apache/flink/blob/3488f8b144a2127497c39b8ed5a48a65b551c57d/flink-runtime/src/main/java/org/apache/flink/runtime/rest/RestClient.java#L185
[2] https://stackoverflow.com/questions/4922943/test-from-shell-script-if-remote-tcp-port-is-open

On Wed, Feb 27, 2019 at 8:09 AM 孙森 <[hidden email]> wrote:
Hi all:
        
        I run flink (1.5.1 with hadoop 2.7) on yarn ,and submit job by “/usr/local/flink/bin/flink run -m jmhost:port my.jar”, but the submission is failed.
The HA configuration is :
  •  high-availability: zookeeper
  •  high-availability.storageDir: hdfs:///flink/ha/
  •  high-availability.zookeeper.quorum:  hdp1:2181,hdp2:2181,hdp3:2181
  •  yarn.application-attempts: 2
The info showed int the client log:    

2019-02-27 11:48:38,651 INFO  org.apache.flink.runtime.rest.RestClient                      - Shutting down rest endpoint.
2019-02-27 11:48:38,659 INFO  org.apache.flink.runtime.rest.RestClient                      - Rest endpoint shutdown complete.
2019-02-27 11:48:38,662 INFO  org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService  - Stopping ZooKeeperLeaderRetrievalService /leader/rest_server_lock.
2019-02-27 11:48:38,665 INFO  org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService  - Stopping ZooKeeperLeaderRetrievalService /leader/dispatcher_lock.
2019-02-27 11:48:38,670 INFO  org.apache.flink.shaded.curator.org.apache.curator.framework.imps.CuratorFrameworkImpl  - backgroundOperationsLoop exiting
2019-02-27 11:48:38,689 INFO  org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Session: 0x2679c52880c00ee closed
2019-02-27 11:48:38,689 INFO  org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn  - EventThread shut down for session: 0x2679c52880c00ee
2019-02-27 11:48:38,690 ERROR org.apache.flink.client.cli.CliFrontend                       - Error while running the command.
org.apache.flink.client.program.ProgramInvocationException: Could not retrieve the execution result.
	at org.apache.flink.client.program.rest.RestClusterClient.submitJob(RestClusterClient.java:257)
	at org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:464)
	at org.apache.flink.streaming.api.environment.StreamContextEnvironment.execute(StreamContextEnvironment.java:66)
	at org.apache.flink.streaming.api.scala.StreamExecutionEnvironment.execute(StreamExecutionEnvironment.scala:654)
	at scala.Function0$class.apply$mcV$sp(Function0.scala:34)
	at scala.runtime.AbstractFunction0.apply$mcV$sp(AbstractFunction0.scala:12)
	at scala.App$$anonfun$main$1.apply(App.scala:76)
	at scala.App$$anonfun$main$1.apply(App.scala:76)
	at scala.collection.immutable.List.foreach(List.scala:381)
	at scala.collection.generic.TraversableForwarder$class.foreach(TraversableForwarder.scala:35)
	at scala.App$class.main(App.scala:76)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at org.apache.flink.client.program.PackagedProgram.callMainMethod(PackagedProgram.java:528)
	at org.apache.flink.client.program.PackagedProgram.invokeInteractiveModeForExecution(PackagedProgram.java:420)
	at org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:404)
	at org.apache.flink.client.cli.CliFrontend.executeProgram(CliFrontend.java:785)
	at org.apache.flink.client.cli.CliFrontend.runProgram(CliFrontend.java:279)
	at org.apache.flink.client.cli.CliFrontend.run(CliFrontend.java:214)
	at org.apache.flink.client.cli.CliFrontend.parseParameters(CliFrontend.java:1025)
	at org.apache.flink.client.cli.CliFrontend.lambda$main$9(CliFrontend.java:1101)
	at java.security.AccessController.doPrivileged(Native Method)
	at javax.security.auth.Subject.doAs(Subject.java:422)
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1754)
	at org.apache.flink.runtime.security.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:41)
	at org.apache.flink.client.cli.CliFrontend.main(CliFrontend.java:1101)
Caused by: org.apache.flink.runtime.client.JobSubmissionException: Failed to submit JobGraph.
	at org.apache.flink.client.program.rest.RestClusterClient.lambda$submitJob$8(RestClusterClient.java:370)
	at java.util.concurrent.CompletableFuture.uniExceptionally(CompletableFuture.java:870)
	at java.util.concurrent.CompletableFuture$UniExceptionally.tryFire(CompletableFuture.java:852)
	at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)
	at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977)
	at org.apache.flink.runtime.concurrent.FutureUtils.lambda$retryOperationWithDelay$5(FutureUtils.java:214)
	at java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:760)
	at java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:736)
	at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)
	at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977)
	at org.apache.flink.runtime.concurrent.FutureUtils$Timeout.run(FutureUtils.java:834)
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)
Caused by: java.util.concurrent.CompletionException: org.apache.flink.runtime.concurrent.FutureUtils$RetryException: Could not complete the operation. Exception is not retryable.
	at java.util.concurrent.CompletableFuture.encodeRelay(CompletableFuture.java:326)
	at java.util.concurrent.CompletableFuture.completeRelay(CompletableFuture.java:338)
	at java.util.concurrent.CompletableFuture.uniRelay(CompletableFuture.java:911)
	at java.util.concurrent.CompletableFuture$UniRelay.tryFire(CompletableFuture.java:899)
	... 15 more
Caused by: org.apache.flink.runtime.concurrent.FutureUtils$RetryException: Could not complete the operation. Exception is not retryable.
	... 13 more
Caused by: java.util.concurrent.CompletionException: java.util.concurrent.TimeoutException
	at java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:292)
	at java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:308)
	at java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:593)
	at java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:577)
	... 10 more
Caused by: java.util.concurrent.TimeoutException
	... 8 more

------------------------------------------------------------
 The program finished with the following exception:

org.apache.flink.client.program.ProgramInvocationException: Could not retrieve the execution result.
	at org.apache.flink.client.program.rest.RestClusterClient.submitJob(RestClusterClient.java:257)
	at org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:464)
	at org.apache.flink.streaming.api.environment.StreamContextEnvironment.execute(StreamContextEnvironment.java:66)
	at org.apache.flink.streaming.api.scala.StreamExecutionEnvironment.execute(StreamExecutionEnvironment.scala:654)
	at edp.wormhole.flinkx.eventflow.WormholeFlinkMainProcess.process(WormholeFlinkMainProcess.scala:114)
	at edp.wormhole.flinkx.eventflow.WormholeFlinkxStarter$.delayedEndpoint$edp$wormhole$flinkx$eventflow$WormholeFlinkxStarter$1(WormholeFlinkxStarter.scala:40)
	at edp.wormhole.flinkx.eventflow.WormholeFlinkxStarter$delayedInit$body.apply(WormholeFlinkxStarter.scala:29)
	at scala.Function0$class.apply$mcV$sp(Function0.scala:34)
	at scala.runtime.AbstractFunction0.apply$mcV$sp(AbstractFunction0.scala:12)
	at scala.App$$anonfun$main$1.apply(App.scala:76)
	at scala.App$$anonfun$main$1.apply(App.scala:76)
	at scala.collection.immutable.List.foreach(List.scala:381)
	at scala.collection.generic.TraversableForwarder$class.foreach(TraversableForwarder.scala:35)
	at scala.App$class.main(App.scala:76)
	at edp.wormhole.flinkx.eventflow.WormholeFlinkxStarter$.main(WormholeFlinkxStarter.scala:29)
	at edp.wormhole.flinkx.eventflow.WormholeFlinkxStarter.main(WormholeFlinkxStarter.scala)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at org.apache.flink.client.program.PackagedProgram.callMainMethod(PackagedProgram.java:528)
	at org.apache.flink.client.program.PackagedProgram.invokeInteractiveModeForExecution(PackagedProgram.java:420)
	at org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:404)
	at org.apache.flink.client.cli.CliFrontend.executeProgram(CliFrontend.java:785)
	at org.apache.flink.client.cli.CliFrontend.runProgram(CliFrontend.java:279)
	at org.apache.flink.client.cli.CliFrontend.run(CliFrontend.java:214)
	at org.apache.flink.client.cli.CliFrontend.parseParameters(CliFrontend.java:1025)
	at org.apache.flink.client.cli.CliFrontend.lambda$main$9(CliFrontend.java:1101)
	at java.security.AccessController.doPrivileged(Native Method)
	at javax.security.auth.Subject.doAs(Subject.java:422)
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1754)
	at org.apache.flink.runtime.security.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:41)
	at org.apache.flink.client.cli.CliFrontend.main(CliFrontend.java:1101)
Caused by: org.apache.flink.runtime.client.JobSubmissionException: Failed to submit JobGraph.
	at org.apache.flink.client.program.rest.RestClusterClient.lambda$submitJob$8(RestClusterClient.java:370)
	at java.util.concurrent.CompletableFuture.uniExceptionally(CompletableFuture.java:870)
	at java.util.concurrent.CompletableFuture$UniExceptionally.tryFire(CompletableFuture.java:852)
	at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)
	at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977)
	at org.apache.flink.runtime.concurrent.FutureUtils.lambda$retryOperationWithDelay$5(FutureUtils.java:214)
	at java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:760)
	at java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:736)
	at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)
	at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977)
	at org.apache.flink.runtime.concurrent.FutureUtils$Timeout.run(FutureUtils.java:834)
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)
Caused by: java.util.concurrent.CompletionException: org.apache.flink.runtime.concurrent.FutureUtils$RetryException: Could not complete the operation. Exception is not retryable.
	at java.util.concurrent.CompletableFuture.encodeRelay(CompletableFuture.java:326)
	at java.util.concurrent.CompletableFuture.completeRelay(CompletableFuture.java:338)
	at java.util.concurrent.CompletableFuture.uniRelay(CompletableFuture.java:911)
	at java.util.concurrent.CompletableFuture$UniRelay.tryFire(CompletableFuture.java:899)
	... 15 more
Caused by: org.apache.flink.runtime.concurrent.FutureUtils$RetryException: Could not complete the operation. Exception is not retryable.
	... 13 more
Caused by: java.util.concurrent.CompletionException: java.util.concurrent.TimeoutException
	at java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:292)
	at java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:308)
	at java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:593)
	at java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:577)
	... 10 more
Caused by: java.util.concurrent.TimeoutException
	... 8 more






=?gb2312?B?xsHEu7/s1dUgMjAxOS0wMy0wNSDJz87nMTEuMzUuNTAucG5n?= (199K) Download Attachment
jobmanager.log (174K) Download Attachment
jobmanager-restart.log (492K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: submit job failed on Yarn HA

Gary Yao-4
Hi Sen,

I don't see

    high-availability: zookeeper

in your Flink configuration. However, this is mandatory for an HA setup. By
default "none" is used, and the ZK configuration is ignored. The log also
hints that you are using StandaloneLeaderElectionService instead of the
ZooKeeper implementation (note that the leaderSessionID consists only of 0s
[1][2]):

    2019-03-05 11:23:53,883 INFO  org.apache.flink.runtime.dispatcher.DispatcherRestEndpoint    - http://hdp3:60179 was granted leadership with leaderSessionID=00000000-0000-0000-0000-000000000000

Did you accidentally delete the "high-availability" config from your flink-
conf.yaml?

You probably also want to increase the number of yarn.application-attempts
[3].

Best,
Gary


[1] https://github.com/apache/flink/blob/dcd8c74b504046802cebf278b718e4753928a260/flink-runtime/src/main/java/org/apache/flink/runtime/leaderelection/StandaloneLeaderElectionService.java#L48
[2] https://github.com/apache/flink/blob/dcd8c74b504046802cebf278b718e4753928a260/flink-runtime/src/main/java/org/apache/flink/runtime/highavailability/HighAvailabilityServices.java#L52-L57
[3] https://ci.apache.org/projects/flink/flink-docs-release-1.7/ops/deployment/yarn_setup.html#recovery-behavior-of-flink-on-yarn

On Tue, Mar 5, 2019 at 7:41 AM 孙森 <[hidden email]> wrote:
Hi Gary:
         I used FsStateBackend .


The jm log is here:


After restart , the log is :




Best!
Sen


在 2019年3月4日,下午10:50,Gary Yao <[hidden email]> 写道:

Hi Sen,

Are you using the default MemoryStateBackend [1]? As far as I know, it does
not support JobManager failover. If you are already using FsStateBackend or
RocksDBStateBackend, please send JM logs.

Best,
Gary

[1] https://ci.apache.org/projects/flink/flink-docs-release-1.5/ops/state/state_backends.html#available-state-backends


On Mon, Mar 4, 2019 at 10:01 AM 孙森 <[hidden email]> wrote:
Hi Gary:


        Yes, I enable the checkpoints in my program .

在 2019年3月4日,上午3:03,Gary Yao <[hidden email]> 写道:

Hi Sen,

Did you set a restart strategy [1]? If you enabled checkpoints [2], the fixed-
delay strategy will be used by default.

Best,
Gary

[1] https://ci.apache.org/projects/flink/flink-docs-release-1.5/dev/restart_strategies.html
[2] https://ci.apache.org/projects/flink/flink-docs-release-1.5/ops/state/checkpoints.html

On Fri, Mar 1, 2019 at 7:27 AM 孙森 <[hidden email]> wrote:
Hi Gary:
             I checked the znode, the address of leader was there.

<屏幕快照 2019-03-01 上午10.45.45 1.png>

When I removed the ZooKeeper configuration in the client's flink-conf.yaml, the job was submitted successfully.  
Then I tried to test if the HA could work.  I killed the job manager ,it restarted .But  the job   did not restart when the jog manager  restarted.


Best!
Sen

在 2019年2月28日,下午6:59,Gary Yao <[hidden email]> 写道:

Hi Sen,

I took a look at the CLI code again, and found out that -m is ignored if high-
availability: ZOOKEEPER is configured in your flink-conf.yaml. This does not
seem right and should be at least documented [1].

Judging from the client logs that you provided, I think the problem is that
the client cannot resolve the leading JobManager from ZooKeeper [2][3]. You
can try the following things for debugging:

    * Check the contents in the znode
      /flink/[...]/leader/rest_server_lock using the ZK CLI. It should contain the
      address of the leader. If not, I would check the jobmanager logs for releated
      errors.
   
    * Submit the job with -m parameter but without ZooKeeper configuration in
      the client's flink-conf.yaml

Best,
Gary 

[1] https://issues.apache.org/jira/browse/FLINK-11779
[2] https://github.com/apache/flink/blob/release-1.5.1/flink-runtime/src/main/java/org/apache/flink/runtime/leaderretrieval/ZooKeeperLeaderRetrievalService.java#L170
[3] https://github.com/apache/flink/blob/release-1.5.1/flink-clients/src/main/java/org/apache/flink/client/program/rest/RestClusterClient.java#L746-L750

On Thu, Feb 28, 2019 at 4:34 AM 孙森 <[hidden email]> wrote:
Hi,Gary

       Actually, I have several Flink cluster on Yarn ,each for a project. For one project ,it can only submit job to the specify cluster.
I’ve already enabled logging on DEBUG level.

How did you determine "jmhost" and "port”? 
We do this by request the rest api :   http://activeRm/proxy/appId/jars


The all client log is in the mail attachment.




在 2019年2月27日,下午9:30,Gary Yao <[hidden email]> 写道:

Hi,

How did you determine "jmhost" and "port"? Actually you do not need to specify
these manually. If the client is using the same configuration as your cluster,
the client will look up the leading JM from ZooKeeper.

If you have already tried omitting the "-m" parameter, you can check in the
client logs which host is used for the job submission [1]. Note that you need
to enable logging on DEBUG level.

The root cause in your stacktrace is a TimeoutException. I would debug this by
checking if you can establish a TCP connection – from the machine you are
submitting the job from, to the target host/port [2].

Moreover, you are using a quite dated Flink version. The newest version in the
1.5 major release is 1.5.6 – so consider upgrading to that or even to 1.7.

Best,
Gary

[1] https://github.com/apache/flink/blob/3488f8b144a2127497c39b8ed5a48a65b551c57d/flink-runtime/src/main/java/org/apache/flink/runtime/rest/RestClient.java#L185
[2] https://stackoverflow.com/questions/4922943/test-from-shell-script-if-remote-tcp-port-is-open

On Wed, Feb 27, 2019 at 8:09 AM 孙森 <[hidden email]> wrote:
Hi all:
        
        I run flink (1.5.1 with hadoop 2.7) on yarn ,and submit job by “/usr/local/flink/bin/flink run -m jmhost:port my.jar”, but the submission is failed.
The HA configuration is :
  •  high-availability: zookeeper
  •  high-availability.storageDir: hdfs:///flink/ha/
  •  high-availability.zookeeper.quorum:  hdp1:2181,hdp2:2181,hdp3:2181
  •  yarn.application-attempts: 2
The info showed int the client log:    

2019-02-27 11:48:38,651 INFO  org.apache.flink.runtime.rest.RestClient                      - Shutting down rest endpoint.
2019-02-27 11:48:38,659 INFO  org.apache.flink.runtime.rest.RestClient                      - Rest endpoint shutdown complete.
2019-02-27 11:48:38,662 INFO  org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService  - Stopping ZooKeeperLeaderRetrievalService /leader/rest_server_lock.
2019-02-27 11:48:38,665 INFO  org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService  - Stopping ZooKeeperLeaderRetrievalService /leader/dispatcher_lock.
2019-02-27 11:48:38,670 INFO  org.apache.flink.shaded.curator.org.apache.curator.framework.imps.CuratorFrameworkImpl  - backgroundOperationsLoop exiting
2019-02-27 11:48:38,689 INFO  org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Session: 0x2679c52880c00ee closed
2019-02-27 11:48:38,689 INFO  org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn  - EventThread shut down for session: 0x2679c52880c00ee
2019-02-27 11:48:38,690 ERROR org.apache.flink.client.cli.CliFrontend                       - Error while running the command.
org.apache.flink.client.program.ProgramInvocationException: Could not retrieve the execution result.
	at org.apache.flink.client.program.rest.RestClusterClient.submitJob(RestClusterClient.java:257)
	at org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:464)
	at org.apache.flink.streaming.api.environment.StreamContextEnvironment.execute(StreamContextEnvironment.java:66)
	at org.apache.flink.streaming.api.scala.StreamExecutionEnvironment.execute(StreamExecutionEnvironment.scala:654)
	at scala.Function0$class.apply$mcV$sp(Function0.scala:34)
	at scala.runtime.AbstractFunction0.apply$mcV$sp(AbstractFunction0.scala:12)
	at scala.App$$anonfun$main$1.apply(App.scala:76)
	at scala.App$$anonfun$main$1.apply(App.scala:76)
	at scala.collection.immutable.List.foreach(List.scala:381)
	at scala.collection.generic.TraversableForwarder$class.foreach(TraversableForwarder.scala:35)
	at scala.App$class.main(App.scala:76)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at org.apache.flink.client.program.PackagedProgram.callMainMethod(PackagedProgram.java:528)
	at org.apache.flink.client.program.PackagedProgram.invokeInteractiveModeForExecution(PackagedProgram.java:420)
	at org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:404)
	at org.apache.flink.client.cli.CliFrontend.executeProgram(CliFrontend.java:785)
	at org.apache.flink.client.cli.CliFrontend.runProgram(CliFrontend.java:279)
	at org.apache.flink.client.cli.CliFrontend.run(CliFrontend.java:214)
	at org.apache.flink.client.cli.CliFrontend.parseParameters(CliFrontend.java:1025)
	at org.apache.flink.client.cli.CliFrontend.lambda$main$9(CliFrontend.java:1101)
	at java.security.AccessController.doPrivileged(Native Method)
	at javax.security.auth.Subject.doAs(Subject.java:422)
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1754)
	at org.apache.flink.runtime.security.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:41)
	at org.apache.flink.client.cli.CliFrontend.main(CliFrontend.java:1101)
Caused by: org.apache.flink.runtime.client.JobSubmissionException: Failed to submit JobGraph.
	at org.apache.flink.client.program.rest.RestClusterClient.lambda$submitJob$8(RestClusterClient.java:370)
	at java.util.concurrent.CompletableFuture.uniExceptionally(CompletableFuture.java:870)
	at java.util.concurrent.CompletableFuture$UniExceptionally.tryFire(CompletableFuture.java:852)
	at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)
	at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977)
	at org.apache.flink.runtime.concurrent.FutureUtils.lambda$retryOperationWithDelay$5(FutureUtils.java:214)
	at java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:760)
	at java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:736)
	at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)
	at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977)
	at org.apache.flink.runtime.concurrent.FutureUtils$Timeout.run(FutureUtils.java:834)
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)
Caused by: java.util.concurrent.CompletionException: org.apache.flink.runtime.concurrent.FutureUtils$RetryException: Could not complete the operation. Exception is not retryable.
	at java.util.concurrent.CompletableFuture.encodeRelay(CompletableFuture.java:326)
	at java.util.concurrent.CompletableFuture.completeRelay(CompletableFuture.java:338)
	at java.util.concurrent.CompletableFuture.uniRelay(CompletableFuture.java:911)
	at java.util.concurrent.CompletableFuture$UniRelay.tryFire(CompletableFuture.java:899)
	... 15 more
Caused by: org.apache.flink.runtime.concurrent.FutureUtils$RetryException: Could not complete the operation. Exception is not retryable.
	... 13 more
Caused by: java.util.concurrent.CompletionException: java.util.concurrent.TimeoutException
	at java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:292)
	at java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:308)
	at java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:593)
	at java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:577)
	... 10 more
Caused by: java.util.concurrent.TimeoutException
	... 8 more

------------------------------------------------------------
 The program finished with the following exception:

org.apache.flink.client.program.ProgramInvocationException: Could not retrieve the execution result.
	at org.apache.flink.client.program.rest.RestClusterClient.submitJob(RestClusterClient.java:257)
	at org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:464)
	at org.apache.flink.streaming.api.environment.StreamContextEnvironment.execute(StreamContextEnvironment.java:66)
	at org.apache.flink.streaming.api.scala.StreamExecutionEnvironment.execute(StreamExecutionEnvironment.scala:654)
	at edp.wormhole.flinkx.eventflow.WormholeFlinkMainProcess.process(WormholeFlinkMainProcess.scala:114)
	at edp.wormhole.flinkx.eventflow.WormholeFlinkxStarter$.delayedEndpoint$edp$wormhole$flinkx$eventflow$WormholeFlinkxStarter$1(WormholeFlinkxStarter.scala:40)
	at edp.wormhole.flinkx.eventflow.WormholeFlinkxStarter$delayedInit$body.apply(WormholeFlinkxStarter.scala:29)
	at scala.Function0$class.apply$mcV$sp(Function0.scala:34)
	at scala.runtime.AbstractFunction0.apply$mcV$sp(AbstractFunction0.scala:12)
	at scala.App$$anonfun$main$1.apply(App.scala:76)
	at scala.App$$anonfun$main$1.apply(App.scala:76)
	at scala.collection.immutable.List.foreach(List.scala:381)
	at scala.collection.generic.TraversableForwarder$class.foreach(TraversableForwarder.scala:35)
	at scala.App$class.main(App.scala:76)
	at edp.wormhole.flinkx.eventflow.WormholeFlinkxStarter$.main(WormholeFlinkxStarter.scala:29)
	at edp.wormhole.flinkx.eventflow.WormholeFlinkxStarter.main(WormholeFlinkxStarter.scala)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at org.apache.flink.client.program.PackagedProgram.callMainMethod(PackagedProgram.java:528)
	at org.apache.flink.client.program.PackagedProgram.invokeInteractiveModeForExecution(PackagedProgram.java:420)
	at org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:404)
	at org.apache.flink.client.cli.CliFrontend.executeProgram(CliFrontend.java:785)
	at org.apache.flink.client.cli.CliFrontend.runProgram(CliFrontend.java:279)
	at org.apache.flink.client.cli.CliFrontend.run(CliFrontend.java:214)
	at org.apache.flink.client.cli.CliFrontend.parseParameters(CliFrontend.java:1025)
	at org.apache.flink.client.cli.CliFrontend.lambda$main$9(CliFrontend.java:1101)
	at java.security.AccessController.doPrivileged(Native Method)
	at javax.security.auth.Subject.doAs(Subject.java:422)
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1754)
	at org.apache.flink.runtime.security.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:41)
	at org.apache.flink.client.cli.CliFrontend.main(CliFrontend.java:1101)
Caused by: org.apache.flink.runtime.client.JobSubmissionException: Failed to submit JobGraph.
	at org.apache.flink.client.program.rest.RestClusterClient.lambda$submitJob$8(RestClusterClient.java:370)
	at java.util.concurrent.CompletableFuture.uniExceptionally(CompletableFuture.java:870)
	at java.util.concurrent.CompletableFuture$UniExceptionally.tryFire(CompletableFuture.java:852)
	at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)
	at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977)
	at org.apache.flink.runtime.concurrent.FutureUtils.lambda$retryOperationWithDelay$5(FutureUtils.java:214)
	at java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:760)
	at java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:736)
	at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)
	at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977)
	at org.apache.flink.runtime.concurrent.FutureUtils$Timeout.run(FutureUtils.java:834)
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)
Caused by: java.util.concurrent.CompletionException: org.apache.flink.runtime.concurrent.FutureUtils$RetryException: Could not complete the operation. Exception is not retryable.
	at java.util.concurrent.CompletableFuture.encodeRelay(CompletableFuture.java:326)
	at java.util.concurrent.CompletableFuture.completeRelay(CompletableFuture.java:338)
	at java.util.concurrent.CompletableFuture.uniRelay(CompletableFuture.java:911)
	at java.util.concurrent.CompletableFuture$UniRelay.tryFire(CompletableFuture.java:899)
	... 15 more
Caused by: org.apache.flink.runtime.concurrent.FutureUtils$RetryException: Could not complete the operation. Exception is not retryable.
	... 13 more
Caused by: java.util.concurrent.CompletionException: java.util.concurrent.TimeoutException
	at java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:292)
	at java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:308)
	at java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:593)
	at java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:577)
	... 10 more
Caused by: java.util.concurrent.TimeoutException
	... 8 more





sen
Reply | Threaded
Open this post in threaded view
|

Re: submit job failed on Yarn HA

sen
Hi Gary:

          No zookeeper is because the reason that the job submit will fail.



Best
Sen

在 2019年3月5日,下午3:02,Gary Yao <[hidden email]> 写道:

Hi Sen,

I don't see

    high-availability: zookeeper

in your Flink configuration. However, this is mandatory for an HA setup. By
default "none" is used, and the ZK configuration is ignored. The log also
hints that you are using StandaloneLeaderElectionService instead of the
ZooKeeper implementation (note that the leaderSessionID consists only of 0s
[1][2]):

    2019-03-05 11:23:53,883 INFO  org.apache.flink.runtime.dispatcher.DispatcherRestEndpoint    - http://hdp3:60179 was granted leadership with leaderSessionID=00000000-0000-0000-0000-000000000000

Did you accidentally delete the "high-availability" config from your flink-
conf.yaml?

You probably also want to increase the number of yarn.application-attempts
[3].

Best,
Gary


[1] https://github.com/apache/flink/blob/dcd8c74b504046802cebf278b718e4753928a260/flink-runtime/src/main/java/org/apache/flink/runtime/leaderelection/StandaloneLeaderElectionService.java#L48
[2] https://github.com/apache/flink/blob/dcd8c74b504046802cebf278b718e4753928a260/flink-runtime/src/main/java/org/apache/flink/runtime/highavailability/HighAvailabilityServices.java#L52-L57
[3] https://ci.apache.org/projects/flink/flink-docs-release-1.7/ops/deployment/yarn_setup.html#recovery-behavior-of-flink-on-yarn

On Tue, Mar 5, 2019 at 7:41 AM 孙森 <[hidden email]> wrote:
Hi Gary:
         I used FsStateBackend .


The jm log is here:


After restart , the log is :




Best!
Sen


在 2019年3月4日,下午10:50,Gary Yao <[hidden email]> 写道:

Hi Sen,

Are you using the default MemoryStateBackend [1]? As far as I know, it does
not support JobManager failover. If you are already using FsStateBackend or
RocksDBStateBackend, please send JM logs.

Best,
Gary

[1] https://ci.apache.org/projects/flink/flink-docs-release-1.5/ops/state/state_backends.html#available-state-backends


On Mon, Mar 4, 2019 at 10:01 AM 孙森 <[hidden email]> wrote:
Hi Gary:


        Yes, I enable the checkpoints in my program .

在 2019年3月4日,上午3:03,Gary Yao <[hidden email]> 写道:

Hi Sen,

Did you set a restart strategy [1]? If you enabled checkpoints [2], the fixed-
delay strategy will be used by default.

Best,
Gary

[1] https://ci.apache.org/projects/flink/flink-docs-release-1.5/dev/restart_strategies.html
[2] https://ci.apache.org/projects/flink/flink-docs-release-1.5/ops/state/checkpoints.html

On Fri, Mar 1, 2019 at 7:27 AM 孙森 <[hidden email]> wrote:
Hi Gary:
             I checked the znode, the address of leader was there.

<屏幕快照 2019-03-01 上午10.45.45 1.png>

When I removed the ZooKeeper configuration in the client's flink-conf.yaml, the job was submitted successfully.  
Then I tried to test if the HA could work.  I killed the job manager ,it restarted .But  the job   did not restart when the jog manager  restarted.


Best!
Sen

在 2019年2月28日,下午6:59,Gary Yao <[hidden email]> 写道:

Hi Sen,

I took a look at the CLI code again, and found out that -m is ignored if high-
availability: ZOOKEEPER is configured in your flink-conf.yaml. This does not
seem right and should be at least documented [1].

Judging from the client logs that you provided, I think the problem is that
the client cannot resolve the leading JobManager from ZooKeeper [2][3]. You
can try the following things for debugging:

    * Check the contents in the znode
      /flink/[...]/leader/rest_server_lock using the ZK CLI. It should contain the
      address of the leader. If not, I would check the jobmanager logs for releated
      errors.
   
    * Submit the job with -m parameter but without ZooKeeper configuration in
      the client's flink-conf.yaml

Best,
Gary 

[1] https://issues.apache.org/jira/browse/FLINK-11779
[2] https://github.com/apache/flink/blob/release-1.5.1/flink-runtime/src/main/java/org/apache/flink/runtime/leaderretrieval/ZooKeeperLeaderRetrievalService.java#L170
[3] https://github.com/apache/flink/blob/release-1.5.1/flink-clients/src/main/java/org/apache/flink/client/program/rest/RestClusterClient.java#L746-L750

On Thu, Feb 28, 2019 at 4:34 AM 孙森 <[hidden email]> wrote:
Hi,Gary

       Actually, I have several Flink cluster on Yarn ,each for a project. For one project ,it can only submit job to the specify cluster.
I’ve already enabled logging on DEBUG level.

How did you determine "jmhost" and "port”? 
We do this by request the rest api :   http://activeRm/proxy/appId/jars


The all client log is in the mail attachment.




在 2019年2月27日,下午9:30,Gary Yao <[hidden email]> 写道:

Hi,

How did you determine "jmhost" and "port"? Actually you do not need to specify
these manually. If the client is using the same configuration as your cluster,
the client will look up the leading JM from ZooKeeper.

If you have already tried omitting the "-m" parameter, you can check in the
client logs which host is used for the job submission [1]. Note that you need
to enable logging on DEBUG level.

The root cause in your stacktrace is a TimeoutException. I would debug this by
checking if you can establish a TCP connection – from the machine you are
submitting the job from, to the target host/port [2].

Moreover, you are using a quite dated Flink version. The newest version in the
1.5 major release is 1.5.6 – so consider upgrading to that or even to 1.7.

Best,
Gary

[1] https://github.com/apache/flink/blob/3488f8b144a2127497c39b8ed5a48a65b551c57d/flink-runtime/src/main/java/org/apache/flink/runtime/rest/RestClient.java#L185
[2] https://stackoverflow.com/questions/4922943/test-from-shell-script-if-remote-tcp-port-is-open

On Wed, Feb 27, 2019 at 8:09 AM 孙森 <[hidden email]> wrote:
Hi all:
        
        I run flink (1.5.1 with hadoop 2.7) on yarn ,and submit job by “/usr/local/flink/bin/flink run -m jmhost:port my.jar”, but the submission is failed.
The HA configuration is :
  •  high-availability: zookeeper
  •  high-availability.storageDir: hdfs:///flink/ha/
  •  high-availability.zookeeper.quorum:  hdp1:2181,hdp2:2181,hdp3:2181
  •  yarn.application-attempts: 2
The info showed int the client log:    

2019-02-27 11:48:38,651 INFO  org.apache.flink.runtime.rest.RestClient                      - Shutting down rest endpoint.
2019-02-27 11:48:38,659 INFO  org.apache.flink.runtime.rest.RestClient                      - Rest endpoint shutdown complete.
2019-02-27 11:48:38,662 INFO  org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService  - Stopping ZooKeeperLeaderRetrievalService /leader/rest_server_lock.
2019-02-27 11:48:38,665 INFO  org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService  - Stopping ZooKeeperLeaderRetrievalService /leader/dispatcher_lock.
2019-02-27 11:48:38,670 INFO  org.apache.flink.shaded.curator.org.apache.curator.framework.imps.CuratorFrameworkImpl  - backgroundOperationsLoop exiting
2019-02-27 11:48:38,689 INFO  org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Session: 0x2679c52880c00ee closed
2019-02-27 11:48:38,689 INFO  org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn  - EventThread shut down for session: 0x2679c52880c00ee
2019-02-27 11:48:38,690 ERROR org.apache.flink.client.cli.CliFrontend                       - Error while running the command.
org.apache.flink.client.program.ProgramInvocationException: Could not retrieve the execution result.
	at org.apache.flink.client.program.rest.RestClusterClient.submitJob(RestClusterClient.java:257)
	at org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:464)
	at org.apache.flink.streaming.api.environment.StreamContextEnvironment.execute(StreamContextEnvironment.java:66)
	at org.apache.flink.streaming.api.scala.StreamExecutionEnvironment.execute(StreamExecutionEnvironment.scala:654)
	at scala.Function0$class.apply$mcV$sp(Function0.scala:34)
	at scala.runtime.AbstractFunction0.apply$mcV$sp(AbstractFunction0.scala:12)
	at scala.App$$anonfun$main$1.apply(App.scala:76)
	at scala.App$$anonfun$main$1.apply(App.scala:76)
	at scala.collection.immutable.List.foreach(List.scala:381)
	at scala.collection.generic.TraversableForwarder$class.foreach(TraversableForwarder.scala:35)
	at scala.App$class.main(App.scala:76)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at org.apache.flink.client.program.PackagedProgram.callMainMethod(PackagedProgram.java:528)
	at org.apache.flink.client.program.PackagedProgram.invokeInteractiveModeForExecution(PackagedProgram.java:420)
	at org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:404)
	at org.apache.flink.client.cli.CliFrontend.executeProgram(CliFrontend.java:785)
	at org.apache.flink.client.cli.CliFrontend.runProgram(CliFrontend.java:279)
	at org.apache.flink.client.cli.CliFrontend.run(CliFrontend.java:214)
	at org.apache.flink.client.cli.CliFrontend.parseParameters(CliFrontend.java:1025)
	at org.apache.flink.client.cli.CliFrontend.lambda$main$9(CliFrontend.java:1101)
	at java.security.AccessController.doPrivileged(Native Method)
	at javax.security.auth.Subject.doAs(Subject.java:422)
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1754)
	at org.apache.flink.runtime.security.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:41)
	at org.apache.flink.client.cli.CliFrontend.main(CliFrontend.java:1101)
Caused by: org.apache.flink.runtime.client.JobSubmissionException: Failed to submit JobGraph.
	at org.apache.flink.client.program.rest.RestClusterClient.lambda$submitJob$8(RestClusterClient.java:370)
	at java.util.concurrent.CompletableFuture.uniExceptionally(CompletableFuture.java:870)
	at java.util.concurrent.CompletableFuture$UniExceptionally.tryFire(CompletableFuture.java:852)
	at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)
	at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977)
	at org.apache.flink.runtime.concurrent.FutureUtils.lambda$retryOperationWithDelay$5(FutureUtils.java:214)
	at java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:760)
	at java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:736)
	at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)
	at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977)
	at org.apache.flink.runtime.concurrent.FutureUtils$Timeout.run(FutureUtils.java:834)
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)
Caused by: java.util.concurrent.CompletionException: org.apache.flink.runtime.concurrent.FutureUtils$RetryException: Could not complete the operation. Exception is not retryable.
	at java.util.concurrent.CompletableFuture.encodeRelay(CompletableFuture.java:326)
	at java.util.concurrent.CompletableFuture.completeRelay(CompletableFuture.java:338)
	at java.util.concurrent.CompletableFuture.uniRelay(CompletableFuture.java:911)
	at java.util.concurrent.CompletableFuture$UniRelay.tryFire(CompletableFuture.java:899)
	... 15 more
Caused by: org.apache.flink.runtime.concurrent.FutureUtils$RetryException: Could not complete the operation. Exception is not retryable.
	... 13 more
Caused by: java.util.concurrent.CompletionException: java.util.concurrent.TimeoutException
	at java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:292)
	at java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:308)
	at java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:593)
	at java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:577)
	... 10 more
Caused by: java.util.concurrent.TimeoutException
	... 8 more

------------------------------------------------------------
 The program finished with the following exception:

org.apache.flink.client.program.ProgramInvocationException: Could not retrieve the execution result.
	at org.apache.flink.client.program.rest.RestClusterClient.submitJob(RestClusterClient.java:257)
	at org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:464)
	at org.apache.flink.streaming.api.environment.StreamContextEnvironment.execute(StreamContextEnvironment.java:66)
	at org.apache.flink.streaming.api.scala.StreamExecutionEnvironment.execute(StreamExecutionEnvironment.scala:654)
	at edp.wormhole.flinkx.eventflow.WormholeFlinkMainProcess.process(WormholeFlinkMainProcess.scala:114)
	at edp.wormhole.flinkx.eventflow.WormholeFlinkxStarter$.delayedEndpoint$edp$wormhole$flinkx$eventflow$WormholeFlinkxStarter$1(WormholeFlinkxStarter.scala:40)
	at edp.wormhole.flinkx.eventflow.WormholeFlinkxStarter$delayedInit$body.apply(WormholeFlinkxStarter.scala:29)
	at scala.Function0$class.apply$mcV$sp(Function0.scala:34)
	at scala.runtime.AbstractFunction0.apply$mcV$sp(AbstractFunction0.scala:12)
	at scala.App$$anonfun$main$1.apply(App.scala:76)
	at scala.App$$anonfun$main$1.apply(App.scala:76)
	at scala.collection.immutable.List.foreach(List.scala:381)
	at scala.collection.generic.TraversableForwarder$class.foreach(TraversableForwarder.scala:35)
	at scala.App$class.main(App.scala:76)
	at edp.wormhole.flinkx.eventflow.WormholeFlinkxStarter$.main(WormholeFlinkxStarter.scala:29)
	at edp.wormhole.flinkx.eventflow.WormholeFlinkxStarter.main(WormholeFlinkxStarter.scala)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at org.apache.flink.client.program.PackagedProgram.callMainMethod(PackagedProgram.java:528)
	at org.apache.flink.client.program.PackagedProgram.invokeInteractiveModeForExecution(PackagedProgram.java:420)
	at org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:404)
	at org.apache.flink.client.cli.CliFrontend.executeProgram(CliFrontend.java:785)
	at org.apache.flink.client.cli.CliFrontend.runProgram(CliFrontend.java:279)
	at org.apache.flink.client.cli.CliFrontend.run(CliFrontend.java:214)
	at org.apache.flink.client.cli.CliFrontend.parseParameters(CliFrontend.java:1025)
	at org.apache.flink.client.cli.CliFrontend.lambda$main$9(CliFrontend.java:1101)
	at java.security.AccessController.doPrivileged(Native Method)
	at javax.security.auth.Subject.doAs(Subject.java:422)
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1754)
	at org.apache.flink.runtime.security.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:41)
	at org.apache.flink.client.cli.CliFrontend.main(CliFrontend.java:1101)
Caused by: org.apache.flink.runtime.client.JobSubmissionException: Failed to submit JobGraph.
	at org.apache.flink.client.program.rest.RestClusterClient.lambda$submitJob$8(RestClusterClient.java:370)
	at java.util.concurrent.CompletableFuture.uniExceptionally(CompletableFuture.java:870)
	at java.util.concurrent.CompletableFuture$UniExceptionally.tryFire(CompletableFuture.java:852)
	at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)
	at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977)
	at org.apache.flink.runtime.concurrent.FutureUtils.lambda$retryOperationWithDelay$5(FutureUtils.java:214)
	at java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:760)
	at java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:736)
	at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)
	at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977)
	at org.apache.flink.runtime.concurrent.FutureUtils$Timeout.run(FutureUtils.java:834)
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)
Caused by: java.util.concurrent.CompletionException: org.apache.flink.runtime.concurrent.FutureUtils$RetryException: Could not complete the operation. Exception is not retryable.
	at java.util.concurrent.CompletableFuture.encodeRelay(CompletableFuture.java:326)
	at java.util.concurrent.CompletableFuture.completeRelay(CompletableFuture.java:338)
	at java.util.concurrent.CompletableFuture.uniRelay(CompletableFuture.java:911)
	at java.util.concurrent.CompletableFuture$UniRelay.tryFire(CompletableFuture.java:899)
	... 15 more
Caused by: org.apache.flink.runtime.concurrent.FutureUtils$RetryException: Could not complete the operation. Exception is not retryable.
	... 13 more
Caused by: java.util.concurrent.CompletionException: java.util.concurrent.TimeoutException
	at java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:292)
	at java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:308)
	at java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:593)
	at java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:577)
	... 10 more
Caused by: java.util.concurrent.TimeoutException
	... 8 more






Reply | Threaded
Open this post in threaded view
|

Re: submit job failed on Yarn HA

Gary Yao-4
Hi Sen,

In that email I meant that you should disable the ZooKeeper configuration in
the CLI because the CLI had troubles resolving the leader from ZooKeeper. What
you should have done is:

    1. Start the cluster normally with ZooKeeper enabled
    2. Edit flink-conf.yaml to remove ZooKeeper config
    3. Submit the job to your cluster with -m flag.

Best,
Gary

On Tue, Mar 5, 2019 at 8:08 AM 孙森 <[hidden email]> wrote:
Hi Gary:

          No zookeeper is because the reason that the job submit will fail.



Best
Sen

在 2019年3月5日,下午3:02,Gary Yao <[hidden email]> 写道:

Hi Sen,

I don't see

    high-availability: zookeeper

in your Flink configuration. However, this is mandatory for an HA setup. By
default "none" is used, and the ZK configuration is ignored. The log also
hints that you are using StandaloneLeaderElectionService instead of the
ZooKeeper implementation (note that the leaderSessionID consists only of 0s
[1][2]):

    2019-03-05 11:23:53,883 INFO  org.apache.flink.runtime.dispatcher.DispatcherRestEndpoint    - http://hdp3:60179 was granted leadership with leaderSessionID=00000000-0000-0000-0000-000000000000

Did you accidentally delete the "high-availability" config from your flink-
conf.yaml?

You probably also want to increase the number of yarn.application-attempts
[3].

Best,
Gary


[1] https://github.com/apache/flink/blob/dcd8c74b504046802cebf278b718e4753928a260/flink-runtime/src/main/java/org/apache/flink/runtime/leaderelection/StandaloneLeaderElectionService.java#L48
[2] https://github.com/apache/flink/blob/dcd8c74b504046802cebf278b718e4753928a260/flink-runtime/src/main/java/org/apache/flink/runtime/highavailability/HighAvailabilityServices.java#L52-L57
[3] https://ci.apache.org/projects/flink/flink-docs-release-1.7/ops/deployment/yarn_setup.html#recovery-behavior-of-flink-on-yarn

On Tue, Mar 5, 2019 at 7:41 AM 孙森 <[hidden email]> wrote:
Hi Gary:
         I used FsStateBackend .


The jm log is here:


After restart , the log is :




Best!
Sen


在 2019年3月4日,下午10:50,Gary Yao <[hidden email]> 写道:

Hi Sen,

Are you using the default MemoryStateBackend [1]? As far as I know, it does
not support JobManager failover. If you are already using FsStateBackend or
RocksDBStateBackend, please send JM logs.

Best,
Gary

[1] https://ci.apache.org/projects/flink/flink-docs-release-1.5/ops/state/state_backends.html#available-state-backends


On Mon, Mar 4, 2019 at 10:01 AM 孙森 <[hidden email]> wrote:
Hi Gary:


        Yes, I enable the checkpoints in my program .

在 2019年3月4日,上午3:03,Gary Yao <[hidden email]> 写道:

Hi Sen,

Did you set a restart strategy [1]? If you enabled checkpoints [2], the fixed-
delay strategy will be used by default.

Best,
Gary

[1] https://ci.apache.org/projects/flink/flink-docs-release-1.5/dev/restart_strategies.html
[2] https://ci.apache.org/projects/flink/flink-docs-release-1.5/ops/state/checkpoints.html

On Fri, Mar 1, 2019 at 7:27 AM 孙森 <[hidden email]> wrote:
Hi Gary:
             I checked the znode, the address of leader was there.

<屏幕快照 2019-03-01 上午10.45.45 1.png>

When I removed the ZooKeeper configuration in the client's flink-conf.yaml, the job was submitted successfully.  
Then I tried to test if the HA could work.  I killed the job manager ,it restarted .But  the job   did not restart when the jog manager  restarted.


Best!
Sen

在 2019年2月28日,下午6:59,Gary Yao <[hidden email]> 写道:

Hi Sen,

I took a look at the CLI code again, and found out that -m is ignored if high-
availability: ZOOKEEPER is configured in your flink-conf.yaml. This does not
seem right and should be at least documented [1].

Judging from the client logs that you provided, I think the problem is that
the client cannot resolve the leading JobManager from ZooKeeper [2][3]. You
can try the following things for debugging:

    * Check the contents in the znode
      /flink/[...]/leader/rest_server_lock using the ZK CLI. It should contain the
      address of the leader. If not, I would check the jobmanager logs for releated
      errors.
   
    * Submit the job with -m parameter but without ZooKeeper configuration in
      the client's flink-conf.yaml

Best,
Gary 

[1] https://issues.apache.org/jira/browse/FLINK-11779
[2] https://github.com/apache/flink/blob/release-1.5.1/flink-runtime/src/main/java/org/apache/flink/runtime/leaderretrieval/ZooKeeperLeaderRetrievalService.java#L170
[3] https://github.com/apache/flink/blob/release-1.5.1/flink-clients/src/main/java/org/apache/flink/client/program/rest/RestClusterClient.java#L746-L750

On Thu, Feb 28, 2019 at 4:34 AM 孙森 <[hidden email]> wrote:
Hi,Gary

       Actually, I have several Flink cluster on Yarn ,each for a project. For one project ,it can only submit job to the specify cluster.
I’ve already enabled logging on DEBUG level.

How did you determine "jmhost" and "port”? 
We do this by request the rest api :   http://activeRm/proxy/appId/jars


The all client log is in the mail attachment.




在 2019年2月27日,下午9:30,Gary Yao <[hidden email]> 写道:

Hi,

How did you determine "jmhost" and "port"? Actually you do not need to specify
these manually. If the client is using the same configuration as your cluster,
the client will look up the leading JM from ZooKeeper.

If you have already tried omitting the "-m" parameter, you can check in the
client logs which host is used for the job submission [1]. Note that you need
to enable logging on DEBUG level.

The root cause in your stacktrace is a TimeoutException. I would debug this by
checking if you can establish a TCP connection – from the machine you are
submitting the job from, to the target host/port [2].

Moreover, you are using a quite dated Flink version. The newest version in the
1.5 major release is 1.5.6 – so consider upgrading to that or even to 1.7.

Best,
Gary

[1] https://github.com/apache/flink/blob/3488f8b144a2127497c39b8ed5a48a65b551c57d/flink-runtime/src/main/java/org/apache/flink/runtime/rest/RestClient.java#L185
[2] https://stackoverflow.com/questions/4922943/test-from-shell-script-if-remote-tcp-port-is-open

On Wed, Feb 27, 2019 at 8:09 AM 孙森 <[hidden email]> wrote:
Hi all:
        
        I run flink (1.5.1 with hadoop 2.7) on yarn ,and submit job by “/usr/local/flink/bin/flink run -m jmhost:port my.jar”, but the submission is failed.
The HA configuration is :
  •  high-availability: zookeeper
  •  high-availability.storageDir: hdfs:///flink/ha/
  •  high-availability.zookeeper.quorum:  hdp1:2181,hdp2:2181,hdp3:2181
  •  yarn.application-attempts: 2
The info showed int the client log:    

2019-02-27 11:48:38,651 INFO  org.apache.flink.runtime.rest.RestClient                      - Shutting down rest endpoint.
2019-02-27 11:48:38,659 INFO  org.apache.flink.runtime.rest.RestClient                      - Rest endpoint shutdown complete.
2019-02-27 11:48:38,662 INFO  org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService  - Stopping ZooKeeperLeaderRetrievalService /leader/rest_server_lock.
2019-02-27 11:48:38,665 INFO  org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService  - Stopping ZooKeeperLeaderRetrievalService /leader/dispatcher_lock.
2019-02-27 11:48:38,670 INFO  org.apache.flink.shaded.curator.org.apache.curator.framework.imps.CuratorFrameworkImpl  - backgroundOperationsLoop exiting
2019-02-27 11:48:38,689 INFO  org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Session: 0x2679c52880c00ee closed
2019-02-27 11:48:38,689 INFO  org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn  - EventThread shut down for session: 0x2679c52880c00ee
2019-02-27 11:48:38,690 ERROR org.apache.flink.client.cli.CliFrontend                       - Error while running the command.
org.apache.flink.client.program.ProgramInvocationException: Could not retrieve the execution result.
	at org.apache.flink.client.program.rest.RestClusterClient.submitJob(RestClusterClient.java:257)
	at org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:464)
	at org.apache.flink.streaming.api.environment.StreamContextEnvironment.execute(StreamContextEnvironment.java:66)
	at org.apache.flink.streaming.api.scala.StreamExecutionEnvironment.execute(StreamExecutionEnvironment.scala:654)
	at scala.Function0$class.apply$mcV$sp(Function0.scala:34)
	at scala.runtime.AbstractFunction0.apply$mcV$sp(AbstractFunction0.scala:12)
	at scala.App$$anonfun$main$1.apply(App.scala:76)
	at scala.App$$anonfun$main$1.apply(App.scala:76)
	at scala.collection.immutable.List.foreach(List.scala:381)
	at scala.collection.generic.TraversableForwarder$class.foreach(TraversableForwarder.scala:35)
	at scala.App$class.main(App.scala:76)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at org.apache.flink.client.program.PackagedProgram.callMainMethod(PackagedProgram.java:528)
	at org.apache.flink.client.program.PackagedProgram.invokeInteractiveModeForExecution(PackagedProgram.java:420)
	at org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:404)
	at org.apache.flink.client.cli.CliFrontend.executeProgram(CliFrontend.java:785)
	at org.apache.flink.client.cli.CliFrontend.runProgram(CliFrontend.java:279)
	at org.apache.flink.client.cli.CliFrontend.run(CliFrontend.java:214)
	at org.apache.flink.client.cli.CliFrontend.parseParameters(CliFrontend.java:1025)
	at org.apache.flink.client.cli.CliFrontend.lambda$main$9(CliFrontend.java:1101)
	at java.security.AccessController.doPrivileged(Native Method)
	at javax.security.auth.Subject.doAs(Subject.java:422)
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1754)
	at org.apache.flink.runtime.security.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:41)
	at org.apache.flink.client.cli.CliFrontend.main(CliFrontend.java:1101)
Caused by: org.apache.flink.runtime.client.JobSubmissionException: Failed to submit JobGraph.
	at org.apache.flink.client.program.rest.RestClusterClient.lambda$submitJob$8(RestClusterClient.java:370)
	at java.util.concurrent.CompletableFuture.uniExceptionally(CompletableFuture.java:870)
	at java.util.concurrent.CompletableFuture$UniExceptionally.tryFire(CompletableFuture.java:852)
	at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)
	at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977)
	at org.apache.flink.runtime.concurrent.FutureUtils.lambda$retryOperationWithDelay$5(FutureUtils.java:214)
	at java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:760)
	at java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:736)
	at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)
	at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977)
	at org.apache.flink.runtime.concurrent.FutureUtils$Timeout.run(FutureUtils.java:834)
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)
Caused by: java.util.concurrent.CompletionException: org.apache.flink.runtime.concurrent.FutureUtils$RetryException: Could not complete the operation. Exception is not retryable.
	at java.util.concurrent.CompletableFuture.encodeRelay(CompletableFuture.java:326)
	at java.util.concurrent.CompletableFuture.completeRelay(CompletableFuture.java:338)
	at java.util.concurrent.CompletableFuture.uniRelay(CompletableFuture.java:911)
	at java.util.concurrent.CompletableFuture$UniRelay.tryFire(CompletableFuture.java:899)
	... 15 more
Caused by: org.apache.flink.runtime.concurrent.FutureUtils$RetryException: Could not complete the operation. Exception is not retryable.
	... 13 more
Caused by: java.util.concurrent.CompletionException: java.util.concurrent.TimeoutException
	at java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:292)
	at java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:308)
	at java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:593)
	at java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:577)
	... 10 more
Caused by: java.util.concurrent.TimeoutException
	... 8 more

------------------------------------------------------------
 The program finished with the following exception:

org.apache.flink.client.program.ProgramInvocationException: Could not retrieve the execution result.
	at org.apache.flink.client.program.rest.RestClusterClient.submitJob(RestClusterClient.java:257)
	at org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:464)
	at org.apache.flink.streaming.api.environment.StreamContextEnvironment.execute(StreamContextEnvironment.java:66)
	at org.apache.flink.streaming.api.scala.StreamExecutionEnvironment.execute(StreamExecutionEnvironment.scala:654)
	at edp.wormhole.flinkx.eventflow.WormholeFlinkMainProcess.process(WormholeFlinkMainProcess.scala:114)
	at edp.wormhole.flinkx.eventflow.WormholeFlinkxStarter$.delayedEndpoint$edp$wormhole$flinkx$eventflow$WormholeFlinkxStarter$1(WormholeFlinkxStarter.scala:40)
	at edp.wormhole.flinkx.eventflow.WormholeFlinkxStarter$delayedInit$body.apply(WormholeFlinkxStarter.scala:29)
	at scala.Function0$class.apply$mcV$sp(Function0.scala:34)
	at scala.runtime.AbstractFunction0.apply$mcV$sp(AbstractFunction0.scala:12)
	at scala.App$$anonfun$main$1.apply(App.scala:76)
	at scala.App$$anonfun$main$1.apply(App.scala:76)
	at scala.collection.immutable.List.foreach(List.scala:381)
	at scala.collection.generic.TraversableForwarder$class.foreach(TraversableForwarder.scala:35)
	at scala.App$class.main(App.scala:76)
	at edp.wormhole.flinkx.eventflow.WormholeFlinkxStarter$.main(WormholeFlinkxStarter.scala:29)
	at edp.wormhole.flinkx.eventflow.WormholeFlinkxStarter.main(WormholeFlinkxStarter.scala)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at org.apache.flink.client.program.PackagedProgram.callMainMethod(PackagedProgram.java:528)
	at org.apache.flink.client.program.PackagedProgram.invokeInteractiveModeForExecution(PackagedProgram.java:420)
	at org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:404)
	at org.apache.flink.client.cli.CliFrontend.executeProgram(CliFrontend.java:785)
	at org.apache.flink.client.cli.CliFrontend.runProgram(CliFrontend.java:279)
	at org.apache.flink.client.cli.CliFrontend.run(CliFrontend.java:214)
	at org.apache.flink.client.cli.CliFrontend.parseParameters(CliFrontend.java:1025)
	at org.apache.flink.client.cli.CliFrontend.lambda$main$9(CliFrontend.java:1101)
	at java.security.AccessController.doPrivileged(Native Method)
	at javax.security.auth.Subject.doAs(Subject.java:422)
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1754)
	at org.apache.flink.runtime.security.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:41)
	at org.apache.flink.client.cli.CliFrontend.main(CliFrontend.java:1101)
Caused by: org.apache.flink.runtime.client.JobSubmissionException: Failed to submit JobGraph.
	at org.apache.flink.client.program.rest.RestClusterClient.lambda$submitJob$8(RestClusterClient.java:370)
	at java.util.concurrent.CompletableFuture.uniExceptionally(CompletableFuture.java:870)
	at java.util.concurrent.CompletableFuture$UniExceptionally.tryFire(CompletableFuture.java:852)
	at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)
	at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977)
	at org.apache.flink.runtime.concurrent.FutureUtils.lambda$retryOperationWithDelay$5(FutureUtils.java:214)
	at java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:760)
	at java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:736)
	at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)
	at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977)
	at org.apache.flink.runtime.concurrent.FutureUtils$Timeout.run(FutureUtils.java:834)
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)
Caused by: java.util.concurrent.CompletionException: org.apache.flink.runtime.concurrent.FutureUtils$RetryException: Could not complete the operation. Exception is not retryable.
	at java.util.concurrent.CompletableFuture.encodeRelay(CompletableFuture.java:326)
	at java.util.concurrent.CompletableFuture.completeRelay(CompletableFuture.java:338)
	at java.util.concurrent.CompletableFuture.uniRelay(CompletableFuture.java:911)
	at java.util.concurrent.CompletableFuture$UniRelay.tryFire(CompletableFuture.java:899)
	... 15 more
Caused by: org.apache.flink.runtime.concurrent.FutureUtils$RetryException: Could not complete the operation. Exception is not retryable.
	... 13 more
Caused by: java.util.concurrent.CompletionException: java.util.concurrent.TimeoutException
	at java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:292)
	at java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:308)
	at java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:593)
	at java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:577)
	... 10 more
Caused by: java.util.concurrent.TimeoutException
	... 8 more






sen
Reply | Threaded
Open this post in threaded view
|

Re: submit job failed on Yarn HA

sen
Hi Gary:
           
          Thanks very much! I have tried it as the way you said. It works. Hopes that the bug can be fixed as soon as possible.
Best!
Sen

在 2019年3月5日,下午3:15,Gary Yao <[hidden email]> 写道:

Hi Sen,

In that email I meant that you should disable the ZooKeeper configuration in
the CLI because the CLI had troubles resolving the leader from ZooKeeper. What
you should have done is:

    1. Start the cluster normally with ZooKeeper enabled
    2. Edit flink-conf.yaml to remove ZooKeeper config
    3. Submit the job to your cluster with -m flag.

Best,
Gary

On Tue, Mar 5, 2019 at 8:08 AM 孙森 <[hidden email]> wrote:
Hi Gary:

          No zookeeper is because the reason that the job submit will fail.
<屏幕快照 2019-03-05 下午3.07.21.png>


Best
Sen

在 2019年3月5日,下午3:02,Gary Yao <[hidden email]> 写道:

Hi Sen,

I don't see

    high-availability: zookeeper

in your Flink configuration. However, this is mandatory for an HA setup. By
default "none" is used, and the ZK configuration is ignored. The log also
hints that you are using StandaloneLeaderElectionService instead of the
ZooKeeper implementation (note that the leaderSessionID consists only of 0s
[1][2]):

    2019-03-05 11:23:53,883 INFO  org.apache.flink.runtime.dispatcher.DispatcherRestEndpoint    - http://hdp3:60179 was granted leadership with leaderSessionID=00000000-0000-0000-0000-000000000000

Did you accidentally delete the "high-availability" config from your flink-
conf.yaml?

You probably also want to increase the number of yarn.application-attempts
[3].

Best,
Gary


[1] https://github.com/apache/flink/blob/dcd8c74b504046802cebf278b718e4753928a260/flink-runtime/src/main/java/org/apache/flink/runtime/leaderelection/StandaloneLeaderElectionService.java#L48
[2] https://github.com/apache/flink/blob/dcd8c74b504046802cebf278b718e4753928a260/flink-runtime/src/main/java/org/apache/flink/runtime/highavailability/HighAvailabilityServices.java#L52-L57
[3] https://ci.apache.org/projects/flink/flink-docs-release-1.7/ops/deployment/yarn_setup.html#recovery-behavior-of-flink-on-yarn

On Tue, Mar 5, 2019 at 7:41 AM 孙森 <[hidden email]> wrote:
Hi Gary:
         I used FsStateBackend .


The jm log is here:


After restart , the log is :




Best!
Sen


在 2019年3月4日,下午10:50,Gary Yao <[hidden email]> 写道:

Hi Sen,

Are you using the default MemoryStateBackend [1]? As far as I know, it does
not support JobManager failover. If you are already using FsStateBackend or
RocksDBStateBackend, please send JM logs.

Best,
Gary

[1] https://ci.apache.org/projects/flink/flink-docs-release-1.5/ops/state/state_backends.html#available-state-backends


On Mon, Mar 4, 2019 at 10:01 AM 孙森 <[hidden email]> wrote:
Hi Gary:


        Yes, I enable the checkpoints in my program .

在 2019年3月4日,上午3:03,Gary Yao <[hidden email]> 写道:

Hi Sen,

Did you set a restart strategy [1]? If you enabled checkpoints [2], the fixed-
delay strategy will be used by default.

Best,
Gary

[1] https://ci.apache.org/projects/flink/flink-docs-release-1.5/dev/restart_strategies.html
[2] https://ci.apache.org/projects/flink/flink-docs-release-1.5/ops/state/checkpoints.html

On Fri, Mar 1, 2019 at 7:27 AM 孙森 <[hidden email]> wrote:
Hi Gary:
             I checked the znode, the address of leader was there.

<屏幕快照 2019-03-01 上午10.45.45 1.png>

When I removed the ZooKeeper configuration in the client's flink-conf.yaml, the job was submitted successfully.  
Then I tried to test if the HA could work.  I killed the job manager ,it restarted .But  the job   did not restart when the jog manager  restarted.


Best!
Sen

在 2019年2月28日,下午6:59,Gary Yao <[hidden email]> 写道:

Hi Sen,

I took a look at the CLI code again, and found out that -m is ignored if high-
availability: ZOOKEEPER is configured in your flink-conf.yaml. This does not
seem right and should be at least documented [1].

Judging from the client logs that you provided, I think the problem is that
the client cannot resolve the leading JobManager from ZooKeeper [2][3]. You
can try the following things for debugging:

    * Check the contents in the znode
      /flink/[...]/leader/rest_server_lock using the ZK CLI. It should contain the
      address of the leader. If not, I would check the jobmanager logs for releated
      errors.
   
    * Submit the job with -m parameter but without ZooKeeper configuration in
      the client's flink-conf.yaml

Best,
Gary 

[1] https://issues.apache.org/jira/browse/FLINK-11779
[2] https://github.com/apache/flink/blob/release-1.5.1/flink-runtime/src/main/java/org/apache/flink/runtime/leaderretrieval/ZooKeeperLeaderRetrievalService.java#L170
[3] https://github.com/apache/flink/blob/release-1.5.1/flink-clients/src/main/java/org/apache/flink/client/program/rest/RestClusterClient.java#L746-L750

On Thu, Feb 28, 2019 at 4:34 AM 孙森 <[hidden email]> wrote:
Hi,Gary

       Actually, I have several Flink cluster on Yarn ,each for a project. For one project ,it can only submit job to the specify cluster.
I’ve already enabled logging on DEBUG level.

How did you determine "jmhost" and "port”? 
We do this by request the rest api :   http://activeRm/proxy/appId/jars


The all client log is in the mail attachment.




在 2019年2月27日,下午9:30,Gary Yao <[hidden email]> 写道:

Hi,

How did you determine "jmhost" and "port"? Actually you do not need to specify
these manually. If the client is using the same configuration as your cluster,
the client will look up the leading JM from ZooKeeper.

If you have already tried omitting the "-m" parameter, you can check in the
client logs which host is used for the job submission [1]. Note that you need
to enable logging on DEBUG level.

The root cause in your stacktrace is a TimeoutException. I would debug this by
checking if you can establish a TCP connection – from the machine you are
submitting the job from, to the target host/port [2].

Moreover, you are using a quite dated Flink version. The newest version in the
1.5 major release is 1.5.6 – so consider upgrading to that or even to 1.7.

Best,
Gary

[1] https://github.com/apache/flink/blob/3488f8b144a2127497c39b8ed5a48a65b551c57d/flink-runtime/src/main/java/org/apache/flink/runtime/rest/RestClient.java#L185
[2] https://stackoverflow.com/questions/4922943/test-from-shell-script-if-remote-tcp-port-is-open

On Wed, Feb 27, 2019 at 8:09 AM 孙森 <[hidden email]> wrote:
Hi all:
        
        I run flink (1.5.1 with hadoop 2.7) on yarn ,and submit job by “/usr/local/flink/bin/flink run -m jmhost:port my.jar”, but the submission is failed.
The HA configuration is :
  •  high-availability: zookeeper
  •  high-availability.storageDir: hdfs:///flink/ha/
  •  high-availability.zookeeper.quorum:  hdp1:2181,hdp2:2181,hdp3:2181
  •  yarn.application-attempts: 2
The info showed int the client log:    

2019-02-27 11:48:38,651 INFO  org.apache.flink.runtime.rest.RestClient                      - Shutting down rest endpoint.
2019-02-27 11:48:38,659 INFO  org.apache.flink.runtime.rest.RestClient                      - Rest endpoint shutdown complete.
2019-02-27 11:48:38,662 INFO  org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService  - Stopping ZooKeeperLeaderRetrievalService /leader/rest_server_lock.
2019-02-27 11:48:38,665 INFO  org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService  - Stopping ZooKeeperLeaderRetrievalService /leader/dispatcher_lock.
2019-02-27 11:48:38,670 INFO  org.apache.flink.shaded.curator.org.apache.curator.framework.imps.CuratorFrameworkImpl  - backgroundOperationsLoop exiting
2019-02-27 11:48:38,689 INFO  org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Session: 0x2679c52880c00ee closed
2019-02-27 11:48:38,689 INFO  org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn  - EventThread shut down for session: 0x2679c52880c00ee
2019-02-27 11:48:38,690 ERROR org.apache.flink.client.cli.CliFrontend                       - Error while running the command.
org.apache.flink.client.program.ProgramInvocationException: Could not retrieve the execution result.
	at org.apache.flink.client.program.rest.RestClusterClient.submitJob(RestClusterClient.java:257)
	at org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:464)
	at org.apache.flink.streaming.api.environment.StreamContextEnvironment.execute(StreamContextEnvironment.java:66)
	at org.apache.flink.streaming.api.scala.StreamExecutionEnvironment.execute(StreamExecutionEnvironment.scala:654)
	at scala.Function0$class.apply$mcV$sp(Function0.scala:34)
	at scala.runtime.AbstractFunction0.apply$mcV$sp(AbstractFunction0.scala:12)
	at scala.App$$anonfun$main$1.apply(App.scala:76)
	at scala.App$$anonfun$main$1.apply(App.scala:76)
	at scala.collection.immutable.List.foreach(List.scala:381)
	at scala.collection.generic.TraversableForwarder$class.foreach(TraversableForwarder.scala:35)
	at scala.App$class.main(App.scala:76)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at org.apache.flink.client.program.PackagedProgram.callMainMethod(PackagedProgram.java:528)
	at org.apache.flink.client.program.PackagedProgram.invokeInteractiveModeForExecution(PackagedProgram.java:420)
	at org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:404)
	at org.apache.flink.client.cli.CliFrontend.executeProgram(CliFrontend.java:785)
	at org.apache.flink.client.cli.CliFrontend.runProgram(CliFrontend.java:279)
	at org.apache.flink.client.cli.CliFrontend.run(CliFrontend.java:214)
	at org.apache.flink.client.cli.CliFrontend.parseParameters(CliFrontend.java:1025)
	at org.apache.flink.client.cli.CliFrontend.lambda$main$9(CliFrontend.java:1101)
	at java.security.AccessController.doPrivileged(Native Method)
	at javax.security.auth.Subject.doAs(Subject.java:422)
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1754)
	at org.apache.flink.runtime.security.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:41)
	at org.apache.flink.client.cli.CliFrontend.main(CliFrontend.java:1101)
Caused by: org.apache.flink.runtime.client.JobSubmissionException: Failed to submit JobGraph.
	at org.apache.flink.client.program.rest.RestClusterClient.lambda$submitJob$8(RestClusterClient.java:370)
	at java.util.concurrent.CompletableFuture.uniExceptionally(CompletableFuture.java:870)
	at java.util.concurrent.CompletableFuture$UniExceptionally.tryFire(CompletableFuture.java:852)
	at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)
	at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977)
	at org.apache.flink.runtime.concurrent.FutureUtils.lambda$retryOperationWithDelay$5(FutureUtils.java:214)
	at java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:760)
	at java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:736)
	at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)
	at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977)
	at org.apache.flink.runtime.concurrent.FutureUtils$Timeout.run(FutureUtils.java:834)
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)
Caused by: java.util.concurrent.CompletionException: org.apache.flink.runtime.concurrent.FutureUtils$RetryException: Could not complete the operation. Exception is not retryable.
	at java.util.concurrent.CompletableFuture.encodeRelay(CompletableFuture.java:326)
	at java.util.concurrent.CompletableFuture.completeRelay(CompletableFuture.java:338)
	at java.util.concurrent.CompletableFuture.uniRelay(CompletableFuture.java:911)
	at java.util.concurrent.CompletableFuture$UniRelay.tryFire(CompletableFuture.java:899)
	... 15 more
Caused by: org.apache.flink.runtime.concurrent.FutureUtils$RetryException: Could not complete the operation. Exception is not retryable.
	... 13 more
Caused by: java.util.concurrent.CompletionException: java.util.concurrent.TimeoutException
	at java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:292)
	at java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:308)
	at java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:593)
	at java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:577)
	... 10 more
Caused by: java.util.concurrent.TimeoutException
	... 8 more

------------------------------------------------------------
 The program finished with the following exception:

org.apache.flink.client.program.ProgramInvocationException: Could not retrieve the execution result.
	at org.apache.flink.client.program.rest.RestClusterClient.submitJob(RestClusterClient.java:257)
	at org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:464)
	at org.apache.flink.streaming.api.environment.StreamContextEnvironment.execute(StreamContextEnvironment.java:66)
	at org.apache.flink.streaming.api.scala.StreamExecutionEnvironment.execute(StreamExecutionEnvironment.scala:654)
	at edp.wormhole.flinkx.eventflow.WormholeFlinkMainProcess.process(WormholeFlinkMainProcess.scala:114)
	at edp.wormhole.flinkx.eventflow.WormholeFlinkxStarter$.delayedEndpoint$edp$wormhole$flinkx$eventflow$WormholeFlinkxStarter$1(WormholeFlinkxStarter.scala:40)
	at edp.wormhole.flinkx.eventflow.WormholeFlinkxStarter$delayedInit$body.apply(WormholeFlinkxStarter.scala:29)
	at scala.Function0$class.apply$mcV$sp(Function0.scala:34)
	at scala.runtime.AbstractFunction0.apply$mcV$sp(AbstractFunction0.scala:12)
	at scala.App$$anonfun$main$1.apply(App.scala:76)
	at scala.App$$anonfun$main$1.apply(App.scala:76)
	at scala.collection.immutable.List.foreach(List.scala:381)
	at scala.collection.generic.TraversableForwarder$class.foreach(TraversableForwarder.scala:35)
	at scala.App$class.main(App.scala:76)
	at edp.wormhole.flinkx.eventflow.WormholeFlinkxStarter$.main(WormholeFlinkxStarter.scala:29)
	at edp.wormhole.flinkx.eventflow.WormholeFlinkxStarter.main(WormholeFlinkxStarter.scala)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at org.apache.flink.client.program.PackagedProgram.callMainMethod(PackagedProgram.java:528)
	at org.apache.flink.client.program.PackagedProgram.invokeInteractiveModeForExecution(PackagedProgram.java:420)
	at org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:404)
	at org.apache.flink.client.cli.CliFrontend.executeProgram(CliFrontend.java:785)
	at org.apache.flink.client.cli.CliFrontend.runProgram(CliFrontend.java:279)
	at org.apache.flink.client.cli.CliFrontend.run(CliFrontend.java:214)
	at org.apache.flink.client.cli.CliFrontend.parseParameters(CliFrontend.java:1025)
	at org.apache.flink.client.cli.CliFrontend.lambda$main$9(CliFrontend.java:1101)
	at java.security.AccessController.doPrivileged(Native Method)
	at javax.security.auth.Subject.doAs(Subject.java:422)
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1754)
	at org.apache.flink.runtime.security.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:41)
	at org.apache.flink.client.cli.CliFrontend.main(CliFrontend.java:1101)
Caused by: org.apache.flink.runtime.client.JobSubmissionException: Failed to submit JobGraph.
	at org.apache.flink.client.program.rest.RestClusterClient.lambda$submitJob$8(RestClusterClient.java:370)
	at java.util.concurrent.CompletableFuture.uniExceptionally(CompletableFuture.java:870)
	at java.util.concurrent.CompletableFuture$UniExceptionally.tryFire(CompletableFuture.java:852)
	at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)
	at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977)
	at org.apache.flink.runtime.concurrent.FutureUtils.lambda$retryOperationWithDelay$5(FutureUtils.java:214)
	at java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:760)
	at java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:736)
	at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)
	at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977)
	at org.apache.flink.runtime.concurrent.FutureUtils$Timeout.run(FutureUtils.java:834)
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)
Caused by: java.util.concurrent.CompletionException: org.apache.flink.runtime.concurrent.FutureUtils$RetryException: Could not complete the operation. Exception is not retryable.
	at java.util.concurrent.CompletableFuture.encodeRelay(CompletableFuture.java:326)
	at java.util.concurrent.CompletableFuture.completeRelay(CompletableFuture.java:338)
	at java.util.concurrent.CompletableFuture.uniRelay(CompletableFuture.java:911)
	at java.util.concurrent.CompletableFuture$UniRelay.tryFire(CompletableFuture.java:899)
	... 15 more
Caused by: org.apache.flink.runtime.concurrent.FutureUtils$RetryException: Could not complete the operation. Exception is not retryable.
	... 13 more
Caused by: java.util.concurrent.CompletionException: java.util.concurrent.TimeoutException
	at java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:292)
	at java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:308)
	at java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:593)
	at java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:577)
	... 10 more
Caused by: java.util.concurrent.TimeoutException
	... 8 more







Reply | Threaded
Open this post in threaded view
|

Re: submit job failed on Yarn HA

Gary Yao-4
Hi Sen,

I took a look at your CLI logs again, and saw that it uses the "default" Flink
namespace in ZooKeeper:

    2019-02-28 11:18:05,255 INFO  org.apache.flink.runtime.util.ZooKeeperUtils                  - Using '/flink/default' as Zookeeper namespace.

However, since you are using YARN, the Flink namespace in ZooKeeper should
include the YARN applicationId. Normally, the CLI tries to resolve the
applicationId from a local "YARN properties" file [1], which is generated
after a successful submission of a session cluster (using Flink's bin/yarn-
session.sh) [2]. In your case that file does not exist – maybe because it got
deleted, or the host from which you are submitting the job, is a different one
from which the session cluster got started.

If you submit the job with -yid <yourApplicationId>, or --yarnapplicationId
<yourApplicationId>, the CLI should use the correct namespace in ZooKeeper.
Just submit the job normally without removing the ZooKeeper configuration from
flink-conf.yaml, and without specifying host:port manually with the "-m"
option. Let me know if this works for you.

Best,
Gary

[1] https://github.com/apache/flink/blob/e2579e39602ab7d3e906a185353dd413aca58317/flink-yarn/src/main/java/org/apache/flink/yarn/cli/FlinkYarnSessionCli.java#L236

[2] https://github.com/apache/flink/blob/e2579e39602ab7d3e906a185353dd413aca58317/flink-yarn/src/main/java/org/apache/flink/yarn/cli/FlinkYarnSessionCli.java#L622-L625

[3] https://github.com/apache/flink/blob/e2579e39602ab7d3e906a185353dd413aca58317/flink-yarn/src/main/java/org/apache/flink/yarn/cli/FlinkYarnSessionCli.java#L603-L606

On Wed, Mar 6, 2019 at 3:58 AM 孙森 <[hidden email]> wrote:
Hi Gary:
           
          Thanks very much! I have tried it as the way you said. It works. Hopes that the bug can be fixed as soon as possible.
Best!
Sen

在 2019年3月5日,下午3:15,Gary Yao <[hidden email]> 写道:

Hi Sen,

In that email I meant that you should disable the ZooKeeper configuration in
the CLI because the CLI had troubles resolving the leader from ZooKeeper. What
you should have done is:

    1. Start the cluster normally with ZooKeeper enabled
    2. Edit flink-conf.yaml to remove ZooKeeper config
    3. Submit the job to your cluster with -m flag.

Best,
Gary

On Tue, Mar 5, 2019 at 8:08 AM 孙森 <[hidden email]> wrote:
Hi Gary:

          No zookeeper is because the reason that the job submit will fail.
<屏幕快照 2019-03-05 下午3.07.21.png>


Best
Sen

在 2019年3月5日,下午3:02,Gary Yao <[hidden email]> 写道:

Hi Sen,

I don't see

    high-availability: zookeeper

in your Flink configuration. However, this is mandatory for an HA setup. By
default "none" is used, and the ZK configuration is ignored. The log also
hints that you are using StandaloneLeaderElectionService instead of the
ZooKeeper implementation (note that the leaderSessionID consists only of 0s
[1][2]):

    2019-03-05 11:23:53,883 INFO  org.apache.flink.runtime.dispatcher.DispatcherRestEndpoint    - http://hdp3:60179 was granted leadership with leaderSessionID=00000000-0000-0000-0000-000000000000

Did you accidentally delete the "high-availability" config from your flink-
conf.yaml?

You probably also want to increase the number of yarn.application-attempts
[3].

Best,
Gary


[1] https://github.com/apache/flink/blob/dcd8c74b504046802cebf278b718e4753928a260/flink-runtime/src/main/java/org/apache/flink/runtime/leaderelection/StandaloneLeaderElectionService.java#L48
[2] https://github.com/apache/flink/blob/dcd8c74b504046802cebf278b718e4753928a260/flink-runtime/src/main/java/org/apache/flink/runtime/highavailability/HighAvailabilityServices.java#L52-L57
[3] https://ci.apache.org/projects/flink/flink-docs-release-1.7/ops/deployment/yarn_setup.html#recovery-behavior-of-flink-on-yarn

On Tue, Mar 5, 2019 at 7:41 AM 孙森 <[hidden email]> wrote:
Hi Gary:
         I used FsStateBackend .


The jm log is here:


After restart , the log is :




Best!
Sen


在 2019年3月4日,下午10:50,Gary Yao <[hidden email]> 写道:

Hi Sen,

Are you using the default MemoryStateBackend [1]? As far as I know, it does
not support JobManager failover. If you are already using FsStateBackend or
RocksDBStateBackend, please send JM logs.

Best,
Gary

[1] https://ci.apache.org/projects/flink/flink-docs-release-1.5/ops/state/state_backends.html#available-state-backends


On Mon, Mar 4, 2019 at 10:01 AM 孙森 <[hidden email]> wrote:
Hi Gary:


        Yes, I enable the checkpoints in my program .

在 2019年3月4日,上午3:03,Gary Yao <[hidden email]> 写道:

Hi Sen,

Did you set a restart strategy [1]? If you enabled checkpoints [2], the fixed-
delay strategy will be used by default.

Best,
Gary

[1] https://ci.apache.org/projects/flink/flink-docs-release-1.5/dev/restart_strategies.html
[2] https://ci.apache.org/projects/flink/flink-docs-release-1.5/ops/state/checkpoints.html

On Fri, Mar 1, 2019 at 7:27 AM 孙森 <[hidden email]> wrote:
Hi Gary:
             I checked the znode, the address of leader was there.

<屏幕快照 2019-03-01 上午10.45.45 1.png>

When I removed the ZooKeeper configuration in the client's flink-conf.yaml, the job was submitted successfully.  
Then I tried to test if the HA could work.  I killed the job manager ,it restarted .But  the job   did not restart when the jog manager  restarted.


Best!
Sen

在 2019年2月28日,下午6:59,Gary Yao <[hidden email]> 写道:

Hi Sen,

I took a look at the CLI code again, and found out that -m is ignored if high-
availability: ZOOKEEPER is configured in your flink-conf.yaml. This does not
seem right and should be at least documented [1].

Judging from the client logs that you provided, I think the problem is that
the client cannot resolve the leading JobManager from ZooKeeper [2][3]. You
can try the following things for debugging:

    * Check the contents in the znode
      /flink/[...]/leader/rest_server_lock using the ZK CLI. It should contain the
      address of the leader. If not, I would check the jobmanager logs for releated
      errors.
   
    * Submit the job with -m parameter but without ZooKeeper configuration in
      the client's flink-conf.yaml

Best,
Gary 

[1] https://issues.apache.org/jira/browse/FLINK-11779
[2] https://github.com/apache/flink/blob/release-1.5.1/flink-runtime/src/main/java/org/apache/flink/runtime/leaderretrieval/ZooKeeperLeaderRetrievalService.java#L170
[3] https://github.com/apache/flink/blob/release-1.5.1/flink-clients/src/main/java/org/apache/flink/client/program/rest/RestClusterClient.java#L746-L750

On Thu, Feb 28, 2019 at 4:34 AM 孙森 <[hidden email]> wrote:
Hi,Gary

       Actually, I have several Flink cluster on Yarn ,each for a project. For one project ,it can only submit job to the specify cluster.
I’ve already enabled logging on DEBUG level.

How did you determine "jmhost" and "port”? 
We do this by request the rest api :   http://activeRm/proxy/appId/jars


The all client log is in the mail attachment.




在 2019年2月27日,下午9:30,Gary Yao <[hidden email]> 写道:

Hi,

How did you determine "jmhost" and "port"? Actually you do not need to specify
these manually. If the client is using the same configuration as your cluster,
the client will look up the leading JM from ZooKeeper.

If you have already tried omitting the "-m" parameter, you can check in the
client logs which host is used for the job submission [1]. Note that you need
to enable logging on DEBUG level.

The root cause in your stacktrace is a TimeoutException. I would debug this by
checking if you can establish a TCP connection – from the machine you are
submitting the job from, to the target host/port [2].

Moreover, you are using a quite dated Flink version. The newest version in the
1.5 major release is 1.5.6 – so consider upgrading to that or even to 1.7.

Best,
Gary

[1] https://github.com/apache/flink/blob/3488f8b144a2127497c39b8ed5a48a65b551c57d/flink-runtime/src/main/java/org/apache/flink/runtime/rest/RestClient.java#L185
[2] https://stackoverflow.com/questions/4922943/test-from-shell-script-if-remote-tcp-port-is-open

On Wed, Feb 27, 2019 at 8:09 AM 孙森 <[hidden email]> wrote:
Hi all:
        
        I run flink (1.5.1 with hadoop 2.7) on yarn ,and submit job by “/usr/local/flink/bin/flink run -m jmhost:port my.jar”, but the submission is failed.
The HA configuration is :
  •  high-availability: zookeeper
  •  high-availability.storageDir: hdfs:///flink/ha/
  •  high-availability.zookeeper.quorum:  hdp1:2181,hdp2:2181,hdp3:2181
  •  yarn.application-attempts: 2
The info showed int the client log:    

2019-02-27 11:48:38,651 INFO  org.apache.flink.runtime.rest.RestClient                      - Shutting down rest endpoint.
2019-02-27 11:48:38,659 INFO  org.apache.flink.runtime.rest.RestClient                      - Rest endpoint shutdown complete.
2019-02-27 11:48:38,662 INFO  org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService  - Stopping ZooKeeperLeaderRetrievalService /leader/rest_server_lock.
2019-02-27 11:48:38,665 INFO  org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService  - Stopping ZooKeeperLeaderRetrievalService /leader/dispatcher_lock.
2019-02-27 11:48:38,670 INFO  org.apache.flink.shaded.curator.org.apache.curator.framework.imps.CuratorFrameworkImpl  - backgroundOperationsLoop exiting
2019-02-27 11:48:38,689 INFO  org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Session: 0x2679c52880c00ee closed
2019-02-27 11:48:38,689 INFO  org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn  - EventThread shut down for session: 0x2679c52880c00ee
2019-02-27 11:48:38,690 ERROR org.apache.flink.client.cli.CliFrontend                       - Error while running the command.
org.apache.flink.client.program.ProgramInvocationException: Could not retrieve the execution result.
	at org.apache.flink.client.program.rest.RestClusterClient.submitJob(RestClusterClient.java:257)
	at org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:464)
	at org.apache.flink.streaming.api.environment.StreamContextEnvironment.execute(StreamContextEnvironment.java:66)
	at org.apache.flink.streaming.api.scala.StreamExecutionEnvironment.execute(StreamExecutionEnvironment.scala:654)
	at scala.Function0$class.apply$mcV$sp(Function0.scala:34)
	at scala.runtime.AbstractFunction0.apply$mcV$sp(AbstractFunction0.scala:12)
	at scala.App$$anonfun$main$1.apply(App.scala:76)
	at scala.App$$anonfun$main$1.apply(App.scala:76)
	at scala.collection.immutable.List.foreach(List.scala:381)
	at scala.collection.generic.TraversableForwarder$class.foreach(TraversableForwarder.scala:35)
	at scala.App$class.main(App.scala:76)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at org.apache.flink.client.program.PackagedProgram.callMainMethod(PackagedProgram.java:528)
	at org.apache.flink.client.program.PackagedProgram.invokeInteractiveModeForExecution(PackagedProgram.java:420)
	at org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:404)
	at org.apache.flink.client.cli.CliFrontend.executeProgram(CliFrontend.java:785)
	at org.apache.flink.client.cli.CliFrontend.runProgram(CliFrontend.java:279)
	at org.apache.flink.client.cli.CliFrontend.run(CliFrontend.java:214)
	at org.apache.flink.client.cli.CliFrontend.parseParameters(CliFrontend.java:1025)
	at org.apache.flink.client.cli.CliFrontend.lambda$main$9(CliFrontend.java:1101)
	at java.security.AccessController.doPrivileged(Native Method)
	at javax.security.auth.Subject.doAs(Subject.java:422)
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1754)
	at org.apache.flink.runtime.security.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:41)
	at org.apache.flink.client.cli.CliFrontend.main(CliFrontend.java:1101)
Caused by: org.apache.flink.runtime.client.JobSubmissionException: Failed to submit JobGraph.
	at org.apache.flink.client.program.rest.RestClusterClient.lambda$submitJob$8(RestClusterClient.java:370)
	at java.util.concurrent.CompletableFuture.uniExceptionally(CompletableFuture.java:870)
	at java.util.concurrent.CompletableFuture$UniExceptionally.tryFire(CompletableFuture.java:852)
	at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)
	at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977)
	at org.apache.flink.runtime.concurrent.FutureUtils.lambda$retryOperationWithDelay$5(FutureUtils.java:214)
	at java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:760)
	at java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:736)
	at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)
	at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977)
	at org.apache.flink.runtime.concurrent.FutureUtils$Timeout.run(FutureUtils.java:834)
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)
Caused by: java.util.concurrent.CompletionException: org.apache.flink.runtime.concurrent.FutureUtils$RetryException: Could not complete the operation. Exception is not retryable.
	at java.util.concurrent.CompletableFuture.encodeRelay(CompletableFuture.java:326)
	at java.util.concurrent.CompletableFuture.completeRelay(CompletableFuture.java:338)
	at java.util.concurrent.CompletableFuture.uniRelay(CompletableFuture.java:911)
	at java.util.concurrent.CompletableFuture$UniRelay.tryFire(CompletableFuture.java:899)
	... 15 more
Caused by: org.apache.flink.runtime.concurrent.FutureUtils$RetryException: Could not complete the operation. Exception is not retryable.
	... 13 more
Caused by: java.util.concurrent.CompletionException: java.util.concurrent.TimeoutException
	at java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:292)
	at java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:308)
	at java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:593)
	at java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:577)
	... 10 more
Caused by: java.util.concurrent.TimeoutException
	... 8 more

------------------------------------------------------------
 The program finished with the following exception:

org.apache.flink.client.program.ProgramInvocationException: Could not retrieve the execution result.
	at org.apache.flink.client.program.rest.RestClusterClient.submitJob(RestClusterClient.java:257)
	at org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:464)
	at org.apache.flink.streaming.api.environment.StreamContextEnvironment.execute(StreamContextEnvironment.java:66)
	at org.apache.flink.streaming.api.scala.StreamExecutionEnvironment.execute(StreamExecutionEnvironment.scala:654)
	at edp.wormhole.flinkx.eventflow.WormholeFlinkMainProcess.process(WormholeFlinkMainProcess.scala:114)
	at edp.wormhole.flinkx.eventflow.WormholeFlinkxStarter$.delayedEndpoint$edp$wormhole$flinkx$eventflow$WormholeFlinkxStarter$1(WormholeFlinkxStarter.scala:40)
	at edp.wormhole.flinkx.eventflow.WormholeFlinkxStarter$delayedInit$body.apply(WormholeFlinkxStarter.scala:29)
	at scala.Function0$class.apply$mcV$sp(Function0.scala:34)
	at scala.runtime.AbstractFunction0.apply$mcV$sp(AbstractFunction0.scala:12)
	at scala.App$$anonfun$main$1.apply(App.scala:76)
	at scala.App$$anonfun$main$1.apply(App.scala:76)
	at scala.collection.immutable.List.foreach(List.scala:381)
	at scala.collection.generic.TraversableForwarder$class.foreach(TraversableForwarder.scala:35)
	at scala.App$class.main(App.scala:76)
	at edp.wormhole.flinkx.eventflow.WormholeFlinkxStarter$.main(WormholeFlinkxStarter.scala:29)
	at edp.wormhole.flinkx.eventflow.WormholeFlinkxStarter.main(WormholeFlinkxStarter.scala)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at org.apache.flink.client.program.PackagedProgram.callMainMethod(PackagedProgram.java:528)
	at org.apache.flink.client.program.PackagedProgram.invokeInteractiveModeForExecution(PackagedProgram.java:420)
	at org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:404)
	at org.apache.flink.client.cli.CliFrontend.executeProgram(CliFrontend.java:785)
	at org.apache.flink.client.cli.CliFrontend.runProgram(CliFrontend.java:279)
	at org.apache.flink.client.cli.CliFrontend.run(CliFrontend.java:214)
	at org.apache.flink.client.cli.CliFrontend.parseParameters(CliFrontend.java:1025)
	at org.apache.flink.client.cli.CliFrontend.lambda$main$9(CliFrontend.java:1101)
	at java.security.AccessController.doPrivileged(Native Method)
	at javax.security.auth.Subject.doAs(Subject.java:422)
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1754)
	at org.apache.flink.runtime.security.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:41)
	at org.apache.flink.client.cli.CliFrontend.main(CliFrontend.java:1101)
Caused by: org.apache.flink.runtime.client.JobSubmissionException: Failed to submit JobGraph.
	at org.apache.flink.client.program.rest.RestClusterClient.lambda$submitJob$8(RestClusterClient.java:370)
	at java.util.concurrent.CompletableFuture.uniExceptionally(CompletableFuture.java:870)
	at java.util.concurrent.CompletableFuture$UniExceptionally.tryFire(CompletableFuture.java:852)
	at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)
	at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977)
	at org.apache.flink.runtime.concurrent.FutureUtils.lambda$retryOperationWithDelay$5(FutureUtils.java:214)
	at java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:760)
	at java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:736)
	at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)
	at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977)
	at org.apache.flink.runtime.concurrent.FutureUtils$Timeout.run(FutureUtils.java:834)
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)
Caused by: java.util.concurrent.CompletionException: org.apache.flink.runtime.concurrent.FutureUtils$RetryException: Could not complete the operation. Exception is not retryable.
	at java.util.concurrent.CompletableFuture.encodeRelay(CompletableFuture.java:326)
	at java.util.concurrent.CompletableFuture.completeRelay(CompletableFuture.java:338)
	at java.util.concurrent.CompletableFuture.uniRelay(CompletableFuture.java:911)
	at java.util.concurrent.CompletableFuture$UniRelay.tryFire(CompletableFuture.java:899)
	... 15 more
Caused by: org.apache.flink.runtime.concurrent.FutureUtils$RetryException: Could not complete the operation. Exception is not retryable.
	... 13 more
Caused by: java.util.concurrent.CompletionException: java.util.concurrent.TimeoutException
	at java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:292)
	at java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:308)
	at java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:593)
	at java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:577)
	... 10 more
Caused by: java.util.concurrent.TimeoutException
	... 8 more







sen
Reply | Threaded
Open this post in threaded view
|

Re: submit job failed on Yarn HA

sen
Hi Gary:
          Yes, it’s the second case, the client host  is different from the session cluster got started. I’ve tried the way by using" flink run -yid  “, it really works.

Best!
Sen

在 2019年3月6日,下午3:19,Gary Yao <[hidden email]> 写道:

Hi Sen,

I took a look at your CLI logs again, and saw that it uses the "default" Flink
namespace in ZooKeeper:

    2019-02-28 11:18:05,255 INFO  org.apache.flink.runtime.util.ZooKeeperUtils                  - Using '/flink/default' as Zookeeper namespace.

However, since you are using YARN, the Flink namespace in ZooKeeper should
include the YARN applicationId. Normally, the CLI tries to resolve the
applicationId from a local "YARN properties" file [1], which is generated
after a successful submission of a session cluster (using Flink's bin/yarn-
session.sh) [2]. In your case that file does not exist – maybe because it got
deleted, or the host from which you are submitting the job, is a different one
from which the session cluster got started.

If you submit the job with -yid <yourApplicationId>, or --yarnapplicationId
<yourApplicationId>, the CLI should use the correct namespace in ZooKeeper.
Just submit the job normally without removing the ZooKeeper configuration from
flink-conf.yaml, and without specifying host:port manually with the "-m"
option. Let me know if this works for you.

Best,
Gary

[1] https://github.com/apache/flink/blob/e2579e39602ab7d3e906a185353dd413aca58317/flink-yarn/src/main/java/org/apache/flink/yarn/cli/FlinkYarnSessionCli.java#L236

[2] https://github.com/apache/flink/blob/e2579e39602ab7d3e906a185353dd413aca58317/flink-yarn/src/main/java/org/apache/flink/yarn/cli/FlinkYarnSessionCli.java#L622-L625

[3] https://github.com/apache/flink/blob/e2579e39602ab7d3e906a185353dd413aca58317/flink-yarn/src/main/java/org/apache/flink/yarn/cli/FlinkYarnSessionCli.java#L603-L606

On Wed, Mar 6, 2019 at 3:58 AM 孙森 <[hidden email]> wrote:
Hi Gary:
           
          Thanks very much! I have tried it as the way you said. It works. Hopes that the bug can be fixed as soon as possible.
Best!
Sen

在 2019年3月5日,下午3:15,Gary Yao <[hidden email]> 写道:

Hi Sen,

In that email I meant that you should disable the ZooKeeper configuration in
the CLI because the CLI had troubles resolving the leader from ZooKeeper. What
you should have done is:

    1. Start the cluster normally with ZooKeeper enabled
    2. Edit flink-conf.yaml to remove ZooKeeper config
    3. Submit the job to your cluster with -m flag.

Best,
Gary

On Tue, Mar 5, 2019 at 8:08 AM 孙森 <[hidden email]> wrote:
Hi Gary:

          No zookeeper is because the reason that the job submit will fail.
<屏幕快照 2019-03-05 下午3.07.21.png>


Best
Sen

在 2019年3月5日,下午3:02,Gary Yao <[hidden email]> 写道:

Hi Sen,

I don't see

    high-availability: zookeeper

in your Flink configuration. However, this is mandatory for an HA setup. By
default "none" is used, and the ZK configuration is ignored. The log also
hints that you are using StandaloneLeaderElectionService instead of the
ZooKeeper implementation (note that the leaderSessionID consists only of 0s
[1][2]):

    2019-03-05 11:23:53,883 INFO  org.apache.flink.runtime.dispatcher.DispatcherRestEndpoint    - http://hdp3:60179 was granted leadership with leaderSessionID=00000000-0000-0000-0000-000000000000

Did you accidentally delete the "high-availability" config from your flink-
conf.yaml?

You probably also want to increase the number of yarn.application-attempts
[3].

Best,
Gary


[1] https://github.com/apache/flink/blob/dcd8c74b504046802cebf278b718e4753928a260/flink-runtime/src/main/java/org/apache/flink/runtime/leaderelection/StandaloneLeaderElectionService.java#L48
[2] https://github.com/apache/flink/blob/dcd8c74b504046802cebf278b718e4753928a260/flink-runtime/src/main/java/org/apache/flink/runtime/highavailability/HighAvailabilityServices.java#L52-L57
[3] https://ci.apache.org/projects/flink/flink-docs-release-1.7/ops/deployment/yarn_setup.html#recovery-behavior-of-flink-on-yarn

On Tue, Mar 5, 2019 at 7:41 AM 孙森 <[hidden email]> wrote:
Hi Gary:
         I used FsStateBackend .


The jm log is here:


After restart , the log is :




Best!
Sen


在 2019年3月4日,下午10:50,Gary Yao <[hidden email]> 写道:

Hi Sen,

Are you using the default MemoryStateBackend [1]? As far as I know, it does
not support JobManager failover. If you are already using FsStateBackend or
RocksDBStateBackend, please send JM logs.

Best,
Gary

[1] https://ci.apache.org/projects/flink/flink-docs-release-1.5/ops/state/state_backends.html#available-state-backends


On Mon, Mar 4, 2019 at 10:01 AM 孙森 <[hidden email]> wrote:
Hi Gary:


        Yes, I enable the checkpoints in my program .

在 2019年3月4日,上午3:03,Gary Yao <[hidden email]> 写道:

Hi Sen,

Did you set a restart strategy [1]? If you enabled checkpoints [2], the fixed-
delay strategy will be used by default.

Best,
Gary

[1] https://ci.apache.org/projects/flink/flink-docs-release-1.5/dev/restart_strategies.html
[2] https://ci.apache.org/projects/flink/flink-docs-release-1.5/ops/state/checkpoints.html

On Fri, Mar 1, 2019 at 7:27 AM 孙森 <[hidden email]> wrote:
Hi Gary:
             I checked the znode, the address of leader was there.

<屏幕快照 2019-03-01 上午10.45.45 1.png>

When I removed the ZooKeeper configuration in the client's flink-conf.yaml, the job was submitted successfully.  
Then I tried to test if the HA could work.  I killed the job manager ,it restarted .But  the job   did not restart when the jog manager  restarted.


Best!
Sen

在 2019年2月28日,下午6:59,Gary Yao <[hidden email]> 写道:

Hi Sen,

I took a look at the CLI code again, and found out that -m is ignored if high-
availability: ZOOKEEPER is configured in your flink-conf.yaml. This does not
seem right and should be at least documented [1].

Judging from the client logs that you provided, I think the problem is that
the client cannot resolve the leading JobManager from ZooKeeper [2][3]. You
can try the following things for debugging:

    * Check the contents in the znode
      /flink/[...]/leader/rest_server_lock using the ZK CLI. It should contain the
      address of the leader. If not, I would check the jobmanager logs for releated
      errors.
   
    * Submit the job with -m parameter but without ZooKeeper configuration in
      the client's flink-conf.yaml

Best,
Gary 

[1] https://issues.apache.org/jira/browse/FLINK-11779
[2] https://github.com/apache/flink/blob/release-1.5.1/flink-runtime/src/main/java/org/apache/flink/runtime/leaderretrieval/ZooKeeperLeaderRetrievalService.java#L170
[3] https://github.com/apache/flink/blob/release-1.5.1/flink-clients/src/main/java/org/apache/flink/client/program/rest/RestClusterClient.java#L746-L750

On Thu, Feb 28, 2019 at 4:34 AM 孙森 <[hidden email]> wrote:
Hi,Gary

       Actually, I have several Flink cluster on Yarn ,each for a project. For one project ,it can only submit job to the specify cluster.
I’ve already enabled logging on DEBUG level.

How did you determine "jmhost" and "port”? 
We do this by request the rest api :   http://activeRm/proxy/appId/jars


The all client log is in the mail attachment.




在 2019年2月27日,下午9:30,Gary Yao <[hidden email]> 写道:

Hi,

How did you determine "jmhost" and "port"? Actually you do not need to specify
these manually. If the client is using the same configuration as your cluster,
the client will look up the leading JM from ZooKeeper.

If you have already tried omitting the "-m" parameter, you can check in the
client logs which host is used for the job submission [1]. Note that you need
to enable logging on DEBUG level.

The root cause in your stacktrace is a TimeoutException. I would debug this by
checking if you can establish a TCP connection – from the machine you are
submitting the job from, to the target host/port [2].

Moreover, you are using a quite dated Flink version. The newest version in the
1.5 major release is 1.5.6 – so consider upgrading to that or even to 1.7.

Best,
Gary

[1] https://github.com/apache/flink/blob/3488f8b144a2127497c39b8ed5a48a65b551c57d/flink-runtime/src/main/java/org/apache/flink/runtime/rest/RestClient.java#L185
[2] https://stackoverflow.com/questions/4922943/test-from-shell-script-if-remote-tcp-port-is-open

On Wed, Feb 27, 2019 at 8:09 AM 孙森 <[hidden email]> wrote:
Hi all:
        
        I run flink (1.5.1 with hadoop 2.7) on yarn ,and submit job by “/usr/local/flink/bin/flink run -m jmhost:port my.jar”, but the submission is failed.
The HA configuration is :
  •  high-availability: zookeeper
  •  high-availability.storageDir: hdfs:///flink/ha/
  •  high-availability.zookeeper.quorum:  hdp1:2181,hdp2:2181,hdp3:2181
  •  yarn.application-attempts: 2
The info showed int the client log:    

2019-02-27 11:48:38,651 INFO  org.apache.flink.runtime.rest.RestClient                      - Shutting down rest endpoint.
2019-02-27 11:48:38,659 INFO  org.apache.flink.runtime.rest.RestClient                      - Rest endpoint shutdown complete.
2019-02-27 11:48:38,662 INFO  org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService  - Stopping ZooKeeperLeaderRetrievalService /leader/rest_server_lock.
2019-02-27 11:48:38,665 INFO  org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService  - Stopping ZooKeeperLeaderRetrievalService /leader/dispatcher_lock.
2019-02-27 11:48:38,670 INFO  org.apache.flink.shaded.curator.org.apache.curator.framework.imps.CuratorFrameworkImpl  - backgroundOperationsLoop exiting
2019-02-27 11:48:38,689 INFO  org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Session: 0x2679c52880c00ee closed
2019-02-27 11:48:38,689 INFO  org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn  - EventThread shut down for session: 0x2679c52880c00ee
2019-02-27 11:48:38,690 ERROR org.apache.flink.client.cli.CliFrontend                       - Error while running the command.
org.apache.flink.client.program.ProgramInvocationException: Could not retrieve the execution result.
	at org.apache.flink.client.program.rest.RestClusterClient.submitJob(RestClusterClient.java:257)
	at org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:464)
	at org.apache.flink.streaming.api.environment.StreamContextEnvironment.execute(StreamContextEnvironment.java:66)
	at org.apache.flink.streaming.api.scala.StreamExecutionEnvironment.execute(StreamExecutionEnvironment.scala:654)
	at scala.Function0$class.apply$mcV$sp(Function0.scala:34)
	at scala.runtime.AbstractFunction0.apply$mcV$sp(AbstractFunction0.scala:12)
	at scala.App$$anonfun$main$1.apply(App.scala:76)
	at scala.App$$anonfun$main$1.apply(App.scala:76)
	at scala.collection.immutable.List.foreach(List.scala:381)
	at scala.collection.generic.TraversableForwarder$class.foreach(TraversableForwarder.scala:35)
	at scala.App$class.main(App.scala:76)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at org.apache.flink.client.program.PackagedProgram.callMainMethod(PackagedProgram.java:528)
	at org.apache.flink.client.program.PackagedProgram.invokeInteractiveModeForExecution(PackagedProgram.java:420)
	at org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:404)
	at org.apache.flink.client.cli.CliFrontend.executeProgram(CliFrontend.java:785)
	at org.apache.flink.client.cli.CliFrontend.runProgram(CliFrontend.java:279)
	at org.apache.flink.client.cli.CliFrontend.run(CliFrontend.java:214)
	at org.apache.flink.client.cli.CliFrontend.parseParameters(CliFrontend.java:1025)
	at org.apache.flink.client.cli.CliFrontend.lambda$main$9(CliFrontend.java:1101)
	at java.security.AccessController.doPrivileged(Native Method)
	at javax.security.auth.Subject.doAs(Subject.java:422)
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1754)
	at org.apache.flink.runtime.security.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:41)
	at org.apache.flink.client.cli.CliFrontend.main(CliFrontend.java:1101)
Caused by: org.apache.flink.runtime.client.JobSubmissionException: Failed to submit JobGraph.
	at org.apache.flink.client.program.rest.RestClusterClient.lambda$submitJob$8(RestClusterClient.java:370)
	at java.util.concurrent.CompletableFuture.uniExceptionally(CompletableFuture.java:870)
	at java.util.concurrent.CompletableFuture$UniExceptionally.tryFire(CompletableFuture.java:852)
	at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)
	at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977)
	at org.apache.flink.runtime.concurrent.FutureUtils.lambda$retryOperationWithDelay$5(FutureUtils.java:214)
	at java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:760)
	at java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:736)
	at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)
	at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977)
	at org.apache.flink.runtime.concurrent.FutureUtils$Timeout.run(FutureUtils.java:834)
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)
Caused by: java.util.concurrent.CompletionException: org.apache.flink.runtime.concurrent.FutureUtils$RetryException: Could not complete the operation. Exception is not retryable.
	at java.util.concurrent.CompletableFuture.encodeRelay(CompletableFuture.java:326)
	at java.util.concurrent.CompletableFuture.completeRelay(CompletableFuture.java:338)
	at java.util.concurrent.CompletableFuture.uniRelay(CompletableFuture.java:911)
	at java.util.concurrent.CompletableFuture$UniRelay.tryFire(CompletableFuture.java:899)
	... 15 more
Caused by: org.apache.flink.runtime.concurrent.FutureUtils$RetryException: Could not complete the operation. Exception is not retryable.
	... 13 more
Caused by: java.util.concurrent.CompletionException: java.util.concurrent.TimeoutException
	at java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:292)
	at java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:308)
	at java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:593)
	at java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:577)
	... 10 more
Caused by: java.util.concurrent.TimeoutException
	... 8 more

------------------------------------------------------------
 The program finished with the following exception:

org.apache.flink.client.program.ProgramInvocationException: Could not retrieve the execution result.
	at org.apache.flink.client.program.rest.RestClusterClient.submitJob(RestClusterClient.java:257)
	at org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:464)
	at org.apache.flink.streaming.api.environment.StreamContextEnvironment.execute(StreamContextEnvironment.java:66)
	at org.apache.flink.streaming.api.scala.StreamExecutionEnvironment.execute(StreamExecutionEnvironment.scala:654)
	at edp.wormhole.flinkx.eventflow.WormholeFlinkMainProcess.process(WormholeFlinkMainProcess.scala:114)
	at edp.wormhole.flinkx.eventflow.WormholeFlinkxStarter$.delayedEndpoint$edp$wormhole$flinkx$eventflow$WormholeFlinkxStarter$1(WormholeFlinkxStarter.scala:40)
	at edp.wormhole.flinkx.eventflow.WormholeFlinkxStarter$delayedInit$body.apply(WormholeFlinkxStarter.scala:29)
	at scala.Function0$class.apply$mcV$sp(Function0.scala:34)
	at scala.runtime.AbstractFunction0.apply$mcV$sp(AbstractFunction0.scala:12)
	at scala.App$$anonfun$main$1.apply(App.scala:76)
	at scala.App$$anonfun$main$1.apply(App.scala:76)
	at scala.collection.immutable.List.foreach(List.scala:381)
	at scala.collection.generic.TraversableForwarder$class.foreach(TraversableForwarder.scala:35)
	at scala.App$class.main(App.scala:76)
	at edp.wormhole.flinkx.eventflow.WormholeFlinkxStarter$.main(WormholeFlinkxStarter.scala:29)
	at edp.wormhole.flinkx.eventflow.WormholeFlinkxStarter.main(WormholeFlinkxStarter.scala)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at org.apache.flink.client.program.PackagedProgram.callMainMethod(PackagedProgram.java:528)
	at org.apache.flink.client.program.PackagedProgram.invokeInteractiveModeForExecution(PackagedProgram.java:420)
	at org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:404)
	at org.apache.flink.client.cli.CliFrontend.executeProgram(CliFrontend.java:785)
	at org.apache.flink.client.cli.CliFrontend.runProgram(CliFrontend.java:279)
	at org.apache.flink.client.cli.CliFrontend.run(CliFrontend.java:214)
	at org.apache.flink.client.cli.CliFrontend.parseParameters(CliFrontend.java:1025)
	at org.apache.flink.client.cli.CliFrontend.lambda$main$9(CliFrontend.java:1101)
	at java.security.AccessController.doPrivileged(Native Method)
	at javax.security.auth.Subject.doAs(Subject.java:422)
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1754)
	at org.apache.flink.runtime.security.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:41)
	at org.apache.flink.client.cli.CliFrontend.main(CliFrontend.java:1101)
Caused by: org.apache.flink.runtime.client.JobSubmissionException: Failed to submit JobGraph.
	at org.apache.flink.client.program.rest.RestClusterClient.lambda$submitJob$8(RestClusterClient.java:370)
	at java.util.concurrent.CompletableFuture.uniExceptionally(CompletableFuture.java:870)
	at java.util.concurrent.CompletableFuture$UniExceptionally.tryFire(CompletableFuture.java:852)
	at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)
	at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977)
	at org.apache.flink.runtime.concurrent.FutureUtils.lambda$retryOperationWithDelay$5(FutureUtils.java:214)
	at java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:760)
	at java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:736)
	at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)
	at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977)
	at org.apache.flink.runtime.concurrent.FutureUtils$Timeout.run(FutureUtils.java:834)
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)
Caused by: java.util.concurrent.CompletionException: org.apache.flink.runtime.concurrent.FutureUtils$RetryException: Could not complete the operation. Exception is not retryable.
	at java.util.concurrent.CompletableFuture.encodeRelay(CompletableFuture.java:326)
	at java.util.concurrent.CompletableFuture.completeRelay(CompletableFuture.java:338)
	at java.util.concurrent.CompletableFuture.uniRelay(CompletableFuture.java:911)
	at java.util.concurrent.CompletableFuture$UniRelay.tryFire(CompletableFuture.java:899)
	... 15 more
Caused by: org.apache.flink.runtime.concurrent.FutureUtils$RetryException: Could not complete the operation. Exception is not retryable.
	... 13 more
Caused by: java.util.concurrent.CompletionException: java.util.concurrent.TimeoutException
	at java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:292)
	at java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:308)
	at java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:593)
	at java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:577)
	... 10 more
Caused by: java.util.concurrent.TimeoutException
	... 8 more