(DEPRECATED) Apache Flink User Mailing List archive.

Re: Cancel flink job occur exception

Posted by Gary Yao-2 on
URL: http://deprecated-apache-flink-user-mailing-list-archive.369.s1.nabble.com/Cancel-flink-job-occur-exception-tp22816p22921.html

Hi all,

The question is being handled on the dev mailing list [1].

Best,
Gary

[1] http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/Cancel-flink-job-occur-exception-td24056.html

On Tue, Sep 4, 2018 at 2:21 PM, rileyli(李瑞亮) <[hidden email]> wrote:

Hi all,

I submit a flink job through yarn-cluster mode and cancel job with savepoint option immediately after job status change to deployed. Sometimes i met this error:

org.apache.flink.util.FlinkException: Could not cancel job xxxx.

at org.apache.flink.client.cli.CliFrontend.lambda$cancel$4(CliFrontend.java:585)

at org.apache.flink.client.cli.CliFrontend.runClusterAction(CliFrontend.java:960)

at org.apache.flink.client.cli.CliFrontend.cancel(CliFrontend.java:577)

at org.apache.flink.client.cli.CliFrontend.parseParameters(CliFrontend.java:1034)

at java.lang.Thread.run(Thread.java:748)

Caused by: java.util.concurrent.ExecutionException: org.apache.flink.runtime.concurrent.FutureUtils$RetryException: Could not complete the operation. Number of retries has been exhausted.

at java.util.concurrent.CompletableFuture.reportGet(CompletableFuture.java:357)

at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1895)

at org.apache.flink.client.program.rest.RestClusterClient.cancelWithSavepoint(RestClusterClient.java:398)

at org.apache.flink.client.cli.CliFrontend.lambda$cancel$4(CliFrontend.java:583)

... 6 more

Caused by: org.apache.flink.runtime.concurrent.FutureUtils$RetryException: Could not complete the operation. Number of retries has been exhausted.

at org.apache.flink.runtime.concurrent.FutureUtils.lambda$retryOperationWithDelay$5(FutureUtils.java:213)

at java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:760)

at java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:736)

... 1 more

Caused by: java.util.concurrent.CompletionException: java.net.ConnectException: Connect refuse: xxx/xxx.xxx.xxx.xxx:xxx

at java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:292)

at java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:308)

at java.util.concurrent.CompletableFuture.uniCompose(CompletableFuture.java:943)

at java.util.concurrent.CompletableFuture$UniCompose.tryFire(CompletableFuture.java:926)

... 16 more

Caused by: java.net.ConnectException: Connect refuse: xxx/xxx.xxx.xxx.xxx:xxx

at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)

at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)

at org.apache.flink.shaded.netty4.io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:224)

at org.apache.flink.shaded.netty4.io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:281)

... 7 more

I check the jobmanager log, no error found. Savepoint is correct saved in hdfs. Yarn appliction status changed to FINISHED and FinalStatus change to KILLED.

I think this issue occur because RestClusterClient cannot find jobmanager addresss after Jobmanager(AM) has shutdown.

My flink version is 1.5.3.

Anyone could help me to resolve this issue, thanks!

Best Regard!