(DEPRECATED) Apache Flink User Mailing List archive.

Cancel flink job occur exception

Classic

List

Threaded

2 messages Options

rileyli(李瑞亮)

Cancel flink job occur exception

Hi all,

I submit a flink job through yarn-cluster mode and cancel job with savepoint option immediately after job status change to deployed. Sometimes i met this error:

org.apache.flink.util.FlinkException: Could not cancel job xxxx.

at org.apache.flink.client.cli.CliFrontend.lambda$cancel$4(CliFrontend.java:585)

at org.apache.flink.client.cli.CliFrontend.runClusterAction(CliFrontend.java:960)

at org.apache.flink.client.cli.CliFrontend.cancel(CliFrontend.java:577)

at org.apache.flink.client.cli.CliFrontend.parseParameters(CliFrontend.java:1034)

at java.lang.Thread.run(Thread.java:748)

Caused by: java.util.concurrent.ExecutionException: org.apache.flink.runtime.concurrent.FutureUtils$RetryException: Could not complete the operation. Number of retries has been exhausted.

at java.util.concurrent.CompletableFuture.reportGet(CompletableFuture.java:357)

at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1895)

at org.apache.flink.client.program.rest.RestClusterClient.cancelWithSavepoint(RestClusterClient.java:398)

at org.apache.flink.client.cli.CliFrontend.lambda$cancel$4(CliFrontend.java:583)

... 6 more

Caused by: org.apache.flink.runtime.concurrent.FutureUtils$RetryException: Could not complete the operation. Number of retries has been exhausted.

at org.apache.flink.runtime.concurrent.FutureUtils.lambda$retryOperationWithDelay$5(FutureUtils.java:213)

at java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:760)

at java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:736)

... 1 more

Caused by: java.util.concurrent.CompletionException: java.net.ConnectException: Connect refuse: xxx/xxx.xxx.xxx.xxx:xxx

at java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:292)

at java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:308)

at java.util.concurrent.CompletableFuture.uniCompose(CompletableFuture.java:943)

at java.util.concurrent.CompletableFuture$UniCompose.tryFire(CompletableFuture.java:926)

... 16 more

Caused by: java.net.ConnectException: Connect refuse: xxx/xxx.xxx.xxx.xxx:xxx

at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)

at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)

at org.apache.flink.shaded.netty4.io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:224)

at org.apache.flink.shaded.netty4.io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:281)

... 7 more

I check the jobmanager log, no error found. Savepoint is correct saved in hdfs. Yarn appliction status changed to FINISHED and FinalStatus change to KILLED.

I think this issue occur because RestClusterClient cannot find jobmanager addresss after Jobmanager(AM) has shutdown.

My flink version is 1.5.3.

Anyone could help me to resolve this issue, thanks!

Best Regard!

Gary Yao-2

Re: Cancel flink job occur exception

Hi all,

The question is being handled on the dev mailing list [1].

Best,
Gary

[1] http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/Cancel-flink-job-occur-exception-td24056.html

On Tue, Sep 4, 2018 at 2:21 PM, rileyli(李瑞亮) <[hidden email]> wrote:

Hi all,

I submit a flink job through yarn-cluster mode and cancel job with savepoint option immediately after job status change to deployed. Sometimes i met this error:

org.apache.flink.util.FlinkException: Could not cancel job xxxx.

at org.apache.flink.client.cli.CliFrontend.lambda$cancel$4(CliFrontend.java:585)

at org.apache.flink.client.cli.CliFrontend.runClusterAction(CliFrontend.java:960)

at org.apache.flink.client.cli.CliFrontend.cancel(CliFrontend.java:577)

at org.apache.flink.client.cli.CliFrontend.parseParameters(CliFrontend.java:1034)

at java.lang.Thread.run(Thread.java:748)

Caused by: java.util.concurrent.ExecutionException: org.apache.flink.runtime.concurrent.FutureUtils$RetryException: Could not complete the operation. Number of retries has been exhausted.

at java.util.concurrent.CompletableFuture.reportGet(CompletableFuture.java:357)

at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1895)

at org.apache.flink.client.program.rest.RestClusterClient.cancelWithSavepoint(RestClusterClient.java:398)

at org.apache.flink.client.cli.CliFrontend.lambda$cancel$4(CliFrontend.java:583)

... 6 more

Caused by: org.apache.flink.runtime.concurrent.FutureUtils$RetryException: Could not complete the operation. Number of retries has been exhausted.

at org.apache.flink.runtime.concurrent.FutureUtils.lambda$retryOperationWithDelay$5(FutureUtils.java:213)

at java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:760)

at java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:736)

... 1 more

Caused by: java.util.concurrent.CompletionException: java.net.ConnectException: Connect refuse: xxx/xxx.xxx.xxx.xxx:xxx

at java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:292)

at java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:308)

at java.util.concurrent.CompletableFuture.uniCompose(CompletableFuture.java:943)

at java.util.concurrent.CompletableFuture$UniCompose.tryFire(CompletableFuture.java:926)

... 16 more

Caused by: java.net.ConnectException: Connect refuse: xxx/xxx.xxx.xxx.xxx:xxx

at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)

at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)

at org.apache.flink.shaded.netty4.io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:224)

at org.apache.flink.shaded.netty4.io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:281)

... 7 more

I check the jobmanager log, no error found. Savepoint is correct saved in hdfs. Yarn appliction status changed to FINISHED and FinalStatus change to KILLED.

I think this issue occur because RestClusterClient cannot find jobmanager addresss after Jobmanager(AM) has shutdown.

My flink version is 1.5.3.

Anyone could help me to resolve this issue, thanks!

Best Regard!