Cancel flink job occur exception

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

Cancel flink job occur exception

rileyli(李瑞亮)
Hi all,
      I submit a flink job through yarn-cluster mode and cancel job with savepoint option immediately after job status change to deployed. Sometimes i met this error: 

org.apache.flink.util.FlinkException: Could not cancel job xxxx.
        at org.apache.flink.client.cli.CliFrontend.lambda$cancel$4(CliFrontend.java:585)
        at org.apache.flink.client.cli.CliFrontend.runClusterAction(CliFrontend.java:960)
        at org.apache.flink.client.cli.CliFrontend.cancel(CliFrontend.java:577)
        at org.apache.flink.client.cli.CliFrontend.parseParameters(CliFrontend.java:1034)
        at java.lang.Thread.run(Thread.java:748)
Caused by: java.util.concurrent.ExecutionException: org.apache.flink.runtime.concurrent.FutureUtils$RetryException: Could not complete the operation. Number of retries has been exhausted.
        at java.util.concurrent.CompletableFuture.reportGet(CompletableFuture.java:357)
        at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1895)
        at org.apache.flink.client.program.rest.RestClusterClient.cancelWithSavepoint(RestClusterClient.java:398)
        at org.apache.flink.client.cli.CliFrontend.lambda$cancel$4(CliFrontend.java:583)
        ... 6 more
Caused by: org.apache.flink.runtime.concurrent.FutureUtils$RetryException: Could not complete the operation. Number of retries has been exhausted.
        at org.apache.flink.runtime.concurrent.FutureUtils.lambda$retryOperationWithDelay$5(FutureUtils.java:213)
        at java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:760)
        at java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:736)
        ... 1 more
Caused by: java.util.concurrent.CompletionException: java.net.ConnectException: Connect refuse: xxx/xxx.xxx.xxx.xxx:xxx
        at java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:292)
        at java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:308)
        at java.util.concurrent.CompletableFuture.uniCompose(CompletableFuture.java:943)
        at java.util.concurrent.CompletableFuture$UniCompose.tryFire(CompletableFuture.java:926)
        ... 16 more
Caused by: java.net.ConnectException: Connect refuse: xxx/xxx.xxx.xxx.xxx:xxx
        at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
        at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
        at org.apache.flink.shaded.netty4.io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:224)
        at org.apache.flink.shaded.netty4.io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:281)
        ... 7 more

    I check the jobmanager log, no error found. Savepoint is correct saved in hdfs. Yarn appliction status changed to FINISHED and FinalStatus change to KILLED.
    I think this issue occur because RestClusterClient cannot find jobmanager addresss after Jobmanager(AM) has shutdown.
    My flink version is 1.5.3.
    Anyone could help me to resolve this issue, thanks!

Best Regard!
Reply | Threaded
Open this post in threaded view
|

Re: Cancel flink job occur exception

Gary Yao-2
Hi all,

The question is being handled on the dev mailing list [1].

Best,
Gary

[1] http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/Cancel-flink-job-occur-exception-td24056.html

On Tue, Sep 4, 2018 at 2:21 PM, rileyli(李瑞亮) <[hidden email]> wrote:
Hi all,
      I submit a flink job through yarn-cluster mode and cancel job with savepoint option immediately after job status change to deployed. Sometimes i met this error: 

org.apache.flink.util.FlinkException: Could not cancel job xxxx.
        at org.apache.flink.client.cli.CliFrontend.lambda$cancel$4(CliFrontend.java:585)
        at org.apache.flink.client.cli.CliFrontend.runClusterAction(CliFrontend.java:960)
        at org.apache.flink.client.cli.CliFrontend.cancel(CliFrontend.java:577)
        at org.apache.flink.client.cli.CliFrontend.parseParameters(CliFrontend.java:1034)
        at java.lang.Thread.run(Thread.java:748)
Caused by: java.util.concurrent.ExecutionException: org.apache.flink.runtime.concurrent.FutureUtils$RetryException: Could not complete the operation. Number of retries has been exhausted.
        at java.util.concurrent.CompletableFuture.reportGet(CompletableFuture.java:357)
        at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1895)
        at org.apache.flink.client.program.rest.RestClusterClient.cancelWithSavepoint(RestClusterClient.java:398)
        at org.apache.flink.client.cli.CliFrontend.lambda$cancel$4(CliFrontend.java:583)
        ... 6 more
Caused by: org.apache.flink.runtime.concurrent.FutureUtils$RetryException: Could not complete the operation. Number of retries has been exhausted.
        at org.apache.flink.runtime.concurrent.FutureUtils.lambda$retryOperationWithDelay$5(FutureUtils.java:213)
        at java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:760)
        at java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:736)
        ... 1 more
Caused by: java.util.concurrent.CompletionException: java.net.ConnectException: Connect refuse: xxx/xxx.xxx.xxx.xxx:xxx
        at java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:292)
        at java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:308)
        at java.util.concurrent.CompletableFuture.uniCompose(CompletableFuture.java:943)
        at java.util.concurrent.CompletableFuture$UniCompose.tryFire(CompletableFuture.java:926)
        ... 16 more
Caused by: java.net.ConnectException: Connect refuse: xxx/xxx.xxx.xxx.xxx:xxx
        at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
        at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
        at org.apache.flink.shaded.netty4.io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:224)
        at org.apache.flink.shaded.netty4.io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:281)
        ... 7 more

    I check the jobmanager log, no error found. Savepoint is correct saved in hdfs. Yarn appliction status changed to FINISHED and FinalStatus change to KILLED.
    I think this issue occur because RestClusterClient cannot find jobmanager addresss after Jobmanager(AM) has shutdown.
    My flink version is 1.5.3.
    Anyone could help me to resolve this issue, thanks!

Best Regard!