TimeoutException in Flink 1.11 stop command

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

TimeoutException in Flink 1.11 stop command

Diwakar Jha
Hello,

I'm trying to use the flink 1.11 stop command to gracefully shutdown application with savepoint. 

flink stop --savepointPath s3a://path_to_save_point c5d52e0146258f80fd52a3bf002d2a1b  -yid application_1620673166934_0001

2021-05-11 06:26:57,852 ERROR org.apache.flink.client.cli.CliFrontend [] - Error while running the command.
org.apache.flink.util.FlinkException: Could not stop with a savepoint job "c5d52e0146258f80fd52a3bf002d2a1b".
at org.apache.flink.client.cli.CliFrontend.lambda$stop$5(CliFrontend.java:495) ~[flink-dist_2.12-1.11.0.jar:1.11.0]
at org.apache.flink.client.cli.CliFrontend.runClusterAction(CliFrontend.java:864) ~[flink-dist_2.12-1.11.0.jar:1.11.0]
at org.apache.flink.client.cli.CliFrontend.stop(CliFrontend.java:487) ~[flink-dist_2.12-1.11.0.jar:1.11.0]
at org.apache.flink.client.cli.CliFrontend.parseParameters(CliFrontend.java:931) ~[flink-dist_2.12-1.11.0.jar:1.11.0]
at org.apache.flink.client.cli.CliFrontend.lambda$main$10(CliFrontend.java:992) ~[flink-dist_2.12-1.11.0.jar:1.11.0]
at java.security.AccessController.doPrivileged(Native Method) ~[?:1.8.0_252]
at javax.security.auth.Subject.doAs(Subject.java:422) [?:1.8.0_252]
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730) [hadoop-common-3.2.1-amzn-1.jar:?]
at org.apache.flink.runtime.security.contexts.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:41) [flink-dist_2.12-1.11.0.jar:1.11.0]
at org.apache.flink.client.cli.CliFrontend.main(CliFrontend.java:992) [flink-dist_2.12-1.11.0.jar:1.11.0]
Caused by: java.util.concurrent.TimeoutException
at java.util.concurrent.CompletableFuture.timedGet(CompletableFuture.java:1784) ~[?:1.8.0_252]
at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1928) ~[?:1.8.0_252]
at org.apache.flink.client.cli.CliFrontend.lambda$stop$5(CliFrontend.java:493) ~[flink-dist_2.12-1.11.0.jar:1.11.0]
... 9 more

Cancel command seems to be working fine. 
Please let me know how to fix this TimeoutException. 

Thanks.
Reply | Threaded
Open this post in threaded view
|

Re: TimeoutException in Flink 1.11 stop command

Chesnay Schepler
Essentially this exception just means that the savepoint operation took longer than the CLI expected.

This can occur for a number of reasons; maybe everything is working as expected but the timeout is just too low (controlled via "client.timeout").
It could also be that the savepoint operation takes abnormally long; for example due to IO bottlenecks.

I suggest to look into the JobManager logs to see whether the savepoint was actually created / the application shut down, and if so then maybe just increase the timeouts.

On 5/11/2021 9:06 AM, Diwakar Jha wrote:
Hello,

I'm trying to use the flink 1.11 stop command to gracefully shutdown application with savepoint. 

flink stop --savepointPath s3a://path_to_save_point c5d52e0146258f80fd52a3bf002d2a1b  -yid application_1620673166934_0001

2021-05-11 06:26:57,852 ERROR org.apache.flink.client.cli.CliFrontend [] - Error while running the command.
org.apache.flink.util.FlinkException: Could not stop with a savepoint job "c5d52e0146258f80fd52a3bf002d2a1b".
at org.apache.flink.client.cli.CliFrontend.lambda$stop$5(CliFrontend.java:495) ~[flink-dist_2.12-1.11.0.jar:1.11.0]
at org.apache.flink.client.cli.CliFrontend.runClusterAction(CliFrontend.java:864) ~[flink-dist_2.12-1.11.0.jar:1.11.0]
at org.apache.flink.client.cli.CliFrontend.stop(CliFrontend.java:487) ~[flink-dist_2.12-1.11.0.jar:1.11.0]
at org.apache.flink.client.cli.CliFrontend.parseParameters(CliFrontend.java:931) ~[flink-dist_2.12-1.11.0.jar:1.11.0]
at org.apache.flink.client.cli.CliFrontend.lambda$main$10(CliFrontend.java:992) ~[flink-dist_2.12-1.11.0.jar:1.11.0]
at java.security.AccessController.doPrivileged(Native Method) ~[?:1.8.0_252]
at javax.security.auth.Subject.doAs(Subject.java:422) [?:1.8.0_252]
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730) [hadoop-common-3.2.1-amzn-1.jar:?]
at org.apache.flink.runtime.security.contexts.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:41) [flink-dist_2.12-1.11.0.jar:1.11.0]
at org.apache.flink.client.cli.CliFrontend.main(CliFrontend.java:992) [flink-dist_2.12-1.11.0.jar:1.11.0]
Caused by: java.util.concurrent.TimeoutException
at java.util.concurrent.CompletableFuture.timedGet(CompletableFuture.java:1784) ~[?:1.8.0_252]
at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1928) ~[?:1.8.0_252]
at org.apache.flink.client.cli.CliFrontend.lambda$stop$5(CliFrontend.java:493) ~[flink-dist_2.12-1.11.0.jar:1.11.0]
... 9 more

Cancel command seems to be working fine. 
Please let me know how to fix this TimeoutException. 

Thanks.


Reply | Threaded
Open this post in threaded view
|

Re: TimeoutException in Flink 1.11 stop command

Diwakar Jha
Thanks.

I tried this command and it worked.
flink stop -p s3a://path_to_savepoint/savepoints 5f9241d336ea2c652a84f79ac3158597  -yid application_1620673166934_0001

I will look at the "client.timeout" also to figure out what actually happened.

Thanks.

On Tue, May 11, 2021 at 3:04 AM Chesnay Schepler <[hidden email]> wrote:
Essentially this exception just means that the savepoint operation took longer than the CLI expected.

This can occur for a number of reasons; maybe everything is working as expected but the timeout is just too low (controlled via "client.timeout").
It could also be that the savepoint operation takes abnormally long; for example due to IO bottlenecks.

I suggest to look into the JobManager logs to see whether the savepoint was actually created / the application shut down, and if so then maybe just increase the timeouts.

On 5/11/2021 9:06 AM, Diwakar Jha wrote:
Hello,

I'm trying to use the flink 1.11 stop command to gracefully shutdown application with savepoint. 

flink stop --savepointPath s3a://path_to_save_point c5d52e0146258f80fd52a3bf002d2a1b  -yid application_1620673166934_0001

2021-05-11 06:26:57,852 ERROR org.apache.flink.client.cli.CliFrontend [] - Error while running the command.
org.apache.flink.util.FlinkException: Could not stop with a savepoint job "c5d52e0146258f80fd52a3bf002d2a1b".
at org.apache.flink.client.cli.CliFrontend.lambda$stop$5(CliFrontend.java:495) ~[flink-dist_2.12-1.11.0.jar:1.11.0]
at org.apache.flink.client.cli.CliFrontend.runClusterAction(CliFrontend.java:864) ~[flink-dist_2.12-1.11.0.jar:1.11.0]
at org.apache.flink.client.cli.CliFrontend.stop(CliFrontend.java:487) ~[flink-dist_2.12-1.11.0.jar:1.11.0]
at org.apache.flink.client.cli.CliFrontend.parseParameters(CliFrontend.java:931) ~[flink-dist_2.12-1.11.0.jar:1.11.0]
at org.apache.flink.client.cli.CliFrontend.lambda$main$10(CliFrontend.java:992) ~[flink-dist_2.12-1.11.0.jar:1.11.0]
at java.security.AccessController.doPrivileged(Native Method) ~[?:1.8.0_252]
at javax.security.auth.Subject.doAs(Subject.java:422) [?:1.8.0_252]
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730) [hadoop-common-3.2.1-amzn-1.jar:?]
at org.apache.flink.runtime.security.contexts.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:41) [flink-dist_2.12-1.11.0.jar:1.11.0]
at org.apache.flink.client.cli.CliFrontend.main(CliFrontend.java:992) [flink-dist_2.12-1.11.0.jar:1.11.0]
Caused by: java.util.concurrent.TimeoutException
at java.util.concurrent.CompletableFuture.timedGet(CompletableFuture.java:1784) ~[?:1.8.0_252]
at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1928) ~[?:1.8.0_252]
at org.apache.flink.client.cli.CliFrontend.lambda$stop$5(CliFrontend.java:493) ~[flink-dist_2.12-1.11.0.jar:1.11.0]
... 9 more

Cancel command seems to be working fine. 
Please let me know how to fix this TimeoutException. 

Thanks.