Cannot cancel job with savepoint due to timeout

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

Cannot cancel job with savepoint due to timeout

Bruno Aranda
Hi there,

I am trying to cancel a job and create a savepoint (ie flink cancel -s) but it takes more than a minute to do that and then it fails due to the timeout. However, it seems that the job will be cancelled successfully and the savepoint made, but I can only see that through the dasboard.

Cancelling job 790b60a2b44bc98854782d4e0cac05d5 with savepoint to default savepoint directory.

------------------------------------------------------------
 The program finished with the following exception:

java.util.concurrent.TimeoutException: Futures timed out after [60000 milliseconds]
at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)
at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:190)
at scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
at scala.concurrent.Await$.result(package.scala:190)
at scala.concurrent.Await.result(package.scala)
at org.apache.flink.client.CliFrontend.cancel(CliFrontend.java:618)
at org.apache.flink.client.CliFrontend.parseParameters(CliFrontend.java:1079)
at org.apache.flink.client.CliFrontend$2.call(CliFrontend.java:1120)
at org.apache.flink.client.CliFrontend$2.call(CliFrontend.java:1117)
at org.apache.flink.runtime.security.HadoopSecurityContext$1.run(HadoopSecurityContext.java:43)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1698)
at org.apache.flink.runtime.security.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:40)
at org.apache.flink.client.CliFrontend.main(CliFrontend.java:1117)

Is there any way to configure this timeout? So we can depend on the outcome of this execution for scripts, etc.

Thanks!

Bruno
Reply | Threaded
Open this post in threaded view
|

Re: Cannot cancel job with savepoint due to timeout

Yury Ruchin
Hi Bruno,

From the code I conclude that "akka.client.timeout" setting is what affects this. It defaults to 60 seconds.

I'm not sure why this setting is not documented though as well as many other "akka.*" settings - maybe there are some good reasons behind.

Regards,
Yury

2017-01-31 17:47 GMT+03:00 Bruno Aranda <[hidden email]>:
Hi there,

I am trying to cancel a job and create a savepoint (ie flink cancel -s) but it takes more than a minute to do that and then it fails due to the timeout. However, it seems that the job will be cancelled successfully and the savepoint made, but I can only see that through the dasboard.

Cancelling job 790b60a2b44bc98854782d4e0cac05d5 with savepoint to default savepoint directory.

------------------------------------------------------------
 The program finished with the following exception:

java.util.concurrent.TimeoutException: Futures timed out after [60000 milliseconds]
at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)
at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:190)
at scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
at scala.concurrent.Await$.result(package.scala:190)
at scala.concurrent.Await.result(package.scala)
at org.apache.flink.client.CliFrontend.cancel(CliFrontend.java:618)
at org.apache.flink.client.CliFrontend.parseParameters(CliFrontend.java:1079)
at org.apache.flink.client.CliFrontend$2.call(CliFrontend.java:1120)
at org.apache.flink.client.CliFrontend$2.call(CliFrontend.java:1117)
at org.apache.flink.runtime.security.HadoopSecurityContext$1.run(HadoopSecurityContext.java:43)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1698)
at org.apache.flink.runtime.security.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:40)
at org.apache.flink.client.CliFrontend.main(CliFrontend.java:1117)

Is there any way to configure this timeout? So we can depend on the outcome of this execution for scripts, etc.

Thanks!

Bruno

Reply | Threaded
Open this post in threaded view
|

Re: Cannot cancel job with savepoint due to timeout

elmosca
Maybe, though it could be good to be able to override in the command line somehow, though I guess I could just change the flink config.

Many thanks Yuri,

Bruno

On Wed, 1 Feb 2017 at 07:40 Yury Ruchin <[hidden email]> wrote:
Hi Bruno,

From the code I conclude that "akka.client.timeout" setting is what affects this. It defaults to 60 seconds.

I'm not sure why this setting is not documented though as well as many other "akka.*" settings - maybe there are some good reasons behind.

Regards,
Yury

2017-01-31 17:47 GMT+03:00 Bruno Aranda <[hidden email]>:
Hi there,

I am trying to cancel a job and create a savepoint (ie flink cancel -s) but it takes more than a minute to do that and then it fails due to the timeout. However, it seems that the job will be cancelled successfully and the savepoint made, but I can only see that through the dasboard.

Cancelling job 790b60a2b44bc98854782d4e0cac05d5 with savepoint to default savepoint directory.

------------------------------------------------------------
 The program finished with the following exception:

java.util.concurrent.TimeoutException: Futures timed out after [60000 milliseconds]
at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)
at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:190)
at scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
at scala.concurrent.Await$.result(package.scala:190)
at scala.concurrent.Await.result(package.scala)
at org.apache.flink.client.CliFrontend.cancel(CliFrontend.java:618)
at org.apache.flink.client.CliFrontend.parseParameters(CliFrontend.java:1079)
at org.apache.flink.client.CliFrontend$2.call(CliFrontend.java:1120)
at org.apache.flink.client.CliFrontend$2.call(CliFrontend.java:1117)
at org.apache.flink.runtime.security.HadoopSecurityContext$1.run(HadoopSecurityContext.java:43)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1698)
at org.apache.flink.runtime.security.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:40)
at org.apache.flink.client.CliFrontend.main(CliFrontend.java:1117)

Is there any way to configure this timeout? So we can depend on the outcome of this execution for scripts, etc.

Thanks!

Bruno

Reply | Threaded
Open this post in threaded view
|

Re: Cannot cancel job with savepoint due to timeout

Till Rohrmann
Hi Bruno,

the lacking documentation for akka.client.timeout is an oversight on our part [1]. I'll update it asap.

Unfortunately, at the moment there is no other way than to specify the akka.client.timeout in the flink-conf.yaml file.


Cheers,
Till

On Wed, Feb 1, 2017 at 9:47 AM, Bruno Aranda <[hidden email]> wrote:
Maybe, though it could be good to be able to override in the command line somehow, though I guess I could just change the flink config.

Many thanks Yuri,

Bruno

On Wed, 1 Feb 2017 at 07:40 Yury Ruchin <[hidden email]> wrote:
Hi Bruno,

From the code I conclude that "akka.client.timeout" setting is what affects this. It defaults to 60 seconds.

I'm not sure why this setting is not documented though as well as many other "akka.*" settings - maybe there are some good reasons behind.

Regards,
Yury

2017-01-31 17:47 GMT+03:00 Bruno Aranda <[hidden email]>:
Hi there,

I am trying to cancel a job and create a savepoint (ie flink cancel -s) but it takes more than a minute to do that and then it fails due to the timeout. However, it seems that the job will be cancelled successfully and the savepoint made, but I can only see that through the dasboard.

Cancelling job 790b60a2b44bc98854782d4e0cac05d5 with savepoint to default savepoint directory.

------------------------------------------------------------
 The program finished with the following exception:

java.util.concurrent.TimeoutException: Futures timed out after [60000 milliseconds]
at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)
at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:190)
at scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
at scala.concurrent.Await$.result(package.scala:190)
at scala.concurrent.Await.result(package.scala)
at org.apache.flink.client.CliFrontend.cancel(CliFrontend.java:618)
at org.apache.flink.client.CliFrontend.parseParameters(CliFrontend.java:1079)
at org.apache.flink.client.CliFrontend$2.call(CliFrontend.java:1120)
at org.apache.flink.client.CliFrontend$2.call(CliFrontend.java:1117)
at org.apache.flink.runtime.security.HadoopSecurityContext$1.run(HadoopSecurityContext.java:43)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1698)
at org.apache.flink.runtime.security.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:40)
at org.apache.flink.client.CliFrontend.main(CliFrontend.java:1117)

Is there any way to configure this timeout? So we can depend on the outcome of this execution for scripts, etc.

Thanks!

Bruno