(DEPRECATED) Apache Flink User Mailing List archive.

Could not cancel job (with savepoint) "Ask timed out"

Classic

List

Threaded

8 messages Options

Juho Autio

Could not cancel job (with savepoint) "Ask timed out"

I was trying to cancel a job with savepoint, but the CLI command failed with "akka.pattern.AskTimeoutException: Ask timed out".

The stack trace reveals that ask timeout is 10 seconds:

Caused by: akka.pattern.AskTimeoutException: Ask timed out on [Actor[akka://flink/user/jobmanager_0#106635280]] after [10000 ms]. Sender[null] sent message of type "org.apache.flink.runtime.rpc.messages.LocalFencedMessage".

Indeed it's documented that the default value for akka.ask.timeout="10 s" in

https://ci.apache.org/projects/flink/flink-docs-stable/ops/config.html#distributed-coordination-via-akka

Behind the scenes the savepoint creation & job cancellation succeeded, that was to be expected, kind of. So my problem is just getting a proper response back from the CLI call instead of timing out so eagerly.

To be exact, what I ran was:

flink-1.5.2/bin/flink cancel b7c7d19d25e16a952d3afa32841024e5 -m yarn-cluster -yid application_1533676784032_0001 --withSavepoint

Should I change the akka.ask.timeout to have a longer timeout? If yes, can I override it just for the CLI call somehow? Maybe it might have undesired side-effects if set globally for the actual flink jobs to use?

What about akka.client.timeout? The default for it is also rather low: "60 s". Should it also be increased accordingly if I want to accept longer than 60 s for savepoint creation?

Finally, that default timeout is so low that I would expect this to be a common problem. I would say that Flink CLI should have higher default timeout for cancel and savepoint creation ops.

Thanks!

vino yang

Re: Could not cancel job (with savepoint) "Ask timed out"

Hi Juho,

This problem does exist, I suggest you separate these two steps to temporarily deal with this problem:

1) Trigger Savepoint separately;

2) execute the cancel command;

Hi Till, Chesnay:

Our internal environment and multiple users on the mailing list have encountered similar problems.

In our environment, it seems that JM shows that the save point is complete and JM has stopped itself, but the client will still connect to the old JM and report a timeout exception.

Thanks, vino.

Juho Autio <[hidden email]> 于2018年8月8日周三下午9:18写道：

I was trying to cancel a job with savepoint, but the CLI command failed with "akka.pattern.AskTimeoutException: Ask timed out".

The stack trace reveals that ask timeout is 10 seconds:

Caused by: akka.pattern.AskTimeoutException: Ask timed out on [Actor[akka://flink/user/jobmanager_0#106635280]] after [10000 ms]. Sender[null] sent message of type "org.apache.flink.runtime.rpc.messages.LocalFencedMessage".

Indeed it's documented that the default value for akka.ask.timeout="10 s" in
https://ci.apache.org/projects/flink/flink-docs-stable/ops/config.html#distributed-coordination-via-akka

Behind the scenes the savepoint creation & job cancellation succeeded, that was to be expected, kind of. So my problem is just getting a proper response back from the CLI call instead of timing out so eagerly.

To be exact, what I ran was:

flink-1.5.2/bin/flink cancel b7c7d19d25e16a952d3afa32841024e5 -m yarn-cluster -yid application_1533676784032_0001 --withSavepoint

Should I change the akka.ask.timeout to have a longer timeout? If yes, can I override it just for the CLI call somehow? Maybe it might have undesired side-effects if set globally for the actual flink jobs to use?

What about akka.client.timeout? The default for it is also rather low: "60 s". Should it also be increased accordingly if I want to accept longer than 60 s for savepoint creation?

Finally, that default timeout is so low that I would expect this to be a common problem. I would say that Flink CLI should have higher default timeout for cancel and savepoint creation ops.

Thanks!

Juho Autio

Re: Could not cancel job (with savepoint) "Ask timed out"

Thanks for the suggestion. Is the separate savepoint triggering async? Would you then separately poll for the savepoint's completion before executing cancel? If additional polling is needed, then I would say that for my purpose it's still easier to call cancel with savepoint and simply ignore the result of the call. I would assume that it won't do any harm if I keep retrying cancel with savepoint until the job stops – I expect that an overlapping cancel request is ignored if the job is already creating a savepoint. Please correct if my assumption is wrong.

On Thu, Aug 9, 2018 at 5:04 AM vino yang <[hidden email]> wrote:

Hi Juho,

This problem does exist, I suggest you separate these two steps to temporarily deal with this problem:
1) Trigger Savepoint separately;
2) execute the cancel command;

Hi Till, Chesnay:

Our internal environment and multiple users on the mailing list have encountered similar problems.

In our environment, it seems that JM shows that the save point is complete and JM has stopped itself, but the client will still connect to the old JM and report a timeout exception.

Thanks, vino.

Juho Autio <[hidden email]> 于2018年8月8日周三下午9:18写道：
I was trying to cancel a job with savepoint, but the CLI command failed with "akka.pattern.AskTimeoutException: Ask timed out".

The stack trace reveals that ask timeout is 10 seconds:

Caused by: akka.pattern.AskTimeoutException: Ask timed out on [Actor[akka://flink/user/jobmanager_0#106635280]] after [10000 ms]. Sender[null] sent message of type "org.apache.flink.runtime.rpc.messages.LocalFencedMessage".

Indeed it's documented that the default value for akka.ask.timeout="10 s" in
https://ci.apache.org/projects/flink/flink-docs-stable/ops/config.html#distributed-coordination-via-akka

Behind the scenes the savepoint creation & job cancellation succeeded, that was to be expected, kind of. So my problem is just getting a proper response back from the CLI call instead of timing out so eagerly.

To be exact, what I ran was:

flink-1.5.2/bin/flink cancel b7c7d19d25e16a952d3afa32841024e5 -m yarn-cluster -yid application_1533676784032_0001 --withSavepoint

Should I change the akka.ask.timeout to have a longer timeout? If yes, can I override it just for the CLI call somehow? Maybe it might have undesired side-effects if set globally for the actual flink jobs to use?

What about akka.client.timeout? The default for it is also rather low: "60 s". Should it also be increased accordingly if I want to accept longer than 60 s for savepoint creation?

Finally, that default timeout is so low that I would expect this to be a common problem. I would say that Flink CLI should have higher default timeout for cancel and savepoint creation ops.

Thanks!

vino yang

Re: Could not cancel job (with savepoint) "Ask timed out"

Hi Juho,

We use REST client API : triggerSavepoint(), this API returns a CompletableFuture, then we call it's get() API.

You can understand that I am waiting for it to complete in sync.

Because cancelWithSavepoint is actually waiting for savepoint to complete synchronization, and then execute the cancel command.

We do not use CLI. I think since you are through the CLI, you can observe whether the savepoint is complete by combining the log or the web UI.

Thanks, vino.

Juho Autio <[hidden email]> 于2018年8月9日周四下午3:07写道：

Thanks for the suggestion. Is the separate savepoint triggering async? Would you then separately poll for the savepoint's completion before executing cancel? If additional polling is needed, then I would say that for my purpose it's still easier to call cancel with savepoint and simply ignore the result of the call. I would assume that it won't do any harm if I keep retrying cancel with savepoint until the job stops – I expect that an overlapping cancel request is ignored if the job is already creating a savepoint. Please correct if my assumption is wrong.

On Thu, Aug 9, 2018 at 5:04 AM vino yang <[hidden email]> wrote:
Hi Juho,

This problem does exist, I suggest you separate these two steps to temporarily deal with this problem:
1) Trigger Savepoint separately;
2) execute the cancel command;

Hi Till, Chesnay:

Our internal environment and multiple users on the mailing list have encountered similar problems.

In our environment, it seems that JM shows that the save point is complete and JM has stopped itself, but the client will still connect to the old JM and report a timeout exception.

Thanks, vino.

Juho Autio <[hidden email]> 于2018年8月8日周三下午9:18写道：
I was trying to cancel a job with savepoint, but the CLI command failed with "akka.pattern.AskTimeoutException: Ask timed out".

The stack trace reveals that ask timeout is 10 seconds:

Caused by: akka.pattern.AskTimeoutException: Ask timed out on [Actor[akka://flink/user/jobmanager_0#106635280]] after [10000 ms]. Sender[null] sent message of type "org.apache.flink.runtime.rpc.messages.LocalFencedMessage".

Indeed it's documented that the default value for akka.ask.timeout="10 s" in
https://ci.apache.org/projects/flink/flink-docs-stable/ops/config.html#distributed-coordination-via-akka

Behind the scenes the savepoint creation & job cancellation succeeded, that was to be expected, kind of. So my problem is just getting a proper response back from the CLI call instead of timing out so eagerly.

To be exact, what I ran was:

flink-1.5.2/bin/flink cancel b7c7d19d25e16a952d3afa32841024e5 -m yarn-cluster -yid application_1533676784032_0001 --withSavepoint

Should I change the akka.ask.timeout to have a longer timeout? If yes, can I override it just for the CLI call somehow? Maybe it might have undesired side-effects if set globally for the actual flink jobs to use?

What about akka.client.timeout? The default for it is also rather low: "60 s". Should it also be increased accordingly if I want to accept longer than 60 s for savepoint creation?

Finally, that default timeout is so low that I would expect this to be a common problem. I would say that Flink CLI should have higher default timeout for cancel and savepoint creation ops.

Thanks!

Till Rohrmann

Re: Could not cancel job (with savepoint) "Ask timed out"

Just a small addition. Concurrent cancel call will interfere with the cancel-with-savepoint command and directly cancel the job. So it is better to use the cancel-with-savepoint call in order to take savepoint and then cancel the job automatically.

Cheers,

Till

On Thu, Aug 9, 2018 at 9:53 AM vino yang <[hidden email]> wrote:

Hi Juho,

We use REST client API : triggerSavepoint(), this API returns a CompletableFuture, then we call it's get() API.

You can understand that I am waiting for it to complete in sync.
Because cancelWithSavepoint is actually waiting for savepoint to complete synchronization, and then execute the cancel command.

We do not use CLI. I think since you are through the CLI, you can observe whether the savepoint is complete by combining the log or the web UI.

Thanks, vino.

Juho Autio <[hidden email]> 于2018年8月9日周四下午3:07写道：
Thanks for the suggestion. Is the separate savepoint triggering async? Would you then separately poll for the savepoint's completion before executing cancel? If additional polling is needed, then I would say that for my purpose it's still easier to call cancel with savepoint and simply ignore the result of the call. I would assume that it won't do any harm if I keep retrying cancel with savepoint until the job stops – I expect that an overlapping cancel request is ignored if the job is already creating a savepoint. Please correct if my assumption is wrong.

On Thu, Aug 9, 2018 at 5:04 AM vino yang <[hidden email]> wrote:
Hi Juho,

This problem does exist, I suggest you separate these two steps to temporarily deal with this problem:
1) Trigger Savepoint separately;
2) execute the cancel command;

Hi Till, Chesnay:

Our internal environment and multiple users on the mailing list have encountered similar problems.

In our environment, it seems that JM shows that the save point is complete and JM has stopped itself, but the client will still connect to the old JM and report a timeout exception.

Thanks, vino.

Juho Autio <[hidden email]> 于2018年8月8日周三下午9:18写道：
I was trying to cancel a job with savepoint, but the CLI command failed with "akka.pattern.AskTimeoutException: Ask timed out".

The stack trace reveals that ask timeout is 10 seconds:

Caused by: akka.pattern.AskTimeoutException: Ask timed out on [Actor[akka://flink/user/jobmanager_0#106635280]] after [10000 ms]. Sender[null] sent message of type "org.apache.flink.runtime.rpc.messages.LocalFencedMessage".

Indeed it's documented that the default value for akka.ask.timeout="10 s" in
https://ci.apache.org/projects/flink/flink-docs-stable/ops/config.html#distributed-coordination-via-akka

Behind the scenes the savepoint creation & job cancellation succeeded, that was to be expected, kind of. So my problem is just getting a proper response back from the CLI call instead of timing out so eagerly.

To be exact, what I ran was:

flink-1.5.2/bin/flink cancel b7c7d19d25e16a952d3afa32841024e5 -m yarn-cluster -yid application_1533676784032_0001 --withSavepoint

Should I change the akka.ask.timeout to have a longer timeout? If yes, can I override it just for the CLI call somehow? Maybe it might have undesired side-effects if set globally for the actual flink jobs to use?

What about akka.client.timeout? The default for it is also rather low: "60 s". Should it also be increased accordingly if I want to accept longer than 60 s for savepoint creation?

Finally, that default timeout is so low that I would expect this to be a common problem. I would say that Flink CLI should have higher default timeout for cancel and savepoint creation ops.

Thanks!

Juho Autio

Re: Could not cancel job (with savepoint) "Ask timed out"

What I meant to ask was, does it do any harm to keep calling cancel-with-savepoint until the job exits? If the job is already cancelling with savepoint, I would assume that another cancel-with-savepoint call is just ignored.

On Tue, Aug 21, 2018 at 1:18 PM Till Rohrmann <[hidden email]> wrote:

Just a small addition. Concurrent cancel call will interfere with the cancel-with-savepoint command and directly cancel the job. So it is better to use the cancel-with-savepoint call in order to take savepoint and then cancel the job automatically.

Cheers,
Till

On Thu, Aug 9, 2018 at 9:53 AM vino yang <[hidden email]> wrote:
Hi Juho,

We use REST client API : triggerSavepoint(), this API returns a CompletableFuture, then we call it's get() API.

You can understand that I am waiting for it to complete in sync.
Because cancelWithSavepoint is actually waiting for savepoint to complete synchronization, and then execute the cancel command.

We do not use CLI. I think since you are through the CLI, you can observe whether the savepoint is complete by combining the log or the web UI.

Thanks, vino.

Juho Autio <[hidden email]> 于2018年8月9日周四下午3:07写道：
Thanks for the suggestion. Is the separate savepoint triggering async? Would you then separately poll for the savepoint's completion before executing cancel? If additional polling is needed, then I would say that for my purpose it's still easier to call cancel with savepoint and simply ignore the result of the call. I would assume that it won't do any harm if I keep retrying cancel with savepoint until the job stops – I expect that an overlapping cancel request is ignored if the job is already creating a savepoint. Please correct if my assumption is wrong.

On Thu, Aug 9, 2018 at 5:04 AM vino yang <[hidden email]> wrote:
Hi Juho,

This problem does exist, I suggest you separate these two steps to temporarily deal with this problem:
1) Trigger Savepoint separately;
2) execute the cancel command;

Hi Till, Chesnay:

Our internal environment and multiple users on the mailing list have encountered similar problems.

In our environment, it seems that JM shows that the save point is complete and JM has stopped itself, but the client will still connect to the old JM and report a timeout exception.

Thanks, vino.

Juho Autio <[hidden email]> 于2018年8月8日周三下午9:18写道：
I was trying to cancel a job with savepoint, but the CLI command failed with "akka.pattern.AskTimeoutException: Ask timed out".

The stack trace reveals that ask timeout is 10 seconds:

Caused by: akka.pattern.AskTimeoutException: Ask timed out on [Actor[akka://flink/user/jobmanager_0#106635280]] after [10000 ms]. Sender[null] sent message of type "org.apache.flink.runtime.rpc.messages.LocalFencedMessage".

Indeed it's documented that the default value for akka.ask.timeout="10 s" in
https://ci.apache.org/projects/flink/flink-docs-stable/ops/config.html#distributed-coordination-via-akka

Behind the scenes the savepoint creation & job cancellation succeeded, that was to be expected, kind of. So my problem is just getting a proper response back from the CLI call instead of timing out so eagerly.

To be exact, what I ran was:

flink-1.5.2/bin/flink cancel b7c7d19d25e16a952d3afa32841024e5 -m yarn-cluster -yid application_1533676784032_0001 --withSavepoint

Should I change the akka.ask.timeout to have a longer timeout? If yes, can I override it just for the CLI call somehow? Maybe it might have undesired side-effects if set globally for the actual flink jobs to use?

What about akka.client.timeout? The default for it is also rather low: "60 s". Should it also be increased accordingly if I want to accept longer than 60 s for savepoint creation?

Finally, that default timeout is so low that I would expect this to be a common problem. I would say that Flink CLI should have higher default timeout for cancel and savepoint creation ops.

Thanks!

Till Rohrmann

Re: Could not cancel job (with savepoint) "Ask timed out"

Calling cancel-with-savepoint multiple times will trigger multiple savepoints. The first issued savepoint will complete first and then cancel the job. Thus, the later savepoints might complete or not depending on the correct timing. Since savepoint can flush results to external systems, I would recommend not calling the API multiple times.

Cheers,

Till

On Wed, Aug 22, 2018 at 10:40 AM Juho Autio <[hidden email]> wrote:

What I meant to ask was, does it do any harm to keep calling cancel-with-savepoint until the job exits? If the job is already cancelling with savepoint, I would assume that another cancel-with-savepoint call is just ignored.

On Tue, Aug 21, 2018 at 1:18 PM Till Rohrmann <[hidden email]> wrote:
Just a small addition. Concurrent cancel call will interfere with the cancel-with-savepoint command and directly cancel the job. So it is better to use the cancel-with-savepoint call in order to take savepoint and then cancel the job automatically.

Cheers,
Till

On Thu, Aug 9, 2018 at 9:53 AM vino yang <[hidden email]> wrote:
Hi Juho,

We use REST client API : triggerSavepoint(), this API returns a CompletableFuture, then we call it's get() API.

You can understand that I am waiting for it to complete in sync.
Because cancelWithSavepoint is actually waiting for savepoint to complete synchronization, and then execute the cancel command.

We do not use CLI. I think since you are through the CLI, you can observe whether the savepoint is complete by combining the log or the web UI.

Thanks, vino.

Juho Autio <[hidden email]> 于2018年8月9日周四下午3:07写道：
Thanks for the suggestion. Is the separate savepoint triggering async? Would you then separately poll for the savepoint's completion before executing cancel? If additional polling is needed, then I would say that for my purpose it's still easier to call cancel with savepoint and simply ignore the result of the call. I would assume that it won't do any harm if I keep retrying cancel with savepoint until the job stops – I expect that an overlapping cancel request is ignored if the job is already creating a savepoint. Please correct if my assumption is wrong.

On Thu, Aug 9, 2018 at 5:04 AM vino yang <[hidden email]> wrote:
Hi Juho,

This problem does exist, I suggest you separate these two steps to temporarily deal with this problem:
1) Trigger Savepoint separately;
2) execute the cancel command;

Hi Till, Chesnay:

Our internal environment and multiple users on the mailing list have encountered similar problems.

In our environment, it seems that JM shows that the save point is complete and JM has stopped itself, but the client will still connect to the old JM and report a timeout exception.

Thanks, vino.

Juho Autio <[hidden email]> 于2018年8月8日周三下午9:18写道：
I was trying to cancel a job with savepoint, but the CLI command failed with "akka.pattern.AskTimeoutException: Ask timed out".

The stack trace reveals that ask timeout is 10 seconds:

Caused by: akka.pattern.AskTimeoutException: Ask timed out on [Actor[akka://flink/user/jobmanager_0#106635280]] after [10000 ms]. Sender[null] sent message of type "org.apache.flink.runtime.rpc.messages.LocalFencedMessage".

Indeed it's documented that the default value for akka.ask.timeout="10 s" in
https://ci.apache.org/projects/flink/flink-docs-stable/ops/config.html#distributed-coordination-via-akka

Behind the scenes the savepoint creation & job cancellation succeeded, that was to be expected, kind of. So my problem is just getting a proper response back from the CLI call instead of timing out so eagerly.

To be exact, what I ran was:

flink-1.5.2/bin/flink cancel b7c7d19d25e16a952d3afa32841024e5 -m yarn-cluster -yid application_1533676784032_0001 --withSavepoint

Should I change the akka.ask.timeout to have a longer timeout? If yes, can I override it just for the CLI call somehow? Maybe it might have undesired side-effects if set globally for the actual flink jobs to use?

What about akka.client.timeout? The default for it is also rather low: "60 s". Should it also be increased accordingly if I want to accept longer than 60 s for savepoint creation?

Finally, that default timeout is so low that I would expect this to be a common problem. I would say that Flink CLI should have higher default timeout for cancel and savepoint creation ops.

Thanks!

Juho Autio

Re: Could not cancel job (with savepoint) "Ask timed out"

I see, thanks. Looks like it's better for us to switch to triggering savepoint & cancel separately.

On Wed, Aug 22, 2018 at 1:26 PM Till Rohrmann <[hidden email]> wrote:

Calling cancel-with-savepoint multiple times will trigger multiple savepoints. The first issued savepoint will complete first and then cancel the job. Thus, the later savepoints might complete or not depending on the correct timing. Since savepoint can flush results to external systems, I would recommend not calling the API multiple times.

Cheers,
Till

On Wed, Aug 22, 2018 at 10:40 AM Juho Autio <[hidden email]> wrote:
What I meant to ask was, does it do any harm to keep calling cancel-with-savepoint until the job exits? If the job is already cancelling with savepoint, I would assume that another cancel-with-savepoint call is just ignored.

On Tue, Aug 21, 2018 at 1:18 PM Till Rohrmann <[hidden email]> wrote:
Just a small addition. Concurrent cancel call will interfere with the cancel-with-savepoint command and directly cancel the job. So it is better to use the cancel-with-savepoint call in order to take savepoint and then cancel the job automatically.

Cheers,
Till

On Thu, Aug 9, 2018 at 9:53 AM vino yang <[hidden email]> wrote:
Hi Juho,

We use REST client API : triggerSavepoint(), this API returns a CompletableFuture, then we call it's get() API.

You can understand that I am waiting for it to complete in sync.
Because cancelWithSavepoint is actually waiting for savepoint to complete synchronization, and then execute the cancel command.

We do not use CLI. I think since you are through the CLI, you can observe whether the savepoint is complete by combining the log or the web UI.

Thanks, vino.

Juho Autio <[hidden email]> 于2018年8月9日周四下午3:07写道：
Thanks for the suggestion. Is the separate savepoint triggering async? Would you then separately poll for the savepoint's completion before executing cancel? If additional polling is needed, then I would say that for my purpose it's still easier to call cancel with savepoint and simply ignore the result of the call. I would assume that it won't do any harm if I keep retrying cancel with savepoint until the job stops – I expect that an overlapping cancel request is ignored if the job is already creating a savepoint. Please correct if my assumption is wrong.

On Thu, Aug 9, 2018 at 5:04 AM vino yang <[hidden email]> wrote:
Hi Juho,

This problem does exist, I suggest you separate these two steps to temporarily deal with this problem:
1) Trigger Savepoint separately;
2) execute the cancel command;

Hi Till, Chesnay:

Our internal environment and multiple users on the mailing list have encountered similar problems.

In our environment, it seems that JM shows that the save point is complete and JM has stopped itself, but the client will still connect to the old JM and report a timeout exception.

Thanks, vino.

Juho Autio <[hidden email]> 于2018年8月8日周三下午9:18写道：
I was trying to cancel a job with savepoint, but the CLI command failed with "akka.pattern.AskTimeoutException: Ask timed out".

The stack trace reveals that ask timeout is 10 seconds:

Caused by: akka.pattern.AskTimeoutException: Ask timed out on [Actor[akka://flink/user/jobmanager_0#106635280]] after [10000 ms]. Sender[null] sent message of type "org.apache.flink.runtime.rpc.messages.LocalFencedMessage".

Indeed it's documented that the default value for akka.ask.timeout="10 s" in
https://ci.apache.org/projects/flink/flink-docs-stable/ops/config.html#distributed-coordination-via-akka

Behind the scenes the savepoint creation & job cancellation succeeded, that was to be expected, kind of. So my problem is just getting a proper response back from the CLI call instead of timing out so eagerly.

To be exact, what I ran was:

flink-1.5.2/bin/flink cancel b7c7d19d25e16a952d3afa32841024e5 -m yarn-cluster -yid application_1533676784032_0001 --withSavepoint

Should I change the akka.ask.timeout to have a longer timeout? If yes, can I override it just for the CLI call somehow? Maybe it might have undesired side-effects if set globally for the actual flink jobs to use?

What about akka.client.timeout? The default for it is also rather low: "60 s". Should it also be increased accordingly if I want to accept longer than 60 s for savepoint creation?

Finally, that default timeout is so low that I would expect this to be a common problem. I would say that Flink CLI should have higher default timeout for cancel and savepoint creation ops.

Thanks!