(DEPRECATED) Apache Flink User Mailing List archive.

Graceful Task Manager Termination and Replacement

Classic

List

Threaded

7 messages Options

Aaron Levin

Graceful Task Manager Termination and Replacement

Hello,

Is there a way to gracefully terminate a Task Manager beyond just killing it (this seems to be what `./taskmanager.sh stop` does)? Specifically I'm interested in a way to replace a Task Manager that has currently-running tasks. It would be great if it was possible to terminate a Task Manager without restarting the job, though I'm not sure if this is possible.

Context: at my work we regularly cycle our hosts for maintenance and security. Each time we do this we stop the task manager running on the host being cycled. This causes the entire job to restart, resulting in downtime for the job. I'd love to decrease this downtime if at all possible.

Thanks! Any insight is appreciated!

Best,

Aaron Levin

Hao Sun

Re: Graceful Task Manager Termination and Replacement

I have a common interest in this topic. My k8s recycle hosts, and I am facing the same issue. Flink can tolerate this situation, but I am wondering if I can do better

On Thu, Jul 11, 2019, 12:39 Aaron Levin <[hidden email]> wrote:

Hello,

Is there a way to gracefully terminate a Task Manager beyond just killing it (this seems to be what `./taskmanager.sh stop` does)? Specifically I'm interested in a way to replace a Task Manager that has currently-running tasks. It would be great if it was possible to terminate a Task Manager without restarting the job, though I'm not sure if this is possible.

Context: at my work we regularly cycle our hosts for maintenance and security. Each time we do this we stop the task manager running on the host being cycled. This causes the entire job to restart, resulting in downtime for the job. I'd love to decrease this downtime if at all possible.

Thanks! Any insight is appreciated!

Best,

Aaron Levin

Paul Lam

Re: Graceful Task Manager Termination and Replacement

In reply to this post by Aaron Levin

Hi,

Maybe region restart strategy can help. It restarts minimum required tasks. Note that it’s recommended to use only after 1.9 release, see [1], unless you’re running a stateless job.

[1] https://issues.apache.org/jira/browse/FLINK-10712

Best,

Paul Lam

在 2019年7月12日，03:38，Aaron Levin <[hidden email]> 写道：

Hello,

Is there a way to gracefully terminate a Task Manager beyond just killing it (this seems to be what `./taskmanager.sh stop` does)? Specifically I'm interested in a way to replace a Task Manager that has currently-running tasks. It would be great if it was possible to terminate a Task Manager without restarting the job, though I'm not sure if this is possible.

Context: at my work we regularly cycle our hosts for maintenance and security. Each time we do this we stop the task manager running on the host being cycled. This causes the entire job to restart, resulting in downtime for the job. I'd love to decrease this downtime if at all possible.

Thanks! Any insight is appreciated!

Best,

Aaron Levin

Biao Liu

Re: Graceful Task Manager Termination and Replacement

Hi Aaron,

From my understanding, you want shutting down a Task Manager without restart the job which has tasks running on this Task Manager?

Based on current implementation, if there is a Task Manager is down, the tasks on it would be treated as failed. The behavior of task failure is defined via `FailoverStrategy` which is `RestartAllStrategy` by default.

That's the reason why the whole job restarts when a Task Manager has gone. As Paul said, you could try "region restart failover strategy" when 1.9 is released. It might be helpful however it depends on your job topology.

The deeper reason of this issue is the consistency semantics of Flink, AT_LEAST_ONCE or EXACTLY_ONCE. Flink must respect these semantics. So there is no much choice of `FailoverStrategy`.

It might be improved in the future. There are some discussions in the mailing list that providing some weaker consistency semantics to improve the `FailoverStrategy`. We are pushing forward this improvement. I hope it can be included in 1.10.

Regarding your question, I guess the answer is no for now. A more frequent checkpoint or a savepoint manually triggered might be helpful by a quicker recovery.

Paul Lam <[hidden email]> 于2019年7月12日周五上午10:25写道：

Hi,

Maybe region restart strategy can help. It restarts minimum required tasks. Note that it’s recommended to use only after 1.9 release, see [1], unless you’re running a stateless job.

[1] https://issues.apache.org/jira/browse/FLINK-10712

Best,
Paul Lam

在 2019年7月12日，03:38，Aaron Levin <[hidden email]> 写道：

Hello,

Is there a way to gracefully terminate a Task Manager beyond just killing it (this seems to be what `./taskmanager.sh stop` does)? Specifically I'm interested in a way to replace a Task Manager that has currently-running tasks. It would be great if it was possible to terminate a Task Manager without restarting the job, though I'm not sure if this is possible.

Context: at my work we regularly cycle our hosts for maintenance and security. Each time we do this we stop the task manager running on the host being cycled. This causes the entire job to restart, resulting in downtime for the job. I'd love to decrease this downtime if at all possible.

Thanks! Any insight is appreciated!

Best,

Aaron Levin

Aaron Levin

Re: Graceful Task Manager Termination and Replacement

I was on vacation but wanted to thank Biao for summarizing the current state! Thanks!

On Mon, Jul 15, 2019 at 2:00 AM Biao Liu <[hidden email]> wrote:

Hi Aaron,

From my understanding, you want shutting down a Task Manager without restart the job which has tasks running on this Task Manager?

Based on current implementation, if there is a Task Manager is down, the tasks on it would be treated as failed. The behavior of task failure is defined via `FailoverStrategy` which is `RestartAllStrategy` by default.
That's the reason why the whole job restarts when a Task Manager has gone. As Paul said, you could try "region restart failover strategy" when 1.9 is released. It might be helpful however it depends on your job topology.

The deeper reason of this issue is the consistency semantics of Flink, AT_LEAST_ONCE or EXACTLY_ONCE. Flink must respect these semantics. So there is no much choice of `FailoverStrategy`.
It might be improved in the future. There are some discussions in the mailing list that providing some weaker consistency semantics to improve the `FailoverStrategy`. We are pushing forward this improvement. I hope it can be included in 1.10.

Regarding your question, I guess the answer is no for now. A more frequent checkpoint or a savepoint manually triggered might be helpful by a quicker recovery.

Paul Lam <[hidden email]> 于2019年7月12日周五上午10:25写道：
Hi,

Maybe region restart strategy can help. It restarts minimum required tasks. Note that it’s recommended to use only after 1.9 release, see [1], unless you’re running a stateless job.

[1] https://issues.apache.org/jira/browse/FLINK-10712

Best,
Paul Lam

在 2019年7月12日，03:38，Aaron Levin <[hidden email]> 写道：

Hello,

Is there a way to gracefully terminate a Task Manager beyond just killing it (this seems to be what `./taskmanager.sh stop` does)? Specifically I'm interested in a way to replace a Task Manager that has currently-running tasks. It would be great if it was possible to terminate a Task Manager without restarting the job, though I'm not sure if this is possible.

Context: at my work we regularly cycle our hosts for maintenance and security. Each time we do this we stop the task manager running on the host being cycled. This causes the entire job to restart, resulting in downtime for the job. I'd love to decrease this downtime if at all possible.

Thanks! Any insight is appreciated!

Best,

Aaron Levin

Yu Li

Re: Graceful Task Manager Termination and Replacement

Belated but FWIW, besides the region failover and best-efforts failover efforts, I believe stop with checkpoint as proposed in FLINK-12619 and FLIP-45 could also help here, FYI.

W.r.t k8s, there're also some offline discussion about supporting local recovery with persistent volume even when task assigned to other TMs during job failover.

[1] https://issues.apache.org/jira/browse/FLINK-12619

[2] https://cwiki.apache.org/confluence/display/FLINK/FLIP-45%3A+Reinforce+Job+Stop+Semantic

Best Regards,

On Wed, 24 Jul 2019 at 17:00, Aaron Levin <[hidden email]> wrote:

I was on vacation but wanted to thank Biao for summarizing the current state! Thanks!

On Mon, Jul 15, 2019 at 2:00 AM Biao Liu <[hidden email]> wrote:
Hi Aaron,

From my understanding, you want shutting down a Task Manager without restart the job which has tasks running on this Task Manager?

Based on current implementation, if there is a Task Manager is down, the tasks on it would be treated as failed. The behavior of task failure is defined via `FailoverStrategy` which is `RestartAllStrategy` by default.
That's the reason why the whole job restarts when a Task Manager has gone. As Paul said, you could try "region restart failover strategy" when 1.9 is released. It might be helpful however it depends on your job topology.

The deeper reason of this issue is the consistency semantics of Flink, AT_LEAST_ONCE or EXACTLY_ONCE. Flink must respect these semantics. So there is no much choice of `FailoverStrategy`.
It might be improved in the future. There are some discussions in the mailing list that providing some weaker consistency semantics to improve the `FailoverStrategy`. We are pushing forward this improvement. I hope it can be included in 1.10.

Regarding your question, I guess the answer is no for now. A more frequent checkpoint or a savepoint manually triggered might be helpful by a quicker recovery.

Paul Lam <[hidden email]> 于2019年7月12日周五上午10:25写道：
Hi,

Maybe region restart strategy can help. It restarts minimum required tasks. Note that it’s recommended to use only after 1.9 release, see [1], unless you’re running a stateless job.

[1] https://issues.apache.org/jira/browse/FLINK-10712

Best,
Paul Lam

在 2019年7月12日，03:38，Aaron Levin <[hidden email]> 写道：

Hello,

Is there a way to gracefully terminate a Task Manager beyond just killing it (this seems to be what `./taskmanager.sh stop` does)? Specifically I'm interested in a way to replace a Task Manager that has currently-running tasks. It would be great if it was possible to terminate a Task Manager without restarting the job, though I'm not sure if this is possible.

Context: at my work we regularly cycle our hosts for maintenance and security. Each time we do this we stop the task manager running on the host being cycled. This causes the entire job to restart, resulting in downtime for the job. I'd love to decrease this downtime if at all possible.

Thanks! Any insight is appreciated!

Best,

Aaron Levin

Biao Liu

Re: Graceful Task Manager Termination and Replacement

Hi Yu,

That's a great proposal. Wish to see this feature soon!

On Mon, Jul 29, 2019 at 4:59 PM Yu Li <[hidden email]> wrote:

Belated but FWIW, besides the region failover and best-efforts failover efforts, I believe stop with checkpoint as proposed in FLINK-12619 and FLIP-45 could also help here, FYI.

W.r.t k8s, there're also some offline discussion about supporting local recovery with persistent volume even when task assigned to other TMs during job failover.

[1] https://issues.apache.org/jira/browse/FLINK-12619
[2] https://cwiki.apache.org/confluence/display/FLINK/FLIP-45%3A+Reinforce+Job+Stop+Semantic

Best Regards,

Yu

On Wed, 24 Jul 2019 at 17:00, Aaron Levin <[hidden email]> wrote:
I was on vacation but wanted to thank Biao for summarizing the current state! Thanks!

On Mon, Jul 15, 2019 at 2:00 AM Biao Liu <[hidden email]> wrote:
Hi Aaron,

From my understanding, you want shutting down a Task Manager without restart the job which has tasks running on this Task Manager?

Based on current implementation, if there is a Task Manager is down, the tasks on it would be treated as failed. The behavior of task failure is defined via `FailoverStrategy` which is `RestartAllStrategy` by default.
That's the reason why the whole job restarts when a Task Manager has gone. As Paul said, you could try "region restart failover strategy" when 1.9 is released. It might be helpful however it depends on your job topology.

The deeper reason of this issue is the consistency semantics of Flink, AT_LEAST_ONCE or EXACTLY_ONCE. Flink must respect these semantics. So there is no much choice of `FailoverStrategy`.
It might be improved in the future. There are some discussions in the mailing list that providing some weaker consistency semantics to improve the `FailoverStrategy`. We are pushing forward this improvement. I hope it can be included in 1.10.

Regarding your question, I guess the answer is no for now. A more frequent checkpoint or a savepoint manually triggered might be helpful by a quicker recovery.

Paul Lam <[hidden email]> 于2019年7月12日周五上午10:25写道：
Hi,

Maybe region restart strategy can help. It restarts minimum required tasks. Note that it’s recommended to use only after 1.9 release, see [1], unless you’re running a stateless job.

[1] https://issues.apache.org/jira/browse/FLINK-10712

Best,
Paul Lam

在 2019年7月12日，03:38，Aaron Levin <[hidden email]> 写道：

Hello,

Is there a way to gracefully terminate a Task Manager beyond just killing it (this seems to be what `./taskmanager.sh stop` does)? Specifically I'm interested in a way to replace a Task Manager that has currently-running tasks. It would be great if it was possible to terminate a Task Manager without restarting the job, though I'm not sure if this is possible.

Context: at my work we regularly cycle our hosts for maintenance and security. Each time we do this we stop the task manager running on the host being cycled. This causes the entire job to restart, resulting in downtime for the job. I'd love to decrease this downtime if at all possible.

Thanks! Any insight is appreciated!

Best,

Aaron Levin