(DEPRECATED) Apache Flink User Mailing List archive.

Task Manager recovery in Standalone Cluster High Availability mode

Classic

List

Threaded

3 messages Options

F.Amara

Task Manager recovery in Standalone Cluster High Availability mode

Hi,

I'm working with Apache Flink 1.1.2 and testing on High Availability mode. In the case of Task Manager failures they say a standby TM will recover the work of the failed TM. In my case, I have 4 TM's running in parallel and when a TM is killed the state goes to Cancelling and then to Failed rather than Restarting and the work is not recovered.

Is there a specific way to create standby TM's and a specific reason for jobs not being recovered?

Ufuk Celebi

Re: Task Manager recovery in Standalone Cluster High Availability mode

Hey! Did you configure a restart strategy?
https://ci.apache.org/projects/flink/flink-docs-release-1.3/dev/restart_strategies.html

Keep in mind that in In stand alone mode a TM process that has exited
won't be automatically restarted though.

On Tue, Feb 21, 2017 at 10:00 AM, F.Amara <[hidden email]> wrote:

> Hi,
>
> I'm working with Apache Flink 1.1.2 and testing on High Availability mode.
> In the case of Task Manager failures they say a standby TM will recover the
> work of the failed TM. In my case, I have 4 TM's running in parallel and
> when a TM is killed the state goes to Cancelling and then to Failed rather
> than Restarting and the work is not recovered.
>
> Is there a specific way to create standby TM's and a specific reason for
> jobs not being recovered?
>
>
>
> --
> View this message in context: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Task-Manager-recovery-in-Standalone-Cluster-High-Availability-mode-tp11767.html
> Sent from the Apache Flink User Mailing List archive. mailing list archive at Nabble.com.

F.Amara

Re: Task Manager recovery in Standalone Cluster High Availability mode

Hi,

Thanks a lot for the reply. I configured a restart strategy as suggested and now the TM failure scenario is working as expected. Once a TM is killed another active TM automatically recovers the job.