Task Manager recovery in Standalone Cluster High Availability mode

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Task Manager recovery in Standalone Cluster High Availability mode

F.Amara
Hi,

I'm working with Apache Flink 1.1.2 and testing on High Availability mode. In the case of Task Manager failures they say a standby TM will recover the work of the failed TM. In my case, I have 4 TM's running in parallel and when a TM is killed the state goes to Cancelling and then to Failed rather than Restarting and the work is not recovered.

Is there a specific way to create standby TM's and a specific reason for jobs not being recovered?
Reply | Threaded
Open this post in threaded view
|

Re: Task Manager recovery in Standalone Cluster High Availability mode

Ufuk Celebi
Hey! Did you configure a restart strategy?
https://ci.apache.org/projects/flink/flink-docs-release-1.3/dev/restart_strategies.html

Keep in mind that in In stand alone mode a TM process that has exited
won't be automatically restarted though.

On Tue, Feb 21, 2017 at 10:00 AM, F.Amara <[hidden email]> wrote:

> Hi,
>
> I'm working with Apache Flink 1.1.2 and testing on High Availability mode.
> In the case of Task Manager failures they say a standby TM will recover the
> work of the failed TM. In my case, I have 4 TM's running in parallel and
> when a TM is killed the state goes to Cancelling and then to Failed rather
> than Restarting and the work is not recovered.
>
> Is there a specific way to create standby TM's and a specific reason for
> jobs not being recovered?
>
>
>
> --
> View this message in context: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Task-Manager-recovery-in-Standalone-Cluster-High-Availability-mode-tp11767.html
> Sent from the Apache Flink User Mailing List archive. mailing list archive at Nabble.com.
Reply | Threaded
Open this post in threaded view
|

Re: Task Manager recovery in Standalone Cluster High Availability mode

F.Amara
Hi,

Thanks a lot for the reply. I configured a restart strategy as suggested and now the TM failure scenario is working as expected. Once a TM is killed another active TM automatically recovers the job.