Task Manager high availability

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

Task Manager high availability

gcandal
Hi,

Before the actual question, and just to make sure my assumptions are correct, my understanding is that Flink's current behaviour is:

- Job Manager failure under an highly available setup: a standby Job Manager takes over, no impact on the job
- Task Manager failure, with enough free slots in the cluster: as long there is a restart strategy defined, the failed sub-tasks will be resumed in another machine
- Task Manager failure, without free slots in the cluster: sub-tasks in the faulty node are marked as failed, while the other ones transition to cancelled. The job is marked as failing. When new task slots appear in the cluster, the job recovers as long as there is a restart strategy

Assuming this is true, there is no true high availability in the sense that there's always some down time when a node fails, even in unrelated partitions or parallel sub-tasks. Is this correct? If so, what is the recommended strategy for achieving high availability in case of a task manager failure?

Currently the only option I can see is to duplicate the entire job, in an active-active fashion. However, if we also try to be resilient to data center failures, that would mean having something like 4 instances of the same job, which starts to seem like a waste of resources. It can be argued that the standby jobs don't need to run in a cluster as powerful as the original one, and that the secondary datacenter only needs a single cluster, but still. Are there any other alternatives?

Ignoring the current state of affairs, are there any plans to tackle this issue? I recall seeing something about "hot-standby replication" in Aljoscha Krettek's "(Past), Present, and Future of Apache Flink" talk during last year's QCon.ai. I've tried scavenging information about this from JIRA and design docs, but to no success.

Besides this, and I'm not sure this is the right place to ask, is there a commercial solution to this problem?

Thanks in advance!