Recovering from 1 of the nodes/slots of a Task Manager failing without resetting entire state during Recovery

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

Recovering from 1 of the nodes/slots of a Task Manager failing without resetting entire state during Recovery

Vijay Balakrishnan
Hi,
I have been going through the book "Real time streaming with Apache Flink". 
How do I recover state for just a single node/slot in a TaskManager without having the recovery reset the application state for all the Task Managers ?

They mention the following:
Reset the state of the whole application to the latest checkpoint, i.e., resetting the state of each task. The JobManager provides the location to the most recent checkpoints of task state. Note that, depending on the topology of the application, certain optimizations are possible and not all tasks need to be reset.

TIA,
Vijay
Flink newbie.
Reply | Threaded
Open this post in threaded view
|

Re: Recovering from 1 of the nodes/slots of a Task Manager failing without resetting entire state during Recovery

Vijay Balakrishnan
HI,
I found the following documentation in the code:

flink-runtime: org.apache.flink.runtime.executiongraph.failover.RestartIndividualStrategy
Simple failover strategy that restarts each task individually.
 * This strategy is only applicable if the entire job consists unconnected
 * tasks, meaning each task is its own component.
 
 RestartPipelinedRegionStrategy:
 A failover strategy that restarts regions of the ExecutionGraph. A region is defined
  * by this strategy as the weakly connected component of tasks that communicate via pipelined
  * data exchange.


Can someone please explain to me the  IndividualStrategy (entire job consists unconnected tasks) & PipelinedRegionStrategy( weakly connected component of tasks that communicate via pipelined data exchange) with an example ?

TIA,
Vijay

On Tue, May 15, 2018 at 3:16 PM Vijay Balakrishnan <[hidden email]> wrote:
Hi,
I have been going through the book "Real time streaming with Apache Flink". 
How do I recover state for just a single node/slot in a TaskManager without having the recovery reset the application state for all the Task Managers ?

They mention the following:
Reset the state of the whole application to the latest checkpoint, i.e., resetting the state of each task. The JobManager provides the location to the most recent checkpoints of task state. Note that, depending on the topology of the application, certain optimizations are possible and not all tasks need to be reset.

TIA,
Vijay
Flink newbie.