Batch Processing Fault Tolerance (DataSet API)

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

Batch Processing Fault Tolerance (DataSet API)

Ovidiu-Cristian MARCU
Hi

In case of failure of a node what does it mean 'Fault tolerance for programs in the DataSet API works by retrying failed executions’ [1] ?
-work already done by the rest of the nodes is not lost, only work of the lost node is recomputed, job execution will continue
or
-entire job execution is retried


Best,
Ovidiu 
Reply | Threaded
Open this post in threaded view
|

Re: Batch Processing Fault Tolerance (DataSet API)

Till Rohrmann
Hi Ovidiu,

at the moment Flink's batch fault tolerance restarts the whole job in case of a failure. However, parts of the logic to do partial backtracking such as intermediate result partitions and the backtracking algorithm are already implemented or exist as a PR [1]. So we hope to complete the partial backtracking soon.


Cheers,
Till

On Mon, Feb 22, 2016 at 6:00 PM, Ovidiu-Cristian MARCU <[hidden email]> wrote:
Hi

In case of failure of a node what does it mean 'Fault tolerance for programs in the DataSet API works by retrying failed executions’ [1] ?
-work already done by the rest of the nodes is not lost, only work of the lost node is recomputed, job execution will continue
or
-entire job execution is retried


Best,
Ovidiu 

Reply | Threaded
Open this post in threaded view
|

Re: Batch Processing Fault Tolerance (DataSet API)

Ovidiu-Cristian MARCU
Thank you, Till!

The current (in progress) implementation is considering also the problem related to losing the task's slots of the failed node(s), something related to [2] ?


Best,
Ovidiu

On 22 Feb 2016, at 18:13, Till Rohrmann <[hidden email]> wrote:

Hi Ovidiu,

at the moment Flink's batch fault tolerance restarts the whole job in case of a failure. However, parts of the logic to do partial backtracking such as intermediate result partitions and the backtracking algorithm are already implemented or exist as a PR [1]. So we hope to complete the partial backtracking soon.


Cheers,
Till

On Mon, Feb 22, 2016 at 6:00 PM, Ovidiu-Cristian MARCU <[hidden email]> wrote:
Hi

In case of failure of a node what does it mean 'Fault tolerance for programs in the DataSet API works by retrying failed executions’ [1] ?
-work already done by the rest of the nodes is not lost, only work of the lost node is recomputed, job execution will continue
or
-entire job execution is retried


Best,
Ovidiu 


Reply | Threaded
Open this post in threaded view
|

Re: Batch Processing Fault Tolerance (DataSet API)

Till Rohrmann
At the moment, the system can only deal with lost slots (nodes) if either there are some excess slots which have not been used before or if the died node is restarted. The latter is the case for yarn applications, for example. There the application master will restart containers which have died.

In the future it is also conceivable to adjust the parallelism of the job to the current number of available slots. However, this is not yet implemented. For the streaming part, we're currently working in this direction. It is subsumed by the topic of dynamic scaling in/out.

Cheers,
Till

On Mon, Feb 22, 2016 at 6:33 PM, Ovidiu-Cristian MARCU <[hidden email]> wrote:
Thank you, Till!

The current (in progress) implementation is considering also the problem related to losing the task's slots of the failed node(s), something related to [2] ?


Best,
Ovidiu

On 22 Feb 2016, at 18:13, Till Rohrmann <[hidden email]> wrote:

Hi Ovidiu,

at the moment Flink's batch fault tolerance restarts the whole job in case of a failure. However, parts of the logic to do partial backtracking such as intermediate result partitions and the backtracking algorithm are already implemented or exist as a PR [1]. So we hope to complete the partial backtracking soon.


Cheers,
Till

On Mon, Feb 22, 2016 at 6:00 PM, Ovidiu-Cristian MARCU <[hidden email]> wrote:
Hi

In case of failure of a node what does it mean 'Fault tolerance for programs in the DataSet API works by retrying failed executions’ [1] ?
-work already done by the rest of the nodes is not lost, only work of the lost node is recomputed, job execution will continue
or
-entire job execution is retried


Best,
Ovidiu