Question on Job Restart strategy

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

Question on Job Restart strategy

Vijay Bhaskar
Hi
We are using restart strategy of fixed delay.  
I have fundamental question: 
Why the reset counter is not zero after streaming job restart is successful? 
Let's say I have number of restarts max are: 5
My streaming job tried 2 times and 3'rd attempt its successful, why counter is still 2 but not zero?
Traditionally in network world, clients will retry for some time and once they are successful, they will reset the counter back to zero.

Why this is the case in flink?

Regards
Bhaskar
Reply | Threaded
Open this post in threaded view
|

Re: Question on Job Restart strategy

Gary Yao-5
Hi Bhaskar,

> Why the reset counter is not zero after streaming job restart is successful?

The short answer is that the fixed delay restart strategy is not
implemented like that (see [1] if you are using Flink 1.10 or above).
There are also other systems that behave similarly, e.g., Apache
Hadoop YARN (see yarn.resourcemanager.am.max-attempts).

If you have such a requirement, you can try to approximate it using
the failure rate restart strategy [2]. Resetting the attempt counter
to zero after a successful restart cannot be easily implemented with
the current RestartBackoffTimeStrategy interface [3]; for this to be
possible, the strategy would need to be informed if a restart was
successful. However, it is not clear what constitutes a successful
restart. For example, is it sufficient that enough TMs/slots could be
acquired to run the job? The job could still fail afterwards due to a
bug in user code. Could it be sufficient to require all tasks to
produce at least one record? I do not think so because the job could
still fail deterministically afterwards due to a particular record.

Best,
Gary

[1] https://github.com/apache/flink/blob/d1292b5f30508e155d0f733527532d7c671ad263/flink-runtime/src/main/java/org/apache/flink/runtime/executiongraph/failover/flip1/FixedDelayRestartBackoffTimeStrategy.java#L29
[2] https://ci.apache.org/projects/flink/flink-docs-release-1.10/dev/task_failure_recovery.html#failure-rate-restart-strategy
[3] https://github.com/apache/flink/blob/d1292b5f30508e155d0f733527532d7c671ad263/flink-runtime/src/main/java/org/apache/flink/runtime/executiongraph/failover/flip1/RestartBackoffTimeStrategy.java#L23


On Tue, May 26, 2020 at 9:28 AM Vijay Bhaskar <[hidden email]> wrote:

>
> Hi
> We are using restart strategy of fixed delay.
> I have fundamental question:
> Why the reset counter is not zero after streaming job restart is successful?
> Let's say I have number of restarts max are: 5
> My streaming job tried 2 times and 3'rd attempt its successful, why counter is still 2 but not zero?
> Traditionally in network world, clients will retry for some time and once they are successful, they will reset the counter back to zero.
>
> Why this is the case in flink?
>
> Regards
> Bhaskar