(DEPRECATED) Apache Flink User Mailing List archive.

Question on Job Restart strategy

Classic

List

Threaded

2 messages Options

Vijay Bhaskar

Question on Job Restart strategy

We are using restart strategy of fixed delay.

I have fundamental question:

Why the reset counter is not zero after streaming job restart is successful?

Let's say I have number of restarts max are: 5

My streaming job tried 2 times and 3'rd attempt its successful, why counter is still 2 but not zero?

Traditionally in network world, clients will retry for some time and once they are successful, they will reset the counter back to zero.

Why this is the case in flink?

Regards

Bhaskar

Gary Yao-5

Re: Question on Job Restart strategy

Hi Bhaskar,

> Why the reset counter is not zero after streaming job restart is successful?

The short answer is that the fixed delay restart strategy is not
implemented like that (see [1] if you are using Flink 1.10 or above).
There are also other systems that behave similarly, e.g., Apache
Hadoop YARN (see yarn.resourcemanager.am.max-attempts).

If you have such a requirement, you can try to approximate it using
the failure rate restart strategy [2]. Resetting the attempt counter
to zero after a successful restart cannot be easily implemented with
the current RestartBackoffTimeStrategy interface [3]; for this to be
possible, the strategy would need to be informed if a restart was
successful. However, it is not clear what constitutes a successful
restart. For example, is it sufficient that enough TMs/slots could be
acquired to run the job? The job could still fail afterwards due to a
bug in user code. Could it be sufficient to require all tasks to
produce at least one record? I do not think so because the job could
still fail deterministically afterwards due to a particular record.

Best,
Gary

[1] https://github.com/apache/flink/blob/d1292b5f30508e155d0f733527532d7c671ad263/flink-runtime/src/main/java/org/apache/flink/runtime/executiongraph/failover/flip1/FixedDelayRestartBackoffTimeStrategy.java#L29
[2] https://ci.apache.org/projects/flink/flink-docs-release-1.10/dev/task_failure_recovery.html#failure-rate-restart-strategy
[3] https://github.com/apache/flink/blob/d1292b5f30508e155d0f733527532d7c671ad263/flink-runtime/src/main/java/org/apache/flink/runtime/executiongraph/failover/flip1/RestartBackoffTimeStrategy.java#L23

On Tue, May 26, 2020 at 9:28 AM Vijay Bhaskar <[hidden email]> wrote:

>
> Hi
> We are using restart strategy of fixed delay.
> I have fundamental question:
> Why the reset counter is not zero after streaming job restart is successful?
> Let's say I have number of restarts max are: 5
> My streaming job tried 2 times and 3'rd attempt its successful, why counter is still 2 but not zero?
> Traditionally in network world, clients will retry for some time and once they are successful, they will reset the counter back to zero.
>
> Why this is the case in flink?
>
> Regards
> Bhaskar