Hi
We are using restart strategy of fixed delay. I have fundamental question: Why the reset counter is not zero after streaming job restart is successful? Let's say I have number of restarts max are: 5 My streaming job tried 2 times and 3'rd attempt its successful, why counter is still 2 but not zero? Traditionally in network world, clients will retry for some time and once they are successful, they will reset the counter back to zero. Why this is the case in flink? Regards Bhaskar |
Hi Bhaskar,
> Why the reset counter is not zero after streaming job restart is successful? The short answer is that the fixed delay restart strategy is not implemented like that (see [1] if you are using Flink 1.10 or above). There are also other systems that behave similarly, e.g., Apache Hadoop YARN (see yarn.resourcemanager.am.max-attempts). If you have such a requirement, you can try to approximate it using the failure rate restart strategy [2]. Resetting the attempt counter to zero after a successful restart cannot be easily implemented with the current RestartBackoffTimeStrategy interface [3]; for this to be possible, the strategy would need to be informed if a restart was successful. However, it is not clear what constitutes a successful restart. For example, is it sufficient that enough TMs/slots could be acquired to run the job? The job could still fail afterwards due to a bug in user code. Could it be sufficient to require all tasks to produce at least one record? I do not think so because the job could still fail deterministically afterwards due to a particular record. Best, Gary [1] https://github.com/apache/flink/blob/d1292b5f30508e155d0f733527532d7c671ad263/flink-runtime/src/main/java/org/apache/flink/runtime/executiongraph/failover/flip1/FixedDelayRestartBackoffTimeStrategy.java#L29 [2] https://ci.apache.org/projects/flink/flink-docs-release-1.10/dev/task_failure_recovery.html#failure-rate-restart-strategy [3] https://github.com/apache/flink/blob/d1292b5f30508e155d0f733527532d7c671ad263/flink-runtime/src/main/java/org/apache/flink/runtime/executiongraph/failover/flip1/RestartBackoffTimeStrategy.java#L23 On Tue, May 26, 2020 at 9:28 AM Vijay Bhaskar <[hidden email]> wrote: > > Hi > We are using restart strategy of fixed delay. > I have fundamental question: > Why the reset counter is not zero after streaming job restart is successful? > Let's say I have number of restarts max are: 5 > My streaming job tried 2 times and 3'rd attempt its successful, why counter is still 2 but not zero? > Traditionally in network world, clients will retry for some time and once they are successful, they will reset the counter back to zero. > > Why this is the case in flink? > > Regards > Bhaskar |
Free forum by Nabble | Edit this page |