Hi,
I think the behaviour of min_pause_between_checkpoints is either buggy or we should at least discuss if it would not be better to respect a pause also for failed checkpoints. As far as I know there is no ongoing work to add backoff, so I suggest you open a jira issue and make a case for this.
Best,
Stefan
Hello all,
Are there any recommendations on using a backoff when experiencing checkpointing failures?
What we have seen is when a checkpoint starts to expire, the next checkpoint dosent care about the previous failure, and starts soon after. We experimented with
min_pause_between_checkpoints, however that seems only to work for successful checkpoints( the same is discussed on this
thread)
Are there any recommendations on how to have a backoff or is there something in works to add a backoff incase of checkpointing failures? This seems very valuable incase of checkpointing on an external location like s3, where one can be potentially throttled or gets errors like TooBusyException from s3(for example like in this jira)
Please let us know!
Thanks,
Vipul