Having a backoff while experiencing checkpointing failures

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

Having a backoff while experiencing checkpointing failures

vipul singh
Hello all,

Are there any recommendations on using a backoff when experiencing checkpointing failures?
What we have seen is when a checkpoint starts to expire, the next checkpoint dosent care about the previous failure, and starts soon after. We experimented with min_pause_between_checkpoints, however that seems only to work for successful checkpoints( the same is discussed on this thread)

Are there any recommendations on how to have a backoff or is there something in works to add a backoff incase of checkpointing failures? This seems very valuable incase of checkpointing on an external location like s3, where one can be potentially throttled or gets errors like TooBusyException from s3(for example like in this jira)

Please let us know!
Thanks,
Vipul
Reply | Threaded
Open this post in threaded view
|

Re: Having a backoff while experiencing checkpointing failures

Stefan Richter
Hi,

I think the behaviour of min_pause_between_checkpoints is either buggy or we should at least discuss if it would not be better to respect a pause also for failed checkpoints. As far as I know there is no ongoing work to add backoff, so I suggest you open a jira issue and make a case for this.

Best,
Stefan

Am 08.06.2018 um 06:30 schrieb vipul singh <[hidden email]>:

Hello all,

Are there any recommendations on using a backoff when experiencing checkpointing failures?
What we have seen is when a checkpoint starts to expire, the next checkpoint dosent care about the previous failure, and starts soon after. We experimented with min_pause_between_checkpoints, however that seems only to work for successful checkpoints( the same is discussed on this thread)

Are there any recommendations on how to have a backoff or is there something in works to add a backoff incase of checkpointing failures? This seems very valuable incase of checkpointing on an external location like s3, where one can be potentially throttled or gets errors like TooBusyException from s3(for example like in this jira)

Please let us know!
Thanks,
Vipul