Best practices around checkpoint intervals and sizes?

classic Classic list List threaded Threaded
2 messages Options
Dan
Reply | Threaded
Open this post in threaded view
|

Best practices around checkpoint intervals and sizes?

Dan
Hi.  I'm playing around with optimizing our checkpoint intervals and sizes.

Are there any best practices around this?  I have a ~7 sequential joins and a few sinks.  I'm curious what would result in the better throughput and latency trade offs.  I'd assume less frequent checkpointing would increase throughput (but constrained by how frequently I want checkedpointed sinks written).
Reply | Threaded
Open this post in threaded view
|

Re: Best practices around checkpoint intervals and sizes?

Chesnay Schepler
A lower checkpoint interval (== more checkpoints / time) will consume
more resources and hence can affect the job performance.
It ultimately boils down to how much latency you are willing to accept
when a failure occurs and data has to be re-processed (more checkpoints
=> less data).

How long this catch-up takes depends on the job and provisioning of the
cluster. An over-provisioned cluster can recover quicker from what is
ultimately just a data spike, while one that is barely keeping up may
incur significant latency.

We know that many users have a checkpointing interval of the order of
minutes, but at the end of the day you will need to run some experiments
with your job&cluster&data to get some rough numbers.

On 2/18/2021 7:35 AM, Dan Hill wrote:
> Hi.  I'm playing around with optimizing our checkpoint intervals and
> sizes.
>
> Are there any best practices around this?  I have a ~7 sequential
> joins and a few sinks.  I'm curious what would result in the better
> throughput and latency trade offs.  I'd assume less frequent
> checkpointing would increase throughput (but constrained by how
> frequently I want checkedpointed sinks written).