(DEPRECATED) Apache Flink User Mailing List archive.

Best practices around checkpoint intervals and sizes?

Classic

List

Threaded

2 messages Options

Dan

Best practices around checkpoint intervals and sizes?

Hi. I'm playing around with optimizing our checkpoint intervals and sizes.

Are there any best practices around this? I have a ~7 sequential joins and a few sinks. I'm curious what would result in the better throughput and latency trade offs. I'd assume less frequent checkpointing would increase throughput (but constrained by how frequently I want checkedpointed sinks written).

Chesnay Schepler

Re: Best practices around checkpoint intervals and sizes?

A lower checkpoint interval (== more checkpoints / time) will consume
more resources and hence can affect the job performance.
It ultimately boils down to how much latency you are willing to accept
when a failure occurs and data has to be re-processed (more checkpoints
=> less data).

How long this catch-up takes depends on the job and provisioning of the
cluster. An over-provisioned cluster can recover quicker from what is
ultimately just a data spike, while one that is barely keeping up may
incur significant latency.

We know that many users have a checkpointing interval of the order of
minutes, but at the end of the day you will need to run some experiments
with your job&cluster&data to get some rough numbers.

On 2/18/2021 7:35 AM, Dan Hill wrote:
> Hi. I'm playing around with optimizing our checkpoint intervals and
> sizes.
>
> Are there any best practices around this? I have a ~7 sequential
> joins and a few sinks. I'm curious what would result in the better
> throughput and latency trade offs. I'd assume less frequent
> checkpointing would increase throughput (but constrained by how
> frequently I want checkedpointed sinks written).