flink checkpoints adjustment strategy

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

flink checkpoints adjustment strategy

Marco Villalobos-2
I am kind of stuck in determining how large a checkpoint interval should be.

Is there a guide for that?  If a timeout time is 10 minutes, we time out, what is a good strategy for adjusting that?

Where is a good starting point for a checkpoint? How shall they be adjusted? 

We often see checkpoint errors during our onTimer calls, I don't know if that's related.

Marco A. Villalobos


Reply | Threaded
Open this post in threaded view
|

Re: flink checkpoints adjustment strategy

Congxian Qiu
Hi Marco
     You need to figure out why the checkpoint timed out(you can see the consumed time of each period for one checkpoint in UI), if it indeed needs such long time to complete the checkpoint, then you need to configure a longer timeout.
     If there are some checkpoint errors, we need first to figure out what the problem is, in general, a checkpoint can split into some parts such as barrie alignment(maybe there is some backpressure or something else, that some barrier can't be received in time), sync duration(the thread is too busy ...), and async duration(too much io/network process ...), etc. 

Best,
Congxian


Marco Villalobos <[hidden email]> 于2021年1月29日周五 上午7:19写道:
I am kind of stuck in determining how large a checkpoint interval should be.

Is there a guide for that?  If a timeout time is 10 minutes, we time out, what is a good strategy for adjusting that?

Where is a good starting point for a checkpoint? How shall they be adjusted? 

We often see checkpoint errors during our onTimer calls, I don't know if that's related.

Marco A. Villalobos


Reply | Threaded
Open this post in threaded view
|

Re: flink checkpoints adjustment strategy

Marco Villalobos-2
Do you have advice on how to determine why a checkpoint failed?  1. Timeout (that's easy to discover as the UI logs them). 2. Other errors are not so easy to find. How can I find other errors?  Are they in the UI, or good old-fashioned logging?

On Fri, Jan 29, 2021 at 3:11 AM Congxian Qiu <[hidden email]> wrote:
Hi Marco
     You need to figure out why the checkpoint timed out(you can see the consumed time of each period for one checkpoint in UI), if it indeed needs such long time to complete the checkpoint, then you need to configure a longer timeout.
     If there are some checkpoint errors, we need first to figure out what the problem is, in general, a checkpoint can split into some parts such as barrie alignment(maybe there is some backpressure or something else, that some barrier can't be received in time), sync duration(the thread is too busy ...), and async duration(too much io/network process ...), etc. 

Best,
Congxian


Marco Villalobos <[hidden email]> 于2021年1月29日周五 上午7:19写道:
I am kind of stuck in determining how large a checkpoint interval should be.

Is there a guide for that?  If a timeout time is 10 minutes, we time out, what is a good strategy for adjusting that?

Where is a good starting point for a checkpoint? How shall they be adjusted? 

We often see checkpoint errors during our onTimer calls, I don't know if that's related.

Marco A. Villalobos