Dynamic configuration of Flink checkpoint interval

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

Dynamic configuration of Flink checkpoint interval

Kai Fu
Hi team,

We want to know if Flink has some dynamic configuration of the checkpoint interval. Our use case has a cold start phase where the entire dataset is replayed from the beginning until the most recent ones.

In the cold start phase, the resources are fully utilized and the backpressure is high for all upstream operators, causing the checkpoint timeout constantly. The real production traffic is far less than that and the current provisioned resource is capable of handling it. 

We're thinking if Flink can support the dynamic checkpoint config to bypass the checkpoint operation or make it less frequent on the cold start phase to speed up the process, while making the checkpoint normal again once the cold start is completed.

--
Best wishes,
- Kai
Reply | Threaded
Open this post in threaded view
|

Re: Dynamic configuration of Flink checkpoint interval

Kai Fu
Hi Jing,

Yup, what you're describing is what I want. I also tried the approach you suggested and it works. I'm going to take that approach for the moment and create a Jira issue for this feature.

On Sun, May 30, 2021 at 8:57 PM JING ZHANG <[hidden email]> wrote:
Hi Kai,

Do you try to find a way to hot update checkpoint interval or disable/enable checkpoint without stop and restart job?
Unfortunately, it is not supported yet, AFAIK. 
You're very welcome to create an issue and describe your needs here (Flink’s Jira) .
At present, you may would like to use the following temporary solution:
  1. set a bigger value as checkpoint interval, start your job
  2. do a savepoint after cold start is completed
  3. set a normal value as checkpoint interval, restart the job from savepoint

Best regards,
JING ZHANG

Kai Fu <[hidden email]> 于2021年5月30日周日 下午7:13写道:
Hi team,

We want to know if Flink has some dynamic configuration of the checkpoint interval. Our use case has a cold start phase where the entire dataset is replayed from the beginning until the most recent ones.

In the cold start phase, the resources are fully utilized and the backpressure is high for all upstream operators, causing the checkpoint timeout constantly. The real production traffic is far less than that and the current provisioned resource is capable of handling it. 

We're thinking if Flink can support the dynamic checkpoint config to bypass the checkpoint operation or make it less frequent on the cold start phase to speed up the process, while making the checkpoint normal again once the cold start is completed.

--
Best wishes,
- Kai


--
Best wishes,
- Kai
Reply | Threaded
Open this post in threaded view
|

Re: Dynamic configuration of Flink checkpoint interval

JING ZHANG
Hi Kai,

Happy to hear that. 
Would you please paste the JIRA link in the email after you create it. Maybe it could help other users who encounter the same problem. Thanks very much.

Best regards,
JING ZHANG

Kai Fu <[hidden email]> 于2021年5月30日周日 下午11:19写道:
Hi Jing,

Yup, what you're describing is what I want. I also tried the approach you suggested and it works. I'm going to take that approach for the moment and create a Jira issue for this feature.

On Sun, May 30, 2021 at 8:57 PM JING ZHANG <[hidden email]> wrote:
Hi Kai,

Do you try to find a way to hot update checkpoint interval or disable/enable checkpoint without stop and restart job?
Unfortunately, it is not supported yet, AFAIK. 
You're very welcome to create an issue and describe your needs here (Flink’s Jira) .
At present, you may would like to use the following temporary solution:
  1. set a bigger value as checkpoint interval, start your job
  2. do a savepoint after cold start is completed
  3. set a normal value as checkpoint interval, restart the job from savepoint

Best regards,
JING ZHANG

Kai Fu <[hidden email]> 于2021年5月30日周日 下午7:13写道:
Hi team,

We want to know if Flink has some dynamic configuration of the checkpoint interval. Our use case has a cold start phase where the entire dataset is replayed from the beginning until the most recent ones.

In the cold start phase, the resources are fully utilized and the backpressure is high for all upstream operators, causing the checkpoint timeout constantly. The real production traffic is far less than that and the current provisioned resource is capable of handling it. 

We're thinking if Flink can support the dynamic checkpoint config to bypass the checkpoint operation or make it less frequent on the cold start phase to speed up the process, while making the checkpoint normal again once the cold start is completed.

--
Best wishes,
- Kai


--
Best wishes,
- Kai
Reply | Threaded
Open this post in threaded view
|

Re: Dynamic configuration of Flink checkpoint interval

Senhong Liu
Hi all,

In fact, a pretty similar JIRA has been created, which is https://issues.apache.org/jira/browse/FLINK-18578 and I am working on it. In the near future, I will publish a FLIP and start a discussion about that. We look forward to your participation.

Best,
Senhong Liu

JING ZHANG <[hidden email]> 于2021年5月31日周一 上午10:21写道:
Hi Kai,

Happy to hear that. 
Would you please paste the JIRA link in the email after you create it. Maybe it could help other users who encounter the same problem. Thanks very much.

Best regards,
JING ZHANG

Kai Fu <[hidden email]> 于2021年5月30日周日 下午11:19写道:
Hi Jing,

Yup, what you're describing is what I want. I also tried the approach you suggested and it works. I'm going to take that approach for the moment and create a Jira issue for this feature.

On Sun, May 30, 2021 at 8:57 PM JING ZHANG <[hidden email]> wrote:
Hi Kai,

Do you try to find a way to hot update checkpoint interval or disable/enable checkpoint without stop and restart job?
Unfortunately, it is not supported yet, AFAIK. 
You're very welcome to create an issue and describe your needs here (Flink’s Jira) .
At present, you may would like to use the following temporary solution:
  1. set a bigger value as checkpoint interval, start your job
  2. do a savepoint after cold start is completed
  3. set a normal value as checkpoint interval, restart the job from savepoint

Best regards,
JING ZHANG

Kai Fu <[hidden email]> 于2021年5月30日周日 下午7:13写道:
Hi team,

We want to know if Flink has some dynamic configuration of the checkpoint interval. Our use case has a cold start phase where the entire dataset is replayed from the beginning until the most recent ones.

In the cold start phase, the resources are fully utilized and the backpressure is high for all upstream operators, causing the checkpoint timeout constantly. The real production traffic is far less than that and the current provisioned resource is capable of handling it. 

We're thinking if Flink can support the dynamic checkpoint config to bypass the checkpoint operation or make it less frequent on the cold start phase to speed up the process, while making the checkpoint normal again once the cold start is completed.

--
Best wishes,
- Kai


--
Best wishes,
- Kai
Reply | Threaded
Open this post in threaded view
|

Re: Dynamic configuration of Flink checkpoint interval

Kai Fu
In reply to this post by JING ZHANG
Hi JING,


On Mon, May 31, 2021 at 10:21 AM JING ZHANG <[hidden email]> wrote:
Hi Kai,

Happy to hear that. 
Would you please paste the JIRA link in the email after you create it. Maybe it could help other users who encounter the same problem. Thanks very much.

Best regards,
JING ZHANG

Kai Fu <[hidden email]> 于2021年5月30日周日 下午11:19写道:
Hi Jing,

Yup, what you're describing is what I want. I also tried the approach you suggested and it works. I'm going to take that approach for the moment and create a Jira issue for this feature.

On Sun, May 30, 2021 at 8:57 PM JING ZHANG <[hidden email]> wrote:
Hi Kai,

Do you try to find a way to hot update checkpoint interval or disable/enable checkpoint without stop and restart job?
Unfortunately, it is not supported yet, AFAIK. 
You're very welcome to create an issue and describe your needs here (Flink’s Jira) .
At present, you may would like to use the following temporary solution:
  1. set a bigger value as checkpoint interval, start your job
  2. do a savepoint after cold start is completed
  3. set a normal value as checkpoint interval, restart the job from savepoint

Best regards,
JING ZHANG

Kai Fu <[hidden email]> 于2021年5月30日周日 下午7:13写道:
Hi team,

We want to know if Flink has some dynamic configuration of the checkpoint interval. Our use case has a cold start phase where the entire dataset is replayed from the beginning until the most recent ones.

In the cold start phase, the resources are fully utilized and the backpressure is high for all upstream operators, causing the checkpoint timeout constantly. The real production traffic is far less than that and the current provisioned resource is capable of handling it. 

We're thinking if Flink can support the dynamic checkpoint config to bypass the checkpoint operation or make it less frequent on the cold start phase to speed up the process, while making the checkpoint normal again once the cold start is completed.

--
Best wishes,
- Kai


--
Best wishes,
- Kai


--
Best wishes,
- Kai
Reply | Threaded
Open this post in threaded view
|

Re: Dynamic configuration of Flink checkpoint interval

Yun Tang
In reply to this post by Senhong Liu
Hi Kai,

I think unaligned checkpoint + alignment timeout [1] might also help you in this case. You could leverage unaligned checkpoint to help reduce the checkpoint duration.



Best
Yun Tang


From: Senhong Liu <[hidden email]>
Sent: Monday, May 31, 2021 10:33
To: JING ZHANG <[hidden email]>
Cc: Kai Fu <[hidden email]>; user <[hidden email]>
Subject: Re: Dynamic configuration of Flink checkpoint interval
 
Hi all,

In fact, a pretty similar JIRA has been created, which is https://issues.apache.org/jira/browse/FLINK-18578 and I am working on it. In the near future, I will publish a FLIP and start a discussion about that. We look forward to your participation.

Best,
Senhong Liu

JING ZHANG <[hidden email]> 于2021年5月31日周一 上午10:21写道:
Hi Kai,

Happy to hear that. 
Would you please paste the JIRA link in the email after you create it. Maybe it could help other users who encounter the same problem. Thanks very much.

Best regards,
JING ZHANG

Kai Fu <[hidden email]> 于2021年5月30日周日 下午11:19写道:
Hi Jing,

Yup, what you're describing is what I want. I also tried the approach you suggested and it works. I'm going to take that approach for the moment and create a Jira issue for this feature.

On Sun, May 30, 2021 at 8:57 PM JING ZHANG <[hidden email]> wrote:
Hi Kai,

Do you try to find a way to hot update checkpoint interval or disable/enable checkpoint without stop and restart job?
Unfortunately, it is not supported yet, AFAIK. 
You're very welcome to create an issue and describe your needs here (Flink’s Jira) .
At present, you may would like to use the following temporary solution:
  1. set a bigger value as checkpoint interval, start your job
  2. do a savepoint after cold start is completed
  3. set a normal value as checkpoint interval, restart the job from savepoint

Best regards,
JING ZHANG

Kai Fu <[hidden email]> 于2021年5月30日周日 下午7:13写道:
Hi team,

We want to know if Flink has some dynamic configuration of the checkpoint interval. Our use case has a cold start phase where the entire dataset is replayed from the beginning until the most recent ones.

In the cold start phase, the resources are fully utilized and the backpressure is high for all upstream operators, causing the checkpoint timeout constantly. The real production traffic is far less than that and the current provisioned resource is capable of handling it. 

We're thinking if Flink can support the dynamic checkpoint config to bypass the checkpoint operation or make it less frequent on the cold start phase to speed up the process, while making the checkpoint normal again once the cold start is completed.

--
Best wishes,
- Kai


--
Best wishes,
- Kai