Re: With the checkpoint interval of the same size, the Flink 1.12 version of the job checkpoint time-consuming increase and production failure, the Flink1.9 job is running normally

Posted by Yingjie Cao on
URL: http://deprecated-apache-flink-user-mailing-list-archive.369.s1.nabble.com/With-the-checkpoint-interval-of-the-same-size-the-Flink-1-12-version-of-the-job-checkpoint-time-consy-tp42471p42628.html

Hi Haihang,

I think your issue is not related to FLINK-16404, because that change should have small impact on checkpoint time, we already have a micro benchmark for that change (1s checkpoint interval) and no regression is seen.

Could you share some more information, for example, the stack of the task which can not finish the checkpoint?

Best,
Yingjie

Haihang Jing <[hidden email]> 于2021年3月25日周四 上午10:58写道:
Hi,Congxian ,thanks for your replay.
job run on Flink1.9 (checkpoint interval 3min)
<http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/file/t3050/6.png>
job run on Flink1.12 (checkpoint interval 10min)
<http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/file/t3050/7.png>
job run on Flink1.12 (checkpoint interval 3min)
Pic1:Time used to complete the checkpoint in 1.12 is longer(5m32s):
<http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/file/t3050/2.png>
Pic2:Start delay(4m27s):
<http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/file/t3050/1.png>
Pic3:Next checkpoint failed(task141 ack n/a):
<http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/file/t3050/3.png>
Pic4:Did not see back pressure and data skew:
<http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/file/t3050/4.png>
Pic5:Subtask deal same data nums ,checkpoint endToEnd fast:
<http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/file/t3050/5.png>
Best,
Haihang



--
Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/