With the checkpoint interval of the same size, the Flink 1.12 version of the job checkpoint time-consuming increase and production failure, the Flink1.9 job is running normally

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

With the checkpoint interval of the same size, the Flink 1.12 version of the job checkpoint time-consuming increase and production failure, the Flink1.9 job is running normally

Haihang Jing
【Appearance】For jobs with the same configuration (checkpoint interval: 3
minutes, job logic: regular join), flink1.9 runs normally. After flink1.12
runs for a period of time, the checkpoint creation time increases, and
finally the checkpoint creation fails.

【Analysis】After learning flink1.10, the checkpoint mechanism is adjusted.
The receiver will not cache the data after a single barrier arrives when the
barrier is aligned, which means that the sender must wait for credit
feedback to transmit data after the barrier is aligned, so the sender will
generate certain The cold start of Flink affects the delay and network
throughput. Therefore, the checkpoint interval is adjusted to 10 minutes for
comparative testing, and it is found that after the adjustment (interval is
10), the job running on flink 1.12 is running normally.

issue:https://issues.apache.org/jira/browse/FLINK-16404

【Question】1.Have you encountered the same problem?
                   2.Can  flink1.12 set small checkpoint interval?

The checkpoint interval is 3 minutes after the flink1.12 job runs for 5
hours, the checkpoint creation fails, the specific exception stack:
org.apache.flink.util.FlinkRuntimeException: Exceeded checkpoint tolerable
failure threshold.

        at
org.apache.flink.runtime.checkpoint.CheckpointFailureManager.handleCheckpointException(CheckpointFailureManager.java:96)

        at
org.apache.flink.runtime.checkpoint.CheckpointFailureManager.handleJobLevelCheckpointException(CheckpointFailureManager.java:65)

        at
org.apache.flink.runtime.checkpoint.CheckpointCoordinator.abortPendingCheckpoint(CheckpointCoordinator.java:1924)

        at
org.apache.flink.runtime.checkpoint.CheckpointCoordinator.abortPendingCheckpoint(CheckpointCoordinator.java:1897)

        at
org.apache.flink.runtime.checkpoint.CheckpointCoordinator.access$600(CheckpointCoordinator.java:93)

        at
org.apache.flink.runtime.checkpoint.CheckpointCoordinator$CheckpointCanceller.run(CheckpointCoordinator.java:2038)

        at
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)

        at java.util.concurrent.FutureTask.run(FutureTask.java:266)

        at
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)

        at
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)

        at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)

        at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)

        at java.lang.Thread.run(Thread.java:745)




--
Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/
Reply | Threaded
Open this post in threaded view
|

Re: With the checkpoint interval of the same size, the Flink 1.12 version of the job checkpoint time-consuming increase and production failure, the Flink1.9 job is running normally

Congxian Qiu
Hi
    From the description, the time used to complete the checkpoint in 1.12 is longer. could you share more detail about the time consumption when running job on 1.9 and 1.12?
Best,
Congxian


Haihang Jing <[hidden email]> 于2021年3月23日周二 下午7:22写道:
【Appearance】For jobs with the same configuration (checkpoint interval: 3
minutes, job logic: regular join), flink1.9 runs normally. After flink1.12
runs for a period of time, the checkpoint creation time increases, and
finally the checkpoint creation fails.

【Analysis】After learning flink1.10, the checkpoint mechanism is adjusted.
The receiver will not cache the data after a single barrier arrives when the
barrier is aligned, which means that the sender must wait for credit
feedback to transmit data after the barrier is aligned, so the sender will
generate certain The cold start of Flink affects the delay and network
throughput. Therefore, the checkpoint interval is adjusted to 10 minutes for
comparative testing, and it is found that after the adjustment (interval is
10), the job running on flink 1.12 is running normally.

issue:https://issues.apache.org/jira/browse/FLINK-16404

【Question】1.Have you encountered the same problem?
                   2.Can  flink1.12 set small checkpoint interval?

The checkpoint interval is 3 minutes after the flink1.12 job runs for 5
hours, the checkpoint creation fails, the specific exception stack:
org.apache.flink.util.FlinkRuntimeException: Exceeded checkpoint tolerable
failure threshold.

        at
org.apache.flink.runtime.checkpoint.CheckpointFailureManager.handleCheckpointException(CheckpointFailureManager.java:96)

        at
org.apache.flink.runtime.checkpoint.CheckpointFailureManager.handleJobLevelCheckpointException(CheckpointFailureManager.java:65)

        at
org.apache.flink.runtime.checkpoint.CheckpointCoordinator.abortPendingCheckpoint(CheckpointCoordinator.java:1924)

        at
org.apache.flink.runtime.checkpoint.CheckpointCoordinator.abortPendingCheckpoint(CheckpointCoordinator.java:1897)

        at
org.apache.flink.runtime.checkpoint.CheckpointCoordinator.access$600(CheckpointCoordinator.java:93)

        at
org.apache.flink.runtime.checkpoint.CheckpointCoordinator$CheckpointCanceller.run(CheckpointCoordinator.java:2038)

        at
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)

        at java.util.concurrent.FutureTask.run(FutureTask.java:266)

        at
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)

        at
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)

        at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)

        at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)

        at java.lang.Thread.run(Thread.java:745)




--
Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/
Reply | Threaded
Open this post in threaded view
|

Re: With the checkpoint interval of the same size, the Flink 1.12 version of the job checkpoint time-consuming increase and production failure, the Flink1.9 job is running normally

Haihang Jing
Hi,Congxian ,thanks for your replay.
job run on Flink1.9 (checkpoint interval 3min)
<http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/file/t3050/6.png>
job run on Flink1.12 (checkpoint interval 10min)
<http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/file/t3050/7.png>
job run on Flink1.12 (checkpoint interval 3min)
Pic1:Time used to complete the checkpoint in 1.12 is longer(5m32s):
<http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/file/t3050/2.png>
Pic2:Start delay(4m27s):
<http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/file/t3050/1.png>
Pic3:Next checkpoint failed(task141 ack n/a):
<http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/file/t3050/3.png>
Pic4:Did not see back pressure and data skew:
<http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/file/t3050/4.png>
Pic5:Subtask deal same data nums ,checkpoint endToEnd fast:
<http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/file/t3050/5.png>
Best,
Haihang



--
Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/
Reply | Threaded
Open this post in threaded view
|

Re: With the checkpoint interval of the same size, the Flink 1.12 version of the job checkpoint time-consuming increase and production failure, the Flink1.9 job is running normally

Yingjie Cao
Hi Haihang,

I think your issue is not related to FLINK-16404, because that change should have small impact on checkpoint time, we already have a micro benchmark for that change (1s checkpoint interval) and no regression is seen.

Could you share some more information, for example, the stack of the task which can not finish the checkpoint?

Best,
Yingjie

Haihang Jing <[hidden email]> 于2021年3月25日周四 上午10:58写道:
Hi,Congxian ,thanks for your replay.
job run on Flink1.9 (checkpoint interval 3min)
<http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/file/t3050/6.png>
job run on Flink1.12 (checkpoint interval 10min)
<http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/file/t3050/7.png>
job run on Flink1.12 (checkpoint interval 3min)
Pic1:Time used to complete the checkpoint in 1.12 is longer(5m32s):
<http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/file/t3050/2.png>
Pic2:Start delay(4m27s):
<http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/file/t3050/1.png>
Pic3:Next checkpoint failed(task141 ack n/a):
<http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/file/t3050/3.png>
Pic4:Did not see back pressure and data skew:
<http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/file/t3050/4.png>
Pic5:Subtask deal same data nums ,checkpoint endToEnd fast:
<http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/file/t3050/5.png>
Best,
Haihang



--
Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/
Reply | Threaded
Open this post in threaded view
|

Re: With the checkpoint interval of the same size, the Flink 1.12 version of the job checkpoint time-consuming increase and production failure, the Flink1.9 job is running normally

Yingjie Cao
Hi Haihang,

After scanning the user mailing list, I found some users have reported checkpoint timeout when using unaligned checkpoint, can you share which checkpoint mode do you use? (The information can be found in log or the checkpoint -> configuration tab in webui)

Best,
Yingjie

Yingjie Cao <[hidden email]> 于2021年3月30日周二 下午4:29写道:
Hi Haihang,

I think your issue is not related to FLINK-16404, because that change should have small impact on checkpoint time, we already have a micro benchmark for that change (1s checkpoint interval) and no regression is seen.

Could you share some more information, for example, the stack of the task which can not finish the checkpoint?

Best,
Yingjie

Haihang Jing <[hidden email]> 于2021年3月25日周四 上午10:58写道:
Hi,Congxian ,thanks for your replay.
job run on Flink1.9 (checkpoint interval 3min)
<http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/file/t3050/6.png>
job run on Flink1.12 (checkpoint interval 10min)
<http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/file/t3050/7.png>
job run on Flink1.12 (checkpoint interval 3min)
Pic1:Time used to complete the checkpoint in 1.12 is longer(5m32s):
<http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/file/t3050/2.png>
Pic2:Start delay(4m27s):
<http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/file/t3050/1.png>
Pic3:Next checkpoint failed(task141 ack n/a):
<http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/file/t3050/3.png>
Pic4:Did not see back pressure and data skew:
<http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/file/t3050/4.png>
Pic5:Subtask deal same data nums ,checkpoint endToEnd fast:
<http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/file/t3050/5.png>
Best,
Haihang



--
Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/