http://deprecated-apache-flink-user-mailing-list-archive.369.s1.nabble.com/Making-job-fail-on-Checkpoint-Expired-tp34051p34145.html
Hi Robin
Thanks for the detailed reply, and sorry for my late reply.
I think that your request to fail the whole job when continues checkpoint expired is valid, I've created an issue to track this[1]
For now, maybe the following steps can help you find out the reason of time out
1. You can find out the "not ack subtask" in checkpoint ui, (maybe it called A)
2. find out A is under backpressure now?
2.1. if A is under backpressure, please fix it
2.2 if A is not under backpressure, you can go to the tm log of A to find out something abnormal(maybe you need to enable the debug log in this step)
for the snapshot in TM side, it contains 1) barrier align (exactly-once mode, at least once no need to align the barrier); 2) synchronize procedure; 3)asynchronize procedure;
backpressure will affect step 1, too many timers/cpu consumption too high/disk utilization too high may affect step 2; 3) disk performance/network performance may affect step 3;