http://deprecated-apache-flink-user-mailing-list-archive.369.s1.nabble.com/Stream-Task-seems-to-be-blocked-after-checkpoint-timeout-tp15861p15890.html
Hi Stefan,
It seems that I found something strange from JM's log.
It had happened more than once before, but all subtasks would finish their checkpoint attempts in the end.
2017-09-26 01:23:28,690 INFO org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Triggering checkpoint 1140 @ 1506389008690
2017-09-26 01:28:28,690 INFO org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Triggering checkpoint 1141 @ 1506389308690
2017-09-26 01:33:28,690 INFO org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Triggering checkpoint 1142 @ 1506389608690
2017-09-26 01:33:28,691 INFO org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Checkpoint 1140 expired before completing.
2017-09-26 01:38:28,691 INFO org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Checkpoint 1141 expired before completing.
2017-09-26 01:40:38,044 WARN org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Received late message for now expired checkpoint attempt 1140 from c63825d15de0fef55a1d148adcf4467e of job 7c039572b...
2017-09-26 01:40:53,743 WARN org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Received late message for now expired checkpoint attempt 1141 from c63825d15de0fef55a1d148adcf4467e of job 7c039572b...
2017-09-26 01:41:19,332 INFO org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Completed checkpoint 1142 (136733704 bytes in 457413 ms).
For chk #1245 and #1246, there was no late message from TM. You can refer to the TM log. The full completed checkpoint attempt will have 12 (... asynchronous part) logs in general, but #1245 and #1246 only got 10 logs.
2017-09-26 10:08:28,690 INFO org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Triggering checkpoint 1245 @ 1506420508690
2017-09-26 10:13:28,690 INFO org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Triggering checkpoint 1246 @ 1506420808690
2017-09-26 10:18:28,691 INFO org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Checkpoint 1245 expired before completing.
2017-09-26 10:23:28,691 INFO org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Checkpoint 1246 expired before completing.
Moreover, I listed the directory for checkpoints on S3 and saw there were two states not discarded successfully. In general, there will be 16 parts for a completed checkpoint state.
2017-09-26 18:08:33 36919 tony-dev/flink-checkpoints/7c039572b13346f1b17dcc0ace2b72c2/chk-1245/eedd7ca5-ee34-45a5-bf0b-11cc1fc67ab8
2017-09-26 18:13:34 37419 tony-dev/flink-checkpoints/7c039572b13346f1b17dcc0ace2b72c2/chk-1246/9aa5c6c4-8c74-465d-8509-5fea4ed25af6
Hope these informations are helpful. Thank you.
Best Regards,
Tony Wei