Re: checkpoint stuck with rocksdb statebackend and s3 filesystem
Posted by
Stefan Richter on
URL: http://deprecated-apache-flink-user-mailing-list-archive.369.s1.nabble.com/checkpoint-stuck-with-rocksdb-statebackend-and-s3-filesystem-tp18679p18701.html
Hi,
thanks for all the info. I had a look into the problem and opened
https://issues.apache.org/jira/browse/FLINK-8871 to fix this. From your stack trace, you can see many checkpointing threads are running on your TM for checkpoints that have already timed out, and I think this cascades and slows down everything. Seems like the implementation of some features like checkpoint timeouts and not failing tasks from checkpointing problems overlooked that we also require to properly communicate that checkpoint cancellation to all task, which was not needed before.
Best,
Stefan
Hi Stefan,
Here is my checkpointing configuration.
Checkpointing Mode | Exactly Once |
Interval | 20m 0s |
Timeout | 10m 0s |
Minimum Pause Between Checkpoints | 0ms |
Maximum Concurrent Checkpoints | 1 |
Persist Checkpoints Externally | Enabled (delete on cancellation) |
Best Regards,
Tony Wei