(DEPRECATED) Apache Flink User Mailing List archive.

Re: checkpoint stuck with rocksdb statebackend and s3 filesystem

Posted by Stefan Richter on
URL: http://deprecated-apache-flink-user-mailing-list-archive.369.s1.nabble.com/checkpoint-stuck-with-rocksdb-statebackend-and-s3-filesystem-tp18679p18701.html

Hi,

thanks for all the info. I had a look into the problem and opened https://issues.apache.org/jira/browse/FLINK-8871 to fix this. From your stack trace, you can see many checkpointing threads are running on your TM for checkpoints that have already timed out, and I think this cascades and slows down everything. Seems like the implementation of some features like checkpoint timeouts and not failing tasks from checkpointing problems overlooked that we also require to properly communicate that checkpoint cancellation to all task, which was not needed before.

Best,

Stefan

Am 05.03.2018 um 14:42 schrieb Tony Wei <[hidden email]>:

Hi Stefan,

Here is my checkpointing configuration.

Checkpointing Mode Exactly Once
Interval 20m 0s
Timeout 10m 0s
Minimum Pause Between Checkpoints 0ms
Maximum Concurrent Checkpoints 1
Persist Checkpoints Externally Enabled (delete on cancellation)
Best Regards,
Tony Wei

2018-03-05 21:30 GMT+08:00 Stefan Richter <[hidden email]>:
Hi,

quick question: what is your exact checkpointing configuration? In particular, what is your value for the maximum parallel checkpoints and the minimum time interval to wait between two checkpoints?

Best,
Stefan

> Am 05.03.2018 um 06:34 schrieb Tony Wei <[hidden email]>:
>
> Hi all,
>
> Last weekend, my flink job's checkpoint start failing because of timeout. I have no idea what happened, but I collect some informations about my cluster and job. Hope someone can give me advices or hints about the problem that I encountered.
>
> My cluster version is flink-release-1.4.0. Cluster has 10 TMs, each has 4 cores. These machines are ec2 spot instances. The job's parallelism is set as 32, using rocksdb as state backend and s3 presto as checkpoint file system.
> The state's size is nearly 15gb and still grows day-by-day. Normally, It takes 1.5 mins to finish the whole checkpoint process. The timeout configuration is set as 10 mins.
>
> <chk_snapshot.png>
>
> As the picture shows, not each subtask of checkpoint broke caused by timeout, but each machine has ever broken for all its subtasks during last weekend. Some machines recovered by themselves and some machines recovered after I restarted them.
>
> I record logs, stack trace and snapshot for machine's status (CPU, IO, Network, etc.) for both good and bad machine. If there is a need for more informations, please let me know. Thanks in advance.
>
> Best Regards,
> Tony Wei.
> <bad_tm_log.log><bad_tm_pic.png><bad_tm_stack.log><good_tm_log.log><good_tm_pic.png><good_tm_stack.log>

Checkpointing Mode	Exactly Once
Interval	20m 0s
Timeout	10m 0s
Minimum Pause Between Checkpoints	0ms
Maximum Concurrent Checkpoints	1
Persist Checkpoints Externally	Enabled (delete on cancellation)