(DEPRECATED) Apache Flink User Mailing List archive.

checkpointing seems to be throttled.

Posted by Colletta, Edward on
URL: http://deprecated-apache-flink-user-mailing-list-archive.369.s1.nabble.com/checkpointing-seems-to-be-throttled-tp40240.html

Using session cluster with three taskmanagers, cluster.evenly-spread-out-slots is set to true. 13 jobs running. Average parallelism of each job is 4.

Flink version 1.11.2, Java 11.

Running on AWS EC2 instances with EFS for high-availability.storageDir.

We are seeing very high checkpoint times and experiencing timeouts. The checkpoint timeout is the default 10 minutes. This does not seem to be related to EFS limits/throttling . We started experiencing these timeouts after upgrading from Flink 1.9.2/Java 8. Are there any known issues which cause very high checkpoint times?

Also I noticed we did not set state.checkpoints.dir, I assume it is using high-availability.storageDir. Is that correct?

For now we plan on setting

execution.checkpointing.timeout: 60 min

execution.checkpointing.tolerable-failed-checkpoints:12

checkpointing seems to be throttled.

execution.checkpointing.unaligned true

and also explicitly set

state.checkpoints.dir