checkpointing seems to be throttled.

Posted by Colletta, Edward on
URL: http://deprecated-apache-flink-user-mailing-list-archive.369.s1.nabble.com/checkpointing-seems-to-be-throttled-tp40240.html

Using session cluster with three taskmanagers, cluster.evenly-spread-out-slots is set to true.  13 jobs running.  Average parallelism of each job is 4.                                                                                                                                                      

Flink version 1.11.2, Java 11.

Running on AWS EC2 instances with EFS for high-availability.storageDir.

 

 

We are seeing very high checkpoint times and experiencing timeouts.  The checkpoint timeout is the default 10 minutes.   This does not seem to be related to EFS limits/throttling .  We started experiencing these timeouts after upgrading from Flink 1.9.2/Java 8.  Are there any known issues which cause very high checkpoint times?

 

Also I noticed we did not set state.checkpoints.dir, I assume it is using high-availability.storageDir.  Is that correct?

 

For now we plan on setting

execution.checkpointing.timeout: 60 min

execution.checkpointing.tolerable-failed-checkpoints:12

execution.checkpointing.unaligned  true
and also explicitly set
state.checkpoints.dir