Using session cluster with three taskmanagers, cluster.evenly-spread-out-slots is set to true. 13 jobs running. Average parallelism of each job is 4.
Flink version 1.11.2, Java 11.
Running on AWS EC2 instances with EFS for high-availability.storageDir.
We are seeing very high checkpoint times and experiencing timeouts. The checkpoint timeout is the default 10 minutes. This does not seem to be related to EFS limits/throttling . We started experiencing these timeouts after upgrading
from Flink 1.9.2/Java 8. Are there any known issues which cause very high checkpoint times?
Also I noticed we did not set
state.checkpoints.dir, I assume it is using high-availability.storageDir. Is that correct?
For now we plan on setting
execution.checkpointing.timeout:
60 min
execution.checkpointing.tolerable-failed-checkpoints:12
Free forum by Nabble | Edit this page |