------------------Original Mail ------------------Sender:Colletta, Edward <[hidden email]>Send Date:Mon Dec 21 17:50:15 2020Recipients:[hidden email] <[hidden email]>Subject:checkpointing seems to be throttled.Using session cluster with three taskmanagers, cluster.evenly-spread-out-slots is set to true. 13 jobs running. Average parallelism of each job is 4.
Flink version 1.11.2, Java 11.
Running on AWS EC2 instances with EFS for high-availability.storageDir.
We are seeing very high checkpoint times and experiencing timeouts. The checkpoint timeout is the default 10 minutes. This does not seem to be related to EFS limits/throttling . We started experiencing these timeouts after upgrading from Flink 1.9.2/Java 8. Are there any known issues which cause very high checkpoint times?
Also I noticed we did not set state.checkpoints.dir, I assume it is using high-availability.storageDir. Is that correct?
For now we plan on setting
execution.checkpointing.timeout: 60 min
execution.checkpointing.tolerable-failed-checkpoints:12
execution.checkpointing.unaligned true
and also explicitly set
state.checkpoints.dir
Free forum by Nabble | Edit this page |