I am working with an application that hasn't gone to production yet. We run Flink as a cluster within a K8s environment. It has the following attributes
1) 2 Job Manager configured using HA, backed by Zookeeper and HDFS
2) 4 Task Managers
3) Configured to use RocksDB. The actual RocksDB files are configured to be written to a locally attached NVMe drive.
4) We checkpoint every 15 seconds, with a minimum delay of 7.5 seconds.
5) There is currently very little load going through the system (it's in a test environment). The web console indicates there isn't any Back Pressure
6) The cluster is running Flink 1.9.0
7) I don't see anything unexpected in the logs
8) Checkpoints take longer than 10 minutes with very little state (<1 mb), they fail due to timeout
9) Eventually the job fails because it can't checkpoint.
What steps beyond what I have already done should I consider to debug this?
-Steve