(DEPRECATED) Apache Flink User Mailing List archive.

Debugging slow/failing checkpoints

Classic

List

Threaded

2 messages Options

Steven Nelson

Debugging slow/failing checkpoints

I am working with an application that hasn't gone to production yet. We run Flink as a cluster within a K8s environment. It has the following attributes

1) 2 Job Manager configured using HA, backed by Zookeeper and HDFS

2) 4 Task Managers

3) Configured to use RocksDB. The actual RocksDB files are configured to be written to a locally attached NVMe drive.

4) We checkpoint every 15 seconds, with a minimum delay of 7.5 seconds.

5) There is currently very little load going through the system (it's in a test environment). The web console indicates there isn't any Back Pressure

6) The cluster is running Flink 1.9.0

7) I don't see anything unexpected in the logs

8) Checkpoints take longer than 10 minutes with very little state (<1 mb), they fail due to timeout

9) Eventually the job fails because it can't checkpoint.

What steps beyond what I have already done should I consider to debug this?

-Steve

Congxian Qiu

Re: Debugging slow/failing checkpoints

Hi Steve

1. Do you use exactly once or at least once?

2. Do you use incremental or not

3. Do you have any timer, and where does the timer stored(Heap or RocksDB), you can ref the config here[1], you can try store the timer in RocksDB.

4. Does the align time too long

5. You can check if it is sync duration took too long time or async duration tool too long time.

6. If the io/network during the checkpoint has reached the limit

[1] https://ci.apache.org/projects/flink/flink-docs-stable/ops/state/state_backends.html#rocksdb-state-backend-config-options

Best,

Congxian

Steven Nelson <[hidden email]> 于2019年9月27日周五上午3:33写道：

I am working with an application that hasn't gone to production yet. We run Flink as a cluster within a K8s environment. It has the following attributes

1) 2 Job Manager configured using HA, backed by Zookeeper and HDFS
2) 4 Task Managers
3) Configured to use RocksDB. The actual RocksDB files are configured to be written to a locally attached NVMe drive.
4) We checkpoint every 15 seconds, with a minimum delay of 7.5 seconds.
5) There is currently very little load going through the system (it's in a test environment). The web console indicates there isn't any Back Pressure
6) The cluster is running Flink 1.9.0
7) I don't see anything unexpected in the logs
8) Checkpoints take longer than 10 minutes with very little state (<1 mb), they fail due to timeout
9) Eventually the job fails because it can't checkpoint.

What steps beyond what I have already done should I consider to debug this?

-Steve