Debugging slow/failing checkpoints

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

Debugging slow/failing checkpoints

Steven Nelson

I am working with an application that hasn't gone to production yet. We run Flink as a cluster within a K8s environment. It has the following attributes

1) 2 Job Manager configured using HA, backed by Zookeeper and HDFS
2) 4 Task Managers 
3) Configured to use RocksDB. The actual RocksDB files are configured to be written to a locally attached NVMe drive.
4) We checkpoint every 15 seconds, with a minimum delay of 7.5 seconds.
5) There is currently very little load going through the system (it's in a test environment). The web console indicates there isn't any Back Pressure
6) The cluster is running Flink 1.9.0
7) I don't see anything unexpected in the logs
8) Checkpoints take longer than 10 minutes with very little state (<1 mb), they fail due to timeout
9) Eventually the job fails because it can't checkpoint.

What steps beyond what I have already done should I consider to debug this?

-Steve



Reply | Threaded
Open this post in threaded view
|

Re: Debugging slow/failing checkpoints

Congxian Qiu
Hi  Steve

1. Do you use exactly once or at least once?
2. Do you use incremental or not
3. Do you have any timer, and where does the timer stored(Heap or RocksDB), you can ref the config here[1], you can try store the timer in RocksDB.
4. Does the align time too long
5. You can check if it is sync duration took too long time or async duration tool too long time.
6. If the io/network during the checkpoint has reached the limit


Steven Nelson <[hidden email]> 于2019年9月27日周五 上午3:33写道:

I am working with an application that hasn't gone to production yet. We run Flink as a cluster within a K8s environment. It has the following attributes

1) 2 Job Manager configured using HA, backed by Zookeeper and HDFS
2) 4 Task Managers 
3) Configured to use RocksDB. The actual RocksDB files are configured to be written to a locally attached NVMe drive.
4) We checkpoint every 15 seconds, with a minimum delay of 7.5 seconds.
5) There is currently very little load going through the system (it's in a test environment). The web console indicates there isn't any Back Pressure
6) The cluster is running Flink 1.9.0
7) I don't see anything unexpected in the logs
8) Checkpoints take longer than 10 minutes with very little state (<1 mb), they fail due to timeout
9) Eventually the job fails because it can't checkpoint.

What steps beyond what I have already done should I consider to debug this?

-Steve