(DEPRECATED) Apache Flink User Mailing List archive.

Checkpoint expired before completing with cleanupInRocksdbCompactFilter

Classic

List

Threaded

2 messages Options

Mu Kong

Checkpoint expired before completing with cleanupInRocksdbCompactFilter

Hi community,

I'm glad that in Flink 1.8.0, it introduced cleanupInRocksdbCompactFilter to support state clean up for rocksdb backend.

We have an application that heavily relies on managed keyed store.

As we are using rocksdb as the state backend, we were suffering the issue of ever-growing state size. To be more specific, our checkpoint size grows into 200GB in 2 weeks.

After upgrade to 1.8.0 and utilize the cleanupInRocksdbCompactFilter ttl config, the checkpoint size never grows over 10GB.

However, two days after upgrade, checkpointing started to fail because of the "Checkpoint expired before completing".

From the log, I could not get anything useful.

But in the Flink UI, the last successful checkpoint took 1m to finish, and our checkpoint timeout is set to 15m.

It seems that the checkpoint period became extremely long all of a sudden.

Is there anyway that I can further look into this? Or is there any direction that I can tune the ttl for the application?

Thanks in advance!

Best regards,

Congxian Qiu

Re: Checkpoint expired before completing with cleanupInRocksdbCompactFilter

Hi, Mu

Is there anything looks like `Received late message for now expired checkpoint attempt ${checkpointID} from ${taskkExecutionID} of job ${jobID}` in JM log?

If yes, that means this task complete the checkpoint too long (maybe receive barrier too late, maybe spend too much time to do checkpoint, can investigate more from TM log);

Best

Congxian

On May 9, 2019, 14:44 +0800, Mu Kong <[hidden email]>, wrote:

Hi community,

I'm glad that in Flink 1.8.0, it introduced cleanupInRocksdbCompactFilter to support state clean up for rocksdb backend.

We have an application that heavily relies on managed keyed store.

As we are using rocksdb as the state backend, we were suffering the issue of ever-growing state size. To be more specific, our checkpoint size grows into 200GB in 2 weeks.

After upgrade to 1.8.0 and utilize the cleanupInRocksdbCompactFilter ttl config, the checkpoint size never grows over 10GB.

However, two days after upgrade, checkpointing started to fail because of the "Checkpoint expired before completing".

From the log, I could not get anything useful.

But in the Flink UI, the last successful checkpoint took 1m to finish, and our checkpoint timeout is set to 15m.

It seems that the checkpoint period became extremely long all of a sudden.

Is there anyway that I can further look into this? Or is there any direction that I can tune the ttl for the application?

Thanks in advance!

Best regards,

Mu