Checkpoint expired before completing with cleanupInRocksdbCompactFilter

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

Checkpoint expired before completing with cleanupInRocksdbCompactFilter

Mu Kong
Hi community,

I'm glad that in Flink 1.8.0, it introduced cleanupInRocksdbCompactFilter to support state clean up for rocksdb backend.
We have an application that heavily relies on managed keyed store. 
As we are using rocksdb as the state backend, we were suffering the issue of ever-growing state size. To be more specific, our checkpoint size grows into 200GB in 2 weeks.

After upgrade to 1.8.0 and utilize the cleanupInRocksdbCompactFilter ttl config, the checkpoint size never grows over 10GB.
However, two days after upgrade, checkpointing started to fail because of the "Checkpoint expired before completing".

From the log, I could not get anything useful.
But in the Flink UI, the last successful checkpoint took 1m to finish, and our checkpoint timeout is set to 15m.
It seems that the checkpoint period became extremely long all of a sudden.

Is there anyway that I can further look into this? Or is there any direction that I can tune the ttl for the application?

Thanks in advance!

Best regards,
Mu
Reply | Threaded
Open this post in threaded view
|

Re: Checkpoint expired before completing with cleanupInRocksdbCompactFilter

Congxian Qiu
Hi, Mu
Is there anything  looks like `Received  late message for now expired checkpoint attempt ${checkpointID} from ${taskkExecutionID} of job ${jobID}` in JM log?

If yes, that means this task complete the checkpoint too long (maybe receive barrier too late, maybe spend too much time to do checkpoint, can investigate more from TM log); 


Best
Congxian
On May 9, 2019, 14:44 +0800, Mu Kong <[hidden email]>, wrote:
Hi community,

I'm glad that in Flink 1.8.0, it introduced cleanupInRocksdbCompactFilter to support state clean up for rocksdb backend.
We have an application that heavily relies on managed keyed store. 
As we are using rocksdb as the state backend, we were suffering the issue of ever-growing state size. To be more specific, our checkpoint size grows into 200GB in 2 weeks.

After upgrade to 1.8.0 and utilize the cleanupInRocksdbCompactFilter ttl config, the checkpoint size never grows over 10GB.
However, two days after upgrade, checkpointing started to fail because of the "Checkpoint expired before completing".

From the log, I could not get anything useful.
But in the Flink UI, the last successful checkpoint took 1m to finish, and our checkpoint timeout is set to 15m.
It seems that the checkpoint period became extremely long all of a sudden.

Is there anyway that I can further look into this? Or is there any direction that I can tune the ttl for the application?

Thanks in advance!

Best regards,
Mu