Hi community,
I'm glad that in Flink 1.8.0, it introduced cleanupInRocksdbCompactFilter to support state clean up for rocksdb backend.
We have an application that heavily relies on managed keyed store.Â
As we are using rocksdb as the state backend, we were suffering the issue of ever-growing state size. To be more specific, our checkpoint size grows into 200GB in 2 weeks.
After upgrade to 1.8.0 and utilize the cleanupInRocksdbCompactFilter ttl config, the checkpoint size never grows over 10GB.
However, two days after upgrade, checkpointing started to fail because of the "Checkpoint expired before completing".
From the log, I could not get anything useful.
But in the Flink UI, the last successful checkpoint took 1m to finish, and our checkpoint timeout is set to 15m.
It seems that the checkpoint period became extremely long all of a sudden.
Is there anyway that I can further look into this? Or is there any direction that I can tune the ttl for the application?
Thanks in advance!
Best regards,
Mu