Re: checkpoint stuck with rocksdb statebackend and s3 filesystem
Posted by
gerryzhou on
URL: http://deprecated-apache-flink-user-mailing-list-archive.369.s1.nabble.com/checkpoint-stuck-with-rocksdb-statebackend-and-s3-filesystem-tp18679p18704.html
Hi Tony,
Sorry for jump into, one thing I want to remind is that from the log you provided it looks like you are using "full checkpoint", this means that the state data that need to be checkpointed and transvered to s3 will grow over time, and even for the first checkpoint it performance is slower that incremental checkpoint (because it need to iterate all the record from the rocksdb using the RocksDBMergeIterator). Maybe you can try out "incremental checkpoint", it could help you got a better performance.
Best Regards,
Sihua Zhou
Hi Stefan,
I see. That explains why the loading of machines grew up. However, I think it is not the root cause that led to these consecutive checkpoint timeout. As I said in my first mail, the checkpointing progress usually took 1.5 mins to upload states, and this operator and kafka consumer are only two operators that have states in my pipeline. In the best case, I should never encounter the timeout problem that only caused by lots of pending checkpointing threads that have already timed out. Am I right?
Since these logging and stack trace was taken after nearly 3 hours from the first checkpoint timeout, I'm afraid that we couldn't actually find out the root cause for the first checkpoint timeout. Because we are preparing to make this pipeline go on production, I was wondering if you could help me find out where the root cause happened: bad machines or s3 or flink-s3-presto packages or flink checkpointing thread. It will be great if we can find it out from those informations the I provided, or a hypothesis based on your experience is welcome as well. The most important thing is that I have to decide whether I need to change my persistence filesystem or use another s3 filesystem package, because it is the last thing I want to see that the checkpoint timeout happened very often.
Thank you very much for all your advices.
Best Regards,
Tony Wei