Checkpoints very slow with high backpressure

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

Checkpoints very slow with high backpressure

Yassin Marzouki
Hi all,

I have a streaming pipeline as follows:
1 - read a folder continuousely from HDFS
2 - filter duplicates (using keyby(x->x) and keeping a state per key indicating whether its is seen)
3 - schedule some future actions on the stream using ProcessFunction and processing time timers (elements are kept in a MapState)
4- write results back to HDFS using a BucketingSink.

I am using RocksdbStateBackend, and Flink 1.3-SNAPSHOT (Commit: 9fb074c).

Currenlty the source contain just one a file of 1GB, so that's the maximum state that the job might hold. I noticed that the backpressure on the operators #1 and #2 is High, and the split reader has only read 60 Mb out of 1Gb source source file. I suspect this is because the ProcessFunction is slow (on purpose). However looks like this affected the checkpoints which are all failing after the timeout (which I set to 2 hours).

In the job manager logs I keep getting warnings : 

2017-04-23 19:32:38,827 WARN org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Received late message for now expired checkpoint attempt 8 from 210769a077c67841d980776d8caece0a of job 6c7e44d205d738fc8a6cb4da181d2d86.

Is the high backpressure the cause for the checkpoints being too slow? If yes Is there a way to disbale the backpressure mechanism since the records will be buffered in the rocksdb state after all which is backed by the disk?

Any help is appreciated. Thank you.

Best,
Yassine