High Job BackPressure

Posted by Sayat Satybaldiyev-2 on
URL: http://deprecated-apache-flink-user-mailing-list-archive.369.s1.nabble.com/High-Job-BackPressure-tp24903.html

Dear Flink community,

Would anyone give a clue how to debug a job that has a high backpressure in the kafka source? We have a flink job that joins two stream via Process Function and Rocksdb state backend from two kafka topics. The job is significantly lagging behind ~8 hours and produces an incorrect result. 

Flink UI gives a hint that Source Functions(recommendation stream and custom source) are backpressure while recommendation-click join is fine. 

I've looked into JM and TM logs and there's nothing stage to me. Except "Kafka error sending fetch request" which happens during a checkpoint. Checkpoints happen once in 20min and utilize almost all network interface.

Please find UI screenshots and flink logs attached to this email.



task_metrics.png (119K) Download Attachment
watermarks.png (64K) Download Attachment
backpressure-source2-kafka.png (41K) Download Attachment
checkpoint_history.png (76K) Download Attachment
back_pressure_reco-stream.png (32K) Download Attachment
backpressure-clik-join.png (29K) Download Attachment
overall-DAG.png (42K) Download Attachment
NET traffic.png (52K) Download Attachment