High Job BackPressure

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

High Job BackPressure

Sayat Satybaldiyev-2
Dear Flink community,

Would anyone give a clue how to debug a job that has a high backpressure in the kafka source? We have a flink job that joins two stream via Process Function and Rocksdb state backend from two kafka topics. The job is significantly lagging behind ~8 hours and produces an incorrect result. 

Flink UI gives a hint that Source Functions(recommendation stream and custom source) are backpressure while recommendation-click join is fine. 

I've looked into JM and TM logs and there's nothing stage to me. Except "Kafka error sending fetch request" which happens during a checkpoint. Checkpoints happen once in 20min and utilize almost all network interface.

Please find UI screenshots and flink logs attached to this email.



task_metrics.png (119K) Download Attachment
watermarks.png (64K) Download Attachment
backpressure-source2-kafka.png (41K) Download Attachment
checkpoint_history.png (76K) Download Attachment
back_pressure_reco-stream.png (32K) Download Attachment
backpressure-clik-join.png (29K) Download Attachment
overall-DAG.png (42K) Download Attachment
NET traffic.png (52K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: High Job BackPressure

Sayat Satybaldiyev-2
I forgot to mention that the job was recently moved from the cluster with SSD disk to SATA and SSD disk. On the old hardware, everything worked fine. Flink version is 1.6.2. There were FLASH optimized setting for RocksDB. I've changed to SPINNING_DISK_OPTIMIZED and it didn't have any effect.

Old servers:

New Server:
https://www.hetzner.de/dedicated-rootserver/ax60-ssd

On Mon, Dec 3, 2018 at 8:07 PM Sayat Satybaldiyev <[hidden email]> wrote:
Dear Flink community,

Would anyone give a clue how to debug a job that has a high backpressure in the kafka source? We have a flink job that joins two stream via Process Function and Rocksdb state backend from two kafka topics. The job is significantly lagging behind ~8 hours and produces an incorrect result. 

Flink UI gives a hint that Source Functions(recommendation stream and custom source) are backpressure while recommendation-click join is fine. 

I've looked into JM and TM logs and there's nothing stage to me. Except "Kafka error sending fetch request" which happens during a checkpoint. Checkpoints happen once in 20min and utilize almost all network interface.

Please find UI screenshots and flink logs attached to this email.