(DEPRECATED) Apache Flink User Mailing List archive.

High Job BackPressure

Classic

List

Threaded

2 messages Options

Sayat Satybaldiyev-2

High Job BackPressure

Dear Flink community,

Would anyone give a clue how to debug a job that has a high backpressure in the kafka source? We have a flink job that joins two stream via Process Function and Rocksdb state backend from two kafka topics. The job is significantly lagging behind ~8 hours and produces an incorrect result.

Flink UI gives a hint that Source Functions(recommendation stream and custom source) are backpressure while recommendation-click join is fine.

I've looked into JM and TM logs and there's nothing stage to me. Except "Kafka error sending fetch request" which happens during a checkpoint. Checkpoints happen once in 20min and utilize almost all network interface.

Please find UI screenshots and flink logs attached to this email.

https://drive.google.com/file/d/14h8zwC_49wxt5uNPYtM3LN6WhJ7lyeVS/view?usp=sharing

https://drive.google.com/file/d/1s6I___S7u0pBJyWdnmYaH0e_MwGr3CgY/view?usp=sharing

task_metrics.png (119K) Download Attachment

watermarks.png (64K) Download Attachment

backpressure-source2-kafka.png (41K) Download Attachment

checkpoint_history.png (76K) Download Attachment

back_pressure_reco-stream.png (32K) Download Attachment

backpressure-clik-join.png (29K) Download Attachment

overall-DAG.png (42K) Download Attachment

NET traffic.png (52K) Download Attachment

Sayat Satybaldiyev-2

Re: High Job BackPressure

I forgot to mention that the job was recently moved from the cluster with SSD disk to SATA and SSD disk. On the old hardware, everything worked fine. Flink version is 1.6.2. There were FLASH optimized setting for RocksDB. I've changed to SPINNING_DISK_OPTIMIZED and it didn't have any effect.

Old servers:

https://www.hetzner.de/dedicated-rootserver/px91-ssd

New Server:

https://www.hetzner.de/dedicated-rootserver/ax60-ssd

On Mon, Dec 3, 2018 at 8:07 PM Sayat Satybaldiyev <[hidden email]> wrote:

Dear Flink community,

Would anyone give a clue how to debug a job that has a high backpressure in the kafka source? We have a flink job that joins two stream via Process Function and Rocksdb state backend from two kafka topics. The job is significantly lagging behind ~8 hours and produces an incorrect result.

Flink UI gives a hint that Source Functions(recommendation stream and custom source) are backpressure while recommendation-click join is fine.

I've looked into JM and TM logs and there's nothing stage to me. Except "Kafka error sending fetch request" which happens during a checkpoint. Checkpoints happen once in 20min and utilize almost all network interface.

Please find UI screenshots and flink logs attached to this email.

https://drive.google.com/file/d/14h8zwC_49wxt5uNPYtM3LN6WhJ7lyeVS/view?usp=sharing
https://drive.google.com/file/d/1s6I___S7u0pBJyWdnmYaH0e_MwGr3CgY/view?usp=sharing