Web UI shows my AssignTImestamp is in high back pressure but in/outPoolUsage are both 0.

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

Web UI shows my AssignTImestamp is in high back pressure but in/outPoolUsage are both 0.

HaochengWang
Hi, I have a job like 'Source -> assignmentTimestamp -> flatmap ->  Window -> Sink' and I get back pressure from 'Source' to the 'FlatMap' operators form the 'BackPressure' tab in the Web UI. 
When trying to find which operator is the source of back pressure, I use metrics provided by the Web UI, specifically, 'inPoolUsage' and 'outPoolUsage'.
Firstly, As far as I know, when both of the metrics are 0, the operator should not be defined as 'back pressured', but when I check the 'AssignmentTimestamp' operator, where 8 subtasks running, I find 1 or 2 of them have 0 value about the back pressure index, and the others have the index higher than 0.80, and all of them are marked  in 'HIGH' status. However, the two metrics, 'in/outPoolUsage', are always be 0. So maybe the operator is not back pressured actually?  Or is there any problem with my Flink WebUI?
Second question is, from my experience, I think the source of the back pressure should be the Window operator because the outPoolUsage of the 'FlatMap' are 1, and the 'Window' is the first downstream operator from the 'Flatmap', but the inPoolUsage and the outPoolUsage are also 0. So the cause of the back pressure should be the network bottleneck between window and flatmap? Am I right?
Thanks for your reading, and I'm looking forward for your ideas.

Haocheng
Reply | Threaded
Open this post in threaded view
|

Re: Web UI shows my AssignTImestamp is in high back pressure but in/outPoolUsage are both 0.

Piotr Nowojski-4
Hi Haocheng,

Regarding the first part, yes. For a very long time there was a trivial bug that was displaying the maximum "backpressure status" ("HIGH" in your case) from all of the subtasks, for every subtask, instead of showing the subtask's individual status. [1]  It is/will be fixed in Flink 1.11.4, 1.12.4, 1.13.1, 1.14.0.

Also please note, that starting from 1.13.0, Flink has a much better, more user friendly tools for analysing the source of the backpressure [2]. I would highly recommend upgrading to it.

About the empty `inPoolUsage`. Keep in mind that this metric is ignoring local channels [3], which might be hiding the problem. But yes. In principle, if the upstream subtask has full output buffers, while the downstream subtasks have empty input buffers, that most likely means there is a problem in the network exchange. It can be network IO related, maybe network threads are overloaded (CPU) might be causing that, or maybe some other issue (GC, encryption/SSL, compression). But that should only happen in very high throughput jobs, with hundreds of MB/s of network traffic. I would first rule out if for sure your `Window` is not causing the backpressure. You could do it by upgrading to Flink 1.13.x and checking the newly added `busyTimeMsPerSecond` metric. Alternatively you can attach a CPU profiler to a TaskManager. This is the most reliable way.

Piotrek


sob., 12 cze 2021 o 12:53 Haocheng Wang <[hidden email]> napisał(a):
Hi, I have a job like 'Source -> assignmentTimestamp -> flatmap ->  Window -> Sink' and I get back pressure from 'Source' to the 'FlatMap' operators form the 'BackPressure' tab in the Web UI. 
When trying to find which operator is the source of back pressure, I use metrics provided by the Web UI, specifically, 'inPoolUsage' and 'outPoolUsage'.
Firstly, As far as I know, when both of the metrics are 0, the operator should not be defined as 'back pressured', but when I check the 'AssignmentTimestamp' operator, where 8 subtasks running, I find 1 or 2 of them have 0 value about the back pressure index, and the others have the index higher than 0.80, and all of them are marked  in 'HIGH' status. However, the two metrics, 'in/outPoolUsage', are always be 0. So maybe the operator is not back pressured actually?  Or is there any problem with my Flink WebUI?
Second question is, from my experience, I think the source of the back pressure should be the Window operator because the outPoolUsage of the 'FlatMap' are 1, and the 'Window' is the first downstream operator from the 'Flatmap', but the inPoolUsage and the outPoolUsage are also 0. So the cause of the back pressure should be the network bottleneck between window and flatmap? Am I right?
Thanks for your reading, and I'm looking forward for your ideas.

Haocheng