First off, thanks for your reply!
I have an assumption that I should probably verify first:
When determining the source of the backpressure we look (in the WebUI) for the first operator in our pipeline that is not showing backpressure. That’s what we consider to be the source of the backpressure
In this case the first operator that in our graph that is not showing backpressure is our window operator (all though the keyBy operation right before it doesn’t show up in the graph). The window function uses a custom aggregation function that builds up a hashmap and a custom process function that emits the hashmap and performs some metrics operations. I am not sure how this would generate backpressure since it doesn’t perform any IO, but again I might be drawing incorrect conclusions.
The window function has a parallelism of 32. Each of the Subtasks has between 136kb and 2.45mb of state, with a checkpoint duration of 280ms to 2 seconds. Each of the 32 subtasks appear to be handling 1,700-50,000 records an hour with a bytes received of 7mb and 170mb
Am I barking up the wrong tree?
-Steve
From: David Anderson <[hidden email]>
Sent: Friday, July 17, 2020 6:54 AM
To: Nelson Steven <[hidden email]>
Cc: [hidden email]
Subject: Re: Backpressure on Window step
Backpressure is typically caused by something like one of these things:
* problems relating to i/o to external services (e.g., enrichment via an API or database lookup, or a sink)
* data skew (e.g., a hot key)
* under-provisioning, or competition for resources
* spikes in traffic
* timer storms
I would start to debug this by looking for signs of significant asymmetry in the metrics (across the various subtasks), or resource exhaustion. Could be related to the network, GC, CPU, disk i/o, etc. Flink's webUI will show you checkpoint size and timing information for each sub-task; you can learn a lot from studying that data.
Relating to session windows -- could you occasionally have an unusually long session, and might that cause problems?
Best,
David
On Tue, Jul 14, 2020 at 10:12 PM Nelson Steven <[hidden email]> wrote:
Hello!
We are experiencing occasional backpressure on a Window function in our pipeline. The window is on a KeyedStream and is triggered by an EventTimeSessionWindows.withGap(Time.seconds(30)). The prior step does a fanout and we use the window to sort things into batches based on the Key for the keyed stream. We aren’t seeing an unreasonable amount of records (500-600/s) on a parallism of 32 (prior step has a parallelism of 4). We are as interested in learning out to debug the issue as we are in fixing the actual problem. Any ideas?
-Steve
Free forum by Nabble | Edit this page |