Skewed CPU utilization analysis

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Skewed CPU utilization analysis

Kai Fu
Hi team,

We're using multiple joins to generate the dynamic view from Kafka stream. There are no data skew in our data input, and that can be verified by number of consumed records for all subtasks for one operator, they fit close as shown in the figure below. This is the metrics figure for the last operator and it's similar trend for other operators. 

While we faced a issue of skewed CPU utilization as shown in Figure-3. This does not happen everytime with the parallelism settings. Is there any guidance for further analysis on this?

image.png
Figure-1. number of consumed records for the last operator.

image.png
Figure-2. number of consumed records for all operators.

image.png

Figure-3. Skewed CPU utilization.

--
Best wishes,
- Kai
Reply | Threaded
Open this post in threaded view
|

Re: Skewed CPU utilization analysis

Till Rohrmann
Hi Kai,

what you could check is the deployment of tasks onto the TaskManagers. You could compare where the tasks are deployed when you see the skewed CPU utilization vs. when not. Maybe Flink deploys some of the tasks suboptimally when you observe the skewed CPU utilization. Additionally, you could sample the stack trace of the JVM which has the high CPU load. That way we could figure out what the TaskManager is doing and what keeps the CPU occupied.

Cheers,
Till

On Thu, Apr 8, 2021 at 2:30 PM Kai Fu <[hidden email]> wrote:
Hi team,

We're using multiple joins to generate the dynamic view from Kafka stream. There are no data skew in our data input, and that can be verified by number of consumed records for all subtasks for one operator, they fit close as shown in the figure below. This is the metrics figure for the last operator and it's similar trend for other operators. 

While we faced a issue of skewed CPU utilization as shown in Figure-3. This does not happen everytime with the parallelism settings. Is there any guidance for further analysis on this?

image.png
Figure-1. number of consumed records for the last operator.

image.png
Figure-2. number of consumed records for all operators.

image.png

Figure-3. Skewed CPU utilization.

--
Best wishes,
- Kai
Reply | Threaded
Open this post in threaded view
|

Re: Skewed CPU utilization analysis

Kai Fu
Hi Till,

Thank you for the suggestion, the phenomenon is that the CPU utilization does not always have such pattern. Sometimes it has skewed case as above, while sometimes all CPUs on hosts are evenly utilized. For the skewed case, I did a profiling with async-profiler, it does not have specific hotspot, almost all of the time is spend on rocksdb accessing as with normal cases as shown below. 

image.png


On Fri, Apr 9, 2021 at 1:13 AM Till Rohrmann <[hidden email]> wrote:
Hi Kai,

what you could check is the deployment of tasks onto the TaskManagers. You could compare where the tasks are deployed when you see the skewed CPU utilization vs. when not. Maybe Flink deploys some of the tasks suboptimally when you observe the skewed CPU utilization. Additionally, you could sample the stack trace of the JVM which has the high CPU load. That way we could figure out what the TaskManager is doing and what keeps the CPU occupied.

Cheers,
Till

On Thu, Apr 8, 2021 at 2:30 PM Kai Fu <[hidden email]> wrote:
Hi team,

We're using multiple joins to generate the dynamic view from Kafka stream. There are no data skew in our data input, and that can be verified by number of consumed records for all subtasks for one operator, they fit close as shown in the figure below. This is the metrics figure for the last operator and it's similar trend for other operators. 

While we faced a issue of skewed CPU utilization as shown in Figure-3. This does not happen everytime with the parallelism settings. Is there any guidance for further analysis on this?

image.png
Figure-1. number of consumed records for the last operator.

image.png
Figure-2. number of consumed records for all operators.

image.png

Figure-3. Skewed CPU utilization.

--
Best wishes,
- Kai


--
Best wishes,
- Kai