(DEPRECATED) Apache Flink User Mailing List archive.

Subtask skew of multiple left JOINs

Classic

List

Threaded

5 messages Options

Kai Fu

Jun 01, 2021; 11:52pm

Subtask skew of multiple left JOINs

36 posts

Hi team,

We faced an issue that resource utilization of the subtask is skewed, while there is no data skew for our input data.

We have 8 data sources joined together with left join. The left table contains 1.6B records, one of the right tables is about 1B, with others about 0.1B. The join is in SQL and it produces query plan of leftInputSec with HasUniqKey and rightInputSpec with JoinKeyContainsUniqueKey all the way down. There is not too much difference of the cardinality for all the join keys. By statistics, the top 100 cardinality skew is [1:65K, 1:17K], which is insignificant among 1.6B records.

JOIN is CPU intensive and the CPU is saturated at the beginning of 1h and then it drops down to ~50%, while data in the Kafka sources are far from being consumed up. The Disk/Network utilization also drops down during that time. Considering JOIN is computational intensive, it should utilize as many resources as it can, while it seems not in our case and there are a lot of redundant resources without any apparent bottleneck.

Our backend is rocksdb with checkpoint disable by enlarging the checkpoint interval to be INF. Host is 32 x i3.2xlarge from AWS, each with 8 CPU, 61GB memory, 1.7TB nvme SSD.

The following are some related metrics. My questions are as follows:

1. Does the one busy subtask cause back pressure for all subtasks upstream, or only some of them?

2. If it only causes the back pressure of one/some of them, why can't others make progress?

1. Is there any hint on locating the bottleneck and overall optimization for this Job?

Figure-1. Subtask resource utilization skew

Figure-2. No data skew for incoming records

Figure-3. CPU utilization pattern of task manager

Figure-4. Kafka source lag max and task MGR CPU utilization

Best wishes,

- Kai

JING ZHANG

Jun 02, 2021; 2:54am

Re: Subtask skew of multiple left JOINs

35 posts

Hi Kai,

> 1. Does the one busy subtask cause back pressure for all subtasks upstream, or only some of them?

In fact one busy subtask could only cause backpressure for those upstream subtasks who need send data to the busy subtask. In your case, those upstream subtasks would be in back pressure who need hash shuffle data to subtask 248 and 251. You could double check by clicking the adjacent upstream operator's back pressure tabs.

> If it only causes the back pressure of one/some of them, why can't others make progress?

In fact, those subtasks of adjacent upstream node who do need send data to hot busy downstream subtask could continue working. However, backpressure is propagated in the whole job pipeline until the most upstream source. Besides, things would be worse in your case because there are exists multiple hash join.

> Is there any hint on locating the bottleneck and overall optimization for this Job?

There is a very useful debug tool in 1.13.0, which could help your find out why subtask 248 and 251 is so busy. CPU FlameGraphs is supported in supported WebUI since 1.13.0, see[1][2].

[1] https://ci.apache.org/projects/flink/flink-docs-release-1.13/docs/ops/debugging/flame_graphs/

[2] https://issues.apache.org/jira/browse/FLINK-13550

Best regards,

JING ZHANG

Kai Fu <[hidden email]> 于2021年6月2日周三上午8:19写道：

Hi team,

We faced an issue that resource utilization of the subtask is skewed, while there is no data skew for our input data.

We have 8 data sources joined together with left join. The left table contains 1.6B records, one of the right tables is about 1B, with others about 0.1B. The join is in SQL and it produces query plan of leftInputSec with HasUniqKey and rightInputSpec with JoinKeyContainsUniqueKey all the way down. There is not too much difference of the cardinality for all the join keys. By statistics, the top 100 cardinality skew is [1:65K, 1:17K], which is insignificant among 1.6B records.

JOIN is CPU intensive and the CPU is saturated at the beginning of 1h and then it drops down to ~50%, while data in the Kafka sources are far from being consumed up. The Disk/Network utilization also drops down during that time. Considering JOIN is computational intensive, it should utilize as many resources as it can, while it seems not in our case and there are a lot of redundant resources without any apparent bottleneck.

Our backend is rocksdb with checkpoint disable by enlarging the checkpoint interval to be INF. Host is 32 x i3.2xlarge from AWS, each with 8 CPU, 61GB memory, 1.7TB nvme SSD.

The following are some related metrics. My questions are as follows:
1. Does the one busy subtask cause back pressure for all subtasks upstream, or only some of them?
2. If it only causes the back pressure of one/some of them, why can't others make progress?
1. Is there any hint on locating the bottleneck and overall optimization for this Job?

Figure-1. Subtask resource utilization skew

Figure-2. No data skew for incoming records

Figure-3. CPU utilization pattern of task manager

Figure-4. Kafka source lag max and task MGR CPU utilization

--
Best wishes,
- Kai

... [show rest of quote]

Kai Fu

Jun 02, 2021; 5:50am

Re: Subtask skew of multiple left JOINs

36 posts

Hi Jing,

Thank you for the prompt reply.

1. Indeed, it's showing that specific subtask is back pressured in the corresponding upstream subtasks;

2. The explanation makes sense that partition can be shared by subtasks, and that's amplified by multiple hash join phase;

3. I did the profiling with the flamegraph table feature, and did another round profiling with the async-profiler for that specific thread. The hot spot is located in rocksdb's seek due to the HasUniqKey introducing suboptimal matching, which is costly per thread.

Please see attachment for the profiling file and advise.

On Wed, Jun 2, 2021 at 10:54 AM JING ZHANG <[hidden email]> wrote:

Hi Kai,
> 1. Does the one busy subtask cause back pressure for all subtasks upstream, or only some of them?
In fact one busy subtask could only cause backpressure for those upstream subtasks who need send data to the busy subtask. In your case, those upstream subtasks would be in back pressure who need hash shuffle data to subtask 248 and 251. You could double check by clicking the adjacent upstream operator's back pressure tabs.

> If it only causes the back pressure of one/some of them, why can't others make progress?
In fact, those subtasks of adjacent upstream node who do need send data to hot busy downstream subtask could continue working. However, backpressure is propagated in the whole job pipeline until the most upstream source. Besides, things would be worse in your case because there are exists multiple hash join.

> Is there any hint on locating the bottleneck and overall optimization for this Job?
There is a very useful debug tool in 1.13.0, which could help your find out why subtask 248 and 251 is so busy. CPU FlameGraphs is supported in supported WebUI since 1.13.0, see[1][2].

[1] https://ci.apache.org/projects/flink/flink-docs-release-1.13/docs/ops/debugging/flame_graphs/
[2] https://issues.apache.org/jira/browse/FLINK-13550

Best regards,
JING ZHANG

Kai Fu <[hidden email]> 于2021年6月2日周三上午8:19写道：
Hi team,

We faced an issue that resource utilization of the subtask is skewed, while there is no data skew for our input data.

We have 8 data sources joined together with left join. The left table contains 1.6B records, one of the right tables is about 1B, with others about 0.1B. The join is in SQL and it produces query plan of leftInputSec with HasUniqKey and rightInputSpec with JoinKeyContainsUniqueKey all the way down. There is not too much difference of the cardinality for all the join keys. By statistics, the top 100 cardinality skew is [1:65K, 1:17K], which is insignificant among 1.6B records.

JOIN is CPU intensive and the CPU is saturated at the beginning of 1h and then it drops down to ~50%, while data in the Kafka sources are far from being consumed up. The Disk/Network utilization also drops down during that time. Considering JOIN is computational intensive, it should utilize as many resources as it can, while it seems not in our case and there are a lot of redundant resources without any apparent bottleneck.

Our backend is rocksdb with checkpoint disable by enlarging the checkpoint interval to be INF. Host is 32 x i3.2xlarge from AWS, each with 8 CPU, 61GB memory, 1.7TB nvme SSD.

The following are some related metrics. My questions are as follows:
1. Does the one busy subtask cause back pressure for all subtasks upstream, or only some of them?
2. If it only causes the back pressure of one/some of them, why can't others make progress?
1. Is there any hint on locating the bottleneck and overall optimization for this Job?

Figure-1. Subtask resource utilization skew

Figure-2. No data skew for incoming records

Figure-3. CPU utilization pattern of task manager

Figure-4. Kafka source lag max and task MGR CPU utilization

--
Best wishes,
- Kai

... [show rest of quote]

... [show rest of quote]

Best wishes,

- Kai

26001.html (532K) Download Attachment

JING ZHANG

Jun 02, 2021; 10:55am

Re: Subtask skew of multiple left JOINs

35 posts

Hi Kai,

I agree with your analyze results that the hot spot is rocksdb's seek because left side is HasUniqKey. It happens when join operator process input data from right side.

I doubt why only subtask 248 and 251 is hot is right side is skew. However right side is much smaller than left side, so it seems there is no data skew for incoming records from web UI.

You could doubt check whether this is the root cause by one of the following ways:

1. do a profile at the other subtask, compare it with the profile result with hot subtask

2. block the left side streaming for a while if you could, to see the incoming records from Web UI.

I hope it helps.

Best regards,

JING ZHANG

Kai Fu <[hidden email]> 于2021年6月2日周三下午1:50写道：

Hi Jing,

Thank you for the prompt reply.
1. Indeed, it's showing that specific subtask is back pressured in the corresponding upstream subtasks;
2. The explanation makes sense that partition can be shared by subtasks, and that's amplified by multiple hash join phase;
3. I did the profiling with the flamegraph table feature, and did another round profiling with the async-profiler for that specific thread. The hot spot is located in rocksdb's seek due to the HasUniqKey introducing suboptimal matching, which is costly per thread.

Please see attachment for the profiling file and advise.

On Wed, Jun 2, 2021 at 10:54 AM JING ZHANG <[hidden email]> wrote:
Hi Kai,
> 1. Does the one busy subtask cause back pressure for all subtasks upstream, or only some of them?
In fact one busy subtask could only cause backpressure for those upstream subtasks who need send data to the busy subtask. In your case, those upstream subtasks would be in back pressure who need hash shuffle data to subtask 248 and 251. You could double check by clicking the adjacent upstream operator's back pressure tabs.

> If it only causes the back pressure of one/some of them, why can't others make progress?
In fact, those subtasks of adjacent upstream node who do need send data to hot busy downstream subtask could continue working. However, backpressure is propagated in the whole job pipeline until the most upstream source. Besides, things would be worse in your case because there are exists multiple hash join.

> Is there any hint on locating the bottleneck and overall optimization for this Job?
There is a very useful debug tool in 1.13.0, which could help your find out why subtask 248 and 251 is so busy. CPU FlameGraphs is supported in supported WebUI since 1.13.0, see[1][2].

[1] https://ci.apache.org/projects/flink/flink-docs-release-1.13/docs/ops/debugging/flame_graphs/
[2] https://issues.apache.org/jira/browse/FLINK-13550

Best regards,
JING ZHANG

Kai Fu <[hidden email]> 于2021年6月2日周三上午8:19写道：
Hi team,

We faced an issue that resource utilization of the subtask is skewed, while there is no data skew for our input data.

We have 8 data sources joined together with left join. The left table contains 1.6B records, one of the right tables is about 1B, with others about 0.1B. The join is in SQL and it produces query plan of leftInputSec with HasUniqKey and rightInputSpec with JoinKeyContainsUniqueKey all the way down. There is not too much difference of the cardinality for all the join keys. By statistics, the top 100 cardinality skew is [1:65K, 1:17K], which is insignificant among 1.6B records.

JOIN is CPU intensive and the CPU is saturated at the beginning of 1h and then it drops down to ~50%, while data in the Kafka sources are far from being consumed up. The Disk/Network utilization also drops down during that time. Considering JOIN is computational intensive, it should utilize as many resources as it can, while it seems not in our case and there are a lot of redundant resources without any apparent bottleneck.

Our backend is rocksdb with checkpoint disable by enlarging the checkpoint interval to be INF. Host is 32 x i3.2xlarge from AWS, each with 8 CPU, 61GB memory, 1.7TB nvme SSD.

The following are some related metrics. My questions are as follows:
1. Does the one busy subtask cause back pressure for all subtasks upstream, or only some of them?
2. If it only causes the back pressure of one/some of them, why can't others make progress?
1. Is there any hint on locating the bottleneck and overall optimization for this Job?

Figure-1. Subtask resource utilization skew

Figure-2. No data skew for incoming records

Figure-3. CPU utilization pattern of task manager

Figure-4. Kafka source lag max and task MGR CPU utilization

--
Best wishes,
- Kai

... [show rest of quote]

... [show rest of quote]

--
Best wishes,
- Kai

... [show rest of quote]

Kai Fu

Jun 04, 2021; 1:38am

Re: Subtask skew of multiple left JOINs

36 posts

Hi JING,

Thank you for the suggestion, we optimized it by changing the way it joins in some BushJoin like way to reduce the length of the pipeline and by grouping perfect-match of join condition at the beginning as much as possible. Now we can notice the CPU resource is saturated in the cold start phase. So the problem is solved for the moment, I believe we need some other deep dive on the subtask skew issue given there is no apparent upstream data skew. Thank you for the help.

On Wed, Jun 2, 2021 at 6:55 PM JING ZHANG <[hidden email]> wrote:

Hi Kai,
I agree with your analyze results that the hot spot is rocksdb's seek because left side is HasUniqKey. It happens when join operator process input data from right side.
I doubt why only subtask 248 and 251 is hot is right side is skew. However right side is much smaller than left side, so it seems there is no data skew for incoming records from web UI.

You could doubt check whether this is the root cause by one of the following ways:
1. do a profile at the other subtask, compare it with the profile result with hot subtask
2. block the left side streaming for a while if you could, to see the incoming records from Web UI.

I hope it helps.

Best regards,
JING ZHANG

Kai Fu <[hidden email]> 于2021年6月2日周三下午1:50写道：
Hi Jing,

Thank you for the prompt reply.
1. Indeed, it's showing that specific subtask is back pressured in the corresponding upstream subtasks;
2. The explanation makes sense that partition can be shared by subtasks, and that's amplified by multiple hash join phase;
3. I did the profiling with the flamegraph table feature, and did another round profiling with the async-profiler for that specific thread. The hot spot is located in rocksdb's seek due to the HasUniqKey introducing suboptimal matching, which is costly per thread.

Please see attachment for the profiling file and advise.

On Wed, Jun 2, 2021 at 10:54 AM JING ZHANG <[hidden email]> wrote:
Hi Kai,
> 1. Does the one busy subtask cause back pressure for all subtasks upstream, or only some of them?
In fact one busy subtask could only cause backpressure for those upstream subtasks who need send data to the busy subtask. In your case, those upstream subtasks would be in back pressure who need hash shuffle data to subtask 248 and 251. You could double check by clicking the adjacent upstream operator's back pressure tabs.

> If it only causes the back pressure of one/some of them, why can't others make progress?
In fact, those subtasks of adjacent upstream node who do need send data to hot busy downstream subtask could continue working. However, backpressure is propagated in the whole job pipeline until the most upstream source. Besides, things would be worse in your case because there are exists multiple hash join.

> Is there any hint on locating the bottleneck and overall optimization for this Job?
There is a very useful debug tool in 1.13.0, which could help your find out why subtask 248 and 251 is so busy. CPU FlameGraphs is supported in supported WebUI since 1.13.0, see[1][2].

[1] https://ci.apache.org/projects/flink/flink-docs-release-1.13/docs/ops/debugging/flame_graphs/
[2] https://issues.apache.org/jira/browse/FLINK-13550

Best regards,
JING ZHANG

Kai Fu <[hidden email]> 于2021年6月2日周三上午8:19写道：
Hi team,

We faced an issue that resource utilization of the subtask is skewed, while there is no data skew for our input data.

We have 8 data sources joined together with left join. The left table contains 1.6B records, one of the right tables is about 1B, with others about 0.1B. The join is in SQL and it produces query plan of leftInputSec with HasUniqKey and rightInputSpec with JoinKeyContainsUniqueKey all the way down. There is not too much difference of the cardinality for all the join keys. By statistics, the top 100 cardinality skew is [1:65K, 1:17K], which is insignificant among 1.6B records.

JOIN is CPU intensive and the CPU is saturated at the beginning of 1h and then it drops down to ~50%, while data in the Kafka sources are far from being consumed up. The Disk/Network utilization also drops down during that time. Considering JOIN is computational intensive, it should utilize as many resources as it can, while it seems not in our case and there are a lot of redundant resources without any apparent bottleneck.

Our backend is rocksdb with checkpoint disable by enlarging the checkpoint interval to be INF. Host is 32 x i3.2xlarge from AWS, each with 8 CPU, 61GB memory, 1.7TB nvme SSD.

The following are some related metrics. My questions are as follows:
1. Does the one busy subtask cause back pressure for all subtasks upstream, or only some of them?
2. If it only causes the back pressure of one/some of them, why can't others make progress?
1. Is there any hint on locating the bottleneck and overall optimization for this Job?

Figure-1. Subtask resource utilization skew

Figure-2. No data skew for incoming records

Figure-3. CPU utilization pattern of task manager

Figure-4. Kafka source lag max and task MGR CPU utilization

--
Best wishes,
- Kai

... [show rest of quote]

... [show rest of quote]

--
Best wishes,
- Kai

... [show rest of quote]

... [show rest of quote]

Best wishes,

- Kai