(DEPRECATED) Apache Flink User Mailing List archive.

Diagnosing bottlenecks in Flink jobs

Classic

List

Threaded

5 messages Options

Dan

Diagnosing bottlenecks in Flink jobs

We have a job that has been running but none of the AWS resource metrics for the EKS, EC2, MSK and EBS show any bottlenecks. I have multiple 8 cores allocated but only ~2 cores are used. Most of the memory is not consumed. MSK does not show much use. EBS metrics look mostly idle. I assumed I'd be able to see whichever resources is a bottleneck.

Is there a good way to diagnose where the bottleneck is for a Flink job?

JING ZHANG

Re: Diagnosing bottlenecks in Flink jobs

Hi Dan,

Would you please describe what's the problem about your job? High latency or low throughput?

Please first check the job throughput and latency .

If the job throughput matches the speed of sources producing data and the latency metric is good, maybe the job works well without bottlenecks.

If you find unnormal throughput or latency, please try the following points:

1. check the back pressure

2. check whether checkpoint duration is long and whether the checkpoint size is expected

Please share the details for deeper analysis in this email if you find something abnormal about the job.

Best,

JING ZHANG

Dan Hill <[hidden email]> 于2021年6月17日周四下午12:44写道：

We have a job that has been running but none of the AWS resource metrics for the EKS, EC2, MSK and EBS show any bottlenecks. I have multiple 8 cores allocated but only ~2 cores are used. Most of the memory is not consumed. MSK does not show much use. EBS metrics look mostly idle. I assumed I'd be able to see whichever resources is a bottleneck.

Is there a good way to diagnose where the bottleneck is for a Flink job?

Dan

Re: Diagnosing bottlenecks in Flink jobs

Thanks, JING ZHANG!

I have one subtask for one Kafka source that is getting backpressure. Is there an easy way to split a single Kafka partition into multiple subtasks? Or do I need to split the Kafka partition?

On Wed, Jun 16, 2021 at 10:29 PM JING ZHANG <[hidden email]> wrote:

Hi Dan,
Would you please describe what's the problem about your job? High latency or low throughput?
Please first check the job throughput and latency .
If the job throughput matches the speed of sources producing data and the latency metric is good, maybe the job works well without bottlenecks.
If you find unnormal throughput or latency, please try the following points:
1. check the back pressure
2. check whether checkpoint duration is long and whether the checkpoint size is expected

Please share the details for deeper analysis in this email if you find something abnormal about the job.

Best,
JING ZHANG

Dan Hill <[hidden email]> 于2021年6月17日周四下午12:44写道：
We have a job that has been running but none of the AWS resource metrics for the EKS, EC2, MSK and EBS show any bottlenecks. I have multiple 8 cores allocated but only ~2 cores are used. Most of the memory is not consumed. MSK does not show much use. EBS metrics look mostly idle. I assumed I'd be able to see whichever resources is a bottleneck.

Is there a good way to diagnose where the bottleneck is for a Flink job?

JING ZHANG

Re: Diagnosing bottlenecks in Flink jobs

Hi Dan,

It's better to split the Kafka partition into multiple partitions.

Here is a way to try without splitting the Kafka partition. Add a rebalance shuffle between source and the downstream operators, set multiple parallelism for the downstream operators. But this way would introduce extra cpu cost for serialize/deserialize and extra network cost for shuffle data. I'm not sure the benefits of this method can offset the additional costs.

Best,

JING ZHANG

Dan Hill <[hidden email]> 于2021年6月17日周四下午1:49写道：

Thanks, JING ZHANG!

I have one subtask for one Kafka source that is getting backpressure. Is there an easy way to split a single Kafka partition into multiple subtasks? Or do I need to split the Kafka partition?

On Wed, Jun 16, 2021 at 10:29 PM JING ZHANG <[hidden email]> wrote:
Hi Dan,
Would you please describe what's the problem about your job? High latency or low throughput?
Please first check the job throughput and latency .
If the job throughput matches the speed of sources producing data and the latency metric is good, maybe the job works well without bottlenecks.
If you find unnormal throughput or latency, please try the following points:
1. check the back pressure
2. check whether checkpoint duration is long and whether the checkpoint size is expected

Please share the details for deeper analysis in this email if you find something abnormal about the job.

Best,
JING ZHANG

Dan Hill <[hidden email]> 于2021年6月17日周四下午12:44写道：
We have a job that has been running but none of the AWS resource metrics for the EKS, EC2, MSK and EBS show any bottlenecks. I have multiple 8 cores allocated but only ~2 cores are used. Most of the memory is not consumed. MSK does not show much use. EBS metrics look mostly idle. I assumed I'd be able to see whichever resources is a bottleneck.

Is there a good way to diagnose where the bottleneck is for a Flink job?

Dan

Re: Diagnosing bottlenecks in Flink jobs

Thanks Jing!

On Wed, Jun 16, 2021 at 11:30 PM JING ZHANG <[hidden email]> wrote:

Hi Dan,
It's better to split the Kafka partition into multiple partitions.
Here is a way to try without splitting the Kafka partition. Add a rebalance shuffle between source and the downstream operators, set multiple parallelism for the downstream operators. But this way would introduce extra cpu cost for serialize/deserialize and extra network cost for shuffle data. I'm not sure the benefits of this method can offset the additional costs.

Best,
JING ZHANG

Dan Hill <[hidden email]> 于2021年6月17日周四下午1:49写道：
Thanks, JING ZHANG!

I have one subtask for one Kafka source that is getting backpressure. Is there an easy way to split a single Kafka partition into multiple subtasks? Or do I need to split the Kafka partition?

On Wed, Jun 16, 2021 at 10:29 PM JING ZHANG <[hidden email]> wrote:
Hi Dan,
Would you please describe what's the problem about your job? High latency or low throughput?
Please first check the job throughput and latency .
If the job throughput matches the speed of sources producing data and the latency metric is good, maybe the job works well without bottlenecks.
If you find unnormal throughput or latency, please try the following points:
1. check the back pressure
2. check whether checkpoint duration is long and whether the checkpoint size is expected

Please share the details for deeper analysis in this email if you find something abnormal about the job.

Best,
JING ZHANG

Dan Hill <[hidden email]> 于2021年6月17日周四下午12:44写道：
We have a job that has been running but none of the AWS resource metrics for the EKS, EC2, MSK and EBS show any bottlenecks. I have multiple 8 cores allocated but only ~2 cores are used. Most of the memory is not consumed. MSK does not show much use. EBS metrics look mostly idle. I assumed I'd be able to see whichever resources is a bottleneck.

Is there a good way to diagnose where the bottleneck is for a Flink job?