Diagnosing bottlenecks in Flink jobs

classic Classic list List threaded Threaded
5 messages Options
Dan
Reply | Threaded
Open this post in threaded view
|

Diagnosing bottlenecks in Flink jobs

Dan
We have a job that has been running but none of the AWS resource metrics for the EKS, EC2, MSK and EBS show any bottlenecks.  I have multiple 8 cores allocated but only ~2 cores are used.  Most of the memory is not consumed.  MSK does not show much use.  EBS metrics look mostly idle.  I assumed I'd be able to see whichever resources is a bottleneck.

Is there a good way to diagnose where the bottleneck is for a Flink job?
Reply | Threaded
Open this post in threaded view
|

Re: Diagnosing bottlenecks in Flink jobs

JING ZHANG
Hi Dan,
Would you please describe what's the problem about your job? High latency or low throughput?
Please first check the job throughput and latency .
If the job throughput matches the speed of sources producing data and the latency metric is good, maybe the job works well without bottlenecks.
If you find unnormal throughput or latency, please try the following points:
1. check the back pressure
2. check whether checkpoint duration is long and whether the checkpoint size is expected

Please share the details for deeper analysis in this email if you find something abnormal about  the job.

Best,
JING ZHANG

Dan Hill <[hidden email]> 于2021年6月17日周四 下午12:44写道:
We have a job that has been running but none of the AWS resource metrics for the EKS, EC2, MSK and EBS show any bottlenecks.  I have multiple 8 cores allocated but only ~2 cores are used.  Most of the memory is not consumed.  MSK does not show much use.  EBS metrics look mostly idle.  I assumed I'd be able to see whichever resources is a bottleneck.

Is there a good way to diagnose where the bottleneck is for a Flink job?
Dan
Reply | Threaded
Open this post in threaded view
|

Re: Diagnosing bottlenecks in Flink jobs

Dan
Thanks, JING ZHANG!

I have one subtask for one Kafka source that is getting backpressure.  Is there an easy way to split a single Kafka partition into multiple subtasks?  Or do I need to split the Kafka partition?

On Wed, Jun 16, 2021 at 10:29 PM JING ZHANG <[hidden email]> wrote:
Hi Dan,
Would you please describe what's the problem about your job? High latency or low throughput?
Please first check the job throughput and latency .
If the job throughput matches the speed of sources producing data and the latency metric is good, maybe the job works well without bottlenecks.
If you find unnormal throughput or latency, please try the following points:
1. check the back pressure
2. check whether checkpoint duration is long and whether the checkpoint size is expected

Please share the details for deeper analysis in this email if you find something abnormal about  the job.

Best,
JING ZHANG

Dan Hill <[hidden email]> 于2021年6月17日周四 下午12:44写道:
We have a job that has been running but none of the AWS resource metrics for the EKS, EC2, MSK and EBS show any bottlenecks.  I have multiple 8 cores allocated but only ~2 cores are used.  Most of the memory is not consumed.  MSK does not show much use.  EBS metrics look mostly idle.  I assumed I'd be able to see whichever resources is a bottleneck.

Is there a good way to diagnose where the bottleneck is for a Flink job?
Reply | Threaded
Open this post in threaded view
|

Re: Diagnosing bottlenecks in Flink jobs

JING ZHANG
Hi Dan,
It's better to split the Kafka partition into multiple partitions. 
Here is a way to try without splitting the Kafka partition. Add a rebalance shuffle between source and the downstream operators, set multiple parallelism for the downstream operators. But this way would introduce extra cpu cost for serialize/deserialize and extra network cost for shuffle data. I'm not sure the benefits of this method can offset the additional costs.

Best,
JING ZHANG

Dan Hill <[hidden email]> 于2021年6月17日周四 下午1:49写道:
Thanks, JING ZHANG!

I have one subtask for one Kafka source that is getting backpressure.  Is there an easy way to split a single Kafka partition into multiple subtasks?  Or do I need to split the Kafka partition?

On Wed, Jun 16, 2021 at 10:29 PM JING ZHANG <[hidden email]> wrote:
Hi Dan,
Would you please describe what's the problem about your job? High latency or low throughput?
Please first check the job throughput and latency .
If the job throughput matches the speed of sources producing data and the latency metric is good, maybe the job works well without bottlenecks.
If you find unnormal throughput or latency, please try the following points:
1. check the back pressure
2. check whether checkpoint duration is long and whether the checkpoint size is expected

Please share the details for deeper analysis in this email if you find something abnormal about  the job.

Best,
JING ZHANG

Dan Hill <[hidden email]> 于2021年6月17日周四 下午12:44写道:
We have a job that has been running but none of the AWS resource metrics for the EKS, EC2, MSK and EBS show any bottlenecks.  I have multiple 8 cores allocated but only ~2 cores are used.  Most of the memory is not consumed.  MSK does not show much use.  EBS metrics look mostly idle.  I assumed I'd be able to see whichever resources is a bottleneck.

Is there a good way to diagnose where the bottleneck is for a Flink job?
Dan
Reply | Threaded
Open this post in threaded view
|

Re: Diagnosing bottlenecks in Flink jobs

Dan
Thanks Jing!

On Wed, Jun 16, 2021 at 11:30 PM JING ZHANG <[hidden email]> wrote:
Hi Dan,
It's better to split the Kafka partition into multiple partitions. 
Here is a way to try without splitting the Kafka partition. Add a rebalance shuffle between source and the downstream operators, set multiple parallelism for the downstream operators. But this way would introduce extra cpu cost for serialize/deserialize and extra network cost for shuffle data. I'm not sure the benefits of this method can offset the additional costs.

Best,
JING ZHANG

Dan Hill <[hidden email]> 于2021年6月17日周四 下午1:49写道:
Thanks, JING ZHANG!

I have one subtask for one Kafka source that is getting backpressure.  Is there an easy way to split a single Kafka partition into multiple subtasks?  Or do I need to split the Kafka partition?

On Wed, Jun 16, 2021 at 10:29 PM JING ZHANG <[hidden email]> wrote:
Hi Dan,
Would you please describe what's the problem about your job? High latency or low throughput?
Please first check the job throughput and latency .
If the job throughput matches the speed of sources producing data and the latency metric is good, maybe the job works well without bottlenecks.
If you find unnormal throughput or latency, please try the following points:
1. check the back pressure
2. check whether checkpoint duration is long and whether the checkpoint size is expected

Please share the details for deeper analysis in this email if you find something abnormal about  the job.

Best,
JING ZHANG

Dan Hill <[hidden email]> 于2021年6月17日周四 下午12:44写道:
We have a job that has been running but none of the AWS resource metrics for the EKS, EC2, MSK and EBS show any bottlenecks.  I have multiple 8 cores allocated but only ~2 cores are used.  Most of the memory is not consumed.  MSK does not show much use.  EBS metrics look mostly idle.  I assumed I'd be able to see whichever resources is a bottleneck.

Is there a good way to diagnose where the bottleneck is for a Flink job?