We have a job that has been running but none of the AWS resource metrics for the EKS, EC2, MSK and EBS show any bottlenecks. I have multiple 8 cores allocated but only ~2 cores are used. Most of the memory is not consumed. MSK does not show much use. EBS metrics look mostly idle. I assumed I'd be able to see whichever resources is a bottleneck.
Is there a good way to diagnose where the bottleneck is for a Flink job? |
Hi Dan,
Would you please describe what's the problem about your job? High latency or low throughput? Please first check the job throughput and latency . If the job throughput matches the speed of sources producing data and the latency metric is good, maybe the job works well without bottlenecks. If you find unnormal throughput or latency, please try the following points: 1. check the back pressure 2. check whether checkpoint duration is long and whether the checkpoint size is expected Please share the details for deeper analysis in this email if you find something abnormal about the job. Best, JING ZHANG Dan Hill <[hidden email]> 于2021年6月17日周四 下午12:44写道:
|
Thanks, JING ZHANG! I have one subtask for one Kafka source that is getting backpressure. Is there an easy way to split a single Kafka partition into multiple subtasks? Or do I need to split the Kafka partition? On Wed, Jun 16, 2021 at 10:29 PM JING ZHANG <[hidden email]> wrote:
|
Hi Dan, It's better to split the Kafka partition into multiple partitions. Here is a way to try without splitting the Kafka partition. Add a rebalance shuffle between source and the downstream operators, set multiple parallelism for the downstream operators. But this way would introduce extra cpu cost for serialize/deserialize and extra network cost for shuffle data. I'm not sure the benefits of this method can offset the additional costs. Best, JING ZHANG Dan Hill <[hidden email]> 于2021年6月17日周四 下午1:49写道:
|
Thanks Jing! On Wed, Jun 16, 2021 at 11:30 PM JING ZHANG <[hidden email]> wrote:
|
Free forum by Nabble | Edit this page |