Performance Flink streaming kafka consumer sink to s3

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

Performance Flink streaming kafka consumer sink to s3

Vijayendra Yadav
Hi Team,

I am trying to increase throughput of my flink stream job streaming from kafka source and sink to s3. Currently it is running fine for small events records. But records with large payloads are running extremely slow like at rate 2 TPS.

Could you provide some best practices to tune?
Also, can we increase parallel processing, beyond the number of kafka partitions that we have, without causing any overhead ?

Regards,
Vijay
Reply | Threaded
Open this post in threaded view
|

Re: Performance Flink streaming kafka consumer sink to s3

rmetzger0
Hi,

Also, can we increase parallel processing, beyond the number of kafka partitions that we have, without causing any overhead ?

Yes, the Kafka sources produce a tiny bit of overhead, but the potential benefit of having downstream operators at a high parallelism might be much bigger.

How large is a large payload in your case?

Best practices:
Try to understand what's causing the performance slowdown: Kafka or S3 ?
You can do a test where you read from kafka, and write it into a discarding sink.
Likewise, use a datagenerator source, and write into S3.

Do the math on your job: What's the theoretical limits of your job: https://www.ververica.com/blog/how-to-size-your-apache-flink-cluster-general-guidelines

Hope this helps,
Robert 


On Thu, Aug 13, 2020 at 11:25 PM Vijayendra Yadav <[hidden email]> wrote:
Hi Team,

I am trying to increase throughput of my flink stream job streaming from kafka source and sink to s3. Currently it is running fine for small events records. But records with large payloads are running extremely slow like at rate 2 TPS.

Could you provide some best practices to tune?
Also, can we increase parallel processing, beyond the number of kafka partitions that we have, without causing any overhead ?

Regards,
Vijay
Reply | Threaded
Open this post in threaded view
|

Re: Performance Flink streaming kafka consumer sink to s3

Vijayendra Yadav
Hi Robert,

Thanks for information. payloads so far are 400KB (each record). 
To achieve high parallelism at the downstream operator do I rebalance the kafka stream ? Could you give me an example please. 

Regards,
Vijay


On Fri, Aug 14, 2020 at 12:50 PM Robert Metzger <[hidden email]> wrote:
Hi,

Also, can we increase parallel processing, beyond the number of kafka partitions that we have, without causing any overhead ?

Yes, the Kafka sources produce a tiny bit of overhead, but the potential benefit of having downstream operators at a high parallelism might be much bigger.

How large is a large payload in your case?

Best practices:
Try to understand what's causing the performance slowdown: Kafka or S3 ?
You can do a test where you read from kafka, and write it into a discarding sink.
Likewise, use a datagenerator source, and write into S3.

Do the math on your job: What's the theoretical limits of your job: https://www.ververica.com/blog/how-to-size-your-apache-flink-cluster-general-guidelines

Hope this helps,
Robert 


On Thu, Aug 13, 2020 at 11:25 PM Vijayendra Yadav <[hidden email]> wrote:
Hi Team,

I am trying to increase throughput of my flink stream job streaming from kafka source and sink to s3. Currently it is running fine for small events records. But records with large payloads are running extremely slow like at rate 2 TPS.

Could you provide some best practices to tune?
Also, can we increase parallel processing, beyond the number of kafka partitions that we have, without causing any overhead ?

Regards,
Vijay
Reply | Threaded
Open this post in threaded view
|

Re: Performance Flink streaming kafka consumer sink to s3

Vijayendra Yadav
Hi, Do you think there can be any issue with Flinks performance, with 400Kb up to 1 MB payload record sizes ? my Spark streaming seems to be doing better. Are there any recommended configurations or increasing parallelism to improve Flink streaming  using flink kafka connect?

Regards,
Vijay


On Fri, Aug 14, 2020 at 2:04 PM Vijayendra Yadav <[hidden email]> wrote:
Hi Robert,

Thanks for information. payloads so far are 400KB (each record). 
To achieve high parallelism at the downstream operator do I rebalance the kafka stream ? Could you give me an example please. 

Regards,
Vijay


On Fri, Aug 14, 2020 at 12:50 PM Robert Metzger <[hidden email]> wrote:
Hi,

Also, can we increase parallel processing, beyond the number of kafka partitions that we have, without causing any overhead ?

Yes, the Kafka sources produce a tiny bit of overhead, but the potential benefit of having downstream operators at a high parallelism might be much bigger.

How large is a large payload in your case?

Best practices:
Try to understand what's causing the performance slowdown: Kafka or S3 ?
You can do a test where you read from kafka, and write it into a discarding sink.
Likewise, use a datagenerator source, and write into S3.

Do the math on your job: What's the theoretical limits of your job: https://www.ververica.com/blog/how-to-size-your-apache-flink-cluster-general-guidelines

Hope this helps,
Robert 


On Thu, Aug 13, 2020 at 11:25 PM Vijayendra Yadav <[hidden email]> wrote:
Hi Team,

I am trying to increase throughput of my flink stream job streaming from kafka source and sink to s3. Currently it is running fine for small events records. But records with large payloads are running extremely slow like at rate 2 TPS.

Could you provide some best practices to tune?
Also, can we increase parallel processing, beyond the number of kafka partitions that we have, without causing any overhead ?

Regards,
Vijay