(DEPRECATED) Apache Flink User Mailing List archive.

Performance Flink streaming kafka consumer sink to s3

Classic

List

Threaded

4 messages Options

Vijayendra Yadav

Performance Flink streaming kafka consumer sink to s3

Hi Team,

I am trying to increase throughput of my flink stream job streaming from kafka source and sink to s3. Currently it is running fine for small events records. But records with large payloads are running extremely slow like at rate 2 TPS.

Could you provide some best practices to tune?

Also, can we increase parallel processing, beyond the number of kafka partitions that we have, without causing any overhead ?

Regards,

Vijay

rmetzger0

Re: Performance Flink streaming kafka consumer sink to s3

Hi,

Also, can we increase parallel processing, beyond the number of kafka partitions that we have, without causing any overhead ?

Yes, the Kafka sources produce a tiny bit of overhead, but the potential benefit of having downstream operators at a high parallelism might be much bigger.

How large is a large payload in your case?

Best practices:

Try to understand what's causing the performance slowdown: Kafka or S3 ?

You can do a test where you read from kafka, and write it into a discarding sink.

Likewise, use a datagenerator source, and write into S3.

Do the math on your job: What's the theoretical limits of your job: https://www.ververica.com/blog/how-to-size-your-apache-flink-cluster-general-guidelines

Hope this helps,

Robert

On Thu, Aug 13, 2020 at 11:25 PM Vijayendra Yadav <[hidden email]> wrote:

Hi Team,

I am trying to increase throughput of my flink stream job streaming from kafka source and sink to s3. Currently it is running fine for small events records. But records with large payloads are running extremely slow like at rate 2 TPS.

Could you provide some best practices to tune?
Also, can we increase parallel processing, beyond the number of kafka partitions that we have, without causing any overhead ?

Regards,
Vijay

Vijayendra Yadav

Re: Performance Flink streaming kafka consumer sink to s3

Hi Robert,

Thanks for information. payloads so far are 400KB (each record).

To achieve high parallelism at the downstream operator do I rebalance the kafka stream ? Could you give me an example please.

Regards,

Vijay

On Fri, Aug 14, 2020 at 12:50 PM Robert Metzger <[hidden email]> wrote:

Hi,

Also, can we increase parallel processing, beyond the number of kafka partitions that we have, without causing any overhead ?

Yes, the Kafka sources produce a tiny bit of overhead, but the potential benefit of having downstream operators at a high parallelism might be much bigger.

How large is a large payload in your case?

Best practices:
Try to understand what's causing the performance slowdown: Kafka or S3 ?
You can do a test where you read from kafka, and write it into a discarding sink.
Likewise, use a datagenerator source, and write into S3.

Do the math on your job: What's the theoretical limits of your job: https://www.ververica.com/blog/how-to-size-your-apache-flink-cluster-general-guidelines

Hope this helps,
Robert

On Thu, Aug 13, 2020 at 11:25 PM Vijayendra Yadav <[hidden email]> wrote:
Hi Team,

I am trying to increase throughput of my flink stream job streaming from kafka source and sink to s3. Currently it is running fine for small events records. But records with large payloads are running extremely slow like at rate 2 TPS.

Could you provide some best practices to tune?
Also, can we increase parallel processing, beyond the number of kafka partitions that we have, without causing any overhead ?

Regards,
Vijay

Vijayendra Yadav

Re: Performance Flink streaming kafka consumer sink to s3

Hi, Do you think there can be any issue with Flinks performance, with 400Kb up to 1 MB payload record sizes ? my Spark streaming seems to be doing better. Are there any recommended configurations or increasing parallelism to improve Flink streaming using flink kafka connect?

Regards,

Vijay

On Fri, Aug 14, 2020 at 2:04 PM Vijayendra Yadav <[hidden email]> wrote:

Hi Robert,

Thanks for information. payloads so far are 400KB (each record).
To achieve high parallelism at the downstream operator do I rebalance the kafka stream ? Could you give me an example please.

Regards,
Vijay

On Fri, Aug 14, 2020 at 12:50 PM Robert Metzger <[hidden email]> wrote:
Hi,

Also, can we increase parallel processing, beyond the number of kafka partitions that we have, without causing any overhead ?

Yes, the Kafka sources produce a tiny bit of overhead, but the potential benefit of having downstream operators at a high parallelism might be much bigger.

How large is a large payload in your case?

Best practices:
Try to understand what's causing the performance slowdown: Kafka or S3 ?
You can do a test where you read from kafka, and write it into a discarding sink.
Likewise, use a datagenerator source, and write into S3.

Do the math on your job: What's the theoretical limits of your job: https://www.ververica.com/blog/how-to-size-your-apache-flink-cluster-general-guidelines

Hope this helps,
Robert

On Thu, Aug 13, 2020 at 11:25 PM Vijayendra Yadav <[hidden email]> wrote:
Hi Team,

I am trying to increase throughput of my flink stream job streaming from kafka source and sink to s3. Currently it is running fine for small events records. But records with large payloads are running extremely slow like at rate 2 TPS.

Could you provide some best practices to tune?
Also, can we increase parallel processing, beyond the number of kafka partitions that we have, without causing any overhead ?

Regards,
Vijay