(DEPRECATED) Apache Flink User Mailing List archive.

How does flink read data from kafka number of TM's are more than topic partitions

Classic

List

Threaded

4 messages Options

Taher Koitawala

How does flink read data from kafka number of TM's are more than topic partitions

Hi All,

Let's say a topic in kafka has 5 partitions. If I spawn 10 Task Managers with 1 slot each and parallelism is 10 then how will records be read from the kafka topic if I use the FlinkKafkaConsumer to read.

Will 5 TM's read and the rest be ideal in that case? Is over subscribing the number of TM's than the number of partitions in the Kafka topic guarantee high throughput?

Regards,

Taher Koitawala

GS Lab Pune

+91 8407979163

Piotr Nowojski

Re: How does flink read data from kafka number of TM's are more than topic partitions

Hi,

Yes, in your case half of the Kafka source tasks wouldn’t read/process any records (you can check that in web UI). This shouldn’t harm you, unless your records will be redistributed after the source. For example:

source.keyBy(..).process(new MyVeryHeavyOperator()).print()

Should be fine, because `keyBy(…)` will redistribute records. However

source.map(new MyVeryHeavyOperator()).print()

Will mean that half of `MyVeryHeavyOperator`s will be idling as well. To solve that, you might want to consider using

dataStream.rebalance();

Piotrek

On 21 Sep 2018, at 13:25, Taher Koitawala <[hidden email]> wrote:

Hi All,
Let's say a topic in kafka has 5 partitions. If I spawn 10 Task Managers with 1 slot each and parallelism is 10 then how will records be read from the kafka topic if I use the FlinkKafkaConsumer to read.

Will 5 TM's read and the rest be ideal in that case? Is over subscribing the number of TM's than the number of partitions in the Kafka topic guarantee high throughput?

Regards,
Taher Koitawala
GS Lab Pune
+91 8407979163

Taher Koitawala

Re: How does flink read data from kafka number of TM's are more than topic partitions

Thanks a lot for the explanation. That was exactly what I thought should happen. However, it is always good to a clear confirmation.

Regards,

Taher Koitawala

GS Lab Pune

+91 8407979163

On Fri, Sep 21, 2018 at 6:26 PM Piotr Nowojski <[hidden email]> wrote:

Hi,

Yes, in your case half of the Kafka source tasks wouldn’t read/process any records (you can check that in web UI). This shouldn’t harm you, unless your records will be redistributed after the source. For example:

source.keyBy(..).process(new MyVeryHeavyOperator()).print()

Should be fine, because `keyBy(…)` will redistribute records. However

source.map(new MyVeryHeavyOperator()).print()

Will mean that half of `MyVeryHeavyOperator`s will be idling as well. To solve that, you might want to consider using

dataStream.rebalance();

Piotrek

On 21 Sep 2018, at 13:25, Taher Koitawala <[hidden email]> wrote:

Hi All,
Let's say a topic in kafka has 5 partitions. If I spawn 10 Task Managers with 1 slot each and parallelism is 10 then how will records be read from the kafka topic if I use the FlinkKafkaConsumer to read.

Will 5 TM's read and the rest be ideal in that case? Is over subscribing the number of TM's than the number of partitions in the Kafka topic guarantee high throughput?

Regards,
Taher Koitawala
GS Lab Pune
+91 8407979163

Piotr Nowojski

Re: How does flink read data from kafka number of TM's are more than topic partitions

No problem :)

Piotrek

On 21 Sep 2018, at 15:04, Taher Koitawala <[hidden email]> wrote:

Thanks a lot for the explanation. That was exactly what I thought should happen. However, it is always good to a clear confirmation.

Regards,
Taher Koitawala
GS Lab Pune
+91 8407979163

On Fri, Sep 21, 2018 at 6:26 PM Piotr Nowojski <[hidden email]> wrote:
Hi,

Yes, in your case half of the Kafka source tasks wouldn’t read/process any records (you can check that in web UI). This shouldn’t harm you, unless your records will be redistributed after the source. For example:

source.keyBy(..).process(new MyVeryHeavyOperator()).print()

Should be fine, because `keyBy(…)` will redistribute records. However

source.map(new MyVeryHeavyOperator()).print()

Will mean that half of `MyVeryHeavyOperator`s will be idling as well. To solve that, you might want to consider using

dataStream.rebalance();

Piotrek

On 21 Sep 2018, at 13:25, Taher Koitawala <[hidden email]> wrote:

Hi All,
Let's say a topic in kafka has 5 partitions. If I spawn 10 Task Managers with 1 slot each and parallelism is 10 then how will records be read from the kafka topic if I use the FlinkKafkaConsumer to read.

Will 5 TM's read and the rest be ideal in that case? Is over subscribing the number of TM's than the number of partitions in the Kafka topic guarantee high throughput?

Regards,
Taher Koitawala
GS Lab Pune
+91 8407979163