How does flink read data from kafka number of TM's are more than topic partitions

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

How does flink read data from kafka number of TM's are more than topic partitions

Taher Koitawala
Hi All,
         Let's say a topic in kafka has 5 partitions. If I spawn 10 Task Managers with 1 slot each and parallelism is 10 then how will records be read from the kafka topic if I use the FlinkKafkaConsumer to read.

Will 5 TM's read and the rest be ideal in that case? Is over subscribing the number of TM's than the number of partitions in the Kafka topic guarantee high throughput?
 
Regards,
Taher Koitawala
GS Lab Pune
+91 8407979163
Reply | Threaded
Open this post in threaded view
|

Re: How does flink read data from kafka number of TM's are more than topic partitions

Piotr Nowojski
Hi,

Yes, in your case half of the Kafka source tasks wouldn’t read/process any records (you can check that in web UI). This shouldn’t harm you, unless your records will be redistributed after the source. For example:

source.keyBy(..).process(new MyVeryHeavyOperator()).print()

Should be fine, because `keyBy(…)` will redistribute records. However

source.map(new MyVeryHeavyOperator()).print()

Will mean that half of `MyVeryHeavyOperator`s will be idling as well. To solve that, you might want to consider using 

dataStream.rebalance();

Piotrek

On 21 Sep 2018, at 13:25, Taher Koitawala <[hidden email]> wrote:

Hi All,
         Let's say a topic in kafka has 5 partitions. If I spawn 10 Task Managers with 1 slot each and parallelism is 10 then how will records be read from the kafka topic if I use the FlinkKafkaConsumer to read.

Will 5 TM's read and the rest be ideal in that case? Is over subscribing the number of TM's than the number of partitions in the Kafka topic guarantee high throughput?
 
Regards,
Taher Koitawala
GS Lab Pune
+91 8407979163

Reply | Threaded
Open this post in threaded view
|

Re: How does flink read data from kafka number of TM's are more than topic partitions

Taher Koitawala
Thanks a lot for the explanation. That was exactly what I thought should happen. However, it is always good to a clear confirmation.


Regards,
Taher Koitawala
GS Lab Pune
+91 8407979163


On Fri, Sep 21, 2018 at 6:26 PM Piotr Nowojski <[hidden email]> wrote:
Hi,

Yes, in your case half of the Kafka source tasks wouldn’t read/process any records (you can check that in web UI). This shouldn’t harm you, unless your records will be redistributed after the source. For example:

source.keyBy(..).process(new MyVeryHeavyOperator()).print()

Should be fine, because `keyBy(…)` will redistribute records. However

source.map(new MyVeryHeavyOperator()).print()

Will mean that half of `MyVeryHeavyOperator`s will be idling as well. To solve that, you might want to consider using 

dataStream.rebalance();

Piotrek

On 21 Sep 2018, at 13:25, Taher Koitawala <[hidden email]> wrote:

Hi All,
         Let's say a topic in kafka has 5 partitions. If I spawn 10 Task Managers with 1 slot each and parallelism is 10 then how will records be read from the kafka topic if I use the FlinkKafkaConsumer to read.

Will 5 TM's read and the rest be ideal in that case? Is over subscribing the number of TM's than the number of partitions in the Kafka topic guarantee high throughput?
 
Regards,
Taher Koitawala
GS Lab Pune
+91 8407979163

Reply | Threaded
Open this post in threaded view
|

Re: How does flink read data from kafka number of TM's are more than topic partitions

Piotr Nowojski
No problem :)

Piotrek

On 21 Sep 2018, at 15:04, Taher Koitawala <[hidden email]> wrote:

Thanks a lot for the explanation. That was exactly what I thought should happen. However, it is always good to a clear confirmation.


Regards,
Taher Koitawala
GS Lab Pune
+91 8407979163


On Fri, Sep 21, 2018 at 6:26 PM Piotr Nowojski <[hidden email]> wrote:
Hi,

Yes, in your case half of the Kafka source tasks wouldn’t read/process any records (you can check that in web UI). This shouldn’t harm you, unless your records will be redistributed after the source. For example:

source.keyBy(..).process(new MyVeryHeavyOperator()).print()

Should be fine, because `keyBy(…)` will redistribute records. However

source.map(new MyVeryHeavyOperator()).print()

Will mean that half of `MyVeryHeavyOperator`s will be idling as well. To solve that, you might want to consider using 

dataStream.rebalance();

Piotrek

On 21 Sep 2018, at 13:25, Taher Koitawala <[hidden email]> wrote:

Hi All,
         Let's say a topic in kafka has 5 partitions. If I spawn 10 Task Managers with 1 slot each and parallelism is 10 then how will records be read from the kafka topic if I use the FlinkKafkaConsumer to read.

Will 5 TM's read and the rest be ideal in that case? Is over subscribing the number of TM's than the number of partitions in the Kafka topic guarantee high throughput?
 
Regards,
Taher Koitawala
GS Lab Pune
+91 8407979163