(DEPRECATED) Apache Flink User Mailing List archive.

Flink Kafka Connector Source Parallelism

Classic

List

Threaded

3 messages Options

Chen, Mason

Flink Kafka Connector Source Parallelism

Hi all,

I’m currently trying to understand Flink’s Kafka Connector and how parallelism affects it. So, I am running the flink playground click count job and the parallelism is set to 2 by default.

However, I don’t see the 2^nd subtask of the Kafka Connector sending any records: https://imgur.com/cA5ucSg. Do I need to rebalance after reading from kafka? 

```
clicks = clicks
   .keyBy(ClickEvent::getPage)
   .map(new BackpressureMap())
   .name("Backpressure");
```

`clicks` is the kafka click stream. From my reading in the operator docs, it seems counterintuitive to do a `rebalance()` when I am already doing a `keyBy()`.

So, my questions:

1. How do I make use of the 2^nd subtask?

2. Does the number of partitions have some sort of correspondence with the parallelism of the source operator? If so, is there a general statement to be made about parallelism across all source operators?

Thanks,

Mason

Chen, Mason

Re: Flink Kafka Connector Source Parallelism

I think I may have just answered my own question. There’s only one Kafka partition, so the maximum parallelism is one and it doesn’t really make sense to make another kafka consumer under the same group id. What threw me off is that there’s a 2^nd subtask for the kafka source created even though it’s not actually doing anything. So, it seems a general statement can be made that (# kafka partitions) >= (# parallelism of flink kafka source)…well I guess you could have more parallelism than kafka partitions, but the extra subtasks will not doing anything.

From: "Chen, Mason" <[hidden email]>
Date: Wednesday, May 27, 2020 at 11:09 PM
To: "[hidden email]" <[hidden email]>
Subject: Flink Kafka Connector Source Parallelism

Hi all,

I’m currently trying to understand Flink’s Kafka Connector and how parallelism affects it. So, I am running the flink playground click count job and the parallelism is set to 2 by default.

However, I don’t see the 2^nd subtask of the Kafka Connector sending any records: https://imgur.com/cA5ucSg. Do I need to rebalance after reading from kafka? 

```
clicks = clicks
   .keyBy(ClickEvent::getPage)
   .map(new BackpressureMap())
   .name("Backpressure");
```

Thanks,

Mason

rmetzger0

Re: Flink Kafka Connector Source Parallelism

Hi Mason,

your understanding is correct.

On Thu, May 28, 2020 at 8:23 AM Chen, Mason <[hidden email]> wrote:

I think I may have just answered my own question. There’s only one Kafka partition, so the maximum parallelism is one and it doesn’t really make sense to make another kafka consumer under the same group id. What threw me off is that there’s a 2^nd subtask for the kafka source created even though it’s not actually doing anything. So, it seems a general statement can be made that (# kafka partitions) >= (# parallelism of flink kafka source)…well I guess you could have more parallelism than kafka partitions, but the extra subtasks will not doing anything.

From: "Chen, Mason" <[hidden email]>
Date: Wednesday, May 27, 2020 at 11:09 PM
To: "[hidden email]" <[hidden email]>
Subject: Flink Kafka Connector Source Parallelism

Hi all,

I’m currently trying to understand Flink’s Kafka Connector and how parallelism affects it. So, I am running the flink playground click count job and the parallelism is set to 2 by default.
However, I don’t see the 2^nd subtask of the Kafka Connector sending any records: https://imgur.com/cA5ucSg. Do I need to rebalance after reading from kafka? 

```
clicks = clicks
   .keyBy(ClickEvent::getPage)
   .map(new BackpressureMap())
   .name("Backpressure");
```
`clicks` is the kafka click stream. From my reading in the operator docs, it seems counterintuitive to do a `rebalance()` when I am already doing a `keyBy()`.

So, my questions:

1. How do I make use of the 2^nd subtask?

2. Does the number of partitions have some sort of correspondence with the parallelism of the source operator? If so, is there a general statement to be made about parallelism across all source operators?

Thanks,

Mason