Parallelism and Partitioning

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Parallelism and Partitioning

Mohit Anchlia
What is the granularity of parallelism in flink? For eg: if I am reading from Kafka -> map -> Sink and I say parallel 2 does it mean it creates 2 consumer threads and allocates it on 2 separate task managers?

Also, it would be good to understand the difference between parallelism and partitioning that also could be distributed across task managers.
Reply | Threaded
Open this post in threaded view
|

Re: Parallelism and Partitioning

Mohit Anchlia
Any information on this would be helpful.

On Thu, Feb 2, 2017 at 5:09 PM, Mohit Anchlia <[hidden email]> wrote:
What is the granularity of parallelism in flink? For eg: if I am reading from Kafka -> map -> Sink and I say parallel 2 does it mean it creates 2 consumer threads and allocates it on 2 separate task managers?

Also, it would be good to understand the difference between parallelism and partitioning that also could be distributed across task managers.

Reply | Threaded
Open this post in threaded view
|

Re: Parallelism and Partitioning

Ufuk Celebi
In reply to this post by Mohit Anchlia
On Fri, Feb 3, 2017 at 2:09 AM, Mohit Anchlia <[hidden email]> wrote:
> What is the granularity of parallelism in flink? For eg: if I am reading
> from Kafka -> map -> Sink and I say parallel 2 does it mean it creates 2
> consumer threads and allocates it on 2 separate task managers?

Yes, this will instantiate two instances of the Kafka source, the map
operator, and the sink. These parallel sub pipelines will be scheduled
to separate slots (that might happen to on the same TM). See here for
more details: https://ci.apache.org/projects/flink/flink-docs-release-1.3/internals/job_scheduling.html

Partitioning of state happens in key groups, which define a range of
keys. A single subtask is usually responsible for more than a single
key group. The key groups are the units of rescaling your program.
This is configurable via the setMaxParallelism() method on the
environment.