(DEPRECATED) Apache Flink User Mailing List archive.

Flink ID hashing

Classic

List

Threaded

3 messages Options

Rex Fenley

Flink ID hashing

Hello,

I'm wondering what sort of algorithm flink uses to map an Integer ID to a subtask when distributing data. Also, what operators from the TableAPI cause data to be redistributed? I know Joins will, what about Aggregates, Sources, Filters?

Thanks!

Rex Fenley | Software Engineer - Mobile and Backend

Remind.com | BLOG | FOLLOW US | LIKE US

Timo Walther

Re: Flink ID hashing

Hi Rex,

for questions like this, I would recommend to checkout the source code
as well.

Search for subclasses of `StreamPartitioner`. For example, for keyBy
Flink uses:

https://github.com/apache/flink/blob/master/flink-streaming-java/src/main/java/org/apache/flink/streaming/runtime/partitioner/KeyGroupStreamPartitioner.java

which uses

https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/runtime/state/KeyGroupRangeAssignment.java

Flink tries to avoid redistribution. Basically redistribution only
occurs when performing a GROUP BY or when having operators with
different parallelism. For Table API and SQL, you can print the
shuffling steps via `Table.explain()`. They are indicated with an
`Exchange` operation

I hope this helps.

Regards,
Timo

On 16.01.21 19:45, Rex Fenley wrote:

> Hello,
>
> I'm wondering what sort of algorithm flink uses to map an Integer ID to
> a subtask when distributing data. Also, what operators from the TableAPI
> cause data to be redistributed? I know Joins will, what about
> Aggregates, Sources, Filters?
>
> Thanks!
>
> --
>
> Rex Fenley|Software Engineer - Mobile and Backend
>
>
> Remind.com <https://www.remind.com/>| BLOG <http://blog.remind.com/> |
> FOLLOW US <https://twitter.com/remindhq> | LIKE US
> <https://www.facebook.com/remindhq>
>

Rex Fenley

Re: Flink ID hashing

This is great info. Looks like it uses murmur hash below the surface too [1].

Thanks!

[1] https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/runtime/state/KeyGroupRangeAssignment.java#L76

On Mon, Jan 18, 2021 at 1:38 AM Timo Walther <[hidden email]> wrote:

Hi Rex,

for questions like this, I would recommend to checkout the source code
as well.

Search for subclasses of `StreamPartitioner`. For example, for keyBy
Flink uses:

https://github.com/apache/flink/blob/master/flink-streaming-java/src/main/java/org/apache/flink/streaming/runtime/partitioner/KeyGroupStreamPartitioner.java

which uses

https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/runtime/state/KeyGroupRangeAssignment.java

Flink tries to avoid redistribution. Basically redistribution only
occurs when performing a GROUP BY or when having operators with
different parallelism. For Table API and SQL, you can print the
shuffling steps via `Table.explain()`. They are indicated with an
`Exchange` operation

I hope this helps.

Regards,
Timo

On 16.01.21 19:45, Rex Fenley wrote:
> Hello,
>
> I'm wondering what sort of algorithm flink uses to map an Integer ID to
> a subtask when distributing data. Also, what operators from the TableAPI
> cause data to be redistributed? I know Joins will, what about
> Aggregates, Sources, Filters?
>
> Thanks!
>
> --
>
> Rex Fenley|Software Engineer - Mobile and Backend
>
>
> Remind.com <https://www.remind.com/>| BLOG <http://blog.remind.com/> |
> FOLLOW US <https://twitter.com/remindhq> | LIKE US
> <https://www.facebook.com/remindhq>
>

Rex Fenley | Software Engineer - Mobile and Backend

Remind.com | BLOG | FOLLOW US | LIKE US