Flink ID hashing

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Flink ID hashing

Rex Fenley
Hello,

I'm wondering what sort of algorithm flink uses to map an Integer ID to a subtask when distributing data. Also, what operators from the TableAPI cause data to be redistributed? I know Joins will, what about Aggregates, Sources, Filters?

Thanks!

--

Rex Fenley  |  Software Engineer - Mobile and Backend


Remind.com |  BLOG  |  FOLLOW US  |  LIKE US

Reply | Threaded
Open this post in threaded view
|

Re: Flink ID hashing

Timo Walther
Hi Rex,

for questions like this, I would recommend to checkout the source code
as well.

Search for subclasses of `StreamPartitioner`. For example, for keyBy
Flink uses:

https://github.com/apache/flink/blob/master/flink-streaming-java/src/main/java/org/apache/flink/streaming/runtime/partitioner/KeyGroupStreamPartitioner.java

which uses

https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/runtime/state/KeyGroupRangeAssignment.java


Flink tries to avoid redistribution. Basically redistribution only
occurs when performing a GROUP BY or when having operators with
different parallelism. For Table API and SQL, you can print the
shuffling steps via `Table.explain()`. They are indicated with an
`Exchange` operation

I hope this helps.

Regards,
Timo


On 16.01.21 19:45, Rex Fenley wrote:

> Hello,
>
> I'm wondering what sort of algorithm flink uses to map an Integer ID to
> a subtask when distributing data. Also, what operators from the TableAPI
> cause data to be redistributed? I know Joins will, what about
> Aggregates, Sources, Filters?
>
> Thanks!
>
> --
>
> Rex Fenley|Software Engineer - Mobile and Backend
>
>
> Remind.com <https://www.remind.com/>| BLOG <http://blog.remind.com/> |
> FOLLOW US <https://twitter.com/remindhq> | LIKE US
> <https://www.facebook.com/remindhq>
>

Reply | Threaded
Open this post in threaded view
|

Re: Flink ID hashing

Rex Fenley

On Mon, Jan 18, 2021 at 1:38 AM Timo Walther <[hidden email]> wrote:
Hi Rex,

for questions like this, I would recommend to checkout the source code
as well.

Search for subclasses of `StreamPartitioner`. For example, for keyBy
Flink uses:

https://github.com/apache/flink/blob/master/flink-streaming-java/src/main/java/org/apache/flink/streaming/runtime/partitioner/KeyGroupStreamPartitioner.java

which uses

https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/runtime/state/KeyGroupRangeAssignment.java


Flink tries to avoid redistribution. Basically redistribution only
occurs when performing a GROUP BY or when having operators with
different parallelism. For Table API and SQL, you can print the
shuffling steps via `Table.explain()`. They are indicated with an
`Exchange` operation

I hope this helps.

Regards,
Timo


On 16.01.21 19:45, Rex Fenley wrote:
> Hello,
>
> I'm wondering what sort of algorithm flink uses to map an Integer ID to
> a subtask when distributing data. Also, what operators from the TableAPI
> cause data to be redistributed? I know Joins will, what about
> Aggregates, Sources, Filters?
>
> Thanks!
>
> --
>
> Rex Fenley|Software Engineer - Mobile and Backend
>
>
> Remind.com <https://www.remind.com/>| BLOG <http://blog.remind.com/> |
> FOLLOW US <https://twitter.com/remindhq> | LIKE US
> <https://www.facebook.com/remindhq>
>



--

Rex Fenley  |  Software Engineer - Mobile and Backend


Remind.com |  BLOG  |  FOLLOW US  |  LIKE US