Hello, I'm wondering what sort of algorithm flink uses to map an Integer ID to a subtask when distributing data. Also, what operators from the TableAPI cause data to be redistributed? I know Joins will, what about Aggregates, Sources, Filters? Thanks! -- Rex Fenley | Software Engineer - Mobile and Backend Remind.com | BLOG | FOLLOW US | LIKE US |
Hi Rex,
for questions like this, I would recommend to checkout the source code as well. Search for subclasses of `StreamPartitioner`. For example, for keyBy Flink uses: https://github.com/apache/flink/blob/master/flink-streaming-java/src/main/java/org/apache/flink/streaming/runtime/partitioner/KeyGroupStreamPartitioner.java which uses https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/runtime/state/KeyGroupRangeAssignment.java Flink tries to avoid redistribution. Basically redistribution only occurs when performing a GROUP BY or when having operators with different parallelism. For Table API and SQL, you can print the shuffling steps via `Table.explain()`. They are indicated with an `Exchange` operation I hope this helps. Regards, Timo On 16.01.21 19:45, Rex Fenley wrote: > Hello, > > I'm wondering what sort of algorithm flink uses to map an Integer ID to > a subtask when distributing data. Also, what operators from the TableAPI > cause data to be redistributed? I know Joins will, what about > Aggregates, Sources, Filters? > > Thanks! > > -- > > Rex Fenley|Software Engineer - Mobile and Backend > > > Remind.com <https://www.remind.com/>| BLOG <http://blog.remind.com/> | > FOLLOW US <https://twitter.com/remindhq> | LIKE US > <https://www.facebook.com/remindhq> > |
This is great info. Looks like it uses murmur hash below the surface too [1]. Thanks! On Mon, Jan 18, 2021 at 1:38 AM Timo Walther <[hidden email]> wrote: Hi Rex, -- Rex Fenley | Software Engineer - Mobile and Backend Remind.com | BLOG | FOLLOW US | LIKE US |
Free forum by Nabble | Edit this page |