group by optimizations with sorted input

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

group by optimizations with sorted input

Richard Moorhead
In batch mode, if input is sorted prior to a group by operation; does flink forward the aggregate data early? Is there a way to prevent grouping operations from buffering all data in a GBK operation in batch mode?
Reply | Threaded
Open this post in threaded view
|

Re: group by optimizations with sorted input

rmetzger0
I assume you are using the DataSet API.

The combine() method will be executed on the sender side, reducing the amount of data to spill on disk. This only works if your data allows such early aggregations.
This is similar to a combiner in Hadoop: https://www.quora.com/What-is-a-Combiner-in-Hadoop


On Thu, Feb 13, 2020 at 8:01 PM Richard Moorhead <[hidden email]> wrote:
In batch mode, if input is sorted prior to a group by operation; does flink forward the aggregate data early? Is there a way to prevent grouping operations from buffering all data in a GBK operation in batch mode?