(DEPRECATED) Apache Flink User Mailing List archive. - DataSet: partitionByHash without materializing/spilling the entire partition?

(DEPRECATED) Apache Flink User Mailing List archive.

DataSet: partitionByHash without materializing/spilling the entire partition?

Posted by Urs Schoenenberger on
URL: http://deprecated-apache-flink-user-mailing-list-archive.369.s1.nabble.com/DataSet-partitionByHash-without-materializing-spilling-the-entire-partition-tp15380.html

Hi all,

we have a DataSet pipeline which reads CSV input data and then
essentially does a combinable GroupReduce via first(n).

In our first iteration (readCsvFile -> groupBy(0) -> sortGroup(0) ->
first(n)), we got a jobgraph like this:

source --[Forward]--> combine --[Hash Partition on 0, Sort]--> reduce

This works, but we found the combine phase to be inefficient because not
enough combinable elements fit into a sorter. My idea was to
pre-partition the DataSet to increase the chance of combinable elements
(readCsvFile -> partitionBy(0) -> groupBy(0) -> sortGroup(0) -> first(n)).

To my surprise, I found that this changed the job graph to

source --[Hash Partition on 0]--> partition(noop) --[Forward]--> combine
--[Hash Partition on 0, Sort]--> reduce

while materializing and spilling the entire partitions at the
partition(noop)-Operator!

Is there any way I can partition the data on the way from source to
combine without spilling? That is, can I get a job graph that looks like

source --[Hash Partition on 0]--> combine --[Hash Partition on 0,
Sort]--> reduce

instead?

Thanks,
Urs

--
Urs Schönenberger - [hidden email]

TNG Technology Consulting GmbH, Betastr. 13a, 85774 Unterföhring
Geschäftsführer: Henrik Klagges, Dr. Robert Dahlke, Gerhard Müller
Sitz: Unterföhring * Amtsgericht München * HRB 135082