Posted by
Urs Schoenenberger on
URL: http://deprecated-apache-flink-user-mailing-list-archive.369.s1.nabble.com/DataSet-partitionByHash-without-materializing-spilling-the-entire-partition-tp15380.html
Hi all,
we have a DataSet pipeline which reads CSV input data and then
essentially does a combinable GroupReduce via first(n).
In our first iteration (readCsvFile -> groupBy(0) -> sortGroup(0) ->
first(n)), we got a jobgraph like this:
source --[Forward]--> combine --[Hash Partition on 0, Sort]--> reduce
This works, but we found the combine phase to be inefficient because not
enough combinable elements fit into a sorter. My idea was to
pre-partition the DataSet to increase the chance of combinable elements
(readCsvFile -> partitionBy(0) -> groupBy(0) -> sortGroup(0) -> first(n)).
To my surprise, I found that this changed the job graph to
source --[Hash Partition on 0]--> partition(noop) --[Forward]--> combine
--[Hash Partition on 0, Sort]--> reduce
while materializing and spilling the entire partitions at the
partition(noop)-Operator!
Is there any way I can partition the data on the way from source to
combine without spilling? That is, can I get a job graph that looks like
source --[Hash Partition on 0]--> combine --[Hash Partition on 0,
Sort]--> reduce
instead?
Thanks,
Urs
--
Urs Schönenberger -
[hidden email]
TNG Technology Consulting GmbH, Betastr. 13a, 85774 Unterföhring
Geschäftsführer: Henrik Klagges, Dr. Robert Dahlke, Gerhard Müller
Sitz: Unterföhring * Amtsgericht München * HRB 135082