max-fan

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

max-fan

Greg Hogan
When workers spill more than 128 files, I have seen these fully merged into one or more much larger files. Does the following parameter allow more files to be stored without requiring the intermediate merge-sort? I have changed it to 1024 without effect. Also, it appears that the entire set of small files is reprocessed rather than the minimum required to attain the max fan-in (i.e., starting with 150 files, 23 would be merged leaving 128 to be processed concurrently).

taskmanager.runtime.max-fan: The maximal fan-in for external merge joins and fan-out for spilling hash tables. Limits the number of file handles per operator, but may cause intermediate merging/partitioning, if set too small (DEFAULT: 128).

Greg Hogan
Reply | Threaded
Open this post in threaded view
|

Re: max-fan

Stephan Ewen
Hi Greg!

That number should control the merge fan in, yes. Maybe a bug was introduced a while back that prevents this parameter from being properly passed through the system. Have you modified the config value in the cluster, on the client, or are you starting the job via the command line, in which case both are the same? In any case, we'll fix that soon, definitely. Could you open an issue for that?


Concerning the sub-optimal merging: You are right, this could be improved, like you said. Right mow, the attempt is to create uniform files, but your suggestion would be more efficient.

Is this a critical issue for you? Would you be up for making a patch for this? It should be a fairly isolated change.


Greetings,
Stephan


On Thu, Sep 3, 2015 at 3:02 AM, Greg Hogan <[hidden email]> wrote:
When workers spill more than 128 files, I have seen these fully merged into one or more much larger files. Does the following parameter allow more files to be stored without requiring the intermediate merge-sort? I have changed it to 1024 without effect. Also, it appears that the entire set of small files is reprocessed rather than the minimum required to attain the max fan-in (i.e., starting with 150 files, 23 would be merged leaving 128 to be processed concurrently).

taskmanager.runtime.max-fan: The maximal fan-in for external merge joins and fan-out for spilling hash tables. Limits the number of file handles per operator, but may cause intermediate merging/partitioning, if set too small (DEFAULT: 128).

Greg Hogan