When workers spill more than 128 files, I have seen these fully merged into one or more much larger files. Does the following parameter allow more files to be stored without requiring the intermediate merge-sort? I have changed it to 1024 without effect. Also, it appears that the entire set of small files is reprocessed rather than the minimum required to attain the max fan-in (i.e., starting with 150 files, 23 would be merged leaving 128 to be processed concurrently).
taskmanager.runtime.max-fan: The maximal fan-in for external merge joins and fan-out for spilling hash tables. Limits the number of file handles per operator, but may cause intermediate merging/partitioning, if set too small (DEFAULT: 128).
Greg Hogan