Hi folks,
I’ve got a single RichOutputFormat which is comprised of two HadoopOutputFormats, let’s call them A and B, each writing to different HDFS directories. If a Record matches a certain
condition it’s written using A, otherwise it’s written with B. Currently, the parallelism that is set at the RichOutputFormat seems to propagates to both A & B – meaning if the parallelism set on the RichOutputFormat is 10, output A and B create 10 files even
if A receives all the records and B receives none.
My app has knowledge about the ratio of records it expects will be sent to output A vs output B, and I would ideally like that pass that down through the RichOutputFormat. Meaning
that if we have a parallelism of 10, and know that 70% of the Records being sent go to A, I would like to supply the A with 7 parallelism and B with 3.
I’m curious because the current approach can lead to lots of redundant empty files, and I’d like to minimize that if possible. Is something like this supported?
____________
Andreas Hailu
Data Lake Engineering
|
Goldman Sachs & Co.
Free forum by Nabble | Edit this page |