(DEPRECATED) Apache Flink User Mailing List archive.

Distribute Parallelism/Tasks within RichOutputFormat?

Posted by Hailu, Andreas on
URL: http://deprecated-apache-flink-user-mailing-list-archive.369.s1.nabble.com/Distribute-Parallelism-Tasks-within-RichOutputFormat-tp40270.html

Hi folks,

I’ve got a single RichOutputFormat which is comprised of two HadoopOutputFormats, let’s call them A and B, each writing to different HDFS directories. If a Record matches a certain condition it’s written using A, otherwise it’s written with B. Currently, the parallelism that is set at the RichOutputFormat seems to propagates to both A & B – meaning if the parallelism set on the RichOutputFormat is 10, output A and B create 10 files even if A receives all the records and B receives none.

My app has knowledge about the ratio of records it expects will be sent to output A vs output B, and I would ideally like that pass that down through the RichOutputFormat. Meaning that if we have a parallelism of 10, and know that 70% of the Records being sent go to A, I would like to supply the A with 7 parallelism and B with 3.

I’m curious because the current approach can lead to lots of redundant empty files, and I’d like to minimize that if possible. Is something like this supported?

____________

Andreas Hailu

Data Lake Engineering | Goldman Sachs & Co.

Your Personal Data: We may collect and process information about you that may be subject to data protection laws. For more information about how we use and disclose your personal data, how we protect your information, our legal basis to use your information, your rights and who you can contact, please refer to: www.gs.com/privacy-notices