Hi,
let's assume I have a dataset and depending on the input data and different filter operations this dataset can be empty. Now I want to output the dataset to HD, but I want that files are only created if the dataset is not empty. If the dataset is empty I don't want any files. The default way: dataset.write(...) will always create as many files as the parallelism of this operator is configured - in case of an empty dataset all files would be empty as well. I thought about doing something like: if (dataset.count() > 0) { dataset.write(...) } but I don't think thats the way to go, because dataset.count() triggers a execution of the (sub)program. Is there a simple way how to avoid creating empty files for empty datasets? Regards, Lars |
Hello Lars,
The only other way i can think of how this could be done is by wrapping the used outputformat in a custom format, which calls open on the wrapped outputformat when you receive the first record. This should work but is quite hacky though as it interferes with the format life-cycle. Regards, Chesnay On 08.12.2016 16:39, [hidden email] wrote: > Hi, > > let's assume I have a dataset and depending on the input data and > different filter operations this dataset can be empty. Now I want to > output the dataset to HD, but I want that files are only created if > the dataset is not empty. If the dataset is empty I don't want any > files. The default way: dataset.write(...) will always create as many > files as the parallelism of this operator is configured - in case of > an empty dataset all files would be empty as well. I thought about > doing something like: > > if (dataset.count() > 0) { > dataset.write(...) > } > > but I don't think thats the way to go, because dataset.count() > triggers a execution of the (sub)program. > > Is there a simple way how to avoid creating empty files for empty > datasets? > > Regards, > > Lars > |
Hi Chesnay,
I actually thought about the same but like you said it seems a bit hacky ;-). Anyway thank you! Regards, Lars Am 08.12.2016 16:47 schrieb Chesnay Schepler: > Hello Lars, > > The only other way i can think of how this could be done is by wrapping > the used > outputformat in a custom format, which calls open on the wrapped > outputformat > when you receive the first record. > > This should work but is quite hacky though as it interferes with the > format life-cycle. > > Regards, > Chesnay > > On 08.12.2016 16:39, [hidden email] wrote: >> Hi, >> >> let's assume I have a dataset and depending on the input data and >> different filter operations this dataset can be empty. Now I want to >> output the dataset to HD, but I want that files are only created if >> the dataset is not empty. If the dataset is empty I don't want any >> files. The default way: dataset.write(...) will always create as many >> files as the parallelism of this operator is configured - in case of >> an empty dataset all files would be empty as well. I thought about >> doing something like: >> >> if (dataset.count() > 0) { >> dataset.write(...) >> } >> >> but I don't think thats the way to go, because dataset.count() >> triggers a execution of the (sub)program. >> >> Is there a simple way how to avoid creating empty files for empty >> datasets? >> >> Regards, >> >> Lars >> |
Free forum by Nabble | Edit this page |