Batch compressed file output

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

Batch compressed file output

Flavio Pompermaier
Hello guys,
I have to write my batch data (Dataset<Row>) to a file format. Actually what I need to do is:
  1. split the data if it exceeds some size threshold  (by line count or max MB)
  2. compress the output data (possibly without converting to the hadoop format)
Are there any suggestions / recommendations about that?

Best,
Flavio
Reply | Threaded
Open this post in threaded view
|

Re: Batch compressed file output

Matthias
Hi Flavio,
others might have better ideas to solve this but I'll give it a try: Have you considered extending FileOutputFormat to achieve what you need? That approach (which is discussed in [1]) sounds like something you could do.
Another pointer I want to give is the DefaultRollingPolicy [2]. It looks like it partially does what you're looking for. I'm adding Kostas to this conversation as he worked on the RollingPolicy. Maybe, he has some more insights.

I hope that helps.

Best,
Matthias



On Fri, Nov 27, 2020 at 11:07 AM Flavio Pompermaier <[hidden email]> wrote:
Hello guys,
I have to write my batch data (Dataset<Row>) to a file format. Actually what I need to do is:
  1. split the data if it exceeds some size threshold  (by line count or max MB)
  2. compress the output data (possibly without converting to the hadoop format)
Are there any suggestions / recommendations about that?

Best,
Flavio