Compress DataSink Output

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Compress DataSink Output

Wesley Kerr
Hello - 

Forgive me if this has been asked before, but I'm trying to determine the best way to add compression to DataSink Outputs (starting with TextOutputFormat).  Realistically I would like each partition file (based on parallelism) to be compressed independently with gzip, but am open to other solutions.

My first thought was to extend TextOutputFormat with a new class that compresses after closing and before returning, but I'm not sure that would work across all possible file systems (S3, Local, and HDFS).

Any thoughts?

Thanks!

Wes


Reply | Threaded
Open this post in threaded view
|

Re: Compress DataSink Output

rmetzger0
Hi Wes,

Flink's own OutputFormats don't support compression, but we have some tools to use Hadoop's OutputFormats with Flink [1], and those support compression: https://hadoop.apache.org/docs/stable/api/org/apache/hadoop/mapreduce/lib/output/FileOutputFormat.html

Let me know if you need more information.

Regards,
Robert



On Thu, Aug 18, 2016 at 2:13 AM, Wesley Kerr <[hidden email]> wrote:
Hello - 

Forgive me if this has been asked before, but I'm trying to determine the best way to add compression to DataSink Outputs (starting with TextOutputFormat).  Realistically I would like each partition file (based on parallelism) to be compressed independently with gzip, but am open to other solutions.

My first thought was to extend TextOutputFormat with a new class that compresses after closing and before returning, but I'm not sure that would work across all possible file systems (S3, Local, and HDFS).

Any thoughts?

Thanks!

Wes



Reply | Threaded
Open this post in threaded view
|

Re: Compress DataSink Output

Wesley Kerr
That looks good.  Thanks!

On Fri, Aug 19, 2016 at 6:15 AM Robert Metzger <[hidden email]> wrote:
Hi Wes,

Flink's own OutputFormats don't support compression, but we have some tools to use Hadoop's OutputFormats with Flink [1], and those support compression: https://hadoop.apache.org/docs/stable/api/org/apache/hadoop/mapreduce/lib/output/FileOutputFormat.html

Let me know if you need more information.

Regards,
Robert



On Thu, Aug 18, 2016 at 2:13 AM, Wesley Kerr <[hidden email]> wrote:
Hello - 

Forgive me if this has been asked before, but I'm trying to determine the best way to add compression to DataSink Outputs (starting with TextOutputFormat).  Realistically I would like each partition file (based on parallelism) to be compressed independently with gzip, but am open to other solutions.

My first thought was to extend TextOutputFormat with a new class that compresses after closing and before returning, but I'm not sure that would work across all possible file systems (S3, Local, and HDFS).

Any thoughts?

Thanks!

Wes