(DEPRECATED) Apache Flink User Mailing List archive.

Compress DataSink Output

Classic

List

Threaded

3 messages Options

Wesley Kerr

Compress DataSink Output

Hello -

Forgive me if this has been asked before, but I'm trying to determine the best way to add compression to DataSink Outputs (starting with TextOutputFormat). Realistically I would like each partition file (based on parallelism) to be compressed independently with gzip, but am open to other solutions.

My first thought was to extend TextOutputFormat with a new class that compresses after closing and before returning, but I'm not sure that would work across all possible file systems (S3, Local, and HDFS).

Any thoughts?

Thanks!

Wes

rmetzger0

Re: Compress DataSink Output

Hi Wes,

Flink's own OutputFormats don't support compression, but we have some tools to use Hadoop's OutputFormats with Flink [1], and those support compression: https://hadoop.apache.org/docs/stable/api/org/apache/hadoop/mapreduce/lib/output/FileOutputFormat.html

Let me know if you need more information.

Regards,

Robert

[1]: https://ci.apache.org/projects/flink/flink-docs-master/apis/batch/hadoop_compatibility.html

On Thu, Aug 18, 2016 at 2:13 AM, Wesley Kerr <[hidden email]> wrote:

Hello -

Forgive me if this has been asked before, but I'm trying to determine the best way to add compression to DataSink Outputs (starting with TextOutputFormat). Realistically I would like each partition file (based on parallelism) to be compressed independently with gzip, but am open to other solutions.

My first thought was to extend TextOutputFormat with a new class that compresses after closing and before returning, but I'm not sure that would work across all possible file systems (S3, Local, and HDFS).

Any thoughts?

Thanks!

Wes

Wesley Kerr

Re: Compress DataSink Output

That looks good. Thanks!

On Fri, Aug 19, 2016 at 6:15 AM Robert Metzger <[hidden email]> wrote:

Hi Wes,

Flink's own OutputFormats don't support compression, but we have some tools to use Hadoop's OutputFormats with Flink [1], and those support compression: https://hadoop.apache.org/docs/stable/api/org/apache/hadoop/mapreduce/lib/output/FileOutputFormat.html

Let me know if you need more information.

Regards,
Robert

[1]: https://ci.apache.org/projects/flink/flink-docs-master/apis/batch/hadoop_compatibility.html

On Thu, Aug 18, 2016 at 2:13 AM, Wesley Kerr <[hidden email]> wrote:
Hello -

Forgive me if this has been asked before, but I'm trying to determine the best way to add compression to DataSink Outputs (starting with TextOutputFormat). Realistically I would like each partition file (based on parallelism) to be compressed independently with gzip, but am open to other solutions.

My first thought was to extend TextOutputFormat with a new class that compresses after closing and before returning, but I'm not sure that would work across all possible file systems (S3, Local, and HDFS).

Any thoughts?

Thanks!

Wes