(DEPRECATED) Apache Flink User Mailing List archive.

File Naming Pattern from HadoopOutputFormat

Classic

List

Threaded

7 messages Options

Hailu, Andreas

File Naming Pattern from HadoopOutputFormat

Hello Flink team,

I’m writing Avro and Parquet files to HDFS, and I’ve would like to include a UUID as a part of the file name.

Our files in HDFS currently follow this pattern:

tmp-r-00001.snappy.parquet

tmp-r-00002.snappy.parquet

...

I’m using a custom output format which extends a RichOutputFormat - is this something which is natively supported? If so, could you please recommend how this could be done, or share the relevant document?

Best,

Andreas

Your Personal Data: We may collect and process information about you that may be subject to data protection laws. For more information about how we use and disclose your personal data, how we protect your information, our legal basis to use your information, your rights and who you can contact, please refer to: www.gs.com/privacy-notices

Haibo Sun

Re:File Naming Pattern from HadoopOutputFormat

Hi, Andreas

I think the following things may be what you want.

1. For writing Avro, I think you can extend AvroOutputFormat and override the getDirectoryFileName() method to customize a file name, as shown below.

The javadoc of AvroOutputFormat: https://ci.apache.org/projects/flink/flink-docs-release-1.8/api/java/org/apache/flink/formats/avro/AvroOutputFormat.html

	public static class CustomAvroOutputFormat extends AvroOutputFormat {
		public CustomAvroOutputFormat(Path filePath, Class type) {
			super(filePath, type);
		}

		public CustomAvroOutputFormat(Class type) {
			super(type);
		}

		@Override
		public void open(int taskNumber, int numTasks) throws IOException {
			this.setOutputDirectoryMode(OutputDirectoryMode.ALWAYS);
			super.open(taskNumber, numTasks);
		}

		@Override
		protected String getDirectoryFileName(int taskNumber) {
			// returns a custom filename
			return null;
		}
	}

2. For writing Parquet, you can refer to ParquetStreamingFileSinkITCase, StreamingFileSink#forBulkFormat and DateTimeBucketAssigner. You can create a class that implements the BucketAssigner interface and return a custom file name in the getBucketId() method (the value returned by getBucketId() will be treated as the file name).

ParquetStreamingFileSinkITCase: https://github.com/apache/flink/blob/master/flink-formats/flink-parquet/src/test/java/org/apache/flink/formats/parquet/avro/ParquetStreamingFileSinkITCase.java

StreamingFileSink#forBulkFormat: https://github.com/apache/flink/blob/master/flink-streaming-java/src/main/java/org/apache/flink/streaming/api/functions/sink/filesystem/StreamingFileSink.java

DateTimeBucketAssigner: https://github.com/apache/flink/blob/master/flink-streaming-java/src/main/java/org/apache/flink/streaming/api/functions/sink/filesystem/bucketassigners/DateTimeBucketAssigner.java

Best,

Haibo

At 2019-07-02 04:15:07, "Hailu, Andreas" <[hidden email]> wrote:

Hello Flink team,

I’m writing Avro and Parquet files to HDFS, and I’ve would like to include a UUID as a part of the file name.

Our files in HDFS currently follow this pattern:

tmp-r-00001.snappy.parquet

tmp-r-00002.snappy.parquet

...

I’m using a custom output format which extends a RichOutputFormat - is this something which is natively supported? If so, could you please recommend how this could be done, or share the relevant document?

Best,

Andreas

Your Personal Data: We may collect and process information about you that may be subject to data protection laws. For more information about how we use and disclose your personal data, how we protect your information, our legal basis to use your information, your rights and who you can contact, please refer to: www.gs.com/privacy-notices

Yitzchak Lieberman

Re: File Naming Pattern from HadoopOutputFormat

regarding option 2 for parquet:

implementing bucket assigner won't set the file name as getBucketId() defined the directory for the files in case of partitioning the data, for example:

<root dir>/day=20190101/part-1-1

there is an open issue for that: https://issues.apache.org/jira/browse/FLINK-12573

On Tue, Jul 2, 2019 at 6:18 AM Haibo Sun <[hidden email]> wrote:

Hi, Andreas

I think the following things may be what you want.

1. For writing Avro, I think you can extend AvroOutputFormat and override the getDirectoryFileName() method to customize a file name, as shown below.
The javadoc of AvroOutputFormat: https://ci.apache.org/projects/flink/flink-docs-release-1.8/api/java/org/apache/flink/formats/avro/AvroOutputFormat.html
	public static class CustomAvroOutputFormat extends AvroOutputFormat {
		public CustomAvroOutputFormat(Path filePath, Class type) {
			super(filePath, type);
		}

		public CustomAvroOutputFormat(Class type) {
			super(type);
		}

		@Override
		public void open(int taskNumber, int numTasks) throws IOException {
			this.setOutputDirectoryMode(OutputDirectoryMode.ALWAYS);
			super.open(taskNumber, numTasks);
		}

		@Override
		protected String getDirectoryFileName(int taskNumber) {
			// returns a custom filename
			return null;
		}
	}
2. For writing Parquet, you can refer to ParquetStreamingFileSinkITCase, StreamingFileSink#forBulkFormat and DateTimeBucketAssigner. You can create a class that implements the BucketAssigner interface and return a custom file name in the getBucketId() method (the value returned by getBucketId() will be treated as the file name).

ParquetStreamingFileSinkITCase: https://github.com/apache/flink/blob/master/flink-formats/flink-parquet/src/test/java/org/apache/flink/formats/parquet/avro/ParquetStreamingFileSinkITCase.java

StreamingFileSink#forBulkFormat: https://github.com/apache/flink/blob/master/flink-streaming-java/src/main/java/org/apache/flink/streaming/api/functions/sink/filesystem/StreamingFileSink.java

DateTimeBucketAssigner: https://github.com/apache/flink/blob/master/flink-streaming-java/src/main/java/org/apache/flink/streaming/api/functions/sink/filesystem/bucketassigners/DateTimeBucketAssigner.java

Best,
Haibo

At 2019-07-02 04:15:07, "Hailu, Andreas" <[hidden email]> wrote:

Hello Flink team,

I’m writing Avro and Parquet files to HDFS, and I’ve would like to include a UUID as a part of the file name.

Our files in HDFS currently follow this pattern:

tmp-r-00001.snappy.parquet

tmp-r-00002.snappy.parquet

...

I’m using a custom output format which extends a RichOutputFormat - is this something which is natively supported? If so, could you please recommend how this could be done, or share the relevant document?

Best,

Andreas

Your Personal Data: We may collect and process information about you that may be subject to data protection laws. For more information about how we use and disclose your personal data, how we protect your information, our legal basis to use your information, your rights and who you can contact, please refer to: www.gs.com/privacy-notices

Haibo Sun

Re:Re: File Naming Pattern from HadoopOutputFormat

Hi, Andreas

You are right. To meet this requirement, Flink should need to expose a interface to allow customizing the filename.

Best,

Haibo

At 2019-07-02 16:33:44, "Yitzchak Lieberman" <[hidden email]> wrote:

regarding option 2 for parquet:
implementing bucket assigner won't set the file name as getBucketId() defined the directory for the files in case of partitioning the data, for example:
<root dir>/day=20190101/part-1-1
there is an open issue for that: https://issues.apache.org/jira/browse/FLINK-12573
On Tue, Jul 2, 2019 at 6:18 AM Haibo Sun <[hidden email]> wrote:
Hi, Andreas

I think the following things may be what you want.

1. For writing Avro, I think you can extend AvroOutputFormat and override the getDirectoryFileName() method to customize a file name, as shown below.
The javadoc of AvroOutputFormat: https://ci.apache.org/projects/flink/flink-docs-release-1.8/api/java/org/apache/flink/formats/avro/AvroOutputFormat.html
	public static class CustomAvroOutputFormat extends AvroOutputFormat {
		public CustomAvroOutputFormat(Path filePath, Class type) {
			super(filePath, type);
		}

		public CustomAvroOutputFormat(Class type) {
			super(type);
		}

		@Override
		public void open(int taskNumber, int numTasks) throws IOException {
			this.setOutputDirectoryMode(OutputDirectoryMode.ALWAYS);
			super.open(taskNumber, numTasks);
		}

		@Override
		protected String getDirectoryFileName(int taskNumber) {
			// returns a custom filename
			return null;
		}
	}
2. For writing Parquet, you can refer to ParquetStreamingFileSinkITCase, StreamingFileSink#forBulkFormat and DateTimeBucketAssigner. You can create a class that implements the BucketAssigner interface and return a custom file name in the getBucketId() method (the value returned by getBucketId() will be treated as the file name).

ParquetStreamingFileSinkITCase: https://github.com/apache/flink/blob/master/flink-formats/flink-parquet/src/test/java/org/apache/flink/formats/parquet/avro/ParquetStreamingFileSinkITCase.java

StreamingFileSink#forBulkFormat: https://github.com/apache/flink/blob/master/flink-streaming-java/src/main/java/org/apache/flink/streaming/api/functions/sink/filesystem/StreamingFileSink.java

DateTimeBucketAssigner: https://github.com/apache/flink/blob/master/flink-streaming-java/src/main/java/org/apache/flink/streaming/api/functions/sink/filesystem/bucketassigners/DateTimeBucketAssigner.java

Best,
Haibo

At 2019-07-02 04:15:07, "Hailu, Andreas" <[hidden email]> wrote:

Hello Flink team,

I’m writing Avro and Parquet files to HDFS, and I’ve would like to include a UUID as a part of the file name.

Our files in HDFS currently follow this pattern:

tmp-r-00001.snappy.parquet

tmp-r-00002.snappy.parquet

...

I’m using a custom output format which extends a RichOutputFormat - is this something which is natively supported? If so, could you please recommend how this could be done, or share the relevant document?

Best,

Andreas

Your Personal Data: We may collect and process information about you that may be subject to data protection laws. For more information about how we use and disclose your personal data, how we protect your information, our legal basis to use your information, your rights and who you can contact, please refer to: www.gs.com/privacy-notices

Hailu, Andreas

RE: Re:Re: File Naming Pattern from HadoopOutputFormat

Hi Haibo, Yitzchak, thanks for getting back to me.

The pattern I chose to use which worked was to extend the HadoopOutputFormat class, override the open() method, and modify the “mapreduce.output.basename” configuration property to match my desired file naming structure.

// ah

From: Haibo Sun <[hidden email]>
Sent: Tuesday, July 2, 2019 5:57 AM
To: Yitzchak Lieberman <[hidden email]>
Cc: Hailu, Andreas [Tech] <[hidden email]>; [hidden email]
Subject: Re:Re: File Naming Pattern from HadoopOutputFormat

Hi, Andreas

You are right. To meet this requirement, Flink should need to expose a interface to allow customizing the filename.

Best,

Haibo

At 2019-07-02 16:33:44, "Yitzchak Lieberman" <[hidden email]> wrote:

regarding option 2 for parquet:

implementing bucket assigner won't set the file name as getBucketId() defined the directory for the files in case of partitioning the data, for example:

<root dir>/day=20190101/part-1-1

there is an open issue for that: https://issues.apache.org/jira/browse/FLINK-12573
On Tue, Jul 2, 2019 at 6:18 AM Haibo Sun <[hidden email]> wrote:
Hi, Andreas

I think the following things may be what you want.

1. For writing Avro, I think you can extend AvroOutputFormat and override the getDirectoryFileName() method to customize a file name, as shown below.

The javadoc of AvroOutputFormat: https://ci.apache.org/projects/flink/flink-docs-release-1.8/api/java/org/apache/flink/formats/avro/AvroOutputFormat.html
          public static class CustomAvroOutputFormat extends AvroOutputFormat {
                              public CustomAvroOutputFormat(Path filePath, Class type) {
                                                   super(filePath, type);
                              }
 
                              public CustomAvroOutputFormat(Class type) {
                                                   super(type);
                              }
 
                              @Override
                              public void open(int taskNumber, int numTasks) throws IOException {
                                                   this.setOutputDirectoryMode(OutputDirectoryMode.ALWAYS);
                                                   super.open(taskNumber, numTasks);
                              }
 
                              @Override
                              protected String getDirectoryFileName(int taskNumber) {
                                                   // returns a custom filename
                                                   return null;
                              }
          }
2. For writing Parquet, you can refer to ParquetStreamingFileSinkITCase, StreamingFileSink#forBulkFormat and DateTimeBucketAssigner. You can create a class that implements the BucketAssigner interface and return a custom file name in the getBucketId() method (the value returned by getBucketId() will be treated as the file name).

ParquetStreamingFileSinkITCase: https://github.com/apache/flink/blob/master/flink-formats/flink-parquet/src/test/java/org/apache/flink/formats/parquet/avro/ParquetStreamingFileSinkITCase.java

StreamingFileSink#forBulkFormat: https://github.com/apache/flink/blob/master/flink-streaming-java/src/main/java/org/apache/flink/streaming/api/functions/sink/filesystem/StreamingFileSink.java

DateTimeBucketAssigner: https://github.com/apache/flink/blob/master/flink-streaming-java/src/main/java/org/apache/flink/streaming/api/functions/sink/filesystem/bucketassigners/DateTimeBucketAssigner.java

Best,

Haibo

At 2019-07-02 04:15:07, "Hailu, Andreas" <[hidden email]> wrote:

Hello Flink team,

I’m writing Avro and Parquet files to HDFS, and I’ve would like to include a UUID as a part of the file name.

Our files in HDFS currently follow this pattern:

tmp-r-00001.snappy.parquet

tmp-r-00002.snappy.parquet

...

I’m using a custom output format which extends a RichOutputFormat - is this something which is natively supported? If so, could you please recommend how this could be done, or share the relevant document?

Best,

Andreas

Your Personal Data: We may collect and process information about you that may be subject to data protection laws. For more information about how we use and disclose your personal data, how we protect your information, our legal basis to use your information, your rights and who you can contact, please refer to: www.gs.com/privacy-notices

Haibo Sun

Re:RE: Re:Re: File Naming Pattern from HadoopOutputFormat

Hi, Andreas

I'm glad you have had a solution. If you're interested in option 2 I talked about, you can follow up on the progress of the issue (https://issues.apache.org/jira/browse/FLINK-12573) that Yitzchak said by watching it.

Best,

Haibo

At 2019-07-03 21:11:44, "Hailu, Andreas" <[hidden email]> wrote:

Hi Haibo, Yitzchak, thanks for getting back to me.

The pattern I chose to use which worked was to extend the HadoopOutputFormat class, override the open() method, and modify the “mapreduce.output.basename” configuration property to match my desired file naming structure.

// ah

From: Haibo Sun <[hidden email]>
Sent: Tuesday, July 2, 2019 5:57 AM
To: Yitzchak Lieberman <[hidden email]>
Cc: Hailu, Andreas [Tech] <[hidden email]>; [hidden email]
Subject: Re:Re: File Naming Pattern from HadoopOutputFormat
Hi, Andreas

You are right. To meet this requirement, Flink should need to expose a interface to allow customizing the filename.

Best,

Haibo

At 2019-07-02 16:33:44, "Yitzchak Lieberman" <[hidden email]> wrote:
regarding option 2 for parquet:

implementing bucket assigner won't set the file name as getBucketId() defined the directory for the files in case of partitioning the data, for example:

<root dir>/day=20190101/part-1-1

there is an open issue for that: https://issues.apache.org/jira/browse/FLINK-12573
On Tue, Jul 2, 2019 at 6:18 AM Haibo Sun <[hidden email]> wrote:
Hi, Andreas

I think the following things may be what you want.

1. For writing Avro, I think you can extend AvroOutputFormat and override the getDirectoryFileName() method to customize a file name, as shown below.

The javadoc of AvroOutputFormat: https://ci.apache.org/projects/flink/flink-docs-release-1.8/api/java/org/apache/flink/formats/avro/AvroOutputFormat.html
          public static class CustomAvroOutputFormat extends AvroOutputFormat {
                              public CustomAvroOutputFormat(Path filePath, Class type) {
                                                   super(filePath, type);
                              }
 
                              public CustomAvroOutputFormat(Class type) {
                                                   super(type);
                              }
 
                              @Override
                              public void open(int taskNumber, int numTasks) throws IOException {
                                                   this.setOutputDirectoryMode(OutputDirectoryMode.ALWAYS);
                                                   super.open(taskNumber, numTasks);
                              }
 
                              @Override
                              protected String getDirectoryFileName(int taskNumber) {
                                                   // returns a custom filename
                                                   return null;
                              }
          }
2. For writing Parquet, you can refer to ParquetStreamingFileSinkITCase, StreamingFileSink#forBulkFormat and DateTimeBucketAssigner. You can create a class that implements the BucketAssigner interface and return a custom file name in the getBucketId() method (the value returned by getBucketId() will be treated as the file name).

ParquetStreamingFileSinkITCase: https://github.com/apache/flink/blob/master/flink-formats/flink-parquet/src/test/java/org/apache/flink/formats/parquet/avro/ParquetStreamingFileSinkITCase.java

StreamingFileSink#forBulkFormat: https://github.com/apache/flink/blob/master/flink-streaming-java/src/main/java/org/apache/flink/streaming/api/functions/sink/filesystem/StreamingFileSink.java

DateTimeBucketAssigner: https://github.com/apache/flink/blob/master/flink-streaming-java/src/main/java/org/apache/flink/streaming/api/functions/sink/filesystem/bucketassigners/DateTimeBucketAssigner.java

Best,

Haibo

At 2019-07-02 04:15:07, "Hailu, Andreas" <[hidden email]> wrote:

Hello Flink team,

I’m writing Avro and Parquet files to HDFS, and I’ve would like to include a UUID as a part of the file name.

Our files in HDFS currently follow this pattern:

tmp-r-00001.snappy.parquet

tmp-r-00002.snappy.parquet

...

I’m using a custom output format which extends a RichOutputFormat - is this something which is natively supported? If so, could you please recommend how this could be done, or share the relevant document?

Best,

Andreas

Your Personal Data: We may collect and process information about you that may be subject to data protection laws. For more information about how we use and disclose your personal data, how we protect your information, our legal basis to use your information, your rights and who you can contact, please refer to: www.gs.com/privacy-notices
Your Personal Data: We may collect and process information about you that may be subject to data protection laws. For more information about how we use and disclose your personal data, how we protect your information, our legal basis to use your information, your rights and who you can contact, please refer to: www.gs.com/privacy-notices

Hailu, Andreas

RE: Re:RE: Re:Re: File Naming Pattern from HadoopOutputFormat

Very well – thank you both.

// ah

From: Haibo Sun <[hidden email]>
Sent: Wednesday, July 3, 2019 9:37 PM
To: Hailu, Andreas [Tech] <[hidden email]>
Cc: Yitzchak Lieberman <[hidden email]>; [hidden email]
Subject: Re:RE: Re:Re: File Naming Pattern from HadoopOutputFormat

Hi, Andreas

Best,

Haibo

At 2019-07-03 21:11:44, "Hailu, Andreas" <[hidden email]> wrote:

Hi Haibo, Yitzchak, thanks for getting back to me.

The pattern I chose to use which worked was to extend the HadoopOutputFormat class, override the open() method, and modify the “mapreduce.output.basename” configuration property to match my desired file naming structure.

// ah

From: Haibo Sun <[hidden email]>
Sent: Tuesday, July 2, 2019 5:57 AM
To: Yitzchak Lieberman <[hidden email]>
Cc: Hailu, Andreas [Tech] <[hidden email]>; [hidden email]
Subject: Re:Re: File Naming Pattern from HadoopOutputFormat
Hi, Andreas

You are right. To meet this requirement, Flink should need to expose a interface to allow customizing the filename.

Best,

Haibo

At 2019-07-02 16:33:44, "Yitzchak Lieberman" <[hidden email]> wrote:
regarding option 2 for parquet:

implementing bucket assigner won't set the file name as getBucketId() defined the directory for the files in case of partitioning the data, for example:

<root dir>/day=20190101/part-1-1

there is an open issue for that: https://issues.apache.org/jira/browse/FLINK-12573
On Tue, Jul 2, 2019 at 6:18 AM Haibo Sun <[hidden email]> wrote:
Hi, Andreas

I think the following things may be what you want.

1. For writing Avro, I think you can extend AvroOutputFormat and override the getDirectoryFileName() method to customize a file name, as shown below.

The javadoc of AvroOutputFormat: https://ci.apache.org/projects/flink/flink-docs-release-1.8/api/java/org/apache/flink/formats/avro/AvroOutputFormat.html
          public static class CustomAvroOutputFormat extends AvroOutputFormat {
                              public CustomAvroOutputFormat(Path filePath, Class type) {
                                                   super(filePath, type);
                              }
 
                              public CustomAvroOutputFormat(Class type) {
                                                   super(type);
                              }
 
                              @Override
                              public void open(int taskNumber, int numTasks) throws IOException {
                                                   this.setOutputDirectoryMode(OutputDirectoryMode.ALWAYS);
                                                   super.open(taskNumber, numTasks);
                              }
 
                              @Override
                              protected String getDirectoryFileName(int taskNumber) {
                                                   // returns a custom filename
                                                   return null;
                              }
          }
2. For writing Parquet, you can refer to ParquetStreamingFileSinkITCase, StreamingFileSink#forBulkFormat and DateTimeBucketAssigner. You can create a class that implements the BucketAssigner interface and return a custom file name in the getBucketId() method (the value returned by getBucketId() will be treated as the file name).

ParquetStreamingFileSinkITCase: https://github.com/apache/flink/blob/master/flink-formats/flink-parquet/src/test/java/org/apache/flink/formats/parquet/avro/ParquetStreamingFileSinkITCase.java

StreamingFileSink#forBulkFormat: https://github.com/apache/flink/blob/master/flink-streaming-java/src/main/java/org/apache/flink/streaming/api/functions/sink/filesystem/StreamingFileSink.java

DateTimeBucketAssigner: https://github.com/apache/flink/blob/master/flink-streaming-java/src/main/java/org/apache/flink/streaming/api/functions/sink/filesystem/bucketassigners/DateTimeBucketAssigner.java

Best,

Haibo

At 2019-07-02 04:15:07, "Hailu, Andreas" <[hidden email]> wrote:

Hello Flink team,

I’m writing Avro and Parquet files to HDFS, and I’ve would like to include a UUID as a part of the file name.

Our files in HDFS currently follow this pattern:

tmp-r-00001.snappy.parquet

tmp-r-00002.snappy.parquet

...

I’m using a custom output format which extends a RichOutputFormat - is this something which is natively supported? If so, could you please recommend how this could be done, or share the relevant document?

Best,

Andreas

Your Personal Data: We may collect and process information about you that may be subject to data protection laws. For more information about how we use and disclose your personal data, how we protect your information, our legal basis to use your information, your rights and who you can contact, please refer to: www.gs.com/privacy-notices
Your Personal Data: We may collect and process information about you that may be subject to data protection laws. For more information about how we use and disclose your personal data, how we protect your information, our legal basis to use your information, your rights and who you can contact, please refer to: www.gs.com/privacy-notices