File Naming Pattern from HadoopOutputFormat

classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|

File Naming Pattern from HadoopOutputFormat

Hailu, Andreas

Hello Flink team,

 

I’m writing Avro and Parquet files to HDFS, and I’ve would like to include a UUID as a part of the file name.

 

Our files in HDFS currently follow this pattern:

 

tmp-r-00001.snappy.parquet

tmp-r-00002.snappy.parquet

...

 

I’m using a custom output format which extends a RichOutputFormat - is this something which is natively supported? If so, could you please recommend how this could be done, or share the relevant document?

 

Best,

Andreas




Your Personal Data: We may collect and process information about you that may be subject to data protection laws. For more information about how we use and disclose your personal data, how we protect your information, our legal basis to use your information, your rights and who you can contact, please refer to: www.gs.com/privacy-notices
Reply | Threaded
Open this post in threaded view
|

Re:File Naming Pattern from HadoopOutputFormat

Haibo Sun
Hi, Andreas

I think the following things may be what you want.

1. For writing Avro, I think you can extend AvroOutputFormat and override the  getDirectoryFileName() method to customize a file name, as shown below.

	public static class CustomAvroOutputFormat extends AvroOutputFormat {
		public CustomAvroOutputFormat(Path filePath, Class type) {
			super(filePath, type);
		}

		public CustomAvroOutputFormat(Class type) {
			super(type);
		}

		@Override
		public void open(int taskNumber, int numTasks) throws IOException {
			this.setOutputDirectoryMode(OutputDirectoryMode.ALWAYS);
			super.open(taskNumber, numTasks);
		}

		@Override
		protected String getDirectoryFileName(int taskNumber) {
			// returns a custom filename
			return null;
		}
	}

2. For writing Parquet, you can refer to ParquetStreamingFileSinkITCase, StreamingFileSink#forBulkFormat and DateTimeBucketAssigner. You can create a class that implements the BucketAssigner interface and return a custom file name in the getBucketId() method (the value returned by getBucketId() will be treated as the file name).





Best,
Haibo

At 2019-07-02 04:15:07, "Hailu, Andreas" <[hidden email]> wrote:

Hello Flink team,

 

I’m writing Avro and Parquet files to HDFS, and I’ve would like to include a UUID as a part of the file name.

 

Our files in HDFS currently follow this pattern:

 

tmp-r-00001.snappy.parquet

tmp-r-00002.snappy.parquet

...

 

I’m using a custom output format which extends a RichOutputFormat - is this something which is natively supported? If so, could you please recommend how this could be done, or share the relevant document?

 

Best,

Andreas




Your Personal Data: We may collect and process information about you that may be subject to data protection laws. For more information about how we use and disclose your personal data, how we protect your information, our legal basis to use your information, your rights and who you can contact, please refer to: www.gs.com/privacy-notices
Reply | Threaded
Open this post in threaded view
|

Re: File Naming Pattern from HadoopOutputFormat

Yitzchak Lieberman
regarding option 2 for parquet:
implementing bucket assigner won't set the file name as getBucketId() defined the directory for the files in case of partitioning the data, for example:
<root dir>/day=20190101/part-1-1
there is an open issue for that: https://issues.apache.org/jira/browse/FLINK-12573

On Tue, Jul 2, 2019 at 6:18 AM Haibo Sun <[hidden email]> wrote:
Hi, Andreas

I think the following things may be what you want.

1. For writing Avro, I think you can extend AvroOutputFormat and override the  getDirectoryFileName() method to customize a file name, as shown below.

	public static class CustomAvroOutputFormat extends AvroOutputFormat {
		public CustomAvroOutputFormat(Path filePath, Class type) {
			super(filePath, type);
		}

		public CustomAvroOutputFormat(Class type) {
			super(type);
		}

		@Override
		public void open(int taskNumber, int numTasks) throws IOException {
			this.setOutputDirectoryMode(OutputDirectoryMode.ALWAYS);
			super.open(taskNumber, numTasks);
		}

		@Override
		protected String getDirectoryFileName(int taskNumber) {
			// returns a custom filename
			return null;
		}
	}

2. For writing Parquet, you can refer to ParquetStreamingFileSinkITCase, StreamingFileSink#forBulkFormat and DateTimeBucketAssigner. You can create a class that implements the BucketAssigner interface and return a custom file name in the getBucketId() method (the value returned by getBucketId() will be treated as the file name).





Best,
Haibo

At 2019-07-02 04:15:07, "Hailu, Andreas" <[hidden email]> wrote:

Hello Flink team,

 

I’m writing Avro and Parquet files to HDFS, and I’ve would like to include a UUID as a part of the file name.

 

Our files in HDFS currently follow this pattern:

 

tmp-r-00001.snappy.parquet

tmp-r-00002.snappy.parquet

...

 

I’m using a custom output format which extends a RichOutputFormat - is this something which is natively supported? If so, could you please recommend how this could be done, or share the relevant document?

 

Best,

Andreas




Your Personal Data: We may collect and process information about you that may be subject to data protection laws. For more information about how we use and disclose your personal data, how we protect your information, our legal basis to use your information, your rights and who you can contact, please refer to: www.gs.com/privacy-notices
Reply | Threaded
Open this post in threaded view
|

Re:Re: File Naming Pattern from HadoopOutputFormat

Haibo Sun

Hi, Andreas 

You are right. To meet this requirement, Flink should need to expose a interface to allow customizing the filename.
 
Best,
Haibo

At 2019-07-02 16:33:44, "Yitzchak Lieberman" <[hidden email]> wrote:
regarding option 2 for parquet:
implementing bucket assigner won't set the file name as getBucketId() defined the directory for the files in case of partitioning the data, for example:
<root dir>/day=20190101/part-1-1
there is an open issue for that: https://issues.apache.org/jira/browse/FLINK-12573

On Tue, Jul 2, 2019 at 6:18 AM Haibo Sun <[hidden email]> wrote:
Hi, Andreas

I think the following things may be what you want.

1. For writing Avro, I think you can extend AvroOutputFormat and override the  getDirectoryFileName() method to customize a file name, as shown below.

	public static class CustomAvroOutputFormat extends AvroOutputFormat {
		public CustomAvroOutputFormat(Path filePath, Class type) {
			super(filePath, type);
		}

		public CustomAvroOutputFormat(Class type) {
			super(type);
		}

		@Override
		public void open(int taskNumber, int numTasks) throws IOException {
			this.setOutputDirectoryMode(OutputDirectoryMode.ALWAYS);
			super.open(taskNumber, numTasks);
		}

		@Override
		protected String getDirectoryFileName(int taskNumber) {
			// returns a custom filename
			return null;
		}
	}

2. For writing Parquet, you can refer to ParquetStreamingFileSinkITCase, StreamingFileSink#forBulkFormat and DateTimeBucketAssigner. You can create a class that implements the BucketAssigner interface and return a custom file name in the getBucketId() method (the value returned by getBucketId() will be treated as the file name).





Best,
Haibo

At 2019-07-02 04:15:07, "Hailu, Andreas" <[hidden email]> wrote:

Hello Flink team,

 

I’m writing Avro and Parquet files to HDFS, and I’ve would like to include a UUID as a part of the file name.

 

Our files in HDFS currently follow this pattern:

 

tmp-r-00001.snappy.parquet

tmp-r-00002.snappy.parquet

...

 

I’m using a custom output format which extends a RichOutputFormat - is this something which is natively supported? If so, could you please recommend how this could be done, or share the relevant document?

 

Best,

Andreas




Your Personal Data: We may collect and process information about you that may be subject to data protection laws. For more information about how we use and disclose your personal data, how we protect your information, our legal basis to use your information, your rights and who you can contact, please refer to: www.gs.com/privacy-notices
Reply | Threaded
Open this post in threaded view
|

RE: Re:Re: File Naming Pattern from HadoopOutputFormat

Hailu, Andreas

Hi Haibo, Yitzchak, thanks for getting back to me.

 

The pattern I chose to use which worked was to extend the HadoopOutputFormat class, override the open() method, and modify the “mapreduce.output.basename” configuration property to match my desired file naming structure.

 

// ah

 

From: Haibo Sun <[hidden email]>
Sent: Tuesday, July 2, 2019 5:57 AM
To: Yitzchak Lieberman <[hidden email]>
Cc: Hailu, Andreas [Tech] <[hidden email]>; [hidden email]
Subject: Re:Re: File Naming Pattern from HadoopOutputFormat

 


Hi, Andreas 

 

You are right. To meet this requirement, Flink should need to expose a interface to allow customizing the filename.

 

Best,

Haibo


At 2019-07-02 16:33:44, "Yitzchak Lieberman" <[hidden email]> wrote:

regarding option 2 for parquet:

implementing bucket assigner won't set the file name as getBucketId() defined the directory for the files in case of partitioning the data, for example:

<root dir>/day=20190101/part-1-1

there is an open issue for that: https://issues.apache.org/jira/browse/FLINK-12573

 

On Tue, Jul 2, 2019 at 6:18 AM Haibo Sun <[hidden email]> wrote:

Hi, Andreas

 

I think the following things may be what you want.

 

1. For writing Avro, I think you can extend AvroOutputFormat and override the  getDirectoryFileName() method to customize a file name, as shown below.

 

          public static class CustomAvroOutputFormat extends AvroOutputFormat {
                              public CustomAvroOutputFormat(Path filePath, Class type) {
                                                   super(filePath, type);
                              }
 
                              public CustomAvroOutputFormat(Class type) {
                                                   super(type);
                              }
 
                              @Override
                              public void open(int taskNumber, int numTasks) throws IOException {
                                                   this.setOutputDirectoryMode(OutputDirectoryMode.ALWAYS);
                                                   super.open(taskNumber, numTasks);
                              }
 
                              @Override
                              protected String getDirectoryFileName(int taskNumber) {
                                                   // returns a custom filename
                                                   return null;
                              }
          }

 

2. For writing Parquet, you can refer to ParquetStreamingFileSinkITCase, StreamingFileSink#forBulkFormat and DateTimeBucketAssigner. You can create a class that implements the BucketAssigner interface and return a custom file name in the getBucketId() method (the value returned by getBucketId() will be treated as the file name).

 

 

 

 

 

Best,

Haibo


At 2019-07-02 04:15:07, "Hailu, Andreas" <[hidden email]> wrote:

Hello Flink team,

 

I’m writing Avro and Parquet files to HDFS, and I’ve would like to include a UUID as a part of the file name.

 

Our files in HDFS currently follow this pattern:

 

tmp-r-00001.snappy.parquet

tmp-r-00002.snappy.parquet

...

 

I’m using a custom output format which extends a RichOutputFormat - is this something which is natively supported? If so, could you please recommend how this could be done, or share the relevant document?

 

Best,

Andreas

 



Your Personal Data: We may collect and process information about you that may be subject to data protection laws. For more information about how we use and disclose your personal data, how we protect your information, our legal basis to use your information, your rights and who you can contact, please refer to: www.gs.com/privacy-notices




Your Personal Data: We may collect and process information about you that may be subject to data protection laws. For more information about how we use and disclose your personal data, how we protect your information, our legal basis to use your information, your rights and who you can contact, please refer to: www.gs.com/privacy-notices
Reply | Threaded
Open this post in threaded view
|

Re:RE: Re:Re: File Naming Pattern from HadoopOutputFormat

Haibo Sun
Hi, Andreas  

I'm glad you have had a solution. If you're interested in option 2 I talked about, you can follow up on the progress of the issue (https://issues.apache.org/jira/browse/FLINK-12573) that Yitzchak said by watching it.

Best,
Haibo

At 2019-07-03 21:11:44, "Hailu, Andreas" <[hidden email]> wrote:

Hi Haibo, Yitzchak, thanks for getting back to me.

 

The pattern I chose to use which worked was to extend the HadoopOutputFormat class, override the open() method, and modify the “mapreduce.output.basename” configuration property to match my desired file naming structure.

 

// ah

 

From: Haibo Sun <[hidden email]>
Sent: Tuesday, July 2, 2019 5:57 AM
To: Yitzchak Lieberman <[hidden email]>
Cc: Hailu, Andreas [Tech] <[hidden email]>; [hidden email]
Subject: Re:Re: File Naming Pattern from HadoopOutputFormat

 


Hi, Andreas 

 

You are right. To meet this requirement, Flink should need to expose a interface to allow customizing the filename.

 

Best,

Haibo


At 2019-07-02 16:33:44, "Yitzchak Lieberman" <[hidden email]> wrote:

regarding option 2 for parquet:

implementing bucket assigner won't set the file name as getBucketId() defined the directory for the files in case of partitioning the data, for example:

<root dir>/day=20190101/part-1-1

there is an open issue for that: https://issues.apache.org/jira/browse/FLINK-12573

 

On Tue, Jul 2, 2019 at 6:18 AM Haibo Sun <[hidden email]> wrote:

Hi, Andreas

 

I think the following things may be what you want.

 

1. For writing Avro, I think you can extend AvroOutputFormat and override the  getDirectoryFileName() method to customize a file name, as shown below.

 

          public static class CustomAvroOutputFormat extends AvroOutputFormat {
                              public CustomAvroOutputFormat(Path filePath, Class type) {
                                                   super(filePath, type);
                              }
 
                              public CustomAvroOutputFormat(Class type) {
                                                   super(type);
                              }
 
                              @Override
                              public void open(int taskNumber, int numTasks) throws IOException {
                                                   this.setOutputDirectoryMode(OutputDirectoryMode.ALWAYS);
                                                   super.open(taskNumber, numTasks);
                              }
 
                              @Override
                              protected String getDirectoryFileName(int taskNumber) {
                                                   // returns a custom filename
                                                   return null;
                              }
          }

 

2. For writing Parquet, you can refer to ParquetStreamingFileSinkITCase, StreamingFileSink#forBulkFormat and DateTimeBucketAssigner. You can create a class that implements the BucketAssigner interface and return a custom file name in the getBucketId() method (the value returned by getBucketId() will be treated as the file name).

 

 

 

 

 

Best,

Haibo


At 2019-07-02 04:15:07, "Hailu, Andreas" <[hidden email]> wrote:

Hello Flink team,

 

I’m writing Avro and Parquet files to HDFS, and I’ve would like to include a UUID as a part of the file name.

 

Our files in HDFS currently follow this pattern:

 

tmp-r-00001.snappy.parquet

tmp-r-00002.snappy.parquet

...

 

I’m using a custom output format which extends a RichOutputFormat - is this something which is natively supported? If so, could you please recommend how this could be done, or share the relevant document?

 

Best,

Andreas

 



Your Personal Data: We may collect and process information about you that may be subject to data protection laws. For more information about how we use and disclose your personal data, how we protect your information, our legal basis to use your information, your rights and who you can contact, please refer to: www.gs.com/privacy-notices




Your Personal Data: We may collect and process information about you that may be subject to data protection laws. For more information about how we use and disclose your personal data, how we protect your information, our legal basis to use your information, your rights and who you can contact, please refer to: www.gs.com/privacy-notices
Reply | Threaded
Open this post in threaded view
|

RE: Re:RE: Re:Re: File Naming Pattern from HadoopOutputFormat

Hailu, Andreas

Very well – thank you both.

 

// ah

 

From: Haibo Sun <[hidden email]>
Sent: Wednesday, July 3, 2019 9:37 PM
To: Hailu, Andreas [Tech] <[hidden email]>
Cc: Yitzchak Lieberman <[hidden email]>; [hidden email]
Subject: Re:RE: Re:Re: File Naming Pattern from HadoopOutputFormat

 

Hi, Andreas  

 

I'm glad you have had a solution. If you're interested in option 2 I talked about, you can follow up on the progress of the issue (https://issues.apache.org/jira/browse/FLINK-12573) that Yitzchak said by watching it.

 

Best,

Haibo


At 2019-07-03 21:11:44, "Hailu, Andreas" <[hidden email]> wrote:

Hi Haibo, Yitzchak, thanks for getting back to me.

 

The pattern I chose to use which worked was to extend the HadoopOutputFormat class, override the open() method, and modify the “mapreduce.output.basename” configuration property to match my desired file naming structure.

 

// ah

 

From: Haibo Sun <[hidden email]>
Sent: Tuesday, July 2, 2019 5:57 AM
To: Yitzchak Lieberman <[hidden email]>
Cc: Hailu, Andreas [Tech] <[hidden email]>; [hidden email]
Subject: Re:Re: File Naming Pattern from HadoopOutputFormat

 


Hi, Andreas 

 

You are right. To meet this requirement, Flink should need to expose a interface to allow customizing the filename.

 

Best,

Haibo


At 2019-07-02 16:33:44, "Yitzchak Lieberman" <[hidden email]> wrote:

regarding option 2 for parquet:

implementing bucket assigner won't set the file name as getBucketId() defined the directory for the files in case of partitioning the data, for example:

<root dir>/day=20190101/part-1-1

there is an open issue for that: https://issues.apache.org/jira/browse/FLINK-12573

 

On Tue, Jul 2, 2019 at 6:18 AM Haibo Sun <[hidden email]> wrote:

Hi, Andreas

 

I think the following things may be what you want.

 

1. For writing Avro, I think you can extend AvroOutputFormat and override the  getDirectoryFileName() method to customize a file name, as shown below.

 

          public static class CustomAvroOutputFormat extends AvroOutputFormat {
                              public CustomAvroOutputFormat(Path filePath, Class type) {
                                                   super(filePath, type);
                              }
 
                              public CustomAvroOutputFormat(Class type) {
                                                   super(type);
                              }
 
                              @Override
                              public void open(int taskNumber, int numTasks) throws IOException {
                                                   this.setOutputDirectoryMode(OutputDirectoryMode.ALWAYS);
                                                   super.open(taskNumber, numTasks);
                              }
 
                              @Override
                              protected String getDirectoryFileName(int taskNumber) {
                                                   // returns a custom filename
                                                   return null;
                              }
          }

 

2. For writing Parquet, you can refer to ParquetStreamingFileSinkITCase, StreamingFileSink#forBulkFormat and DateTimeBucketAssigner. You can create a class that implements the BucketAssigner interface and return a custom file name in the getBucketId() method (the value returned by getBucketId() will be treated as the file name).

 

 

 

 

 

Best,

Haibo


At 2019-07-02 04:15:07, "Hailu, Andreas" <[hidden email]> wrote:

Hello Flink team,

 

I’m writing Avro and Parquet files to HDFS, and I’ve would like to include a UUID as a part of the file name.

 

Our files in HDFS currently follow this pattern:

 

tmp-r-00001.snappy.parquet

tmp-r-00002.snappy.parquet

...

 

I’m using a custom output format which extends a RichOutputFormat - is this something which is natively supported? If so, could you please recommend how this could be done, or share the relevant document?

 

Best,

Andreas

 



Your Personal Data: We may collect and process information about you that may be subject to data protection laws. For more information about how we use and disclose your personal data, how we protect your information, our legal basis to use your information, your rights and who you can contact, please refer to: www.gs.com/privacy-notices

 



Your Personal Data: We may collect and process information about you that may be subject to data protection laws. For more information about how we use and disclose your personal data, how we protect your information, our legal basis to use your information, your rights and who you can contact, please refer to: www.gs.com/privacy-notices




Your Personal Data: We may collect and process information about you that may be subject to data protection laws. For more information about how we use and disclose your personal data, how we protect your information, our legal basis to use your information, your rights and who you can contact, please refer to: www.gs.com/privacy-notices