CSV sink partitioning and bucketing

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

CSV sink partitioning and bucketing

Flavio Pompermaier
Hi to all,
in my use case I'd need to output my Row objects into an output folder as CSV on HDFS but creating/overwriting new subfolders based on an attribute (for example create a subfolder for each value of a specified column). Then, it could be interesting to bucketing the data inside those folders by number of lines,i.e. every file inside those directory cannot contain more than 1000 lines.

For example, if I have a dataset (of Row) containing people I need to write my dataset as CSV into an output folder X  partitioned by year (where each file cannot have more then 1000 rows), like:

X/1990/file1
   /1990/file2
   /1991/file1
etc..

Does something like that exists in Flink?
In principle I could use Hive for this but at the moment I'd try to avoid to add another component to our pipeline...Moreover, my feeling is that very few people is using Flink on Hive..am I wrong?
Any advice on how to proceed?

Best,
Flavio

Reply | Threaded
Open this post in threaded view
|

Re: CSV sink partitioning and bucketing

Fabian Hueske-2
Hi Flavio,

Flink does not come with an OutputFormat that creates buckets. It should not be too hard to implement this in Flink though.

However, if you want a solution fast, I would try the following approach:
1) Search for a Hadoop OutputFormat that buckets Strings based on a key (<Key, String>).
2) Implement a mapper that converts Row into a String and extracts the key
3) Use the Hadoop OutputFormat with Flink's HadoopOutputFormat wrapper.

Depending on the output format you might want to partition and sort the data on the key before writing it out.

Best, Fabian

2017-02-17 9:32 GMT+01:00 Flavio Pompermaier <[hidden email]>:
Hi to all,
in my use case I'd need to output my Row objects into an output folder as CSV on HDFS but creating/overwriting new subfolders based on an attribute (for example create a subfolder for each value of a specified column). Then, it could be interesting to bucketing the data inside those folders by number of lines,i.e. every file inside those directory cannot contain more than 1000 lines.

For example, if I have a dataset (of Row) containing people I need to write my dataset as CSV into an output folder X  partitioned by year (where each file cannot have more then 1000 rows), like:

X/1990/file1
   /1990/file2
   /1991/file1
etc..

Does something like that exists in Flink?
In principle I could use Hive for this but at the moment I'd try to avoid to add another component to our pipeline...Moreover, my feeling is that very few people is using Flink on Hive..am I wrong?
Any advice on how to proceed?

Best,
Flavio