Re: BucketingSink capabilities for DataSet API
Posted by
anuj.aj07 on
URL: http://deprecated-apache-flink-user-mailing-list-archive.369.s1.nabble.com/BucketingSink-capabilities-for-DataSet-API-tp24107p32830.html
Hi Rafi,
I have a similar use case where I want to read parquet files in the dataset and want to perform some transformation and similarly want to write the result using year month day partitioned.
I am stuck at first step only where how to read and write Parquet files using hadoop-Compatability.
Please help me with this and also if u find the solution for how to write data in partitioned.
Thanks,
Anuj
On Thu, Oct 25, 2018 at 5:35 PM Andrey Zagrebin <
[hidden email]> wrote:
Hi Rafi,
At the moment I do not see any support of Parquet in DataSet API except HadoopOutputFormat, mentioned in stack overflow question. I have cc’ed Fabian and Aljoscha, maybe they could provide more information.
Best,
Andrey
Hi,
I'm writing a Batch job which reads Parquet, does some aggregations and writes back as Parquet files.
I would like the output to be partitioned by year, month, day by event time. Similarly to the functionality of the BucketingSink.
I was able to achieve the reading/writing to/from Parquet by using the hadoop-compatibility features.
I couldn't find a way to partition the data by year, month, day to create a folder hierarchy accordingly. Everything is written to a single directory.
Can anyone suggest a way to achieve this? Maybe there's a way to integrate the BucketingSink with the DataSet API? Another solution?
--
Thanks & Regards,
Anuj Jain
Mob. : +91- 8588817877
Skype : anuj.jain07