Hi, I am using Flink 1.8.1 DataSet for a batch processing. The data source is avro files and I want to output the result into parquet. https://ci.apache.org/projects/flink/flink-docs-release-1.8/dev/batch/ only has no related information. What's the recommended way for doing this? Do I need to write adapters? Appreciate your help! |
Flink allows hadoop (mapreduce) OutputFormats in Flink jobs[1]. You can have a try with Parquet OutputFormat[2]. And if you can turn to DataStream API, StreamingFileSink + ParquetBulkWriter meets your requirement[3][4]. Best Regards, Zhenghua Gao On Fri, Aug 16, 2019 at 1:04 PM Lian Jiang <[hidden email]> wrote:
|
Thanks. Which api (dataset or datastream) is recommended for file handling (no window operation required)?
We have similar scenario for real-time processing. May it make sense to use datastream api for both batch and real-time for uniformity?
Sent from my iPhone
|
the DataStream API should fully subsume the DataSet API (through bounded streams) in the long run [1] And you can consider use Table/SQL API in your project. [1] https://flink.apache.org/roadmap.html#analytics-applications-and-the-roles-of-datastream-dataset-and-table-api Best Regards, Zhenghua Gao On Fri, Aug 16, 2019 at 11:52 PM Lian Jiang <[hidden email]> wrote:
|
Free forum by Nabble | Edit this page |