processing avro data source using DataSet API and output to parquet

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

processing avro data source using DataSet API and output to parquet

Lian Jiang
Hi,

I am using Flink 1.8.1 DataSet for a batch processing. The data source is avro files and I want to output the result into parquet. https://ci.apache.org/projects/flink/flink-docs-release-1.8/dev/batch/ only has no related information. What's the recommended way for doing this? Do I need to write adapters? Appreciate your help!


Reply | Threaded
Open this post in threaded view
|

Re: processing avro data source using DataSet API and output to parquet

Zhenghua Gao
Flink allows hadoop (mapreduce) OutputFormats in Flink jobs[1]. You can have a try with Parquet OutputFormat[2].


On Fri, Aug 16, 2019 at 1:04 PM Lian Jiang <[hidden email]> wrote:
Hi,

I am using Flink 1.8.1 DataSet for a batch processing. The data source is avro files and I want to output the result into parquet. https://ci.apache.org/projects/flink/flink-docs-release-1.8/dev/batch/ only has no related information. What's the recommended way for doing this? Do I need to write adapters? Appreciate your help!


Reply | Threaded
Open this post in threaded view
|

Re: processing avro data source using DataSet API and output to parquet

Lian Jiang
Thanks. Which api (dataset or datastream) is recommended for file handling (no window operation required)?

We have similar scenario for real-time processing. May it make sense to use datastream api for both batch and real-time for uniformity?

Sent from my iPhone

On Aug 16, 2019, at 00:38, Zhenghua Gao <[hidden email]> wrote:

Flink allows hadoop (mapreduce) OutputFormats in Flink jobs[1]. You can have a try with Parquet OutputFormat[2].


On Fri, Aug 16, 2019 at 1:04 PM Lian Jiang <[hidden email]> wrote:
Hi,

I am using Flink 1.8.1 DataSet for a batch processing. The data source is avro files and I want to output the result into parquet. https://ci.apache.org/projects/flink/flink-docs-release-1.8/dev/batch/ only has no related information. What's the recommended way for doing this? Do I need to write adapters? Appreciate your help!


Reply | Threaded
Open this post in threaded view
|

Re: processing avro data source using DataSet API and output to parquet

Zhenghua Gao
the DataStream API should fully subsume the DataSet API (through bounded streams) in the long run [1] 

Best Regards,
Zhenghua Gao


On Fri, Aug 16, 2019 at 11:52 PM Lian Jiang <[hidden email]> wrote:
Thanks. Which api (dataset or datastream) is recommended for file handling (no window operation required)?

We have similar scenario for real-time processing. May it make sense to use datastream api for both batch and real-time for uniformity?

Sent from my iPhone

On Aug 16, 2019, at 00:38, Zhenghua Gao <[hidden email]> wrote:

Flink allows hadoop (mapreduce) OutputFormats in Flink jobs[1]. You can have a try with Parquet OutputFormat[2].


On Fri, Aug 16, 2019 at 1:04 PM Lian Jiang <[hidden email]> wrote:
Hi,

I am using Flink 1.8.1 DataSet for a batch processing. The data source is avro files and I want to output the result into parquet. https://ci.apache.org/projects/flink/flink-docs-release-1.8/dev/batch/ only has no related information. What's the recommended way for doing this? Do I need to write adapters? Appreciate your help!