(DEPRECATED) Apache Flink User Mailing List archive.

Parquet batch table sink in Flink 1.11

Classic

List

Threaded

5 messages Options

Flavio Pompermaier

Parquet batch table sink in Flink 1.11

Hi to all,

is there a way to write out Parquet-Avro data using BatchTableEnvironment with Flink 1.11?

At the moment I'm using the hadoop ParquetOutputFormat but I hope to be able to get rid of it sooner or later..I saw that there's the AvroOutputFormat but no support for it using Parquet.

Best,

Flavio

Jingsong Li

Re: Parquet batch table sink in Flink 1.11

Hi Flavio,

AvroOutputFormat only supports writing Avro files.

I think you can use `AvroParquetOutputFormat` as a hadoop output format, and wrap it through Flink `HadoopOutputFormat`.

Best,

Jingsong

On Fri, Jul 17, 2020 at 11:59 PM Flavio Pompermaier <[hidden email]> wrote:

Hi to all,
is there a way to write out Parquet-Avro data using BatchTableEnvironment with Flink 1.11?
At the moment I'm using the hadoop ParquetOutputFormat but I hope to be able to get rid of it sooner or later..I saw that there's the AvroOutputFormat but no support for it using Parquet.

Best,
Flavio

Best, Jingsong Lee

Flavio Pompermaier

Re: Parquet batch table sink in Flink 1.11

This is what I actually do but I was hoping to be able to get rid of the HadoopOutputForma and be able to use a more comfortable Source/Sink implementation.

On Tue, Jul 21, 2020 at 12:38 PM Jingsong Li <[hidden email]> wrote:

Hi Flavio,

AvroOutputFormat only supports writing Avro files.
I think you can use `AvroParquetOutputFormat` as a hadoop output format, and wrap it through Flink `HadoopOutputFormat`.

Best,
Jingsong

On Fri, Jul 17, 2020 at 11:59 PM Flavio Pompermaier <[hidden email]> wrote:
Hi to all,
is there a way to write out Parquet-Avro data using BatchTableEnvironment with Flink 1.11?
At the moment I'm using the hadoop ParquetOutputFormat but I hope to be able to get rid of it sooner or later..I saw that there's the AvroOutputFormat but no support for it using Parquet.

Best,
Flavio

--
Best, Jingsong Lee

Jingsong Li

Re: Parquet batch table sink in Flink 1.11

In table/SQL,

I think we don't need a source/sink for `AvroParquetOutputFormat`, because the data structure is always Row or RowData, should not be a avro object.

Best,

Jingsong

On Tue, Jul 21, 2020 at 8:09 PM Flavio Pompermaier <[hidden email]> wrote:

This is what I actually do but I was hoping to be able to get rid of the HadoopOutputForma and be able to use a more comfortable Source/Sink implementation.

On Tue, Jul 21, 2020 at 12:38 PM Jingsong Li <[hidden email]> wrote:
Hi Flavio,

AvroOutputFormat only supports writing Avro files.
I think you can use `AvroParquetOutputFormat` as a hadoop output format, and wrap it through Flink `HadoopOutputFormat`.

Best,
Jingsong

On Fri, Jul 17, 2020 at 11:59 PM Flavio Pompermaier <[hidden email]> wrote:
Hi to all,
is there a way to write out Parquet-Avro data using BatchTableEnvironment with Flink 1.11?
At the moment I'm using the hadoop ParquetOutputFormat but I hope to be able to get rid of it sooner or later..I saw that there's the AvroOutputFormat but no support for it using Parquet.

Best,
Flavio

--
Best, Jingsong Lee

Best, Jingsong Lee

Flavio Pompermaier

Re: Parquet batch table sink in Flink 1.11

I think that's not true when you need to integrate Flink into an existing data-lake..I think it should be very straightforward (in my opinion) to read/ write Parquet data with objects serialized with avro/thrift/protobuf...or at least reuse hadoop input/output formats with table API. At the moment I have to pass through a lot of custom code that uses the Hadoop formats and is a lto of code just to read and write thrift or avro serialized objects in parquet folders.

On Wed, Jul 22, 2020 at 3:35 AM Jingsong Li <[hidden email]> wrote:

In table/SQL,

I think we don't need a source/sink for `AvroParquetOutputFormat`, because the data structure is always Row or RowData, should not be a avro object.

Best,
Jingsong

On Tue, Jul 21, 2020 at 8:09 PM Flavio Pompermaier <[hidden email]> wrote:
This is what I actually do but I was hoping to be able to get rid of the HadoopOutputForma and be able to use a more comfortable Source/Sink implementation.

On Tue, Jul 21, 2020 at 12:38 PM Jingsong Li <[hidden email]> wrote:
Hi Flavio,

AvroOutputFormat only supports writing Avro files.
I think you can use `AvroParquetOutputFormat` as a hadoop output format, and wrap it through Flink `HadoopOutputFormat`.

Best,
Jingsong

On Fri, Jul 17, 2020 at 11:59 PM Flavio Pompermaier <[hidden email]> wrote:
Hi to all,
is there a way to write out Parquet-Avro data using BatchTableEnvironment with Flink 1.11?
At the moment I'm using the hadoop ParquetOutputFormat but I hope to be able to get rid of it sooner or later..I saw that there's the AvroOutputFormat but no support for it using Parquet.

Best,
Flavio

--
Best, Jingsong Lee

--
Best, Jingsong Lee