Parquet batch table sink in Flink 1.11

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

Parquet batch table sink in Flink 1.11

Flavio Pompermaier
Hi to all,
is there a way to write out Parquet-Avro data using BatchTableEnvironment with Flink 1.11?
At the moment I'm using the hadoop ParquetOutputFormat but I hope to be able to get rid of it sooner or later..I saw that there's the AvroOutputFormat but no support for it using Parquet.

Best,
Flavio
Reply | Threaded
Open this post in threaded view
|

Re: Parquet batch table sink in Flink 1.11

Jingsong Li
Hi Flavio,

AvroOutputFormat only supports writing Avro files.
I think you can use `AvroParquetOutputFormat` as a hadoop output format, and wrap it through Flink `HadoopOutputFormat`.

Best,
Jingsong

On Fri, Jul 17, 2020 at 11:59 PM Flavio Pompermaier <[hidden email]> wrote:
Hi to all,
is there a way to write out Parquet-Avro data using BatchTableEnvironment with Flink 1.11?
At the moment I'm using the hadoop ParquetOutputFormat but I hope to be able to get rid of it sooner or later..I saw that there's the AvroOutputFormat but no support for it using Parquet.

Best,
Flavio


--
Best, Jingsong Lee
Reply | Threaded
Open this post in threaded view
|

Re: Parquet batch table sink in Flink 1.11

Flavio Pompermaier
This is what I actually do but I was hoping to be able to get rid of the HadoopOutputForma and be able to use a  more comfortable Source/Sink implementation.

On Tue, Jul 21, 2020 at 12:38 PM Jingsong Li <[hidden email]> wrote:
Hi Flavio,

AvroOutputFormat only supports writing Avro files.
I think you can use `AvroParquetOutputFormat` as a hadoop output format, and wrap it through Flink `HadoopOutputFormat`.

Best,
Jingsong

On Fri, Jul 17, 2020 at 11:59 PM Flavio Pompermaier <[hidden email]> wrote:
Hi to all,
is there a way to write out Parquet-Avro data using BatchTableEnvironment with Flink 1.11?
At the moment I'm using the hadoop ParquetOutputFormat but I hope to be able to get rid of it sooner or later..I saw that there's the AvroOutputFormat but no support for it using Parquet.

Best,
Flavio


--
Best, Jingsong Lee
Reply | Threaded
Open this post in threaded view
|

Re: Parquet batch table sink in Flink 1.11

Jingsong Li
In table/SQL,

I think we don't need a source/sink for `AvroParquetOutputFormat`, because the data structure is always Row or RowData, should not be a avro object.

Best,
Jingsong

On Tue, Jul 21, 2020 at 8:09 PM Flavio Pompermaier <[hidden email]> wrote:
This is what I actually do but I was hoping to be able to get rid of the HadoopOutputForma and be able to use a  more comfortable Source/Sink implementation.

On Tue, Jul 21, 2020 at 12:38 PM Jingsong Li <[hidden email]> wrote:
Hi Flavio,

AvroOutputFormat only supports writing Avro files.
I think you can use `AvroParquetOutputFormat` as a hadoop output format, and wrap it through Flink `HadoopOutputFormat`.

Best,
Jingsong

On Fri, Jul 17, 2020 at 11:59 PM Flavio Pompermaier <[hidden email]> wrote:
Hi to all,
is there a way to write out Parquet-Avro data using BatchTableEnvironment with Flink 1.11?
At the moment I'm using the hadoop ParquetOutputFormat but I hope to be able to get rid of it sooner or later..I saw that there's the AvroOutputFormat but no support for it using Parquet.

Best,
Flavio


--
Best, Jingsong Lee


--
Best, Jingsong Lee
Reply | Threaded
Open this post in threaded view
|

Re: Parquet batch table sink in Flink 1.11

Flavio Pompermaier
I think that's not true when you need to integrate Flink into an existing data-lake..I think it should be very straightforward (in my opinion) to read/ write Parquet data with objects serialized with avro/thrift/protobuf...or at least reuse hadoop input/output formats with table API. At the moment I have to pass through a lot of custom code that uses the Hadoop formats and is a lto of code just to read and write thrift or avro serialized objects in parquet folders.

On Wed, Jul 22, 2020 at 3:35 AM Jingsong Li <[hidden email]> wrote:
In table/SQL,

I think we don't need a source/sink for `AvroParquetOutputFormat`, because the data structure is always Row or RowData, should not be a avro object.

Best,
Jingsong

On Tue, Jul 21, 2020 at 8:09 PM Flavio Pompermaier <[hidden email]> wrote:
This is what I actually do but I was hoping to be able to get rid of the HadoopOutputForma and be able to use a  more comfortable Source/Sink implementation.

On Tue, Jul 21, 2020 at 12:38 PM Jingsong Li <[hidden email]> wrote:
Hi Flavio,

AvroOutputFormat only supports writing Avro files.
I think you can use `AvroParquetOutputFormat` as a hadoop output format, and wrap it through Flink `HadoopOutputFormat`.

Best,
Jingsong

On Fri, Jul 17, 2020 at 11:59 PM Flavio Pompermaier <[hidden email]> wrote:
Hi to all,
is there a way to write out Parquet-Avro data using BatchTableEnvironment with Flink 1.11?
At the moment I'm using the hadoop ParquetOutputFormat but I hope to be able to get rid of it sooner or later..I saw that there's the AvroOutputFormat but no support for it using Parquet.

Best,
Flavio


--
Best, Jingsong Lee


--
Best, Jingsong Lee