(DEPRECATED) Apache Flink User Mailing List archive.

Parquet example

Classic

List

Threaded

7 messages Options

Flavio Pompermaier

Parquet example

Hi to all,

I'd like to know whether Flink is able exploit Parquet format to read data efficiently from HDFS.

Is there any example available?

Bets,

Flavio

Fabian Hueske

Re: Parquet example

Hi Flavio,

I am not aware of a Flink InputFormat for Parquet. However, it should be hopefully covered by the Hadoop IF wrapper.

A dedicated Flink IF would be great though, IMO.

Best, Fabian

2014-11-11 12:10 GMT+01:00 Flavio Pompermaier <[hidden email]>:

Hi to all,

I'd like to know whether Flink is able exploit Parquet format to read data efficiently from HDFS.
Is there any example available?

Bets,
Flavio

Flavio Pompermaier

Re: Parquet example

Maybe this is a dumb question but could you explain me what are the benefits of a dedicated Flink IF vs the one available by default in Hadoop IF wrapper?

Is it just because of data locality of task slots?

On Tue, Nov 11, 2014 at 12:16 PM, Fabian Hueske <[hidden email]> wrote:

Hi Flavio,

I am not aware of a Flink InputFormat for Parquet. However, it should be hopefully covered by the Hadoop IF wrapper.
A dedicated Flink IF would be great though, IMO.

Best, Fabian

2014-11-11 12:10 GMT+01:00 Flavio Pompermaier <[hidden email]>:
Hi to all,

I'd like to know whether Flink is able exploit Parquet format to read data efficiently from HDFS.
Is there any example available?

Bets,
Flavio

Fabian Hueske-2

Re: Parquet example

First of all, split locality can make a huge difference.

It will also enable a tighter integration, API-wise as well for the execution by pushing for example filters or projections directly into the data source and therefore reduce the data to be read from the file system.

2014-11-11 12:30 GMT+01:00 Flavio Pompermaier <[hidden email]>:

Maybe this is a dumb question but could you explain me what are the benefits of a dedicated Flink IF vs the one available by default in Hadoop IF wrapper?
Is it just because of data locality of task slots?

On Tue, Nov 11, 2014 at 12:16 PM, Fabian Hueske <[hidden email]> wrote:
Hi Flavio,

I am not aware of a Flink InputFormat for Parquet. However, it should be hopefully covered by the Hadoop IF wrapper.
A dedicated Flink IF would be great though, IMO.

Best, Fabian

2014-11-11 12:10 GMT+01:00 Flavio Pompermaier <[hidden email]>:
Hi to all,

I'd like to know whether Flink is able exploit Parquet format to read data efficiently from HDFS.
Is there any example available?

Bets,
Flavio

Fabian Hueske

Re: Parquet example

Hi,

just want to let you know, that we opened a JIRA (FLINK-1236) to support local split assignment for the HadoopInputFormat.

At least this performance issue should be easy to solve :-)

2014-11-11 12:44 GMT+01:00 Fabian Hueske <[hidden email]>:

First of all, split locality can make a huge difference.
It will also enable a tighter integration, API-wise as well for the execution by pushing for example filters or projections directly into the data source and therefore reduce the data to be read from the file system.

2014-11-11 12:30 GMT+01:00 Flavio Pompermaier <[hidden email]>:
Maybe this is a dumb question but could you explain me what are the benefits of a dedicated Flink IF vs the one available by default in Hadoop IF wrapper?
Is it just because of data locality of task slots?

On Tue, Nov 11, 2014 at 12:16 PM, Fabian Hueske <[hidden email]> wrote:
Hi Flavio,

I am not aware of a Flink InputFormat for Parquet. However, it should be hopefully covered by the Hadoop IF wrapper.
A dedicated Flink IF would be great though, IMO.

Best, Fabian

2014-11-11 12:10 GMT+01:00 Flavio Pompermaier <[hidden email]>:
Hi to all,

I'd like to know whether Flink is able exploit Parquet format to read data efficiently from HDFS.
Is there any example available?

Bets,
Flavio

Flavio Pompermaier

Re: Parquet example

Yes I've read it!Will it support also hbase tableInputFormat (HTable and Scan are no more serializable) ao basically the hbase addon becomes useless?

On Nov 12, 2014 9:10 PM, "Fabian Hueske" <[hidden email]> wrote:

Hi,

just want to let you know, that we opened a JIRA (FLINK-1236) to support local split assignment for the HadoopInputFormat.
At least this performance issue should be easy to solve :-)

2014-11-11 12:44 GMT+01:00 Fabian Hueske <[hidden email]>:
First of all, split locality can make a huge difference.
It will also enable a tighter integration, API-wise as well for the execution by pushing for example filters or projections directly into the data source and therefore reduce the data to be read from the file system.

2014-11-11 12:30 GMT+01:00 Flavio Pompermaier <[hidden email]>:
Maybe this is a dumb question but could you explain me what are the benefits of a dedicated Flink IF vs the one available by default in Hadoop IF wrapper?
Is it just because of data locality of task slots?

On Tue, Nov 11, 2014 at 12:16 PM, Fabian Hueske <[hidden email]> wrote:
Hi Flavio,

I am not aware of a Flink InputFormat for Parquet. However, it should be hopefully covered by the Hadoop IF wrapper.
A dedicated Flink IF would be great though, IMO.

Best, Fabian

2014-11-11 12:10 GMT+01:00 Flavio Pompermaier <[hidden email]>:
Hi to all,

I'd like to know whether Flink is able exploit Parquet format to read data efficiently from HDFS.
Is there any example available?

Bets,
Flavio

Fabian Hueske

Re: Parquet example

I guess this depends on how the Flink TableInputFormat is implemented.

In its current state, the TableInputFormat returns a key-value pair just like the Hadoop HBase IF does. A Flink HBase IF could for example also unpack the HBase result object into a tuple of column values depending on the HBase query.

2014-11-12 21:17 GMT+01:00 Flavio Pompermaier <[hidden email]>:

Yes I've read it!Will it support also hbase tableInputFormat (HTable and Scan are no more serializable) ao basically the hbase addon becomes useless?

On Nov 12, 2014 9:10 PM, "Fabian Hueske" <[hidden email]> wrote:
Hi,

just want to let you know, that we opened a JIRA (FLINK-1236) to support local split assignment for the HadoopInputFormat.
At least this performance issue should be easy to solve :-)

2014-11-11 12:44 GMT+01:00 Fabian Hueske <[hidden email]>:
First of all, split locality can make a huge difference.
It will also enable a tighter integration, API-wise as well for the execution by pushing for example filters or projections directly into the data source and therefore reduce the data to be read from the file system.

2014-11-11 12:30 GMT+01:00 Flavio Pompermaier <[hidden email]>:
Maybe this is a dumb question but could you explain me what are the benefits of a dedicated Flink IF vs the one available by default in Hadoop IF wrapper?
Is it just because of data locality of task slots?

On Tue, Nov 11, 2014 at 12:16 PM, Fabian Hueske <[hidden email]> wrote:
Hi Flavio,

I am not aware of a Flink InputFormat for Parquet. However, it should be hopefully covered by the Hadoop IF wrapper.
A dedicated Flink IF would be great though, IMO.

Best, Fabian

2014-11-11 12:10 GMT+01:00 Flavio Pompermaier <[hidden email]>:
Hi to all,

I'd like to know whether Flink is able exploit Parquet format to read data efficiently from HDFS.
Is there any example available?

Bets,
Flavio