Parquet example

classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|

Parquet example

Flavio Pompermaier
Hi to all,

I'd like to know whether Flink is able exploit Parquet format to read data efficiently from HDFS.
Is there any example available?

Bets,
Flavio
Reply | Threaded
Open this post in threaded view
|

Re: Parquet example

Fabian Hueske
Hi Flavio,

I am not aware of a Flink InputFormat for Parquet. However, it should be hopefully covered by the Hadoop IF wrapper.
A dedicated Flink IF would be great though, IMO.

Best, Fabian

2014-11-11 12:10 GMT+01:00 Flavio Pompermaier <[hidden email]>:
Hi to all,

I'd like to know whether Flink is able exploit Parquet format to read data efficiently from HDFS.
Is there any example available?

Bets,
Flavio

Reply | Threaded
Open this post in threaded view
|

Re: Parquet example

Flavio Pompermaier
Maybe this is a dumb question but could you explain me what are the benefits of a dedicated Flink IF vs the one available by default in Hadoop IF wrapper?
Is it just because of data locality of task slots?

On Tue, Nov 11, 2014 at 12:16 PM, Fabian Hueske <[hidden email]> wrote:
Hi Flavio,

I am not aware of a Flink InputFormat for Parquet. However, it should be hopefully covered by the Hadoop IF wrapper.
A dedicated Flink IF would be great though, IMO.

Best, Fabian

2014-11-11 12:10 GMT+01:00 Flavio Pompermaier <[hidden email]>:
Hi to all,

I'd like to know whether Flink is able exploit Parquet format to read data efficiently from HDFS.
Is there any example available?

Bets,
Flavio


Reply | Threaded
Open this post in threaded view
|

Re: Parquet example

Fabian Hueske-2
First of all, split locality can make a huge difference.
It will also enable a tighter integration, API-wise as well for the execution by pushing for example filters or projections directly into the data source and therefore reduce the data to be read from the file system.

2014-11-11 12:30 GMT+01:00 Flavio Pompermaier <[hidden email]>:
Maybe this is a dumb question but could you explain me what are the benefits of a dedicated Flink IF vs the one available by default in Hadoop IF wrapper?
Is it just because of data locality of task slots?

On Tue, Nov 11, 2014 at 12:16 PM, Fabian Hueske <[hidden email]> wrote:
Hi Flavio,

I am not aware of a Flink InputFormat for Parquet. However, it should be hopefully covered by the Hadoop IF wrapper.
A dedicated Flink IF would be great though, IMO.

Best, Fabian

2014-11-11 12:10 GMT+01:00 Flavio Pompermaier <[hidden email]>:
Hi to all,

I'd like to know whether Flink is able exploit Parquet format to read data efficiently from HDFS.
Is there any example available?

Bets,
Flavio



Reply | Threaded
Open this post in threaded view
|

Re: Parquet example

Fabian Hueske
Hi,

just want to let you know, that we opened a JIRA (FLINK-1236) to support local split assignment for the HadoopInputFormat.
At least this performance issue should be easy to solve :-)

2014-11-11 12:44 GMT+01:00 Fabian Hueske <[hidden email]>:
First of all, split locality can make a huge difference.
It will also enable a tighter integration, API-wise as well for the execution by pushing for example filters or projections directly into the data source and therefore reduce the data to be read from the file system.

2014-11-11 12:30 GMT+01:00 Flavio Pompermaier <[hidden email]>:
Maybe this is a dumb question but could you explain me what are the benefits of a dedicated Flink IF vs the one available by default in Hadoop IF wrapper?
Is it just because of data locality of task slots?

On Tue, Nov 11, 2014 at 12:16 PM, Fabian Hueske <[hidden email]> wrote:
Hi Flavio,

I am not aware of a Flink InputFormat for Parquet. However, it should be hopefully covered by the Hadoop IF wrapper.
A dedicated Flink IF would be great though, IMO.

Best, Fabian

2014-11-11 12:10 GMT+01:00 Flavio Pompermaier <[hidden email]>:
Hi to all,

I'd like to know whether Flink is able exploit Parquet format to read data efficiently from HDFS.
Is there any example available?

Bets,
Flavio




Reply | Threaded
Open this post in threaded view
|

Re: Parquet example

Flavio Pompermaier

Yes I've read it!Will it support also hbase tableInputFormat (HTable and Scan are no more serializable) ao basically the hbase addon becomes useless?

On Nov 12, 2014 9:10 PM, "Fabian Hueske" <[hidden email]> wrote:
Hi,

just want to let you know, that we opened a JIRA (FLINK-1236) to support local split assignment for the HadoopInputFormat.
At least this performance issue should be easy to solve :-)

2014-11-11 12:44 GMT+01:00 Fabian Hueske <[hidden email]>:
First of all, split locality can make a huge difference.
It will also enable a tighter integration, API-wise as well for the execution by pushing for example filters or projections directly into the data source and therefore reduce the data to be read from the file system.

2014-11-11 12:30 GMT+01:00 Flavio Pompermaier <[hidden email]>:
Maybe this is a dumb question but could you explain me what are the benefits of a dedicated Flink IF vs the one available by default in Hadoop IF wrapper?
Is it just because of data locality of task slots?

On Tue, Nov 11, 2014 at 12:16 PM, Fabian Hueske <[hidden email]> wrote:
Hi Flavio,

I am not aware of a Flink InputFormat for Parquet. However, it should be hopefully covered by the Hadoop IF wrapper.
A dedicated Flink IF would be great though, IMO.

Best, Fabian

2014-11-11 12:10 GMT+01:00 Flavio Pompermaier <[hidden email]>:
Hi to all,

I'd like to know whether Flink is able exploit Parquet format to read data efficiently from HDFS.
Is there any example available?

Bets,
Flavio




Reply | Threaded
Open this post in threaded view
|

Re: Parquet example

Fabian Hueske
I guess this depends on how the Flink TableInputFormat is implemented.

In its current state, the TableInputFormat returns a key-value pair just like the Hadoop HBase IF does. A Flink HBase IF could for example also unpack the HBase result object into a tuple of column values depending on the HBase query.

2014-11-12 21:17 GMT+01:00 Flavio Pompermaier <[hidden email]>:

Yes I've read it!Will it support also hbase tableInputFormat (HTable and Scan are no more serializable) ao basically the hbase addon becomes useless?

On Nov 12, 2014 9:10 PM, "Fabian Hueske" <[hidden email]> wrote:
Hi,

just want to let you know, that we opened a JIRA (FLINK-1236) to support local split assignment for the HadoopInputFormat.
At least this performance issue should be easy to solve :-)

2014-11-11 12:44 GMT+01:00 Fabian Hueske <[hidden email]>:
First of all, split locality can make a huge difference.
It will also enable a tighter integration, API-wise as well for the execution by pushing for example filters or projections directly into the data source and therefore reduce the data to be read from the file system.

2014-11-11 12:30 GMT+01:00 Flavio Pompermaier <[hidden email]>:
Maybe this is a dumb question but could you explain me what are the benefits of a dedicated Flink IF vs the one available by default in Hadoop IF wrapper?
Is it just because of data locality of task slots?

On Tue, Nov 11, 2014 at 12:16 PM, Fabian Hueske <[hidden email]> wrote:
Hi Flavio,

I am not aware of a Flink InputFormat for Parquet. However, it should be hopefully covered by the Hadoop IF wrapper.
A dedicated Flink IF would be great though, IMO.

Best, Fabian

2014-11-11 12:10 GMT+01:00 Flavio Pompermaier <[hidden email]>:
Hi to all,

I'd like to know whether Flink is able exploit Parquet format to read data efficiently from HDFS.
Is there any example available?

Bets,
Flavio