Reading Parquet/Hive

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

Reading Parquet/Hive

Gwenhael Pasquiers

Hi,

 

I’m trying to read Parquet/Hive data using parquet’s ParquetInputFormat and hive’s DataWritableReadSupport.

 

I have an error when the TupleSerializer tries to create an instance of ArrayWritable, using reflection because ArrayWritable has no no-args constructor.

 

I’ve been able to make it work when executing in a local cluster by copying the ArrayWritable class in my own sources and adding the constructor. I guess that the classpath built by maven puts my code first and allows me to override the original class. However when running into the real cluster (yarn@cloudera) the exception comes back (I guess that the original class is first in the classpath).

 

So you have an idea of how I could make it work ?

 

I’m think I’m tied to the ArrayWritable type because of the DataWritableReadSupport that extends ReadSupport<ArrayWritable>.

 

Would it be possible (and not too complicated) to make a DataSource that would not generate Tuples and allow me to convert the ArrayWritable to a more friendly type like String[] … Or if you have any other idea, they are welcome !

 

B.R.

 

Gwenhaël PASQUIERS

Reply | Threaded
Open this post in threaded view
|

RE: Reading Parquet/Hive

Gwenhael Pasquiers

I’ll answer to myself J

 

I think i’ve managed to make it work by creating my “WrappingReadSupport” that wraps the DataWritableReadSupport but I also insert my “WrappingMaterializer” that converts the ArrayWritable produced by the original Materializer to String[]. Then later on, the String[] poses no issues with Tuple and it seems to be OK.

 

Now … Let’s write those String[] in parquet too J

 

 

From: Gwenhael Pasquiers [mailto:[hidden email]]
Sent: vendredi 18 décembre 2015 10:04
To: [hidden email]
Subject: Reading Parquet/Hive

 

Hi,

 

I’m trying to read Parquet/Hive data using parquet’s ParquetInputFormat and hive’s DataWritableReadSupport.

 

I have an error when the TupleSerializer tries to create an instance of ArrayWritable, using reflection because ArrayWritable has no no-args constructor.

 

I’ve been able to make it work when executing in a local cluster by copying the ArrayWritable class in my own sources and adding the constructor. I guess that the classpath built by maven puts my code first and allows me to override the original class. However when running into the real cluster (yarn@cloudera) the exception comes back (I guess that the original class is first in the classpath).

 

So you have an idea of how I could make it work ?

 

I’m think I’m tied to the ArrayWritable type because of the DataWritableReadSupport that extends ReadSupport<ArrayWritable>.

 

Would it be possible (and not too complicated) to make a DataSource that would not generate Tuples and allow me to convert the ArrayWritable to a more friendly type like String[] … Or if you have any other idea, they are welcome !

 

B.R.

 

Gwenhaël PASQUIERS