Thrift object serialization

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

Thrift object serialization

Flavio Pompermaier
Hi to all,
in my Flink job I create a Dataset<MyThriftObj> using HadoopInputFormat in this way:

HadoopInputFormat<Void, MyThriftObj> inputFormat = new HadoopInputFormat<>(
        new ParquetThriftInputFormat<MyThriftObj>(), Void.class, MyThriftObj.class, job);
FileInputFormat.addInputPath(job,  new org.apache.hadoop.fs.Path(inputPath);
DataSet<Tuple2<Void, MyThriftObj>> ds = env.createInput(inputFormat);

Flink logs this message:
  • TypeExtractor - class MyThriftObj contains custom serialization methods we do not call.

Indeed MyThriftObj has readObject/writeObject functions and when I print the type of ds I see:
  • Java Tuple2<Void, GenericType<MyThriftObj>>
Fom my experience GenericType is a performace killer...what should I do to improve the reading/writing of MyThriftObj?

Best,
Flavio


--
Flavio Pompermaier
Development Department

OKKAM S.r.l.
Tel. +(39) 0461 1823908
Reply | Threaded
Open this post in threaded view
|

Re: Thrift object serialization

Tzu-Li (Gordon) Tai
Hi Flavio!

I believe [1] has what you are looking for. Have you taken a look at that?

Cheers,
Gordon

[1] https://ci.apache.org/projects/flink/flink-docs-release-1.3/dev/custom_serializers.html

On 15 May 2017 at 9:08:33 PM, Flavio Pompermaier ([hidden email]) wrote:

Hi to all,
in my Flink job I create a Dataset<MyThriftObj> using HadoopInputFormat in this way:

HadoopInputFormat<Void, MyThriftObj> inputFormat = new HadoopInputFormat<>(
        new ParquetThriftInputFormat<MyThriftObj>(), Void.class, MyThriftObj.class, job);
FileInputFormat.addInputPath(job,  new org.apache.hadoop.fs.Path(inputPath);
DataSet<Tuple2<Void, MyThriftObj>> ds = env.createInput(inputFormat);

Flink logs this message:
  • TypeExtractor - class MyThriftObj contains custom serialization methods we do not call.

Indeed MyThriftObj has readObject/writeObject functions and when I print the type of ds I see:
  • Java Tuple2<Void, GenericType<MyThriftObj>>
Fom my experience GenericType is a performace killer...what should I do to improve the reading/writing of MyThriftObj?

Best,
Flavio


--
Flavio Pompermaier
Development Department

OKKAM S.r.l.
Tel. +(39) 0461 1823908
Reply | Threaded
Open this post in threaded view
|

Re: Thrift object serialization

Flavio Pompermaier
Hi Gordon,
thanks for the link. Will the usage ofTBaseSerializer wrt Kryo lead to a performance gain?

On Tue, May 16, 2017 at 7:32 AM, Tzu-Li (Gordon) Tai <[hidden email]> wrote:
Hi Flavio!

I believe [1] has what you are looking for. Have you taken a look at that?

Cheers,
Gordon


On 15 May 2017 at 9:08:33 PM, Flavio Pompermaier ([hidden email]) wrote:

Hi to all,
in my Flink job I create a Dataset<MyThriftObj> using HadoopInputFormat in this way:

HadoopInputFormat<Void, MyThriftObj> inputFormat = new HadoopInputFormat<>(
        new ParquetThriftInputFormat<MyThriftObj>(), Void.class, MyThriftObj.class, job);
FileInputFormat.addInputPath(job,  new org.apache.hadoop.fs.Path(inputPath);
DataSet<Tuple2<Void, MyThriftObj>> ds = env.createInput(inputFormat);

Flink logs this message:
  • TypeExtractor - class MyThriftObj contains custom serialization methods we do not call.

Indeed MyThriftObj has readObject/writeObject functions and when I print the type of ds I see:
  • Java Tuple2<Void, GenericType<MyThriftObj>>
Fom my experience GenericType is a performace killer...what should I do to improve the reading/writing of MyThriftObj?

Best,
Flavio


--
Flavio Pompermaier
Development Department

OKKAM S.r.l.
Tel. <a href="tel:+39%200461%20182%203908" value="+3904611823908" target="_blank">+(39) 0461 1823908



--
Flavio Pompermaier
Development Department

OKKAM S.r.l.
Tel. +(39) 0461 1823908
Reply | Threaded
Open this post in threaded view
|

Re: Thrift object serialization

Tzu-Li (Gordon) Tai
If you don’t register the TBaseSerializer for your MyThriftObj (or in general don’t register any serializer for the Thrift class), I think Kryo’s default FieldSerializer will be used for it.

The TBaseSerializer basically just uses TBase for de-/serialization as you normally would for the Thrift classes (see [1]), so there should be some specific optimization going on for that compared to Kryo’s generic FieldSerializer. I’m not entirely sure about performance gain between the two as I don’t really know the details of the serialization differences, but I would suggest to use TBaseSerializer if they are Thrift classes.

Cheers,
Gordon

[1] https://github.com/twitter/chill/blob/develop/chill-thrift/src/main/java/com/twitter/chill/thrift/TBaseSerializer.java

On 16 May 2017 at 3:26:32 PM, Flavio Pompermaier ([hidden email]) wrote:

Hi Gordon,
thanks for the link. Will the usage ofTBaseSerializer wrt Kryo lead to a performance gain?

On Tue, May 16, 2017 at 7:32 AM, Tzu-Li (Gordon) Tai <[hidden email]> wrote:
Hi Flavio!

I believe [1] has what you are looking for. Have you taken a look at that?

Cheers,
Gordon


On 15 May 2017 at 9:08:33 PM, Flavio Pompermaier ([hidden email]) wrote:

Hi to all,
in my Flink job I create a Dataset<MyThriftObj> using HadoopInputFormat in this way:

HadoopInputFormat<Void, MyThriftObj> inputFormat = new HadoopInputFormat<>(
        new ParquetThriftInputFormat<MyThriftObj>(), Void.class, MyThriftObj.class, job);
FileInputFormat.addInputPath(job,  new org.apache.hadoop.fs.Path(inputPath);
DataSet<Tuple2<Void, MyThriftObj>> ds = env.createInput(inputFormat);

Flink logs this message:
  • TypeExtractor - class MyThriftObj contains custom serialization methods we do not call.

Indeed MyThriftObj has readObject/writeObject functions and when I print the type of ds I see:
  • Java Tuple2<Void, GenericType<MyThriftObj>>
Fom my experience GenericType is a performace killer...what should I do to improve the reading/writing of MyThriftObj?

Best,
Flavio


--
Flavio Pompermaier
Development Department

OKKAM S.r.l.
Tel. <a href="tel:+39%200461%20182%203908" value="+3904611823908" target="_blank">+(39) 0461 1823908



--
Flavio Pompermaier
Development Department

OKKAM S.r.l.
Tel. +(39) 0461 1823908
Reply | Threaded
Open this post in threaded view
|

Re: Thrift object serialization

Flavio Pompermaier
Ok thanks Gordon! It would be nice to have a benchmark also on this ;)

Thanks a lot for the support,
Flavio

On Tue, May 16, 2017 at 9:41 AM, Tzu-Li (Gordon) Tai <[hidden email]> wrote:
If you don’t register the TBaseSerializer for your MyThriftObj (or in general don’t register any serializer for the Thrift class), I think Kryo’s default FieldSerializer will be used for it.

The TBaseSerializer basically just uses TBase for de-/serialization as you normally would for the Thrift classes (see [1]), so there should be some specific optimization going on for that compared to Kryo’s generic FieldSerializer. I’m not entirely sure about performance gain between the two as I don’t really know the details of the serialization differences, but I would suggest to use TBaseSerializer if they are Thrift classes.

Cheers,
Gordon

[1] https://github.com/twitter/chill/blob/develop/chill-thrift/src/main/java/com/twitter/chill/thrift/TBaseSerializer.java


On 16 May 2017 at 3:26:32 PM, Flavio Pompermaier ([hidden email]) wrote:

Hi Gordon,
thanks for the link. Will the usage ofTBaseSerializer wrt Kryo lead to a performance gain?

On Tue, May 16, 2017 at 7:32 AM, Tzu-Li (Gordon) Tai <[hidden email]> wrote:
Hi Flavio!

I believe [1] has what you are looking for. Have you taken a look at that?

Cheers,
Gordon


On 15 May 2017 at 9:08:33 PM, Flavio Pompermaier ([hidden email]) wrote:

Hi to all,
in my Flink job I create a Dataset<MyThriftObj> using HadoopInputFormat in this way:

HadoopInputFormat<Void, MyThriftObj> inputFormat = new HadoopInputFormat<>(
        new ParquetThriftInputFormat<MyThriftObj>(), Void.class, MyThriftObj.class, job);
FileInputFormat.addInputPath(job,  new org.apache.hadoop.fs.Path(inputPath);
DataSet<Tuple2<Void, MyThriftObj>> ds = env.createInput(inputFormat);

Flink logs this message:
  • TypeExtractor - class MyThriftObj contains custom serialization methods we do not call.

Indeed MyThriftObj has readObject/writeObject functions and when I print the type of ds I see:
  • Java Tuple2<Void, GenericType<MyThriftObj>>
Fom my experience GenericType is a performace killer...what should I do to improve the reading/writing of MyThriftObj?

Best,
Flavio


--
Flavio Pompermaier
Development Department

OKKAM S.r.l.
Tel. <a href="tel:+39%200461%20182%203908" value="+3904611823908" target="_blank">+(39) 0461 1823908



--
Flavio Pompermaier
Development Department

OKKAM S.r.l.
Tel. <a href="tel:+39%200461%20182%203908" value="+3904611823908" target="_blank">+(39) 0461 1823908