POJO Dataset read and write

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

POJO Dataset read and write

Flavio Pompermaier
Hi to all,

I have a complex POJO (with nexted objects) that I'd like to write and read with Flink (batch).
What is the simplest way to do that? I can't find any example of it :(

Best,
Flavio
Reply | Threaded
Open this post in threaded view
|

Re: POJO Dataset read and write

Fabian Hueske-2
If you are just looking for an exchange format between two Flink jobs, I would go for the TypeSerializerInput/OutputFormat.
Note that these are binary formats.

Best, Fabian

2015-11-27 15:28 GMT+01:00 Flavio Pompermaier <[hidden email]>:
Hi to all,

I have a complex POJO (with nexted objects) that I'd like to write and read with Flink (batch).
What is the simplest way to do that? I can't find any example of it :(

Best,
Flavio

Reply | Threaded
Open this post in threaded view
|

Re: POJO Dataset read and write

Flavio Pompermaier
I made a simple test and using parquet + thrift vs TypeSerializer IF/OF: the former outperformed the second approach for a simple filter (not pushed down) and a map+sum (something like 2 s vs 33s, and not considering disk space usage that is much worse). Is that normal or TypeSerializer is supposed to perform better then this?

On Fri, Nov 27, 2015 at 3:39 PM, Fabian Hueske <[hidden email]> wrote:
If you are just looking for an exchange format between two Flink jobs, I would go for the TypeSerializerInput/OutputFormat.
Note that these are binary formats.

Best, Fabian

2015-11-27 15:28 GMT+01:00 Flavio Pompermaier <[hidden email]>:
Hi to all,

I have a complex POJO (with nexted objects) that I'd like to write and read with Flink (batch).
What is the simplest way to do that? I can't find any example of it :(

Best,
Flavio


Reply | Threaded
Open this post in threaded view
|

Re: POJO Dataset read and write

Fabian Hueske-2
Parquet is much cleverer that the TypeSerializer and applies columnar storage and compression technique.
The TypeSerializerIOFs just use Flink's element-wise serializers to write and read binary data.

I'd go with Parquet if it is working well for you.

2015-11-27 16:15 GMT+01:00 Flavio Pompermaier <[hidden email]>:
I made a simple test and using parquet + thrift vs TypeSerializer IF/OF: the former outperformed the second approach for a simple filter (not pushed down) and a map+sum (something like 2 s vs 33s, and not considering disk space usage that is much worse). Is that normal or TypeSerializer is supposed to perform better then this?


On Fri, Nov 27, 2015 at 3:39 PM, Fabian Hueske <[hidden email]> wrote:
If you are just looking for an exchange format between two Flink jobs, I would go for the TypeSerializerInput/OutputFormat.
Note that these are binary formats.

Best, Fabian

2015-11-27 15:28 GMT+01:00 Flavio Pompermaier <[hidden email]>:
Hi to all,

I have a complex POJO (with nexted objects) that I'd like to write and read with Flink (batch).
What is the simplest way to do that? I can't find any example of it :(

Best,
Flavio



Reply | Threaded
Open this post in threaded view
|

Re: POJO Dataset read and write

Flavio Pompermaier
I was expecting Parquet + thrift to perform faster but I wasn't expecting that much, it was just to know whether my results were right or not. Thanks for the moment Fabian!

On Fri, Nov 27, 2015 at 4:22 PM, Fabian Hueske <[hidden email]> wrote:
Parquet is much cleverer that the TypeSerializer and applies columnar storage and compression technique.
The TypeSerializerIOFs just use Flink's element-wise serializers to write and read binary data.

I'd go with Parquet if it is working well for you.

2015-11-27 16:15 GMT+01:00 Flavio Pompermaier <[hidden email]>:
I made a simple test and using parquet + thrift vs TypeSerializer IF/OF: the former outperformed the second approach for a simple filter (not pushed down) and a map+sum (something like 2 s vs 33s, and not considering disk space usage that is much worse). Is that normal or TypeSerializer is supposed to perform better then this?


On Fri, Nov 27, 2015 at 3:39 PM, Fabian Hueske <[hidden email]> wrote:
If you are just looking for an exchange format between two Flink jobs, I would go for the TypeSerializerInput/OutputFormat.
Note that these are binary formats.

Best, Fabian

2015-11-27 15:28 GMT+01:00 Flavio Pompermaier <[hidden email]>:
Hi to all,

I have a complex POJO (with nexted objects) that I'd like to write and read with Flink (batch).
What is the simplest way to do that? I can't find any example of it :(

Best,
Flavio