Variable Tuple Type

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Variable Tuple Type

Max Kießling
Hey,

for a project we need to represent data as lists. So each entry in the
DataSets basically holds a list of basic data type elements. When
processing the data we keep joining lists of the same shape and so the
list size grows over time

e.g. (a,b,c) x (c,d,e) -> (a,b,c,d,e)

Currently our solution basically is
to use a `DataSet<List<Long>>`.

The problem with this is, that the performance seems to be quite poor
compared to using tuples. When we compare the same job using either
Tuples or Lists, Tuples seem to be 2-10 times faster.

However since we can't know in advance how many elements the list will
have for a given job, using the Tuple0-25 would be both cumbersome and
complex, especially if the list size outgrows 25.
Using the Record class, the results look promising but using Tuples is
still double as fast.

In general our tests yield that the runtime is
Tuple < Array < Record < List

So my question is, do you see a possible way to create a variable length
tuple type which can grow almost indefinitely while keeping most of the
benefits of the TupleXX classes but skipping lots of the overhead of
Record (like keeping track of possible null values etc)

Thanks a lot
Best Max
Reply | Threaded
Open this post in threaded view
|

Re: Variable Tuple Type

Fabian Hueske-2
Hi Max,

Tuples in Flink are of fixed length. You can define your own data types and serializers, but this is not the easiest solution.

I would go for Array types, especially if your data can be primitive types (long).
The serializer for primitive arrays should be almost as efficient as the Tuple serializers. The only overhead is serializing the length of the array which are 4 bytes.

Best, Fabian


2016-12-05 17:28 GMT+01:00 Max Kießling <[hidden email]>:
Hey,

for a project we need to represent data as lists. So each entry in the
DataSets basically holds a list of basic data type elements. When
processing the data we keep joining lists of the same shape and so the
list size grows over time

e.g. (a,b,c) x (c,d,e) -> (a,b,c,d,e)

Currently our solution basically is
to use a `DataSet<List<Long>>`.

The problem with this is, that the performance seems to be quite poor
compared to using tuples. When we compare the same job using either
Tuples or Lists, Tuples seem to be 2-10 times faster.

However since we can't know in advance how many elements the list will
have for a given job, using the Tuple0-25 would be both cumbersome and
complex, especially if the list size outgrows 25.
Using the Record class, the results look promising but using Tuples is
still double as fast.

In general our tests yield that the runtime is
Tuple < Array < Record < List

So my question is, do you see a possible way to create a variable length
tuple type which can grow almost indefinitely while keeping most of the
benefits of the TupleXX classes but skipping lots of the overhead of
Record (like keeping track of possible null values etc)

Thanks a lot
Best Max

Reply | Threaded
Open this post in threaded view
|

Re: Variable Tuple Type

Greg Hogan
In reply to this post by Max Kießling
Hi Max,

Belated response but this looks to be the same problem I am working to solve in Gelly with graph data in FLINK-3695 [0]. These arrays allow for object reuse. Interface is here [1]. Additional Value types are easy to add but Long, Int, and String are most common to Gelly.

Suggestions are welcome, and please let us know how you've solved your problem.

Greg

On Mon, Dec 5, 2016 at 11:28 AM, Max Kießling <[hidden email]> wrote:
Hey,

for a project we need to represent data as lists. So each entry in the
DataSets basically holds a list of basic data type elements. When
processing the data we keep joining lists of the same shape and so the
list size grows over time

e.g. (a,b,c) x (c,d,e) -> (a,b,c,d,e)

Currently our solution basically is
to use a `DataSet<List<Long>>`.

The problem with this is, that the performance seems to be quite poor
compared to using tuples. When we compare the same job using either
Tuples or Lists, Tuples seem to be 2-10 times faster.

However since we can't know in advance how many elements the list will
have for a given job, using the Tuple0-25 would be both cumbersome and
complex, especially if the list size outgrows 25.
Using the Record class, the results look promising but using Tuples is
still double as fast.

In general our tests yield that the runtime is
Tuple < Array < Record < List

So my question is, do you see a possible way to create a variable length
tuple type which can grow almost indefinitely while keeping most of the
benefits of the TupleXX classes but skipping lots of the overhead of
Record (like keeping track of possible null values etc)

Thanks a lot
Best Max