Tuples serialization

classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|

Tuples serialization

Flavio Pompermaier
Hi to all,

in my use case I'd like to persist within a directory batch of Tuple2<String, byte[]>.
Which is the most efficient way to achieve that in Flink?
I was thinking to use Avro but I can't find an example of how to do that. 
Once generated how can I (re)generate a Dataset<Tuple2<String, byte[]> from it?

Best,
Flavio
Reply | Threaded
Open this post in threaded view
|

Re: Tuples serialization

Fabian Hueske-2
Have you tried the TypeSerializerOutputFormat?
This will serialize data using Flink's own serializers and write it to binary files.
The data can be read back using the TypeSerializerInputFormat.

Cheers, Fabian

2015-04-23 11:14 GMT+02:00 Flavio Pompermaier <[hidden email]>:
Hi to all,

in my use case I'd like to persist within a directory batch of Tuple2<String, byte[]>.
Which is the most efficient way to achieve that in Flink?
I was thinking to use Avro but I can't find an example of how to do that. 
Once generated how can I (re)generate a Dataset<Tuple2<String, byte[]> from it?

Best,
Flavio

Reply | Threaded
Open this post in threaded view
|

Re: Tuples serialization

Flavio Pompermaier
I've searched within flink for a working example of TypeSerializerOutputFormat usage but I didn't find anything usable.
Cold you show me a simple snippet of code?
Do I have to configure BinaryInputFormat.BLOCK_SIZE_PARAMETER_KEY? Which size do I have to use? Will flink write a single file or a set of avro file in a directory?
Is it possible to read all files in a directory at once?

On Thu, Apr 23, 2015 at 12:16 PM, Fabian Hueske <[hidden email]> wrote:
Have you tried the TypeSerializerOutputFormat?
This will serialize data using Flink's own serializers and write it to binary files.
The data can be read back using the TypeSerializerInputFormat.

Cheers, Fabian

2015-04-23 11:14 GMT+02:00 Flavio Pompermaier <[hidden email]>:
Hi to all,

in my use case I'd like to persist within a directory batch of Tuple2<String, byte[]>.
Which is the most efficient way to achieve that in Flink?
I was thinking to use Avro but I can't find an example of how to do that. 
Once generated how can I (re)generate a Dataset<Tuple2<String, byte[]> from it?

Best,
Flavio

Reply | Threaded
Open this post in threaded view
|

Re: Tuples serialization

Flavio Pompermaier
I managed to read and write avro files and still I have two doubts:

Which size do I have to use for BLOCK_SIZE_PARAMETER_KEY?
Do I have really to create a sample tuple to extract the TypeInformation to instantiate the TypeSerializerInputFormat?

On Thu, Apr 23, 2015 at 7:04 PM, Flavio Pompermaier <[hidden email]> wrote:
I've searched within flink for a working example of TypeSerializerOutputFormat usage but I didn't find anything usable.
Cold you show me a simple snippet of code?
Do I have to configure BinaryInputFormat.BLOCK_SIZE_PARAMETER_KEY? Which size do I have to use? Will flink write a single file or a set of avro file in a directory?
Is it possible to read all files in a directory at once?

On Thu, Apr 23, 2015 at 12:16 PM, Fabian Hueske <[hidden email]> wrote:
Have you tried the TypeSerializerOutputFormat?
This will serialize data using Flink's own serializers and write it to binary files.
The data can be read back using the TypeSerializerInputFormat.

Cheers, Fabian

2015-04-23 11:14 GMT+02:00 Flavio Pompermaier <[hidden email]>:
Hi to all,

in my use case I'd like to persist within a directory batch of Tuple2<String, byte[]>.
Which is the most efficient way to achieve that in Flink?
I was thinking to use Avro but I can't find an example of how to do that. 
Once generated how can I (re)generate a Dataset<Tuple2<String, byte[]> from it?

Best,
Flavio

Reply | Threaded
Open this post in threaded view
|

Re: Tuples serialization

Fabian Hueske-2
The BLOCK_SIZE_PARAMETER_KEY is used to split a file into processable blocks. Since this is a binary file format, the InputFormat does not know where a new record starts. When writing such a file, each block starts with a new record and is filled until no more records fit completely in. The remaining space until the next block border is padded.

As long as the BLOCK_SIZE_PARAMETER_KEY is larger than the max record size and Input and OutputFormats use the same setting, the parameter has only performance implications. Smaller settings waste more space but allow for higher read parallelism (too much parallelism causes scheduling overhead). I'd simply set it to 64MB and experiment with smaller and larger settings if performance is a major concern here.

You don't need to create a sample tuple to create a TypeInformation for it.

private TupleTypeInfo<Tuple2<String, byte[]>> tInfo = new TupleTypeInfo<Tuple2<String, byte[]>>(
      BasicTypeInfo.STRING_TYPE_INFO,
      PrimitiveArrayTypeInfo.BYTE_PRIMITIVE_ARRAY_TYPE_INFO
);

2015-04-24 9:54 GMT+02:00 Flavio Pompermaier <[hidden email]>:
I managed to read and write avro files and still I have two doubts:

Which size do I have to use for BLOCK_SIZE_PARAMETER_KEY?
Do I have really to create a sample tuple to extract the TypeInformation to instantiate the TypeSerializerInputFormat?

On Thu, Apr 23, 2015 at 7:04 PM, Flavio Pompermaier <[hidden email]> wrote:
I've searched within flink for a working example of TypeSerializerOutputFormat usage but I didn't find anything usable.
Cold you show me a simple snippet of code?
Do I have to configure BinaryInputFormat.BLOCK_SIZE_PARAMETER_KEY? Which size do I have to use? Will flink write a single file or a set of avro file in a directory?
Is it possible to read all files in a directory at once?

On Thu, Apr 23, 2015 at 12:16 PM, Fabian Hueske <[hidden email]> wrote:
Have you tried the TypeSerializerOutputFormat?
This will serialize data using Flink's own serializers and write it to binary files.
The data can be read back using the TypeSerializerInputFormat.

Cheers, Fabian

2015-04-23 11:14 GMT+02:00 Flavio Pompermaier <[hidden email]>:
Hi to all,

in my use case I'd like to persist within a directory batch of Tuple2<String, byte[]>.
Which is the most efficient way to achieve that in Flink?
I was thinking to use Avro but I can't find an example of how to do that. 
Once generated how can I (re)generate a Dataset<Tuple2<String, byte[]> from it?

Best,
Flavio


Reply | Threaded
Open this post in threaded view
|

Re: Tuples serialization

Stephan Ewen
I think you need not create any TypeInformation anyways. It is always present in the data set.

DataSet<Tuple2<String, integer>> myTuples = ...;

myTuples.output(new TypeSerializerOutputFormat<Tuple2<String, integer>>());

On Fri, Apr 24, 2015 at 10:20 AM, Fabian Hueske <[hidden email]> wrote:
The BLOCK_SIZE_PARAMETER_KEY is used to split a file into processable blocks. Since this is a binary file format, the InputFormat does not know where a new record starts. When writing such a file, each block starts with a new record and is filled until no more records fit completely in. The remaining space until the next block border is padded.

As long as the BLOCK_SIZE_PARAMETER_KEY is larger than the max record size and Input and OutputFormats use the same setting, the parameter has only performance implications. Smaller settings waste more space but allow for higher read parallelism (too much parallelism causes scheduling overhead). I'd simply set it to 64MB and experiment with smaller and larger settings if performance is a major concern here.

You don't need to create a sample tuple to create a TypeInformation for it.

private TupleTypeInfo<Tuple2<String, byte[]>> tInfo = new TupleTypeInfo<Tuple2<String, byte[]>>(
      BasicTypeInfo.STRING_TYPE_INFO,
      PrimitiveArrayTypeInfo.BYTE_PRIMITIVE_ARRAY_TYPE_INFO
);

2015-04-24 9:54 GMT+02:00 Flavio Pompermaier <[hidden email]>:
I managed to read and write avro files and still I have two doubts:

Which size do I have to use for BLOCK_SIZE_PARAMETER_KEY?
Do I have really to create a sample tuple to extract the TypeInformation to instantiate the TypeSerializerInputFormat?

On Thu, Apr 23, 2015 at 7:04 PM, Flavio Pompermaier <[hidden email]> wrote:
I've searched within flink for a working example of TypeSerializerOutputFormat usage but I didn't find anything usable.
Cold you show me a simple snippet of code?
Do I have to configure BinaryInputFormat.BLOCK_SIZE_PARAMETER_KEY? Which size do I have to use? Will flink write a single file or a set of avro file in a directory?
Is it possible to read all files in a directory at once?

On Thu, Apr 23, 2015 at 12:16 PM, Fabian Hueske <[hidden email]> wrote:
Have you tried the TypeSerializerOutputFormat?
This will serialize data using Flink's own serializers and write it to binary files.
The data can be read back using the TypeSerializerInputFormat.

Cheers, Fabian

2015-04-23 11:14 GMT+02:00 Flavio Pompermaier <[hidden email]>:
Hi to all,

in my use case I'd like to persist within a directory batch of Tuple2<String, byte[]>.
Which is the most efficient way to achieve that in Flink?
I was thinking to use Avro but I can't find an example of how to do that. 
Once generated how can I (re)generate a Dataset<Tuple2<String, byte[]> from it?

Best,
Flavio



Reply | Threaded
Open this post in threaded view
|

Re: Tuples serialization

Stephan Ewen
For the Input Side:

The data set myTuples has its type via "myTuples.getType()".
The TypeSerializerOutputFormat implements a special interface that picks up that type automatically.


If you want to use the type serializer input, you can always do it like this:

DataSet<Tuple2<String, integer>> myTuples = ...;
myTuples.output(new TypeSerializerOutputFormat<Tuple2<String, integer>>(filePath));
env.execute();


TypeInformation<Tuple2<String, integer>> type = myTuples.getType();

DataSet<Tuple2<String, integer>> input = env.createInput(filePath, new TypeSerializerInputFormat<Tuple2<String, integer>>(type));







On Fri, Apr 24, 2015 at 10:33 AM, Stephan Ewen <[hidden email]> wrote:
I think you need not create any TypeInformation anyways. It is always present in the data set.

DataSet<Tuple2<String, integer>> myTuples = ...;

myTuples.output(new TypeSerializerOutputFormat<Tuple2<String, integer>>());

On Fri, Apr 24, 2015 at 10:20 AM, Fabian Hueske <[hidden email]> wrote:
The BLOCK_SIZE_PARAMETER_KEY is used to split a file into processable blocks. Since this is a binary file format, the InputFormat does not know where a new record starts. When writing such a file, each block starts with a new record and is filled until no more records fit completely in. The remaining space until the next block border is padded.

As long as the BLOCK_SIZE_PARAMETER_KEY is larger than the max record size and Input and OutputFormats use the same setting, the parameter has only performance implications. Smaller settings waste more space but allow for higher read parallelism (too much parallelism causes scheduling overhead). I'd simply set it to 64MB and experiment with smaller and larger settings if performance is a major concern here.

You don't need to create a sample tuple to create a TypeInformation for it.

private TupleTypeInfo<Tuple2<String, byte[]>> tInfo = new TupleTypeInfo<Tuple2<String, byte[]>>(
      BasicTypeInfo.STRING_TYPE_INFO,
      PrimitiveArrayTypeInfo.BYTE_PRIMITIVE_ARRAY_TYPE_INFO
);

2015-04-24 9:54 GMT+02:00 Flavio Pompermaier <[hidden email]>:
I managed to read and write avro files and still I have two doubts:

Which size do I have to use for BLOCK_SIZE_PARAMETER_KEY?
Do I have really to create a sample tuple to extract the TypeInformation to instantiate the TypeSerializerInputFormat?

On Thu, Apr 23, 2015 at 7:04 PM, Flavio Pompermaier <[hidden email]> wrote:
I've searched within flink for a working example of TypeSerializerOutputFormat usage but I didn't find anything usable.
Cold you show me a simple snippet of code?
Do I have to configure BinaryInputFormat.BLOCK_SIZE_PARAMETER_KEY? Which size do I have to use? Will flink write a single file or a set of avro file in a directory?
Is it possible to read all files in a directory at once?

On Thu, Apr 23, 2015 at 12:16 PM, Fabian Hueske <[hidden email]> wrote:
Have you tried the TypeSerializerOutputFormat?
This will serialize data using Flink's own serializers and write it to binary files.
The data can be read back using the TypeSerializerInputFormat.

Cheers, Fabian

2015-04-23 11:14 GMT+02:00 Flavio Pompermaier <[hidden email]>:
Hi to all,

in my use case I'd like to persist within a directory batch of Tuple2<String, byte[]>.
Which is the most efficient way to achieve that in Flink?
I was thinking to use Avro but I can't find an example of how to do that. 
Once generated how can I (re)generate a Dataset<Tuple2<String, byte[]> from it?

Best,
Flavio