Potential block size issue with S3 binary files

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

Potential block size issue with S3 binary files

Ken Krugler
Hi all,

Wondering if anyone else has run into this.

We write files to S3 using the SerializedOutputFormat<OurCustomPOJO>. When we read them back, sometimes we get deserialization errors where the data seems to be corrupt.

After a lot of logging, the weathervane of blame pointed towards the block size somehow not being the same between the write (where it’s 64MB) and the read (unknown).

When I added a call to SerializedInputFormat.setBlockSize(64MB), the problems went away.

It looks like both input and output formats use fs.getDefaultBlockSize() to set this value by default, so maybe the root issue is S3 somehow reporting different values.

But it does feel a bit odd that we’re relying on this default setting, versus it being recorded in the file during the write phase.

And it’s awkward to try to set the block size on the write, as you need to set it in the environment conf, which means it applies to all output files in the job.

— Ken

--------------------------
Ken Krugler
+1 530-210-6378
http://www.scaleunlimited.com
Custom big data solutions & training
Flink, Solr, Hadoop, Cascading & Cassandra