(DEPRECATED) Apache Flink User Mailing List archive.

HadoopDataOutputStream maybe does not expose enough methods of org.apache.hadoop.fs.FSDataOutputStream

Classic

List

Threaded

5 messages Options

LINZ, Arnaud

HadoopDataOutputStream maybe does not expose enough methods of org.apache.hadoop.fs.FSDataOutputStream

Hi,

I’ve noticed that when you use org.apache.flink.core.fs.FileSystem to write into a hdfs file, calling org.apache.flink.runtime.fs.hdfs.HadoopFileSystem.create(), it returns a HadoopDataOutputStream that wraps a org.apache.hadoop.fs.FSDataOutputStream (under its org.apache.hadoop.hdfs.client .HdfsDataOutputStream wrappper).

However, FSDataOutputStream exposes many methods like flush, getPos etc, but HadoopDataOutputStream only wraps write & close.

For instance, flush() calls the default, empty implementation of OutputStream instead of the hadoop one, and that’s confusing. Moreover, because of the restrictive OutputStream interface, hsync() and hflush() are not exposed to Flink ; maybe having a getWrappedStream() would be convenient.

(For now, that prevents me from using Flink FileSystem object, I directly use hadoop’s one).

Regards,

Arnaud

L'intégrité de ce message n'étant pas assurée sur internet, la société expéditrice ne peut être tenue responsable de son contenu ni de ses pièces jointes. Toute utilisation ou diffusion non autorisée est interdite. Si vous n'êtes pas destinataire de ce message, merci de le détruire et d'avertir l'expéditeur.

The integrity of this message cannot be guaranteed on the Internet. The company that sent this message cannot therefore be held liable for its content nor attachments. Any unauthorized use or dissemination is prohibited. If you are not the intended recipient of this message, then please delete it and notify the sender.

Stephan Ewen

Re: HadoopDataOutputStream maybe does not expose enough methods of org.apache.hadoop.fs.FSDataOutputStream

I think that is a very good idea.

Originally, we wrapped the Hadoop FS classes for convenience (they were changing, we wanted to keep the system independent of Hadoop), but these are no longer relevant reasons, in my opinion.

Let's start with your proposal and see if we can actually get rid of the wrapping in a way that is friendly to existing users.

Would you open an issue for this?

Greetings,

Stephan

On Wed, Aug 26, 2015 at 6:23 PM, LINZ, Arnaud <[hidden email]> wrote:

Hi,

I’ve noticed that when you use org.apache.flink.core.fs.FileSystem to write into a hdfs file, calling org.apache.flink.runtime.fs.hdfs.HadoopFileSystem.create(), it returns a HadoopDataOutputStream that wraps a org.apache.hadoop.fs.FSDataOutputStream (under its org.apache.hadoop.hdfs.client .HdfsDataOutputStream wrappper).

However, FSDataOutputStream exposes many methods like flush, getPos etc, but HadoopDataOutputStream only wraps write & close.

For instance, flush() calls the default, empty implementation of OutputStream instead of the hadoop one, and that’s confusing. Moreover, because of the restrictive OutputStream interface, hsync() and hflush() are not exposed to Flink ; maybe having a getWrappedStream() would be convenient.

(For now, that prevents me from using Flink FileSystem object, I directly use hadoop’s one).

Regards,

Arnaud

L'intégrité de ce message n'étant pas assurée sur internet, la société expéditrice ne peut être tenue responsable de son contenu ni de ses pièces jointes. Toute utilisation ou diffusion non autorisée est interdite. Si vous n'êtes pas destinataire de ce message, merci de le détruire et d'avertir l'expéditeur.

The integrity of this message cannot be guaranteed on the Internet. The company that sent this message cannot therefore be held liable for its content nor attachments. Any unauthorized use or dissemination is prohibited. If you are not the intended recipient of this message, then please delete it and notify the sender.

LINZ, Arnaud

RE: HadoopDataOutputStream maybe does not expose enough methods of org.apache.hadoop.fs.FSDataOutputStream

Hi,

Ok, I’ve created FLINK-2580 to track this issue (and FLINK-2579, which is totally unrelated).

I think I’m going to set up my dev environment to start contributing a little more than just complaining J.

Best regards,

Arnaud

De : [hidden email] [mailto:[hidden email]] De la part de Stephan Ewen
Envoyé : mercredi 26 août 2015 20:12
À : [hidden email]
Objet : Re: HadoopDataOutputStream maybe does not expose enough methods of org.apache.hadoop.fs.FSDataOutputStream

I think that is a very good idea.

Originally, we wrapped the Hadoop FS classes for convenience (they were changing, we wanted to keep the system independent of Hadoop), but these are no longer relevant reasons, in my opinion.

Let's start with your proposal and see if we can actually get rid of the wrapping in a way that is friendly to existing users.

Would you open an issue for this?

Greetings,

Stephan

On Wed, Aug 26, 2015 at 6:23 PM, LINZ, Arnaud <[hidden email]> wrote:

Hi,

I’ve noticed that when you use org.apache.flink.core.fs.FileSystem to write into a hdfs file, calling org.apache.flink.runtime.fs.hdfs.HadoopFileSystem.create(), it returns a HadoopDataOutputStream that wraps a org.apache.hadoop.fs.FSDataOutputStream (under its org.apache.hadoop.hdfs.client .HdfsDataOutputStream wrappper).

However, FSDataOutputStream exposes many methods like flush, getPos etc, but HadoopDataOutputStream only wraps write & close.

For instance, flush() calls the default, empty implementation of OutputStream instead of the hadoop one, and that’s confusing. Moreover, because of the restrictive OutputStream interface, hsync() and hflush() are not exposed to Flink ; maybe having a getWrappedStream() would be convenient.

(For now, that prevents me from using Flink FileSystem object, I directly use hadoop’s one).

Regards,

Arnaud

L'intégrité de ce message n'étant pas assurée sur internet, la société expéditrice ne peut être tenue responsable de son contenu ni de ses pièces jointes. Toute utilisation ou diffusion non autorisée est interdite. Si vous n'êtes pas destinataire de ce message, merci de le détruire et d'avertir l'expéditeur.

The integrity of this message cannot be guaranteed on the Internet. The company that sent this message cannot therefore be held liable for its content nor attachments. Any unauthorized use or dissemination is prohibited. If you are not the intended recipient of this message, then please delete it and notify the sender.

Ufuk Celebi-2

Re: HadoopDataOutputStream maybe does not expose enough methods of org.apache.hadoop.fs.FSDataOutputStream

> On 27 Aug 2015, at 09:33, LINZ, Arnaud <[hidden email]> wrote:
>
> Hi,
>
> Ok, I’ve created FLINK-2580 to track this issue (and FLINK-2579, which is totally unrelated).

Thanks :)

> I think I’m going to set up my dev environment to start contributing a little more than just complaining J.

If you need any help with the setup, let us know. There is also this guide: https://ci.apache.org/projects/flink/flink-docs-master/internals/ide_setup.html

– Ufuk

Stephan Ewen

Re: HadoopDataOutputStream maybe does not expose enough methods of org.apache.hadoop.fs.FSDataOutputStream

Hi!

I pushed a fix to the master to expose more methods.

You can access the original Hadoop stream now, and you can also call "flush()" and "sync()" in the Flink stream, which get forwarded as "hflush()" and "hsync()" in Hadoop 2 (in Hadoop 1 these are not available).

The fix is in the master and I will make it part of the upcoming milestone release.

Greetings,
Stephan

On Thu, Aug 27, 2015 at 9:51 AM, Ufuk Celebi <[hidden email]> wrote:

> On 27 Aug 2015, at 09:33, LINZ, Arnaud <[hidden email]> wrote:
>
> Hi,
>
> Ok, I’ve created FLINK-2580 to track this issue (and FLINK-2579, which is totally unrelated).

Thanks :)

> I think I’m going to set up my dev environment to start contributing a little more than just complaining J.

If you need any help with the setup, let us know. There is also this guide: https://ci.apache.org/projects/flink/flink-docs-master/internals/ide_setup.html

– Ufuk