Re: HDFS append

Posted by Flavio Pompermaier on
URL: http://deprecated-apache-flink-user-mailing-list-archive.369.s1.nabble.com/HDFS-append-tp530p546.html

Great! Append data to HDFS will be a very useful feature!
I think that then you should think also how to read efficiently directories containing a lot of small files. I know that this can be quite inefficient so that's why in Spark they give you a coalesce operation to be able to deal siwth such cases..

On Tue, Dec 9, 2014 at 2:39 PM, Vasiliki Kalavri <[hidden email]> wrote:
Hi!

Yes, I took a look into this. I hope I'll be able to find some time to work on it this week.
I'll keep you updated :)

Cheers,
V.

On 9 December 2014 at 14:03, Robert Metzger <[hidden email]> wrote:
It seems that Vasia started working on adding support for recursive reading: https://issues.apache.org/jira/browse/FLINK-1307.
I'm still occupied with refactoring the YARN client, the HDFS refactoring is next on my list.

On Tue, Dec 9, 2014 at 11:59 AM, Flavio Pompermaier <[hidden email]> wrote:
Any news about this Robert?

Thanks in advance,
Flavio

On Thu, Dec 4, 2014 at 10:03 PM, Robert Metzger <[hidden email]> wrote:
Hi,

I think there is no support for appending to HDFS files in Flink yet. 
HDFS supports it, but there are some adjustments in the system required (not deleting / creating directories before writing; exposing the append() methods in the FS abstractions).

I'm planning to work on the FS abstractions in the next week, if I have enough time, I can also look into adding support for append().

Another approach could be adding support for recursively reading directories with the input formats. Vasia asked for this feature a few days ago on the mailing list. If we would have that feature, you could just write to a directory and read the parent directory (with all the dirs for the appends).

Best,
Robert

On Thu, Dec 4, 2014 at 5:59 PM, Flavio Pompermaier <[hidden email]> wrote:
Hi guys,
how can I efficiently appends data (as plain strings or also avro records) to  HDFS using Flink?
Do I need to use Flume or can I avoid it?

Thanks in advance,
Flavio