(DEPRECATED) Apache Flink User Mailing List archive.

Re: HDFS append

Posted by rmetzger0 on
URL: http://deprecated-apache-flink-user-mailing-list-archive.369.s1.nabble.com/HDFS-append-tp530p547.html

Vasia is working on support for reading directories recursively. But I thought that this is also allowing you to simulate something like an append.

Did you notice an issue when reading many small files with Flink? Flink is handling the reading of files differently than Spark.

Spark basically starts a task for each file / file split. So if you have millions of small files in your HDFS, spark will start millions of tasks (queued however). You need to coalesce in spark to reduce the number of partitions. by default, they re-use the partitions of the preceding operator.

Flink on the other hand is starting a fixed number of tasks which are reading multiple input splits which are lazily assigned to these tasks once they ready to process new splits.

Flink will not create a partition for each (small) input file. I expect Flink to handle that case a bit better than Spark (I haven't tested it though)

On Tue, Dec 9, 2014 at 3:03 PM, Flavio Pompermaier <[hidden email]> wrote:

Great! Append data to HDFS will be a very useful feature!
I think that then you should think also how to read efficiently directories containing a lot of small files. I know that this can be quite inefficient so that's why in Spark they give you a coalesce operation to be able to deal siwth such cases..

On Tue, Dec 9, 2014 at 2:39 PM, Vasiliki Kalavri <[hidden email]> wrote:
Hi!

Yes, I took a look into this. I hope I'll be able to find some time to work on it this week.
I'll keep you updated :)

Cheers,
V.

On 9 December 2014 at 14:03, Robert Metzger <[hidden email]> wrote:
It seems that Vasia started working on adding support for recursive reading: https://issues.apache.org/jira/browse/FLINK-1307.
I'm still occupied with refactoring the YARN client, the HDFS refactoring is next on my list.

On Tue, Dec 9, 2014 at 11:59 AM, Flavio Pompermaier <[hidden email]> wrote:
Any news about this Robert?

Thanks in advance,
Flavio

On Thu, Dec 4, 2014 at 10:03 PM, Robert Metzger <[hidden email]> wrote:
Hi,

I think there is no support for appending to HDFS files in Flink yet.
HDFS supports it, but there are some adjustments in the system required (not deleting / creating directories before writing; exposing the append() methods in the FS abstractions).

I'm planning to work on the FS abstractions in the next week, if I have enough time, I can also look into adding support for append().

Another approach could be adding support for recursively reading directories with the input formats. Vasia asked for this feature a few days ago on the mailing list. If we would have that feature, you could just write to a directory and read the parent directory (with all the dirs for the appends).

Best,
Robert

On Thu, Dec 4, 2014 at 5:59 PM, Flavio Pompermaier <[hidden email]> wrote:
Hi guys,
how can I efficiently appends data (as plain strings or also avro records) to HDFS using Flink?
Do I need to use Flume or can I avoid it?

Thanks in advance,
Flavio