Hi guys,
how can I efficiently appends data (as plain strings or also avro records) to HDFS using Flink? Do I need to use Flume or can I avoid it? Thanks in advance, Flavio |
Hi, I think there is no support for appending to HDFS files in Flink yet. HDFS supports it, but there are some adjustments in the system required (not deleting / creating directories before writing; exposing the append() methods in the FS abstractions). I'm planning to work on the FS abstractions in the next week, if I have enough time, I can also look into adding support for append(). Another approach could be adding support for recursively reading directories with the input formats. Vasia asked for this feature a few days ago on the mailing list. If we would have that feature, you could just write to a directory and read the parent directory (with all the dirs for the appends). Best, Robert On Thu, Dec 4, 2014 at 5:59 PM, Flavio Pompermaier <[hidden email]> wrote:
|
Any news about this Robert?
Thanks in advance, Flavio On Thu, Dec 4, 2014 at 10:03 PM, Robert Metzger <[hidden email]> wrote:
|
It seems that Vasia started working on adding support for recursive reading: https://issues.apache.org/jira/browse/FLINK-1307. I'm still occupied with refactoring the YARN client, the HDFS refactoring is next on my list. On Tue, Dec 9, 2014 at 11:59 AM, Flavio Pompermaier <[hidden email]> wrote:
|
Hi! Yes, I took a look into this. I hope I'll be able to find some time to work on it this week. I'll keep you updated :) Cheers, V. On 9 December 2014 at 14:03, Robert Metzger <[hidden email]> wrote:
|
Great! Append data to HDFS will be a very useful feature!
I think that then you should think also how to read efficiently directories containing a lot of small files. I know that this can be quite inefficient so that's why in Spark they give you a coalesce operation to be able to deal siwth such cases.. On Tue, Dec 9, 2014 at 2:39 PM, Vasiliki Kalavri <[hidden email]> wrote:
|
Vasia is working on support for reading directories recursively. But I thought that this is also allowing you to simulate something like an append. Spark basically starts a task for each file / file split. So if you have millions of small files in your HDFS, spark will start millions of tasks (queued however). You need to coalesce in spark to reduce the number of partitions. by default, they re-use the partitions of the preceding operator. Flink on the other hand is starting a fixed number of tasks which are reading multiple input splits which are lazily assigned to these tasks once they ready to process new splits. Flink will not create a partition for each (small) input file. I expect Flink to handle that case a bit better than Spark (I haven't tested it though) On Tue, Dec 9, 2014 at 3:03 PM, Flavio Pompermaier <[hidden email]> wrote:
|
I didn't know such difference! Thus, Flink is very smart :)
Thank for the explanation Robert. On Tue, Dec 9, 2014 at 3:33 PM, Robert Metzger <[hidden email]> wrote:
|
Hey Flavio, this pull request got merged: https://github.com/apache/incubator-flink/pull/260 With this, you now can simulate an append behavior with Flink: - You have a directory in HDFS where you put the files you want to append hdfs:///data/appendjob/ - each time you want to append something, you run your job and let it create a new directory in hdfs:///data/appendjob/, lets say hdfs:///data/appendjob/run-X/ - Now, you can instruct the job to read the full output by letting it recursively read hdfs:///data/appendjob/. I hope that helps. Best, Robert On Tue, Dec 9, 2014 at 3:37 PM, Flavio Pompermaier <[hidden email]> wrote:
|
Thanks a lot Robert! On Dec 15, 2014 12:54 PM, "Robert Metzger" <[hidden email]> wrote:
|
Free forum by Nabble | Edit this page |