HDFS append

classic Classic list List threaded Threaded
10 messages Options
Reply | Threaded
Open this post in threaded view
|

HDFS append

Flavio Pompermaier
Hi guys,
how can I efficiently appends data (as plain strings or also avro records) to  HDFS using Flink?
Do I need to use Flume or can I avoid it?

Thanks in advance,
Flavio

Reply | Threaded
Open this post in threaded view
|

Re: HDFS append

rmetzger0
Hi,

I think there is no support for appending to HDFS files in Flink yet. 
HDFS supports it, but there are some adjustments in the system required (not deleting / creating directories before writing; exposing the append() methods in the FS abstractions).

I'm planning to work on the FS abstractions in the next week, if I have enough time, I can also look into adding support for append().

Another approach could be adding support for recursively reading directories with the input formats. Vasia asked for this feature a few days ago on the mailing list. If we would have that feature, you could just write to a directory and read the parent directory (with all the dirs for the appends).

Best,
Robert

On Thu, Dec 4, 2014 at 5:59 PM, Flavio Pompermaier <[hidden email]> wrote:
Hi guys,
how can I efficiently appends data (as plain strings or also avro records) to  HDFS using Flink?
Do I need to use Flume or can I avoid it?

Thanks in advance,
Flavio


Reply | Threaded
Open this post in threaded view
|

Re: HDFS append

Flavio Pompermaier
Any news about this Robert?

Thanks in advance,
Flavio

On Thu, Dec 4, 2014 at 10:03 PM, Robert Metzger <[hidden email]> wrote:
Hi,

I think there is no support for appending to HDFS files in Flink yet. 
HDFS supports it, but there are some adjustments in the system required (not deleting / creating directories before writing; exposing the append() methods in the FS abstractions).

I'm planning to work on the FS abstractions in the next week, if I have enough time, I can also look into adding support for append().

Another approach could be adding support for recursively reading directories with the input formats. Vasia asked for this feature a few days ago on the mailing list. If we would have that feature, you could just write to a directory and read the parent directory (with all the dirs for the appends).

Best,
Robert

On Thu, Dec 4, 2014 at 5:59 PM, Flavio Pompermaier <[hidden email]> wrote:
Hi guys,
how can I efficiently appends data (as plain strings or also avro records) to  HDFS using Flink?
Do I need to use Flume or can I avoid it?

Thanks in advance,
Flavio



Reply | Threaded
Open this post in threaded view
|

Re: HDFS append

rmetzger0
It seems that Vasia started working on adding support for recursive reading: https://issues.apache.org/jira/browse/FLINK-1307.
I'm still occupied with refactoring the YARN client, the HDFS refactoring is next on my list.

On Tue, Dec 9, 2014 at 11:59 AM, Flavio Pompermaier <[hidden email]> wrote:
Any news about this Robert?

Thanks in advance,
Flavio

On Thu, Dec 4, 2014 at 10:03 PM, Robert Metzger <[hidden email]> wrote:
Hi,

I think there is no support for appending to HDFS files in Flink yet. 
HDFS supports it, but there are some adjustments in the system required (not deleting / creating directories before writing; exposing the append() methods in the FS abstractions).

I'm planning to work on the FS abstractions in the next week, if I have enough time, I can also look into adding support for append().

Another approach could be adding support for recursively reading directories with the input formats. Vasia asked for this feature a few days ago on the mailing list. If we would have that feature, you could just write to a directory and read the parent directory (with all the dirs for the appends).

Best,
Robert

On Thu, Dec 4, 2014 at 5:59 PM, Flavio Pompermaier <[hidden email]> wrote:
Hi guys,
how can I efficiently appends data (as plain strings or also avro records) to  HDFS using Flink?
Do I need to use Flume or can I avoid it?

Thanks in advance,
Flavio




Reply | Threaded
Open this post in threaded view
|

Re: HDFS append

Vasiliki Kalavri
Hi!

Yes, I took a look into this. I hope I'll be able to find some time to work on it this week.
I'll keep you updated :)

Cheers,
V.

On 9 December 2014 at 14:03, Robert Metzger <[hidden email]> wrote:
It seems that Vasia started working on adding support for recursive reading: https://issues.apache.org/jira/browse/FLINK-1307.
I'm still occupied with refactoring the YARN client, the HDFS refactoring is next on my list.

On Tue, Dec 9, 2014 at 11:59 AM, Flavio Pompermaier <[hidden email]> wrote:
Any news about this Robert?

Thanks in advance,
Flavio

On Thu, Dec 4, 2014 at 10:03 PM, Robert Metzger <[hidden email]> wrote:
Hi,

I think there is no support for appending to HDFS files in Flink yet. 
HDFS supports it, but there are some adjustments in the system required (not deleting / creating directories before writing; exposing the append() methods in the FS abstractions).

I'm planning to work on the FS abstractions in the next week, if I have enough time, I can also look into adding support for append().

Another approach could be adding support for recursively reading directories with the input formats. Vasia asked for this feature a few days ago on the mailing list. If we would have that feature, you could just write to a directory and read the parent directory (with all the dirs for the appends).

Best,
Robert

On Thu, Dec 4, 2014 at 5:59 PM, Flavio Pompermaier <[hidden email]> wrote:
Hi guys,
how can I efficiently appends data (as plain strings or also avro records) to  HDFS using Flink?
Do I need to use Flume or can I avoid it?

Thanks in advance,
Flavio





Reply | Threaded
Open this post in threaded view
|

Re: HDFS append

Flavio Pompermaier
Great! Append data to HDFS will be a very useful feature!
I think that then you should think also how to read efficiently directories containing a lot of small files. I know that this can be quite inefficient so that's why in Spark they give you a coalesce operation to be able to deal siwth such cases..

On Tue, Dec 9, 2014 at 2:39 PM, Vasiliki Kalavri <[hidden email]> wrote:
Hi!

Yes, I took a look into this. I hope I'll be able to find some time to work on it this week.
I'll keep you updated :)

Cheers,
V.

On 9 December 2014 at 14:03, Robert Metzger <[hidden email]> wrote:
It seems that Vasia started working on adding support for recursive reading: https://issues.apache.org/jira/browse/FLINK-1307.
I'm still occupied with refactoring the YARN client, the HDFS refactoring is next on my list.

On Tue, Dec 9, 2014 at 11:59 AM, Flavio Pompermaier <[hidden email]> wrote:
Any news about this Robert?

Thanks in advance,
Flavio

On Thu, Dec 4, 2014 at 10:03 PM, Robert Metzger <[hidden email]> wrote:
Hi,

I think there is no support for appending to HDFS files in Flink yet. 
HDFS supports it, but there are some adjustments in the system required (not deleting / creating directories before writing; exposing the append() methods in the FS abstractions).

I'm planning to work on the FS abstractions in the next week, if I have enough time, I can also look into adding support for append().

Another approach could be adding support for recursively reading directories with the input formats. Vasia asked for this feature a few days ago on the mailing list. If we would have that feature, you could just write to a directory and read the parent directory (with all the dirs for the appends).

Best,
Robert

On Thu, Dec 4, 2014 at 5:59 PM, Flavio Pompermaier <[hidden email]> wrote:
Hi guys,
how can I efficiently appends data (as plain strings or also avro records) to  HDFS using Flink?
Do I need to use Flume or can I avoid it?

Thanks in advance,
Flavio






Reply | Threaded
Open this post in threaded view
|

Re: HDFS append

rmetzger0
Vasia is working on support for reading directories recursively. But I thought that this is also allowing you to simulate something like an append.

Did you notice an issue when reading many small files with Flink? Flink is handling the reading of files differently than Spark.

Spark basically starts a task for each file / file split. So if you have millions of small files in your HDFS, spark will start millions of tasks (queued however). You need to coalesce in spark to reduce the number of partitions. by default, they re-use the partitions of the preceding operator.
Flink on the other hand is starting a fixed number of tasks which are reading multiple input splits which are lazily assigned to these tasks once they ready to process new splits.
Flink will not create a partition for each (small) input file. I expect Flink to handle that case a bit better than Spark (I haven't tested it though)



On Tue, Dec 9, 2014 at 3:03 PM, Flavio Pompermaier <[hidden email]> wrote:
Great! Append data to HDFS will be a very useful feature!
I think that then you should think also how to read efficiently directories containing a lot of small files. I know that this can be quite inefficient so that's why in Spark they give you a coalesce operation to be able to deal siwth such cases..


On Tue, Dec 9, 2014 at 2:39 PM, Vasiliki Kalavri <[hidden email]> wrote:
Hi!

Yes, I took a look into this. I hope I'll be able to find some time to work on it this week.
I'll keep you updated :)

Cheers,
V.

On 9 December 2014 at 14:03, Robert Metzger <[hidden email]> wrote:
It seems that Vasia started working on adding support for recursive reading: https://issues.apache.org/jira/browse/FLINK-1307.
I'm still occupied with refactoring the YARN client, the HDFS refactoring is next on my list.

On Tue, Dec 9, 2014 at 11:59 AM, Flavio Pompermaier <[hidden email]> wrote:
Any news about this Robert?

Thanks in advance,
Flavio

On Thu, Dec 4, 2014 at 10:03 PM, Robert Metzger <[hidden email]> wrote:
Hi,

I think there is no support for appending to HDFS files in Flink yet. 
HDFS supports it, but there are some adjustments in the system required (not deleting / creating directories before writing; exposing the append() methods in the FS abstractions).

I'm planning to work on the FS abstractions in the next week, if I have enough time, I can also look into adding support for append().

Another approach could be adding support for recursively reading directories with the input formats. Vasia asked for this feature a few days ago on the mailing list. If we would have that feature, you could just write to a directory and read the parent directory (with all the dirs for the appends).

Best,
Robert

On Thu, Dec 4, 2014 at 5:59 PM, Flavio Pompermaier <[hidden email]> wrote:
Hi guys,
how can I efficiently appends data (as plain strings or also avro records) to  HDFS using Flink?
Do I need to use Flume or can I avoid it?

Thanks in advance,
Flavio







Reply | Threaded
Open this post in threaded view
|

Re: HDFS append

Flavio Pompermaier
I didn't know such difference! Thus, Flink is very smart :)
Thank for the explanation Robert.

On Tue, Dec 9, 2014 at 3:33 PM, Robert Metzger <[hidden email]> wrote:
Vasia is working on support for reading directories recursively. But I thought that this is also allowing you to simulate something like an append.

Did you notice an issue when reading many small files with Flink? Flink is handling the reading of files differently than Spark.

Spark basically starts a task for each file / file split. So if you have millions of small files in your HDFS, spark will start millions of tasks (queued however). You need to coalesce in spark to reduce the number of partitions. by default, they re-use the partitions of the preceding operator.
Flink on the other hand is starting a fixed number of tasks which are reading multiple input splits which are lazily assigned to these tasks once they ready to process new splits.
Flink will not create a partition for each (small) input file. I expect Flink to handle that case a bit better than Spark (I haven't tested it though)



On Tue, Dec 9, 2014 at 3:03 PM, Flavio Pompermaier <[hidden email]> wrote:
Great! Append data to HDFS will be a very useful feature!
I think that then you should think also how to read efficiently directories containing a lot of small files. I know that this can be quite inefficient so that's why in Spark they give you a coalesce operation to be able to deal siwth such cases..


On Tue, Dec 9, 2014 at 2:39 PM, Vasiliki Kalavri <[hidden email]> wrote:
Hi!

Yes, I took a look into this. I hope I'll be able to find some time to work on it this week.
I'll keep you updated :)

Cheers,
V.

On 9 December 2014 at 14:03, Robert Metzger <[hidden email]> wrote:
It seems that Vasia started working on adding support for recursive reading: https://issues.apache.org/jira/browse/FLINK-1307.
I'm still occupied with refactoring the YARN client, the HDFS refactoring is next on my list.

On Tue, Dec 9, 2014 at 11:59 AM, Flavio Pompermaier <[hidden email]> wrote:
Any news about this Robert?

Thanks in advance,
Flavio

On Thu, Dec 4, 2014 at 10:03 PM, Robert Metzger <[hidden email]> wrote:
Hi,

I think there is no support for appending to HDFS files in Flink yet. 
HDFS supports it, but there are some adjustments in the system required (not deleting / creating directories before writing; exposing the append() methods in the FS abstractions).

I'm planning to work on the FS abstractions in the next week, if I have enough time, I can also look into adding support for append().

Another approach could be adding support for recursively reading directories with the input formats. Vasia asked for this feature a few days ago on the mailing list. If we would have that feature, you could just write to a directory and read the parent directory (with all the dirs for the appends).

Best,
Robert

On Thu, Dec 4, 2014 at 5:59 PM, Flavio Pompermaier <[hidden email]> wrote:
Hi guys,
how can I efficiently appends data (as plain strings or also avro records) to  HDFS using Flink?
Do I need to use Flume or can I avoid it?

Thanks in advance,
Flavio









Reply | Threaded
Open this post in threaded view
|

Re: HDFS append

rmetzger0
Hey Flavio,


With this, you now can simulate an append behavior with Flink: 

- You have a directory in HDFS where you put the files you want to append hdfs:///data/appendjob/
- each time you want to append something, you run your job and let it create a new directory in hdfs:///data/appendjob/, lets say hdfs:///data/appendjob/run-X/
- Now, you can instruct the job to read the full output by letting it recursively read hdfs:///data/appendjob/.

I hope that helps.


Best,
Robert


On Tue, Dec 9, 2014 at 3:37 PM, Flavio Pompermaier <[hidden email]> wrote:
I didn't know such difference! Thus, Flink is very smart :)
Thank for the explanation Robert.

On Tue, Dec 9, 2014 at 3:33 PM, Robert Metzger <[hidden email]> wrote:
Vasia is working on support for reading directories recursively. But I thought that this is also allowing you to simulate something like an append.

Did you notice an issue when reading many small files with Flink? Flink is handling the reading of files differently than Spark.

Spark basically starts a task for each file / file split. So if you have millions of small files in your HDFS, spark will start millions of tasks (queued however). You need to coalesce in spark to reduce the number of partitions. by default, they re-use the partitions of the preceding operator.
Flink on the other hand is starting a fixed number of tasks which are reading multiple input splits which are lazily assigned to these tasks once they ready to process new splits.
Flink will not create a partition for each (small) input file. I expect Flink to handle that case a bit better than Spark (I haven't tested it though)



On Tue, Dec 9, 2014 at 3:03 PM, Flavio Pompermaier <[hidden email]> wrote:
Great! Append data to HDFS will be a very useful feature!
I think that then you should think also how to read efficiently directories containing a lot of small files. I know that this can be quite inefficient so that's why in Spark they give you a coalesce operation to be able to deal siwth such cases..


On Tue, Dec 9, 2014 at 2:39 PM, Vasiliki Kalavri <[hidden email]> wrote:
Hi!

Yes, I took a look into this. I hope I'll be able to find some time to work on it this week.
I'll keep you updated :)

Cheers,
V.

On 9 December 2014 at 14:03, Robert Metzger <[hidden email]> wrote:
It seems that Vasia started working on adding support for recursive reading: https://issues.apache.org/jira/browse/FLINK-1307.
I'm still occupied with refactoring the YARN client, the HDFS refactoring is next on my list.

On Tue, Dec 9, 2014 at 11:59 AM, Flavio Pompermaier <[hidden email]> wrote:
Any news about this Robert?

Thanks in advance,
Flavio

On Thu, Dec 4, 2014 at 10:03 PM, Robert Metzger <[hidden email]> wrote:
Hi,

I think there is no support for appending to HDFS files in Flink yet. 
HDFS supports it, but there are some adjustments in the system required (not deleting / creating directories before writing; exposing the append() methods in the FS abstractions).

I'm planning to work on the FS abstractions in the next week, if I have enough time, I can also look into adding support for append().

Another approach could be adding support for recursively reading directories with the input formats. Vasia asked for this feature a few days ago on the mailing list. If we would have that feature, you could just write to a directory and read the parent directory (with all the dirs for the appends).

Best,
Robert

On Thu, Dec 4, 2014 at 5:59 PM, Flavio Pompermaier <[hidden email]> wrote:
Hi guys,
how can I efficiently appends data (as plain strings or also avro records) to  HDFS using Flink?
Do I need to use Flume or can I avoid it?

Thanks in advance,
Flavio









Reply | Threaded
Open this post in threaded view
|

Re: HDFS append

Flavio Pompermaier

Thanks a lot Robert!

On Dec 15, 2014 12:54 PM, "Robert Metzger" <[hidden email]> wrote:
Hey Flavio,


With this, you now can simulate an append behavior with Flink: 

- You have a directory in HDFS where you put the files you want to append hdfs:///data/appendjob/
- each time you want to append something, you run your job and let it create a new directory in hdfs:///data/appendjob/, lets say hdfs:///data/appendjob/run-X/
- Now, you can instruct the job to read the full output by letting it recursively read hdfs:///data/appendjob/.

I hope that helps.


Best,
Robert


On Tue, Dec 9, 2014 at 3:37 PM, Flavio Pompermaier <[hidden email]> wrote:
I didn't know such difference! Thus, Flink is very smart :)
Thank for the explanation Robert.

On Tue, Dec 9, 2014 at 3:33 PM, Robert Metzger <[hidden email]> wrote:
Vasia is working on support for reading directories recursively. But I thought that this is also allowing you to simulate something like an append.

Did you notice an issue when reading many small files with Flink? Flink is handling the reading of files differently than Spark.

Spark basically starts a task for each file / file split. So if you have millions of small files in your HDFS, spark will start millions of tasks (queued however). You need to coalesce in spark to reduce the number of partitions. by default, they re-use the partitions of the preceding operator.
Flink on the other hand is starting a fixed number of tasks which are reading multiple input splits which are lazily assigned to these tasks once they ready to process new splits.
Flink will not create a partition for each (small) input file. I expect Flink to handle that case a bit better than Spark (I haven't tested it though)



On Tue, Dec 9, 2014 at 3:03 PM, Flavio Pompermaier <[hidden email]> wrote:
Great! Append data to HDFS will be a very useful feature!
I think that then you should think also how to read efficiently directories containing a lot of small files. I know that this can be quite inefficient so that's why in Spark they give you a coalesce operation to be able to deal siwth such cases..


On Tue, Dec 9, 2014 at 2:39 PM, Vasiliki Kalavri <[hidden email]> wrote:
Hi!

Yes, I took a look into this. I hope I'll be able to find some time to work on it this week.
I'll keep you updated :)

Cheers,
V.

On 9 December 2014 at 14:03, Robert Metzger <[hidden email]> wrote:
It seems that Vasia started working on adding support for recursive reading: https://issues.apache.org/jira/browse/FLINK-1307.
I'm still occupied with refactoring the YARN client, the HDFS refactoring is next on my list.

On Tue, Dec 9, 2014 at 11:59 AM, Flavio Pompermaier <[hidden email]> wrote:
Any news about this Robert?

Thanks in advance,
Flavio

On Thu, Dec 4, 2014 at 10:03 PM, Robert Metzger <[hidden email]> wrote:
Hi,

I think there is no support for appending to HDFS files in Flink yet. 
HDFS supports it, but there are some adjustments in the system required (not deleting / creating directories before writing; exposing the append() methods in the FS abstractions).

I'm planning to work on the FS abstractions in the next week, if I have enough time, I can also look into adding support for append().

Another approach could be adding support for recursively reading directories with the input formats. Vasia asked for this feature a few days ago on the mailing list. If we would have that feature, you could just write to a directory and read the parent directory (with all the dirs for the appends).

Best,
Robert

On Thu, Dec 4, 2014 at 5:59 PM, Flavio Pompermaier <[hidden email]> wrote:
Hi guys,
how can I efficiently appends data (as plain strings or also avro records) to  HDFS using Flink?
Do I need to use Flume or can I avoid it?

Thanks in advance,
Flavio