Input from nested directory structure

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

Input from nested directory structure

Vasiliki Kalavri
Hello all,

I want to run a Flink log processing job and my input is stored locally in a nested directory structure, like the following:

logs_dir/
|-----/machine1/
|-----------/january.log
|-----------/february.log
...
|-----/machine2/
...

etc.

When providing "logs_dir" as the argument to readTextFile(), nothing is read and no an exception or error is returned.
Copying the nested individual files machine1/january.log, machine1/february.log, ..., to the same directory works fine, but I was wondering whether there is a better way to do this?

Thank you!
V.
Reply | Threaded
Open this post in threaded view
|

Re: Input from nested directory structure

Stephan Ewen
Hi!

Not right now. The input formats do not recursively enumerate files. In that, we followed the way Hadoop did it.

If that is something that is interesting, it should not be too hard to add to the FileInputFormat an option to do a complete recursive traversal of the directory structure.

Greetings,
Stephan


On Tue, Dec 2, 2014 at 4:32 PM, Vasiliki Kalavri <[hidden email]> wrote:
Hello all,

I want to run a Flink log processing job and my input is stored locally in a nested directory structure, like the following:

logs_dir/
|-----/machine1/
|-----------/january.log
|-----------/february.log
...
|-----/machine2/
...

etc.

When providing "logs_dir" as the argument to readTextFile(), nothing is read and no an exception or error is returned.
Copying the nested individual files machine1/january.log, machine1/february.log, ..., to the same directory works fine, but I was wondering whether there is a better way to do this?

Thank you!
V.

Reply | Threaded
Open this post in threaded view
|

Re: Input from nested directory structure

Vasiliki Kalavri
Hi,

thanks for replying!

It would certainly be useful for my use case, but not absolutely necessary. If you think other people might find it useful too, I can open a issue. 
If not, I believe it would be nice to print a warning when a nested directory is given as input path, 
since now, the files that are in the base directory are normally processed, but the nested ones are simply ignored.

Cheers,
V.

On 2 December 2014 at 16:52, Stephan Ewen <[hidden email]> wrote:
Hi!

Not right now. The input formats do not recursively enumerate files. In that, we followed the way Hadoop did it.

If that is something that is interesting, it should not be too hard to add to the FileInputFormat an option to do a complete recursive traversal of the directory structure.

Greetings,
Stephan


On Tue, Dec 2, 2014 at 4:32 PM, Vasiliki Kalavri <[hidden email]> wrote:
Hello all,

I want to run a Flink log processing job and my input is stored locally in a nested directory structure, like the following:

logs_dir/
|-----/machine1/
|-----------/january.log
|-----------/february.log
...
|-----/machine2/
...

etc.

When providing "logs_dir" as the argument to readTextFile(), nothing is read and no an exception or error is returned.
Copying the nested individual files machine1/january.log, machine1/february.log, ..., to the same directory works fine, but I was wondering whether there is a better way to do this?

Thank you!
V.


Reply | Threaded
Open this post in threaded view
|

Re: Input from nested directory structure

rmetzger0
+1 for adding such a feature. It should be very easy to implement (basically extend the createInputSplits() method)

On Tue, Dec 2, 2014 at 5:22 PM, Vasiliki Kalavri <[hidden email]> wrote:
Hi,

thanks for replying!

It would certainly be useful for my use case, but not absolutely necessary. If you think other people might find it useful too, I can open a issue. 
If not, I believe it would be nice to print a warning when a nested directory is given as input path, 
since now, the files that are in the base directory are normally processed, but the nested ones are simply ignored.

Cheers,
V.

On 2 December 2014 at 16:52, Stephan Ewen <[hidden email]> wrote:
Hi!

Not right now. The input formats do not recursively enumerate files. In that, we followed the way Hadoop did it.

If that is something that is interesting, it should not be too hard to add to the FileInputFormat an option to do a complete recursive traversal of the directory structure.

Greetings,
Stephan


On Tue, Dec 2, 2014 at 4:32 PM, Vasiliki Kalavri <[hidden email]> wrote:
Hello all,

I want to run a Flink log processing job and my input is stored locally in a nested directory structure, like the following:

logs_dir/
|-----/machine1/
|-----------/january.log
|-----------/february.log
...
|-----/machine2/
...

etc.

When providing "logs_dir" as the argument to readTextFile(), nothing is read and no an exception or error is returned.
Copying the nested individual files machine1/january.log, machine1/february.log, ..., to the same directory works fine, but I was wondering whether there is a better way to do this?

Thank you!
V.



Reply | Threaded
Open this post in threaded view
|

Re: Input from nested directory structure

Ufuk Celebi
+1 I find this useful as well.

On 04 Dec 2014, at 22:02, Robert Metzger <[hidden email]> wrote:

> +1 for adding such a feature. It should be very easy to implement (basically extend the createInputSplits() method)
>
> On Tue, Dec 2, 2014 at 5:22 PM, Vasiliki Kalavri <[hidden email]> wrote:
> Hi,
>
> thanks for replying!
>
> It would certainly be useful for my use case, but not absolutely necessary. If you think other people might find it useful too, I can open a issue.
> If not, I believe it would be nice to print a warning when a nested directory is given as input path,
> since now, the files that are in the base directory are normally processed, but the nested ones are simply ignored.
>
> Cheers,
> V.
>
> On 2 December 2014 at 16:52, Stephan Ewen <[hidden email]> wrote:
> Hi!
>
> Not right now. The input formats do not recursively enumerate files. In that, we followed the way Hadoop did it.
>
> If that is something that is interesting, it should not be too hard to add to the FileInputFormat an option to do a complete recursive traversal of the directory structure.
>
> Greetings,
> Stephan
>
>
> On Tue, Dec 2, 2014 at 4:32 PM, Vasiliki Kalavri <[hidden email]> wrote:
> Hello all,
>
> I want to run a Flink log processing job and my input is stored locally in a nested directory structure, like the following:
>
> logs_dir/
> |-----/machine1/
> |-----------/january.log
> |-----------/february.log
> ...
> |-----/machine2/
> ...
>
> etc.
>
> When providing "logs_dir" as the argument to readTextFile(), nothing is read and no an exception or error is returned.
> Copying the nested individual files machine1/january.log, machine1/february.log, ..., to the same directory works fine, but I was wondering whether there is a better way to do this?
>
> Thank you!
> V.
>
>
>