Read a given list of HDFS folder

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

Read a given list of HDFS folder

Gwenhael Pasquiers
Hello,

Sorry if this has been already asked or is already in the docs, I did not find the answer :

Is there a way to read a given set of folders in Flink batch ? Let's say we have one folder per hour of data, written by flume, and we'd like to read only the N last hours (or any other pattern or arbitrary list of folders).

And while I'm at it I have another question :

Let's say that in my batch task I need to sequence two "phases" and that the second phase needs the final result from the first one.
 - Do I have to create, in the TaskManager, one Execution environment per task and execute them one after the other ?
 - Can my TaskManagers send back some data (other than counters) to the JobManager or do I have to use a file to store the result from phase one and use it in phase Two ?

Thanks in advance for your answers,

Gwenhaël
Reply | Threaded
Open this post in threaded view
|

Re: Read a given list of HDFS folder

Ufuk Celebi
Hey Gwenhaël,

see here for recursive traversal of input paths:
https://ci.apache.org/projects/flink/flink-docs-master/apis/batch/index.html#recursive-traversal-of-the-input-path-directory

Regarding the phases: the best way to exchange data between batch jobs
is via files. You can then execute two programs one after the other,
the first one produces the files, which the second jobs uses as input.

– Ufuk



On Mon, Mar 21, 2016 at 12:14 PM, Gwenhael Pasquiers
<[hidden email]> wrote:

> Hello,
>
> Sorry if this has been already asked or is already in the docs, I did not find the answer :
>
> Is there a way to read a given set of folders in Flink batch ? Let's say we have one folder per hour of data, written by flume, and we'd like to read only the N last hours (or any other pattern or arbitrary list of folders).
>
> And while I'm at it I have another question :
>
> Let's say that in my batch task I need to sequence two "phases" and that the second phase needs the final result from the first one.
>  - Do I have to create, in the TaskManager, one Execution environment per task and execute them one after the other ?
>  - Can my TaskManagers send back some data (other than counters) to the JobManager or do I have to use a file to store the result from phase one and use it in phase Two ?
>
> Thanks in advance for your answers,
>
> Gwenhaël
Reply | Threaded
Open this post in threaded view
|

RE: Read a given list of HDFS folder

Gwenhael Pasquiers
Hi and thanks, i'm not sure that recurive traversal is what I need.

Let's say I have the following dir tree :

/data/2016_03_21_13/<files>.gz
/data/2016_03_21_12/<files>.gz
/data/2016_03_21_11/<files>.gz
/data/2016_03_21_10/<files>.gz
/data/2016_03_21_09/<files>.gz
/data/2016_03_21_08/<files>.gz
/data/2016_03_21_07/<files>.gz


I want my DataSet to include (and nothing else) :

/data/2016_03_21_13/*.gz
/data/2016_03_21_12/*.gz
/data/2016_03_21_11/*.gz

And I do not want to include any of the other folders (and their files).

Can I create a DataSet that would only contain those folders ?

-----Original Message-----
From: Ufuk Celebi [mailto:[hidden email]]
Sent: lundi 21 mars 2016 13:39
To: [hidden email]
Subject: Re: Read a given list of HDFS folder

Hey Gwenhaël,

see here for recursive traversal of input paths:
https://ci.apache.org/projects/flink/flink-docs-master/apis/batch/index.html#recursive-traversal-of-the-input-path-directory

Regarding the phases: the best way to exchange data between batch jobs is via files. You can then execute two programs one after the other, the first one produces the files, which the second jobs uses as input.

– Ufuk



On Mon, Mar 21, 2016 at 12:14 PM, Gwenhael Pasquiers <[hidden email]> wrote:

> Hello,
>
> Sorry if this has been already asked or is already in the docs, I did not find the answer :
>
> Is there a way to read a given set of folders in Flink batch ? Let's say we have one folder per hour of data, written by flume, and we'd like to read only the N last hours (or any other pattern or arbitrary list of folders).
>
> And while I'm at it I have another question :
>
> Let's say that in my batch task I need to sequence two "phases" and that the second phase needs the final result from the first one.
>  - Do I have to create, in the TaskManager, one Execution environment per task and execute them one after the other ?
>  - Can my TaskManagers send back some data (other than counters) to the JobManager or do I have to use a file to store the result from phase one and use it in phase Two ?
>
> Thanks in advance for your answers,
>
> Gwenhaël
Reply | Threaded
Open this post in threaded view
|

Re: Read a given list of HDFS folder

Maximilian Michels
Hi Gwenhael,

That is not possible right now. As a workaround, you could have three DataSets that are constructed by reading recursively from each directory and unify these later. Alternatively, moving/linking the directories in a different location would also work.

I agree that it would be nice to specify a pattern of files to include/exclude. I've filed a JIRA: https://issues.apache.org/jira/browse/FLINK-3677

Cheers,
Max


On Mon, Mar 21, 2016 at 1:51 PM, Gwenhael Pasquiers <[hidden email]> wrote:
Hi and thanks, i'm not sure that recurive traversal is what I need.

Let's say I have the following dir tree :

/data/2016_03_21_13/<files>.gz
/data/2016_03_21_12/<files>.gz
/data/2016_03_21_11/<files>.gz
/data/2016_03_21_10/<files>.gz
/data/2016_03_21_09/<files>.gz
/data/2016_03_21_08/<files>.gz
/data/2016_03_21_07/<files>.gz


I want my DataSet to include (and nothing else) :

/data/2016_03_21_13/*.gz
/data/2016_03_21_12/*.gz
/data/2016_03_21_11/*.gz

And I do not want to include any of the other folders (and their files).

Can I create a DataSet that would only contain those folders ?

-----Original Message-----
From: Ufuk Celebi [mailto:[hidden email]]
Sent: lundi 21 mars 2016 13:39
To: [hidden email]
Subject: Re: Read a given list of HDFS folder

Hey Gwenhaël,

see here for recursive traversal of input paths:
https://ci.apache.org/projects/flink/flink-docs-master/apis/batch/index.html#recursive-traversal-of-the-input-path-directory

Regarding the phases: the best way to exchange data between batch jobs is via files. You can then execute two programs one after the other, the first one produces the files, which the second jobs uses as input.

– Ufuk



On Mon, Mar 21, 2016 at 12:14 PM, Gwenhael Pasquiers <[hidden email]> wrote:
> Hello,
>
> Sorry if this has been already asked or is already in the docs, I did not find the answer :
>
> Is there a way to read a given set of folders in Flink batch ? Let's say we have one folder per hour of data, written by flume, and we'd like to read only the N last hours (or any other pattern or arbitrary list of folders).
>
> And while I'm at it I have another question :
>
> Let's say that in my batch task I need to sequence two "phases" and that the second phase needs the final result from the first one.
>  - Do I have to create, in the TaskManager, one Execution environment per task and execute them one after the other ?
>  - Can my TaskManagers send back some data (other than counters) to the JobManager or do I have to use a file to store the result from phase one and use it in phase Two ?
>
> Thanks in advance for your answers,
>
> Gwenhaël