(DEPRECATED) Apache Flink User Mailing List archive.

Read a given list of HDFS folder

Classic

List

Threaded

4 messages Options

Gwenhael Pasquiers

Read a given list of HDFS folder

Hello,

Sorry if this has been already asked or is already in the docs, I did not find the answer :

Is there a way to read a given set of folders in Flink batch ? Let's say we have one folder per hour of data, written by flume, and we'd like to read only the N last hours (or any other pattern or arbitrary list of folders).

And while I'm at it I have another question :

Let's say that in my batch task I need to sequence two "phases" and that the second phase needs the final result from the first one.
- Do I have to create, in the TaskManager, one Execution environment per task and execute them one after the other ?
- Can my TaskManagers send back some data (other than counters) to the JobManager or do I have to use a file to store the result from phase one and use it in phase Two ?

Thanks in advance for your answers,

Gwenhaël

Ufuk Celebi

Re: Read a given list of HDFS folder

Hey Gwenhaël,

see here for recursive traversal of input paths:
https://ci.apache.org/projects/flink/flink-docs-master/apis/batch/index.html#recursive-traversal-of-the-input-path-directory

Regarding the phases: the best way to exchange data between batch jobs
is via files. You can then execute two programs one after the other,
the first one produces the files, which the second jobs uses as input.

– Ufuk

On Mon, Mar 21, 2016 at 12:14 PM, Gwenhael Pasquiers
<[hidden email]> wrote:

> Hello,
>
> Sorry if this has been already asked or is already in the docs, I did not find the answer :
>
> Is there a way to read a given set of folders in Flink batch ? Let's say we have one folder per hour of data, written by flume, and we'd like to read only the N last hours (or any other pattern or arbitrary list of folders).
>
> And while I'm at it I have another question :
>
> Let's say that in my batch task I need to sequence two "phases" and that the second phase needs the final result from the first one.
> - Do I have to create, in the TaskManager, one Execution environment per task and execute them one after the other ?
> - Can my TaskManagers send back some data (other than counters) to the JobManager or do I have to use a file to store the result from phase one and use it in phase Two ?
>
> Thanks in advance for your answers,
>
> Gwenhaël

Gwenhael Pasquiers

RE: Read a given list of HDFS folder

Hi and thanks, i'm not sure that recurive traversal is what I need.

Let's say I have the following dir tree :

/data/2016_03_21_13/<files>.gz
/data/2016_03_21_12/<files>.gz
/data/2016_03_21_11/<files>.gz
/data/2016_03_21_10/<files>.gz
/data/2016_03_21_09/<files>.gz
/data/2016_03_21_08/<files>.gz
/data/2016_03_21_07/<files>.gz

I want my DataSet to include (and nothing else) :

/data/2016_03_21_13/*.gz
/data/2016_03_21_12/*.gz
/data/2016_03_21_11/*.gz

And I do not want to include any of the other folders (and their files).

Can I create a DataSet that would only contain those folders ?

-----Original Message-----
From: Ufuk Celebi [mailto:[hidden email]]
Sent: lundi 21 mars 2016 13:39
To: [hidden email]
Subject: Re: Read a given list of HDFS folder

Hey Gwenhaël,

see here for recursive traversal of input paths:
https://ci.apache.org/projects/flink/flink-docs-master/apis/batch/index.html#recursive-traversal-of-the-input-path-directory

Regarding the phases: the best way to exchange data between batch jobs is via files. You can then execute two programs one after the other, the first one produces the files, which the second jobs uses as input.

– Ufuk

On Mon, Mar 21, 2016 at 12:14 PM, Gwenhael Pasquiers <[hidden email]> wrote:

Maximilian Michels

Re: Read a given list of HDFS folder

Hi Gwenhael,

That is not possible right now. As a workaround, you could have three DataSets that are constructed by reading recursively from each directory and unify these later. Alternatively, moving/linking the directories in a different location would also work.

I agree that it would be nice to specify a pattern of files to include/exclude. I've filed a JIRA: https://issues.apache.org/jira/browse/FLINK-3677

Cheers,

Max

On Mon, Mar 21, 2016 at 1:51 PM, Gwenhael Pasquiers <[hidden email]> wrote:

Hi and thanks, i'm not sure that recurive traversal is what I need.

Let's say I have the following dir tree :

/data/2016_03_21_13/<files>.gz
/data/2016_03_21_12/<files>.gz
/data/2016_03_21_11/<files>.gz
/data/2016_03_21_10/<files>.gz
/data/2016_03_21_09/<files>.gz
/data/2016_03_21_08/<files>.gz
/data/2016_03_21_07/<files>.gz

I want my DataSet to include (and nothing else) :

/data/2016_03_21_13/*.gz
/data/2016_03_21_12/*.gz
/data/2016_03_21_11/*.gz

And I do not want to include any of the other folders (and their files).

Can I create a DataSet that would only contain those folders ?

-----Original Message-----
From: Ufuk Celebi [mailto:[hidden email]]
Sent: lundi 21 mars 2016 13:39
To: [hidden email]
Subject: Re: Read a given list of HDFS folder

Hey Gwenhaël,

see here for recursive traversal of input paths:
https://ci.apache.org/projects/flink/flink-docs-master/apis/batch/index.html#recursive-traversal-of-the-input-path-directory

Regarding the phases: the best way to exchange data between batch jobs is via files. You can then execute two programs one after the other, the first one produces the files, which the second jobs uses as input.

– Ufuk

On Mon, Mar 21, 2016 at 12:14 PM, Gwenhael Pasquiers <[hidden email]> wrote:
> Hello,
>
> Sorry if this has been already asked or is already in the docs, I did not find the answer :
>
> Is there a way to read a given set of folders in Flink batch ? Let's say we have one folder per hour of data, written by flume, and we'd like to read only the N last hours (or any other pattern or arbitrary list of folders).
>
> And while I'm at it I have another question :
>
> Let's say that in my batch task I need to sequence two "phases" and that the second phase needs the final result from the first one.
> - Do I have to create, in the TaskManager, one Execution environment per task and execute them one after the other ?
> - Can my TaskManagers send back some data (other than counters) to the JobManager or do I have to use a file to store the result from phase one and use it in phase Two ?
>
> Thanks in advance for your answers,
>
> Gwenhaël