Hello,
Sorry if this has been already asked or is already in the docs, I did not find the answer : Is there a way to read a given set of folders in Flink batch ? Let's say we have one folder per hour of data, written by flume, and we'd like to read only the N last hours (or any other pattern or arbitrary list of folders). And while I'm at it I have another question : Let's say that in my batch task I need to sequence two "phases" and that the second phase needs the final result from the first one. - Do I have to create, in the TaskManager, one Execution environment per task and execute them one after the other ? - Can my TaskManagers send back some data (other than counters) to the JobManager or do I have to use a file to store the result from phase one and use it in phase Two ? Thanks in advance for your answers, Gwenhaël |
Hey Gwenhaël,
see here for recursive traversal of input paths: https://ci.apache.org/projects/flink/flink-docs-master/apis/batch/index.html#recursive-traversal-of-the-input-path-directory Regarding the phases: the best way to exchange data between batch jobs is via files. You can then execute two programs one after the other, the first one produces the files, which the second jobs uses as input. – Ufuk On Mon, Mar 21, 2016 at 12:14 PM, Gwenhael Pasquiers <[hidden email]> wrote: > Hello, > > Sorry if this has been already asked or is already in the docs, I did not find the answer : > > Is there a way to read a given set of folders in Flink batch ? Let's say we have one folder per hour of data, written by flume, and we'd like to read only the N last hours (or any other pattern or arbitrary list of folders). > > And while I'm at it I have another question : > > Let's say that in my batch task I need to sequence two "phases" and that the second phase needs the final result from the first one. > - Do I have to create, in the TaskManager, one Execution environment per task and execute them one after the other ? > - Can my TaskManagers send back some data (other than counters) to the JobManager or do I have to use a file to store the result from phase one and use it in phase Two ? > > Thanks in advance for your answers, > > Gwenhaël |
Hi and thanks, i'm not sure that recurive traversal is what I need.
Let's say I have the following dir tree : /data/2016_03_21_13/<files>.gz /data/2016_03_21_12/<files>.gz /data/2016_03_21_11/<files>.gz /data/2016_03_21_10/<files>.gz /data/2016_03_21_09/<files>.gz /data/2016_03_21_08/<files>.gz /data/2016_03_21_07/<files>.gz I want my DataSet to include (and nothing else) : /data/2016_03_21_13/*.gz /data/2016_03_21_12/*.gz /data/2016_03_21_11/*.gz And I do not want to include any of the other folders (and their files). Can I create a DataSet that would only contain those folders ? -----Original Message----- From: Ufuk Celebi [mailto:[hidden email]] Sent: lundi 21 mars 2016 13:39 To: [hidden email] Subject: Re: Read a given list of HDFS folder Hey Gwenhaël, see here for recursive traversal of input paths: https://ci.apache.org/projects/flink/flink-docs-master/apis/batch/index.html#recursive-traversal-of-the-input-path-directory Regarding the phases: the best way to exchange data between batch jobs is via files. You can then execute two programs one after the other, the first one produces the files, which the second jobs uses as input. – Ufuk On Mon, Mar 21, 2016 at 12:14 PM, Gwenhael Pasquiers <[hidden email]> wrote: > Hello, > > Sorry if this has been already asked or is already in the docs, I did not find the answer : > > Is there a way to read a given set of folders in Flink batch ? Let's say we have one folder per hour of data, written by flume, and we'd like to read only the N last hours (or any other pattern or arbitrary list of folders). > > And while I'm at it I have another question : > > Let's say that in my batch task I need to sequence two "phases" and that the second phase needs the final result from the first one. > - Do I have to create, in the TaskManager, one Execution environment per task and execute them one after the other ? > - Can my TaskManagers send back some data (other than counters) to the JobManager or do I have to use a file to store the result from phase one and use it in phase Two ? > > Thanks in advance for your answers, > > Gwenhaël |
Hi Gwenhael, That is not possible right now. As a workaround, you could have three DataSets that are constructed by reading recursively from each directory and unify these later. Alternatively, moving/linking the directories in a different location would also work. I agree that it would be nice to specify a pattern of files to include/exclude. I've filed a JIRA: https://issues.apache.org/jira/browse/FLINK-3677 Cheers, Max On Mon, Mar 21, 2016 at 1:51 PM, Gwenhael Pasquiers <[hidden email]> wrote: Hi and thanks, i'm not sure that recurive traversal is what I need. |
Free forum by Nabble | Edit this page |