open multiple file from list of uri

classic Classic list List threaded Threaded
13 messages Options
Reply | Threaded
Open this post in threaded view
|

open multiple file from list of uri

Michele Bertoni
Hi everybody,
is there a way to specify a list of URI (“hdfs://file1”,”hdfs://file2”,…) and open them as different files?
I know i may open the entire directory, but i want to be able to select a subset of files in the directory

thanks
Reply | Threaded
Open this post in threaded view
|

Re: open multiple file from list of uri

Stephan Ewen
There are two ways you can realize that:

1) Create multiple sources and union them. This is easy, but probably a bit less efficient.

2) Override the FileInputFormat's createInputSplits method to take a union of the paths to create a list of all files and fils splits that will be read.

Stephan


On Fri, Jun 26, 2015 at 12:12 PM, Michele Bertoni <[hidden email]> wrote:
Hi everybody,
is there a way to specify a list of URI (“hdfs://file1”,”hdfs://file2”,…) and open them as different files?
I know i may open the entire directory, but i want to be able to select a subset of files in the directory

thanks

Reply | Threaded
Open this post in threaded view
|

Re: open multiple file from list of uri

Michele Bertoni
Hi Stephan, thanks for answering,
right now I am using an extension of the DelimitedInputFormat, is there a way to merge it with the option 2?



Il giorno 26/giu/2015, alle ore 12:17, Stephan Ewen <[hidden email]> ha scritto:

There are two ways you can realize that:

1) Create multiple sources and union them. This is easy, but probably a bit less efficient.

2) Override the FileInputFormat's createInputSplits method to take a union of the paths to create a list of all files and fils splits that will be read.

Stephan


On Fri, Jun 26, 2015 at 12:12 PM, Michele Bertoni <[hidden email]> wrote:
Hi everybody,
is there a way to specify a list of URI (“<a href="hdfs://file1”,”hdfs://file2" class="">hdfs://file1”,”hdfs://file2”,…) and open them as different files?
I know i may open the entire directory, but i want to be able to select a subset of files in the directory

thanks


Reply | Threaded
Open this post in threaded view
|

Re: open multiple file from list of uri

Stephan Ewen
Sure, just override the "createInputSplits()" method. Call for each of your file paths "super.createInputSplits()" and then combine the results into one array that you return.

That should do it...

On Fri, Jun 26, 2015 at 12:19 PM, Michele Bertoni <[hidden email]> wrote:
Hi Stephan, thanks for answering,
right now I am using an extension of the DelimitedInputFormat, is there a way to merge it with the option 2?



Il giorno 26/giu/2015, alle ore 12:17, Stephan Ewen <[hidden email]> ha scritto:

There are two ways you can realize that:

1) Create multiple sources and union them. This is easy, but probably a bit less efficient.

2) Override the FileInputFormat's createInputSplits method to take a union of the paths to create a list of all files and fils splits that will be read.

Stephan


On Fri, Jun 26, 2015 at 12:12 PM, Michele Bertoni <[hidden email]> wrote:
Hi everybody,
is there a way to specify a list of URI (“hdfs://file1”,”hdfs://file2”,…) and open them as different files?
I know i may open the entire directory, but i want to be able to select a subset of files in the directory

thanks



Reply | Threaded
Open this post in threaded view
|

Re: open multiple file from list of uri

Michele Bertoni
Got it!
i will try thanks! :)

What about writing a section of it in the programming guide?
I found a couple of topic about the readers in the mailing list, it seems it may be helpful


Il giorno 26/giu/2015, alle ore 12:21, Stephan Ewen <[hidden email]> ha scritto:

Sure, just override the "createInputSplits()" method. Call for each of your file paths "super.createInputSplits()" and then combine the results into one array that you return.

That should do it...

On Fri, Jun 26, 2015 at 12:19 PM, Michele Bertoni <[hidden email]> wrote:
Hi Stephan, thanks for answering,
right now I am using an extension of the DelimitedInputFormat, is there a way to merge it with the option 2?



Il giorno 26/giu/2015, alle ore 12:17, Stephan Ewen <[hidden email]> ha scritto:

There are two ways you can realize that:

1) Create multiple sources and union them. This is easy, but probably a bit less efficient.

2) Override the FileInputFormat's createInputSplits method to take a union of the paths to create a list of all files and fils splits that will be read.

Stephan


On Fri, Jun 26, 2015 at 12:12 PM, Michele Bertoni <[hidden email]> wrote:
Hi everybody,
is there a way to specify a list of URI (“hdfs://file1”,”hdfs://file2”,…) and open them as different files?
I know i may open the entire directory, but i want to be able to select a subset of files in the directory

thanks




Reply | Threaded
Open this post in threaded view
|

Re: open multiple file from list of uri

Stephan Ewen
Seems like a good idea to collect these questions.

Stackoverflow is also a good place for "useful tricks"...

On Fri, Jun 26, 2015 at 12:25 PM, Michele Bertoni <[hidden email]> wrote:
Got it!
i will try thanks! :)

What about writing a section of it in the programming guide?
I found a couple of topic about the readers in the mailing list, it seems it may be helpful



Il giorno 26/giu/2015, alle ore 12:21, Stephan Ewen <[hidden email]> ha scritto:

Sure, just override the "createInputSplits()" method. Call for each of your file paths "super.createInputSplits()" and then combine the results into one array that you return.

That should do it...

On Fri, Jun 26, 2015 at 12:19 PM, Michele Bertoni <[hidden email]> wrote:
Hi Stephan, thanks for answering,
right now I am using an extension of the DelimitedInputFormat, is there a way to merge it with the option 2?



Il giorno 26/giu/2015, alle ore 12:17, Stephan Ewen <[hidden email]> ha scritto:

There are two ways you can realize that:

1) Create multiple sources and union them. This is easy, but probably a bit less efficient.

2) Override the FileInputFormat's createInputSplits method to take a union of the paths to create a list of all files and fils splits that will be read.

Stephan


On Fri, Jun 26, 2015 at 12:12 PM, Michele Bertoni <[hidden email]> wrote:
Hi everybody,
is there a way to specify a list of URI (“hdfs://file1”,”hdfs://file2”,…) and open them as different files?
I know i may open the entire directory, but i want to be able to select a subset of files in the directory

thanks





Reply | Threaded
Open this post in threaded view
|

Re: open multiple file from list of uri

Michele Bertoni
Right!
later I will do the question and quoting your answer with the solution :)

Il giorno 26/giu/2015, alle ore 12:27, Stephan Ewen <[hidden email]> ha scritto:

Seems like a good idea to collect these questions.

Stackoverflow is also a good place for "useful tricks"...

On Fri, Jun 26, 2015 at 12:25 PM, Michele Bertoni <[hidden email]> wrote:
Got it!
i will try thanks! :)

What about writing a section of it in the programming guide?
I found a couple of topic about the readers in the mailing list, it seems it may be helpful



Il giorno 26/giu/2015, alle ore 12:21, Stephan Ewen <[hidden email]> ha scritto:

Sure, just override the "createInputSplits()" method. Call for each of your file paths "super.createInputSplits()" and then combine the results into one array that you return.

That should do it...

On Fri, Jun 26, 2015 at 12:19 PM, Michele Bertoni <[hidden email]> wrote:
Hi Stephan, thanks for answering,
right now I am using an extension of the DelimitedInputFormat, is there a way to merge it with the option 2?



Il giorno 26/giu/2015, alle ore 12:17, Stephan Ewen <[hidden email]> ha scritto:

There are two ways you can realize that:

1) Create multiple sources and union them. This is easy, but probably a bit less efficient.

2) Override the FileInputFormat's createInputSplits method to take a union of the paths to create a list of all files and fils splits that will be read.

Stephan


On Fri, Jun 26, 2015 at 12:12 PM, Michele Bertoni <[hidden email]> wrote:
Hi everybody,
is there a way to specify a list of URI (“hdfs://file1”,”hdfs://file2”,…) and open them as different files?
I know i may open the entire directory, but i want to be able to select a subset of files in the directory

thanks






Reply | Threaded
Open this post in threaded view
|

Re: open multiple file from list of uri

Michele Bertoni
Hi Stephan, I started working on this today, but I am having a problem

Can you be a little more detailed in the procedure?
actually I don’t understand how to give to the input format the list of URI since it will try putting it in a Path variable

createinputsplit does not receive the path but takes a path from that variable


Thanks,
Michele


Il giorno 26/giu/2015, alle ore 12:28, Michele Bertoni <[hidden email]> ha scritto:

Right!
later I will do the question and quoting your answer with the solution :)

Il giorno 26/giu/2015, alle ore 12:27, Stephan Ewen <[hidden email]> ha scritto:

Seems like a good idea to collect these questions.

Stackoverflow is also a good place for "useful tricks"...

On Fri, Jun 26, 2015 at 12:25 PM, Michele Bertoni <[hidden email]> wrote:
Got it!
i will try thanks! :)

What about writing a section of it in the programming guide?
I found a couple of topic about the readers in the mailing list, it seems it may be helpful



Il giorno 26/giu/2015, alle ore 12:21, Stephan Ewen <[hidden email]> ha scritto:

Sure, just override the "createInputSplits()" method. Call for each of your file paths "super.createInputSplits()" and then combine the results into one array that you return.

That should do it...

On Fri, Jun 26, 2015 at 12:19 PM, Michele Bertoni <[hidden email]> wrote:
Hi Stephan, thanks for answering,
right now I am using an extension of the DelimitedInputFormat, is there a way to merge it with the option 2?



Il giorno 26/giu/2015, alle ore 12:17, Stephan Ewen <[hidden email]> ha scritto:

There are two ways you can realize that:

1) Create multiple sources and union them. This is easy, but probably a bit less efficient.

2) Override the FileInputFormat's createInputSplits method to take a union of the paths to create a list of all files and fils splits that will be read.

Stephan


On Fri, Jun 26, 2015 at 12:12 PM, Michele Bertoni <[hidden email]> wrote:
Hi everybody,
is there a way to specify a list of URI (“hdfs://file1”,”hdfs://file2”,…) and open them as different files?
I know i may open the entire directory, but i want to be able to select a subset of files in the directory

thanks







Reply | Threaded
Open this post in threaded view
|

Re: open multiple file from list of uri

Stephan Ewen
For the approach that I outlined, you need to subclass of the file input format.

In that subclass, you store the list of URIs (in a new variable), and override the "createInputSplits()" method.

Stephan

On Tue, Jul 14, 2015 at 6:42 PM, Michele Bertoni <[hidden email]> wrote:
Hi Stephan, I started working on this today, but I am having a problem

Can you be a little more detailed in the procedure?
actually I don’t understand how to give to the input format the list of URI since it will try putting it in a Path variable

createinputsplit does not receive the path but takes a path from that variable


Thanks,
Michele


Il giorno 26/giu/2015, alle ore 12:28, Michele Bertoni <[hidden email]> ha scritto:

Right!
later I will do the question and quoting your answer with the solution :)

Il giorno 26/giu/2015, alle ore 12:27, Stephan Ewen <[hidden email]> ha scritto:

Seems like a good idea to collect these questions.

Stackoverflow is also a good place for "useful tricks"...

On Fri, Jun 26, 2015 at 12:25 PM, Michele Bertoni <[hidden email]> wrote:
Got it!
i will try thanks! :)

What about writing a section of it in the programming guide?
I found a couple of topic about the readers in the mailing list, it seems it may be helpful



Il giorno 26/giu/2015, alle ore 12:21, Stephan Ewen <[hidden email]> ha scritto:

Sure, just override the "createInputSplits()" method. Call for each of your file paths "super.createInputSplits()" and then combine the results into one array that you return.

That should do it...

On Fri, Jun 26, 2015 at 12:19 PM, Michele Bertoni <[hidden email]> wrote:
Hi Stephan, thanks for answering,
right now I am using an extension of the DelimitedInputFormat, is there a way to merge it with the option 2?



Il giorno 26/giu/2015, alle ore 12:17, Stephan Ewen <[hidden email]> ha scritto:

There are two ways you can realize that:

1) Create multiple sources and union them. This is easy, but probably a bit less efficient.

2) Override the FileInputFormat's createInputSplits method to take a union of the paths to create a list of all files and fils splits that will be read.

Stephan


On Fri, Jun 26, 2015 at 12:12 PM, Michele Bertoni <[hidden email]> wrote:
Hi everybody,
is there a way to specify a list of URI (“hdfs://file1”,”hdfs://file2”,…) and open them as different files?
I know i may open the entire directory, but i want to be able to select a subset of files in the directory

thanks








Reply | Threaded
Open this post in threaded view
|

Re: open multiple file from list of uri

Michele Bertoni
Ok thank you, now I solved it!


The problem was in the env.readFile(myInputFormat, path)

now that path is actually a list of paths what should I pass it?



I solved in this way

env.readFile(new myDelimitedInputFormat(parser)(paths), paths.head)

where that paths.head gives to the read file a url that is just a “placeholder” and seems to be never used, and the custom input format takes care of creating the split out of the list of dir

I tried and it works
is it correct way to do that? :)



fyi the create input split is implemented in this way

override def createInputSplits(minNumSplits : Int) = {
    files.flatMap((f) => {
      super.setFilePath(f)
      super.createInputSplits(minNumSplits)
    }).toArray
  }

where paths is a parameter of the input format constructor (as much as the custom parser as shown above)

do you think it is useful if a open a stack overflow post of it (maybe with the custom parser too)?




cheers
michele


Il giorno 14/lug/2015, alle ore 18:50, Stephan Ewen <[hidden email]> ha scritto:

For the approach that I outlined, you need to subclass of the file input format.

In that subclass, you store the list of URIs (in a new variable), and override the "createInputSplits()" method.

Stephan

On Tue, Jul 14, 2015 at 6:42 PM, Michele Bertoni <[hidden email]> wrote:
Hi Stephan, I started working on this today, but I am having a problem

Can you be a little more detailed in the procedure?
actually I don’t understand how to give to the input format the list of URI since it will try putting it in a Path variable

createinputsplit does not receive the path but takes a path from that variable


Thanks,
Michele


Il giorno 26/giu/2015, alle ore 12:28, Michele Bertoni <[hidden email]> ha scritto:

Right!
later I will do the question and quoting your answer with the solution :)

Il giorno 26/giu/2015, alle ore 12:27, Stephan Ewen <[hidden email]> ha scritto:

Seems like a good idea to collect these questions.

Stackoverflow is also a good place for "useful tricks"...

On Fri, Jun 26, 2015 at 12:25 PM, Michele Bertoni <[hidden email]> wrote:
Got it!
i will try thanks! :)

What about writing a section of it in the programming guide?
I found a couple of topic about the readers in the mailing list, it seems it may be helpful



Il giorno 26/giu/2015, alle ore 12:21, Stephan Ewen <[hidden email]> ha scritto:

Sure, just override the "createInputSplits()" method. Call for each of your file paths "super.createInputSplits()" and then combine the results into one array that you return.

That should do it...

On Fri, Jun 26, 2015 at 12:19 PM, Michele Bertoni <[hidden email]> wrote:
Hi Stephan, thanks for answering,
right now I am using an extension of the DelimitedInputFormat, is there a way to merge it with the option 2?



Il giorno 26/giu/2015, alle ore 12:17, Stephan Ewen <[hidden email]> ha scritto:

There are two ways you can realize that:

1) Create multiple sources and union them. This is easy, but probably a bit less efficient.

2) Override the FileInputFormat's createInputSplits method to take a union of the paths to create a list of all files and fils splits that will be read.

Stephan


On Fri, Jun 26, 2015 at 12:12 PM, Michele Bertoni <[hidden email]> wrote:
Hi everybody,
is there a way to specify a list of URI (“hdfs://file1”,”hdfs://file2”,…) and open them as different files?
I know i may open the entire directory, but i want to be able to select a subset of files in the directory

thanks









Reply | Threaded
Open this post in threaded view
|

Re: open multiple file from list of uri

Stephan Ewen
If you want to work without the placeholder, simply do: "env.createInput(new myDelimitedInputFormat(parser)(paths))

The "createInputSplits()" method looks good.

Greetings,
Stephan


On Tue, Jul 14, 2015 at 11:42 PM, Michele Bertoni <[hidden email]> wrote:
Ok thank you, now I solved it!


The problem was in the env.readFile(myInputFormat, path)

now that path is actually a list of paths what should I pass it?



I solved in this way

env.readFile(new myDelimitedInputFormat(parser)(paths), paths.head)

where that paths.head gives to the read file a url that is just a “placeholder” and seems to be never used, and the custom input format takes care of creating the split out of the list of dir

I tried and it works
is it correct way to do that? :)



fyi the create input split is implemented in this way

override def createInputSplits(minNumSplits : Int) = {
    files.flatMap((f) => {
      super.setFilePath(f)
      super.createInputSplits(minNumSplits)
    }).toArray
  }

where paths is a parameter of the input format constructor (as much as the custom parser as shown above)

do you think it is useful if a open a stack overflow post of it (maybe with the custom parser too)?




cheers
michele


Il giorno 14/lug/2015, alle ore 18:50, Stephan Ewen <[hidden email]> ha scritto:

For the approach that I outlined, you need to subclass of the file input format.

In that subclass, you store the list of URIs (in a new variable), and override the "createInputSplits()" method.

Stephan

On Tue, Jul 14, 2015 at 6:42 PM, Michele Bertoni <[hidden email]> wrote:
Hi Stephan, I started working on this today, but I am having a problem

Can you be a little more detailed in the procedure?
actually I don’t understand how to give to the input format the list of URI since it will try putting it in a Path variable

createinputsplit does not receive the path but takes a path from that variable


Thanks,
Michele


Il giorno 26/giu/2015, alle ore 12:28, Michele Bertoni <[hidden email]> ha scritto:

Right!
later I will do the question and quoting your answer with the solution :)

Il giorno 26/giu/2015, alle ore 12:27, Stephan Ewen <[hidden email]> ha scritto:

Seems like a good idea to collect these questions.

Stackoverflow is also a good place for "useful tricks"...

On Fri, Jun 26, 2015 at 12:25 PM, Michele Bertoni <[hidden email]> wrote:
Got it!
i will try thanks! :)

What about writing a section of it in the programming guide?
I found a couple of topic about the readers in the mailing list, it seems it may be helpful



Il giorno 26/giu/2015, alle ore 12:21, Stephan Ewen <[hidden email]> ha scritto:

Sure, just override the "createInputSplits()" method. Call for each of your file paths "super.createInputSplits()" and then combine the results into one array that you return.

That should do it...

On Fri, Jun 26, 2015 at 12:19 PM, Michele Bertoni <[hidden email]> wrote:
Hi Stephan, thanks for answering,
right now I am using an extension of the DelimitedInputFormat, is there a way to merge it with the option 2?



Il giorno 26/giu/2015, alle ore 12:17, Stephan Ewen <[hidden email]> ha scritto:

There are two ways you can realize that:

1) Create multiple sources and union them. This is easy, but probably a bit less efficient.

2) Override the FileInputFormat's createInputSplits method to take a union of the paths to create a list of all files and fils splits that will be read.

Stephan


On Fri, Jun 26, 2015 at 12:12 PM, Michele Bertoni <[hidden email]> wrote:
Hi everybody,
is there a way to specify a list of URI (“hdfs://file1”,”hdfs://file2”,…) and open them as different files?
I know i may open the entire directory, but i want to be able to select a subset of files in the directory

thanks










Reply | Threaded
Open this post in threaded view
|

Re: open multiple file from list of uri

Michele Bertoni
uhm, it doesn’t seem to work: it calls the configure() method that checks if filePath is null and throws an exception
Actually i set that field only during the createInputSplits that is some steps later


Il giorno 15/lug/2015, alle ore 13:16, Stephan Ewen <[hidden email]> ha scritto:

If you want to work without the placeholder, simply do: "env.createInput(new myDelimitedInputFormat(parser)(paths))

The "createInputSplits()" method looks good.

Greetings,
Stephan


On Tue, Jul 14, 2015 at 11:42 PM, Michele Bertoni <[hidden email]> wrote:
Ok thank you, now I solved it!


The problem was in the env.readFile(myInputFormat, path)

now that path is actually a list of paths what should I pass it?



I solved in this way

env.readFile(new myDelimitedInputFormat(parser)(paths), paths.head)

where that paths.head gives to the read file a url that is just a “placeholder” and seems to be never used, and the custom input format takes care of creating the split out of the list of dir

I tried and it works
is it correct way to do that? :)



fyi the create input split is implemented in this way

override def createInputSplits(minNumSplits : Int) = {
    files.flatMap((f) => {
      super.setFilePath(f)
      super.createInputSplits(minNumSplits)
    }).toArray
  }

where paths is a parameter of the input format constructor (as much as the custom parser as shown above)

do you think it is useful if a open a stack overflow post of it (maybe with the custom parser too)?




cheers
michele


Il giorno 14/lug/2015, alle ore 18:50, Stephan Ewen <[hidden email]> ha scritto:

For the approach that I outlined, you need to subclass of the file input format.

In that subclass, you store the list of URIs (in a new variable), and override the "createInputSplits()" method.

Stephan

On Tue, Jul 14, 2015 at 6:42 PM, Michele Bertoni <[hidden email]> wrote:
Hi Stephan, I started working on this today, but I am having a problem

Can you be a little more detailed in the procedure?
actually I don’t understand how to give to the input format the list of URI since it will try putting it in a Path variable

createinputsplit does not receive the path but takes a path from that variable


Thanks,
Michele


Il giorno 26/giu/2015, alle ore 12:28, Michele Bertoni <[hidden email]> ha scritto:

Right!
later I will do the question and quoting your answer with the solution :)

Il giorno 26/giu/2015, alle ore 12:27, Stephan Ewen <[hidden email]> ha scritto:

Seems like a good idea to collect these questions.

Stackoverflow is also a good place for "useful tricks"...

On Fri, Jun 26, 2015 at 12:25 PM, Michele Bertoni <[hidden email]> wrote:
Got it!
i will try thanks! :)

What about writing a section of it in the programming guide?
I found a couple of topic about the readers in the mailing list, it seems it may be helpful



Il giorno 26/giu/2015, alle ore 12:21, Stephan Ewen <[hidden email]> ha scritto:

Sure, just override the "createInputSplits()" method. Call for each of your file paths "super.createInputSplits()" and then combine the results into one array that you return.

That should do it...

On Fri, Jun 26, 2015 at 12:19 PM, Michele Bertoni <[hidden email]> wrote:
Hi Stephan, thanks for answering,
right now I am using an extension of the DelimitedInputFormat, is there a way to merge it with the option 2?



Il giorno 26/giu/2015, alle ore 12:17, Stephan Ewen <[hidden email]> ha scritto:

There are two ways you can realize that:

1) Create multiple sources and union them. This is easy, but probably a bit less efficient.

2) Override the FileInputFormat's createInputSplits method to take a union of the paths to create a list of all files and fils splits that will be read.

Stephan


On Fri, Jun 26, 2015 at 12:12 PM, Michele Bertoni <[hidden email]> wrote:
Hi everybody,
is there a way to specify a list of URI (“hdfs://file1”,”hdfs://file2”,…) and open them as different files?
I know i may open the entire directory, but i want to be able to select a subset of files in the directory

thanks











Reply | Threaded
Open this post in threaded view
|

Re: open multiple file from list of uri

Stephan Ewen
You are right, the implementation needs a place holder here.
The placeholder can probably be a "fake path", like "file:///this/will/never/be/read/anyways", because you override the "createSplits" method...

On Thu, Jul 16, 2015 at 12:03 AM, Michele Bertoni <[hidden email]> wrote:
uhm, it doesn’t seem to work: it calls the configure() method that checks if filePath is null and throws an exception
Actually i set that field only during the createInputSplits that is some steps later



Il giorno 15/lug/2015, alle ore 13:16, Stephan Ewen <[hidden email]> ha scritto:

If you want to work without the placeholder, simply do: "env.createInput(new myDelimitedInputFormat(parser)(paths))

The "createInputSplits()" method looks good.

Greetings,
Stephan


On Tue, Jul 14, 2015 at 11:42 PM, Michele Bertoni <[hidden email]> wrote:
Ok thank you, now I solved it!


The problem was in the env.readFile(myInputFormat, path)

now that path is actually a list of paths what should I pass it?



I solved in this way

env.readFile(new myDelimitedInputFormat(parser)(paths), paths.head)

where that paths.head gives to the read file a url that is just a “placeholder” and seems to be never used, and the custom input format takes care of creating the split out of the list of dir

I tried and it works
is it correct way to do that? :)



fyi the create input split is implemented in this way

override def createInputSplits(minNumSplits : Int) = {
    files.flatMap((f) => {
      super.setFilePath(f)
      super.createInputSplits(minNumSplits)
    }).toArray
  }

where paths is a parameter of the input format constructor (as much as the custom parser as shown above)

do you think it is useful if a open a stack overflow post of it (maybe with the custom parser too)?




cheers
michele


Il giorno 14/lug/2015, alle ore 18:50, Stephan Ewen <[hidden email]> ha scritto:

For the approach that I outlined, you need to subclass of the file input format.

In that subclass, you store the list of URIs (in a new variable), and override the "createInputSplits()" method.

Stephan

On Tue, Jul 14, 2015 at 6:42 PM, Michele Bertoni <[hidden email]> wrote:
Hi Stephan, I started working on this today, but I am having a problem

Can you be a little more detailed in the procedure?
actually I don’t understand how to give to the input format the list of URI since it will try putting it in a Path variable

createinputsplit does not receive the path but takes a path from that variable


Thanks,
Michele


Il giorno 26/giu/2015, alle ore 12:28, Michele Bertoni <[hidden email]> ha scritto:

Right!
later I will do the question and quoting your answer with the solution :)

Il giorno 26/giu/2015, alle ore 12:27, Stephan Ewen <[hidden email]> ha scritto:

Seems like a good idea to collect these questions.

Stackoverflow is also a good place for "useful tricks"...

On Fri, Jun 26, 2015 at 12:25 PM, Michele Bertoni <[hidden email]> wrote:
Got it!
i will try thanks! :)

What about writing a section of it in the programming guide?
I found a couple of topic about the readers in the mailing list, it seems it may be helpful



Il giorno 26/giu/2015, alle ore 12:21, Stephan Ewen <[hidden email]> ha scritto:

Sure, just override the "createInputSplits()" method. Call for each of your file paths "super.createInputSplits()" and then combine the results into one array that you return.

That should do it...

On Fri, Jun 26, 2015 at 12:19 PM, Michele Bertoni <[hidden email]> wrote:
Hi Stephan, thanks for answering,
right now I am using an extension of the DelimitedInputFormat, is there a way to merge it with the option 2?



Il giorno 26/giu/2015, alle ore 12:17, Stephan Ewen <[hidden email]> ha scritto:

There are two ways you can realize that:

1) Create multiple sources and union them. This is easy, but probably a bit less efficient.

2) Override the FileInputFormat's createInputSplits method to take a union of the paths to create a list of all files and fils splits that will be read.

Stephan


On Fri, Jun 26, 2015 at 12:12 PM, Michele Bertoni <[hidden email]> wrote:
Hi everybody,
is there a way to specify a list of URI (“hdfs://file1”,”hdfs://file2”,…) and open them as different files?
I know i may open the entire directory, but i want to be able to select a subset of files in the directory

thanks