Reading from multiple input files with fewer task slots

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

Reading from multiple input files with fewer task slots

Pieter Hameete
Hello Flinkers!

I run into some strange behavior when reading from a folder of input files.

When the number of input files in the folder exceeds the number of task slots I noticed that the size of my datasets varies with each run. It seems as if the transformations don't wait for all input files to be read.

When I have equal or more task slots than there are files, there are no problems.

I'm using a custom input format. Could there be a problem with my custom input format, and if so what could I be forgetting?

Kind regards and thank you for your time!

Pieter
Reply | Threaded
Open this post in threaded view
|

Re: Reading from multiple input files with fewer task slots

Stephan Ewen
I assume this concerns the streaming API?

Can you share your program and/or the custom input format code?

On Mon, Oct 5, 2015 at 12:33 PM, Pieter Hameete <[hidden email]> wrote:
Hello Flinkers!

I run into some strange behavior when reading from a folder of input files.

When the number of input files in the folder exceeds the number of task slots I noticed that the size of my datasets varies with each run. It seems as if the transformations don't wait for all input files to be read.

When I have equal or more task slots than there are files, there are no problems.

I'm using a custom input format. Could there be a problem with my custom input format, and if so what could I be forgetting?

Kind regards and thank you for your time!

Pieter

Reply | Threaded
Open this post in threaded view
|

Re: Reading from multiple input files with fewer task slots

Pieter Hameete

2015-10-05 12:38 GMT+02:00 Stephan Ewen <[hidden email]>:
I assume this concerns the streaming API?

Can you share your program and/or the custom input format code?

On Mon, Oct 5, 2015 at 12:33 PM, Pieter Hameete <[hidden email]> wrote:
Hello Flinkers!

I run into some strange behavior when reading from a folder of input files.

When the number of input files in the folder exceeds the number of task slots I noticed that the size of my datasets varies with each run. It seems as if the transformations don't wait for all input files to be read.

When I have equal or more task slots than there are files, there are no problems.

I'm using a custom input format. Could there be a problem with my custom input format, and if so what could I be forgetting?

Kind regards and thank you for your time!

Pieter


Reply | Threaded
Open this post in threaded view
|

Re: Reading from multiple input files with fewer task slots

Stephan Ewen
If you have more files than task slots, then some tasks will get multiple files. That means that open() and close() are called multiple times on the input format.

Make sure that your input format tolerates that and does not get confused with lingering state (maybe create a new SimpleInputProjection as well)

On Mon, Oct 5, 2015 at 12:41 PM, Pieter Hameete <[hidden email]> wrote:

2015-10-05 12:38 GMT+02:00 Stephan Ewen <[hidden email]>:
I assume this concerns the streaming API?

Can you share your program and/or the custom input format code?

On Mon, Oct 5, 2015 at 12:33 PM, Pieter Hameete <[hidden email]> wrote:
Hello Flinkers!

I run into some strange behavior when reading from a folder of input files.

When the number of input files in the folder exceeds the number of task slots I noticed that the size of my datasets varies with each run. It seems as if the transformations don't wait for all input files to be read.

When I have equal or more task slots than there are files, there are no problems.

I'm using a custom input format. Could there be a problem with my custom input format, and if so what could I be forgetting?

Kind regards and thank you for your time!

Pieter



Reply | Threaded
Open this post in threaded view
|

Re: Reading from multiple input files with fewer task slots

Pieter Hameete
Hi Stephen,

it was not the SimpleInputProjection, because that is a stateless object. The boolean endReached was not reset upon opening a new file however, so for each consecutive file no records were parsed.

Thanks alot for your help!

- Pieter

2015-10-05 12:50 GMT+02:00 Stephan Ewen <[hidden email]>:
If you have more files than task slots, then some tasks will get multiple files. That means that open() and close() are called multiple times on the input format.

Make sure that your input format tolerates that and does not get confused with lingering state (maybe create a new SimpleInputProjection as well)

On Mon, Oct 5, 2015 at 12:41 PM, Pieter Hameete <[hidden email]> wrote:

2015-10-05 12:38 GMT+02:00 Stephan Ewen <[hidden email]>:
I assume this concerns the streaming API?

Can you share your program and/or the custom input format code?

On Mon, Oct 5, 2015 at 12:33 PM, Pieter Hameete <[hidden email]> wrote:
Hello Flinkers!

I run into some strange behavior when reading from a folder of input files.

When the number of input files in the folder exceeds the number of task slots I noticed that the size of my datasets varies with each run. It seems as if the transformations don't wait for all input files to be read.

When I have equal or more task slots than there are files, there are no problems.

I'm using a custom input format. Could there be a problem with my custom input format, and if so what could I be forgetting?

Kind regards and thank you for your time!

Pieter




Reply | Threaded
Open this post in threaded view
|

Re: Reading from multiple input files with fewer task slots

Stephan Ewen
Okay, nice to hear it works out!

On Mon, Oct 5, 2015 at 1:50 PM, Pieter Hameete <[hidden email]> wrote:
Hi Stephen,

it was not the SimpleInputProjection, because that is a stateless object. The boolean endReached was not reset upon opening a new file however, so for each consecutive file no records were parsed.

Thanks alot for your help!

- Pieter

2015-10-05 12:50 GMT+02:00 Stephan Ewen <[hidden email]>:
If you have more files than task slots, then some tasks will get multiple files. That means that open() and close() are called multiple times on the input format.

Make sure that your input format tolerates that and does not get confused with lingering state (maybe create a new SimpleInputProjection as well)

On Mon, Oct 5, 2015 at 12:41 PM, Pieter Hameete <[hidden email]> wrote:

2015-10-05 12:38 GMT+02:00 Stephan Ewen <[hidden email]>:
I assume this concerns the streaming API?

Can you share your program and/or the custom input format code?

On Mon, Oct 5, 2015 at 12:33 PM, Pieter Hameete <[hidden email]> wrote:
Hello Flinkers!

I run into some strange behavior when reading from a folder of input files.

When the number of input files in the folder exceeds the number of task slots I noticed that the size of my datasets varies with each run. It seems as if the transformations don't wait for all input files to be read.

When I have equal or more task slots than there are files, there are no problems.

I'm using a custom input format. Could there be a problem with my custom input format, and if so what could I be forgetting?

Kind regards and thank you for your time!

Pieter