Parallel file read in LocalEnvironment

classic Classic list List threaded Threaded
12 messages Options
Reply | Threaded
Open this post in threaded view
|

Parallel file read in LocalEnvironment

Flavio Pompermaier
Hi to all,

is there a way to split a single local file by line count (e.g. a split every 100 lines) in a LocalEnvironment to speed up a simple map function? For me it is not very clear how the local files (files into directory if recursive=true) are managed by Flink..is there any ref to this internals?

Best,
Flavio
Reply | Threaded
Open this post in threaded view
|

Re: Parallel file read in LocalEnvironment

Fabian Hueske-2
Hi Flavio,

it is not possible to split by line count because that would mean to read and parse the file just for splitting.

Parallel processing of data sources depends on the input splits created by the InputFormat. Local files can be split just like files in HDFS. Usually, each file corresponds to at least one split but multiple files could also be put into a single split if necessary.The logic for that would go into to the InputFormat.createInputSplits() method.

Cheers, Fabian

2015-10-07 14:47 GMT+02:00 Flavio Pompermaier <[hidden email]>:
Hi to all,

is there a way to split a single local file by line count (e.g. a split every 100 lines) in a LocalEnvironment to speed up a simple map function? For me it is not very clear how the local files (files into directory if recursive=true) are managed by Flink..is there any ref to this internals?

Best,
Flavio

Reply | Threaded
Open this post in threaded view
|

Re: Parallel file read in LocalEnvironment

Flavio Pompermaier
And what is the split policy for the FileInputFormat?it depends on the fs block size?
Is there a pointer to the several flink input formats and a description of their internals?

On Wed, Oct 7, 2015 at 3:09 PM, Fabian Hueske <[hidden email]> wrote:
Hi Flavio,

it is not possible to split by line count because that would mean to read and parse the file just for splitting.

Parallel processing of data sources depends on the input splits created by the InputFormat. Local files can be split just like files in HDFS. Usually, each file corresponds to at least one split but multiple files could also be put into a single split if necessary.The logic for that would go into to the InputFormat.createInputSplits() method.

Cheers, Fabian

2015-10-07 14:47 GMT+02:00 Flavio Pompermaier <[hidden email]>:
Hi to all,

is there a way to split a single local file by line count (e.g. a split every 100 lines) in a LocalEnvironment to speed up a simple map function? For me it is not very clear how the local files (files into directory if recursive=true) are managed by Flink..is there any ref to this internals?

Best,
Flavio



Reply | Threaded
Open this post in threaded view
|

Re: Parallel file read in LocalEnvironment

Fabian Hueske-2
I'm sorry there is no such documentation.
You need to look at the code :-(

2015-10-07 15:19 GMT+02:00 Flavio Pompermaier <[hidden email]>:
And what is the split policy for the FileInputFormat?it depends on the fs block size?
Is there a pointer to the several flink input formats and a description of their internals?

On Wed, Oct 7, 2015 at 3:09 PM, Fabian Hueske <[hidden email]> wrote:
Hi Flavio,

it is not possible to split by line count because that would mean to read and parse the file just for splitting.

Parallel processing of data sources depends on the input splits created by the InputFormat. Local files can be split just like files in HDFS. Usually, each file corresponds to at least one split but multiple files could also be put into a single split if necessary.The logic for that would go into to the InputFormat.createInputSplits() method.

Cheers, Fabian

2015-10-07 14:47 GMT+02:00 Flavio Pompermaier <[hidden email]>:
Hi to all,

is there a way to split a single local file by line count (e.g. a split every 100 lines) in a LocalEnvironment to speed up a simple map function? For me it is not very clear how the local files (files into directory if recursive=true) are managed by Flink..is there any ref to this internals?

Best,
Flavio




Reply | Threaded
Open this post in threaded view
|

Re: Parallel file read in LocalEnvironment

Stephan Ewen
The split functionality is in the FileInputFormat and the functionality that takes care of lines across splits is in the DelimitedIntputFormat.

On Wed, Oct 7, 2015 at 3:24 PM, Fabian Hueske <[hidden email]> wrote:
I'm sorry there is no such documentation.
You need to look at the code :-(

2015-10-07 15:19 GMT+02:00 Flavio Pompermaier <[hidden email]>:
And what is the split policy for the FileInputFormat?it depends on the fs block size?
Is there a pointer to the several flink input formats and a description of their internals?

On Wed, Oct 7, 2015 at 3:09 PM, Fabian Hueske <[hidden email]> wrote:
Hi Flavio,

it is not possible to split by line count because that would mean to read and parse the file just for splitting.

Parallel processing of data sources depends on the input splits created by the InputFormat. Local files can be split just like files in HDFS. Usually, each file corresponds to at least one split but multiple files could also be put into a single split if necessary.The logic for that would go into to the InputFormat.createInputSplits() method.

Cheers, Fabian

2015-10-07 14:47 GMT+02:00 Flavio Pompermaier <[hidden email]>:
Hi to all,

is there a way to split a single local file by line count (e.g. a split every 100 lines) in a LocalEnvironment to speed up a simple map function? For me it is not very clear how the local files (files into directory if recursive=true) are managed by Flink..is there any ref to this internals?

Best,
Flavio





Reply | Threaded
Open this post in threaded view
|

Re: Parallel file read in LocalEnvironment

Flavio Pompermaier
I've tried to split my huge file by lines count (using the bash command split -l) in 2 different ways:
  1. small lines count (huge number of small files)
  2. big lines count (small number of big files)
I can't understand why the time required to effectively start the job is more or less the same
  • in 1. it takes a lot to fetch the file list (~50.000) and the split assigner is fast to assign the splits (but also being fast they are a lot)
  • in 2. Flink is fast in fetch the file list but it's extremely slow to generate the splits to assign
Initially I was thinking that Flink was eagerly materializing the lines somewhere but both the memory and the disks doesn't increase.
What is going on underneath? Is it normal?

Thanks in advance,
Flavio



On Wed, Oct 7, 2015 at 3:27 PM, Stephan Ewen <[hidden email]> wrote:
The split functionality is in the FileInputFormat and the functionality that takes care of lines across splits is in the DelimitedIntputFormat.

On Wed, Oct 7, 2015 at 3:24 PM, Fabian Hueske <[hidden email]> wrote:
I'm sorry there is no such documentation.
You need to look at the code :-(

2015-10-07 15:19 GMT+02:00 Flavio Pompermaier <[hidden email]>:
And what is the split policy for the FileInputFormat?it depends on the fs block size?
Is there a pointer to the several flink input formats and a description of their internals?

On Wed, Oct 7, 2015 at 3:09 PM, Fabian Hueske <[hidden email]> wrote:
Hi Flavio,

it is not possible to split by line count because that would mean to read and parse the file just for splitting.

Parallel processing of data sources depends on the input splits created by the InputFormat. Local files can be split just like files in HDFS. Usually, each file corresponds to at least one split but multiple files could also be put into a single split if necessary.The logic for that would go into to the InputFormat.createInputSplits() method.

Cheers, Fabian

2015-10-07 14:47 GMT+02:00 Flavio Pompermaier <[hidden email]>:
Hi to all,

is there a way to split a single local file by line count (e.g. a split every 100 lines) in a LocalEnvironment to speed up a simple map function? For me it is not very clear how the local files (files into directory if recursive=true) are managed by Flink..is there any ref to this internals?

Best,
Flavio






Reply | Threaded
Open this post in threaded view
|

Re: Parallel file read in LocalEnvironment

Stephan Ewen
Late answer, sorry:

The splits are created in the JobManager, so the sub submission should not be affected by that.

The assignment of splits to workers is very fast, so many splits with small data is not very different from few splits with large data.

Lines are never materialized and the operators do not work differently based on different numbers of splits.

On Wed, Oct 7, 2015 at 4:26 PM, Flavio Pompermaier <[hidden email]> wrote:
I've tried to split my huge file by lines count (using the bash command split -l) in 2 different ways:
  1. small lines count (huge number of small files)
  2. big lines count (small number of big files)
I can't understand why the time required to effectively start the job is more or less the same
  • in 1. it takes a lot to fetch the file list (~50.000) and the split assigner is fast to assign the splits (but also being fast they are a lot)
  • in 2. Flink is fast in fetch the file list but it's extremely slow to generate the splits to assign
Initially I was thinking that Flink was eagerly materializing the lines somewhere but both the memory and the disks doesn't increase.
What is going on underneath? Is it normal?

Thanks in advance,
Flavio



On Wed, Oct 7, 2015 at 3:27 PM, Stephan Ewen <[hidden email]> wrote:
The split functionality is in the FileInputFormat and the functionality that takes care of lines across splits is in the DelimitedIntputFormat.

On Wed, Oct 7, 2015 at 3:24 PM, Fabian Hueske <[hidden email]> wrote:
I'm sorry there is no such documentation.
You need to look at the code :-(

2015-10-07 15:19 GMT+02:00 Flavio Pompermaier <[hidden email]>:
And what is the split policy for the FileInputFormat?it depends on the fs block size?
Is there a pointer to the several flink input formats and a description of their internals?

On Wed, Oct 7, 2015 at 3:09 PM, Fabian Hueske <[hidden email]> wrote:
Hi Flavio,

it is not possible to split by line count because that would mean to read and parse the file just for splitting.

Parallel processing of data sources depends on the input splits created by the InputFormat. Local files can be split just like files in HDFS. Usually, each file corresponds to at least one split but multiple files could also be put into a single split if necessary.The logic for that would go into to the InputFormat.createInputSplits() method.

Cheers, Fabian

2015-10-07 14:47 GMT+02:00 Flavio Pompermaier <[hidden email]>:
Hi to all,

is there a way to split a single local file by line count (e.g. a split every 100 lines) in a LocalEnvironment to speed up a simple map function? For me it is not very clear how the local files (files into directory if recursive=true) are managed by Flink..is there any ref to this internals?

Best,
Flavio







Reply | Threaded
Open this post in threaded view
|

Re: Parallel file read in LocalEnvironment

Flavio Pompermaier

So why it takes so much to start the job?because in any case the job manager has to read all the lines of the input files before generating the splits?

On 18 Nov 2015 17:52, "Stephan Ewen" <[hidden email]> wrote:
Late answer, sorry:

The splits are created in the JobManager, so the sub submission should not be affected by that.

The assignment of splits to workers is very fast, so many splits with small data is not very different from few splits with large data.

Lines are never materialized and the operators do not work differently based on different numbers of splits.

On Wed, Oct 7, 2015 at 4:26 PM, Flavio Pompermaier <[hidden email]> wrote:
I've tried to split my huge file by lines count (using the bash command split -l) in 2 different ways:
  1. small lines count (huge number of small files)
  2. big lines count (small number of big files)
I can't understand why the time required to effectively start the job is more or less the same
  • in 1. it takes a lot to fetch the file list (~50.000) and the split assigner is fast to assign the splits (but also being fast they are a lot)
  • in 2. Flink is fast in fetch the file list but it's extremely slow to generate the splits to assign
Initially I was thinking that Flink was eagerly materializing the lines somewhere but both the memory and the disks doesn't increase.
What is going on underneath? Is it normal?

Thanks in advance,
Flavio



On Wed, Oct 7, 2015 at 3:27 PM, Stephan Ewen <[hidden email]> wrote:
The split functionality is in the FileInputFormat and the functionality that takes care of lines across splits is in the DelimitedIntputFormat.

On Wed, Oct 7, 2015 at 3:24 PM, Fabian Hueske <[hidden email]> wrote:
I'm sorry there is no such documentation.
You need to look at the code :-(

2015-10-07 15:19 GMT+02:00 Flavio Pompermaier <[hidden email]>:
And what is the split policy for the FileInputFormat?it depends on the fs block size?
Is there a pointer to the several flink input formats and a description of their internals?

On Wed, Oct 7, 2015 at 3:09 PM, Fabian Hueske <[hidden email]> wrote:
Hi Flavio,

it is not possible to split by line count because that would mean to read and parse the file just for splitting.

Parallel processing of data sources depends on the input splits created by the InputFormat. Local files can be split just like files in HDFS. Usually, each file corresponds to at least one split but multiple files could also be put into a single split if necessary.The logic for that would go into to the InputFormat.createInputSplits() method.

Cheers, Fabian

2015-10-07 14:47 GMT+02:00 Flavio Pompermaier <[hidden email]>:
Hi to all,

is there a way to split a single local file by line count (e.g. a split every 100 lines) in a LocalEnvironment to speed up a simple map function? For me it is not very clear how the local files (files into directory if recursive=true) are managed by Flink..is there any ref to this internals?

Best,
Flavio







Reply | Threaded
Open this post in threaded view
|

Re: Parallel file read in LocalEnvironment

Stephan Ewen
The JobManager does not read all files, but is has to query the HDFS for all file metadata (size, blocks, block locations), which can take a bit. There is a separate call to the HDFS Namenode for each file. The more files, the more metadata has to be collected.


On Wed, Nov 18, 2015 at 7:15 PM, Flavio Pompermaier <[hidden email]> wrote:

So why it takes so much to start the job?because in any case the job manager has to read all the lines of the input files before generating the splits?

On 18 Nov 2015 17:52, "Stephan Ewen" <[hidden email]> wrote:
Late answer, sorry:

The splits are created in the JobManager, so the sub submission should not be affected by that.

The assignment of splits to workers is very fast, so many splits with small data is not very different from few splits with large data.

Lines are never materialized and the operators do not work differently based on different numbers of splits.

On Wed, Oct 7, 2015 at 4:26 PM, Flavio Pompermaier <[hidden email]> wrote:
I've tried to split my huge file by lines count (using the bash command split -l) in 2 different ways:
  1. small lines count (huge number of small files)
  2. big lines count (small number of big files)
I can't understand why the time required to effectively start the job is more or less the same
  • in 1. it takes a lot to fetch the file list (~50.000) and the split assigner is fast to assign the splits (but also being fast they are a lot)
  • in 2. Flink is fast in fetch the file list but it's extremely slow to generate the splits to assign
Initially I was thinking that Flink was eagerly materializing the lines somewhere but both the memory and the disks doesn't increase.
What is going on underneath? Is it normal?

Thanks in advance,
Flavio



On Wed, Oct 7, 2015 at 3:27 PM, Stephan Ewen <[hidden email]> wrote:
The split functionality is in the FileInputFormat and the functionality that takes care of lines across splits is in the DelimitedIntputFormat.

On Wed, Oct 7, 2015 at 3:24 PM, Fabian Hueske <[hidden email]> wrote:
I'm sorry there is no such documentation.
You need to look at the code :-(

2015-10-07 15:19 GMT+02:00 Flavio Pompermaier <[hidden email]>:
And what is the split policy for the FileInputFormat?it depends on the fs block size?
Is there a pointer to the several flink input formats and a description of their internals?

On Wed, Oct 7, 2015 at 3:09 PM, Fabian Hueske <[hidden email]> wrote:
Hi Flavio,

it is not possible to split by line count because that would mean to read and parse the file just for splitting.

Parallel processing of data sources depends on the input splits created by the InputFormat. Local files can be split just like files in HDFS. Usually, each file corresponds to at least one split but multiple files could also be put into a single split if necessary.The logic for that would go into to the InputFormat.createInputSplits() method.

Cheers, Fabian

2015-10-07 14:47 GMT+02:00 Flavio Pompermaier <[hidden email]>:
Hi to all,

is there a way to split a single local file by line count (e.g. a split every 100 lines) in a LocalEnvironment to speed up a simple map function? For me it is not very clear how the local files (files into directory if recursive=true) are managed by Flink..is there any ref to this internals?

Best,
Flavio








Reply | Threaded
Open this post in threaded view
|

Re: Parallel file read in LocalEnvironment

Flavio Pompermaier

in my test I was using the local fs (ext4)

On 18 Nov 2015 19:17, "Stephan Ewen" <[hidden email]> wrote:
The JobManager does not read all files, but is has to query the HDFS for all file metadata (size, blocks, block locations), which can take a bit. There is a separate call to the HDFS Namenode for each file. The more files, the more metadata has to be collected.


On Wed, Nov 18, 2015 at 7:15 PM, Flavio Pompermaier <[hidden email]> wrote:

So why it takes so much to start the job?because in any case the job manager has to read all the lines of the input files before generating the splits?

On 18 Nov 2015 17:52, "Stephan Ewen" <[hidden email]> wrote:
Late answer, sorry:

The splits are created in the JobManager, so the sub submission should not be affected by that.

The assignment of splits to workers is very fast, so many splits with small data is not very different from few splits with large data.

Lines are never materialized and the operators do not work differently based on different numbers of splits.

On Wed, Oct 7, 2015 at 4:26 PM, Flavio Pompermaier <[hidden email]> wrote:
I've tried to split my huge file by lines count (using the bash command split -l) in 2 different ways:
  1. small lines count (huge number of small files)
  2. big lines count (small number of big files)
I can't understand why the time required to effectively start the job is more or less the same
  • in 1. it takes a lot to fetch the file list (~50.000) and the split assigner is fast to assign the splits (but also being fast they are a lot)
  • in 2. Flink is fast in fetch the file list but it's extremely slow to generate the splits to assign
Initially I was thinking that Flink was eagerly materializing the lines somewhere but both the memory and the disks doesn't increase.
What is going on underneath? Is it normal?

Thanks in advance,
Flavio



On Wed, Oct 7, 2015 at 3:27 PM, Stephan Ewen <[hidden email]> wrote:
The split functionality is in the FileInputFormat and the functionality that takes care of lines across splits is in the DelimitedIntputFormat.

On Wed, Oct 7, 2015 at 3:24 PM, Fabian Hueske <[hidden email]> wrote:
I'm sorry there is no such documentation.
You need to look at the code :-(

2015-10-07 15:19 GMT+02:00 Flavio Pompermaier <[hidden email]>:
And what is the split policy for the FileInputFormat?it depends on the fs block size?
Is there a pointer to the several flink input formats and a description of their internals?

On Wed, Oct 7, 2015 at 3:09 PM, Fabian Hueske <[hidden email]> wrote:
Hi Flavio,

it is not possible to split by line count because that would mean to read and parse the file just for splitting.

Parallel processing of data sources depends on the input splits created by the InputFormat. Local files can be split just like files in HDFS. Usually, each file corresponds to at least one split but multiple files could also be put into a single split if necessary.The logic for that would go into to the InputFormat.createInputSplits() method.

Cheers, Fabian

2015-10-07 14:47 GMT+02:00 Flavio Pompermaier <[hidden email]>:
Hi to all,

is there a way to split a single local file by line count (e.g. a split every 100 lines) in a LocalEnvironment to speed up a simple map function? For me it is not very clear how the local files (files into directory if recursive=true) are managed by Flink..is there any ref to this internals?

Best,
Flavio








Reply | Threaded
Open this post in threaded view
|

Re: Parallel file read in LocalEnvironment

Stephan Ewen
Okay, let me take a step back and make sure I understand this right...

With many small files it takes longer to start the job, correct? How much time did it actually take and how many files did you have?


On Wed, Nov 18, 2015 at 7:18 PM, Flavio Pompermaier <[hidden email]> wrote:

in my test I was using the local fs (ext4)

On 18 Nov 2015 19:17, "Stephan Ewen" <[hidden email]> wrote:
The JobManager does not read all files, but is has to query the HDFS for all file metadata (size, blocks, block locations), which can take a bit. There is a separate call to the HDFS Namenode for each file. The more files, the more metadata has to be collected.


On Wed, Nov 18, 2015 at 7:15 PM, Flavio Pompermaier <[hidden email]> wrote:

So why it takes so much to start the job?because in any case the job manager has to read all the lines of the input files before generating the splits?

On 18 Nov 2015 17:52, "Stephan Ewen" <[hidden email]> wrote:
Late answer, sorry:

The splits are created in the JobManager, so the sub submission should not be affected by that.

The assignment of splits to workers is very fast, so many splits with small data is not very different from few splits with large data.

Lines are never materialized and the operators do not work differently based on different numbers of splits.

On Wed, Oct 7, 2015 at 4:26 PM, Flavio Pompermaier <[hidden email]> wrote:
I've tried to split my huge file by lines count (using the bash command split -l) in 2 different ways:
  1. small lines count (huge number of small files)
  2. big lines count (small number of big files)
I can't understand why the time required to effectively start the job is more or less the same
  • in 1. it takes a lot to fetch the file list (~50.000) and the split assigner is fast to assign the splits (but also being fast they are a lot)
  • in 2. Flink is fast in fetch the file list but it's extremely slow to generate the splits to assign
Initially I was thinking that Flink was eagerly materializing the lines somewhere but both the memory and the disks doesn't increase.
What is going on underneath? Is it normal?

Thanks in advance,
Flavio



On Wed, Oct 7, 2015 at 3:27 PM, Stephan Ewen <[hidden email]> wrote:
The split functionality is in the FileInputFormat and the functionality that takes care of lines across splits is in the DelimitedIntputFormat.

On Wed, Oct 7, 2015 at 3:24 PM, Fabian Hueske <[hidden email]> wrote:
I'm sorry there is no such documentation.
You need to look at the code :-(

2015-10-07 15:19 GMT+02:00 Flavio Pompermaier <[hidden email]>:
And what is the split policy for the FileInputFormat?it depends on the fs block size?
Is there a pointer to the several flink input formats and a description of their internals?

On Wed, Oct 7, 2015 at 3:09 PM, Fabian Hueske <[hidden email]> wrote:
Hi Flavio,

it is not possible to split by line count because that would mean to read and parse the file just for splitting.

Parallel processing of data sources depends on the input splits created by the InputFormat. Local files can be split just like files in HDFS. Usually, each file corresponds to at least one split but multiple files could also be put into a single split if necessary.The logic for that would go into to the InputFormat.createInputSplits() method.

Cheers, Fabian

2015-10-07 14:47 GMT+02:00 Flavio Pompermaier <[hidden email]>:
Hi to all,

is there a way to split a single local file by line count (e.g. a split every 100 lines) in a LocalEnvironment to speed up a simple map function? For me it is not very clear how the local files (files into directory if recursive=true) are managed by Flink..is there any ref to this internals?

Best,
Flavio









Reply | Threaded
Open this post in threaded view
|

Re: Parallel file read in LocalEnvironment

Flavio Pompermaier

it was long ago..but if I remember correctly they were about 50k

On 18 Nov 2015 19:22, "Stephan Ewen" <[hidden email]> wrote:
Okay, let me take a step back and make sure I understand this right...

With many small files it takes longer to start the job, correct? How much time did it actually take and how many files did you have?


On Wed, Nov 18, 2015 at 7:18 PM, Flavio Pompermaier <[hidden email]> wrote:

in my test I was using the local fs (ext4)

On 18 Nov 2015 19:17, "Stephan Ewen" <[hidden email]> wrote:
The JobManager does not read all files, but is has to query the HDFS for all file metadata (size, blocks, block locations), which can take a bit. There is a separate call to the HDFS Namenode for each file. The more files, the more metadata has to be collected.


On Wed, Nov 18, 2015 at 7:15 PM, Flavio Pompermaier <[hidden email]> wrote:

So why it takes so much to start the job?because in any case the job manager has to read all the lines of the input files before generating the splits?

On 18 Nov 2015 17:52, "Stephan Ewen" <[hidden email]> wrote:
Late answer, sorry:

The splits are created in the JobManager, so the sub submission should not be affected by that.

The assignment of splits to workers is very fast, so many splits with small data is not very different from few splits with large data.

Lines are never materialized and the operators do not work differently based on different numbers of splits.

On Wed, Oct 7, 2015 at 4:26 PM, Flavio Pompermaier <[hidden email]> wrote:
I've tried to split my huge file by lines count (using the bash command split -l) in 2 different ways:
  1. small lines count (huge number of small files)
  2. big lines count (small number of big files)
I can't understand why the time required to effectively start the job is more or less the same
  • in 1. it takes a lot to fetch the file list (~50.000) and the split assigner is fast to assign the splits (but also being fast they are a lot)
  • in 2. Flink is fast in fetch the file list but it's extremely slow to generate the splits to assign
Initially I was thinking that Flink was eagerly materializing the lines somewhere but both the memory and the disks doesn't increase.
What is going on underneath? Is it normal?

Thanks in advance,
Flavio



On Wed, Oct 7, 2015 at 3:27 PM, Stephan Ewen <[hidden email]> wrote:
The split functionality is in the FileInputFormat and the functionality that takes care of lines across splits is in the DelimitedIntputFormat.

On Wed, Oct 7, 2015 at 3:24 PM, Fabian Hueske <[hidden email]> wrote:
I'm sorry there is no such documentation.
You need to look at the code :-(

2015-10-07 15:19 GMT+02:00 Flavio Pompermaier <[hidden email]>:
And what is the split policy for the FileInputFormat?it depends on the fs block size?
Is there a pointer to the several flink input formats and a description of their internals?

On Wed, Oct 7, 2015 at 3:09 PM, Fabian Hueske <[hidden email]> wrote:
Hi Flavio,

it is not possible to split by line count because that would mean to read and parse the file just for splitting.

Parallel processing of data sources depends on the input splits created by the InputFormat. Local files can be split just like files in HDFS. Usually, each file corresponds to at least one split but multiple files could also be put into a single split if necessary.The logic for that would go into to the InputFormat.createInputSplits() method.

Cheers, Fabian

2015-10-07 14:47 GMT+02:00 Flavio Pompermaier <[hidden email]>:
Hi to all,

is there a way to split a single local file by line count (e.g. a split every 100 lines) in a LocalEnvironment to speed up a simple map function? For me it is not very clear how the local files (files into directory if recursive=true) are managed by Flink..is there any ref to this internals?

Best,
Flavio