Data Locality in Flink

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

Data Locality in Flink

Soheil Pourbafrani
Hi

I want to exactly how Flink read data in the both case of file in local filesystem and file on distributed file system?

In reading data from local file system I guess every line of the file will be read by a slot (according to the job parallelism) for applying the map logic.

In reading from HDFS I read this answer by Fabian Hueske and i want to know is that still the Flink strategy fro reading from distributed system file?

thanks
Reply | Threaded
Open this post in threaded view
|

Re: Data Locality in Flink

Fabian Hueske-2
Hi,

The method that I described in the SO answer is still implemented in Flink.
Flink tries to assign splits to tasks that run on local TMs.
However, files are not split per line (this would be horribly inefficient) but in larger chunks depending on the number of subtasks (and in case of HDFS the file block size).

Best, Fabian

Am So., 28. Apr. 2019 um 18:48 Uhr schrieb Soheil Pourbafrani <[hidden email]>:
Hi

I want to exactly how Flink read data in the both case of file in local filesystem and file on distributed file system?

In reading data from local file system I guess every line of the file will be read by a slot (according to the job parallelism) for applying the map logic.

In reading from HDFS I read this answer by Fabian Hueske and i want to know is that still the Flink strategy fro reading from distributed system file?

thanks
Reply | Threaded
Open this post in threaded view
|

Re: Data Locality in Flink

Flavio Pompermaier
Hi Fabian, I wasn't aware that  "race-conditions may happen if your splits are very small as the first data source task might rapidly request and process all splits before the other source tasks do their first request". What happens exactly when a race-condition arise? Is this exception internally handled by Flink or not?

On Mon, Apr 29, 2019 at 11:51 AM Fabian Hueske <[hidden email]> wrote:
Hi,

The method that I described in the SO answer is still implemented in Flink.
Flink tries to assign splits to tasks that run on local TMs.
However, files are not split per line (this would be horribly inefficient) but in larger chunks depending on the number of subtasks (and in case of HDFS the file block size).

Best, Fabian

Am So., 28. Apr. 2019 um 18:48 Uhr schrieb Soheil Pourbafrani <[hidden email]>:
Hi

I want to exactly how Flink read data in the both case of file in local filesystem and file on distributed file system?

In reading data from local file system I guess every line of the file will be read by a slot (according to the job parallelism) for applying the map logic.

In reading from HDFS I read this answer by Fabian Hueske and i want to know is that still the Flink strategy fro reading from distributed system file?

thanks


Reply | Threaded
Open this post in threaded view
|

Re: Data Locality in Flink

Fabian Hueske-2
Hi Flavio,

These typos of race conditions are not failure cases, so no exception is thrown.
It only means that a single source tasks reads all (or most of the) splits and no splits are left for the other tasks.
This can be a problem if a record represents a large amount of IO or an intensive computation as they might not be properly distributed. In that case you'd need to manually rebalance the partitions of a DataSet.

Fabian

Am Mo., 29. Apr. 2019 um 14:42 Uhr schrieb Flavio Pompermaier <[hidden email]>:
Hi Fabian, I wasn't aware that  "race-conditions may happen if your splits are very small as the first data source task might rapidly request and process all splits before the other source tasks do their first request". What happens exactly when a race-condition arise? Is this exception internally handled by Flink or not?

On Mon, Apr 29, 2019 at 11:51 AM Fabian Hueske <[hidden email]> wrote:
Hi,

The method that I described in the SO answer is still implemented in Flink.
Flink tries to assign splits to tasks that run on local TMs.
However, files are not split per line (this would be horribly inefficient) but in larger chunks depending on the number of subtasks (and in case of HDFS the file block size).

Best, Fabian

Am So., 28. Apr. 2019 um 18:48 Uhr schrieb Soheil Pourbafrani <[hidden email]>:
Hi

I want to exactly how Flink read data in the both case of file in local filesystem and file on distributed file system?

In reading data from local file system I guess every line of the file will be read by a slot (according to the job parallelism) for applying the map logic.

In reading from HDFS I read this answer by Fabian Hueske and i want to know is that still the Flink strategy fro reading from distributed system file?

thanks


Reply | Threaded
Open this post in threaded view
|

Re: Data Locality in Flink

Flavio Pompermaier
Thanks Fabian, that's more clear..many times you don't know when to rebalance or not a dataset because it depends on the specific use case and dataset distribution.
An automatic way of choosing whether a Dataset could benefit from a rebalance or not could be VERY nice (at least for batch) but I fear this would be very hard to implement..am I wrong?

On Mon, Apr 29, 2019 at 3:10 PM Fabian Hueske <[hidden email]> wrote:
Hi Flavio,

These typos of race conditions are not failure cases, so no exception is thrown.
It only means that a single source tasks reads all (or most of the) splits and no splits are left for the other tasks.
This can be a problem if a record represents a large amount of IO or an intensive computation as they might not be properly distributed. In that case you'd need to manually rebalance the partitions of a DataSet.

Fabian

Am Mo., 29. Apr. 2019 um 14:42 Uhr schrieb Flavio Pompermaier <[hidden email]>:
Hi Fabian, I wasn't aware that  "race-conditions may happen if your splits are very small as the first data source task might rapidly request and process all splits before the other source tasks do their first request". What happens exactly when a race-condition arise? Is this exception internally handled by Flink or not?

On Mon, Apr 29, 2019 at 11:51 AM Fabian Hueske <[hidden email]> wrote:
Hi,

The method that I described in the SO answer is still implemented in Flink.
Flink tries to assign splits to tasks that run on local TMs.
However, files are not split per line (this would be horribly inefficient) but in larger chunks depending on the number of subtasks (and in case of HDFS the file block size).

Best, Fabian

Am So., 28. Apr. 2019 um 18:48 Uhr schrieb Soheil Pourbafrani <[hidden email]>:
Hi

I want to exactly how Flink read data in the both case of file in local filesystem and file on distributed file system?

In reading data from local file system I guess every line of the file will be read by a slot (according to the job parallelism) for applying the map logic.

In reading from HDFS I read this answer by Fabian Hueske and i want to know is that still the Flink strategy fro reading from distributed system file?

thanks



Reply | Threaded
Open this post in threaded view
|

Re: Data Locality in Flink

Fabian Hueske-2
Such a decision would require some distribution statistics, preferably stats on the actual data that needs to be rebalanced or not.
This data would only be available while a job is executed and a component that changes a running program is very difficult to implement.

Best, Fabian


Am Mo., 29. Apr. 2019 um 15:30 Uhr schrieb Flavio Pompermaier <[hidden email]>:
Thanks Fabian, that's more clear..many times you don't know when to rebalance or not a dataset because it depends on the specific use case and dataset distribution.
An automatic way of choosing whether a Dataset could benefit from a rebalance or not could be VERY nice (at least for batch) but I fear this would be very hard to implement..am I wrong?

On Mon, Apr 29, 2019 at 3:10 PM Fabian Hueske <[hidden email]> wrote:
Hi Flavio,

These typos of race conditions are not failure cases, so no exception is thrown.
It only means that a single source tasks reads all (or most of the) splits and no splits are left for the other tasks.
This can be a problem if a record represents a large amount of IO or an intensive computation as they might not be properly distributed. In that case you'd need to manually rebalance the partitions of a DataSet.

Fabian

Am Mo., 29. Apr. 2019 um 14:42 Uhr schrieb Flavio Pompermaier <[hidden email]>:
Hi Fabian, I wasn't aware that  "race-conditions may happen if your splits are very small as the first data source task might rapidly request and process all splits before the other source tasks do their first request". What happens exactly when a race-condition arise? Is this exception internally handled by Flink or not?

On Mon, Apr 29, 2019 at 11:51 AM Fabian Hueske <[hidden email]> wrote:
Hi,

The method that I described in the SO answer is still implemented in Flink.
Flink tries to assign splits to tasks that run on local TMs.
However, files are not split per line (this would be horribly inefficient) but in larger chunks depending on the number of subtasks (and in case of HDFS the file block size).

Best, Fabian

Am So., 28. Apr. 2019 um 18:48 Uhr schrieb Soheil Pourbafrani <[hidden email]>:
Hi

I want to exactly how Flink read data in the both case of file in local filesystem and file on distributed file system?

In reading data from local file system I guess every line of the file will be read by a slot (according to the job parallelism) for applying the map logic.

In reading from HDFS I read this answer by Fabian Hueske and i want to know is that still the Flink strategy fro reading from distributed system file?

thanks