(DEPRECATED) Apache Flink User Mailing List archive.

How does flink read a DataSet?

Classic

List

Threaded

7 messages Options

Taher Koitawala

How does flink read a DataSet?

Hi All,

Just like Spark does Flink read a dataset and keep it in memory and keep applying transformations? Or all records read by Flink async parallel reads? Furthermore, how does Flink deal with

Regards,

Taher Koitawala

GS Lab Pune

+91 8407979163

Taher Koitawala

Re: How does flink read a DataSet?

Furthermore, how does Flink deal with Task Managers dying when it is using the DataSet API. Is checkpointing done on dataset too? Or the whole dataset has to re-read.

Regards,

Taher Koitawala

GS Lab Pune

+91 8407979163

On Tue, Sep 11, 2018 at 11:18 PM, Taher Koitawala <[hidden email]> wrote:

Hi All,
Just like Spark does Flink read a dataset and keep it in memory and keep applying transformations? Or all records read by Flink async parallel reads? Furthermore, how does Flink deal with

Regards,
Taher Koitawala
GS Lab Pune
+91 8407979163

vino yang

Re: How does flink read a DataSet?

Hi Taher,

Stream processing and batch processing are very different. The principle of batch processing determines that it needs to process bulk data, such as memory-based sorting, join, and so on. So, in this case, it needs to wait for the relevant data to arrive before it is calculated, but this does not mean that the data is concentrated in one node, and the calculation is still distributed. Flink has corresponding optimization measures for the execution plan of batch processing. For the storage of large data sets, it uses a custom-managed memory mechanism (you can use more memory by applying extra-heap memory). Of course, the amount of data is still stored in the memory. It will spill to disk when not in use.

Regarding fault tolerance, the current checkpoint mechanism is only applicable to stream processing. Batch fault tolerance can be re-executed by directly playing back the complete data set. A TaskManager fails, Flink will kick it out of the cluster, and the Task running on it will fail, but the result of stream processing and batch Task failure is different. For stream processing, it triggers a restart of the entire job, which may only trigger a partial restart for batch processing.

Thanks, vino.

Taher Koitawala <[hidden email]> 于2018年9月12日周三上午1:50写道：

Furthermore, how does Flink deal with Task Managers dying when it is using the DataSet API. Is checkpointing done on dataset too? Or the whole dataset has to re-read.

Regards,
Taher Koitawala
GS Lab Pune
+91 8407979163

On Tue, Sep 11, 2018 at 11:18 PM, Taher Koitawala <[hidden email]> wrote:
Hi All,
Just like Spark does Flink read a dataset and keep it in memory and keep applying transformations? Or all records read by Flink async parallel reads? Furthermore, how does Flink deal with

Regards,
Taher Koitawala
GS Lab Pune
+91 8407979163

Fabian Hueske-2

Re: How does flink read a DataSet?

Actually, some parts of Flink's batch engine are similar to streaming as well. If the data does not need to be sorted or put into a hash-table, the data is pipelined (like in many relational database systems).

For example, if you have a job that joins two inputs with a HashJoin, only the build side is marterialized in memory. If the build side fits in memory, the probe side if fully pipelined. If some parts of the build side need to be put on disk, the fraction of the probe side that would join with the spilled part is also written to disk. If the data needs to be sorted, Flink tries to do that in memory as well but can spill to disk if necessary. A job that only applies a filter or simple transformation would also be fully pipelined.

So it depends on the job and its execution plan whether data is stored in memory or not.

Best, Fabian

2018-09-12 2:34 GMT-04:00 vino yang <[hidden email]>:

Hi Taher,

Stream processing and batch processing are very different. The principle of batch processing determines that it needs to process bulk data, such as memory-based sorting, join, and so on. So, in this case, it needs to wait for the relevant data to arrive before it is calculated, but this does not mean that the data is concentrated in one node, and the calculation is still distributed. Flink has corresponding optimization measures for the execution plan of batch processing. For the storage of large data sets, it uses a custom-managed memory mechanism (you can use more memory by applying extra-heap memory). Of course, the amount of data is still stored in the memory. It will spill to disk when not in use.

Regarding fault tolerance, the current checkpoint mechanism is only applicable to stream processing. Batch fault tolerance can be re-executed by directly playing back the complete data set. A TaskManager fails, Flink will kick it out of the cluster, and the Task running on it will fail, but the result of stream processing and batch Task failure is different. For stream processing, it triggers a restart of the entire job, which may only trigger a partial restart for batch processing.

Thanks, vino.

Taher Koitawala <[hidden email]> 于2018年9月12日周三上午1:50写道：
Furthermore, how does Flink deal with Task Managers dying when it is using the DataSet API. Is checkpointing done on dataset too? Or the whole dataset has to re-read.

Regards,
Taher Koitawala
GS Lab Pune
+91 8407979163

On Tue, Sep 11, 2018 at 11:18 PM, Taher Koitawala <[hidden email]> wrote:
Hi All,
Just like Spark does Flink read a dataset and keep it in memory and keep applying transformations? Or all records read by Flink async parallel reads? Furthermore, how does Flink deal with

Regards,
Taher Koitawala
GS Lab Pune
+91 8407979163

Taher Koitawala

Re: How does flink read a DataSet?

So flink TMs reads one line at a time from hdfs in parallel and keep filling it in memory and keep passing the records to the next operator? I just want to know how data comes in memory? How it is partition between TMs Is there a documentation i can refer how the reading is done and how data is pushed from operators to operators in both stream and batch

On Wed 12 Sep, 2018, 4:28 PM Fabian Hueske, <[hidden email]> wrote:

Actually, some parts of Flink's batch engine are similar to streaming as well. If the data does not need to be sorted or put into a hash-table, the data is pipelined (like in many relational database systems).
For example, if you have a job that joins two inputs with a HashJoin, only the build side is marterialized in memory. If the build side fits in memory, the probe side if fully pipelined. If some parts of the build side need to be put on disk, the fraction of the probe side that would join with the spilled part is also written to disk. If the data needs to be sorted, Flink tries to do that in memory as well but can spill to disk if necessary. A job that only applies a filter or simple transformation would also be fully pipelined.

So it depends on the job and its execution plan whether data is stored in memory or not.

Best, Fabian

2018-09-12 2:34 GMT-04:00 vino yang <[hidden email]>:
Hi Taher,

Stream processing and batch processing are very different. The principle of batch processing determines that it needs to process bulk data, such as memory-based sorting, join, and so on. So, in this case, it needs to wait for the relevant data to arrive before it is calculated, but this does not mean that the data is concentrated in one node, and the calculation is still distributed. Flink has corresponding optimization measures for the execution plan of batch processing. For the storage of large data sets, it uses a custom-managed memory mechanism (you can use more memory by applying extra-heap memory). Of course, the amount of data is still stored in the memory. It will spill to disk when not in use.

Regarding fault tolerance, the current checkpoint mechanism is only applicable to stream processing. Batch fault tolerance can be re-executed by directly playing back the complete data set. A TaskManager fails, Flink will kick it out of the cluster, and the Task running on it will fail, but the result of stream processing and batch Task failure is different. For stream processing, it triggers a restart of the entire job, which may only trigger a partial restart for batch processing.

Thanks, vino.

Taher Koitawala <[hidden email]> 于2018年9月12日周三上午1:50写道：
Furthermore, how does Flink deal with Task Managers dying when it is using the DataSet API. Is checkpointing done on dataset too? Or the whole dataset has to re-read.

Regards,
Taher Koitawala
GS Lab Pune
+91 8407979163

On Tue, Sep 11, 2018 at 11:18 PM, Taher Koitawala <[hidden email]> wrote:
Hi All,
Just like Spark does Flink read a dataset and keep it in memory and keep applying transformations? Or all records read by Flink async parallel reads? Furthermore, how does Flink deal with

Regards,
Taher Koitawala
GS Lab Pune
+91 8407979163

Fabian Hueske-2

Re: How does flink read a DataSet?

The InputFormat interface is similar to Hadoop MapReduce's.

Data is emitted record-by-record, but InputFormats can read larger blocks for better efficiency (e.g., for ORC or Parquet files).

In general, Flink tries to push data forward as early as possible and avoids collecting records in memory unless necessary (e.g., for more efficient network transfer).

Partitioning is another story. There are two modes available. Pushing data eagerly to the next operator (batch and streaming) or collecting on the sender side (batch only).

2018-09-12 7:30 GMT-04:00 Taher Koitawala <[hidden email]>:

So flink TMs reads one line at a time from hdfs in parallel and keep filling it in memory and keep passing the records to the next operator? I just want to know how data comes in memory? How it is partition between TMs Is there a documentation i can refer how the reading is done and how data is pushed from operators to operators in both stream and batch

On Wed 12 Sep, 2018, 4:28 PM Fabian Hueske, <[hidden email]> wrote:
Actually, some parts of Flink's batch engine are similar to streaming as well. If the data does not need to be sorted or put into a hash-table, the data is pipelined (like in many relational database systems).
For example, if you have a job that joins two inputs with a HashJoin, only the build side is marterialized in memory. If the build side fits in memory, the probe side if fully pipelined. If some parts of the build side need to be put on disk, the fraction of the probe side that would join with the spilled part is also written to disk. If the data needs to be sorted, Flink tries to do that in memory as well but can spill to disk if necessary. A job that only applies a filter or simple transformation would also be fully pipelined.

So it depends on the job and its execution plan whether data is stored in memory or not.

Best, Fabian

2018-09-12 2:34 GMT-04:00 vino yang <[hidden email]>:
Hi Taher,

Stream processing and batch processing are very different. The principle of batch processing determines that it needs to process bulk data, such as memory-based sorting, join, and so on. So, in this case, it needs to wait for the relevant data to arrive before it is calculated, but this does not mean that the data is concentrated in one node, and the calculation is still distributed. Flink has corresponding optimization measures for the execution plan of batch processing. For the storage of large data sets, it uses a custom-managed memory mechanism (you can use more memory by applying extra-heap memory). Of course, the amount of data is still stored in the memory. It will spill to disk when not in use.

Regarding fault tolerance, the current checkpoint mechanism is only applicable to stream processing. Batch fault tolerance can be re-executed by directly playing back the complete data set. A TaskManager fails, Flink will kick it out of the cluster, and the Task running on it will fail, but the result of stream processing and batch Task failure is different. For stream processing, it triggers a restart of the entire job, which may only trigger a partial restart for batch processing.

Thanks, vino.

Taher Koitawala <[hidden email]> 于2018年9月12日周三上午1:50写道：
Furthermore, how does Flink deal with Task Managers dying when it is using the DataSet API. Is checkpointing done on dataset too? Or the whole dataset has to re-read.

Regards,
Taher Koitawala
GS Lab Pune
+91 8407979163

On Tue, Sep 11, 2018 at 11:18 PM, Taher Koitawala <[hidden email]> wrote:
Hi All,
Just like Spark does Flink read a dataset and keep it in memory and keep applying transformations? Or all records read by Flink async parallel reads? Furthermore, how does Flink deal with

Regards,
Taher Koitawala
GS Lab Pune
+91 8407979163

Taher Koitawala

Re: How does flink read a DataSet?

Thanks a lot! For your explanation i am much clearer. However for my reference can you give me links of some documentations for flink Dataset and DataStream which clearly and in detail explain all the internals right from reading to processing etc etc. The flink landing page doesn't have in depth information about all this

On 12-Sep-2018 5:38 PM, "Fabian Hueske" <[hidden email]> wrote:

The InputFormat interface is similar to Hadoop MapReduce's.
Data is emitted record-by-record, but InputFormats can read larger blocks for better efficiency (e.g., for ORC or Parquet files).
In general, Flink tries to push data forward as early as possible and avoids collecting records in memory unless necessary (e.g., for more efficient network transfer).

Partitioning is another story. There are two modes available. Pushing data eagerly to the next operator (batch and streaming) or collecting on the sender side (batch only).

2018-09-12 7:30 GMT-04:00 Taher Koitawala <[hidden email]>:
So flink TMs reads one line at a time from hdfs in parallel and keep filling it in memory and keep passing the records to the next operator? I just want to know how data comes in memory? How it is partition between TMs Is there a documentation i can refer how the reading is done and how data is pushed from operators to operators in both stream and batch

On Wed 12 Sep, 2018, 4:28 PM Fabian Hueske, <[hidden email]> wrote:
Actually, some parts of Flink's batch engine are similar to streaming as well. If the data does not need to be sorted or put into a hash-table, the data is pipelined (like in many relational database systems).
For example, if you have a job that joins two inputs with a HashJoin, only the build side is marterialized in memory. If the build side fits in memory, the probe side if fully pipelined. If some parts of the build side need to be put on disk, the fraction of the probe side that would join with the spilled part is also written to disk. If the data needs to be sorted, Flink tries to do that in memory as well but can spill to disk if necessary. A job that only applies a filter or simple transformation would also be fully pipelined.

So it depends on the job and its execution plan whether data is stored in memory or not.

Best, Fabian

2018-09-12 2:34 GMT-04:00 vino yang <[hidden email]>:
Hi Taher,

Stream processing and batch processing are very different. The principle of batch processing determines that it needs to process bulk data, such as memory-based sorting, join, and so on. So, in this case, it needs to wait for the relevant data to arrive before it is calculated, but this does not mean that the data is concentrated in one node, and the calculation is still distributed. Flink has corresponding optimization measures for the execution plan of batch processing. For the storage of large data sets, it uses a custom-managed memory mechanism (you can use more memory by applying extra-heap memory). Of course, the amount of data is still stored in the memory. It will spill to disk when not in use.

Regarding fault tolerance, the current checkpoint mechanism is only applicable to stream processing. Batch fault tolerance can be re-executed by directly playing back the complete data set. A TaskManager fails, Flink will kick it out of the cluster, and the Task running on it will fail, but the result of stream processing and batch Task failure is different. For stream processing, it triggers a restart of the entire job, which may only trigger a partial restart for batch processing.

Thanks, vino.

Taher Koitawala <[hidden email]> 于2018年9月12日周三上午1:50写道：
Furthermore, how does Flink deal with Task Managers dying when it is using the DataSet API. Is checkpointing done on dataset too? Or the whole dataset has to re-read.

Regards,
Taher Koitawala
GS Lab Pune
+91 8407979163

On Tue, Sep 11, 2018 at 11:18 PM, Taher Koitawala <[hidden email]> wrote:
Hi All,
Just like Spark does Flink read a dataset and keep it in memory and keep applying transformations? Or all records read by Flink async parallel reads? Furthermore, how does Flink deal with

Regards,
Taher Koitawala
GS Lab Pune
+91 8407979163