(DEPRECATED) Apache Flink User Mailing List archive.

What is the equivalent of Spark RDD is Flink

Classic

List

Threaded

7 messages Options

Sourav Mazumder

What is the equivalent of Spark RDD is Flink

Hi,

I am new to Flink. Trying to understand some of the basics of Flink.

What is the equivalent of Spark's RDD in Flink ? In my understanding the closes think is DataSet API. But wanted to reconfirm.

Also using DataSet API if I ingest a large volume of data (val lines : DataSet[String] = env.readTextFile(<some file path and name>)), which may not fit in single slave node, will that data get automatically distributed in the memory of other slave nodes ?

Regards,

Sourav

Aljoscha Krettek

Re: What is the equivalent of Spark RDD is Flink

Hi Sourav,
you are right, in Flink the equivalent to an RDD would be a DataSet (or a DataStream if you are working with the streaming API).

Contrary to Spark, a Flink job is executed lazily when ExecutionEnvironment.execute() is called. Only then does Flink build an executable program from the graph of transformations that was built by calling the transformation methods on DataSet. That’s why I called it lazy. The operations will also be automatically parallelized. The parallelism of operations can either be configured in the cluster configuration (conf/flink-conf.yaml), on a per job basis (ExecutionEnvironment.setParallelism(int)) or per operation, by calling setParallelism(int) on a DataSet.

(Above you can always replace DataSet by DataStream, the same explanations hold.)

So, to get back to your question, yes, the operation of reading the file (or files in a directory) will be parallelized to several worker nodes based on the previously mentioned settings.

Let us now if you need more information.

Cheers,
Aljoscha

On Thu, 24 Dec 2015 at 16:49 Sourav Mazumder <[hidden email]> wrote:

Hi,

I am new to Flink. Trying to understand some of the basics of Flink.

What is the equivalent of Spark's RDD in Flink ? In my understanding the closes think is DataSet API. But wanted to reconfirm.

Also using DataSet API if I ingest a large volume of data (val lines : DataSet[String] = env.readTextFile(<some file path and name>)), which may not fit in single slave node, will that data get automatically distributed in the memory of other slave nodes ?

Regards,
Sourav

Filip Łęczycki

Re: What is the equivalent of Spark RDD is Flink

Hi Aljoscha,

Sorry for a little off-topic, but I wanted to calrify whether my understanding is right. You said that "Contrary to Spark, a Flink job is executed lazily", however as I read in available sources, for example http://spark.apache.org/docs/latest/programming-guide.html, chapter "RDD operations" : ". The transformations are only computed when an action requires a result to be returned to the driver program.". To my understanding Spark implements the same lazy execution principle as Flink, that is the job is only executed when a data sink/action/execute is called and before that only a execution plan is built. Is that correct or are there other significant differences between Spark and Flink lazy execution approach that I failed to grasp?

Best regards,

Filip Łęczycki

Pozdrawiam,
Filip Łęczycki

2015-12-25 10:17 GMT+01:00 Aljoscha Krettek <[hidden email]>:

Hi Sourav,
you are right, in Flink the equivalent to an RDD would be a DataSet (or a DataStream if you are working with the streaming API).

Contrary to Spark, a Flink job is executed lazily when ExecutionEnvironment.execute() is called. Only then does Flink build an executable program from the graph of transformations that was built by calling the transformation methods on DataSet. That’s why I called it lazy. The operations will also be automatically parallelized. The parallelism of operations can either be configured in the cluster configuration (conf/flink-conf.yaml), on a per job basis (ExecutionEnvironment.setParallelism(int)) or per operation, by calling setParallelism(int) on a DataSet.

(Above you can always replace DataSet by DataStream, the same explanations hold.)

So, to get back to your question, yes, the operation of reading the file (or files in a directory) will be parallelized to several worker nodes based on the previously mentioned settings.

Let us now if you need more information.

Cheers,
Aljoscha

On Thu, 24 Dec 2015 at 16:49 Sourav Mazumder <[hidden email]> wrote:
Hi,

I am new to Flink. Trying to understand some of the basics of Flink.

What is the equivalent of Spark's RDD in Flink ? In my understanding the closes think is DataSet API. But wanted to reconfirm.

Also using DataSet API if I ingest a large volume of data (val lines : DataSet[String] = env.readTextFile(<some file path and name>)), which may not fit in single slave node, will that data get automatically distributed in the memory of other slave nodes ?

Regards,
Sourav

Chiwan Park-2

Re: What is the equivalent of Spark RDD is Flink

Hi Filip,

Spark executes job also lazily. But It is slightly different from Flink. Flink can execute lazily a whole job which Spark cannot execute lazily. One of example is iterative job.

In Spark, each stage of the iteration is submitted, scheduled as a job and executed because of calling action in last of each iteration. In Flink, although the job contains iteration, user submits only a job. Flink cluster schedules and runs the job once.

Because of this difference, in Spark, user must determine something more such as “Which RDDs are cached or uncached?”.

In 22 page and 23 page of ApacheCon EU 2014 slides [1] and Fabian’s answer in SO [2] would be helpful to understand this differences. :)

[1]: http://www.slideshare.net/GyulaFra/flink-apachecon
[2]: http://stackoverflow.com/questions/29780747/apache-flink-vs-apache-spark-as-platforms-for-large-scale-machine-learning

> On Dec 29, 2015, at 1:35 AM, Filip Łęczycki <[hidden email]> wrote:
>
> Hi Aljoscha,
>
> Sorry for a little off-topic, but I wanted to calrify whether my understanding is right. You said that "Contrary to Spark, a Flink job is executed lazily", however as I read in available sources, for example http://spark.apache.org/docs/latest/programming-guide.html, chapter "RDD operations" : ". The transformations are only computed when an action requires a result to be returned to the driver program.". To my understanding Spark implements the same lazy execution principle as Flink, that is the job is only executed when a data sink/action/execute is called and before that only a execution plan is built. Is that correct or are there other significant differences between Spark and Flink lazy execution approach that I failed to grasp?
>
> Best regards,
> Filip Łęczycki
>
> Pozdrawiam,
> Filip Łęczycki
>
> 2015-12-25 10:17 GMT+01:00 Aljoscha Krettek <[hidden email]>:
> Hi Sourav,
> you are right, in Flink the equivalent to an RDD would be a DataSet (or a DataStream if you are working with the streaming API).
>
> Contrary to Spark, a Flink job is executed lazily when ExecutionEnvironment.execute() is called. Only then does Flink build an executable program from the graph of transformations that was built by calling the transformation methods on DataSet. That’s why I called it lazy. The operations will also be automatically parallelized. The parallelism of operations can either be configured in the cluster configuration (conf/flink-conf.yaml), on a per job basis (ExecutionEnvironment.setParallelism(int)) or per operation, by calling setParallelism(int) on a DataSet.
>
> (Above you can always replace DataSet by DataStream, the same explanations hold.)
>
> So, to get back to your question, yes, the operation of reading the file (or files in a directory) will be parallelized to several worker nodes based on the previously mentioned settings.
>
> Let us now if you need more information.
>
> Cheers,
> Aljoscha
>
> On Thu, 24 Dec 2015 at 16:49 Sourav Mazumder <[hidden email]> wrote:
> Hi,
>
> I am new to Flink. Trying to understand some of the basics of Flink.
>
> What is the equivalent of Spark's RDD in Flink ? In my understanding the closes think is DataSet API. But wanted to reconfirm.
>
> Also using DataSet API if I ingest a large volume of data (val lines : DataSet[String] = env.readTextFile(<some file path and name>)), which may not fit in single slave node, will that data get automatically distributed in the memory of other slave nodes ?
>
> Regards,
> Sourav
>

Regards,
Chiwan Park

Sourav Mazumder

Re: What is the equivalent of Spark RDD is Flink

Hi Aljoscha and Chiwan,

Firstly thanks for the inputs.

Couple of follow ups -

1. Based on Chiwan's explanation and the links my understanding is potential performance difference may happen between Spark and Flink (during iterative computation like building a model using a Machine Learning algorithm) across two iterations because of the overhead of starting a new set of tasks/operators.Other overheads would be same as both stores the intermediate results in memory. Is this understanding correct ?

2. In case of Flink what happens if a DataSet needs to contain data which is volume wise more than total memory available in all the slave nodes ? Will it serialize the memory in the disks of respective slave nodes by default ?

Regards,

Sourav

On Mon, Dec 28, 2015 at 4:13 PM, Chiwan Park <[hidden email]> wrote:

Hi Filip,

Spark executes job also lazily. But It is slightly different from Flink. Flink can execute lazily a whole job which Spark cannot execute lazily. One of example is iterative job.

In Spark, each stage of the iteration is submitted, scheduled as a job and executed because of calling action in last of each iteration. In Flink, although the job contains iteration, user submits only a job. Flink cluster schedules and runs the job once.

Because of this difference, in Spark, user must determine something more such as “Which RDDs are cached or uncached?”.

In 22 page and 23 page of ApacheCon EU 2014 slides [1] and Fabian’s answer in SO [2] would be helpful to understand this differences. :)

[1]: http://www.slideshare.net/GyulaFra/flink-apachecon
[2]: http://stackoverflow.com/questions/29780747/apache-flink-vs-apache-spark-as-platforms-for-large-scale-machine-learning

> On Dec 29, 2015, at 1:35 AM, Filip Łęczycki <[hidden email]> wrote:
>
> Hi Aljoscha,
>
> Sorry for a little off-topic, but I wanted to calrify whether my understanding is right. You said that "Contrary to Spark, a Flink job is executed lazily", however as I read in available sources, for example http://spark.apache.org/docs/latest/programming-guide.html, chapter "RDD operations" : ". The transformations are only computed when an action requires a result to be returned to the driver program.". To my understanding Spark implements the same lazy execution principle as Flink, that is the job is only executed when a data sink/action/execute is called and before that only a execution plan is built. Is that correct or are there other significant differences between Spark and Flink lazy execution approach that I failed to grasp?
>
> Best regards,
> Filip Łęczycki
>
> Pozdrawiam,
> Filip Łęczycki
>
> 2015-12-25 10:17 GMT+01:00 Aljoscha Krettek <[hidden email]>:
> Hi Sourav,
> you are right, in Flink the equivalent to an RDD would be a DataSet (or a DataStream if you are working with the streaming API).
>
> Contrary to Spark, a Flink job is executed lazily when ExecutionEnvironment.execute() is called. Only then does Flink build an executable program from the graph of transformations that was built by calling the transformation methods on DataSet. That’s why I called it lazy. The operations will also be automatically parallelized. The parallelism of operations can either be configured in the cluster configuration (conf/flink-conf.yaml), on a per job basis (ExecutionEnvironment.setParallelism(int)) or per operation, by calling setParallelism(int) on a DataSet.
>
> (Above you can always replace DataSet by DataStream, the same explanations hold.)
>
> So, to get back to your question, yes, the operation of reading the file (or files in a directory) will be parallelized to several worker nodes based on the previously mentioned settings.
>
> Let us now if you need more information.
>
> Cheers,
> Aljoscha
>
> On Thu, 24 Dec 2015 at 16:49 Sourav Mazumder <[hidden email]> wrote:
> Hi,
>
> I am new to Flink. Trying to understand some of the basics of Flink.
>
> What is the equivalent of Spark's RDD in Flink ? In my understanding the closes think is DataSet API. But wanted to reconfirm.
>
> Also using DataSet API if I ingest a large volume of data (val lines : DataSet[String] = env.readTextFile(<some file path and name>)), which may not fit in single slave node, will that data get automatically distributed in the memory of other slave nodes ?
>
> Regards,
> Sourav
>

Regards,
Chiwan Park

Stephan Ewen

Re: What is the equivalent of Spark RDD is Flink

Concerning question (2):

DataSets in Flink are in most cases not materialized at all, but they represent in-flight data as it is being streamed from one operation to the next (remember, Flink is streaming in its core). So even in a MapReduce style program, the DataSet produced by the Map Function does never exist as a whole, but is continuously produced and streamed to the ReduceFunction.

The operator that executes the ReduceFunction materializes the data as part of its sorting operation. All materializing batch operations (sort / hash / cache / ...) can go out of core very reliably.

Greetings,

Stephan

On Wed, Dec 30, 2015 at 4:45 AM, Sourav Mazumder <[hidden email]> wrote:

Hi Aljoscha and Chiwan,

Firstly thanks for the inputs.

Couple of follow ups -

1. Based on Chiwan's explanation and the links my understanding is potential performance difference may happen between Spark and Flink (during iterative computation like building a model using a Machine Learning algorithm) across two iterations because of the overhead of starting a new set of tasks/operators.Other overheads would be same as both stores the intermediate results in memory. Is this understanding correct ?

2. In case of Flink what happens if a DataSet needs to contain data which is volume wise more than total memory available in all the slave nodes ? Will it serialize the memory in the disks of respective slave nodes by default ?

Regards,
Sourav

On Mon, Dec 28, 2015 at 4:13 PM, Chiwan Park <[hidden email]> wrote:
Hi Filip,

Spark executes job also lazily. But It is slightly different from Flink. Flink can execute lazily a whole job which Spark cannot execute lazily. One of example is iterative job.

In Spark, each stage of the iteration is submitted, scheduled as a job and executed because of calling action in last of each iteration. In Flink, although the job contains iteration, user submits only a job. Flink cluster schedules and runs the job once.

Because of this difference, in Spark, user must determine something more such as “Which RDDs are cached or uncached?”.

In 22 page and 23 page of ApacheCon EU 2014 slides [1] and Fabian’s answer in SO [2] would be helpful to understand this differences. :)

[1]: http://www.slideshare.net/GyulaFra/flink-apachecon
[2]: http://stackoverflow.com/questions/29780747/apache-flink-vs-apache-spark-as-platforms-for-large-scale-machine-learning

> On Dec 29, 2015, at 1:35 AM, Filip Łęczycki <[hidden email]> wrote:
>
> Hi Aljoscha,
>
> Sorry for a little off-topic, but I wanted to calrify whether my understanding is right. You said that "Contrary to Spark, a Flink job is executed lazily", however as I read in available sources, for example http://spark.apache.org/docs/latest/programming-guide.html, chapter "RDD operations" : ". The transformations are only computed when an action requires a result to be returned to the driver program.". To my understanding Spark implements the same lazy execution principle as Flink, that is the job is only executed when a data sink/action/execute is called and before that only a execution plan is built. Is that correct or are there other significant differences between Spark and Flink lazy execution approach that I failed to grasp?
>
> Best regards,
> Filip Łęczycki
>
> Pozdrawiam,
> Filip Łęczycki
>
> 2015-12-25 10:17 GMT+01:00 Aljoscha Krettek <[hidden email]>:
> Hi Sourav,
> you are right, in Flink the equivalent to an RDD would be a DataSet (or a DataStream if you are working with the streaming API).
>
> Contrary to Spark, a Flink job is executed lazily when ExecutionEnvironment.execute() is called. Only then does Flink build an executable program from the graph of transformations that was built by calling the transformation methods on DataSet. That’s why I called it lazy. The operations will also be automatically parallelized. The parallelism of operations can either be configured in the cluster configuration (conf/flink-conf.yaml), on a per job basis (ExecutionEnvironment.setParallelism(int)) or per operation, by calling setParallelism(int) on a DataSet.
>
> (Above you can always replace DataSet by DataStream, the same explanations hold.)
>
> So, to get back to your question, yes, the operation of reading the file (or files in a directory) will be parallelized to several worker nodes based on the previously mentioned settings.
>
> Let us now if you need more information.
>
> Cheers,
> Aljoscha
>
> On Thu, 24 Dec 2015 at 16:49 Sourav Mazumder <[hidden email]> wrote:
> Hi,
>
> I am new to Flink. Trying to understand some of the basics of Flink.
>
> What is the equivalent of Spark's RDD in Flink ? In my understanding the closes think is DataSet API. But wanted to reconfirm.
>
> Also using DataSet API if I ingest a large volume of data (val lines : DataSet[String] = env.readTextFile(<some file path and name>)), which may not fit in single slave node, will that data get automatically distributed in the memory of other slave nodes ?
>
> Regards,
> Sourav
>

Regards,
Chiwan Park

Chiwan Park-2

Re: What is the equivalent of Spark RDD is Flink

About question 1,

Scheduling once for iterative job is one of factors causing performance difference. Dongwon’s slides [1] would be helpful other factors of performance.

[1] http://flink-forward.org/?session=a-comparative-performance-evaluation-of-flink

> On Dec 31, 2015, at 9:37 AM, Stephan Ewen <[hidden email]> wrote:
>
> Concerning question (2):
>
> DataSets in Flink are in most cases not materialized at all, but they represent in-flight data as it is being streamed from one operation to the next (remember, Flink is streaming in its core). So even in a MapReduce style program, the DataSet produced by the Map Function does never exist as a whole, but is continuously produced and streamed to the ReduceFunction.
>
> The operator that executes the ReduceFunction materializes the data as part of its sorting operation. All materializing batch operations (sort / hash / cache / ...) can go out of core very reliably.
>
> Greetings,
> Stephan
>
>
>
> On Wed, Dec 30, 2015 at 4:45 AM, Sourav Mazumder <[hidden email]> wrote:
> Hi Aljoscha and Chiwan,
>
> Firstly thanks for the inputs.
>
> Couple of follow ups -
>
> 1. Based on Chiwan's explanation and the links my understanding is potential performance difference may happen between Spark and Flink (during iterative computation like building a model using a Machine Learning algorithm) across two iterations because of the overhead of starting a new set of tasks/operators.Other overheads would be same as both stores the intermediate results in memory. Is this understanding correct ?
>
> 2. In case of Flink what happens if a DataSet needs to contain data which is volume wise more than total memory available in all the slave nodes ? Will it serialize the memory in the disks of respective slave nodes by default ?
>
> Regards,
> Sourav
>
>
> On Mon, Dec 28, 2015 at 4:13 PM, Chiwan Park <[hidden email]> wrote:
> Hi Filip,
>
> Spark executes job also lazily. But It is slightly different from Flink. Flink can execute lazily a whole job which Spark cannot execute lazily. One of example is iterative job.
>
> In Spark, each stage of the iteration is submitted, scheduled as a job and executed because of calling action in last of each iteration. In Flink, although the job contains iteration, user submits only a job. Flink cluster schedules and runs the job once.
>
> Because of this difference, in Spark, user must determine something more such as “Which RDDs are cached or uncached?”.
>
> In 22 page and 23 page of ApacheCon EU 2014 slides [1] and Fabian’s answer in SO [2] would be helpful to understand this differences. :)
>
> [1]: http://www.slideshare.net/GyulaFra/flink-apachecon
> [2]: http://stackoverflow.com/questions/29780747/apache-flink-vs-apache-spark-as-platforms-for-large-scale-machine-learning
>
> > On Dec 29, 2015, at 1:35 AM, Filip Łęczycki <[hidden email]> wrote:
> >
> > Hi Aljoscha,
> >
> > Sorry for a little off-topic, but I wanted to calrify whether my understanding is right. You said that "Contrary to Spark, a Flink job is executed lazily", however as I read in available sources, for example http://spark.apache.org/docs/latest/programming-guide.html, chapter "RDD operations" : ". The transformations are only computed when an action requires a result to be returned to the driver program.". To my understanding Spark implements the same lazy execution principle as Flink, that is the job is only executed when a data sink/action/execute is called and before that only a execution plan is built. Is that correct or are there other significant differences between Spark and Flink lazy execution approach that I failed to grasp?
> >
> > Best regards,
> > Filip Łęczycki
> >
> > Pozdrawiam,
> > Filip Łęczycki
> >
> > 2015-12-25 10:17 GMT+01:00 Aljoscha Krettek <[hidden email]>:
> > Hi Sourav,
> > you are right, in Flink the equivalent to an RDD would be a DataSet (or a DataStream if you are working with the streaming API).
> >
> > Contrary to Spark, a Flink job is executed lazily when ExecutionEnvironment.execute() is called. Only then does Flink build an executable program from the graph of transformations that was built by calling the transformation methods on DataSet. That’s why I called it lazy. The operations will also be automatically parallelized. The parallelism of operations can either be configured in the cluster configuration (conf/flink-conf.yaml), on a per job basis (ExecutionEnvironment.setParallelism(int)) or per operation, by calling setParallelism(int) on a DataSet.
> >
> > (Above you can always replace DataSet by DataStream, the same explanations hold.)
> >
> > So, to get back to your question, yes, the operation of reading the file (or files in a directory) will be parallelized to several worker nodes based on the previously mentioned settings.
> >
> > Let us now if you need more information.
> >
> > Cheers,
> > Aljoscha
> >
> > On Thu, 24 Dec 2015 at 16:49 Sourav Mazumder <[hidden email]> wrote:
> > Hi,
> >
> > I am new to Flink. Trying to understand some of the basics of Flink.
> >
> > What is the equivalent of Spark's RDD in Flink ? In my understanding the closes think is DataSet API. But wanted to reconfirm.
> >
> > Also using DataSet API if I ingest a large volume of data (val lines : DataSet[String] = env.readTextFile(<some file path and name>)), which may not fit in single slave node, will that data get automatically distributed in the memory of other slave nodes ?
> >
> > Regards,
> > Sourav
> >
>
> Regards,
> Chiwan Park

Regards,
Chiwan Park