(DEPRECATED) Apache Flink User Mailing List archive.

Flink and Spark

Classic

List

Threaded

9 messages Options

Samarth Mailinglist

Flink and Spark

Hey folks, I have a noob question.

I already looked up the archives and saw a couple of discussions about Spark and Flink.

I am familiar with spark (the python API, esp MLLib), and I see many similarities between Flink and Spark.

How does Flink distinguish itself from Spark?

Márton Balassi-2

Re: Flink and Spark

Dear Samarth,

Besides the discussions you have mentioned [1] I can recommend one of our recent presentations [2], especially the distinguishing Flink section (from slide 16).

It is generally a difficult question as both the systems are rapidly evolving, so the answer can become outdated quite fast. However there are fundamental design features that are highly unlikely to change, for example Spark uses "true" batch processing, meaning that intermediate results are materialized (mostly in memory) as RDDs. Flink's engine is internally more like streaming, forwarding the results to the next operator asap. The latter can yield performance benefits for more complex jobs. Flink also gives you a query optimizer, spills gracefully to disk when the system runs out of memory and has some cool features around serialization. For performance numbers and some more insight please check out the presentation [2] and do not hesitate to post a follow-up mail here if you come across something unclear or extraordinary.

[1] http://apache-flink-incubator-mailing-list-archive.1008284.n3.nabble.com/template/NamlServlet.jtp?macro=search_page&node=1&query=spark

[2] http://www.slideshare.net/GyulaFra/flink-apachecon

Best,

Marton

On Tue, Dec 23, 2014 at 6:19 PM, Samarth Mailinglist <[hidden email]> wrote:

Hey folks, I have a noob question.

I already looked up the archives and saw a couple of discussions about Spark and Flink.

I am familiar with spark (the python API, esp MLLib), and I see many similarities between Flink and Spark.

How does Flink distinguish itself from Spark?

Samarth Mailinglist

Re: Flink and Spark

Thank you for your answer. I have a couple of follow up questions:

1. Does it support 'exactly once semantics' that Spark and Storm support?

2. (Related to 1) What happens when an error occurs during processing?

3. Is there a plan for adding Machine Learning support on top of Flink? Say Alternative Least Squares, Basic Naive Bayes?

4. When you say Flink manages itself, does it mean I don't have to fiddle with number of partitions (Spark), number of reduces / happers (Hadoop?) to optimize performance? (In some cases this might be needed)

5. How far along is the Python API? I don't see the specs in the Website.

On Thu, Dec 25, 2014 at 4:31 AM, Márton Balassi <[hidden email]> wrote:

Dear Samarth,

Besides the discussions you have mentioned [1] I can recommend one of our recent presentations [2], especially the distinguishing Flink section (from slide 16).

It is generally a difficult question as both the systems are rapidly evolving, so the answer can become outdated quite fast. However there are fundamental design features that are highly unlikely to change, for example Spark uses "true" batch processing, meaning that intermediate results are materialized (mostly in memory) as RDDs. Flink's engine is internally more like streaming, forwarding the results to the next operator asap. The latter can yield performance benefits for more complex jobs. Flink also gives you a query optimizer, spills gracefully to disk when the system runs out of memory and has some cool features around serialization. For performance numbers and some more insight please check out the presentation [2] and do not hesitate to post a follow-up mail here if you come across something unclear or extraordinary.

[1] http://apache-flink-incubator-mailing-list-archive.1008284.n3.nabble.com/template/NamlServlet.jtp?macro=search_page&node=1&query=spark
[2] http://www.slideshare.net/GyulaFra/flink-apachecon

Best,

Marton

On Tue, Dec 23, 2014 at 6:19 PM, Samarth Mailinglist <[hidden email]> wrote:
Hey folks, I have a noob question.

I already looked up the archives and saw a couple of discussions about Spark and Flink.

I am familiar with spark (the python API, esp MLLib), and I see many similarities between Flink and Spark.

How does Flink distinguish itself from Spark?

Gyula Fóra-2

Re: Flink and Spark

Hey,

1-2. As for failure recovery, there is a difference how the Flink batch and streaming programs handle failures. The failed parts of the batch jobs currently restart upon failures but there is an ongoing effort on fine grained fault tolerance which is somewhat similar to sparks lineage tracking. (so technically this is exactly once semantic but that is somewhat meaningless for batch jobs)

For streaming programs we are currently working on fault tolerance, we are hoping to support at least once processing guarantees in the 0.9 release. After that we will focus our research efforts on an high performance implementation of exactly once processing semantics, which is still a hard topic in streaming systems. Storm's trident's exaclty once semantics can only provide very low throughput while we are trying hard to avoid this issue, as our streaming system is capable of much higher throughput than storm in general as you can see on some perf measurements.

3. There are already many ml algorithms implemented for Flink but they are scattered all around. We are planning to collect them in a machine learning library soon. We are also implementing an adapter for Samoa which will provide some streaming machine learning algorithms as well. Samoa integration should be ready in January.

4. Flink carefully manages its memory use to avoid heap errors, and utilizing memory space as effectively as it can. The optimizer for batch programs also takes care of a lot of optimization steps that the user would manually have to do in other system, like optimizing the order of transformations etc. There are of course parts of the program that still needs to modified for maximal performance, for example parallelism settings for some operators in some cases.

5. As for the status of the Python API I personally cannot say very much, maybe someone can jump in and help me with that question :)

Regards,

Gyula

On Thu, Dec 25, 2014 at 11:58 AM, Samarth Mailinglist <[hidden email]> wrote:

Thank you for your answer. I have a couple of follow up questions:
1. Does it support 'exactly once semantics' that Spark and Storm support?
2. (Related to 1) What happens when an error occurs during processing?
3. Is there a plan for adding Machine Learning support on top of Flink? Say Alternative Least Squares, Basic Naive Bayes?
4. When you say Flink manages itself, does it mean I don't have to fiddle with number of partitions (Spark), number of reduces / happers (Hadoop?) to optimize performance? (In some cases this might be needed)
5. How far along is the Python API? I don't see the specs in the Website.

On Thu, Dec 25, 2014 at 4:31 AM, Márton Balassi <[hidden email]> wrote:
Dear Samarth,

Besides the discussions you have mentioned [1] I can recommend one of our recent presentations [2], especially the distinguishing Flink section (from slide 16).

It is generally a difficult question as both the systems are rapidly evolving, so the answer can become outdated quite fast. However there are fundamental design features that are highly unlikely to change, for example Spark uses "true" batch processing, meaning that intermediate results are materialized (mostly in memory) as RDDs. Flink's engine is internally more like streaming, forwarding the results to the next operator asap. The latter can yield performance benefits for more complex jobs. Flink also gives you a query optimizer, spills gracefully to disk when the system runs out of memory and has some cool features around serialization. For performance numbers and some more insight please check out the presentation [2] and do not hesitate to post a follow-up mail here if you come across something unclear or extraordinary.

[1] http://apache-flink-incubator-mailing-list-archive.1008284.n3.nabble.com/template/NamlServlet.jtp?macro=search_page&node=1&query=spark
[2] http://www.slideshare.net/GyulaFra/flink-apachecon

Best,

Marton

On Tue, Dec 23, 2014 at 6:19 PM, Samarth Mailinglist <[hidden email]> wrote:
Hey folks, I have a noob question.

I already looked up the archives and saw a couple of discussions about Spark and Flink.

I am familiar with spark (the python API, esp MLLib), and I see many similarities between Flink and Spark.

How does Flink distinguish itself from Spark?

Samarth Mailinglist

Re: Flink and Spark

Thank you the answers, folks.

Can anyone provide me a link for any implementation of an ML algorithm on Flink?

On Thu, Dec 25, 2014 at 8:07 PM, Gyula Fóra <[hidden email]> wrote:

Hey,

1-2. As for failure recovery, there is a difference how the Flink batch and streaming programs handle failures. The failed parts of the batch jobs currently restart upon failures but there is an ongoing effort on fine grained fault tolerance which is somewhat similar to sparks lineage tracking. (so technically this is exactly once semantic but that is somewhat meaningless for batch jobs)

For streaming programs we are currently working on fault tolerance, we are hoping to support at least once processing guarantees in the 0.9 release. After that we will focus our research efforts on an high performance implementation of exactly once processing semantics, which is still a hard topic in streaming systems. Storm's trident's exaclty once semantics can only provide very low throughput while we are trying hard to avoid this issue, as our streaming system is capable of much higher throughput than storm in general as you can see on some perf measurements.

3. There are already many ml algorithms implemented for Flink but they are scattered all around. We are planning to collect them in a machine learning library soon. We are also implementing an adapter for Samoa which will provide some streaming machine learning algorithms as well. Samoa integration should be ready in January.

4. Flink carefully manages its memory use to avoid heap errors, and utilizing memory space as effectively as it can. The optimizer for batch programs also takes care of a lot of optimization steps that the user would manually have to do in other system, like optimizing the order of transformations etc. There are of course parts of the program that still needs to modified for maximal performance, for example parallelism settings for some operators in some cases.

5. As for the status of the Python API I personally cannot say very much, maybe someone can jump in and help me with that question :)

Regards,
Gyula

On Thu, Dec 25, 2014 at 11:58 AM, Samarth Mailinglist <[hidden email]> wrote:
Thank you for your answer. I have a couple of follow up questions:
1. Does it support 'exactly once semantics' that Spark and Storm support?
2. (Related to 1) What happens when an error occurs during processing?
3. Is there a plan for adding Machine Learning support on top of Flink? Say Alternative Least Squares, Basic Naive Bayes?
4. When you say Flink manages itself, does it mean I don't have to fiddle with number of partitions (Spark), number of reduces / happers (Hadoop?) to optimize performance? (In some cases this might be needed)
5. How far along is the Python API? I don't see the specs in the Website.

On Thu, Dec 25, 2014 at 4:31 AM, Márton Balassi <[hidden email]> wrote:
Dear Samarth,

Besides the discussions you have mentioned [1] I can recommend one of our recent presentations [2], especially the distinguishing Flink section (from slide 16).

It is generally a difficult question as both the systems are rapidly evolving, so the answer can become outdated quite fast. However there are fundamental design features that are highly unlikely to change, for example Spark uses "true" batch processing, meaning that intermediate results are materialized (mostly in memory) as RDDs. Flink's engine is internally more like streaming, forwarding the results to the next operator asap. The latter can yield performance benefits for more complex jobs. Flink also gives you a query optimizer, spills gracefully to disk when the system runs out of memory and has some cool features around serialization. For performance numbers and some more insight please check out the presentation [2] and do not hesitate to post a follow-up mail here if you come across something unclear or extraordinary.

[1] http://apache-flink-incubator-mailing-list-archive.1008284.n3.nabble.com/template/NamlServlet.jtp?macro=search_page&node=1&query=spark
[2] http://www.slideshare.net/GyulaFra/flink-apachecon

Best,

Marton

On Tue, Dec 23, 2014 at 6:19 PM, Samarth Mailinglist <[hidden email]> wrote:
Hey folks, I have a noob question.

I already looked up the archives and saw a couple of discussions about Spark and Flink.

I am familiar with spark (the python API, esp MLLib), and I see many similarities between Flink and Spark.

How does Flink distinguish itself from Spark?

Márton Balassi

Re: Flink and Spark

Hey,

You can find some ml examples like LinerRegression [1, 2] or KMeans [3, 4] in the examples package in both java and scala as a quickstart.

[1] https://github.com/apache/incubator-flink/blob/release-0.8/flink-examples/flink-java-examples/src/main/java/org/apache/flink/examples/java/ml/LinearRegression.java

[2] https://github.com/apache/incubator-flink/blob/release-0.8/flink-examples/flink-scala-examples/src/main/scala/org/apache/flink/examples/scala/ml/LinearRegression.scala

[3] https://github.com/apache/incubator-flink/blob/release-0.8/flink-examples/flink-java-examples/src/main/java/org/apache/flink/examples/java/clustering/KMeans.java

[4] https://github.com/apache/incubator-flink/blob/release-0.8/flink-examples/flink-scala-examples/src/main/scala/org/apache/flink/examples/scala/clustering/KMeans.scala

On Fri, Dec 26, 2014 at 7:31 AM, Samarth Mailinglist <[hidden email]> wrote:

Thank you the answers, folks.
Can anyone provide me a link for any implementation of an ML algorithm on Flink?

On Thu, Dec 25, 2014 at 8:07 PM, Gyula Fóra <[hidden email]> wrote:
Hey,

1-2. As for failure recovery, there is a difference how the Flink batch and streaming programs handle failures. The failed parts of the batch jobs currently restart upon failures but there is an ongoing effort on fine grained fault tolerance which is somewhat similar to sparks lineage tracking. (so technically this is exactly once semantic but that is somewhat meaningless for batch jobs)

For streaming programs we are currently working on fault tolerance, we are hoping to support at least once processing guarantees in the 0.9 release. After that we will focus our research efforts on an high performance implementation of exactly once processing semantics, which is still a hard topic in streaming systems. Storm's trident's exaclty once semantics can only provide very low throughput while we are trying hard to avoid this issue, as our streaming system is capable of much higher throughput than storm in general as you can see on some perf measurements.

3. There are already many ml algorithms implemented for Flink but they are scattered all around. We are planning to collect them in a machine learning library soon. We are also implementing an adapter for Samoa which will provide some streaming machine learning algorithms as well. Samoa integration should be ready in January.

4. Flink carefully manages its memory use to avoid heap errors, and utilizing memory space as effectively as it can. The optimizer for batch programs also takes care of a lot of optimization steps that the user would manually have to do in other system, like optimizing the order of transformations etc. There are of course parts of the program that still needs to modified for maximal performance, for example parallelism settings for some operators in some cases.

5. As for the status of the Python API I personally cannot say very much, maybe someone can jump in and help me with that question :)

Regards,
Gyula

On Thu, Dec 25, 2014 at 11:58 AM, Samarth Mailinglist <[hidden email]> wrote:
Thank you for your answer. I have a couple of follow up questions:
1. Does it support 'exactly once semantics' that Spark and Storm support?
2. (Related to 1) What happens when an error occurs during processing?
3. Is there a plan for adding Machine Learning support on top of Flink? Say Alternative Least Squares, Basic Naive Bayes?
4. When you say Flink manages itself, does it mean I don't have to fiddle with number of partitions (Spark), number of reduces / happers (Hadoop?) to optimize performance? (In some cases this might be needed)
5. How far along is the Python API? I don't see the specs in the Website.

On Thu, Dec 25, 2014 at 4:31 AM, Márton Balassi <[hidden email]> wrote:
Dear Samarth,

Besides the discussions you have mentioned [1] I can recommend one of our recent presentations [2], especially the distinguishing Flink section (from slide 16).

It is generally a difficult question as both the systems are rapidly evolving, so the answer can become outdated quite fast. However there are fundamental design features that are highly unlikely to change, for example Spark uses "true" batch processing, meaning that intermediate results are materialized (mostly in memory) as RDDs. Flink's engine is internally more like streaming, forwarding the results to the next operator asap. The latter can yield performance benefits for more complex jobs. Flink also gives you a query optimizer, spills gracefully to disk when the system runs out of memory and has some cool features around serialization. For performance numbers and some more insight please check out the presentation [2] and do not hesitate to post a follow-up mail here if you come across something unclear or extraordinary.

[1] http://apache-flink-incubator-mailing-list-archive.1008284.n3.nabble.com/template/NamlServlet.jtp?macro=search_page&node=1&query=spark
[2] http://www.slideshare.net/GyulaFra/flink-apachecon

Best,

Marton

On Tue, Dec 23, 2014 at 6:19 PM, Samarth Mailinglist <[hidden email]> wrote:
Hey folks, I have a noob question.

I already looked up the archives and saw a couple of discussions about Spark and Flink.

I am familiar with spark (the python API, esp MLLib), and I see many similarities between Flink and Spark.

How does Flink distinguish itself from Spark?

Samarth Mailinglist

Re: Flink and Spark

Thanks a lot Márton and Gyula!

On Fri, Dec 26, 2014 at 2:42 PM, Márton Balassi <[hidden email]> wrote:

Hey,

You can find some ml examples like LinerRegression [1, 2] or KMeans [3, 4] in the examples package in both java and scala as a quickstart.

[1] https://github.com/apache/incubator-flink/blob/release-0.8/flink-examples/flink-java-examples/src/main/java/org/apache/flink/examples/java/ml/LinearRegression.java
[2] https://github.com/apache/incubator-flink/blob/release-0.8/flink-examples/flink-scala-examples/src/main/scala/org/apache/flink/examples/scala/ml/LinearRegression.scala
[3] https://github.com/apache/incubator-flink/blob/release-0.8/flink-examples/flink-java-examples/src/main/java/org/apache/flink/examples/java/clustering/KMeans.java
[4] https://github.com/apache/incubator-flink/blob/release-0.8/flink-examples/flink-scala-examples/src/main/scala/org/apache/flink/examples/scala/clustering/KMeans.scala

On Fri, Dec 26, 2014 at 7:31 AM, Samarth Mailinglist <[hidden email]> wrote:
Thank you the answers, folks.
Can anyone provide me a link for any implementation of an ML algorithm on Flink?

On Thu, Dec 25, 2014 at 8:07 PM, Gyula Fóra <[hidden email]> wrote:
Hey,

1-2. As for failure recovery, there is a difference how the Flink batch and streaming programs handle failures. The failed parts of the batch jobs currently restart upon failures but there is an ongoing effort on fine grained fault tolerance which is somewhat similar to sparks lineage tracking. (so technically this is exactly once semantic but that is somewhat meaningless for batch jobs)

For streaming programs we are currently working on fault tolerance, we are hoping to support at least once processing guarantees in the 0.9 release. After that we will focus our research efforts on an high performance implementation of exactly once processing semantics, which is still a hard topic in streaming systems. Storm's trident's exaclty once semantics can only provide very low throughput while we are trying hard to avoid this issue, as our streaming system is capable of much higher throughput than storm in general as you can see on some perf measurements.

3. There are already many ml algorithms implemented for Flink but they are scattered all around. We are planning to collect them in a machine learning library soon. We are also implementing an adapter for Samoa which will provide some streaming machine learning algorithms as well. Samoa integration should be ready in January.

4. Flink carefully manages its memory use to avoid heap errors, and utilizing memory space as effectively as it can. The optimizer for batch programs also takes care of a lot of optimization steps that the user would manually have to do in other system, like optimizing the order of transformations etc. There are of course parts of the program that still needs to modified for maximal performance, for example parallelism settings for some operators in some cases.

5. As for the status of the Python API I personally cannot say very much, maybe someone can jump in and help me with that question :)

Regards,
Gyula

On Thu, Dec 25, 2014 at 11:58 AM, Samarth Mailinglist <[hidden email]> wrote:
Thank you for your answer. I have a couple of follow up questions:
1. Does it support 'exactly once semantics' that Spark and Storm support?
2. (Related to 1) What happens when an error occurs during processing?
3. Is there a plan for adding Machine Learning support on top of Flink? Say Alternative Least Squares, Basic Naive Bayes?
4. When you say Flink manages itself, does it mean I don't have to fiddle with number of partitions (Spark), number of reduces / happers (Hadoop?) to optimize performance? (In some cases this might be needed)
5. How far along is the Python API? I don't see the specs in the Website.

On Thu, Dec 25, 2014 at 4:31 AM, Márton Balassi <[hidden email]> wrote:
Dear Samarth,

Besides the discussions you have mentioned [1] I can recommend one of our recent presentations [2], especially the distinguishing Flink section (from slide 16).

It is generally a difficult question as both the systems are rapidly evolving, so the answer can become outdated quite fast. However there are fundamental design features that are highly unlikely to change, for example Spark uses "true" batch processing, meaning that intermediate results are materialized (mostly in memory) as RDDs. Flink's engine is internally more like streaming, forwarding the results to the next operator asap. The latter can yield performance benefits for more complex jobs. Flink also gives you a query optimizer, spills gracefully to disk when the system runs out of memory and has some cool features around serialization. For performance numbers and some more insight please check out the presentation [2] and do not hesitate to post a follow-up mail here if you come across something unclear or extraordinary.

[1] http://apache-flink-incubator-mailing-list-archive.1008284.n3.nabble.com/template/NamlServlet.jtp?macro=search_page&node=1&query=spark
[2] http://www.slideshare.net/GyulaFra/flink-apachecon

Best,

Marton

On Tue, Dec 23, 2014 at 6:19 PM, Samarth Mailinglist <[hidden email]> wrote:
Hey folks, I have a noob question.

I already looked up the archives and saw a couple of discussions about Spark and Flink.

I am familiar with spark (the python API, esp MLLib), and I see many similarities between Flink and Spark.

How does Flink distinguish itself from Spark?

rmetzger0

Re: Flink and Spark

For the Python API, there is a pending pull request: https://github.com/apache/incubator-flink/pull/202 It is still work in progress, but feedback is, as always, appreciated.

On Fri, Dec 26, 2014 at 3:41 PM, Samarth Mailinglist <[hidden email]> wrote:

Thanks a lot Márton and Gyula!

On Fri, Dec 26, 2014 at 2:42 PM, Márton Balassi <[hidden email]> wrote:
Hey,

You can find some ml examples like LinerRegression [1, 2] or KMeans [3, 4] in the examples package in both java and scala as a quickstart.

[1] https://github.com/apache/incubator-flink/blob/release-0.8/flink-examples/flink-java-examples/src/main/java/org/apache/flink/examples/java/ml/LinearRegression.java
[2] https://github.com/apache/incubator-flink/blob/release-0.8/flink-examples/flink-scala-examples/src/main/scala/org/apache/flink/examples/scala/ml/LinearRegression.scala
[3] https://github.com/apache/incubator-flink/blob/release-0.8/flink-examples/flink-java-examples/src/main/java/org/apache/flink/examples/java/clustering/KMeans.java
[4] https://github.com/apache/incubator-flink/blob/release-0.8/flink-examples/flink-scala-examples/src/main/scala/org/apache/flink/examples/scala/clustering/KMeans.scala

On Fri, Dec 26, 2014 at 7:31 AM, Samarth Mailinglist <[hidden email]> wrote:
Thank you the answers, folks.
Can anyone provide me a link for any implementation of an ML algorithm on Flink?

On Thu, Dec 25, 2014 at 8:07 PM, Gyula Fóra <[hidden email]> wrote:
Hey,

1-2. As for failure recovery, there is a difference how the Flink batch and streaming programs handle failures. The failed parts of the batch jobs currently restart upon failures but there is an ongoing effort on fine grained fault tolerance which is somewhat similar to sparks lineage tracking. (so technically this is exactly once semantic but that is somewhat meaningless for batch jobs)

For streaming programs we are currently working on fault tolerance, we are hoping to support at least once processing guarantees in the 0.9 release. After that we will focus our research efforts on an high performance implementation of exactly once processing semantics, which is still a hard topic in streaming systems. Storm's trident's exaclty once semantics can only provide very low throughput while we are trying hard to avoid this issue, as our streaming system is capable of much higher throughput than storm in general as you can see on some perf measurements.

3. There are already many ml algorithms implemented for Flink but they are scattered all around. We are planning to collect them in a machine learning library soon. We are also implementing an adapter for Samoa which will provide some streaming machine learning algorithms as well. Samoa integration should be ready in January.

4. Flink carefully manages its memory use to avoid heap errors, and utilizing memory space as effectively as it can. The optimizer for batch programs also takes care of a lot of optimization steps that the user would manually have to do in other system, like optimizing the order of transformations etc. There are of course parts of the program that still needs to modified for maximal performance, for example parallelism settings for some operators in some cases.

5. As for the status of the Python API I personally cannot say very much, maybe someone can jump in and help me with that question :)

Regards,
Gyula

On Thu, Dec 25, 2014 at 11:58 AM, Samarth Mailinglist <[hidden email]> wrote:
Thank you for your answer. I have a couple of follow up questions:
1. Does it support 'exactly once semantics' that Spark and Storm support?
2. (Related to 1) What happens when an error occurs during processing?
3. Is there a plan for adding Machine Learning support on top of Flink? Say Alternative Least Squares, Basic Naive Bayes?
4. When you say Flink manages itself, does it mean I don't have to fiddle with number of partitions (Spark), number of reduces / happers (Hadoop?) to optimize performance? (In some cases this might be needed)
5. How far along is the Python API? I don't see the specs in the Website.

On Thu, Dec 25, 2014 at 4:31 AM, Márton Balassi <[hidden email]> wrote:
Dear Samarth,

Besides the discussions you have mentioned [1] I can recommend one of our recent presentations [2], especially the distinguishing Flink section (from slide 16).

It is generally a difficult question as both the systems are rapidly evolving, so the answer can become outdated quite fast. However there are fundamental design features that are highly unlikely to change, for example Spark uses "true" batch processing, meaning that intermediate results are materialized (mostly in memory) as RDDs. Flink's engine is internally more like streaming, forwarding the results to the next operator asap. The latter can yield performance benefits for more complex jobs. Flink also gives you a query optimizer, spills gracefully to disk when the system runs out of memory and has some cool features around serialization. For performance numbers and some more insight please check out the presentation [2] and do not hesitate to post a follow-up mail here if you come across something unclear or extraordinary.

[1] http://apache-flink-incubator-mailing-list-archive.1008284.n3.nabble.com/template/NamlServlet.jtp?macro=search_page&node=1&query=spark
[2] http://www.slideshare.net/GyulaFra/flink-apachecon

Best,

Marton

On Tue, Dec 23, 2014 at 6:19 PM, Samarth Mailinglist <[hidden email]> wrote:
Hey folks, I have a noob question.

I already looked up the archives and saw a couple of discussions about Spark and Flink.

I am familiar with spark (the python API, esp MLLib), and I see many similarities between Flink and Spark.

How does Flink distinguish itself from Spark?

Till Rohrmann

Re: Flink and Spark

Hi Samarth,

you can also find different implementations of ALS on Flink here: https://github.com/project-flink/flink-perf/tree/master/flink-jobs/src/main/scala/com/github/projectflink/als.

On Fri, Dec 26, 2014 at 4:12 PM, Robert Metzger <[hidden email]> wrote:

For the Python API, there is a pending pull request: https://github.com/apache/incubator-flink/pull/202 It is still work in progress, but feedback is, as always, appreciated.

On Fri, Dec 26, 2014 at 3:41 PM, Samarth Mailinglist <[hidden email]> wrote:
Thanks a lot Márton and Gyula!

On Fri, Dec 26, 2014 at 2:42 PM, Márton Balassi <[hidden email]> wrote:
Hey,

You can find some ml examples like LinerRegression [1, 2] or KMeans [3, 4] in the examples package in both java and scala as a quickstart.

[1] https://github.com/apache/incubator-flink/blob/release-0.8/flink-examples/flink-java-examples/src/main/java/org/apache/flink/examples/java/ml/LinearRegression.java
[2] https://github.com/apache/incubator-flink/blob/release-0.8/flink-examples/flink-scala-examples/src/main/scala/org/apache/flink/examples/scala/ml/LinearRegression.scala
[3] https://github.com/apache/incubator-flink/blob/release-0.8/flink-examples/flink-java-examples/src/main/java/org/apache/flink/examples/java/clustering/KMeans.java
[4] https://github.com/apache/incubator-flink/blob/release-0.8/flink-examples/flink-scala-examples/src/main/scala/org/apache/flink/examples/scala/clustering/KMeans.scala

On Fri, Dec 26, 2014 at 7:31 AM, Samarth Mailinglist <[hidden email]> wrote:
Thank you the answers, folks.
Can anyone provide me a link for any implementation of an ML algorithm on Flink?

On Thu, Dec 25, 2014 at 8:07 PM, Gyula Fóra <[hidden email]> wrote:
Hey,

1-2. As for failure recovery, there is a difference how the Flink batch and streaming programs handle failures. The failed parts of the batch jobs currently restart upon failures but there is an ongoing effort on fine grained fault tolerance which is somewhat similar to sparks lineage tracking. (so technically this is exactly once semantic but that is somewhat meaningless for batch jobs)

For streaming programs we are currently working on fault tolerance, we are hoping to support at least once processing guarantees in the 0.9 release. After that we will focus our research efforts on an high performance implementation of exactly once processing semantics, which is still a hard topic in streaming systems. Storm's trident's exaclty once semantics can only provide very low throughput while we are trying hard to avoid this issue, as our streaming system is capable of much higher throughput than storm in general as you can see on some perf measurements.

3. There are already many ml algorithms implemented for Flink but they are scattered all around. We are planning to collect them in a machine learning library soon. We are also implementing an adapter for Samoa which will provide some streaming machine learning algorithms as well. Samoa integration should be ready in January.

4. Flink carefully manages its memory use to avoid heap errors, and utilizing memory space as effectively as it can. The optimizer for batch programs also takes care of a lot of optimization steps that the user would manually have to do in other system, like optimizing the order of transformations etc. There are of course parts of the program that still needs to modified for maximal performance, for example parallelism settings for some operators in some cases.

5. As for the status of the Python API I personally cannot say very much, maybe someone can jump in and help me with that question :)

Regards,
Gyula

On Thu, Dec 25, 2014 at 11:58 AM, Samarth Mailinglist <[hidden email]> wrote:
Thank you for your answer. I have a couple of follow up questions:
1. Does it support 'exactly once semantics' that Spark and Storm support?
2. (Related to 1) What happens when an error occurs during processing?
3. Is there a plan for adding Machine Learning support on top of Flink? Say Alternative Least Squares, Basic Naive Bayes?
4. When you say Flink manages itself, does it mean I don't have to fiddle with number of partitions (Spark), number of reduces / happers (Hadoop?) to optimize performance? (In some cases this might be needed)
5. How far along is the Python API? I don't see the specs in the Website.

On Thu, Dec 25, 2014 at 4:31 AM, Márton Balassi <[hidden email]> wrote:
Dear Samarth,

Besides the discussions you have mentioned [1] I can recommend one of our recent presentations [2], especially the distinguishing Flink section (from slide 16).

It is generally a difficult question as both the systems are rapidly evolving, so the answer can become outdated quite fast. However there are fundamental design features that are highly unlikely to change, for example Spark uses "true" batch processing, meaning that intermediate results are materialized (mostly in memory) as RDDs. Flink's engine is internally more like streaming, forwarding the results to the next operator asap. The latter can yield performance benefits for more complex jobs. Flink also gives you a query optimizer, spills gracefully to disk when the system runs out of memory and has some cool features around serialization. For performance numbers and some more insight please check out the presentation [2] and do not hesitate to post a follow-up mail here if you come across something unclear or extraordinary.

[1] http://apache-flink-incubator-mailing-list-archive.1008284.n3.nabble.com/template/NamlServlet.jtp?macro=search_page&node=1&query=spark
[2] http://www.slideshare.net/GyulaFra/flink-apachecon

Best,

Marton

On Tue, Dec 23, 2014 at 6:19 PM, Samarth Mailinglist <[hidden email]> wrote:
Hey folks, I have a noob question.

I already looked up the archives and saw a couple of discussions about Spark and Flink.

I am familiar with spark (the python API, esp MLLib), and I see many similarities between Flink and Spark.

How does Flink distinguish itself from Spark?