(DEPRECATED) Apache Flink User Mailing List archive.

Benchmark results between Flink and Spark

Classic

List

Threaded

11 messages Options

Slim Baltagi

Benchmark results between Flink and Spark

Hi

Apache Flink outperforms Apache Spark in processing machine learning & graph algorithms and relational queries but not in batch processing!

The results were published in the proceedings of the 18th International Conference, Business Information Systems 2015, Poznań, Poland, June 24-26, 2015.

Thanks to our friend Google, Chapter 3: Evaluating New Approaches of Big Data Analytics Frameworks by
Norman Spangenberg, Martin Roth and Bogdan Franczyk is available for preview at http://goo.gl/WocQci
at pages 28-37.

Enjoy!

Slim Baltagi
http://www.SparkBigData.com

Stephan Ewen

Re: Benchmark results between Flink and Spark

Hi Slim!

Thank you for the link.

Unfortunately, I cannot access the contents. I always get a "connection closed" error.

Anybody else experiences something similar?

Stephan

On Sun, Jul 5, 2015 at 6:37 PM, Slim Baltagi <[hidden email]> wrote:

Hi

Apache Flink outperforms Apache Spark in processing machine learning & graph
algorithms and relational queries but not in batch processing!

The results were published in the proceedings of the 18th International
Conference, Business Information Systems 2015, Poznań, Poland, June 24-26,
2015.

Thanks to our friend Google, Chapter 3: Evaluating New Approaches of Big
Data Analytics Frameworks by
Norman Spangenberg, Martin Roth and Bogdan Franczyk is available for preview
at http://goo.gl/WocQci
at pages 28-37.

Enjoy!

Slim Baltagi
http://www.SparkBigData.com

--
View this message in context: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Benchmark-results-between-Flink-and-Spark-tp1940.html
Sent from the Apache Flink User Mailing List archive. mailing list archive at Nabble.com.

Fabian Hueske-2

Re: Benchmark results between Flink and Spark

Thanks for sharing, Slim!

I had a look at the report (except for two pages which were not available in the preview).

It compares four different tasks on a setup with 4 rather small nodes (8 cores, 16GB memory). I could not find which versions of Flink and Spark were compared.

The comparison tasks are:

1) WordCount for "batch processing"

2) KMeans for "Machine-learning"

3) PageRank for "Graph-processing"

4) Some kind of relational query (details probably in the two missing pages)

Flink outperforms Spark in all tasks except WordCount.

The results should not be taken too serious due to the small number of nodes and low number of different tasks (only one task characterizes a task category). K-Means and Wordcount are certainly not representative for the very diverse categories machine-learning and "batch processing". Same applies for relational processing which could be a single table aggregation or a cascade of a dozen joins.

The results are very motivating though :-)
I hope to see more independent benchmarks in the future.

2015-07-05 19:02 GMT+02:00 Stephan Ewen <[hidden email]>:

Hi Slim!

Thank you for the link.

Unfortunately, I cannot access the contents. I always get a "connection closed" error.

Anybody else experiences something similar?

Stephan

On Sun, Jul 5, 2015 at 6:37 PM, Slim Baltagi <[hidden email]> wrote:
Hi

Apache Flink outperforms Apache Spark in processing machine learning & graph
algorithms and relational queries but not in batch processing!

The results were published in the proceedings of the 18th International
Conference, Business Information Systems 2015, Poznań, Poland, June 24-26,
2015.

Thanks to our friend Google, Chapter 3: Evaluating New Approaches of Big
Data Analytics Frameworks by
Norman Spangenberg, Martin Roth and Bogdan Franczyk is available for preview
at http://goo.gl/WocQci
at pages 28-37.

Enjoy!

Slim Baltagi
http://www.SparkBigData.com

--
View this message in context: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Benchmark-results-between-Flink-and-Spark-tp1940.html
Sent from the Apache Flink User Mailing List archive. mailing list archive at Nabble.com.

Slim Baltagi

Re: Benchmark results between Flink and Spark

Hi Fabian

> I could not find which versions of Flink and Spark were compared.
According to Norman Spangenberg, one of the authors of the conference paper, the benchmark used Spark version was 1.2.0. and Flink version was 0.8.0.

I did ask him a few more questions about the benchmark between Flink and Spark.
I'll share the answers once Norman Spangenberg gets back to me.

Thanks

Slim Baltagi
Apache Flink Knowledge Base ( Now with over 300 categorized web resources!)
http://sparkbigdata.com/component/tags/tag/27-flink

Wang, Yanping

RE: Benchmark results between Flink and Spark

In reply to this post by Fabian Hueske-2

Hi,

I am new to Flink community. I am interested in comparing Spark’s feature and performance vs. Spark.

Does anyone know if there is any benchmark or test available for testing Spark performance on servers that has 32 plus cores and 256GB plus memory?

Thanks

-yanping

From: Fabian Hueske [mailto:[hidden email]]
Sent: Sunday, July 05, 2015 10:18 AM
To: [hidden email]
Subject: Re: Benchmark results between Flink and Spark

Thanks for sharing, Slim!

I had a look at the report (except for two pages which were not available in the preview).

It compares four different tasks on a setup with 4 rather small nodes (8 cores, 16GB memory). I could not find which versions of Flink and Spark were compared.

The comparison tasks are:

1) WordCount for "batch processing"

2) KMeans for "Machine-learning"

3) PageRank for "Graph-processing"

4) Some kind of relational query (details probably in the two missing pages)

Flink outperforms Spark in all tasks except WordCount.

The results are very motivating though :-)
I hope to see more independent benchmarks in the future.

2015-07-05 19:02 GMT+02:00 Stephan Ewen <[hidden email]>:

Hi Slim!

Thank you for the link.

Unfortunately, I cannot access the contents. I always get a "connection closed" error.

Anybody else experiences something similar?

Stephan

On Sun, Jul 5, 2015 at 6:37 PM, Slim Baltagi <[hidden email]> wrote:

Hi

Apache Flink outperforms Apache Spark in processing machine learning & graph
algorithms and relational queries but not in batch processing!

The results were published in the proceedings of the 18th International
Conference, Business Information Systems 2015, Poznań, Poland, June 24-26,
2015.

Thanks to our friend Google, Chapter 3: Evaluating New Approaches of Big
Data Analytics Frameworks by
Norman Spangenberg, Martin Roth and Bogdan Franczyk is available for preview
at http://goo.gl/WocQci
at pages 28-37.

Enjoy!

Slim Baltagi
http://www.SparkBigData.com

--
View this message in context: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Benchmark-results-between-Flink-and-Spark-tp1940.html
Sent from the Apache Flink User Mailing List archive. mailing list archive at Nabble.com.

hawin

Re: Benchmark results between Flink and Spark

In reply to this post by Slim Baltagi

Hi Slim and Fabian

Here is the Spark benchmark. https://amplab.cs.berkeley.edu/benchmark/

Do we have s similar report or comparison like that.

Thanks.

Best regards

Hawin

On Mon, Jul 6, 2015 at 6:32 AM, Slim Baltagi <[hidden email]> wrote:

Hi Fabian

> I could not find which versions of Flink and Spark were compared.
According to Norman Spangenberg, one of the authors of the conference paper,
the benchmark used *Spark* version was *1.2.0*. and *Flink* version was
*0.8.0*.

I did ask him a few more questions about the benchmark between Flink and
Spark.
I'll share the answers once Norman Spangenberg gets back to me.

Thanks

Slim Baltagi
Apache Flink Knowledge Base ( Now with over 300 categorized web resources!)
http://sparkbigdata.com/component/tags/tag/27-flink

--
View this message in context: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Benchmark-results-between-Flink-and-Spark-tp1940p1957.html

Sent from the Apache Flink User Mailing List archive. mailing list archive at Nabble.com.

Stephan Ewen

Re: Benchmark results between Flink and Spark

Hi Hawin!

The benchmark you refer to is a more or less pure SQL benchmark.

For systems that are designed for exactly the "beyond SQL" applications (streaming, iterative algorithms, UDFs, ...), this benchmark is probably not very meaningful, as it covers not one of these areas.

Even in the SQL analytics space, this is not a very representative benchmark. The TPC benchmarks are probably more interesting there. They are designed with more input from industry and go through more design cycles.

Greetings,

Stephan

On Mon, Jul 6, 2015 at 7:00 PM, Hawin Jiang <[hidden email]> wrote:

Hi Slim and Fabian

Here is the Spark benchmark. https://amplab.cs.berkeley.edu/benchmark/
Do we have s similar report or comparison like that.
Thanks.

Best regards
Hawin

On Mon, Jul 6, 2015 at 6:32 AM, Slim Baltagi <[hidden email]> wrote:
Hi Fabian

> I could not find which versions of Flink and Spark were compared.
According to Norman Spangenberg, one of the authors of the conference paper,
the benchmark used *Spark* version was *1.2.0*. and *Flink* version was
*0.8.0*.

I did ask him a few more questions about the benchmark between Flink and
Spark.
I'll share the answers once Norman Spangenberg gets back to me.

Thanks

Slim Baltagi
Apache Flink Knowledge Base ( Now with over 300 categorized web resources!)
http://sparkbigdata.com/component/tags/tag/27-flink

--
View this message in context: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Benchmark-results-between-Flink-and-Spark-tp1940p1957.html

Sent from the Apache Flink User Mailing List archive. mailing list archive at Nabble.com.

Slim Baltagi

Re: Benchmark results between Flink and Spark

In reply to this post by hawin

Hi Hawin

What you shared is not 'the Spark benchmark'.
This benchmark measures response time on a handful of relational queries of different tools including Shark.
Shark development was ended a year ago on July 1, 2014 in favor of Spark SQL which graduated from an alpha project on March 13, 2015.
I am not aware of any published benchmark between Spark and Flink by a third party except the one that I shared from a conference paper: http://goo.gl/WocQci
I hope this helps.

Slim Baltagi
Apache Flink Knowledge Base ( Now with over 300 categorized web resources!)
http://sparkbigdata.com/component/tags/tag/27-flink

Vasiliki Kalavri

Re: Benchmark results between Flink and Spark

Hi,

Apart from the amplab benchmark, you might also find [1] and [2] interesting. The first is a survey on existing benchmarks, while the second proposes one. However, they are also limited to SQL-like queries.

Regarding graph processing benchmarks, I recently came across Graphalytics [3]. The benchmark currently supports Giraph, GraphLab, Graph-X, MapReduce and Neo4j. I hope we can add Gelly to this list soon!

Unfortunately, I'm not aware of any large-scale ML or streaming benchmarks.

Cheers,

Vasia.

[1]: http://arxiv.org/pdf/1402.5194.pdf

[2]: http://msrg.utoronto.ca/publications/pdf_files/2013/Ghazal13-BigBench:_Towards_an_Industry_Standa.pdf

[3]: http://event.cwi.nl/grades2015/07-capota.pdf

On 6 July 2015 at 19:03, Slim Baltagi <[hidden email]> wrote:

Hi Hawin

What you shared is not 'the Spark benchmark'.
This benchmark measures response time on a handful of relational queries of
different tools including Shark.
Shark development was ended a year ago on July 1, 2014 in favor of Spark SQL
which graduated from an alpha project on March 13, 2015.
I am not aware of any published benchmark between Spark and Flink by a third
party except the one that I shared from a conference paper:
http://goo.gl/WocQci
I hope this helps.

Slim Baltagi
Apache Flink Knowledge Base ( Now with over 300 categorized web resources!)
http://sparkbigdata.com/component/tags/tag/27-flink

--
View this message in context: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Benchmark-results-between-Flink-and-Spark-tp1940p1961.html

Sent from the Apache Flink User Mailing List archive. mailing list archive at Nabble.com.

Slim Baltagi

Re: Benchmark results between Flink and Spark

Hi

Vasia, thanks for sharing.
1. I would like to add a couple resources about BigBench, the Big Data benchmark suite that you are referring to:
https://github.com/intel-hadoop/Big-Data-Benchmark-for-Big-Bench
and also
http://blog.cloudera.com/blog/2014/11/bigbench-toward-an-industry-standard-benchmark-for-big-data-analytics/

2. BigDataBench is also an open source Big Data Benchmarking suite from both industry and academia. As a subset of BigDataBench, BigDataBench-DCA is China’s first industry-standard big data benchmark suite: http://prof.ict.ac.cn/BigDataBench/industry-standard-benchmarks/
It comes with real-world data sets and many workloads: TeraSort, WordCount, PageRank, K-means, NaiveBayes, Aggregation and Read/Write/Scan and also a tool that uses Hadoop, HBase and Mahout.
This might be inspiring to build a Big Data Benchmarking suite for Flink!

Regards,

Slim Baltagi
Apache Flink Knowledge Base ( Now with over 300 categorized web resources!)
http://sparkbigdata.com/component/tags/tag/27-flink

hawin

Re: Benchmark results between Flink and Spark

Hi Stephan

Yes. You are correct. It looks like the TPCx-HS is an industry standard for big data. But how to get a Flink number on that.

I think it is also difficult to get a Spark performance number based on TPCx-HS.

if you know someone can provide servers for performance testing. I would like to put in my best efforts.

@Slim

That link is just for your reference. At least, you know the exact time them spent it when you run that queries.

BigDataBench is a good guide for big data benchmark. But how to run these user cases between Flink and Spark to get that performance number.

@Vasia

Thanks for sharing. if we can do some basic comparisons with Apache Spark. The red line below will be going up fast.

Thanks.

On Mon, Jul 6, 2015 at 11:41 AM, Slim Baltagi <[hidden email]> wrote:

Hi

Vasia, thanks for sharing.
1. I would like to add a couple resources about *BigBench*, the Big Data
benchmark suite that you are referring to:
https://github.com/intel-hadoop/Big-Data-Benchmark-for-Big-Bench
and also
http://blog.cloudera.com/blog/2014/11/bigbench-toward-an-industry-standard-benchmark-for-big-data-analytics/

2. *BigDataBench* is also an open source Big Data Benchmarking suite from
both industry and academia. As a subset of BigDataBench, BigDataBench-DCA
is China’s first industry-standard big data benchmark suite:
http://prof.ict.ac.cn/BigDataBench/industry-standard-benchmarks/
It comes with *real-world data sets* and *many workloads*: TeraSort,
WordCount, PageRank, K-means, NaiveBayes, Aggregation and Read/Write/Scan
and also a *tool* that uses Hadoop, HBase and Mahout.
This might be inspiring to build a Big Data Benchmarking suite for Flink!

Regards,

Slim Baltagi
Apache Flink Knowledge Base ( Now with over 300 categorized web resources!)
http://sparkbigdata.com/component/tags/tag/27-flink

--
View this message in context: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Benchmark-results-between-Flink-and-Spark-tp1940p1963.html

Sent from the Apache Flink User Mailing List archive. mailing list archive at Nabble.com.