(DEPRECATED) Apache Flink User Mailing List archive.

Speedup of Flink Applications

Classic

List

Threaded

4 messages Options

Hanna Prinz

Speedup of Flink Applications

Happy new year everyone :)

I’m currently working on a paper about Flink. I already got some recommendations on general papers with details about Flink, which helped me a lot already. But now that I read them, I’m further interested is the speedup capabilities, provided by the Flink Framework: How „far“ can it scale efficiently?

Amdahls law states that a parallelization is only efficient as long as the non-parallelizable part of the processing (time for the communication between the nodes etc.) doesn’t „eat up“ the speed gains of parallelization (= parallel slowdown).

Of course, the communication overhead is mostly caused by the implementation, but the frameworks specific solution for the communication between the nodes has a reasonable effect as well.

After studying these papers, it looks like, although Flinks performance is better in many cases, the possible speedup is equal to the possible speedup of Spark.

1. Spark versus Flink - Understanding Performance in Big Data Analytics Frameworks | https://hal.inria.fr/hal-01347638/document
2. Big Data Analytics on Cray XC Series DataWarp using Hadoop, Spark and Flink | https://cug.org/proceedings/cug2016_proceedings/includes/files/pap141.pdf
3. Thrill - High-Performance Algorithmic Distributed Batch Data Processing with C++ | https://panthema.net/2016/0816-Thrill-High-Performance-Algorithmic-Distributed-Batch-Data-Processing-with-CPP/1608.05634v1.pdf

Does someone have …

… more information (or data) on speedup of Flink applications?

… experience (or data) with Flink in an extremely paralellized environment?

… detailed information on how the nodes communicate, especially when they are waiting for task results of one another?

Thank you very much for your time & answers!

Hanna

Timur Shenkao

Re: Speedup of Flink Applications

Hi,

It seems your questions are too abstract & theoretical. The answer is : it depends on several factors. Skewness in data, data volume, reliability requirements, "fatness" of servers, whether one performs look-up in other data sources, etc.

The papers you mentioned mean the following: under concrete & specific conditions, researchers achieved their results. If they had changed some parameters slightly (increase network's throughput, for example, or change garbage collector's options) , the results would have been completely different.

On Tuesday, January 3, 2017, Hanna Prinz <[hidden email]> wrote:

Happy new year everyone :)

I’m currently working on a paper about Flink. I already got some recommendations on general papers with details about Flink, which helped me a lot already. But now that I read them, I’m further interested is the speedup capabilities, provided by the Flink Framework: How „far“ can it scale efficiently?

Amdahls law states that a parallelization is only efficient as long as the non-parallelizable part of the processing (time for the communication between the nodes etc.) doesn’t „eat up“ the speed gains of parallelization (= parallel slowdown).
Of course, the communication overhead is mostly caused by the implementation, but the frameworks specific solution for the communication between the nodes has a reasonable effect as well.

After studying these papers, it looks like, although Flinks performance is better in many cases, the possible speedup is equal to the possible speedup of Spark.
1. Spark versus Flink - Understanding Performance in Big Data Analytics Frameworks | https://hal.inria.fr/hal-01347638/document
2. Big Data Analytics on Cray XC Series DataWarp using Hadoop, Spark and Flink | https://cug.org/proceedings/cug2016_proceedings/includes/files/pap141.pdf
3. Thrill - High-Performance Algorithmic Distributed Batch Data Processing with C++ | https://panthema.net/2016/0816-Thrill-High-Performance-Algorithmic-Distributed-Batch-Data-Processing-with-CPP/1608.05634v1.pdf

Does someone have …
… more information (or data) on speedup of Flink applications?
… experience (or data) with Flink in an extremely paralellized environment?
… detailed information on how the nodes communicate, especially when they are waiting for task results of one another?

Thank you very much for your time & answers!
Hanna

Fabian Hueske-2

Re: Speedup of Flink Applications

Hi Hanna,

I assume you are asking about the possible speed up of batch analysis programs and not about streaming applications (please correct me if I'm wrong).

Timur raised very good points about data size and skew.

Given evenly distributed data (no skewed key distribution for a grouping or join operation) and sufficiently large data sets, Flink scales quite well.

Flink uses by default pipelined shuffles. Depending on the parallelism, you need to adapt the number of network buffers (see taskmanager.network.numberOfBuffers [1]). When the required number of network buffers becomes too large, you can switch to batched shuffles (set ExecutionMode.BATCH on ExecutionConfig of ExecutionEnvironment).

Apart from that, really large parallelism of complex programs might suffer from the scheduling overhead of individual tasks. Here the JobManager might become a bottleneck when assigning tasks to workers so that there might be an initial delay before a job starts processing.

If your data is skewed to too small, scaling out doesn't help because a single worker will be busy working while all others are waiting for it or the overhead of distributing the work becomes too large.

Hope this helps,

Fabian

[1] https://ci.apache.org/projects/flink/flink-docs-release-1.1/setup/config.html#jobmanager-amp-taskmanager

2017-01-03 15:44 GMT+01:00 Timur Shenkao <[hidden email]>:

Hi,
It seems your questions are too abstract & theoretical. The answer is : it depends on several factors. Skewness in data, data volume, reliability requirements, "fatness" of servers, whether one performs look-up in other data sources, etc.
The papers you mentioned mean the following: under concrete & specific conditions, researchers achieved their results. If they had changed some parameters slightly (increase network's throughput, for example, or change garbage collector's options) , the results would have been completely different.

On Tuesday, January 3, 2017, Hanna Prinz <[hidden email]> wrote:
Happy new year everyone :)

I’m currently working on a paper about Flink. I already got some recommendations on general papers with details about Flink, which helped me a lot already. But now that I read them, I’m further interested is the speedup capabilities, provided by the Flink Framework: How „far“ can it scale efficiently?

Amdahls law states that a parallelization is only efficient as long as the non-parallelizable part of the processing (time for the communication between the nodes etc.) doesn’t „eat up“ the speed gains of parallelization (= parallel slowdown).
Of course, the communication overhead is mostly caused by the implementation, but the frameworks specific solution for the communication between the nodes has a reasonable effect as well.

After studying these papers, it looks like, although Flinks performance is better in many cases, the possible speedup is equal to the possible speedup of Spark.
1. Spark versus Flink - Understanding Performance in Big Data Analytics Frameworks | https://hal.inria.fr/hal-01347638/document
2. Big Data Analytics on Cray XC Series DataWarp using Hadoop, Spark and Flink | https://cug.org/proceedings/cug2016_proceedings/includes/files/pap141.pdf
3. Thrill - High-Performance Algorithmic Distributed Batch Data Processing with C++ | https://panthema.net/2016/0816-Thrill-High-Performance-Algorithmic-Distributed-Batch-Data-Processing-with-CPP/1608.05634v1.pdf

Does someone have …
… more information (or data) on speedup of Flink applications?
… experience (or data) with Flink in an extremely paralellized environment?
… detailed information on how the nodes communicate, especially when they are waiting for task results of one another?

Thank you very much for your time & answers!
Hanna

Hanna Prinz

Re: Speedup of Flink Applications

Hey Fabian and Timur,

Thank you for your helpful answers.

Especially because I'm aware that there is no simple answer to that. And also, I've just started to work with flink, so I might have not understood everything yet :)

From the documentation on task scheduling, I assumed that the JobManger might be a bottleneck on extreme parallelization. As for the the skewness of the data: I suppose that's a problem every data processing framework on a large scale has and there is not much to do about it besides improving the partitioning where possible.

I will look into the configuration you mentioned @Fabian and might get back to you with further questions later.

Cheers

Hanna

Am 05.01.2017 um 11:32 schrieb Fabian Hueske <[hidden email]>:

Hi Hanna,

I assume you are asking about the possible speed up of batch analysis programs and not about streaming applications (please correct me if I'm wrong).

Timur raised very good points about data size and skew.

Given evenly distributed data (no skewed key distribution for a grouping or join operation) and sufficiently large data sets, Flink scales quite well.
Flink uses by default pipelined shuffles. Depending on the parallelism, you need to adapt the number of network buffers (see taskmanager.network.numberOfBuffers [1]). When the required number of network buffers becomes too large, you can switch to batched shuffles (set ExecutionMode.BATCH on ExecutionConfig of ExecutionEnvironment).
Apart from that, really large parallelism of complex programs might suffer from the scheduling overhead of individual tasks. Here the JobManager might become a bottleneck when assigning tasks to workers so that there might be an initial delay before a job starts processing.

If your data is skewed to too small, scaling out doesn't help because a single worker will be busy working while all others are waiting for it or the overhead of distributing the work becomes too large.

Hope this helps,
Fabian

[1] https://ci.apache.org/projects/flink/flink-docs-release-1.1/setup/config.html#jobmanager-amp-taskmanager

2017-01-03 15:44 GMT+01:00 Timur Shenkao <[hidden email]>:
Hi,
It seems your questions are too abstract & theoretical. The answer is : it depends on several factors. Skewness in data, data volume, reliability requirements, "fatness" of servers, whether one performs look-up in other data sources, etc.
The papers you mentioned mean the following: under concrete & specific conditions, researchers achieved their results. If they had changed some parameters slightly (increase network's throughput, for example, or change garbage collector's options) , the results would have been completely different.

On Tuesday, January 3, 2017, Hanna Prinz <[hidden email]> wrote:
Happy new year everyone :)

I’m currently working on a paper about Flink. I already got some recommendations on general papers with details about Flink, which helped me a lot already. But now that I read them, I’m further interested is the speedup capabilities, provided by the Flink Framework: How „far“ can it scale efficiently?

Amdahls law states that a parallelization is only efficient as long as the non-parallelizable part of the processing (time for the communication between the nodes etc.) doesn’t „eat up“ the speed gains of parallelization (= parallel slowdown).
Of course, the communication overhead is mostly caused by the implementation, but the frameworks specific solution for the communication between the nodes has a reasonable effect as well.

After studying these papers, it looks like, although Flinks performance is better in many cases, the possible speedup is equal to the possible speedup of Spark.
1. Spark versus Flink - Understanding Performance in Big Data Analytics Frameworks | https://hal.inria.fr/hal-01347638/document
2. Big Data Analytics on Cray XC Series DataWarp using Hadoop, Spark and Flink | https://cug.org/proceedings/cug2016_proceedings/includes/files/pap141.pdf
3. Thrill - High-Performance Algorithmic Distributed Batch Data Processing with C++ | https://panthema.net/2016/0816-Thrill-High-Performance-Algorithmic-Distributed-Batch-Data-Processing-with-CPP/1608.05634v1.pdf

Does someone have …
… more information (or data) on speedup of Flink applications?
… experience (or data) with Flink in an extremely paralellized environment?
… detailed information on how the nodes communicate, especially when they are waiting for task results of one another?

Thank you very much for your time & answers!
Hanna