Happy new year everyone :)
I’m currently working on a paper about Flink. I already got some recommendations on general papers with details about Flink, which helped me a lot already. But now that I read them, I’m further interested is the speedup capabilities, provided by the Flink Framework: How „far“ can it scale efficiently? Amdahls law states that a parallelization is only efficient as long as the non-parallelizable part of the processing (time for the communication between the nodes etc.) doesn’t „eat up“ the speed gains of parallelization (= parallel slowdown). Of course, the communication overhead is mostly caused by the implementation, but the frameworks specific solution for the communication between the nodes has a reasonable effect as well. After studying these papers, it looks like, although Flinks performance is better in many cases, the possible speedup is equal to the possible speedup of Spark. 1. Spark versus Flink - Understanding Performance in Big Data Analytics Frameworks | https://hal.inria.fr/hal-01347638/document Does someone have … … more information (or data) on speedup of Flink applications? … experience (or data) with Flink in an extremely paralellized environment? … detailed information on how the nodes communicate, especially when they are waiting for task results of one another? Thank you very much for your time & answers! Hanna
|
Hi,
It seems your questions are too abstract & theoretical. The answer is : it depends on several factors. Skewness in data, data volume, reliability requirements, "fatness" of servers, whether one performs look-up in other data sources, etc. The papers you mentioned mean the following: under concrete & specific conditions, researchers achieved their results. If they had changed some parameters slightly (increase network's throughput, for example, or change garbage collector's options) , the results would have been completely different.
On Tuesday, January 3, 2017, Hanna Prinz <[hidden email]> wrote:
|
Hi Hanna, I assume you are asking about the possible speed up of batch analysis programs and not about streaming applications (please correct me if I'm wrong).Given evenly distributed data (no skewed key distribution for a grouping or join operation) and sufficiently large data sets, Flink scales quite well. If your data is skewed to too small, scaling out doesn't help because a single worker will be busy working while all others are waiting for it or the overhead of distributing the work becomes too large. Hope this helps, 2017-01-03 15:44 GMT+01:00 Timur Shenkao <[hidden email]>: Hi, |
Hey Fabian and Timur, Thank you for your helpful answers. Especially because I'm aware that there is no simple answer to that. And also, I've just started to work with flink, so I might have not understood everything yet :) From the documentation on task scheduling, I assumed that the JobManger might be a bottleneck on extreme parallelization. As for the the skewness of the data: I suppose that's a problem every data processing framework on a large scale has and there is not much to do about it besides improving the partitioning where possible. I will look into the configuration you mentioned @Fabian and might get back to you with further questions later. Cheers Hanna
|
Free forum by Nabble | Edit this page |