Re: Map and Reduce cycle
Posted by
Fabian Hueske-2 on
URL: http://deprecated-apache-flink-user-mailing-list-archive.369.s1.nabble.com/Map-and-Reduce-cycle-tp2128p2130.html
Hi Bill,
Flink uses pipelined data shipping by default. If you have a program like: source -> map -> reduce -> sink, the mapper will immediately start to shuffle data over the network to the reducer. The reducer collects that data and starts to sort it in batches. When the mapper is done and the reducer is done with the sorting, the sorted batches are merged and a sorted stream is fed into the reduce function.
So, the shuffling and sorting happens while the mapper is running, but the reduce function can only be applied after the last mapper has finished.
I am not aware of benchmarks that compare Flink's and Spark's in-memory operations, but this blog post compares the performance of sorting binary data in managed memory (like Flink) to naive sorting on the heap (Spark might do it differently!).
-->
http://flink.apache.org/news/2015/05/11/Juggling-with-Bits-and-Bytes.html
Fabian