What is the basic difference between partitioning datasets by key or grouping them by key ?
Does it make a difference in terms of paralellism ? Thx |
Hi Patrick, I think (but I'm not 100% sure) its not a difference in what the engine does in the end, its more of an API thing. When you are grouping, you can perform operations such as reducing afterwards. On a partitioned dataset, you can do stuff like processing each partition in parallel, or sort them. The parallelism is independent of the partitioning or grouping. Usually there are more partitions than parallel instances, so each instance will take care of multiple partitions. On Thu, Feb 23, 2017 at 6:16 PM, Patrick Brunmayr <[hidden email]> wrote:
|
Hi Patrick, as Robert said, partitionBy() shuffles the data such that all records with the same key end up in the same partition. That's all it does.Moreover, groupBy() alone is not a complete operation but just "prepares" a following operation. It must be called with a reduce or combine operator. In contrast partitionBy() is by itself complete. So the difference between partitionBy() and groupBy() is more than just an API thing. Hope that helps, Fabian 2017-02-23 21:51 GMT+01:00 Robert Metzger <[hidden email]>:
|
Thank you for that answer. Helped me a lot 2017-02-23 22:10 GMT+01:00 Fabian Hueske <[hidden email]>:
|
Free forum by Nabble | Edit this page |