Re: Union of multiple datasets vs Join

Posted by Fabian Hueske on
URL: http://deprecated-apache-flink-user-mailing-list-archive.369.s1.nabble.com/Union-of-multiple-datasets-vs-Join-tp578p579.html

Follow the first approach. 
Joins are expensive, union comes for free.

Best, Fabian

2014-12-22 11:47 GMT+01:00 Flavio Pompermaier <[hidden email]>:
Hi guys,

In my use case I have multiple Datasets with the same structure (e.g. Tuple3) and I want to produce an output Dataset containing all Tuple3 grouped by the first field (0).
I can obtain the same results performing a union of all datasets and then a group by (simplest implementation) or join all of them pairwise (((A->B)->C)->D)..) or I don't know if there is any other solution. When should I use the first or the second approach? Could you help me in figuring out the internals of the two approaches? I always have some fear when using multiple joins when I don't know exactly their size..

Best,
Flavio