(DEPRECATED) Apache Flink User Mailing List archive.

Re: Union of multiple datasets vs Join

Posted by Fabian Hueske-2 on
URL: http://deprecated-apache-flink-user-mailing-list-archive.369.s1.nabble.com/Union-of-multiple-datasets-vs-Join-tp578p581.html

Union is just combining data from multiple sources into a single dataset.

That’s it. No memory, no disk involved.

In you case you have

input1.union(input2).groupBy(1).reduce(…)

This will translate into:

input1 -> repartition ->

read-both-inputs -> sort -> reduce

input2 -> repartition ->

So, in your case not even additional network transfer is involved, because both data sets would need to be partitioned for the reduce anyway.

Note, union in Flink has SQL union-all semantics, i.e., there is not removal of duplicates.

Cheers, Fabian

From: [hidden email]
Sent: ‎Monday‎, ‎22‎. ‎December‎, ‎2014 ‎14‎:‎32
To: [hidden email]

Ok thanks Fabian. I'd like just to know the internals of the union of multiple datasets (partitioning, distribution among server, memory/disk, etc..). Do you have any ref to this?

Thanks in advance,

Flavio

On Mon, Dec 22, 2014 at 12:46 PM, Fabian Hueske <[hidden email]> wrote:

Follow the first approach.
Joins are expensive, union comes for free.

Best, Fabian

2014-12-22 11:47 GMT+01:00 Flavio Pompermaier <[hidden email]>:
Hi guys,

In my use case I have multiple Datasets with the same structure (e.g. Tuple3) and I want to produce an output Dataset containing all Tuple3 grouped by the first field (0).
I can obtain the same results performing a union of all datasets and then a group by (simplest implementation) or join all of them pairwise (((A->B)->C)->D)..) or I don't know if there is any other solution. When should I use the first or the second approach? Could you help me in figuring out the internals of the two approaches? I always have some fear when using multiple joins when I don't know exactly their size..

Best,
Flavio