Re: Union of multiple datasets vs Join
Posted by
Fabian Hueske-2 on
URL: http://deprecated-apache-flink-user-mailing-list-archive.369.s1.nabble.com/Union-of-multiple-datasets-vs-Join-tp578p581.html
Union is just combining data from multiple sources into a single dataset.
That’s it. No memory, no disk involved.
In you case you have
input1.union(input2).groupBy(1).reduce(…)
This will translate into:
input1 -> repartition ->
read-both-inputs -> sort -> reduce
input2 -> repartition ->
So, in your case not even additional network transfer is involved, because both data sets would need to be partitioned for the reduce anyway.
Note, union in Flink has SQL union-all semantics, i.e., there is not removal of duplicates.
Cheers, Fabian
Ok thanks Fabian. I'd like just to know the internals of the union of multiple datasets (partitioning, distribution among server, memory/disk, etc..). Do you have any ref to this?
Thanks in advance,
Flavio