Re: Flink Cross Strategies
Posted by
gen-too on
URL: http://deprecated-apache-flink-user-mailing-list-archive.369.s1.nabble.com/Flink-Cross-Strategies-tp12516p12527.html
Hi Fabian,
thanks for your answer!
How does Flink select the cross strategy?
For a better understanding for my part maybe we can assume the
following example scenario: I have two DataSets consisting of 6000
and 4000 records (stored as files in HDFS) and I want to do the
cross operation. Lets say that I have a parallelism of 2, then how
is the general work flow?
Thanks in advance
Am 05.04.2017 um 09:56 schrieb Fabian
Hueske:
Hi,
first of all, Cross is a *very* expensive operation, if
you cannot ensure that one side is very small. If one
input fits into memory, it is usually better to use a
MapFunction with a broadcast set. If both sides can be
large, Cross will take a very long time.
That being said, the strategies work as follow:
- NESTEDLOOP_STREAMED_OUTER_FIRS
T: The first (or
left) input is the the outer side of a nested-loop. The
second (or right) input is buffered (potentially spilled
to disk). For each record of the outer input, we read and
combine all values of the spilled inner with the outer
record. Hence, the order of the outer side is preserved.
- NESTEDLOOP_BLOCKED_OUTER_FIRST: first (or left) input is
the the outer side of a nested-loop. The second (or right)
input is buffered (potentially spilled to disk). The outer
side is consumed in blocks of records. For each block of
outer records, the inner side is read and each record of the
inner side is combined with all outer records in the block.
This strategy destroys the sort order of the outer side.
The other strategies switch outer and inner side and are
symmetric.
The benefit of the blocked strategy is that we only iterate
once per block over the inner side and not for each individual
records as the streamed strategy does. However, the blocked
variant destroys the order of the outer side.
Best, Fabian