Re: Hello, the performance of apply function after join
Posted by
Fabian Hueske-2 on
URL: http://deprecated-apache-flink-user-mailing-list-archive.369.s1.nabble.com/Hello-the-performance-of-apply-function-after-join-tp3835p3837.html
Hi Phil,
an apply method after a join runs pipelined with the join, i.e., it starts processing when the first join result is emitted and finishes after it handled the last join result.
Unless the logic in your apply function is not terribly complex, this should be OK. If you do not specify an apply method, a default method will be used which returns a Tuple2(left, right).
Regarding the join hints, there is no general rule when to use joinWithHuge / joinWithTiny. It depends on the number of machines, machine specs, number of records, record size, etc...
If you use joinWithHuge/Tiny, the smaller side will be broadcasted to every node and each parallel partition will hold the full relation in memory, i.e., if the smaller side is 10GB, you need at least 10GB for each task manager slot. So this should only be used if the smaller side is *really* small.
The join method does also allow to specify more fine-grained hints such as:
small.join(large, JoinHint.REPARTITION_HASH_SECOND)
which will execute the join by shuffling both inputs and building a hash table on the partition of the second input.
If you want to optimize for performance, you should try both hints: REPARTITION_HASH_*small* and BROADCAST_HASH_*small*.
Best, Fabian