We have a cogroup where sometimes we cogroup like this: Dataset z = larger.coGroup(small).where… The strategy is printed as hash on key and a sort asc on the other key. Which is which? Naively, we’d want to hash larger and sort the small? Or is that wrong?
What factors would impact the performance of the cogroup? We use cogroup to calculate a new set of records for a key from the previous calculated set with some modifications from (small). We’re temporally milestoning records using cogroup
btw, that’s the use case. Thanks Billy Newport Data Architecture, Goldman, Sachs & Co.
Tel: +1 (212) 8557773 | Cell: +1 (507) 254-0134
|
Hi Billy, A CoGroup does not have any freedom in its execution strategy.If you can convert the CoGroup into an (Inner)Join or OuterJoin you have more degrees of freedom to optimize. In this case, Flink might be able to use a Broadcast/Forward shipping strategy (ideally keeping the large input local and broadcasting the small one) and using a HashJoin (small input being the build side, large input being the probe side). Whether you can use a join depends on your application semantics. Maybe grouping all the smaller input and collecting all records for a key into a single record can help to switch from CoGroup to Join. Hope this helps, Fabian 2017-02-07 21:52 GMT+01:00 Newport, Billy <[hidden email]>:
|
Free forum by Nabble | Edit this page |