(DEPRECATED) Apache Flink User Mailing List archive.

Cogroup hints/performance

Classic

List

Threaded

2 messages Options

Newport, Billy

Cogroup hints/performance

We have a cogroup where sometimes we cogroup like this:

Dataset z = larger.coGroup(small).where…

The strategy is printed as hash on key and a sort asc on the other key. Which is which? Naively, we’d want to hash larger and sort the small? Or is that wrong?

What factors would impact the performance of the cogroup? We use cogroup to calculate a new set of records for a key from the previous calculated set with some modifications from (small). We’re temporally milestoning records using cogroup btw, that’s the use case.

Thanks

Billy Newport

Data Architecture, Goldman, Sachs & Co.
30 Hudson | 37th Floor | Jersey City, NJ

Tel: +1 (212) 8557773 | Cell: +1 (507) 254-0134
Email: [hidden email], KD2DKQ

Fabian Hueske-2

Re: Cogroup hints/performance

Hi Billy,

A CoGroup does not have any freedom in its execution strategy.

It requires that both inputs are partitioned on the grouping keys and are then performs a local sort-merge join, i.e, both inputs are sorted. Existing partitioning or sort orders can be reused.

Since there is only one execution strategy, there is not much you can do to optimize it without changing the operator.

If you can convert the CoGroup into an (Inner)Join or OuterJoin you have more degrees of freedom to optimize. In this case, Flink might be able to use a Broadcast/Forward shipping strategy (ideally keeping the large input local and broadcasting the small one) and using a HashJoin (small input being the build side, large input being the probe side).

Whether you can use a join depends on your application semantics. Maybe grouping all the smaller input and collecting all records for a key into a single record can help to switch from CoGroup to Join.

Hope this helps,

Fabian

2017-02-07 21:52 GMT+01:00 Newport, Billy <[hidden email]>:

We have a cogroup where sometimes we cogroup like this:

Dataset z = larger.coGroup(small).where…

The strategy is printed as hash on key and a sort asc on the other key. Which is which? Naively, we’d want to hash larger and sort the small? Or is that wrong?

What factors would impact the performance of the cogroup? We use cogroup to calculate a new set of records for a key from the previous calculated set with some modifications from (small). We’re temporally milestoning records using cogroup btw, that’s the use case.

Thanks

Billy Newport

Data Architecture, Goldman, Sachs & Co.
30 Hudson | 37th Floor | Jersey City, NJ

Tel: <a href="tel:(212)%20855-7773" value="+12128557773" target="_blank">+1 (212) 8557773 | Cell: <a href="tel:(507)%20254-0134" value="+15072540134" target="_blank">+1 (507) 254-0134
Email: [hidden email], KD2DKQ