Hash tables - joins, cogroup, deltaIteration

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

Hash tables - joins, cogroup, deltaIteration

Ovidiu-Cristian MARCU
Hi,

Can you please confirm if there is any update regarding the hash tables use cases, as in [1] it is specified that Hash tables are used in Joins and for the Solution set in iterations (pending work to use them for grouping/aggregations)?

I am interested in the pending work progress and also if you consider to add an implementation where Joins and Solution Set in delta iterations (and CoGroup) can rely on a hybrid implementation where the engine can use also disk if not enough memory available when working with these hash tables.
 

Thanks

Best,
Ovidiu
Reply | Threaded
Open this post in threaded view
|

Re: Hash tables - joins, cogroup, deltaIteration

Fabian Hueske-2
Hi Ovidiu,

Hash tables are currently used for joins (inner & outer) and the solution set of delta iterations.
There is a pending PR that implements a hash table for partial aggregations (combiner) [1] which should be added soon.

Joins (inner & outer) are already implemented as Hybrid Hash joins that go to disk if memory is insufficient.
The hash table used for solution sets does not support spilling. In case of not enough memory, writing and reading data to / from disk in *each* iteration would cause a significant performance hit, that possibly outweighs the benefits of delta iterations. The recommended solution for this is to add more machines / memory to the cluster.
I am not aware of anybody working on addressing the issue of non-spillable hash tables of solution sets.

2016-04-18 21:47 GMT+02:00 Ovidiu-Cristian MARCU <[hidden email]>:
Hi,

Can you please confirm if there is any update regarding the hash tables use cases, as in [1] it is specified that Hash tables are used in Joins and for the Solution set in iterations (pending work to use them for grouping/aggregations)?

I am interested in the pending work progress and also if you consider to add an implementation where Joins and Solution Set in delta iterations (and CoGroup) can rely on a hybrid implementation where the engine can use also disk if not enough memory available when working with these hash tables.
 

Thanks

Best,
Ovidiu