(DEPRECATED) Apache Flink User Mailing List archive.

DataSet: CombineHint heuristics

Classic

List

Threaded

4 messages Options

Urs Schoenenberger

DataSet: CombineHint heuristics

Hi all,

I was wondering about the heuristics for CombineHint:

Flink uses SORT by default, but the doc for HASH says that we should
expect it to be faster if the number of keys is less than 1/10th of the
number of records.

HASH should be faster if it is able to combine a lot of records, which
happens if multiple events for the same key are present in a data chunk
*that fits into a combine-hashtable* (cf handling in
ReduceCombineDriver.java).

Now, if I have 10 billion events and 100 million keys, but only about 1
million records fit into a hashtable, the number of matches may be
extremely low, so very few events are getting combined (of course, this
is similar for SORT as the sorter's memory is bounded, too).

Am I correct in assuming that the actual tradeoff is not only based on
the ratio of #total records/#keys, but also on #total records/#records
that fit into a single Sorter/Hashtable?

Thanks,
Urs

--
Urs Schönenberger - [hidden email]

TNG Technology Consulting GmbH, Betastr. 13a, 85774 Unterföhring
Geschäftsführer: Henrik Klagges, Dr. Robert Dahlke, Gerhard Müller
Sitz: Unterföhring * Amtsgericht München * HRB 135082

Aljoscha Krettek

Re: DataSet: CombineHint heuristics

Hi,

I would say that your assumption is correct and that the COMBINE strategy does in fact also depend on the ration " #total records/#records that fit into a single Sorter/Hashtable".

I'm CC'ing Fabian, just to be sure. He knows that stuff better than I do.

Best,
Aljoscha

> On 31. Aug 2017, at 13:41, Urs Schoenenberger <[hidden email]> wrote:
>
> Hi all,
>
> I was wondering about the heuristics for CombineHint:
>
> Flink uses SORT by default, but the doc for HASH says that we should
> expect it to be faster if the number of keys is less than 1/10th of the
> number of records.
>
> HASH should be faster if it is able to combine a lot of records, which
> happens if multiple events for the same key are present in a data chunk
> *that fits into a combine-hashtable* (cf handling in
> ReduceCombineDriver.java).
>
> Now, if I have 10 billion events and 100 million keys, but only about 1
> million records fit into a hashtable, the number of matches may be
> extremely low, so very few events are getting combined (of course, this
> is similar for SORT as the sorter's memory is bounded, too).
>
> Am I correct in assuming that the actual tradeoff is not only based on
> the ratio of #total records/#keys, but also on #total records/#records
> that fit into a single Sorter/Hashtable?
>
> Thanks,
> Urs
>
> --
> Urs Schönenberger - [hidden email]
>
> TNG Technology Consulting GmbH, Betastr. 13a, 85774 Unterföhring
> Geschäftsführer: Henrik Klagges, Dr. Robert Dahlke, Gerhard Müller
> Sitz: Unterföhring * Amtsgericht München * HRB 135082

Gábor Gévay

Re: DataSet: CombineHint heuristics

Hi Urs,

Yes, the 1/10th ratio is just a very loose rule of thumb. I would
suggest to try both the SORT and HASH strategies with a workload that
is as similar as possible to your production workload (similar data,
similar parallelism, etc.), and see which one is faster for your
specific use case.

An important difference between the HASH and SORT strategies is that
the sorting combiner stores the original input records, while the hash
combiner stores only combined records. I.e., when an input record
arrives whose key is already in the hashtable then this record won't
consume additional memory, because it is combined right away.
Therefore, for example, if you would like your combiner to not emit
any records prematurely (i.e., combine everything possible, without
running out of memory), then with the SORT strategy you need combiner
memory proportional to your input size, while with the HASH strategy
you need combiner memory proportional only to the number of keys.

You are correct in that the performance depends very much on how many
records fit into a single Sorter/Hashtable. However, I wrote
#keys/#total records into the documentation because this is easier for
a user to estimate, and this ratio being small correlates with the
HASH strategy getting faster, as explained above.

Best,
Gábor

On Thu, Aug 31, 2017 at 4:02 PM, Aljoscha Krettek <[hidden email]> wrote:

> Hi,
>
> I would say that your assumption is correct and that the COMBINE strategy does in fact also depend on the ration " #total records/#records that fit into a single Sorter/Hashtable".
>
> I'm CC'ing Fabian, just to be sure. He knows that stuff better than I do.
>
> Best,
> Aljoscha
>
>> On 31. Aug 2017, at 13:41, Urs Schoenenberger <[hidden email]> wrote:
>>
>> Hi all,
>>
>> I was wondering about the heuristics for CombineHint:
>>
>> Flink uses SORT by default, but the doc for HASH says that we should
>> expect it to be faster if the number of keys is less than 1/10th of the
>> number of records.
>>
>> HASH should be faster if it is able to combine a lot of records, which
>> happens if multiple events for the same key are present in a data chunk
>> *that fits into a combine-hashtable* (cf handling in
>> ReduceCombineDriver.java).
>>
>> Now, if I have 10 billion events and 100 million keys, but only about 1
>> million records fit into a hashtable, the number of matches may be
>> extremely low, so very few events are getting combined (of course, this
>> is similar for SORT as the sorter's memory is bounded, too).
>>
>> Am I correct in assuming that the actual tradeoff is not only based on
>> the ratio of #total records/#keys, but also on #total records/#records
>> that fit into a single Sorter/Hashtable?
>>
>> Thanks,
>> Urs
>>
>> --
>> Urs Schönenberger - [hidden email]
>>
>> TNG Technology Consulting GmbH, Betastr. 13a, 85774 Unterföhring
>> Geschäftsführer: Henrik Klagges, Dr. Robert Dahlke, Gerhard Müller
>> Sitz: Unterföhring * Amtsgericht München * HRB 135082
>

Urs Schoenenberger

Re: DataSet: CombineHint heuristics

Hi Gábor,

thank you very much for your explanation, that makes a lot of sense.

Best regards,
Urs

On 05.09.2017 14:32, Gábor Gévay wrote:

> Hi Urs,
>
> Yes, the 1/10th ratio is just a very loose rule of thumb. I would
> suggest to try both the SORT and HASH strategies with a workload that
> is as similar as possible to your production workload (similar data,
> similar parallelism, etc.), and see which one is faster for your
> specific use case.
>
> An important difference between the HASH and SORT strategies is that
> the sorting combiner stores the original input records, while the hash
> combiner stores only combined records. I.e., when an input record
> arrives whose key is already in the hashtable then this record won't
> consume additional memory, because it is combined right away.
> Therefore, for example, if you would like your combiner to not emit
> any records prematurely (i.e., combine everything possible, without
> running out of memory), then with the SORT strategy you need combiner
> memory proportional to your input size, while with the HASH strategy
> you need combiner memory proportional only to the number of keys.
>
> You are correct in that the performance depends very much on how many
> records fit into a single Sorter/Hashtable. However, I wrote
> #keys/#total records into the documentation because this is easier for
> a user to estimate, and this ratio being small correlates with the
> HASH strategy getting faster, as explained above.
>
> Best,
> Gábor
>
>
>
> On Thu, Aug 31, 2017 at 4:02 PM, Aljoscha Krettek <[hidden email]> wrote:
>> Hi,
>>
>> I would say that your assumption is correct and that the COMBINE strategy does in fact also depend on the ration " #total records/#records that fit into a single Sorter/Hashtable".
>>
>> I'm CC'ing Fabian, just to be sure. He knows that stuff better than I do.
>>
>> Best,
>> Aljoscha
>>
>>> On 31. Aug 2017, at 13:41, Urs Schoenenberger <[hidden email]> wrote:
>>>
>>> Hi all,
>>>
>>> I was wondering about the heuristics for CombineHint:
>>>
>>> Flink uses SORT by default, but the doc for HASH says that we should
>>> expect it to be faster if the number of keys is less than 1/10th of the
>>> number of records.
>>>
>>> HASH should be faster if it is able to combine a lot of records, which
>>> happens if multiple events for the same key are present in a data chunk
>>> *that fits into a combine-hashtable* (cf handling in
>>> ReduceCombineDriver.java).
>>>
>>> Now, if I have 10 billion events and 100 million keys, but only about 1
>>> million records fit into a hashtable, the number of matches may be
>>> extremely low, so very few events are getting combined (of course, this
>>> is similar for SORT as the sorter's memory is bounded, too).
>>>
>>> Am I correct in assuming that the actual tradeoff is not only based on
>>> the ratio of #total records/#keys, but also on #total records/#records
>>> that fit into a single Sorter/Hashtable?
>>>
>>> Thanks,
>>> Urs
>>>
>>> --
>>> Urs Schönenberger - [hidden email]
>>>
>>> TNG Technology Consulting GmbH, Betastr. 13a, 85774 Unterföhring
>>> Geschäftsführer: Henrik Klagges, Dr. Robert Dahlke, Gerhard Müller
>>> Sitz: Unterföhring * Amtsgericht München * HRB 135082
>>

--
Urs Schönenberger - [hidden email]

TNG Technology Consulting GmbH, Betastr. 13a, 85774 Unterföhring
Geschäftsführer: Henrik Klagges, Dr. Robert Dahlke, Gerhard Müller
Sitz: Unterföhring * Amtsgericht München * HRB 135082