Hi all,
I was wondering about the heuristics for CombineHint: Flink uses SORT by default, but the doc for HASH says that we should expect it to be faster if the number of keys is less than 1/10th of the number of records. HASH should be faster if it is able to combine a lot of records, which happens if multiple events for the same key are present in a data chunk *that fits into a combine-hashtable* (cf handling in ReduceCombineDriver.java). Now, if I have 10 billion events and 100 million keys, but only about 1 million records fit into a hashtable, the number of matches may be extremely low, so very few events are getting combined (of course, this is similar for SORT as the sorter's memory is bounded, too). Am I correct in assuming that the actual tradeoff is not only based on the ratio of #total records/#keys, but also on #total records/#records that fit into a single Sorter/Hashtable? Thanks, Urs -- Urs Schönenberger - [hidden email] TNG Technology Consulting GmbH, Betastr. 13a, 85774 Unterföhring Geschäftsführer: Henrik Klagges, Dr. Robert Dahlke, Gerhard Müller Sitz: Unterföhring * Amtsgericht München * HRB 135082 |
Hi,
I would say that your assumption is correct and that the COMBINE strategy does in fact also depend on the ration " #total records/#records that fit into a single Sorter/Hashtable". I'm CC'ing Fabian, just to be sure. He knows that stuff better than I do. Best, Aljoscha > On 31. Aug 2017, at 13:41, Urs Schoenenberger <[hidden email]> wrote: > > Hi all, > > I was wondering about the heuristics for CombineHint: > > Flink uses SORT by default, but the doc for HASH says that we should > expect it to be faster if the number of keys is less than 1/10th of the > number of records. > > HASH should be faster if it is able to combine a lot of records, which > happens if multiple events for the same key are present in a data chunk > *that fits into a combine-hashtable* (cf handling in > ReduceCombineDriver.java). > > Now, if I have 10 billion events and 100 million keys, but only about 1 > million records fit into a hashtable, the number of matches may be > extremely low, so very few events are getting combined (of course, this > is similar for SORT as the sorter's memory is bounded, too). > > Am I correct in assuming that the actual tradeoff is not only based on > the ratio of #total records/#keys, but also on #total records/#records > that fit into a single Sorter/Hashtable? > > Thanks, > Urs > > -- > Urs Schönenberger - [hidden email] > > TNG Technology Consulting GmbH, Betastr. 13a, 85774 Unterföhring > Geschäftsführer: Henrik Klagges, Dr. Robert Dahlke, Gerhard Müller > Sitz: Unterföhring * Amtsgericht München * HRB 135082 |
Hi Urs,
Yes, the 1/10th ratio is just a very loose rule of thumb. I would suggest to try both the SORT and HASH strategies with a workload that is as similar as possible to your production workload (similar data, similar parallelism, etc.), and see which one is faster for your specific use case. An important difference between the HASH and SORT strategies is that the sorting combiner stores the original input records, while the hash combiner stores only combined records. I.e., when an input record arrives whose key is already in the hashtable then this record won't consume additional memory, because it is combined right away. Therefore, for example, if you would like your combiner to not emit any records prematurely (i.e., combine everything possible, without running out of memory), then with the SORT strategy you need combiner memory proportional to your input size, while with the HASH strategy you need combiner memory proportional only to the number of keys. You are correct in that the performance depends very much on how many records fit into a single Sorter/Hashtable. However, I wrote #keys/#total records into the documentation because this is easier for a user to estimate, and this ratio being small correlates with the HASH strategy getting faster, as explained above. Best, Gábor On Thu, Aug 31, 2017 at 4:02 PM, Aljoscha Krettek <[hidden email]> wrote: > Hi, > > I would say that your assumption is correct and that the COMBINE strategy does in fact also depend on the ration " #total records/#records that fit into a single Sorter/Hashtable". > > I'm CC'ing Fabian, just to be sure. He knows that stuff better than I do. > > Best, > Aljoscha > >> On 31. Aug 2017, at 13:41, Urs Schoenenberger <[hidden email]> wrote: >> >> Hi all, >> >> I was wondering about the heuristics for CombineHint: >> >> Flink uses SORT by default, but the doc for HASH says that we should >> expect it to be faster if the number of keys is less than 1/10th of the >> number of records. >> >> HASH should be faster if it is able to combine a lot of records, which >> happens if multiple events for the same key are present in a data chunk >> *that fits into a combine-hashtable* (cf handling in >> ReduceCombineDriver.java). >> >> Now, if I have 10 billion events and 100 million keys, but only about 1 >> million records fit into a hashtable, the number of matches may be >> extremely low, so very few events are getting combined (of course, this >> is similar for SORT as the sorter's memory is bounded, too). >> >> Am I correct in assuming that the actual tradeoff is not only based on >> the ratio of #total records/#keys, but also on #total records/#records >> that fit into a single Sorter/Hashtable? >> >> Thanks, >> Urs >> >> -- >> Urs Schönenberger - [hidden email] >> >> TNG Technology Consulting GmbH, Betastr. 13a, 85774 Unterföhring >> Geschäftsführer: Henrik Klagges, Dr. Robert Dahlke, Gerhard Müller >> Sitz: Unterföhring * Amtsgericht München * HRB 135082 > |
Hi Gábor,
thank you very much for your explanation, that makes a lot of sense. Best regards, Urs On 05.09.2017 14:32, Gábor Gévay wrote: > Hi Urs, > > Yes, the 1/10th ratio is just a very loose rule of thumb. I would > suggest to try both the SORT and HASH strategies with a workload that > is as similar as possible to your production workload (similar data, > similar parallelism, etc.), and see which one is faster for your > specific use case. > > An important difference between the HASH and SORT strategies is that > the sorting combiner stores the original input records, while the hash > combiner stores only combined records. I.e., when an input record > arrives whose key is already in the hashtable then this record won't > consume additional memory, because it is combined right away. > Therefore, for example, if you would like your combiner to not emit > any records prematurely (i.e., combine everything possible, without > running out of memory), then with the SORT strategy you need combiner > memory proportional to your input size, while with the HASH strategy > you need combiner memory proportional only to the number of keys. > > You are correct in that the performance depends very much on how many > records fit into a single Sorter/Hashtable. However, I wrote > #keys/#total records into the documentation because this is easier for > a user to estimate, and this ratio being small correlates with the > HASH strategy getting faster, as explained above. > > Best, > Gábor > > > > On Thu, Aug 31, 2017 at 4:02 PM, Aljoscha Krettek <[hidden email]> wrote: >> Hi, >> >> I would say that your assumption is correct and that the COMBINE strategy does in fact also depend on the ration " #total records/#records that fit into a single Sorter/Hashtable". >> >> I'm CC'ing Fabian, just to be sure. He knows that stuff better than I do. >> >> Best, >> Aljoscha >> >>> On 31. Aug 2017, at 13:41, Urs Schoenenberger <[hidden email]> wrote: >>> >>> Hi all, >>> >>> I was wondering about the heuristics for CombineHint: >>> >>> Flink uses SORT by default, but the doc for HASH says that we should >>> expect it to be faster if the number of keys is less than 1/10th of the >>> number of records. >>> >>> HASH should be faster if it is able to combine a lot of records, which >>> happens if multiple events for the same key are present in a data chunk >>> *that fits into a combine-hashtable* (cf handling in >>> ReduceCombineDriver.java). >>> >>> Now, if I have 10 billion events and 100 million keys, but only about 1 >>> million records fit into a hashtable, the number of matches may be >>> extremely low, so very few events are getting combined (of course, this >>> is similar for SORT as the sorter's memory is bounded, too). >>> >>> Am I correct in assuming that the actual tradeoff is not only based on >>> the ratio of #total records/#keys, but also on #total records/#records >>> that fit into a single Sorter/Hashtable? >>> >>> Thanks, >>> Urs >>> >>> -- >>> Urs Schönenberger - [hidden email] >>> >>> TNG Technology Consulting GmbH, Betastr. 13a, 85774 Unterföhring >>> Geschäftsführer: Henrik Klagges, Dr. Robert Dahlke, Gerhard Müller >>> Sitz: Unterföhring * Amtsgericht München * HRB 135082 >> -- Urs Schönenberger - [hidden email] TNG Technology Consulting GmbH, Betastr. 13a, 85774 Unterföhring Geschäftsführer: Henrik Klagges, Dr. Robert Dahlke, Gerhard Müller Sitz: Unterföhring * Amtsgericht München * HRB 135082 |
Free forum by Nabble | Edit this page |