Trying to simulate the Split Distinct Aggregation optimizations from Table API

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

Trying to simulate the Split Distinct Aggregation optimizations from Table API

Felipe Gutierrez
Hi,

I am trying to simulate the "Split Distinct Aggregation" [1] with the
data from Taxi Ride. I am using the following query:

SELECT dayOfTheYear, COUNT(DISTINCT driverId) FROM TaxiRide GROUP BY
dayOfTheYear

and I am analyzing the different methods for optimizing. So I started
using (1) no optimization, then the (2) "table.exec.mini-batch.size" =
TRUE with "table.optimizer.agg-phase-strategy" = "ONE_PHASE", then I
changed to (3) "table.optimizer.agg-phase-strategy" = "TWO_PHASE", and
finally I use the (4) "table.optimizer.distinct-agg.split.enabled" =
TRUE.

What does the sentence "COUNT DISTINCT is not good at reducing records
if the value of distinct key (i.e. user_id) is sparse." mean on the
website? In my case the distinct key is the driverId. So, should I
change the data source to have a lot of null values on the driverId
column?

I am asking because I tested this query with optimizations (1) and (2)
I got ~20K r/s on each operator (parallelism of 8) when I set the
workload to ~200K r/s. This is almost the total workload. Then I
changed only the optimization to (3, TWO_PHASE) and maximum throughput
reaches only 4K.

I think that the problem is in my data that the query with distinct is
consuming. So, how should I prepare the data to see the optimization
of split distinct take effect?

Thanks,
Felipe

[1] https://ci.apache.org/projects/flink/flink-docs-stable/dev/table/tuning/streaming_aggregation_optimization.html#split-distinct-aggregation
--
-- Felipe Gutierrez
-- skype: felipe.o.gutierrez
-- https://felipeogutierrez.blogspot.com