There are three more weird things about the pv uv in Flink SQL.
As I described in the above email, I computed the pv uv in two method, I list them below:
select a,v,MAX(DATE_FORMAT(ts, 'yyyy-MM-dd HH:mm:00')) dt,
COUNT(m2) AS pv,
COUNT(DISTINCT m2) AS uv from kafkaTable GROUP BY tumble(ts, interval '1' day),a,v;
And the result of one dimension is

Here are the three questions:
1. According to the same cpu and memory and parallelism, but the day grouping solution is faster than the 1 day window solution, the day grouping solution cost 1 hour to consume all the data,
but the 1 day window solution cost 4 hours to consume all the data.
2. The final result is not the same, the pv/uv of the day grouping is 7304086/7299878, but the pv/uv of the 1 day window is 7304352/7300144, I think both of the result is not accurate, but approximate?
So, how about the loss of accuracy? What is the algorithm below the count distinct?
3. As the picture of the 1 day window shows, there are many records of the a=1, v=12.0.6.1, dt=2021-01-13 17:45:00, but in my last mail, I noticed the records changed always when the job begin to execute, and
one record per dimension, now on the final time, it popped up so many records per dimension, it's weird.
Any advice will be fully appreciated.
Yours sincerely
Josh