Question about counting top k on streaming data

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

Question about counting top k on streaming data

Tony Wei
Hi,

I know that there is an improvement in Blink SQL that can deal with the top k problem like SQL
showed below by maintaining an in-memory "heap" to store top k records. That is not a problem
when user's score will only grow up.
SELECT user_id, score
FROM (
  SELECT *,
    ROW_NUMBER() OVER (ORDER BY score DESC) AS row_num
  FROM user_scores)
WHERE row_num <= 3

My question is how to deal with such top k problem when user's score will decrease as well.
Suppose there is a SQL like this.
SELECT user_id, score
FROM (
  SELECT *,
    ROW_NUMBER() OVER (ORDER BY score DESC) AS row_num
  FROM (
  SELECT user_id, LAST_VAL(score) AS score
  FROM user_scores
  GROUP BY user_id))
WHERE row_num <= 3
`user_scores` is a dynamic table converted from a DataStream, `LAST_VAL` is a UDAF to get
the latest value. So, the user's score will increase but also decrease from time to time.

So, if we only maintain a heap to store top k elements, there will be a problem. For example, if
there is already three users: A, B, C with score: 4, 3, 2 stored in a top-3 heap. If the next record
is a D user with score 1, it will be dropped due to the score is less than A, B and C. However, if
the next record comes after that with an updated score 0 for user A. In reality, we know that
top-3 users will become B, C and D, but it is no chance to get user D back if using heap in this
case. Using heap works fine if it is running on batch mode because the users' score won't
change from time to time. 

In this case, I think it should fall back to store all users and their scores. Update top-k every time
when receive a new record. If the heap optimization won't work here in streaming mode, is there
any other optimization can apply in this case? It is not necessary to focus on SQL only. Any
improvement on DataStream is also welcome. Thank you.

Best Regards,
Tony Wei