|
This post was updated on .
Hi community,
Top-N query is result-updating, according to [docs](
https://ci.apache.org/projects/flink/flink-docs-release-1.13/docs/dev/table/sql/queries/topn/):
> The TopN query is *Result Updating*. Flink SQL will sort the input data
stream according to the order key, so if the top N records have been
changed, the changed ones will be sent as retraction/update records to
downstream.
It makes sense for streaming input. However, I feel it makes less sense for
batch input. Why? Because the intermediate result of Top-N is not useful
for a batch input.
For example, take a look at this input:
```
1,100,1
1,101,3
2,201,4
2,200,2
2,202,5
2,203,6
1,102,7
1,103,8
1,104,9
2,204,10
```
The first column being user_id, the second being movie_id, the third being
score (generated by some ML model). I want to tell what the top 3 movies
that each user_id should watch.
Since the input is batch and finite, what I need is really just:
1 -> 102,103,104
2 -> 202,203,204
However, using this [PyFlink program](
https://github.com/YikSanChan/pyflink-quickstart/blob/12b89341301c62072c4dfff4eecc39edad61ef9b/topk.py),
I get a ton of intermediate output:
```
+I(1,100)
-U(1,100)
+U(1,100,101)
+I(2,201)
-U(2,201)
+U(2,201,200)
-U(2,201,200)
+U(2,201,200,202)
-U(2,201,200,202)
+U(2,201,202)
-U(2,201,202)
+U(2,201,202,203)
-U(1,100,101)
+U(1,100,101,102)
-U(1,100,101,102)
+U(1,101,102)
-U(1,101,102)
+U(1,101,102,103)
-U(1,101,102,103)
+U(1,102,103)
-U(1,102,103)
+U(1,102,103,104)
-U(2,201,202,203)
+U(2,202,203)
-U(2,202,203)
+U(2,202,203,204)
```
And since it is result-updating, I am not able to use a file as the sink.
My question is: is there a way to generate top-n movies for each user, and
make sure the output is not result-updating?
Thank you!
Best,
Yik San
|