Does Top-N query have to be result-updating, even for batch input?

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

Does Top-N query have to be result-updating, even for batch input?

Yik San Chan
This post was updated on .
Hi community,

Top-N query is result-updating, according to [docs](
https://ci.apache.org/projects/flink/flink-docs-release-1.13/docs/dev/table/sql/queries/topn/
):

> The TopN query is *Result Updating*. Flink SQL will sort the input data
stream according to the order key, so if the top N records have been
changed, the changed ones will be sent as retraction/update records to
downstream.

It makes sense for streaming input. However, I feel it makes less sense for
batch input. Why? Because the intermediate result of Top-N is not useful
for a batch input.

For example, take a look at this input:

```

1,100,1
1,101,3
2,201,4
2,200,2
2,202,5
2,203,6
1,102,7
1,103,8
1,104,9
2,204,10

```

The first column being user_id, the second being movie_id, the third being
score (generated by some ML model). I want to tell what the top 3 movies
that each user_id should watch.

Since the input is batch and finite, what I need is really just:

1 -> 102,103,104
2 -> 202,203,204

However, using this [PyFlink program](
https://github.com/YikSanChan/pyflink-quickstart/blob/12b89341301c62072c4dfff4eecc39edad61ef9b/topk.py),
I get a ton of intermediate output:

```
+I(1,100)
-U(1,100)
+U(1,100,101)
+I(2,201)
-U(2,201)
+U(2,201,200)
-U(2,201,200)
+U(2,201,200,202)
-U(2,201,200,202)
+U(2,201,202)
-U(2,201,202)
+U(2,201,202,203)
-U(1,100,101)
+U(1,100,101,102)
-U(1,100,101,102)
+U(1,101,102)
-U(1,101,102)
+U(1,101,102,103)
-U(1,101,102,103)
+U(1,102,103)
-U(1,102,103)
+U(1,102,103,104)
-U(2,201,202,203)
+U(2,202,203)
-U(2,202,203)
+U(2,202,203,204)
```

And since it is result-updating, I am not able to use a file as the sink.

My question is: is there a way to generate top-n movies for each user, and
make sure the output is not result-updating?

Thank you!

Best,
Yik San
Reply | Threaded
Open this post in threaded view
|

Re: Does Top-N query have to be result-updating, even for batch input?

Yik San Chan
There turns out to be an easy solution.

env_settings =
EnvironmentSettings.new_instance().in_batch_mode().use_blink_planner().build()
t_env = BatchTableEnvironment.create(environment_settings=env_settings)

Use batch mode instead, then the top-n will be only omitting result once, at
the end of the whole processing.



--
Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/