how does SQL mode work with PopularPlaces example?

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

how does SQL mode work with PopularPlaces example?

James Yu
Hi,

I am following the PopularPlacesSQL example (http://training.data-artisans.com/exercises/popularPlacesSql.html), but I am unable to understand why the following statement will pickup events with START flag only.

"SELECT " +
"toCoords(cell), wstart, wend, isStart, popCnt " +
"FROM " +
"(SELECT " +
"cell, " +
"isStart, " +
"HOP_START(eventTime, INTERVAL '5' MINUTE, INTERVAL '15' MINUTE) AS wstart, " +
"HOP_END(eventTime, INTERVAL '5' MINUTE, INTERVAL '15' MINUTE) AS wend, " +
"COUNT(isStart) AS popCnt " +
"FROM " +
"(SELECT " +
"eventTime, " +
"isStart, " +
"CASE WHEN isStart THEN toCellId(startLon, startLat) ELSE toCellId(endLon, endLat) END AS cell " +
"FROM TaxiRides " +
"WHERE isInNYC(startLon, startLat) AND isInNYC(endLon, endLat)) " +
"GROUP BY cell, isStart, HOP(eventTime, INTERVAL '5' MINUTE, INTERVAL '15' MINUTE)) " +
"WHERE popCnt > 20"

Since we can update state in processElement when we do it with low level ProcessFunction, how does SQL rule out the un-paired events?


This is a UTF-8 formatted mail
-----------------------------------------------
James C.-C.Yu
+886988713275
Reply | Threaded
Open this post in threaded view
|

Re: how does SQL mode work with PopularPlaces example?

Fabian Hueske-2
Hi James,

the exercise does not require to filter on pickup events. It says:

"This is done by counting every five minutes the number of taxi rides that started and ended in the same area within the last 15 minutes. Arrival and departure locations should be separately counted."

That is achieved by including isStart in the GROUP BY fields.

If you would only like to count pickup locations, you can just add a predicate "WHERE isStart".
Regarding the question of state cleanup, the query does not require to join start and end events. Both are separately counted.
To join start and end events, you can use a time-windowed join [1] (since Flink 1.4) which will automatically remove rows from the state that fell out of the join window.

Best, Fabian

[1] https://ci.apache.org/projects/flink/flink-docs-release-1.4/dev/table/sql.html#joins


2018-03-22 2:34 GMT+01:00 James Yu <[hidden email]>:
Hi,

I am following the PopularPlacesSQL example (http://training.data-artisans.com/exercises/popularPlacesSql.html), but I am unable to understand why the following statement will pickup events with START flag only.

"SELECT " +
"toCoords(cell), wstart, wend, isStart, popCnt " +
"FROM " +
"(SELECT " +
"cell, " +
"isStart, " +
"HOP_START(eventTime, INTERVAL '5' MINUTE, INTERVAL '15' MINUTE) AS wstart, " +
"HOP_END(eventTime, INTERVAL '5' MINUTE, INTERVAL '15' MINUTE) AS wend, " +
"COUNT(isStart) AS popCnt " +
"FROM " +
"(SELECT " +
"eventTime, " +
"isStart, " +
"CASE WHEN isStart THEN toCellId(startLon, startLat) ELSE toCellId(endLon, endLat) END AS cell " +
"FROM TaxiRides " +
"WHERE isInNYC(startLon, startLat) AND isInNYC(endLon, endLat)) " +
"GROUP BY cell, isStart, HOP(eventTime, INTERVAL '5' MINUTE, INTERVAL '15' MINUTE)) " +
"WHERE popCnt > 20"

Since we can update state in processElement when we do it with low level ProcessFunction, how does SQL rule out the un-paired events?


This is a UTF-8 formatted mail
-----------------------------------------------
James C.-C.Yu
<a href="tel:+886%20988%20713%20275" value="+886988713275" target="_blank">+886988713275