Flink SQL API: Extra columns added from order by

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

Flink SQL API: Extra columns added from order by

Morrisa Brenner

Hi Flink folks,


We have a custom date formatting function that we use to format the output of columns containing dates. Ideally what we want is to format the output in the select statement but be able to order by the underlying datetime (so that and output with formatted dates "February 2019" and "April 2019" is guaranteed to have the rows sorted in time order rather than alphabetical order).


When I go to add the unformatted column to the order by, however, that gets appended as an extra column to the select statement during the query planning process within Calcite. (In the order by parsing, it's considering this a different column from the one in the select statement.) When the group by column is different in the same way but there's no order by column, the extra column isn't added. I've included a couple of simple examples below.


Is this the intended behavior of the query planner? Does anyone know of a way around this without needing to change the formatting so that it makes the output dates correctly sortable?


Thanks for your help!


Morrisa




Example query and output with order by using formatted date:

SELECT 

formatDate(floor(`testTable`.`timestamp` TO MONTH), 'MONTH'), 

sum(`testTable`.`count`) 

FROM `testTable` 

GROUP BY formatDate(floor(`testTable`.`timestamp` TO MONTH), 'MONTH') 

ORDER BY formatDate(floor(`testTable`.`timestamp` TO MONTH), 'MONTH') ASC

Month

SUM VALUE

April 2019

1052

February 2019

1


Example query and output without order by but group by using unformatted date:

SELECT 

formatDate(floor(`testTable`.`timestamp` TO MONTH), 'MONTH'), 

sum(`testTable`.`count`) 

FROM `testTable` 

GROUP BY floor(`testTable`.`timestamp` TO MONTH)

Month

SUM VALUE

February 2019

1

April 2019

1052

We would like to enforce the ordering, so although this output is what we want, I don't think we can use this solution.


Example query and output with order by using unformatted date:

SELECT 

formatDate(floor(`testTable`.`timestamp` TO MONTH), 'MONTH'), 

sum(`testTable`.`count`) 

FROM `testTable` 

GROUP BY floor(`testTable`.`timestamp` TO MONTH) 

ORDER BY floor(`testTable`.`timestamp` TO MONTH) ASC

Month

SUM VALUE


February 2019

1

2/1/2019 12:00 AM

April 2019

1052

4/1/2019 12:00 AM



--
Morrisa Brenner
Software Engineer
225 Franklin St, Boston, MA 02110
klaviyo.com
Klaviyo Logo
Reply | Threaded
Open this post in threaded view
|

Re: Flink SQL API: Extra columns added from order by

Rong Rong
Hi Morrisa,

Can you share more information regarding what type of function "formatDate" is and how did you configure the return type of that function?
For the question on the first query If the return type is String, then ASC on a string value should be on alphabetical ordering.

However on the third query, if the GROUP BY and ORDER BY are both operating on the same input to your UDF it shouldn't be part of the output columns. 
This looks like a bug to me. 

--
Rong

On Thu, Jul 11, 2019 at 11:45 AM Morrisa Brenner <[hidden email]> wrote:

Hi Flink folks,


We have a custom date formatting function that we use to format the output of columns containing dates. Ideally what we want is to format the output in the select statement but be able to order by the underlying datetime (so that and output with formatted dates "February 2019" and "April 2019" is guaranteed to have the rows sorted in time order rather than alphabetical order).


When I go to add the unformatted column to the order by, however, that gets appended as an extra column to the select statement during the query planning process within Calcite. (In the order by parsing, it's considering this a different column from the one in the select statement.) When the group by column is different in the same way but there's no order by column, the extra column isn't added. I've included a couple of simple examples below.


Is this the intended behavior of the query planner? Does anyone know of a way around this without needing to change the formatting so that it makes the output dates correctly sortable?


Thanks for your help!


Morrisa




Example query and output with order by using formatted date:

SELECT 

formatDate(floor(`testTable`.`timestamp` TO MONTH), 'MONTH'), 

sum(`testTable`.`count`) 

FROM `testTable` 

GROUP BY formatDate(floor(`testTable`.`timestamp` TO MONTH), 'MONTH') 

ORDER BY formatDate(floor(`testTable`.`timestamp` TO MONTH), 'MONTH') ASC

Month

SUM VALUE

April 2019

1052

February 2019

1


Example query and output without order by but group by using unformatted date:

SELECT 

formatDate(floor(`testTable`.`timestamp` TO MONTH), 'MONTH'), 

sum(`testTable`.`count`) 

FROM `testTable` 

GROUP BY floor(`testTable`.`timestamp` TO MONTH)

Month

SUM VALUE

February 2019

1

April 2019

1052

We would like to enforce the ordering, so although this output is what we want, I don't think we can use this solution.


Example query and output with order by using unformatted date:

SELECT 

formatDate(floor(`testTable`.`timestamp` TO MONTH), 'MONTH'), 

sum(`testTable`.`count`) 

FROM `testTable` 

GROUP BY floor(`testTable`.`timestamp` TO MONTH) 

ORDER BY floor(`testTable`.`timestamp` TO MONTH) ASC

Month

SUM VALUE


February 2019

1

2/1/2019 12:00 AM

April 2019

1052

4/1/2019 12:00 AM



--
Morrisa Brenner
Software Engineer
225 Franklin St, Boston, MA 02110
klaviyo.com
Klaviyo Logo
Reply | Threaded
Open this post in threaded view
|

Re: Flink SQL API: Extra columns added from order by

Caizhi Weng
(Oops, I mistakenly sent my response only to Rong Rong. Sorry Rong...)

Hi Morrisa,

This is due to a bug in the old flink planner. The `createTable(new PlannerQueryOperation(relational.rel))` method in flink-table-planner -> `TableEnvImpl` -> `sqlQuery` should be `createTable(new PlannerQueryOperation(relational.project()))` as the last projection (if exists) should be performed. This bug is already fixed in the blink planner (flink-table-planner-blink). You can use blink planner and runner instead of the legacy planner to solve this problem.

Rong Rong <[hidden email]> 于2019年7月13日周六 上午7:08写道:
Hi Morrisa,

Can you share more information regarding what type of function "formatDate" is and how did you configure the return type of that function?
For the question on the first query If the return type is String, then ASC on a string value should be on alphabetical ordering.

However on the third query, if the GROUP BY and ORDER BY are both operating on the same input to your UDF it shouldn't be part of the output columns. 
This looks like a bug to me. 

--
Rong

On Thu, Jul 11, 2019 at 11:45 AM Morrisa Brenner <[hidden email]> wrote:

Hi Flink folks,


We have a custom date formatting function that we use to format the output of columns containing dates. Ideally what we want is to format the output in the select statement but be able to order by the underlying datetime (so that and output with formatted dates "February 2019" and "April 2019" is guaranteed to have the rows sorted in time order rather than alphabetical order).


When I go to add the unformatted column to the order by, however, that gets appended as an extra column to the select statement during the query planning process within Calcite. (In the order by parsing, it's considering this a different column from the one in the select statement.) When the group by column is different in the same way but there's no order by column, the extra column isn't added. I've included a couple of simple examples below.


Is this the intended behavior of the query planner? Does anyone know of a way around this without needing to change the formatting so that it makes the output dates correctly sortable?


Thanks for your help!


Morrisa




Example query and output with order by using formatted date:

SELECT 

formatDate(floor(`testTable`.`timestamp` TO MONTH), 'MONTH'), 

sum(`testTable`.`count`) 

FROM `testTable` 

GROUP BY formatDate(floor(`testTable`.`timestamp` TO MONTH), 'MONTH') 

ORDER BY formatDate(floor(`testTable`.`timestamp` TO MONTH), 'MONTH') ASC

Month

SUM VALUE

April 2019

1052

February 2019

1


Example query and output without order by but group by using unformatted date:

SELECT 

formatDate(floor(`testTable`.`timestamp` TO MONTH), 'MONTH'), 

sum(`testTable`.`count`) 

FROM `testTable` 

GROUP BY floor(`testTable`.`timestamp` TO MONTH)

Month

SUM VALUE

February 2019

1

April 2019

1052

We would like to enforce the ordering, so although this output is what we want, I don't think we can use this solution.


Example query and output with order by using unformatted date:

SELECT 

formatDate(floor(`testTable`.`timestamp` TO MONTH), 'MONTH'), 

sum(`testTable`.`count`) 

FROM `testTable` 

GROUP BY floor(`testTable`.`timestamp` TO MONTH) 

ORDER BY floor(`testTable`.`timestamp` TO MONTH) ASC

Month

SUM VALUE


February 2019

1

2/1/2019 12:00 AM

April 2019

1052

4/1/2019 12:00 AM



--
Morrisa Brenner
Software Engineer
225 Franklin St, Boston, MA 02110
klaviyo.com
Klaviyo Logo
Reply | Threaded
Open this post in threaded view
|

Re: Flink SQL API: Extra columns added from order by

Morrisa Brenner
Hi Caizhi and Rong,

Thanks for the responses! It's good to know that this is a known bug - right now we're just using Flink 1.8 and will work around this, but we look forward to getting the fixes in the future!

All the best,
Morrisa


On Mon, Jul 15, 2019 at 2:25 AM Caizhi Weng <[hidden email]> wrote:
(Oops, I mistakenly sent my response only to Rong Rong. Sorry Rong...)

Hi Morrisa,

This is due to a bug in the old flink planner. The `createTable(new PlannerQueryOperation(relational.rel))` method in flink-table-planner -> `TableEnvImpl` -> `sqlQuery` should be `createTable(new PlannerQueryOperation(relational.project()))` as the last projection (if exists) should be performed. This bug is already fixed in the blink planner (flink-table-planner-blink). You can use blink planner and runner instead of the legacy planner to solve this problem.

Rong Rong <[hidden email]> 于2019年7月13日周六 上午7:08写道:
Hi Morrisa,

Can you share more information regarding what type of function "formatDate" is and how did you configure the return type of that function?
For the question on the first query If the return type is String, then ASC on a string value should be on alphabetical ordering.

However on the third query, if the GROUP BY and ORDER BY are both operating on the same input to your UDF it shouldn't be part of the output columns. 
This looks like a bug to me. 

--
Rong

On Thu, Jul 11, 2019 at 11:45 AM Morrisa Brenner <[hidden email]> wrote:

Hi Flink folks,


We have a custom date formatting function that we use to format the output of columns containing dates. Ideally what we want is to format the output in the select statement but be able to order by the underlying datetime (so that and output with formatted dates "February 2019" and "April 2019" is guaranteed to have the rows sorted in time order rather than alphabetical order).


When I go to add the unformatted column to the order by, however, that gets appended as an extra column to the select statement during the query planning process within Calcite. (In the order by parsing, it's considering this a different column from the one in the select statement.) When the group by column is different in the same way but there's no order by column, the extra column isn't added. I've included a couple of simple examples below.


Is this the intended behavior of the query planner? Does anyone know of a way around this without needing to change the formatting so that it makes the output dates correctly sortable?


Thanks for your help!


Morrisa




Example query and output with order by using formatted date:

SELECT 

formatDate(floor(`testTable`.`timestamp` TO MONTH), 'MONTH'), 

sum(`testTable`.`count`) 

FROM `testTable` 

GROUP BY formatDate(floor(`testTable`.`timestamp` TO MONTH), 'MONTH') 

ORDER BY formatDate(floor(`testTable`.`timestamp` TO MONTH), 'MONTH') ASC

Month

SUM VALUE

April 2019

1052

February 2019

1


Example query and output without order by but group by using unformatted date:

SELECT 

formatDate(floor(`testTable`.`timestamp` TO MONTH), 'MONTH'), 

sum(`testTable`.`count`) 

FROM `testTable` 

GROUP BY floor(`testTable`.`timestamp` TO MONTH)

Month

SUM VALUE

February 2019

1

April 2019

1052

We would like to enforce the ordering, so although this output is what we want, I don't think we can use this solution.


Example query and output with order by using unformatted date:

SELECT 

formatDate(floor(`testTable`.`timestamp` TO MONTH), 'MONTH'), 

sum(`testTable`.`count`) 

FROM `testTable` 

GROUP BY floor(`testTable`.`timestamp` TO MONTH) 

ORDER BY floor(`testTable`.`timestamp` TO MONTH) ASC

Month

SUM VALUE


February 2019

1

2/1/2019 12:00 AM

April 2019

1052

4/1/2019 12:00 AM



--
Morrisa Brenner
Software Engineer
225 Franklin St, Boston, MA 02110
klaviyo.com
Klaviyo Logo


--
Morrisa Brenner
Software Engineer
225 Franklin St, Boston, MA 02110
klaviyo.com
Klaviyo Logo