State migration for sql job

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

State migration for sql job

aitozi
This post was updated on .
When use flink sql, we encounter a big problem to deal with sql state compatibility.
Think we have a group agg sql like


select sum(`a`) from source_t group by `uid`


 But if i want to add a new agg column to

select sum(`a`), max(`a`) from source_t group by `uid`


Then sql state will not be compatible. Is there any on-going work/thoughts to improve this situation?
Reply | Threaded
Open this post in threaded view
|

Re: State migration for sql job

JING ZHANG
Hi aitozi,
This is a popular demand that many users mentioned, which appears in user mail list for several times. 
Unfortunately, it is not supported by Flink SQL yet, maybe would be solved in the future. BTW, a few company try to solve the problem in some specified user cases on their internal Flink version[2].
Currently, you may try use `State Processor API`[1] as temporary solution. 
1. Do a savepoint 
2. Generates updated the savepoint based on State Processor API
3. Recover from the new savepoint.


Best regards,
JING ZHANG

aitozi <[hidden email]> 于2021年6月8日周二 下午1:54写道:
When use flink sql, we encounter a big problem to deal with sql state compatibility. Think we have a group agg sql like ```sql select sum(`a`) from source_t group by `uid` ``` But if i want to add a new agg column to ```sql select sum(`a`), max(`a`) from source_t group by `uid` ``` Then sql state will not be compatible. Is there any on-going work/thoughts to improve this situation?

Sent from the Apache Flink User Mailing List archive. mailing list archive at Nabble.com.
Reply | Threaded
Open this post in threaded view
|

Re: State migration for sql job

Kurt Young
What kind of expectation do you have after you add the "max(a)" aggregation:

a. Keep summing a and start to calculate max(a) after you added. In other words, max(a) won't take the history data into account.
b. First process all the historical data to get a result of max(a), and then start to compute sum(a) and max(a) together for the real-time data.

Best,
Kurt


On Tue, Jun 8, 2021 at 2:11 PM JING ZHANG <[hidden email]> wrote:
Hi aitozi,
This is a popular demand that many users mentioned, which appears in user mail list for several times. 
Unfortunately, it is not supported by Flink SQL yet, maybe would be solved in the future. BTW, a few company try to solve the problem in some specified user cases on their internal Flink version[2].
Currently, you may try use `State Processor API`[1] as temporary solution. 
1. Do a savepoint 
2. Generates updated the savepoint based on State Processor API
3. Recover from the new savepoint.


Best regards,
JING ZHANG

aitozi <[hidden email]> 于2021年6月8日周二 下午1:54写道:
When use flink sql, we encounter a big problem to deal with sql state compatibility. Think we have a group agg sql like ```sql select sum(`a`) from source_t group by `uid` ``` But if i want to add a new agg column to ```sql select sum(`a`), max(`a`) from source_t group by `uid` ``` Then sql state will not be compatible. Is there any on-going work/thoughts to improve this situation?

Sent from the Apache Flink User Mailing List archive. mailing list archive at Nabble.com.
Reply | Threaded
Open this post in threaded view
|

Re: State migration for sql job

aitozi
Thanks for JING & Kurt's reply. I think we prefer to choose the option (a)
that will not take  the history data into account.

IMO, if we want to process all the historical data, we have to store the
original data, which may be a big overhead to backend. But if we just
aggregate after the new added function, may just need a data format
transfer. Besides, the business logic we met only need the new aggFunction
accumulate with new data.



--
Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/
Reply | Threaded
Open this post in threaded view
|

Re: State migration for sql job

Yuval Itzchakov
As my company is also a heavy user of Flink SQL, the state migration story is very important to us.

I as well believe that adding new fields should start to accumulate state from the point in time of the change forward.

Is anyone actively working on this? Is there anyway to get involved?

On Tue, Jun 8, 2021, 17:33 aitozi <[hidden email]> wrote:
Thanks for JING & Kurt's reply. I think we prefer to choose the option (a)
that will not take  the history data into account.

IMO, if we want to process all the historical data, we have to store the
original data, which may be a big overhead to backend. But if we just
aggregate after the new added function, may just need a data format
transfer. Besides, the business logic we met only need the new aggFunction
accumulate with new data.



--
Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/