Use State query to dump state into datalake

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

Use State query to dump state into datalake

Lian Jiang
Hi,

I am interested in dumping Flink state from Rockdb to datalake using state query https://ci.apache.org/projects/flink/flink-docs-master/docs/dev/datastream/fault-tolerance/queryable_state/. My map state could have 200 million key-values pairs and the total size could be 150G bytes. My batch job scheduled using airflow will have one task which uses Flink state query to dump the Flink state to datalake in parquet format so other spark tasks can use it.

Is there any scalability concern for using state query in this way? Appreciate any insight. Thanks!
Reply | Threaded
Open this post in threaded view
|

Re: Use State query to dump state into datalake

David Anderson-4
I think you'd be better off using the State Processor API [1] instead. The State Processor API has cleaner semantics -- as you'll be seeing a self-consistent snapshot of all the state -- and it's also much more performant. 

Note also that the Queryable State API is "approaching end of life" [2]. The long-term objective is to replace this with something more useful.

Regards,

On Sun, May 2, 2021 at 9:07 PM Lian Jiang <[hidden email]> wrote:
Hi,

I am interested in dumping Flink state from Rockdb to datalake using state query https://ci.apache.org/projects/flink/flink-docs-master/docs/dev/datastream/fault-tolerance/queryable_state/. My map state could have 200 million key-values pairs and the total size could be 150G bytes. My batch job scheduled using airflow will have one task which uses Flink state query to dump the Flink state to datalake in parquet format so other spark tasks can use it.

Is there any scalability concern for using state query in this way? Appreciate any insight. Thanks!