(DEPRECATED) Apache Flink User Mailing List archive.

Use State query to dump state into datalake

Classic

List

Threaded

2 messages Options

Lian Jiang

Use State query to dump state into datalake

Hi,

I am interested in dumping Flink state from Rockdb to datalake using state query https://ci.apache.org/projects/flink/flink-docs-master/docs/dev/datastream/fault-tolerance/queryable_state/. My map state could have 200 million key-values pairs and the total size could be 150G bytes. My batch job scheduled using airflow will have one task which uses Flink state query to dump the Flink state to datalake in parquet format so other spark tasks can use it.

Is there any scalability concern for using state query in this way? Appreciate any insight. Thanks!

David Anderson-4

Re: Use State query to dump state into datalake

I think you'd be better off using the State Processor API [1] instead. The State Processor API has cleaner semantics -- as you'll be seeing a self-consistent snapshot of all the state -- and it's also much more performant.

Note also that the Queryable State API is "approaching end of life" [2]. The long-term objective is to replace this with something more useful.

Regards,

David

[1] https://ci.apache.org/projects/flink/flink-docs-stable/dev/libs/state_processor_api.html

[2] https://flink.apache.org/roadmap.html

On Sun, May 2, 2021 at 9:07 PM Lian Jiang <[hidden email]> wrote:

Hi,

I am interested in dumping Flink state from Rockdb to datalake using state query https://ci.apache.org/projects/flink/flink-docs-master/docs/dev/datastream/fault-tolerance/queryable_state/. My map state could have 200 million key-values pairs and the total size could be 150G bytes. My batch job scheduled using airflow will have one task which uses Flink state query to dump the Flink state to datalake in parquet format so other spark tasks can use it.

Is there any scalability concern for using state query in this way? Appreciate any insight. Thanks!