Init RocksDB state backend during startup

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Init RocksDB state backend during startup

Peter Zende
Hi,

We use RocksDB with FsStateBackend (HDFS) to store state used by the mapWithState operator. Is it possible to initialize / populate this state during the streaming application startup?

Our intention is to reprocess the historical data from HDFS in a batch job and save the latest state of the records onto HDFS. Thus when we restart the streaming job we can just build up or load the most recent view of this store.

Many thanks,
Peter
Reply | Threaded
Open this post in threaded view
|

Re: Init RocksDB state backend during startup

xiatao123
Also would like to know how to do this if it is possible.

On Fri, May 4, 2018 at 9:31 AM, Peter Zende <[hidden email]> wrote:
Hi,

We use RocksDB with FsStateBackend (HDFS) to store state used by the mapWithState operator. Is it possible to initialize / populate this state during the streaming application startup?

Our intention is to reprocess the historical data from HDFS in a batch job and save the latest state of the records onto HDFS. Thus when we restart the streaming job we can just build up or load the most recent view of this store.

Many thanks,
Peter

Reply | Threaded
Open this post in threaded view
|

Re: Init RocksDB state backend during startup

Fabian Hueske-2
Hi Peter,

State initialization with with historic data is a use case that's coming up more and more.
Unfortunately, there's no good solution for this yet but just a couple of workaround that require careful design and work for all cases.
There was a talk about exactly this problem and some ideas for addressing it at Flink Forward a month ago [1]. The slides and video of the talk are available online [2].

Your idea of initializing keyed state during startup (by the open() method) doesn't work.
Keyed state is automatically moved into the context of the key of a currently processed record.
Since there are no records during initialization, one would need to manually set the key for the state to initialize.
The challenge here is that the keys are partitioned / sharded across the parallel instances. So, one would need to know on which instance which key must be initialized. This is not trivial.

Best,
Fabian

2018-05-04 19:47 GMT+02:00 Tao Xia <[hidden email]>:
Also would like to know how to do this if it is possible.

On Fri, May 4, 2018 at 9:31 AM, Peter Zende <[hidden email]> wrote:
Hi,

We use RocksDB with FsStateBackend (HDFS) to store state used by the mapWithState operator. Is it possible to initialize / populate this state during the streaming application startup?

Our intention is to reprocess the historical data from HDFS in a batch job and save the latest state of the records onto HDFS. Thus when we restart the streaming job we can just build up or load the most recent view of this store.

Many thanks,
Peter