Multiple MapState vs single nested MapState in stateful Operator

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

Multiple MapState vs single nested MapState in stateful Operator

Gagan Agrawal
Hi,
I have a use case where 4 streams get merged (union) and grouped on common key (keyBy) and a custom KeyedProcessFunction is called. Now I need to keep state (RocksDB backend) for all 4 streams in my custom KeyedProcessFunction where each of these 4 streams would be stored as map. So I have 2 options

1. Create a separate MapStateDescriptor for each of these streams and store their events separately.
2. Create a single MapStateDescriptor where there will be only 4 keys (corresponding to 4 stream types) and value will be of type Map which further keep events from respective streams.

I want to understand from performance perspective, would there be any difference in above approaches. Will keeping 4 different MapState cause 4 lookups for RocksDB backend when they are accessed? Or all of these MapStates are internally stored within RocksDB in single row corresponding to respective key (as per keyedStream) and hence they are all fetched in single call before operator's processElement is called? If there are different lookups in RocksDB for each of MapStateDescriptor, then I think keeping them in single MapStateDescriptor would be more efficient minimize RocksDB calls? Please advise.

Gagan
Reply | Threaded
Open this post in threaded view
|

Re: Multiple MapState vs single nested MapState in stateful Operator

Congxian Qiu
Hi, Gagan Agrawal

In my opinion, I prefer the first.

Here is the reason.

In RocksDB StateBackend, we will serialize the key, namespace, user-key into a serialized bytes (key-bytes) and serialize user-value to serialized bytes(value-bytes) then insert  into the key-bytes/value-bytes into RocksDB, when retrieving from RocksDB we can user get(for a single key/value) or iterator(for a key range).

If we store four maps into a single MapState, we need to deserialize the value-bytes(a Map) when we want to retrieve a single user-value.


Gagan Agrawal <[hidden email]> 于2019年1月10日周四 上午10:38写道:
Hi,
I have a use case where 4 streams get merged (union) and grouped on common key (keyBy) and a custom KeyedProcessFunction is called. Now I need to keep state (RocksDB backend) for all 4 streams in my custom KeyedProcessFunction where each of these 4 streams would be stored as map. So I have 2 options

1. Create a separate MapStateDescriptor for each of these streams and store their events separately.
2. Create a single MapStateDescriptor where there will be only 4 keys (corresponding to 4 stream types) and value will be of type Map which further keep events from respective streams.

I want to understand from performance perspective, would there be any difference in above approaches. Will keeping 4 different MapState cause 4 lookups for RocksDB backend when they are accessed? Or all of these MapStates are internally stored within RocksDB in single row corresponding to respective key (as per keyedStream) and hence they are all fetched in single call before operator's processElement is called? If there are different lookups in RocksDB for each of MapStateDescriptor, then I think keeping them in single MapStateDescriptor would be more efficient minimize RocksDB calls? Please advise.

Gagan


--
Best,
Congxian
Reply | Threaded
Open this post in threaded view
|

Re: Multiple MapState vs single nested MapState in stateful Operator

Kostas Kloudas-3
Hi Gagan,

I agree with Congxian! 
In MapState, when accessing the state/value associated with a key in the map, then the whole value is de-serialized (and serialized in case of a put()).
Given this, it is more efficient to have many keys, with small state, than fewer keys with huge state.

Cheers,
Kostas


On Thu, Jan 10, 2019 at 12:34 PM Congxian Qiu <[hidden email]> wrote:
Hi, Gagan Agrawal

In my opinion, I prefer the first.

Here is the reason.

In RocksDB StateBackend, we will serialize the key, namespace, user-key into a serialized bytes (key-bytes) and serialize user-value to serialized bytes(value-bytes) then insert  into the key-bytes/value-bytes into RocksDB, when retrieving from RocksDB we can user get(for a single key/value) or iterator(for a key range).

If we store four maps into a single MapState, we need to deserialize the value-bytes(a Map) when we want to retrieve a single user-value.


Gagan Agrawal <[hidden email]> 于2019年1月10日周四 上午10:38写道:
Hi,
I have a use case where 4 streams get merged (union) and grouped on common key (keyBy) and a custom KeyedProcessFunction is called. Now I need to keep state (RocksDB backend) for all 4 streams in my custom KeyedProcessFunction where each of these 4 streams would be stored as map. So I have 2 options

1. Create a separate MapStateDescriptor for each of these streams and store their events separately.
2. Create a single MapStateDescriptor where there will be only 4 keys (corresponding to 4 stream types) and value will be of type Map which further keep events from respective streams.

I want to understand from performance perspective, would there be any difference in above approaches. Will keeping 4 different MapState cause 4 lookups for RocksDB backend when they are accessed? Or all of these MapStates are internally stored within RocksDB in single row corresponding to respective key (as per keyedStream) and hence they are all fetched in single call before operator's processElement is called? If there are different lookups in RocksDB for each of MapStateDescriptor, then I think keeping them in single MapStateDescriptor would be more efficient minimize RocksDB calls? Please advise.

Gagan


--
Best,
Congxian
Reply | Threaded
Open this post in threaded view
|

Re: Multiple MapState vs single nested MapState in stateful Operator

Gagan Agrawal
This makes perfect sense to me. Thanks Congxian and Kostas for your inputs.

Gagan

On Thu, Jan 10, 2019 at 6:03 PM Kostas Kloudas <[hidden email]> wrote:
Hi Gagan,

I agree with Congxian! 
In MapState, when accessing the state/value associated with a key in the map, then the whole value is de-serialized (and serialized in case of a put()).
Given this, it is more efficient to have many keys, with small state, than fewer keys with huge state.

Cheers,
Kostas


On Thu, Jan 10, 2019 at 12:34 PM Congxian Qiu <[hidden email]> wrote:
Hi, Gagan Agrawal

In my opinion, I prefer the first.

Here is the reason.

In RocksDB StateBackend, we will serialize the key, namespace, user-key into a serialized bytes (key-bytes) and serialize user-value to serialized bytes(value-bytes) then insert  into the key-bytes/value-bytes into RocksDB, when retrieving from RocksDB we can user get(for a single key/value) or iterator(for a key range).

If we store four maps into a single MapState, we need to deserialize the value-bytes(a Map) when we want to retrieve a single user-value.


Gagan Agrawal <[hidden email]> 于2019年1月10日周四 上午10:38写道:
Hi,
I have a use case where 4 streams get merged (union) and grouped on common key (keyBy) and a custom KeyedProcessFunction is called. Now I need to keep state (RocksDB backend) for all 4 streams in my custom KeyedProcessFunction where each of these 4 streams would be stored as map. So I have 2 options

1. Create a separate MapStateDescriptor for each of these streams and store their events separately.
2. Create a single MapStateDescriptor where there will be only 4 keys (corresponding to 4 stream types) and value will be of type Map which further keep events from respective streams.

I want to understand from performance perspective, would there be any difference in above approaches. Will keeping 4 different MapState cause 4 lookups for RocksDB backend when they are accessed? Or all of these MapStates are internally stored within RocksDB in single row corresponding to respective key (as per keyedStream) and hence they are all fetched in single call before operator's processElement is called? If there are different lookups in RocksDB for each of MapStateDescriptor, then I think keeping them in single MapStateDescriptor would be more efficient minimize RocksDB calls? Please advise.

Gagan


--
Best,
Congxian