State size Vs keys number perfromance

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

State size Vs keys number perfromance

KristoffSC
Hi,
I would to ask about what has more memory footprint and what could be more
efficient regarding
less keys with bigger keyState vs many keys with smaller keyState

For this use case I'm using RocksDB StateBackend and state TTL is, well..
infinitive. So I'm keeping the state forever in Flink.

The use case:
I have a stream of messages that I have to process it in some custom way.
I can take one of two approaches

1. use a keyBy that will give me some number of distinct keys but for each
key, the state size will be significant. It will be MapState in this case.
The keyBy I used will still give me ability to spread operations across
operator instances.

2. In second approach I can use a different keyBy, where I would have huge
number of distinct keys, but each keyState will be very small and it will be
a ValueState in this case.

To sum up:
"reasonable" number of keys with very big keySatte VS huge number of keys
with very small state each.

What are the pros and cons for both?




--
Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/
Reply | Threaded
Open this post in threaded view
|

Re: State size Vs keys number perfromance

Congxian Qiu
Hi
I'll give some information from my side:
1. The performance for RocksDB is mainly related to the (de)serialization and disk read/write. 
2. MapState just need to (de)serialize the single mapkey/mapvalue when read/write state, ValueState need to (de)serialize the whole state when read/write the state
3. disk read/write is somewhat about the whole state size

Best,
Congxian


KristoffSC <[hidden email]> 于2020年4月8日周三 上午2:41写道:
Hi,
I would to ask about what has more memory footprint and what could be more
efficient regarding
less keys with bigger keyState vs many keys with smaller keyState

For this use case I'm using RocksDB StateBackend and state TTL is, well..
infinitive. So I'm keeping the state forever in Flink.

The use case:
I have a stream of messages that I have to process it in some custom way.
I can take one of two approaches

1. use a keyBy that will give me some number of distinct keys but for each
key, the state size will be significant. It will be MapState in this case.
The keyBy I used will still give me ability to spread operations across
operator instances.

2. In second approach I can use a different keyBy, where I would have huge
number of distinct keys, but each keyState will be very small and it will be
a ValueState in this case.

To sum up:
"reasonable" number of keys with very big keySatte VS huge number of keys
with very small state each.

What are the pros and cons for both?




--
Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/
Reply | Threaded
Open this post in threaded view
|

Re: State size Vs keys number perfromance

KristoffSC
Thanks Congxian Qiu,
I'm aware about your second point. In Value state I will keep String or very
simple POJO, without any collections inside.

I didn't get your third point, could you clarify it please?
"disk read/write is somewhat about the whole state size"

Actually what I will keep in Value state is what it would be kept in single
MapState entry. Depends what key I will choose, my state can be "broader"
where I will use MapState, or can be very narrow so I will be able to use
Value state that will keep actually only one entry.

This is the essence of my question , what are the trade offs here.




--
Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/
Reply | Threaded
Open this post in threaded view
|

Re: State size Vs keys number perfromance

Congxian Qiu
Hi

In the last email, I just wanted to express that the overall state size(and the access pattern, but I assume that the access pattern is the same between the two states) affects the final performance (which has to do with RocksDB's architecture), and if you use MapState and ValueState to end up with about the same state size on each subtask, then there is no difference at this point

Best,
Congxian


KristoffSC <[hidden email]> 于2020年4月8日周三 下午3:36写道:
Thanks Congxian Qiu,
I'm aware about your second point. In Value state I will keep String or very
simple POJO, without any collections inside.

I didn't get your third point, could you clarify it please?
"disk read/write is somewhat about the whole state size"

Actually what I will keep in Value state is what it would be kept in single
MapState entry. Depends what key I will choose, my state can be "broader"
where I will use MapState, or can be very narrow so I will be able to use
Value state that will keep actually only one entry.

This is the essence of my question , what are the trade offs here.




--
Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/