Can i use lot of keyd states or should i use 1 big key state.

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

Can i use lot of keyd states or should i use 1 big key state.

shashank734
Hello,

I have to compute results on basis of lot of history data, parameters like total transactions in last 1 month, last 1 day, last 1 hour etc. by email id, ip, mobile, name, address, zipcode etc.

So my question is this right approach to create keyed state by email, mobile, zipcode etc. or should i create 1 big mapped state (BS) and than process that BS, may be in process function or by applying some loop and filter logic in window or process function. 

My main worry is i will end up with millions of states, because there can be millions unique emails, phone numbers or zipcode if i create keyed state by email, phone etc.

am i right ? is this impact on the performance or is this wrong approach ? Which approach would you suggest in this use case.


--
Thanks Regards

SHASHANK AGARWAL
 ---  Trying to mobilize the things....




Reply | Threaded
Open this post in threaded view
|

Re: Can i use lot of keyd states or should i use 1 big key state.

Stephan Ewen
Each keyed state in Flink is a hashtable or a column family in RocksDB. Having too many of those is not memory efficient.

Having fewer states is better, if you can adapt your schema that way.

I would also look into "MapState", which is an efficient way to have "sub keys" under a keyed state.

Stephan


On Mon, Jul 31, 2017 at 6:01 PM, shashank agarwal <[hidden email]> wrote:
Hello,

I have to compute results on basis of lot of history data, parameters like total transactions in last 1 month, last 1 day, last 1 hour etc. by email id, ip, mobile, name, address, zipcode etc.

So my question is this right approach to create keyed state by email, mobile, zipcode etc. or should i create 1 big mapped state (BS) and than process that BS, may be in process function or by applying some loop and filter logic in window or process function. 

My main worry is i will end up with millions of states, because there can be millions unique emails, phone numbers or zipcode if i create keyed state by email, phone etc.

am i right ? is this impact on the performance or is this wrong approach ? Which approach would you suggest in this use case.


--
Thanks Regards

SHASHANK AGARWAL
 ---  Trying to mobilize the things....





Reply | Threaded
Open this post in threaded view
|

Re: Can i use lot of keyd states or should i use 1 big key state.

shashank734
Ok if i am taking it as right for an example :

if  i am creating a keyed state with name "total count by email" for key(project id + email)  than it will create a single hash-table or column family "total count by email" and all the unique email id's will be rows of that single hash-table or column family and than i can store millions of unique email id's in that.

Means it will create only single state object for all unique email id's ?




On Tue, Aug 1, 2017 at 1:53 AM, Stephan Ewen <[hidden email]> wrote:
Each keyed state in Flink is a hashtable or a column family in RocksDB. Having too many of those is not memory efficient.

Having fewer states is better, if you can adapt your schema that way.

I would also look into "MapState", which is an efficient way to have "sub keys" under a keyed state.

Stephan


On Mon, Jul 31, 2017 at 6:01 PM, shashank agarwal <[hidden email]> wrote:
<img width="0" height="0" class="m_7223236271901929862m_8578517833411515249mailtrack-img" alt="" style="display:flex" src="">Hello,

I have to compute results on basis of lot of history data, parameters like total transactions in last 1 month, last 1 day, last 1 hour etc. by email id, ip, mobile, name, address, zipcode etc.

So my question is this right approach to create keyed state by email, mobile, zipcode etc. or should i create 1 big mapped state (BS) and than process that BS, may be in process function or by applying some loop and filter logic in window or process function. 

My main worry is i will end up with millions of states, because there can be millions unique emails, phone numbers or zipcode if i create keyed state by email, phone etc.

am i right ? is this impact on the performance or is this wrong approach ? Which approach would you suggest in this use case.


--
Thanks Regards

SHASHANK AGARWAL
 ---  Trying to mobilize the things....








--
Thanks Regards

SHASHANK AGARWAL
 ---  Trying to mobilize the things....

Reply | Threaded
Open this post in threaded view
|

Re: Can i use lot of keyd states or should i use 1 big key state.

shashank734
If I am creating KeyedState ("count by email id") and keyed stream has 10 unique email id's.

Will it create 1 column family or hash table ?

Or it will create 10 column family or hash table ?

Can i have millions of unique email id in that keyed state ?



On Tue, Aug 1, 2017 at 2:59 AM, shashank agarwal <[hidden email]> wrote:
<img width="0" height="0" class="m_-8958467274873209536mailtrack-img" alt="" style="display:flex" src="">Ok if i am taking it as right for an example :

if  i am creating a keyed state with name "total count by email" for key(project id + email)  than it will create a single hash-table or column family "total count by email" and all the unique email id's will be rows of that single hash-table or column family and than i can store millions of unique email id's in that.

Means it will create only single state object for all unique email id's ?




On Tue, Aug 1, 2017 at 1:53 AM, Stephan Ewen <[hidden email]> wrote:
Each keyed state in Flink is a hashtable or a column family in RocksDB. Having too many of those is not memory efficient.

Having fewer states is better, if you can adapt your schema that way.

I would also look into "MapState", which is an efficient way to have "sub keys" under a keyed state.

Stephan


On Mon, Jul 31, 2017 at 6:01 PM, shashank agarwal <[hidden email]> wrote:
<img width="0" height="0" class="m_-8958467274873209536m_7223236271901929862m_8578517833411515249mailtrack-img" alt="" style="display:flex" src="">Hello,

I have to compute results on basis of lot of history data, parameters like total transactions in last 1 month, last 1 day, last 1 hour etc. by email id, ip, mobile, name, address, zipcode etc.

So my question is this right approach to create keyed state by email, mobile, zipcode etc. or should i create 1 big mapped state (BS) and than process that BS, may be in process function or by applying some loop and filter logic in window or process function. 

My main worry is i will end up with millions of states, because there can be millions unique emails, phone numbers or zipcode if i create keyed state by email, phone etc.

am i right ? is this impact on the performance or is this wrong approach ? Which approach would you suggest in this use case.


--
Thanks Regards

SHASHANK AGARWAL
 ---  Trying to mobilize the things....








--
Thanks Regards

SHASHANK AGARWAL
 ---  Trying to mobilize the things....




--
Thanks Regards

SHASHANK AGARWAL
 ---  Trying to mobilize the things....

Reply | Threaded
Open this post in threaded view
|

Re: Can i use lot of keyd states or should i use 1 big key state.

Aljoscha Krettek
Hi,

If you have one keyed state, say "count by email id", and many different keys you will only have one column in RocksDB (or one HashTable). Actually, a lot of users have hundreds of millions of different keys for some states.

Best,
Aljoscha 
On 2. Aug 2017, at 14:59, shashank agarwal <[hidden email]> wrote:

If I am creating KeyedState ("count by email id") and keyed stream has 10 unique email id's.

Will it create 1 column family or hash table ?

Or it will create 10 column family or hash table ?

Can i have millions of unique email id in that keyed state ?



On Tue, Aug 1, 2017 at 2:59 AM, shashank agarwal <[hidden email]> wrote:
<img width="0" height="0" class="m_-8958467274873209536mailtrack-img" alt="" src="" style="display: flex;">Ok if i am taking it as right for an example :

if  i am creating a keyed state with name "total count by email" for key(project id + email)  than it will create a single hash-table or column family "total count by email" and all the unique email id's will be rows of that single hash-table or column family and than i can store millions of unique email id's in that.

Means it will create only single state object for all unique email id's ?




On Tue, Aug 1, 2017 at 1:53 AM, Stephan Ewen <[hidden email]> wrote:
Each keyed state in Flink is a hashtable or a column family in RocksDB. Having too many of those is not memory efficient.

Having fewer states is better, if you can adapt your schema that way.

I would also look into "MapState", which is an efficient way to have "sub keys" under a keyed state.

Stephan


On Mon, Jul 31, 2017 at 6:01 PM, shashank agarwal <[hidden email]> wrote:
<img width="0" height="0" class="m_-8958467274873209536m_7223236271901929862m_8578517833411515249mailtrack-img" alt="" src="" style="display: flex;">Hello,

I have to compute results on basis of lot of history data, parameters like total transactions in last 1 month, last 1 day, last 1 hour etc. by email id, ip, mobile, name, address, zipcode etc.

So my question is this right approach to create keyed state by email, mobile, zipcode etc. or should i create 1 big mapped state (BS) and than process that BS, may be in process function or by applying some loop and filter logic in window or process function. 

My main worry is i will end up with millions of states, because there can be millions unique emails, phone numbers or zipcode if i create keyed state by email, phone etc.

am i right ? is this impact on the performance or is this wrong approach ? Which approach would you suggest in this use case.


-- 
Thanks Regards

SHASHANK AGARWAL
 ---  Trying to mobilize the things....








-- 
Thanks Regards

SHASHANK AGARWAL
 ---  Trying to mobilize the things....




-- 
Thanks Regards

SHASHANK AGARWAL
 ---  Trying to mobilize the things....

Reply | Threaded
Open this post in threaded view
|

Re: Can i use lot of keyd states or should i use 1 big key state.

shashank734
Thanks Aljoscha and Stephan for clearing the doubt.




On Wed, Aug 9, 2017 at 7:37 PM, Aljoscha Krettek <[hidden email]> wrote:
Hi,

If you have one keyed state, say "count by email id", and many different keys you will only have one column in RocksDB (or one HashTable). Actually, a lot of users have hundreds of millions of different keys for some states.

Best,
Aljoscha 
On 2. Aug 2017, at 14:59, shashank agarwal <[hidden email]> wrote:

<img width="0" height="0" class="m_3097927355864191437mailtrack-img" alt="" src="" style="display:flex">If I am creating KeyedState ("count by email id") and keyed stream has 10 unique email id's.

Will it create 1 column family or hash table ?

Or it will create 10 column family or hash table ?

Can i have millions of unique email id in that keyed state ?



On Tue, Aug 1, 2017 at 2:59 AM, shashank agarwal <[hidden email]> wrote:
<img width="0" height="0" class="m_3097927355864191437m_-8958467274873209536mailtrack-img" alt="" style="display:flex" src="">Ok if i am taking it as right for an example :

if  i am creating a keyed state with name "total count by email" for key(project id + email)  than it will create a single hash-table or column family "total count by email" and all the unique email id's will be rows of that single hash-table or column family and than i can store millions of unique email id's in that.

Means it will create only single state object for all unique email id's ?




On Tue, Aug 1, 2017 at 1:53 AM, Stephan Ewen <[hidden email]> wrote:
Each keyed state in Flink is a hashtable or a column family in RocksDB. Having too many of those is not memory efficient.

Having fewer states is better, if you can adapt your schema that way.

I would also look into "MapState", which is an efficient way to have "sub keys" under a keyed state.

Stephan


On Mon, Jul 31, 2017 at 6:01 PM, shashank agarwal <[hidden email]> wrote:
<img width="0" height="0" class="m_3097927355864191437m_-8958467274873209536m_7223236271901929862m_8578517833411515249mailtrack-img" alt="" style="display:flex" src="">Hello,

I have to compute results on basis of lot of history data, parameters like total transactions in last 1 month, last 1 day, last 1 hour etc. by email id, ip, mobile, name, address, zipcode etc.

So my question is this right approach to create keyed state by email, mobile, zipcode etc. or should i create 1 big mapped state (BS) and than process that BS, may be in process function or by applying some loop and filter logic in window or process function. 

My main worry is i will end up with millions of states, because there can be millions unique emails, phone numbers or zipcode if i create keyed state by email, phone etc.

am i right ? is this impact on the performance or is this wrong approach ? Which approach would you suggest in this use case.


-- 
Thanks Regards

SHASHANK AGARWAL
 ---  Trying to mobilize the things....








-- 
Thanks Regards

SHASHANK AGARWAL
 ---  Trying to mobilize the things....




-- 
Thanks Regards

SHASHANK AGARWAL
 ---  Trying to mobilize the things....




--
Thanks Regards

SHASHANK AGARWAL
 ---  Trying to mobilize the things....