(DEPRECATED) Apache Flink User Mailing List archive.

Can i use lot of keyd states or should i use 1 big key state.

Classic

List

Threaded

6 messages Options

shashank734

Can i use lot of keyd states or should i use 1 big key state.

Hello,

I have to compute results on basis of lot of history data, parameters like total transactions in last 1 month, last 1 day, last 1 hour etc. by email id, ip, mobile, name, address, zipcode etc.

So my question is this right approach to create keyed state by email, mobile, zipcode etc. or should i create 1 big mapped state (BS) and than process that BS, may be in process function or by applying some loop and filter logic in window or process function.

My main worry is i will end up with millions of states, because there can be millions unique emails, phone numbers or zipcode if i create keyed state by email, phone etc.

am i right ? is this impact on the performance or is this wrong approach ? Which approach would you suggest in this use case.

Thanks Regards

SHASHANK AGARWAL

--- Trying to mobilize the things....

Stephan Ewen

Re: Can i use lot of keyd states or should i use 1 big key state.

Each keyed state in Flink is a hashtable or a column family in RocksDB. Having too many of those is not memory efficient.

Having fewer states is better, if you can adapt your schema that way.

I would also look into "MapState", which is an efficient way to have "sub keys" under a keyed state.

Stephan

On Mon, Jul 31, 2017 at 6:01 PM, shashank agarwal <[hidden email]> wrote:

Hello,

I have to compute results on basis of lot of history data, parameters like total transactions in last 1 month, last 1 day, last 1 hour etc. by email id, ip, mobile, name, address, zipcode etc.

So my question is this right approach to create keyed state by email, mobile, zipcode etc. or should i create 1 big mapped state (BS) and than process that BS, may be in process function or by applying some loop and filter logic in window or process function.

My main worry is i will end up with millions of states, because there can be millions unique emails, phone numbers or zipcode if i create keyed state by email, phone etc.

am i right ? is this impact on the performance or is this wrong approach ? Which approach would you suggest in this use case.

--
Thanks Regards

SHASHANK AGARWAL
--- Trying to mobilize the things....

shashank734

Re: Can i use lot of keyd states or should i use 1 big key state.

Ok if i am taking it as right for an example :

if i am creating a keyed state with name "total count by email" for key(project id + email) than it will create a single hash-table or column family "total count by email" and all the unique email id's will be rows of that single hash-table or column family and than i can store millions of unique email id's in that.

Means it will create only single state object for all unique email id's ?

On Tue, Aug 1, 2017 at 1:53 AM, Stephan Ewen <[hidden email]> wrote:

Each keyed state in Flink is a hashtable or a column family in RocksDB. Having too many of those is not memory efficient.

Having fewer states is better, if you can adapt your schema that way.

I would also look into "MapState", which is an efficient way to have "sub keys" under a keyed state.

Stephan

On Mon, Jul 31, 2017 at 6:01 PM, shashank agarwal <[hidden email]> wrote:
<img width="0" height="0" class="m_7223236271901929862m_8578517833411515249mailtrack-img" alt="" style="display:flex" src="data:image/gif;base64,R0lGODlhAQABAIAAAAAAAP///yH5BAEAAAAALAAAAAABAAEAAAIBRAA7">Hello,

I have to compute results on basis of lot of history data, parameters like total transactions in last 1 month, last 1 day, last 1 hour etc. by email id, ip, mobile, name, address, zipcode etc.

So my question is this right approach to create keyed state by email, mobile, zipcode etc. or should i create 1 big mapped state (BS) and than process that BS, may be in process function or by applying some loop and filter logic in window or process function.

My main worry is i will end up with millions of states, because there can be millions unique emails, phone numbers or zipcode if i create keyed state by email, phone etc.

am i right ? is this impact on the performance or is this wrong approach ? Which approach would you suggest in this use case.

--
Thanks Regards

SHASHANK AGARWAL
--- Trying to mobilize the things....

Thanks Regards

SHASHANK AGARWAL

--- Trying to mobilize the things....

shashank734

Re: Can i use lot of keyd states or should i use 1 big key state.

If I am creating KeyedState ("count by email id") and keyed stream has 10 unique email id's.

Will it create 1 column family or hash table ?

Or it will create 10 column family or hash table ?

Can i have millions of unique email id in that keyed state ?

On Tue, Aug 1, 2017 at 2:59 AM, shashank agarwal <[hidden email]> wrote:

<img width="0" height="0" class="m_-8958467274873209536mailtrack-img" alt="" style="display:flex" src="data:image/gif;base64,R0lGODlhAQABAIAAAAAAAP///yH5BAEAAAAALAAAAAABAAEAAAIBRAA7">Ok if i am taking it as right for an example :

if i am creating a keyed state with name "total count by email" for key(project id + email) than it will create a single hash-table or column family "total count by email" and all the unique email id's will be rows of that single hash-table or column family and than i can store millions of unique email id's in that.

Means it will create only single state object for all unique email id's ?

On Tue, Aug 1, 2017 at 1:53 AM, Stephan Ewen <[hidden email]> wrote:
Each keyed state in Flink is a hashtable or a column family in RocksDB. Having too many of those is not memory efficient.

Having fewer states is better, if you can adapt your schema that way.

I would also look into "MapState", which is an efficient way to have "sub keys" under a keyed state.

Stephan

On Mon, Jul 31, 2017 at 6:01 PM, shashank agarwal <[hidden email]> wrote:
<img width="0" height="0" class="m_-8958467274873209536m_7223236271901929862m_8578517833411515249mailtrack-img" alt="" style="display:flex" src="data:image/gif;base64,R0lGODlhAQABAIAAAAAAAP///yH5BAEAAAAALAAAAAABAAEAAAIBRAA7">Hello,

I have to compute results on basis of lot of history data, parameters like total transactions in last 1 month, last 1 day, last 1 hour etc. by email id, ip, mobile, name, address, zipcode etc.

So my question is this right approach to create keyed state by email, mobile, zipcode etc. or should i create 1 big mapped state (BS) and than process that BS, may be in process function or by applying some loop and filter logic in window or process function.

My main worry is i will end up with millions of states, because there can be millions unique emails, phone numbers or zipcode if i create keyed state by email, phone etc.

am i right ? is this impact on the performance or is this wrong approach ? Which approach would you suggest in this use case.

--
Thanks Regards

SHASHANK AGARWAL
--- Trying to mobilize the things....

--
Thanks Regards

SHASHANK AGARWAL
--- Trying to mobilize the things....

Thanks Regards

SHASHANK AGARWAL

--- Trying to mobilize the things....

Aljoscha Krettek

Re: Can i use lot of keyd states or should i use 1 big key state.

Hi,

If you have one keyed state, say "count by email id", and many different keys you will only have one column in RocksDB (or one HashTable). Actually, a lot of users have hundreds of millions of different keys for some states.

Best,

Aljoscha

On 2. Aug 2017, at 14:59, shashank agarwal <[hidden email]> wrote:

If I am creating KeyedState ("count by email id") and keyed stream has 10 unique email id's.

Will it create 1 column family or hash table ?

Or it will create 10 column family or hash table ?

Can i have millions of unique email id in that keyed state ?

On Tue, Aug 1, 2017 at 2:59 AM, shashank agarwal <[hidden email]> wrote:
<img width="0" height="0" class="m_-8958467274873209536mailtrack-img" alt="" src="data:image/gif;base64,R0lGODlhAQABAIAAAAAAAP///yH5BAEAAAAALAAAAAABAAEAAAIBRAA7" style="display: flex;">Ok if i am taking it as right for an example :

if i am creating a keyed state with name "total count by email" for key(project id + email) than it will create a single hash-table or column family "total count by email" and all the unique email id's will be rows of that single hash-table or column family and than i can store millions of unique email id's in that.

Means it will create only single state object for all unique email id's ?

On Tue, Aug 1, 2017 at 1:53 AM, Stephan Ewen <[hidden email]> wrote:
Each keyed state in Flink is a hashtable or a column family in RocksDB. Having too many of those is not memory efficient.

Having fewer states is better, if you can adapt your schema that way.

I would also look into "MapState", which is an efficient way to have "sub keys" under a keyed state.

Stephan

On Mon, Jul 31, 2017 at 6:01 PM, shashank agarwal <[hidden email]> wrote:
<img width="0" height="0" class="m_-8958467274873209536m_7223236271901929862m_8578517833411515249mailtrack-img" alt="" src="data:image/gif;base64,R0lGODlhAQABAIAAAAAAAP///yH5BAEAAAAALAAAAAABAAEAAAIBRAA7" style="display: flex;">Hello,

I have to compute results on basis of lot of history data, parameters like total transactions in last 1 month, last 1 day, last 1 hour etc. by email id, ip, mobile, name, address, zipcode etc.

So my question is this right approach to create keyed state by email, mobile, zipcode etc. or should i create 1 big mapped state (BS) and than process that BS, may be in process function or by applying some loop and filter logic in window or process function.

My main worry is i will end up with millions of states, because there can be millions unique emails, phone numbers or zipcode if i create keyed state by email, phone etc.

am i right ? is this impact on the performance or is this wrong approach ? Which approach would you suggest in this use case.

--
Thanks Regards

SHASHANK AGARWAL
--- Trying to mobilize the things....

--
Thanks Regards

SHASHANK AGARWAL
--- Trying to mobilize the things....

--
Thanks Regards

SHASHANK AGARWAL
--- Trying to mobilize the things....

shashank734

Re: Can i use lot of keyd states or should i use 1 big key state.

Thanks Aljoscha and Stephan for clearing the doubt.

On Wed, Aug 9, 2017 at 7:37 PM, Aljoscha Krettek <[hidden email]> wrote:

Hi,

If you have one keyed state, say "count by email id", and many different keys you will only have one column in RocksDB (or one HashTable). Actually, a lot of users have hundreds of millions of different keys for some states.

Best,
Aljoscha
On 2. Aug 2017, at 14:59, shashank agarwal <[hidden email]> wrote:

<img width="0" height="0" class="m_3097927355864191437mailtrack-img" alt="" src="data:image/gif;base64,R0lGODlhAQABAIAAAAAAAP///yH5BAEAAAAALAAAAAABAAEAAAIBRAA7" style="display:flex">If I am creating KeyedState ("count by email id") and keyed stream has 10 unique email id's.

Will it create 1 column family or hash table ?

Or it will create 10 column family or hash table ?

Can i have millions of unique email id in that keyed state ?

On Tue, Aug 1, 2017 at 2:59 AM, shashank agarwal <[hidden email]> wrote:
<img width="0" height="0" class="m_3097927355864191437m_-8958467274873209536mailtrack-img" alt="" style="display:flex" src="data:image/gif;base64,R0lGODlhAQABAIAAAAAAAP///yH5BAEAAAAALAAAAAABAAEAAAIBRAA7">Ok if i am taking it as right for an example :

if i am creating a keyed state with name "total count by email" for key(project id + email) than it will create a single hash-table or column family "total count by email" and all the unique email id's will be rows of that single hash-table or column family and than i can store millions of unique email id's in that.

Means it will create only single state object for all unique email id's ?

On Tue, Aug 1, 2017 at 1:53 AM, Stephan Ewen <[hidden email]> wrote:
Each keyed state in Flink is a hashtable or a column family in RocksDB. Having too many of those is not memory efficient.

Having fewer states is better, if you can adapt your schema that way.

I would also look into "MapState", which is an efficient way to have "sub keys" under a keyed state.

Stephan

On Mon, Jul 31, 2017 at 6:01 PM, shashank agarwal <[hidden email]> wrote:
<img width="0" height="0" class="m_3097927355864191437m_-8958467274873209536m_7223236271901929862m_8578517833411515249mailtrack-img" alt="" style="display:flex" src="data:image/gif;base64,R0lGODlhAQABAIAAAAAAAP///yH5BAEAAAAALAAAAAABAAEAAAIBRAA7">Hello,

I have to compute results on basis of lot of history data, parameters like total transactions in last 1 month, last 1 day, last 1 hour etc. by email id, ip, mobile, name, address, zipcode etc.

So my question is this right approach to create keyed state by email, mobile, zipcode etc. or should i create 1 big mapped state (BS) and than process that BS, may be in process function or by applying some loop and filter logic in window or process function.

My main worry is i will end up with millions of states, because there can be millions unique emails, phone numbers or zipcode if i create keyed state by email, phone etc.

am i right ? is this impact on the performance or is this wrong approach ? Which approach would you suggest in this use case.

--
Thanks Regards

SHASHANK AGARWAL
--- Trying to mobilize the things....

--
Thanks Regards

SHASHANK AGARWAL
--- Trying to mobilize the things....

--
Thanks Regards

SHASHANK AGARWAL
--- Trying to mobilize the things....

Thanks Regards

SHASHANK AGARWAL

--- Trying to mobilize the things....