Question regarding Streaming Resources

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

Question regarding Streaming Resources

Vijay Bhaskar
Hi

I have created a KeyedStream with state as explained below
For example i have created 1000 streams,  out of which 50% of streams data is going to come once in 8 hours. Will the resources of these under utilized streams are idle for that duration? Or Flink internal task manager is having some strategy to utilize them for other new streams that are coming?
Regards
Bhaskar
Reply | Threaded
Open this post in threaded view
|

Re: Question regarding Streaming Resources

Vijay Bhaskar


On 2018/09/12 16:55:09, [hidden email] <[hidden email]> wrote:
> Hi
>
> I have created a KeyedStream with state as explained below
> For example i have created 1000 streams,  out of which 50% of streams data is going to come once in 8 hours. Will the resources of these under utilized streams are idle for that duration? Or Flink internal task manager is having some strategy to utilize them for other new streams that are coming?
> Regards
> Bhaskar
>

What is the best strategy i can use in this scenario?

Regards
Bhaskar
Reply | Threaded
Open this post in threaded view
|

Re: Question regarding Streaming Resources

Ken Krugler
In reply to this post by Vijay Bhaskar
Hi Bhaskar,

I assume you don’t have 1000 streams, but rather one (keyed) stream with 1000 different key values, yes?

If so, then this one stream is physically partitioned based on the parallelism of the operator following the keyBy(), not per unique key.

The most common per-key “resource” is the memory required for each key's state, if you’ve got any operations that need to maintain state (accumulators, windows, etc).

For 1000 unique keys, this should be negligible.

— Ken


On Sep 12, 2018, at 9:55 AM, [hidden email] wrote:

Hi

I have created a KeyedStream with state as explained below
For example i have created 1000 streams,  out of which 50% of streams data is going to come once in 8 hours. Will the resources of these under utilized streams are idle for that duration? Or Flink internal task manager is having some strategy to utilize them for other new streams that are coming?
Regards
Bhaskar

--------------------------
Ken Krugler
+1 530-210-6378
http://www.scaleunlimited.com
Custom big data solutions & training
Flink, Solr, Hadoop, Cascading & Cassandra

Reply | Threaded
Open this post in threaded view
|

Re: Question regarding Streaming Resources

Vijay Bhaskar


On 2018/09/12 20:42:22, Ken Krugler <[hidden email]> wrote:

> Hi Bhaskar,
>
> I assume you don’t have 1000 streams, but rather one (keyed) stream with 1000 different key values, yes?
>
> If so, then this one stream is physically partitioned based on the parallelism of the operator following the keyBy(), not per unique key.
>
> The most common per-key “resource” is the memory required for each key's state, if you’ve got any operations that need to maintain state (accumulators, windows, etc).
>
> For 1000 unique keys, this should be negligible.
>
> — Ken
>
>
> > On Sep 12, 2018, at 9:55 AM, [hidden email] <mailto:[hidden email]> wrote:
> >
> > Hi
> >
> > I have created a KeyedStream with state as explained below
> > For example i have created 1000 streams,  out of which 50% of streams data is going to come once in 8 hours. Will the resources of these under utilized streams are idle for that duration? Or Flink internal task manager is having some strategy to utilize them for other new streams that are coming?
> > Regards
> > Bhaskar
>
> --------------------------
> Ken Krugler
> +1 530-210-6378
> http://www.scaleunlimited.com <http://www.scaleunlimited.com/>
> Custom big data solutions & training
> Flink, Solr, Hadoop, Cascading & Cassan
>
>
Hi Ken
As per documentation it is showing: https://ci.apache.org/projects/flink/flink-docs-release-1.6/dev/stream/operators/
On DataStream if we apply KeyBy  then output is KeyedStream. Once its stream means it should execute in parallel right? There will be 1000 streams each is having Keyed State. What you are saying is the main over head here is only memory. That means Does these 1000 streams are going to be run across 1000 task slots in parallel?  These 1000 task slots is the main memory over head? Even 50% of them idle its not harm?
Please clarify
Regards
Bhaskar
 
Reply | Threaded
Open this post in threaded view
|

Re: Question regarding Streaming Resources

Ken Krugler
Hi Bhaskar,

On 2018/09/12 20:42:22, Ken Krugler <[hidden email]> wrote:
Hi Bhaskar,

I assume you don’t have 1000 streams, but rather one (keyed) stream with 1000 different key values, yes?

If so, then this one stream is physically partitioned based on the parallelism of the operator following the keyBy(), not per unique key.

The most common per-key “resource” is the memory required for each key's state, if you’ve got any operations that need to maintain state (accumulators, windows, etc).

For 1000 unique keys, this should be negligible.

— Ken


On Sep 12, 2018, at 9:55 AM, [hidden email] <[hidden email]> wrote:

Hi

I have created a KeyedStream with state as explained below
For example i have created 1000 streams,  out of which 50% of streams data is going to come once in 8 hours. Will the resources of these under utilized streams are idle for that duration? Or Flink internal task manager is having some strategy to utilize them for other new streams that are coming?
Regards
Bhaskar

Hi Ken
As per documentation it is showing: https://ci.apache.org/projects/flink/flink-docs-release-1.6/dev/stream/operators/
On DataStream if we apply KeyBy  then output is KeyedStream.

Correct.

Once its stream means it should execute in parallel right?

It will operate in parallel based on the parallelism of the downstream operations being applied to the KeyedStream.

Number of unique keys has nothing to do with number of parallel (simultaneous) operators being used to process the KeyedStream.

There will be 1000 streams each is having Keyed State.

“Stream” has a specific meaning in Flink, which I think you’re not using as intended here.

What you are saying is the main over head here is only memory. That means Does these 1000 streams are going to be run across 1000 task slots in parallel?  These 1000 task slots is the main memory over head? Even 50% of them idle its not harm?

See above - you don’t have 1000 “task slots” and you don’t have 1000 stream.

You have N operators running at the same time, where N is based on the parallelism that you set (either implicitly, or explicitly) for the operator(s) processing the KeyedStream.

Note that If you have 1000 unique keys, and you’ve got (for example) a single ValueState per key, then you’d have 1000 states.

But if you have say a sliding window, then the number of states per key can grow significantly, since each key can have multiple states (one per each “open window”).

But also note that using RocksDB to handle state means that not all state has to be in memory at the same time, so you’ve got more room to scale.

Regards,

— Ken

--------------------------
Ken Krugler
+1 530-210-6378
http://www.scaleunlimited.com
Custom big data solutions & training
Flink, Solr, Hadoop, Cascading & Cassandra

Reply | Threaded
Open this post in threaded view
|

Re: Question regarding Streaming Resources

Vijay Bhaskar


On 2018/09/13 03:30:28, Ken Krugler <[hidden email]> wrote:

> Hi Bhaskar,
>
> > On 2018/09/12 20:42:22, Ken Krugler <[hidden email]> wrote:
> >> Hi Bhaskar,
> >>
> >> I assume you don’t have 1000 streams, but rather one (keyed) stream with 1000 different key values, yes?
> >>
> >> If so, then this one stream is physically partitioned based on the parallelism of the operator following the keyBy(), not per unique key.
> >>
> >> The most common per-key “resource” is the memory required for each key's state, if you’ve got any operations that need to maintain state (accumulators, windows, etc).
> >>
> >> For 1000 unique keys, this should be negligible.
> >>
> >> — Ken
> >>
> >>
> >>> On Sep 12, 2018, at 9:55 AM, [hidden email] <mailto:[hidden email]> wrote:
> >>>
> >>> Hi
> >>>
> >>> I have created a KeyedStream with state as explained below
> >>> For example i have created 1000 streams,  out of which 50% of streams data is going to come once in 8 hours. Will the resources of these under utilized streams are idle for that duration? Or Flink internal task manager is having some strategy to utilize them for other new streams that are coming?
> >>> Regards
> >>> Bhaskar
> >>
> > Hi Ken
> > As per documentation it is showing: https://ci.apache.org/projects/flink/flink-docs-release-1.6/dev/stream/operators/
> > On DataStream if we apply KeyBy  then output is KeyedStream.
>
> Correct.
>
> > Once its stream means it should execute in parallel right?
>
> It will operate in parallel based on the parallelism of the downstream operations being applied to the KeyedStream.
>
> Number of unique keys has nothing to do with number of parallel (simultaneous) operators being used to process the KeyedStream.
>
> > There will be 1000 streams each is having Keyed State.
>
> “Stream” has a specific meaning in Flink, which I think you’re not using as intended here.
>
> > What you are saying is the main over head here is only memory. That means Does these 1000 streams are going to be run across 1000 task slots in parallel?  These 1000 task slots is the main memory over head? Even 50% of them idle its not harm?
>
> See above - you don’t have 1000 “task slots” and you don’t have 1000 stream.
>
> You have N operators running at the same time, where N is based on the parallelism that you set (either implicitly, or explicitly) for the operator(s) processing the KeyedStream.
>
> Note that If you have 1000 unique keys, and you’ve got (for example) a single ValueState per key, then you’d have 1000 states.
>
> But if you have say a sliding window, then the number of states per key can grow significantly, since each key can have multiple states (one per each “open window”).
>
> But also note that using RocksDB to handle state means that not all state has to be in memory at the same time, so you’ve got more room to scale.
>
>
> Regards,
>
> — Ken
>
> --------------------------
> Ken Krugler
> +1 530-210-6378
> http://www.scaleunlimited.com
> Custom big data solutions & training
> Flink, Solr, Hadoop, Cascading & Cassandra
>
>
Thanks Ken for the detailed clarification!

Regards
Bhaskar