Broadcast state

classic Classic list List threaded Threaded
10 messages Options
Reply | Threaded
Open this post in threaded view
|

Broadcast state

Navneeth Krishnan
Hi All,

Is it possible to access a broadcast state across the pipeline? For example, say I have a KeyedBroadcastProcessFunction which adds the incoming data to state and I have downstream operator where I need the same state as well, would I be able to just read the broadcast state with a readonly view. I know this is possible in kafka streams.

Thanks
Reply | Threaded
Open this post in threaded view
|

Re: Broadcast state

Oytun Tez
You can re-use the broadcasted state (along with its descriptor) that comes into your KeyedBroadcastProcessFunction, in another operator downstream. that's basically duplicating the broadcasted state whichever operator you want to use, every time.



---
Oytun Tez

M O T A W O R D
The World's Fastest Human Translation Platform.


On Mon, Sep 30, 2019 at 8:29 PM Navneeth Krishnan <[hidden email]> wrote:
Hi All,

Is it possible to access a broadcast state across the pipeline? For example, say I have a KeyedBroadcastProcessFunction which adds the incoming data to state and I have downstream operator where I need the same state as well, would I be able to just read the broadcast state with a readonly view. I know this is possible in kafka streams.

Thanks
Reply | Threaded
Open this post in threaded view
|

Re: Broadcast state

Oytun Tez
This is how we currently use broadcast state. Our states are re-usable (code-wise), every operator that wants to consume basically keeps the same descriptor state locally by processBroadcastElement'ing into a local state.

I am open to suggestions. I see this as a hard drawback of dataflow programming or Flink framework?



---
Oytun Tez

M O T A W O R D
The World's Fastest Human Translation Platform.


On Mon, Sep 30, 2019 at 8:40 PM Oytun Tez <[hidden email]> wrote:
You can re-use the broadcasted state (along with its descriptor) that comes into your KeyedBroadcastProcessFunction, in another operator downstream. that's basically duplicating the broadcasted state whichever operator you want to use, every time.



---
Oytun Tez

M O T A W O R D
The World's Fastest Human Translation Platform.


On Mon, Sep 30, 2019 at 8:29 PM Navneeth Krishnan <[hidden email]> wrote:
Hi All,

Is it possible to access a broadcast state across the pipeline? For example, say I have a KeyedBroadcastProcessFunction which adds the incoming data to state and I have downstream operator where I need the same state as well, would I be able to just read the broadcast state with a readonly view. I know this is possible in kafka streams.

Thanks
Reply | Threaded
Open this post in threaded view
|

Re: Broadcast state

Navneeth Krishnan
Thanks Oytun. The problem with doing that is the same data will be have to be stored multiple times wasting memory. In my case there will around million entries which needs to be used by at least two operators for now. 

Thanks

On Mon, Sep 30, 2019 at 5:42 PM Oytun Tez <[hidden email]> wrote:
This is how we currently use broadcast state. Our states are re-usable (code-wise), every operator that wants to consume basically keeps the same descriptor state locally by processBroadcastElement'ing into a local state.

I am open to suggestions. I see this as a hard drawback of dataflow programming or Flink framework?



---
Oytun Tez

M O T A W O R D
The World's Fastest Human Translation Platform.


On Mon, Sep 30, 2019 at 8:40 PM Oytun Tez <[hidden email]> wrote:
You can re-use the broadcasted state (along with its descriptor) that comes into your KeyedBroadcastProcessFunction, in another operator downstream. that's basically duplicating the broadcasted state whichever operator you want to use, every time.



---
Oytun Tez

M O T A W O R D
The World's Fastest Human Translation Platform.


On Mon, Sep 30, 2019 at 8:29 PM Navneeth Krishnan <[hidden email]> wrote:
Hi All,

Is it possible to access a broadcast state across the pipeline? For example, say I have a KeyedBroadcastProcessFunction which adds the incoming data to state and I have downstream operator where I need the same state as well, would I be able to just read the broadcast state with a readonly view. I know this is possible in kafka streams.

Thanks
Reply | Threaded
Open this post in threaded view
|

Re: Broadcast state

Congxian Qiu
Hi,

Could you use some cache system such as HBase or Reids to storage this data, and query from the cache if needed?

Best,
Congxian


Navneeth Krishnan <[hidden email]> 于2019年10月1日周二 上午10:15写道:
Thanks Oytun. The problem with doing that is the same data will be have to be stored multiple times wasting memory. In my case there will around million entries which needs to be used by at least two operators for now. 

Thanks

On Mon, Sep 30, 2019 at 5:42 PM Oytun Tez <[hidden email]> wrote:
This is how we currently use broadcast state. Our states are re-usable (code-wise), every operator that wants to consume basically keeps the same descriptor state locally by processBroadcastElement'ing into a local state.

I am open to suggestions. I see this as a hard drawback of dataflow programming or Flink framework?



---
Oytun Tez

M O T A W O R D
The World's Fastest Human Translation Platform.


On Mon, Sep 30, 2019 at 8:40 PM Oytun Tez <[hidden email]> wrote:
You can re-use the broadcasted state (along with its descriptor) that comes into your KeyedBroadcastProcessFunction, in another operator downstream. that's basically duplicating the broadcasted state whichever operator you want to use, every time.



---
Oytun Tez

M O T A W O R D
The World's Fastest Human Translation Platform.


On Mon, Sep 30, 2019 at 8:29 PM Navneeth Krishnan <[hidden email]> wrote:
Hi All,

Is it possible to access a broadcast state across the pipeline? For example, say I have a KeyedBroadcastProcessFunction which adds the incoming data to state and I have downstream operator where I need the same state as well, would I be able to just read the broadcast state with a readonly view. I know this is possible in kafka streams.

Thanks
Reply | Threaded
Open this post in threaded view
|

Re: Broadcast state

Navneeth Krishnan
Hi,

I can use redis but I’m still having hard time figuring out how I can eliminate duplicate data. Today without broadcast state in 1.4 I’m using cache to lazy load the data. I thought the broadcast state will be similar to that of kafka streams where I have read access to the state across the pipeline. That will indeed solve a lot of problems. Is there some way I can do the same with flink?

Thanks!

On Mon, Sep 30, 2019 at 10:36 PM Congxian Qiu <[hidden email]> wrote:
Hi,

Could you use some cache system such as HBase or Reids to storage this data, and query from the cache if needed?

Best,
Congxian


Navneeth Krishnan <[hidden email]> 于2019年10月1日周二 上午10:15写道:
Thanks Oytun. The problem with doing that is the same data will be have to be stored multiple times wasting memory. In my case there will around million entries which needs to be used by at least two operators for now. 

Thanks

On Mon, Sep 30, 2019 at 5:42 PM Oytun Tez <[hidden email]> wrote:
This is how we currently use broadcast state. Our states are re-usable (code-wise), every operator that wants to consume basically keeps the same descriptor state locally by processBroadcastElement'ing into a local state.

I am open to suggestions. I see this as a hard drawback of dataflow programming or Flink framework?



---
Oytun Tez

M O T A W O R D
The World's Fastest Human Translation Platform.


On Mon, Sep 30, 2019 at 8:40 PM Oytun Tez <[hidden email]> wrote:
You can re-use the broadcasted state (along with its descriptor) that comes into your KeyedBroadcastProcessFunction, in another operator downstream. that's basically duplicating the broadcasted state whichever operator you want to use, every time.



---
Oytun Tez

M O T A W O R D
The World's Fastest Human Translation Platform.


On Mon, Sep 30, 2019 at 8:29 PM Navneeth Krishnan <[hidden email]> wrote:
Hi All,

Is it possible to access a broadcast state across the pipeline? For example, say I have a KeyedBroadcastProcessFunction which adds the incoming data to state and I have downstream operator where I need the same state as well, would I be able to just read the broadcast state with a readonly view. I know this is possible in kafka streams.

Thanks
Reply | Threaded
Open this post in threaded view
|

Re: Broadcast state

Fabian Hueske-2
Hi,

State is always associated with a single task in Flink.
The state of a task cannot be accessed by other tasks of the same operator or tasks of other operators.
This is true for every type of state, including broadcast state.

Best, Fabian


Am Di., 1. Okt. 2019 um 08:22 Uhr schrieb Navneeth Krishnan <[hidden email]>:
Hi,

I can use redis but I’m still having hard time figuring out how I can eliminate duplicate data. Today without broadcast state in 1.4 I’m using cache to lazy load the data. I thought the broadcast state will be similar to that of kafka streams where I have read access to the state across the pipeline. That will indeed solve a lot of problems. Is there some way I can do the same with flink?

Thanks!

On Mon, Sep 30, 2019 at 10:36 PM Congxian Qiu <[hidden email]> wrote:
Hi,

Could you use some cache system such as HBase or Reids to storage this data, and query from the cache if needed?

Best,
Congxian


Navneeth Krishnan <[hidden email]> 于2019年10月1日周二 上午10:15写道:
Thanks Oytun. The problem with doing that is the same data will be have to be stored multiple times wasting memory. In my case there will around million entries which needs to be used by at least two operators for now. 

Thanks

On Mon, Sep 30, 2019 at 5:42 PM Oytun Tez <[hidden email]> wrote:
This is how we currently use broadcast state. Our states are re-usable (code-wise), every operator that wants to consume basically keeps the same descriptor state locally by processBroadcastElement'ing into a local state.

I am open to suggestions. I see this as a hard drawback of dataflow programming or Flink framework?



---
Oytun Tez

M O T A W O R D
The World's Fastest Human Translation Platform.


On Mon, Sep 30, 2019 at 8:40 PM Oytun Tez <[hidden email]> wrote:
You can re-use the broadcasted state (along with its descriptor) that comes into your KeyedBroadcastProcessFunction, in another operator downstream. that's basically duplicating the broadcasted state whichever operator you want to use, every time.



---
Oytun Tez

M O T A W O R D
The World's Fastest Human Translation Platform.


On Mon, Sep 30, 2019 at 8:29 PM Navneeth Krishnan <[hidden email]> wrote:
Hi All,

Is it possible to access a broadcast state across the pipeline? For example, say I have a KeyedBroadcastProcessFunction which adds the incoming data to state and I have downstream operator where I need the same state as well, would I be able to just read the broadcast state with a readonly view. I know this is possible in kafka streams.

Thanks
Reply | Threaded
Open this post in threaded view
|

Re: Broadcast state

Congxian Qiu
Hi,

After using Redis, why there need to care about eliminate duplicated data, if you specify the same key, then Redis will do the deduplicate things.

Best,
Congxian


Fabian Hueske <[hidden email]> 于2019年10月2日周三 下午5:30写道:
Hi,

State is always associated with a single task in Flink.
The state of a task cannot be accessed by other tasks of the same operator or tasks of other operators.
This is true for every type of state, including broadcast state.

Best, Fabian


Am Di., 1. Okt. 2019 um 08:22 Uhr schrieb Navneeth Krishnan <[hidden email]>:
Hi,

I can use redis but I’m still having hard time figuring out how I can eliminate duplicate data. Today without broadcast state in 1.4 I’m using cache to lazy load the data. I thought the broadcast state will be similar to that of kafka streams where I have read access to the state across the pipeline. That will indeed solve a lot of problems. Is there some way I can do the same with flink?

Thanks!

On Mon, Sep 30, 2019 at 10:36 PM Congxian Qiu <[hidden email]> wrote:
Hi,

Could you use some cache system such as HBase or Reids to storage this data, and query from the cache if needed?

Best,
Congxian


Navneeth Krishnan <[hidden email]> 于2019年10月1日周二 上午10:15写道:
Thanks Oytun. The problem with doing that is the same data will be have to be stored multiple times wasting memory. In my case there will around million entries which needs to be used by at least two operators for now. 

Thanks

On Mon, Sep 30, 2019 at 5:42 PM Oytun Tez <[hidden email]> wrote:
This is how we currently use broadcast state. Our states are re-usable (code-wise), every operator that wants to consume basically keeps the same descriptor state locally by processBroadcastElement'ing into a local state.

I am open to suggestions. I see this as a hard drawback of dataflow programming or Flink framework?



---
Oytun Tez

M O T A W O R D
The World's Fastest Human Translation Platform.


On Mon, Sep 30, 2019 at 8:40 PM Oytun Tez <[hidden email]> wrote:
You can re-use the broadcasted state (along with its descriptor) that comes into your KeyedBroadcastProcessFunction, in another operator downstream. that's basically duplicating the broadcasted state whichever operator you want to use, every time.



---
Oytun Tez

M O T A W O R D
The World's Fastest Human Translation Platform.


On Mon, Sep 30, 2019 at 8:29 PM Navneeth Krishnan <[hidden email]> wrote:
Hi All,

Is it possible to access a broadcast state across the pipeline? For example, say I have a KeyedBroadcastProcessFunction which adds the incoming data to state and I have downstream operator where I need the same state as well, would I be able to just read the broadcast state with a readonly view. I know this is possible in kafka streams.

Thanks
Reply | Threaded
Open this post in threaded view
|

Re: Broadcast state

Navneeth Krishnan
Ya, there will not be a problem of duplicates. But what I'm trying to achieve is if there a large static state which needs to be present just one per node rather than storing it per slot that would be ideal. The reason being is that the state is quite large around 100GB of mostly static data and it is not needed at per slot level. It can be at per instance level where each slot can read from this shared memory.

Thanks

On Wed, Oct 9, 2019 at 12:13 AM Congxian Qiu <[hidden email]> wrote:
Hi,

After using Redis, why there need to care about eliminate duplicated data, if you specify the same key, then Redis will do the deduplicate things.

Best,
Congxian


Fabian Hueske <[hidden email]> 于2019年10月2日周三 下午5:30写道:
Hi,

State is always associated with a single task in Flink.
The state of a task cannot be accessed by other tasks of the same operator or tasks of other operators.
This is true for every type of state, including broadcast state.

Best, Fabian


Am Di., 1. Okt. 2019 um 08:22 Uhr schrieb Navneeth Krishnan <[hidden email]>:
Hi,

I can use redis but I’m still having hard time figuring out how I can eliminate duplicate data. Today without broadcast state in 1.4 I’m using cache to lazy load the data. I thought the broadcast state will be similar to that of kafka streams where I have read access to the state across the pipeline. That will indeed solve a lot of problems. Is there some way I can do the same with flink?

Thanks!

On Mon, Sep 30, 2019 at 10:36 PM Congxian Qiu <[hidden email]> wrote:
Hi,

Could you use some cache system such as HBase or Reids to storage this data, and query from the cache if needed?

Best,
Congxian


Navneeth Krishnan <[hidden email]> 于2019年10月1日周二 上午10:15写道:
Thanks Oytun. The problem with doing that is the same data will be have to be stored multiple times wasting memory. In my case there will around million entries which needs to be used by at least two operators for now. 

Thanks

On Mon, Sep 30, 2019 at 5:42 PM Oytun Tez <[hidden email]> wrote:
This is how we currently use broadcast state. Our states are re-usable (code-wise), every operator that wants to consume basically keeps the same descriptor state locally by processBroadcastElement'ing into a local state.

I am open to suggestions. I see this as a hard drawback of dataflow programming or Flink framework?



---
Oytun Tez

M O T A W O R D
The World's Fastest Human Translation Platform.


On Mon, Sep 30, 2019 at 8:40 PM Oytun Tez <[hidden email]> wrote:
You can re-use the broadcasted state (along with its descriptor) that comes into your KeyedBroadcastProcessFunction, in another operator downstream. that's basically duplicating the broadcasted state whichever operator you want to use, every time.



---
Oytun Tez

M O T A W O R D
The World's Fastest Human Translation Platform.


On Mon, Sep 30, 2019 at 8:29 PM Navneeth Krishnan <[hidden email]> wrote:
Hi All,

Is it possible to access a broadcast state across the pipeline? For example, say I have a KeyedBroadcastProcessFunction which adds the incoming data to state and I have downstream operator where I need the same state as well, would I be able to just read the broadcast state with a readonly view. I know this is possible in kafka streams.

Thanks
Reply | Threaded
Open this post in threaded view
|

Re: Broadcast state

Congxian Qiu
By using Redis, you can store all data in one job in one single Redis, no need one slot one Redis, what do you think?

Best,
Congxian


Navneeth Krishnan <[hidden email]> 于2019年10月18日周五 上午4:47写道:
Ya, there will not be a problem of duplicates. But what I'm trying to achieve is if there a large static state which needs to be present just one per node rather than storing it per slot that would be ideal. The reason being is that the state is quite large around 100GB of mostly static data and it is not needed at per slot level. It can be at per instance level where each slot can read from this shared memory.

Thanks

On Wed, Oct 9, 2019 at 12:13 AM Congxian Qiu <[hidden email]> wrote:
Hi,

After using Redis, why there need to care about eliminate duplicated data, if you specify the same key, then Redis will do the deduplicate things.

Best,
Congxian


Fabian Hueske <[hidden email]> 于2019年10月2日周三 下午5:30写道:
Hi,

State is always associated with a single task in Flink.
The state of a task cannot be accessed by other tasks of the same operator or tasks of other operators.
This is true for every type of state, including broadcast state.

Best, Fabian


Am Di., 1. Okt. 2019 um 08:22 Uhr schrieb Navneeth Krishnan <[hidden email]>:
Hi,

I can use redis but I’m still having hard time figuring out how I can eliminate duplicate data. Today without broadcast state in 1.4 I’m using cache to lazy load the data. I thought the broadcast state will be similar to that of kafka streams where I have read access to the state across the pipeline. That will indeed solve a lot of problems. Is there some way I can do the same with flink?

Thanks!

On Mon, Sep 30, 2019 at 10:36 PM Congxian Qiu <[hidden email]> wrote:
Hi,

Could you use some cache system such as HBase or Reids to storage this data, and query from the cache if needed?

Best,
Congxian


Navneeth Krishnan <[hidden email]> 于2019年10月1日周二 上午10:15写道:
Thanks Oytun. The problem with doing that is the same data will be have to be stored multiple times wasting memory. In my case there will around million entries which needs to be used by at least two operators for now. 

Thanks

On Mon, Sep 30, 2019 at 5:42 PM Oytun Tez <[hidden email]> wrote:
This is how we currently use broadcast state. Our states are re-usable (code-wise), every operator that wants to consume basically keeps the same descriptor state locally by processBroadcastElement'ing into a local state.

I am open to suggestions. I see this as a hard drawback of dataflow programming or Flink framework?



---
Oytun Tez

M O T A W O R D
The World's Fastest Human Translation Platform.


On Mon, Sep 30, 2019 at 8:40 PM Oytun Tez <[hidden email]> wrote:
You can re-use the broadcasted state (along with its descriptor) that comes into your KeyedBroadcastProcessFunction, in another operator downstream. that's basically duplicating the broadcasted state whichever operator you want to use, every time.



---
Oytun Tez

M O T A W O R D
The World's Fastest Human Translation Platform.


On Mon, Sep 30, 2019 at 8:29 PM Navneeth Krishnan <[hidden email]> wrote:
Hi All,

Is it possible to access a broadcast state across the pipeline? For example, say I have a KeyedBroadcastProcessFunction which adds the incoming data to state and I have downstream operator where I need the same state as well, would I be able to just read the broadcast state with a readonly view. I know this is possible in kafka streams.

Thanks