global state and single stream

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

global state and single stream

Adam Atrea

Hi, 

I am pretty new to Flink and I'm trying to implement - which seems to me - a pretty basic pattern:

Say I have a single stream of Price objects encapsulating a price value and a symbol (for example A to Z)  they are emitted at a very random interval all day - could be 10000 /day or once a week.

..(price= .22, symbol = 'B')...(price= .12 , symbol = 'C').. (price= .12 , symbol = 'A'). .(price= .22, symbol = 'Z')...

I want to define an operator that should emit a single object only once when ALL symbols have been received (not before) - the object will include all the received prices.
If a price is received again for the same symbol it will reemit the same object with the updated price and all previous prices will stay the same.

If the stream is keyed by symbol I understand that using MapState state will not help because the state is local to the partition - each task will need to know that all symbols have been received.

Basically I'm looking for a global state for a single keyed stream - a state accessible by all parallel tasks - I have used the broadcast pattern but I understand this is when connecting 2 streams - is there a way to do it without forcing parallelism to 1.

Thanks in advance for your assistance

Adam


Reply | Threaded
Open this post in threaded view
|

Re: global state and single stream

Matthias
Hi Adam,
sorry for the late reply. Introducing a global state is something that should be avoided as it introduces bottlenecks and/or concurrency/order issues. Broadcasting the state between different subtasks will also bring a loss in performance since each state change has to be shared with every other subtask. Ideally, you might be able to reconsider the design of your pipeline.

Is there a specific reason that prevents you from doing the merging on a single instance?

Best,
Matthias

On Wed, Sep 16, 2020 at 11:21 PM Adam Atrea <[hidden email]> wrote:

Hi, 

I am pretty new to Flink and I'm trying to implement - which seems to me - a pretty basic pattern:

Say I have a single stream of Price objects encapsulating a price value and a symbol (for example A to Z)  they are emitted at a very random interval all day - could be 10000 /day or once a week.

..(price= .22, symbol = 'B')...(price= .12 , symbol = 'C').. (price= .12 , symbol = 'A'). .(price= .22, symbol = 'Z')...

I want to define an operator that should emit a single object only once when ALL symbols have been received (not before) - the object will include all the received prices.
If a price is received again for the same symbol it will reemit the same object with the updated price and all previous prices will stay the same.

If the stream is keyed by symbol I understand that using MapState state will not help because the state is local to the partition - each task will need to know that all symbols have been received.

Basically I'm looking for a global state for a single keyed stream - a state accessible by all parallel tasks - I have used the broadcast pattern but I understand this is when connecting 2 streams - is there a way to do it without forcing parallelism to 1.

Thanks in advance for your assistance

Adam