(DEPRECATED) Apache Flink User Mailing List archive.

Broadcast state before events stream consumption

Classic

List

Threaded

7 messages Options

Vadim Vararu

Broadcast state before events stream consumption

Hi all,

I need to use the broadcast state mechanism (https://ci.apache.org/projects/flink/flink-docs-stable/dev/stream/state/broadcast_state.html) for the next scenario.

I have a reference data stream (slow) and an events stream (fast running) and I want to do a kind of lookup in the reference stream for each

event. The broadcast state mechanism seems to fit perfect the scenario.

From documentation:

As an example where broadcast state can emerge as a natural fit, one can imagine a low-throughput stream containing a set of rules which we want to evaluate against all elements coming from another stream.

However, I am not sure what is the correct way to delay the consumption of the fast running stream until the slow one is fully read (in case of a file) or until a marker is emitted (in case of some other source). Is there any way to accomplish that? It doesn't seem to be a rare use case.

Thanks, Vadim.

chiggi_dev

Re: Broadcast state before events stream consumption

Hi Vadim,

I would be interested in this too.

Presently, I have to read my lookup source in the open method and keep it in a cache. By doing that I cannot make use of the broadcast state until ofcourse the first emit comes on the Broadcast stream.

The problem with waiting the event stream is the lack of knowledge that I have read all the data from the lookup source. There is no possibility of having a special marker in the data as well for my use case.

So pre loading the data seems to be the only option right now.

Thanks,

Chirag

On Friday, 8 February, 2019, 7:45:37 pm IST, Vadim Vararu <[hidden email]> wrote:

Hi all,

I need to use the broadcast state mechanism (https://ci.apache.org/projects/flink/flink-docs-stable/dev/stream/state/broadcast_state.html) for the next scenario.

I have a reference data stream (slow) and an events stream (fast running) and I want to do a kind of lookup in the reference stream for each

event. The broadcast state mechanism seems to fit perfect the scenario.

From documentation:

Thanks, Vadim.

Konstantin Knauf-2

Re: Broadcast state before events stream consumption

Hi Chirag, Hi Vadim,

from the top of my head, I see two options here:

* Buffer the "fast" stream inside the KeyedBroadcastProcessFunction until relevant (whatever this means in your use case) broadcast events have arrived. Advantage: operationally easy, events are emitted as early as possible. Disadvantage: state size might become very large, depending on the nature of the broadcast stream it might be hard to know, when the "relevant broadcast events have arrived".

* Start your job and only consume the broadcast stream (by configuration). Once the stream is "fully processed", i.e. has caught up, take a savepoint. Finally, start the job from this savepoint with the correct "fast" stream. There is a small race condition between taking the savepoint and restarting the job, which might matter (or not) depending on your use case.

This topic is related to event-time alignment in sources, which has been actively discussed in the community in the past and we might be able to solve this in a similar way in the future.

Cheers,

Konstantin

On Fri, Feb 8, 2019 at 5:48 PM Chirag Dewan <[hidden email]> wrote:

Hi Vadim,

I would be interested in this too.

Presently, I have to read my lookup source in the open method and keep it in a cache. By doing that I cannot make use of the broadcast state until ofcourse the first emit comes on the Broadcast stream.

The problem with waiting the event stream is the lack of knowledge that I have read all the data from the lookup source. There is no possibility of having a special marker in the data as well for my use case.

So pre loading the data seems to be the only option right now.

Thanks,

Chirag

On Friday, 8 February, 2019, 7:45:37 pm IST, Vadim Vararu <[hidden email]> wrote:

Hi all,

I need to use the broadcast state mechanism (https://ci.apache.org/projects/flink/flink-docs-stable/dev/stream/state/broadcast_state.html) for the next scenario.

I have a reference data stream (slow) and an events stream (fast running) and I want to do a kind of lookup in the reference stream for each

event. The broadcast state mechanism seems to fit perfect the scenario.

From documentation:

As an example where broadcast state can emerge as a natural fit, one can imagine a low-throughput stream containing a set of rules which we want to evaluate against all elements coming from another stream.

However, I am not sure what is the correct way to delay the consumption of the fast running stream until the slow one is fully read (in case of a file) or until a marker is emitted (in case of some other source). Is there any way to accomplish that? It doesn't seem to be a rare use case.

Thanks, Vadim.

Konstantin Knauf | Solutions Architect

+49 160 91394525

Join Flink Forward - The Apache Flink Conference

Stream Processing | Event Driven | Real Time

Data Artisans GmbH | Invalidenstrasse 115, 10115 Berlin, Germany

Data Artisans GmbH
Registered at Amtsgericht Charlottenburg: HRB 158244 B
Managing Directors: Dr. Kostas Tzoumas, Dr. Stephan Ewen

chiggi_dev

Re: Broadcast state before events stream consumption

Hi Konstantin,

For the second solution, would savepoint persist the Broadcast state in State backend? Because I am aware that Broadcast state is not checkpointed.

Is that correct?

Thanks,

Chirag

Sent from Yahoo Mail on Android

On Mon, 11 Feb 2019 at 2:39 PM, Konstantin Knauf
<[hidden email]> wrote:

Hi Chirag, Hi Vadim,

from the top of my head, I see two options here:

* Buffer the "fast" stream inside the KeyedBroadcastProcessFunction until relevant (whatever this means in your use case) broadcast events have arrived. Advantage: operationally easy, events are emitted as early as possible. Disadvantage: state size might become very large, depending on the nature of the broadcast stream it might be hard to know, when the "relevant broadcast events have arrived".

* Start your job and only consume the broadcast stream (by configuration). Once the stream is "fully processed", i.e. has caught up, take a savepoint. Finally, start the job from this savepoint with the correct "fast" stream. There is a small race condition between taking the savepoint and restarting the job, which might matter (or not) depending on your use case.

This topic is related to event-time alignment in sources, which has been actively discussed in the community in the past and we might be able to solve this in a similar way in the future.

Cheers,

Konstantin

On Fri, Feb 8, 2019 at 5:48 PM Chirag Dewan <[hidden email]> wrote:

Hi Vadim,

I would be interested in this too.

Presently, I have to read my lookup source in the open method and keep it in a cache. By doing that I cannot make use of the broadcast state until ofcourse the first emit comes on the Broadcast stream.

The problem with waiting the event stream is the lack of knowledge that I have read all the data from the lookup source. There is no possibility of having a special marker in the data as well for my use case.

So pre loading the data seems to be the only option right now.

Thanks,

Chirag

On Friday, 8 February, 2019, 7:45:37 pm IST, Vadim Vararu <[hidden email]> wrote:

Hi all,

I need to use the broadcast state mechanism (https://ci.apache.org/projects/flink/flink-docs-stable/dev/stream/state/broadcast_state.html) for the next scenario.

I have a reference data stream (slow) and an events stream (fast running) and I want to do a kind of lookup in the reference stream for each

event. The broadcast state mechanism seems to fit perfect the scenario.

From documentation:

As an example where broadcast state can emerge as a natural fit, one can imagine a low-throughput stream containing a set of rules which we want to evaluate against all elements coming from another stream.

However, I am not sure what is the correct way to delay the consumption of the fast running stream until the slow one is fully read (in case of a file) or until a marker is emitted (in case of some other source). Is there any way to accomplish that? It doesn't seem to be a rare use case.

Thanks, Vadim.

--
Konstantin Knauf | Solutions Architect
+49 160 91394525

Follow us @VervericaData
--
Join Flink Forward - The Apache Flink Conference
Stream Processing | Event Driven | Real Time
--
Data Artisans GmbH | Invalidenstrasse 115, 10115 Berlin, Germany
--
Data Artisans GmbH
Registered at Amtsgericht Charlottenburg: HRB 158244 B
Managing Directors: Dr. Kostas Tzoumas, Dr. Stephan Ewen

Konstantin Knauf-2

Re: Broadcast state before events stream consumption

Hi Chirag,

Broadcast state is checkpointed, hence the savepoint would contain it.

Best,

Konstantin

On Wed, Feb 13, 2019 at 4:04 PM Chirag Dewan <[hidden email]> wrote:

Hi Konstantin,

For the second solution, would savepoint persist the Broadcast state in State backend? Because I am aware that Broadcast state is not checkpointed.

Is that correct?

Thanks,

Chirag

Sent from Yahoo Mail on Android

On Mon, 11 Feb 2019 at 2:39 PM, Konstantin Knauf
<[hidden email]> wrote:

Hi Chirag, Hi Vadim,

from the top of my head, I see two options here:

* Buffer the "fast" stream inside the KeyedBroadcastProcessFunction until relevant (whatever this means in your use case) broadcast events have arrived. Advantage: operationally easy, events are emitted as early as possible. Disadvantage: state size might become very large, depending on the nature of the broadcast stream it might be hard to know, when the "relevant broadcast events have arrived".

* Start your job and only consume the broadcast stream (by configuration). Once the stream is "fully processed", i.e. has caught up, take a savepoint. Finally, start the job from this savepoint with the correct "fast" stream. There is a small race condition between taking the savepoint and restarting the job, which might matter (or not) depending on your use case.

This topic is related to event-time alignment in sources, which has been actively discussed in the community in the past and we might be able to solve this in a similar way in the future.

Cheers,

Konstantin

On Fri, Feb 8, 2019 at 5:48 PM Chirag Dewan <[hidden email]> wrote:

Hi Vadim,

I would be interested in this too.

Presently, I have to read my lookup source in the open method and keep it in a cache. By doing that I cannot make use of the broadcast state until ofcourse the first emit comes on the Broadcast stream.

The problem with waiting the event stream is the lack of knowledge that I have read all the data from the lookup source. There is no possibility of having a special marker in the data as well for my use case.

So pre loading the data seems to be the only option right now.

Thanks,

Chirag

On Friday, 8 February, 2019, 7:45:37 pm IST, Vadim Vararu <[hidden email]> wrote:

Hi all,

I need to use the broadcast state mechanism (https://ci.apache.org/projects/flink/flink-docs-stable/dev/stream/state/broadcast_state.html) for the next scenario.

I have a reference data stream (slow) and an events stream (fast running) and I want to do a kind of lookup in the reference stream for each

event. The broadcast state mechanism seems to fit perfect the scenario.

From documentation:

As an example where broadcast state can emerge as a natural fit, one can imagine a low-throughput stream containing a set of rules which we want to evaluate against all elements coming from another stream.

However, I am not sure what is the correct way to delay the consumption of the fast running stream until the slow one is fully read (in case of a file) or until a marker is emitted (in case of some other source). Is there any way to accomplish that? It doesn't seem to be a rare use case.

Thanks, Vadim.

--
Konstantin Knauf | Solutions Architect
+49 160 91394525

Follow us @VervericaData
--
Join Flink Forward - The Apache Flink Conference
Stream Processing | Event Driven | Real Time
--
Data Artisans GmbH | Invalidenstrasse 115, 10115 Berlin, Germany
--
Data Artisans GmbH
Registered at Amtsgericht Charlottenburg: HRB 158244 B
Managing Directors: Dr. Kostas Tzoumas, Dr. Stephan Ewen

Konstantin Knauf | Solutions Architect

+49 160 91394525

Join Flink Forward - The Apache Flink Conference

Stream Processing | Event Driven | Real Time

Data Artisans GmbH | Invalidenstrasse 115, 10115 Berlin, Germany

Data Artisans GmbH
Registered at Amtsgericht Charlottenburg: HRB 158244 B
Managing Directors: Dr. Kostas Tzoumas, Dr. Stephan Ewen

Averell

Re: Broadcast state before events stream consumption

Hi Konstantin,

The statement below is mentioned at the end of the page
broadcast_state.html#important-considerations
<https://ci.apache.org/projects/flink/flink-docs-stable/dev/stream/state/broadcast_state.html#important-considerations>
/"No RocksDB state backend: Broadcast state is kept in-memory at runtime and
memory provisioning should be done accordingly. This holds for all operator
states."/

I am using RocksDB state backend, and is confused by that statement and
yours.

Could you please help clarify?

Thanks and regards,
Averell

--
Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/

Dawid Wysakowicz-2

Re: Broadcast state before events stream consumption

Hi Averell,

BroadcastState is a special case of OperatorState. Operator state is
always kept in-memory at runtime (must fit into memory), no matter what
state backend you use. Nevertheless it is snapshotted and thus fault
tolerant.

Best,

Dawid

On 21/02/2019 11:50, Averell wrote:

> Hi Konstantin,
>
> The statement below is mentioned at the end of the page
> broadcast_state.html#important-considerations
> <https://ci.apache.org/projects/flink/flink-docs-stable/dev/stream/state/broadcast_state.html#important-considerations>
> /"No RocksDB state backend: Broadcast state is kept in-memory at runtime and
> memory provisioning should be done accordingly. This holds for all operator
> states."/
>
> I am using RocksDB state backend, and is confused by that statement and
> yours.
>
> Could you please help clarify?
>
> Thanks and regards,
> Averell
>
>
>
>
> --
> Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/

signature.asc (849 bytes) Download Attachment