(DEPRECATED) Apache Flink User Mailing List archive.

Broadcast state vs data enrichment

Classic

List

Threaded

5 messages Options

Manas Kale

Broadcast state vs data enrichment

Hi,

I have a single broadcast message that contains configuration data consumed by different operators. For eg:

config = {

"config1" : 1,

"config2" : 2,

"config3" : 3

}

Operator 1 will consume config1 only, operator 2 will consume config2 only etc.

Right now in my implementation the config message gets broadcast over operators 1,2,3 and each operator only stores what it needs.

A different approach would be to broadcast the config message to a single root operator. This will then enrich event data flowing through it with config1,config2 and config3 and each downstream operator will "strip off" the config parameter that it needs.

I was wondering which approach would be the best to go with performance wise. I don't really have the time to implement both and compare, so perhaps someone here already knows if one approach is better or both provide similar performance.

FWIW, the config stream is very sporadic compared to the event stream.

Thank you,

Manas Kale

r_khachatryan

Re: Broadcast state vs data enrichment

Hi Manas,

The approaches you described looks the same:

> each operator only stores what it needs.

> each downstream operator will "strip off" the config parameter that it needs.

Can you please explain the difference?

Regards,
Roman

On Mon, May 11, 2020 at 8:07 AM Manas Kale <[hidden email]> wrote:

Hi,
I have a single broadcast message that contains configuration data consumed by different operators. For eg:
config = {
"config1" : 1,
"config2" : 2,
"config3" : 3
}

Operator 1 will consume config1 only, operator 2 will consume config2 only etc.

Right now in my implementation the config message gets broadcast over operators 1,2,3 and each operator only stores what it needs.
A different approach would be to broadcast the config message to a single root operator. This will then enrich event data flowing through it with config1,config2 and config3 and each downstream operator will "strip off" the config parameter that it needs.

I was wondering which approach would be the best to go with performance wise. I don't really have the time to implement both and compare, so perhaps someone here already knows if one approach is better or both provide similar performance.

FWIW, the config stream is very sporadic compared to the event stream.

Thank you,
Manas Kale

Manas Kale

Re: Broadcast state vs data enrichment

Sure. Apologies for not making this clear enough.

> each operator only stores what it needs.

Lets imagine this setup :

BROADCAST STREAM
config-stream --------------------------------------------------------------------
                            |                           |                      |
event-stream----------> operator1------------------> operator2-------------> operator3

In this scenario, all 3 operators will be BroadcastProcessFunctions. Each of them will receive the whole config message in their processBroadcastElement method, but each one will only store what it needs in their state store. So even though operator1 will receive

config = {

"config1" : 1,

"config2" : 2,

"config3" : 3

}

it will only store config1.

> each downstream operator will "strip off" the config parameter that it needs.

BROADCAST STREAM
config-stream -----------------
                              |
event-stream---------->  enricher --------------> operator1------------------> operator2-------------> operator3

In this case, the enricher operator will store the whole config message. When an event message arrives, this operator will append config1, config2 and config3 to it. Operator 1 will extract and use config1, and output a message that has config1 stripped off.

I hope that helps!

Perhaps I am being too pedantic but I would like to know if these two methods have comparable performance differences and if so which one would be preferred.

On Mon, May 11, 2020 at 11:46 PM Khachatryan Roman <[hidden email]> wrote:

Hi Manas,

The approaches you described looks the same:
> each operator only stores what it needs.
> each downstream operator will "strip off" the config parameter that it needs.

Can you please explain the difference?

Regards,
Roman

On Mon, May 11, 2020 at 8:07 AM Manas Kale <[hidden email]> wrote:
Hi,
I have a single broadcast message that contains configuration data consumed by different operators. For eg:
config = {
"config1" : 1,
"config2" : 2,
"config3" : 3
}

Operator 1 will consume config1 only, operator 2 will consume config2 only etc.

Right now in my implementation the config message gets broadcast over operators 1,2,3 and each operator only stores what it needs.
A different approach would be to broadcast the config message to a single root operator. This will then enrich event data flowing through it with config1,config2 and config3 and each downstream operator will "strip off" the config parameter that it needs.

I was wondering which approach would be the best to go with performance wise. I don't really have the time to implement both and compare, so perhaps someone here already knows if one approach is better or both provide similar performance.

FWIW, the config stream is very sporadic compared to the event stream.

Thank you,
Manas Kale

r_khachatryan

Re: Broadcast state vs data enrichment

Thanks for the clarification.

Apparently, the second option (with enricher) creates more load by adding configuration to every event. Unless events are much bigger than the configuration, this will significantly increase network, memory, and CPU usage.

Btw, I think you don't need a broadcast in the 2nd option, since the interested subtask will receive the configuration anyways.

Regards,
Roman

On Tue, May 12, 2020 at 5:57 AM Manas Kale <[hidden email]> wrote:

Sure. Apologies for not making this clear enough.

> each operator only stores what it needs.
Lets imagine this setup :
BROADCAST STREAM
config-stream --------------------------------------------------------------------
                            |                           |                      |
event-stream----------> operator1------------------> operator2-------------> operator3
In this scenario, all 3 operators will be BroadcastProcessFunctions. Each of them will receive the whole config message in their processBroadcastElement method, but each one will only store what it needs in their state store. So even though operator1 will receive
config = {
"config1" : 1,
"config2" : 2,
"config3" : 3
}
it will only store config1.
> each downstream operator will "strip off" the config parameter that it needs.
BROADCAST STREAM
config-stream -----------------
                              |
event-stream---------->  enricher --------------> operator1------------------> operator2-------------> operator3
In this case, the enricher operator will store the whole config message. When an event message arrives, this operator will append config1, config2 and config3 to it. Operator 1 will extract and use config1, and output a message that has config1 stripped off.

I hope that helps!

Perhaps I am being too pedantic but I would like to know if these two methods have comparable performance differences and if so which one would be preferred.
On Mon, May 11, 2020 at 11:46 PM Khachatryan Roman <[hidden email]> wrote:
Hi Manas,

The approaches you described looks the same:
> each operator only stores what it needs.
> each downstream operator will "strip off" the config parameter that it needs.

Can you please explain the difference?

Regards,
Roman

On Mon, May 11, 2020 at 8:07 AM Manas Kale <[hidden email]> wrote:
Hi,
I have a single broadcast message that contains configuration data consumed by different operators. For eg:
config = {
"config1" : 1,
"config2" : 2,
"config3" : 3
}

Operator 1 will consume config1 only, operator 2 will consume config2 only etc.

Right now in my implementation the config message gets broadcast over operators 1,2,3 and each operator only stores what it needs.
A different approach would be to broadcast the config message to a single root operator. This will then enrich event data flowing through it with config1,config2 and config3 and each downstream operator will "strip off" the config parameter that it needs.

I was wondering which approach would be the best to go with performance wise. I don't really have the time to implement both and compare, so perhaps someone here already knows if one approach is better or both provide similar performance.

FWIW, the config stream is very sporadic compared to the event stream.

Thank you,
Manas Kale

Manas Kale

Re: Broadcast state vs data enrichment

I see, thank you Roman!

On Tue, May 12, 2020 at 4:59 PM Khachatryan Roman <[hidden email]> wrote:

Thanks for the clarification.

Apparently, the second option (with enricher) creates more load by adding configuration to every event. Unless events are much bigger than the configuration, this will significantly increase network, memory, and CPU usage.
Btw, I think you don't need a broadcast in the 2nd option, since the interested subtask will receive the configuration anyways.

Regards,
Roman
On Tue, May 12, 2020 at 5:57 AM Manas Kale <[hidden email]> wrote:
Sure. Apologies for not making this clear enough.

> each operator only stores what it needs.
Lets imagine this setup :
BROADCAST STREAM
config-stream --------------------------------------------------------------------
                            |                           |                      |
event-stream----------> operator1------------------> operator2-------------> operator3
In this scenario, all 3 operators will be BroadcastProcessFunctions. Each of them will receive the whole config message in their processBroadcastElement method, but each one will only store what it needs in their state store. So even though operator1 will receive
config = {
"config1" : 1,
"config2" : 2,
"config3" : 3
}
it will only store config1.
> each downstream operator will "strip off" the config parameter that it needs.
BROADCAST STREAM
config-stream -----------------
                              |
event-stream---------->  enricher --------------> operator1------------------> operator2-------------> operator3
In this case, the enricher operator will store the whole config message. When an event message arrives, this operator will append config1, config2 and config3 to it. Operator 1 will extract and use config1, and output a message that has config1 stripped off.

I hope that helps!

Perhaps I am being too pedantic but I would like to know if these two methods have comparable performance differences and if so which one would be preferred.
On Mon, May 11, 2020 at 11:46 PM Khachatryan Roman <[hidden email]> wrote:
Hi Manas,

The approaches you described looks the same:
> each operator only stores what it needs.
> each downstream operator will "strip off" the config parameter that it needs.

Can you please explain the difference?

Regards,
Roman

On Mon, May 11, 2020 at 8:07 AM Manas Kale <[hidden email]> wrote:
Hi,
I have a single broadcast message that contains configuration data consumed by different operators. For eg:
config = {
"config1" : 1,
"config2" : 2,
"config3" : 3
}

Operator 1 will consume config1 only, operator 2 will consume config2 only etc.

Right now in my implementation the config message gets broadcast over operators 1,2,3 and each operator only stores what it needs.
A different approach would be to broadcast the config message to a single root operator. This will then enrich event data flowing through it with config1,config2 and config3 and each downstream operator will "strip off" the config parameter that it needs.

I was wondering which approach would be the best to go with performance wise. I don't really have the time to implement both and compare, so perhaps someone here already knows if one approach is better or both provide similar performance.

FWIW, the config stream is very sporadic compared to the event stream.

Thank you,
Manas Kale