Splitting stream

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

Splitting stream

Nikola Hrusov
Hi,

I am trying to find some information on what is the best way to split a stream of the same data.

For the given scenario: I have an object which has a property "exist"

I want to split the stream based on this property, do something, and afterwards join it again into a single stream.

Initial (A) -> Split stream based on exist (B) or not (C) -> union both streams (D)

I could find some similar topics on StackOverflow: 

but none of them really gives a definitive answer.

What I am thinking about is using 1) filter or 2) side output.

I know that one of the use cases of side output is that it can have different data types. That is not my case as it will be the same object going through the whole pipeline.

So both options look more or less the same to me, however I do not know the flink internals as good as I would like to as of this point.

Can some of you guys shed some light and perhaps tell me if I am mistaken in my thoughts?

Thanks.

Regards
,
Nikola
Reply | Threaded
Open this post in threaded view
|

Re: Splitting stream

Arvid Heise-4
Hi Nikola,

if you just want to apply a different user function to the records depending on the property "exist" the simplest way is to use

source -> map(if exist do this else that) -> sink

If it turns out that you want to apply a different subgraph, you can do

source -> filter(if exist) -> do this -> union -> sink
source -> filter(if not exist) -> do that -^

On Mon, May 10, 2021 at 3:07 PM Nikola Hrusov <[hidden email]> wrote:
Hi,

I am trying to find some information on what is the best way to split a stream of the same data.

For the given scenario: I have an object which has a property "exist"

I want to split the stream based on this property, do something, and afterwards join it again into a single stream.

Initial (A) -> Split stream based on exist (B) or not (C) -> union both streams (D)

I could find some similar topics on StackOverflow: 

but none of them really gives a definitive answer.

What I am thinking about is using 1) filter or 2) side output.

I know that one of the use cases of side output is that it can have different data types. That is not my case as it will be the same object going through the whole pipeline.

So both options look more or less the same to me, however I do not know the flink internals as good as I would like to as of this point.

Can some of you guys shed some light and perhaps tell me if I am mistaken in my thoughts?

Thanks.

Regards
,
Nikola
Reply | Threaded
Open this post in threaded view
|

Re: Splitting stream

Nikola Hrusov
Hi Arvid,

In my case it's the latter, thus I have also thought about using the filter (map is not useful in my case).

What I am not sure which is better to be used?
In what case would you split a stream with side output and in what case with filter? 
Would there be any performance gain/pain based on which is used?

Regards
,
Nikola
<a href="tel:%28%2B45%29%2060%2054%2032%2016" value="+4560543216" style="color:rgb(17,85,204)" target="_blank">


On Mon, May 10, 2021 at 6:00 PM Arvid Heise <[hidden email]> wrote:
Hi Nikola,

if you just want to apply a different user function to the records depending on the property "exist" the simplest way is to use

source -> map(if exist do this else that) -> sink

If it turns out that you want to apply a different subgraph, you can do

source -> filter(if exist) -> do this -> union -> sink
source -> filter(if not exist) -> do that -^

On Mon, May 10, 2021 at 3:07 PM Nikola Hrusov <[hidden email]> wrote:
Hi,

I am trying to find some information on what is the best way to split a stream of the same data.

For the given scenario: I have an object which has a property "exist"

I want to split the stream based on this property, do something, and afterwards join it again into a single stream.

Initial (A) -> Split stream based on exist (B) or not (C) -> union both streams (D)

I could find some similar topics on StackOverflow: 

but none of them really gives a definitive answer.

What I am thinking about is using 1) filter or 2) side output.

I know that one of the use cases of side output is that it can have different data types. That is not my case as it will be the same object going through the whole pipeline.

So both options look more or less the same to me, however I do not know the flink internals as good as I would like to as of this point.

Can some of you guys shed some light and perhaps tell me if I am mistaken in my thoughts?

Thanks.

Regards
,
Nikola
Reply | Threaded
Open this post in threaded view
|

Re: Splitting stream

taher koitawala-2
I think what your looking for is a a side output. Change the logic to a process function. What is true goes to collector false can go to a side output. Which now gives you 2 streams

On Mon, May 10, 2021, 8:14 PM Nikola Hrusov <[hidden email]> wrote:
Hi Arvid,

In my case it's the latter, thus I have also thought about using the filter (map is not useful in my case).

What I am not sure which is better to be used?
In what case would you split a stream with side output and in what case with filter? 
Would there be any performance gain/pain based on which is used?

Regards
,
Nikola
<a href="tel:%28%2B45%29%2060%2054%2032%2016" value="+4560543216" style="color:rgb(17,85,204)" target="_blank" rel="noreferrer">


On Mon, May 10, 2021 at 6:00 PM Arvid Heise <[hidden email]> wrote:
Hi Nikola,

if you just want to apply a different user function to the records depending on the property "exist" the simplest way is to use

source -> map(if exist do this else that) -> sink

If it turns out that you want to apply a different subgraph, you can do

source -> filter(if exist) -> do this -> union -> sink
source -> filter(if not exist) -> do that -^

On Mon, May 10, 2021 at 3:07 PM Nikola Hrusov <[hidden email]> wrote:
Hi,

I am trying to find some information on what is the best way to split a stream of the same data.

For the given scenario: I have an object which has a property "exist"

I want to split the stream based on this property, do something, and afterwards join it again into a single stream.

Initial (A) -> Split stream based on exist (B) or not (C) -> union both streams (D)

I could find some similar topics on StackOverflow: 

but none of them really gives a definitive answer.

What I am thinking about is using 1) filter or 2) side output.

I know that one of the use cases of side output is that it can have different data types. That is not my case as it will be the same object going through the whole pipeline.

So both options look more or less the same to me, however I do not know the flink internals as good as I would like to as of this point.

Can some of you guys shed some light and perhaps tell me if I am mistaken in my thoughts?

Thanks.

Regards
,
Nikola
Reply | Threaded
Open this post in threaded view
|

Re: Splitting stream

Arvid Heise-4
Hi Nikola,

side outputs definitively are at least as efficient as using two filters but they are also harder to implement and maintain. Do you actually have a use case where every bit of performance counts?

If so, please also check enableObjectReuse [1] and look into serialization [2].

Also if you can implement your use case with Table API/SQL (with UDFs), it will be much faster than other alternatives.


On Mon, May 10, 2021 at 4:52 PM Taher Koitawala <[hidden email]> wrote:
I think what your looking for is a a side output. Change the logic to a process function. What is true goes to collector false can go to a side output. Which now gives you 2 streams

On Mon, May 10, 2021, 8:14 PM Nikola Hrusov <[hidden email]> wrote:
Hi Arvid,

In my case it's the latter, thus I have also thought about using the filter (map is not useful in my case).

What I am not sure which is better to be used?
In what case would you split a stream with side output and in what case with filter? 
Would there be any performance gain/pain based on which is used?

Regards
,
Nikola
<a href="tel:%28%2B45%29%2060%2054%2032%2016" value="+4560543216" style="color:rgb(17,85,204)" rel="noreferrer" target="_blank">


On Mon, May 10, 2021 at 6:00 PM Arvid Heise <[hidden email]> wrote:
Hi Nikola,

if you just want to apply a different user function to the records depending on the property "exist" the simplest way is to use

source -> map(if exist do this else that) -> sink

If it turns out that you want to apply a different subgraph, you can do

source -> filter(if exist) -> do this -> union -> sink
source -> filter(if not exist) -> do that -^

On Mon, May 10, 2021 at 3:07 PM Nikola Hrusov <[hidden email]> wrote:
Hi,

I am trying to find some information on what is the best way to split a stream of the same data.

For the given scenario: I have an object which has a property "exist"

I want to split the stream based on this property, do something, and afterwards join it again into a single stream.

Initial (A) -> Split stream based on exist (B) or not (C) -> union both streams (D)

I could find some similar topics on StackOverflow: 

but none of them really gives a definitive answer.

What I am thinking about is using 1) filter or 2) side output.

I know that one of the use cases of side output is that it can have different data types. That is not my case as it will be the same object going through the whole pipeline.

So both options look more or less the same to me, however I do not know the flink internals as good as I would like to as of this point.

Can some of you guys shed some light and perhaps tell me if I am mistaken in my thoughts?

Thanks.

Regards
,
Nikola