(DEPRECATED) Apache Flink User Mailing List archive.

Splitting stream

Classic

List

Threaded

5 messages Options

Nikola Hrusov

Splitting stream

Hi,

I am trying to find some information on what is the best way to split a stream of the same data.

For the given scenario: I have an object which has a property "exist"

I want to split the stream based on this property, do something, and afterwards join it again into a single stream.

Initial (A) -> Split stream based on exist (B) or not (C) -> union both streams (D)

I could find some similar topics on StackOverflow:

- https://stackoverflow.com/questions/53588554/apache-flink-using-filter-or-split-to-split-a-stream

- https://stackoverflow.com/questions/61752728/how-to-get-output-of-the-values-that-are-not-matched-in-filter-function-in-apach

but none of them really gives a definitive answer.

What I am thinking about is using 1) filter or 2) side output.

I know that one of the use cases of side output is that it can have different data types. That is not my case as it will be the same object going through the whole pipeline.

So both options look more or less the same to me, however I do not know the flink internals as good as I would like to as of this point.

Can some of you guys shed some light and perhaps tell me if I am mistaken in my thoughts?

Thanks.

Regards

Nikola

Arvid Heise-4

Re: Splitting stream

Hi Nikola,

if you just want to apply a different user function to the records depending on the property "exist" the simplest way is to use

source -> map(if exist do this else that) -> sink

If it turns out that you want to apply a different subgraph, you can do

source -> filter(if exist) -> do this -> union -> sink

source -> filter(if not exist) -> do that -^

On Mon, May 10, 2021 at 3:07 PM Nikola Hrusov <[hidden email]> wrote:

Hi,

I am trying to find some information on what is the best way to split a stream of the same data.

For the given scenario: I have an object which has a property "exist"

I want to split the stream based on this property, do something, and afterwards join it again into a single stream.

Initial (A) -> Split stream based on exist (B) or not (C) -> union both streams (D)

I could find some similar topics on StackOverflow:
- https://stackoverflow.com/questions/53588554/apache-flink-using-filter-or-split-to-split-a-stream
- https://stackoverflow.com/questions/61752728/how-to-get-output-of-the-values-that-are-not-matched-in-filter-function-in-apach

but none of them really gives a definitive answer.

What I am thinking about is using 1) filter or 2) side output.

I know that one of the use cases of side output is that it can have different data types. That is not my case as it will be the same object going through the whole pipeline.

So both options look more or less the same to me, however I do not know the flink internals as good as I would like to as of this point.

Can some of you guys shed some light and perhaps tell me if I am mistaken in my thoughts?

Thanks.

Regards
,
Nikola

Nikola Hrusov

Re: Splitting stream

Hi Arvid,

In my case it's the latter, thus I have also thought about using the filter (map is not useful in my case).

What I am not sure which is better to be used?

In what case would you split a stream with side output and in what case with filter?

Would there be any performance gain/pain based on which is used?

Regards

Nikola

On Mon, May 10, 2021 at 6:00 PM Arvid Heise <[hidden email]> wrote:

Hi Nikola,

if you just want to apply a different user function to the records depending on the property "exist" the simplest way is to use

source -> map(if exist do this else that) -> sink

If it turns out that you want to apply a different subgraph, you can do

source -> filter(if exist) -> do this -> union -> sink
source -> filter(if not exist) -> do that -^

On Mon, May 10, 2021 at 3:07 PM Nikola Hrusov <[hidden email]> wrote:
Hi,

I am trying to find some information on what is the best way to split a stream of the same data.

For the given scenario: I have an object which has a property "exist"

I want to split the stream based on this property, do something, and afterwards join it again into a single stream.

Initial (A) -> Split stream based on exist (B) or not (C) -> union both streams (D)

I could find some similar topics on StackOverflow:
- https://stackoverflow.com/questions/53588554/apache-flink-using-filter-or-split-to-split-a-stream
- https://stackoverflow.com/questions/61752728/how-to-get-output-of-the-values-that-are-not-matched-in-filter-function-in-apach

but none of them really gives a definitive answer.

What I am thinking about is using 1) filter or 2) side output.

I know that one of the use cases of side output is that it can have different data types. That is not my case as it will be the same object going through the whole pipeline.

So both options look more or less the same to me, however I do not know the flink internals as good as I would like to as of this point.

Can some of you guys shed some light and perhaps tell me if I am mistaken in my thoughts?

Thanks.

Regards
,
Nikola

taher koitawala-2

Re: Splitting stream

I think what your looking for is a a side output. Change the logic to a process function. What is true goes to collector false can go to a side output. Which now gives you 2 streams

On Mon, May 10, 2021, 8:14 PM Nikola Hrusov <[hidden email]> wrote:

Hi Arvid,

In my case it's the latter, thus I have also thought about using the filter (map is not useful in my case).

What I am not sure which is better to be used?
In what case would you split a stream with side output and in what case with filter?
Would there be any performance gain/pain based on which is used?

Regards
,
Nikola
<a href="tel:%28%2B45%29%2060%2054%2032%2016" value="+4560543216" style="color:rgb(17,85,204)" target="_blank" rel="noreferrer">

On Mon, May 10, 2021 at 6:00 PM Arvid Heise <[hidden email]> wrote:
Hi Nikola,

if you just want to apply a different user function to the records depending on the property "exist" the simplest way is to use

source -> map(if exist do this else that) -> sink

If it turns out that you want to apply a different subgraph, you can do

source -> filter(if exist) -> do this -> union -> sink
source -> filter(if not exist) -> do that -^

On Mon, May 10, 2021 at 3:07 PM Nikola Hrusov <[hidden email]> wrote:
Hi,

I am trying to find some information on what is the best way to split a stream of the same data.

For the given scenario: I have an object which has a property "exist"

I want to split the stream based on this property, do something, and afterwards join it again into a single stream.

Initial (A) -> Split stream based on exist (B) or not (C) -> union both streams (D)

I could find some similar topics on StackOverflow:
- https://stackoverflow.com/questions/53588554/apache-flink-using-filter-or-split-to-split-a-stream
- https://stackoverflow.com/questions/61752728/how-to-get-output-of-the-values-that-are-not-matched-in-filter-function-in-apach

but none of them really gives a definitive answer.

What I am thinking about is using 1) filter or 2) side output.

I know that one of the use cases of side output is that it can have different data types. That is not my case as it will be the same object going through the whole pipeline.

So both options look more or less the same to me, however I do not know the flink internals as good as I would like to as of this point.

Can some of you guys shed some light and perhaps tell me if I am mistaken in my thoughts?

Thanks.

Regards
,
Nikola

Arvid Heise-4

Re: Splitting stream

Hi Nikola,

side outputs definitively are at least as efficient as using two filters but they are also harder to implement and maintain. Do you actually have a use case where every bit of performance counts?

If so, please also check enableObjectReuse [1] and look into serialization [2].

Also if you can implement your use case with Table API/SQL (with UDFs), it will be much faster than other alternatives.

[1] https://ci.apache.org/projects/flink/flink-docs-release-1.13/docs/dev/execution/execution_configuration/

[2] https://flink.apache.org/news/2020/04/15/flink-serialization-tuning-vol-1.html

On Mon, May 10, 2021 at 4:52 PM Taher Koitawala <[hidden email]> wrote:

I think what your looking for is a a side output. Change the logic to a process function. What is true goes to collector false can go to a side output. Which now gives you 2 streams

On Mon, May 10, 2021, 8:14 PM Nikola Hrusov <[hidden email]> wrote:
Hi Arvid,

In my case it's the latter, thus I have also thought about using the filter (map is not useful in my case).

What I am not sure which is better to be used?
In what case would you split a stream with side output and in what case with filter?
Would there be any performance gain/pain based on which is used?

Regards
,
Nikola
<a href="tel:%28%2B45%29%2060%2054%2032%2016" value="+4560543216" style="color:rgb(17,85,204)" rel="noreferrer" target="_blank">

On Mon, May 10, 2021 at 6:00 PM Arvid Heise <[hidden email]> wrote:
Hi Nikola,

if you just want to apply a different user function to the records depending on the property "exist" the simplest way is to use

source -> map(if exist do this else that) -> sink

If it turns out that you want to apply a different subgraph, you can do

source -> filter(if exist) -> do this -> union -> sink
source -> filter(if not exist) -> do that -^

On Mon, May 10, 2021 at 3:07 PM Nikola Hrusov <[hidden email]> wrote:
Hi,

I am trying to find some information on what is the best way to split a stream of the same data.

For the given scenario: I have an object which has a property "exist"

I want to split the stream based on this property, do something, and afterwards join it again into a single stream.

Initial (A) -> Split stream based on exist (B) or not (C) -> union both streams (D)

I could find some similar topics on StackOverflow:
- https://stackoverflow.com/questions/53588554/apache-flink-using-filter-or-split-to-split-a-stream
- https://stackoverflow.com/questions/61752728/how-to-get-output-of-the-values-that-are-not-matched-in-filter-function-in-apach

but none of them really gives a definitive answer.

What I am thinking about is using 1) filter or 2) side output.

I know that one of the use cases of side output is that it can have different data types. That is not my case as it will be the same object going through the whole pipeline.

So both options look more or less the same to me, however I do not know the flink internals as good as I would like to as of this point.

Can some of you guys shed some light and perhaps tell me if I am mistaken in my thoughts?

Thanks.

Regards
,
Nikola