Efficiency for Filter then Transform ( filter().map() vs flatMap() )

classic Classic list List threaded Threaded
8 messages Options
Reply | Threaded
Open this post in threaded view
|

Efficiency for Filter then Transform ( filter().map() vs flatMap() )

tambunanw
Hi All, 

I would like to filter some item from the event stream. I think there are two ways doing this. 

Using the regular pipeline filter(...).map(...). We can also use flatMap for doing both in the same operator. 

Any performance improvement if we are using flatMap ? As that will be done in one operator instance. 


Cheers

--
Reply | Threaded
Open this post in threaded view
|

Re: Efficiency for Filter then Transform ( filter().map() vs flatMap() )

Gyula Fóra
Hey Welly,

If you call filter and map one after the other like you mentioned, these operators will be chained and executed as if they were running in the same operator.
The only small performance overhead comes from the fact that the output of the filter will be copied before passing it as input to the map to keep immutability guarantees (but no serialization/deserialization will happen). Copying might be practically free depending on your data type, though.

If you are using operators that don't make use of the immutability of inputs/outputs (i.e you don't hold references to those values) than you can disable copying altogether by calling env.getConfig().enableObjectReuse(), in which case they will have exactly the same performance.

Cheers,
Gyula

Welly Tambunan <[hidden email]> ezt írta (időpont: 2015. szept. 3., Cs, 4:33):
Hi All, 

I would like to filter some item from the event stream. I think there are two ways doing this. 

Using the regular pipeline filter(...).map(...). We can also use flatMap for doing both in the same operator. 

Any performance improvement if we are using flatMap ? As that will be done in one operator instance. 


Cheers
Reply | Threaded
Open this post in threaded view
|

Re: Efficiency for Filter then Transform ( filter().map() vs flatMap() )

tambunanw
Hi Gyula, 

Thanks for your response. Seems i will use filter and map for now as that one is really make the intention clear, and not a big performance hit. 

Thanks again. 

Cheers

On Thu, Sep 3, 2015 at 10:29 AM, Gyula Fóra <[hidden email]> wrote:
Hey Welly,

If you call filter and map one after the other like you mentioned, these operators will be chained and executed as if they were running in the same operator.
The only small performance overhead comes from the fact that the output of the filter will be copied before passing it as input to the map to keep immutability guarantees (but no serialization/deserialization will happen). Copying might be practically free depending on your data type, though.

If you are using operators that don't make use of the immutability of inputs/outputs (i.e you don't hold references to those values) than you can disable copying altogether by calling env.getConfig().enableObjectReuse(), in which case they will have exactly the same performance.

Cheers,
Gyula

Welly Tambunan <[hidden email]> ezt írta (időpont: 2015. szept. 3., Cs, 4:33):
Hi All, 

I would like to filter some item from the event stream. I think there are two ways doing this. 

Using the regular pipeline filter(...).map(...). We can also use flatMap for doing both in the same operator. 

Any performance improvement if we are using flatMap ? As that will be done in one operator instance. 


Cheers



--
Reply | Threaded
Open this post in threaded view
|

Re: Efficiency for Filter then Transform ( filter().map() vs flatMap() )

Stephan Ewen
In a set of benchmarks a while back, we found that the chaining mechanism has some overhead right now, because of its abstraction. The abstraction creates iterators for each element and makes it hard for the JIT to specialize on the operators in the chain.

For purely local chains at full speed, this overhead is observable (can decrease throughput from 25mio elements/core to 15-20mio elements per core). If your job does not reach that throughput, or is I/O bound, source bound, etc, it does not matter.

If you care about super high performance, collapsing the code into one function helps.

On Thu, Sep 3, 2015 at 5:59 AM, Welly Tambunan <[hidden email]> wrote:
Hi Gyula, 

Thanks for your response. Seems i will use filter and map for now as that one is really make the intention clear, and not a big performance hit. 

Thanks again. 

Cheers

On Thu, Sep 3, 2015 at 10:29 AM, Gyula Fóra <[hidden email]> wrote:
Hey Welly,

If you call filter and map one after the other like you mentioned, these operators will be chained and executed as if they were running in the same operator.
The only small performance overhead comes from the fact that the output of the filter will be copied before passing it as input to the map to keep immutability guarantees (but no serialization/deserialization will happen). Copying might be practically free depending on your data type, though.

If you are using operators that don't make use of the immutability of inputs/outputs (i.e you don't hold references to those values) than you can disable copying altogether by calling env.getConfig().enableObjectReuse(), in which case they will have exactly the same performance.

Cheers,
Gyula

Welly Tambunan <[hidden email]> ezt írta (időpont: 2015. szept. 3., Cs, 4:33):
Hi All, 

I would like to filter some item from the event stream. I think there are two ways doing this. 

Using the regular pipeline filter(...).map(...). We can also use flatMap for doing both in the same operator. 

Any performance improvement if we are using flatMap ? As that will be done in one operator instance. 


Cheers



--

Reply | Threaded
Open this post in threaded view
|

Re: Efficiency for Filter then Transform ( filter().map() vs flatMap() )

tambunanw
Hi Stephan, 

That's good information to know. We will hit that throughput easily. Our computation graph has lot of chaining like this right now. 
I think it's safe to minimize the chain right now. 

Thanks a lot for this Stephan. 

Cheers

On Thu, Sep 3, 2015 at 7:20 PM, Stephan Ewen <[hidden email]> wrote:
In a set of benchmarks a while back, we found that the chaining mechanism has some overhead right now, because of its abstraction. The abstraction creates iterators for each element and makes it hard for the JIT to specialize on the operators in the chain.

For purely local chains at full speed, this overhead is observable (can decrease throughput from 25mio elements/core to 15-20mio elements per core). If your job does not reach that throughput, or is I/O bound, source bound, etc, it does not matter.

If you care about super high performance, collapsing the code into one function helps.

On Thu, Sep 3, 2015 at 5:59 AM, Welly Tambunan <[hidden email]> wrote:
Hi Gyula, 

Thanks for your response. Seems i will use filter and map for now as that one is really make the intention clear, and not a big performance hit. 

Thanks again. 

Cheers

On Thu, Sep 3, 2015 at 10:29 AM, Gyula Fóra <[hidden email]> wrote:
Hey Welly,

If you call filter and map one after the other like you mentioned, these operators will be chained and executed as if they were running in the same operator.
The only small performance overhead comes from the fact that the output of the filter will be copied before passing it as input to the map to keep immutability guarantees (but no serialization/deserialization will happen). Copying might be practically free depending on your data type, though.

If you are using operators that don't make use of the immutability of inputs/outputs (i.e you don't hold references to those values) than you can disable copying altogether by calling env.getConfig().enableObjectReuse(), in which case they will have exactly the same performance.

Cheers,
Gyula

Welly Tambunan <[hidden email]> ezt írta (időpont: 2015. szept. 3., Cs, 4:33):
Hi All, 

I would like to filter some item from the event stream. I think there are two ways doing this. 

Using the regular pipeline filter(...).map(...). We can also use flatMap for doing both in the same operator. 

Any performance improvement if we are using flatMap ? As that will be done in one operator instance. 


Cheers



--




--
Reply | Threaded
Open this post in threaded view
|

Re: Efficiency for Filter then Transform ( filter().map() vs flatMap() )

Stephan Ewen
We will definitely also try to get the chaining overhead down a bit.

BTW: To reach this kind of throughput, you need sources that can produce very fast...

On Fri, Sep 4, 2015 at 12:20 AM, Welly Tambunan <[hidden email]> wrote:
Hi Stephan, 

That's good information to know. We will hit that throughput easily. Our computation graph has lot of chaining like this right now. 
I think it's safe to minimize the chain right now. 

Thanks a lot for this Stephan. 

Cheers

On Thu, Sep 3, 2015 at 7:20 PM, Stephan Ewen <[hidden email]> wrote:
In a set of benchmarks a while back, we found that the chaining mechanism has some overhead right now, because of its abstraction. The abstraction creates iterators for each element and makes it hard for the JIT to specialize on the operators in the chain.

For purely local chains at full speed, this overhead is observable (can decrease throughput from 25mio elements/core to 15-20mio elements per core). If your job does not reach that throughput, or is I/O bound, source bound, etc, it does not matter.

If you care about super high performance, collapsing the code into one function helps.

On Thu, Sep 3, 2015 at 5:59 AM, Welly Tambunan <[hidden email]> wrote:
Hi Gyula, 

Thanks for your response. Seems i will use filter and map for now as that one is really make the intention clear, and not a big performance hit. 

Thanks again. 

Cheers

On Thu, Sep 3, 2015 at 10:29 AM, Gyula Fóra <[hidden email]> wrote:
Hey Welly,

If you call filter and map one after the other like you mentioned, these operators will be chained and executed as if they were running in the same operator.
The only small performance overhead comes from the fact that the output of the filter will be copied before passing it as input to the map to keep immutability guarantees (but no serialization/deserialization will happen). Copying might be practically free depending on your data type, though.

If you are using operators that don't make use of the immutability of inputs/outputs (i.e you don't hold references to those values) than you can disable copying altogether by calling env.getConfig().enableObjectReuse(), in which case they will have exactly the same performance.

Cheers,
Gyula

Welly Tambunan <[hidden email]> ezt írta (időpont: 2015. szept. 3., Cs, 4:33):
Hi All, 

I would like to filter some item from the event stream. I think there are two ways doing this. 

Using the regular pipeline filter(...).map(...). We can also use flatMap for doing both in the same operator. 

Any performance improvement if we are using flatMap ? As that will be done in one operator instance. 


Cheers



--




--

Reply | Threaded
Open this post in threaded view
|

Re: Efficiency for Filter then Transform ( filter().map() vs flatMap() )

tambunanw
Hi Stephan, 

Cheers

On Fri, Sep 4, 2015 at 2:31 PM, Stephan Ewen <[hidden email]> wrote:
We will definitely also try to get the chaining overhead down a bit.

BTW: To reach this kind of throughput, you need sources that can produce very fast...

On Fri, Sep 4, 2015 at 12:20 AM, Welly Tambunan <[hidden email]> wrote:
Hi Stephan, 

That's good information to know. We will hit that throughput easily. Our computation graph has lot of chaining like this right now. 
I think it's safe to minimize the chain right now. 

Thanks a lot for this Stephan. 

Cheers

On Thu, Sep 3, 2015 at 7:20 PM, Stephan Ewen <[hidden email]> wrote:
In a set of benchmarks a while back, we found that the chaining mechanism has some overhead right now, because of its abstraction. The abstraction creates iterators for each element and makes it hard for the JIT to specialize on the operators in the chain.

For purely local chains at full speed, this overhead is observable (can decrease throughput from 25mio elements/core to 15-20mio elements per core). If your job does not reach that throughput, or is I/O bound, source bound, etc, it does not matter.

If you care about super high performance, collapsing the code into one function helps.

On Thu, Sep 3, 2015 at 5:59 AM, Welly Tambunan <[hidden email]> wrote:
Hi Gyula, 

Thanks for your response. Seems i will use filter and map for now as that one is really make the intention clear, and not a big performance hit. 

Thanks again. 

Cheers

On Thu, Sep 3, 2015 at 10:29 AM, Gyula Fóra <[hidden email]> wrote:
Hey Welly,

If you call filter and map one after the other like you mentioned, these operators will be chained and executed as if they were running in the same operator.
The only small performance overhead comes from the fact that the output of the filter will be copied before passing it as input to the map to keep immutability guarantees (but no serialization/deserialization will happen). Copying might be practically free depending on your data type, though.

If you are using operators that don't make use of the immutability of inputs/outputs (i.e you don't hold references to those values) than you can disable copying altogether by calling env.getConfig().enableObjectReuse(), in which case they will have exactly the same performance.

Cheers,
Gyula

Welly Tambunan <[hidden email]> ezt írta (időpont: 2015. szept. 3., Cs, 4:33):
Hi All, 

I would like to filter some item from the event stream. I think there are two ways doing this. 

Using the regular pipeline filter(...).map(...). We can also use flatMap for doing both in the same operator. 

Any performance improvement if we are using flatMap ? As that will be done in one operator instance. 


Cheers



--




--




--
Reply | Threaded
Open this post in threaded view
|

Re: Efficiency for Filter then Transform ( filter().map() vs flatMap() )

tambunanw
Hi Stephan, 

Thanks for your clarification.

Basically we will have lots of sensor that will push this kind of data to queuing system ( currently we are using RabbitMQ, but will soon move to Kafka). 
We also will use the same pipeline to process the historical data.

I also want to minimize the chaining as in the filter is doing very little work. By minimizing the pipeline we can minimize db/external source hit and cached local data efficiently. 

Cheers

On Fri, Sep 4, 2015 at 2:58 PM, Welly Tambunan <[hidden email]> wrote:
Hi Stephan, 

Cheers

On Fri, Sep 4, 2015 at 2:31 PM, Stephan Ewen <[hidden email]> wrote:
We will definitely also try to get the chaining overhead down a bit.

BTW: To reach this kind of throughput, you need sources that can produce very fast...

On Fri, Sep 4, 2015 at 12:20 AM, Welly Tambunan <[hidden email]> wrote:
Hi Stephan, 

That's good information to know. We will hit that throughput easily. Our computation graph has lot of chaining like this right now. 
I think it's safe to minimize the chain right now. 

Thanks a lot for this Stephan. 

Cheers

On Thu, Sep 3, 2015 at 7:20 PM, Stephan Ewen <[hidden email]> wrote:
In a set of benchmarks a while back, we found that the chaining mechanism has some overhead right now, because of its abstraction. The abstraction creates iterators for each element and makes it hard for the JIT to specialize on the operators in the chain.

For purely local chains at full speed, this overhead is observable (can decrease throughput from 25mio elements/core to 15-20mio elements per core). If your job does not reach that throughput, or is I/O bound, source bound, etc, it does not matter.

If you care about super high performance, collapsing the code into one function helps.

On Thu, Sep 3, 2015 at 5:59 AM, Welly Tambunan <[hidden email]> wrote:
Hi Gyula, 

Thanks for your response. Seems i will use filter and map for now as that one is really make the intention clear, and not a big performance hit. 

Thanks again. 

Cheers

On Thu, Sep 3, 2015 at 10:29 AM, Gyula Fóra <[hidden email]> wrote:
Hey Welly,

If you call filter and map one after the other like you mentioned, these operators will be chained and executed as if they were running in the same operator.
The only small performance overhead comes from the fact that the output of the filter will be copied before passing it as input to the map to keep immutability guarantees (but no serialization/deserialization will happen). Copying might be practically free depending on your data type, though.

If you are using operators that don't make use of the immutability of inputs/outputs (i.e you don't hold references to those values) than you can disable copying altogether by calling env.getConfig().enableObjectReuse(), in which case they will have exactly the same performance.

Cheers,
Gyula

Welly Tambunan <[hidden email]> ezt írta (időpont: 2015. szept. 3., Cs, 4:33):
Hi All, 

I would like to filter some item from the event stream. I think there are two ways doing this. 

Using the regular pipeline filter(...).map(...). We can also use flatMap for doing both in the same operator. 

Any performance improvement if we are using flatMap ? As that will be done in one operator instance. 


Cheers



--




--




--



--