filter().project() vs flatMap()

classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|

filter().project() vs flatMap()

Flavio Pompermaier
Hi Flinkers,

I'd like to know whether it's better to perform a filter.project or a flatMap to filter tuples and do some projection after the filter. Functionally they are equivalent but maybe I'm ignoring something..

Thanks in advance,
Flavio
Reply | Threaded
Open this post in threaded view
|

Re: filter().project() vs flatMap()

Sebastian Schelter-2
filter + project is easier to understand for the system, as the number
of output tuples is guaranteed to be <= the number of input tuples. With
flatMap, the system cannot know an upper bound.

--sebastian

On 04.05.2015 14:43, Flavio Pompermaier wrote:
> Hi Flinkers,
>
> I'd like to know whether it's better to perform a filter.project or a
> flatMap to filter tuples and do some projection after the filter.
> Functionally they are equivalent but maybe I'm ignoring something..
>
> Thanks in advance,
> Flavio
Reply | Threaded
Open this post in threaded view
|

Re: filter().project() vs flatMap()

Fabian Hueske-2
In reply to this post by Flavio Pompermaier
It should not make a difference. I think its just personal taste.

If your filter condition is simple enough, I'd go with Flink's Table API because it does not require to define a Filter or FlatMapFunction.


2015-05-04 14:43 GMT+02:00 Flavio Pompermaier <[hidden email]>:
Hi Flinkers,

I'd like to know whether it's better to perform a filter.project or a flatMap to filter tuples and do some projection after the filter. Functionally they are equivalent but maybe I'm ignoring something..

Thanks in advance,
Flavio

Reply | Threaded
Open this post in threaded view
|

Re: filter().project() vs flatMap()

Stephan Ewen
Hah, interesting to see how opinions differ ;-)

Sebastian has a point, that filter + project is more transparent to the system. In some situations, this knowledge can help the optimizer, but often, it will not matter.

Greetings,
Stephan


On Mon, May 4, 2015 at 2:49 PM, Fabian Hueske <[hidden email]> wrote:
It should not make a difference. I think its just personal taste.

If your filter condition is simple enough, I'd go with Flink's Table API because it does not require to define a Filter or FlatMapFunction.


2015-05-04 14:43 GMT+02:00 Flavio Pompermaier <[hidden email]>:
Hi Flinkers,

I'd like to know whether it's better to perform a filter.project or a flatMap to filter tuples and do some projection after the filter. Functionally they are equivalent but maybe I'm ignoring something..

Thanks in advance,
Flavio


Reply | Threaded
Open this post in threaded view
|

Re: filter().project() vs flatMap()

Flavio Pompermaier
In reply to this post by Fabian Hueske-2
Thanks Sebastian and Fabian for the feedback, just one last question:
what does change from the system point of view to know that the  output tuples is <= the number of input tuples?
Is there any optimization that Flink can apply to the pipeline?

On Mon, May 4, 2015 at 2:49 PM, Fabian Hueske <[hidden email]> wrote:
It should not make a difference. I think its just personal taste.

If your filter condition is simple enough, I'd go with Flink's Table API because it does not require to define a Filter or FlatMapFunction.


2015-05-04 14:43 GMT+02:00 Flavio Pompermaier <[hidden email]>:
Hi Flinkers,

I'd like to know whether it's better to perform a filter.project or a flatMap to filter tuples and do some projection after the filter. Functionally they are equivalent but maybe I'm ignoring something..

Thanks in advance,
Flavio



Reply | Threaded
Open this post in threaded view
|

Re: filter().project() vs flatMap()

Sebastian Schelter-2
If the system has to decide data shipping strategies for a join (e.g.,
broadcasting one side) it helps to have good estimates of the input sizes.

On 04.05.2015 14:53, Flavio Pompermaier wrote:

> Thanks Sebastian and Fabian for the feedback, just one last question:
> what does change from the system point of view to know that the output
> tuples is <= the number of input tuples?
> Is there any optimization that Flink can apply to the pipeline?
>
> On Mon, May 4, 2015 at 2:49 PM, Fabian Hueske <[hidden email]
> <mailto:[hidden email]>> wrote:
>
>     It should not make a difference. I think its just personal taste.
>
>     If your filter condition is simple enough, I'd go with Flink's Table
>     API because it does not require to define a Filter or FlatMapFunction.
>
>
>     2015-05-04 14:43 GMT+02:00 Flavio Pompermaier <[hidden email]
>     <mailto:[hidden email]>>:
>
>         Hi Flinkers,
>
>         I'd like to know whether it's better to perform a filter.project
>         or a flatMap to filter tuples and do some projection after the
>         filter. Functionally they are equivalent but maybe I'm ignoring
>         something..
>
>         Thanks in advance,
>         Flavio
>
>
>
>
Reply | Threaded
Open this post in threaded view
|

Re: filter().project() vs flatMap()

Fabian Hueske-2
In reply to this post by Flavio Pompermaier
That might help with cardinality estimation for cost-based optimization. For example when deciding about join strategies (broadcast vs. repartition, build-side of a hash join).
However, as Stephan said, there are many cases where it does not make a difference, e.g. if the input cardinality of the filter (or the size of the other join input) is unknown.

I think, chances are low that it makes a difference.


2015-05-04 14:53 GMT+02:00 Flavio Pompermaier <[hidden email]>:
Thanks Sebastian and Fabian for the feedback, just one last question:
what does change from the system point of view to know that the  output tuples is <= the number of input tuples?
Is there any optimization that Flink can apply to the pipeline?

On Mon, May 4, 2015 at 2:49 PM, Fabian Hueske <[hidden email]> wrote:
It should not make a difference. I think its just personal taste.

If your filter condition is simple enough, I'd go with Flink's Table API because it does not require to define a Filter or FlatMapFunction.


2015-05-04 14:43 GMT+02:00 Flavio Pompermaier <[hidden email]>:
Hi Flinkers,

I'd like to know whether it's better to perform a filter.project or a flatMap to filter tuples and do some projection after the filter. Functionally they are equivalent but maybe I'm ignoring something..

Thanks in advance,
Flavio