(DEPRECATED) Apache Flink User Mailing List archive.

filter().project() vs flatMap()

Classic

List

Threaded

7 messages Options

Flavio Pompermaier

filter().project() vs flatMap()

Hi Flinkers,

I'd like to know whether it's better to perform a filter.project or a flatMap to filter tuples and do some projection after the filter. Functionally they are equivalent but maybe I'm ignoring something..

Thanks in advance,

Flavio

Sebastian Schelter-2

Re: filter().project() vs flatMap()

filter + project is easier to understand for the system, as the number
of output tuples is guaranteed to be <= the number of input tuples. With
flatMap, the system cannot know an upper bound.

--sebastian

On 04.05.2015 14:43, Flavio Pompermaier wrote:
> Hi Flinkers,
>
> I'd like to know whether it's better to perform a filter.project or a
> flatMap to filter tuples and do some projection after the filter.
> Functionally they are equivalent but maybe I'm ignoring something..
>
> Thanks in advance,
> Flavio

Fabian Hueske-2

Re: filter().project() vs flatMap()

In reply to this post by Flavio Pompermaier

It should not make a difference. I think its just personal taste.

If your filter condition is simple enough, I'd go with Flink's Table API because it does not require to define a Filter or FlatMapFunction.

2015-05-04 14:43 GMT+02:00 Flavio Pompermaier <[hidden email]>:

Hi Flinkers,

I'd like to know whether it's better to perform a filter.project or a flatMap to filter tuples and do some projection after the filter. Functionally they are equivalent but maybe I'm ignoring something..

Thanks in advance,
Flavio

Stephan Ewen

Re: filter().project() vs flatMap()

Hah, interesting to see how opinions differ ;-)

Sebastian has a point, that filter + project is more transparent to the system. In some situations, this knowledge can help the optimizer, but often, it will not matter.

Greetings,

Stephan

On Mon, May 4, 2015 at 2:49 PM, Fabian Hueske <[hidden email]> wrote:

It should not make a difference. I think its just personal taste.

If your filter condition is simple enough, I'd go with Flink's Table API because it does not require to define a Filter or FlatMapFunction.

2015-05-04 14:43 GMT+02:00 Flavio Pompermaier <[hidden email]>:
Hi Flinkers,

I'd like to know whether it's better to perform a filter.project or a flatMap to filter tuples and do some projection after the filter. Functionally they are equivalent but maybe I'm ignoring something..

Thanks in advance,
Flavio

Flavio Pompermaier

Re: filter().project() vs flatMap()

In reply to this post by Fabian Hueske-2

Thanks Sebastian and Fabian for the feedback, just one last question:

what does change from the system point of view to know that the output tuples is <= the number of input tuples?

Is there any optimization that Flink can apply to the pipeline?

On Mon, May 4, 2015 at 2:49 PM, Fabian Hueske <[hidden email]> wrote:

It should not make a difference. I think its just personal taste.

If your filter condition is simple enough, I'd go with Flink's Table API because it does not require to define a Filter or FlatMapFunction.

2015-05-04 14:43 GMT+02:00 Flavio Pompermaier <[hidden email]>:
Hi Flinkers,

I'd like to know whether it's better to perform a filter.project or a flatMap to filter tuples and do some projection after the filter. Functionally they are equivalent but maybe I'm ignoring something..

Thanks in advance,
Flavio

Sebastian Schelter-2

Re: filter().project() vs flatMap()

If the system has to decide data shipping strategies for a join (e.g.,
broadcasting one side) it helps to have good estimates of the input sizes.

On 04.05.2015 14:53, Flavio Pompermaier wrote:

> Thanks Sebastian and Fabian for the feedback, just one last question:
> what does change from the system point of view to know that the output
> tuples is <= the number of input tuples?
> Is there any optimization that Flink can apply to the pipeline?
>
> On Mon, May 4, 2015 at 2:49 PM, Fabian Hueske <[hidden email]
> <mailto:[hidden email]>> wrote:
>
> It should not make a difference. I think its just personal taste.
>
> If your filter condition is simple enough, I'd go with Flink's Table
> API because it does not require to define a Filter or FlatMapFunction.
>
>
> 2015-05-04 14:43 GMT+02:00 Flavio Pompermaier <[hidden email]
> <mailto:[hidden email]>>:
>
> Hi Flinkers,
>
> I'd like to know whether it's better to perform a filter.project
> or a flatMap to filter tuples and do some projection after the
> filter. Functionally they are equivalent but maybe I'm ignoring
> something..
>
> Thanks in advance,
> Flavio
>
>
>
>

Fabian Hueske-2

Re: filter().project() vs flatMap()

In reply to this post by Flavio Pompermaier

That might help with cardinality estimation for cost-based optimization. For example when deciding about join strategies (broadcast vs. repartition, build-side of a hash join).
However, as Stephan said, there are many cases where it does not make a difference, e.g. if the input cardinality of the filter (or the size of the other join input) is unknown.

I think, chances are low that it makes a difference.

2015-05-04 14:53 GMT+02:00 Flavio Pompermaier <[hidden email]>:

Thanks Sebastian and Fabian for the feedback, just one last question:
what does change from the system point of view to know that the output tuples is <= the number of input tuples?
Is there any optimization that Flink can apply to the pipeline?

On Mon, May 4, 2015 at 2:49 PM, Fabian Hueske <[hidden email]> wrote:
It should not make a difference. I think its just personal taste.

If your filter condition is simple enough, I'd go with Flink's Table API because it does not require to define a Filter or FlatMapFunction.

2015-05-04 14:43 GMT+02:00 Flavio Pompermaier <[hidden email]>:
Hi Flinkers,

I'd like to know whether it's better to perform a filter.project or a flatMap to filter tuples and do some projection after the filter. Functionally they are equivalent but maybe I'm ignoring something..

Thanks in advance,
Flavio