Hi Flinkers,
I'd like to know whether it's better to perform a filter.project or a flatMap to filter tuples and do some projection after the filter. Functionally they are equivalent but maybe I'm ignoring something.. Thanks in advance, Flavio |
filter + project is easier to understand for the system, as the number
of output tuples is guaranteed to be <= the number of input tuples. With flatMap, the system cannot know an upper bound. --sebastian On 04.05.2015 14:43, Flavio Pompermaier wrote: > Hi Flinkers, > > I'd like to know whether it's better to perform a filter.project or a > flatMap to filter tuples and do some projection after the filter. > Functionally they are equivalent but maybe I'm ignoring something.. > > Thanks in advance, > Flavio |
In reply to this post by Flavio Pompermaier
It should not make a difference. I think its just personal taste. If your filter condition is simple enough, I'd go with Flink's Table API because it does not require to define a Filter or FlatMapFunction. 2015-05-04 14:43 GMT+02:00 Flavio Pompermaier <[hidden email]>:
|
Hah, interesting to see how opinions differ ;-) Sebastian has a point, that filter + project is more transparent to the system. In some situations, this knowledge can help the optimizer, but often, it will not matter. Greetings, Stephan On Mon, May 4, 2015 at 2:49 PM, Fabian Hueske <[hidden email]> wrote:
|
In reply to this post by Fabian Hueske-2
Thanks Sebastian and Fabian for the feedback, just one last question:
what does change from the system point of view to know that the output tuples is <= the number of input tuples? Is there any optimization that Flink can apply to the pipeline? On Mon, May 4, 2015 at 2:49 PM, Fabian Hueske <[hidden email]> wrote:
|
If the system has to decide data shipping strategies for a join (e.g.,
broadcasting one side) it helps to have good estimates of the input sizes. On 04.05.2015 14:53, Flavio Pompermaier wrote: > Thanks Sebastian and Fabian for the feedback, just one last question: > what does change from the system point of view to know that the output > tuples is <= the number of input tuples? > Is there any optimization that Flink can apply to the pipeline? > > On Mon, May 4, 2015 at 2:49 PM, Fabian Hueske <[hidden email] > <mailto:[hidden email]>> wrote: > > It should not make a difference. I think its just personal taste. > > If your filter condition is simple enough, I'd go with Flink's Table > API because it does not require to define a Filter or FlatMapFunction. > > > 2015-05-04 14:43 GMT+02:00 Flavio Pompermaier <[hidden email] > <mailto:[hidden email]>>: > > Hi Flinkers, > > I'd like to know whether it's better to perform a filter.project > or a flatMap to filter tuples and do some projection after the > filter. Functionally they are equivalent but maybe I'm ignoring > something.. > > Thanks in advance, > Flavio > > > > |
In reply to this post by Flavio Pompermaier
That might help with cardinality estimation for cost-based optimization. For example when deciding about join strategies (broadcast vs. repartition, build-side of a hash join). I think, chances are low that it makes a difference.However, as Stephan said, there are many cases where it does not make a difference, e.g. if the input cardinality of the filter (or the size of the other join input) is unknown. 2015-05-04 14:53 GMT+02:00 Flavio Pompermaier <[hidden email]>:
|
Free forum by Nabble | Edit this page |