how to query the output of the scalar table function

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

how to query the output of the scalar table function

Darshan Singh
Hi,

I am not able to find what is best way to query the output of a scalar table function.

Suppose I have table which has column col1 which is string.

I have a scalar function and returns a POJO
{col1V1 String, col1V2 String , col1V3 String}.


I am using following.

so table.select("sf(col1) as sfc")
.select("sfc.get('col1V1') as v1, sfc.get('col1V2') as v2 , sfc.get('col1V13') as v3 ")

It is working fine but strangely enough it calls the sf 3 times(each per get) for same row. However, If I use table function it calls just once.

So to me it seems that scalar function is very expensive so if I have 10 columns and the computation is expensive it will do 10 times. Thats why I thought maybe there is a better way to return pojo from scalar function rather than what I have been using.

If this is the best way then I wonder if scalar functions should be used to return single value only.

Thanks
Reply | Threaded
Open this post in threaded view
|

Re: how to query the output of the scalar table function

Fabian Hueske-2
Hi Darshan,

What you observe is the result of what's supposed to be an optimization. By fusing the two select() calls, we reduce the number of operators in the resulting plan (one MapFunction less).
This optimization is only applied for ScalarFunctions but not for TableFunctions.
With a better cost-modeling that increases the cost of user-defined ScalarFunctions it should be possible to "convince" the optimizer to not fuse the operators.

For now, you could either use the TableFunction approach or convert the result of the first select() into a DataStream (or DataSet depending on your setup) and register that again as a table.
That would split the plan into two parts which are independently optimized and hence the select() operators would not be merged.

Best, Fabian



2018-03-30 22:07 GMT+02:00 Darshan Singh <[hidden email]>:
Hi,

I am not able to find what is best way to query the output of a scalar table function.

Suppose I have table which has column col1 which is string.

I have a scalar function and returns a POJO
{col1V1 String, col1V2 String , col1V3 String}.


I am using following.

so table.select("sf(col1) as sfc")
.select("sfc.get('col1V1') as v1, sfc.get('col1V2') as v2 , sfc.get('col1V13') as v3 ")

It is working fine but strangely enough it calls the sf 3 times(each per get) for same row. However, If I use table function it calls just once.

So to me it seems that scalar function is very expensive so if I have 10 columns and the computation is expensive it will do 10 times. Thats why I thought maybe there is a better way to return pojo from scalar function rather than what I have been using.

If this is the best way then I wonder if scalar functions should be used to return single value only.

Thanks

Reply | Threaded
Open this post in threaded view
|

Re: how to query the output of the scalar table function

Darshan Singh
Thanks Fabian

We are going to replace all scalar functions with the table functions.

Thanks

On Wed, Apr 4, 2018 at 12:16 PM, Fabian Hueske <[hidden email]> wrote:
Hi Darshan,

What you observe is the result of what's supposed to be an optimization. By fusing the two select() calls, we reduce the number of operators in the resulting plan (one MapFunction less).
This optimization is only applied for ScalarFunctions but not for TableFunctions.
With a better cost-modeling that increases the cost of user-defined ScalarFunctions it should be possible to "convince" the optimizer to not fuse the operators.

For now, you could either use the TableFunction approach or convert the result of the first select() into a DataStream (or DataSet depending on your setup) and register that again as a table.
That would split the plan into two parts which are independently optimized and hence the select() operators would not be merged.

Best, Fabian



2018-03-30 22:07 GMT+02:00 Darshan Singh <[hidden email]>:
Hi,

I am not able to find what is best way to query the output of a scalar table function.

Suppose I have table which has column col1 which is string.

I have a scalar function and returns a POJO
{col1V1 String, col1V2 String , col1V3 String}.


I am using following.

so table.select("sf(col1) as sfc")
.select("sfc.get('col1V1') as v1, sfc.get('col1V2') as v2 , sfc.get('col1V13') as v3 ")

It is working fine but strangely enough it calls the sf 3 times(each per get) for same row. However, If I use table function it calls just once.

So to me it seems that scalar function is very expensive so if I have 10 columns and the computation is expensive it will do 10 times. Thats why I thought maybe there is a better way to return pojo from scalar function rather than what I have been using.

If this is the best way then I wonder if scalar functions should be used to return single value only.

Thanks