[PyFlink] register udf functions with different versions of the same library in the same job

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

[PyFlink] register udf functions with different versions of the same library in the same job

Rinat
Hi mates !

I've just read an amazing article about PyFlink and I'm absolutely delighted.
I got some questions about udf registration, and it seems that it's possible to specify the list of libraries that should be used to evaluate udf functions. 

As far as I understand, each udf function is a separate process, that is managed by Beam (but I'm not sure I got it right).
Does it mean that I can register multiple udf functions with different versions of the same library or what would be even better with different python environments and they won't clash ?

A few words about the task that I'm trying to solve: I would like to build a recommendation pipeline that will accumulate features as a table and make 
recommendations using models from Ml flow registry. Since I don't want to limit data analysts from usage in all libraries that they won't, the best solution 
for me - assemble the environment using conda descriptor and register a UDF function.

Kubernetes and Kubeflow are not an option for us yet, so we are trying to include models into existing pipelines.

thx !




Reply | Threaded
Open this post in threaded view
|

Re: [PyFlink] register udf functions with different versions of the same library in the same job

Xingbo Huang
Hi,
I will do my best to provide pyflink related content, I hope it helps you.

>>>  each udf function is a separate process, that is managed by Beam (but I'm not sure I got it right).

Strictly speaking, it is not true that every UDF is in a different python process. For example, the two python functions of udf1 and udf2 such as udf1(udf2(a)) are running in a python process, and you can even think that there is a return value of python wrap func udf1(udf2(a)). In fact, you can think that in most of the cases, we will put multiple python udf together to improve its performance.

>>> Does it mean that I can register multiple udf functions with different versions of the same library or what would be even better with different python environments and they won't clash

A PyFlink job All nodes use the same python environment path currently. So there is no way to make each UDF use a different python execution environment. Maybe you need to use multiple jobs to achieve this effect.

Best,
Xingbo

Sharipov, Rinat <[hidden email]> 于2020年10月10日周六 上午1:18写道:
Hi mates !

I've just read an amazing article about PyFlink and I'm absolutely delighted.
I got some questions about udf registration, and it seems that it's possible to specify the list of libraries that should be used to evaluate udf functions. 

As far as I understand, each udf function is a separate process, that is managed by Beam (but I'm not sure I got it right).
Does it mean that I can register multiple udf functions with different versions of the same library or what would be even better with different python environments and they won't clash ?

A few words about the task that I'm trying to solve: I would like to build a recommendation pipeline that will accumulate features as a table and make 
recommendations using models from Ml flow registry. Since I don't want to limit data analysts from usage in all libraries that they won't, the best solution 
for me - assemble the environment using conda descriptor and register a UDF function.

Kubernetes and Kubeflow are not an option for us yet, so we are trying to include models into existing pipelines.

thx !




Reply | Threaded
Open this post in threaded view
|

Re: [PyFlink] register udf functions with different versions of the same library in the same job

Rinat

Hi Xingbo ! 

Thx a lot for such a detailed reply, it is very useful.


пн, 12 окт. 2020 г. в 09:32, Xingbo Huang <[hidden email]>:
Hi,
I will do my best to provide pyflink related content, I hope it helps you.

>>>  each udf function is a separate process, that is managed by Beam (but I'm not sure I got it right).

Strictly speaking, it is not true that every UDF is in a different python process. For example, the two python functions of udf1 and udf2 such as udf1(udf2(a)) are running in a python process, and you can even think that there is a return value of python wrap func udf1(udf2(a)). In fact, you can think that in most of the cases, we will put multiple python udf together to improve its performance.

>>> Does it mean that I can register multiple udf functions with different versions of the same library or what would be even better with different python environments and they won't clash

A PyFlink job All nodes use the same python environment path currently. So there is no way to make each UDF use a different python execution environment. Maybe you need to use multiple jobs to achieve this effect.

Best,
Xingbo

Sharipov, Rinat <[hidden email]> 于2020年10月10日周六 上午1:18写道:
Hi mates !

I've just read an amazing article about PyFlink and I'm absolutely delighted.
I got some questions about udf registration, and it seems that it's possible to specify the list of libraries that should be used to evaluate udf functions. 

As far as I understand, each udf function is a separate process, that is managed by Beam (but I'm not sure I got it right).
Does it mean that I can register multiple udf functions with different versions of the same library or what would be even better with different python environments and they won't clash ?

A few words about the task that I'm trying to solve: I would like to build a recommendation pipeline that will accumulate features as a table and make 
recommendations using models from Ml flow registry. Since I don't want to limit data analysts from usage in all libraries that they won't, the best solution 
for me - assemble the environment using conda descriptor and register a UDF function.

Kubernetes and Kubeflow are not an option for us yet, so we are trying to include models into existing pipelines.

thx !