(DEPRECATED) Apache Flink User Mailing List archive.

[PyFlink] register udf functions with different versions of the same library in the same job

Classic

List

Threaded

3 messages Options

Rinat

[PyFlink] register udf functions with different versions of the same library in the same job

Hi mates !

I've just read an amazing article about PyFlink and I'm absolutely delighted.

I got some questions about udf registration, and it seems that it's possible to specify the list of libraries that should be used to evaluate udf functions.

As far as I understand, each udf function is a separate process, that is managed by Beam (but I'm not sure I got it right).

Does it mean that I can register multiple udf functions with different versions of the same library or what would be even better with different python environments and they won't clash ?

A few words about the task that I'm trying to solve: I would like to build a recommendation pipeline that will accumulate features as a table and make

recommendations using models from Ml flow registry. Since I don't want to limit data analysts from usage in all libraries that they won't, the best solution

for me - assemble the environment using conda descriptor and register a UDF function.

Kubernetes and Kubeflow are not an option for us yet, so we are trying to include models into existing pipelines.

thx !

Xingbo Huang

Re: [PyFlink] register udf functions with different versions of the same library in the same job

Hi,

I will do my best to provide pyflink related content, I hope it helps you.

>>> each udf function is a separate process, that is managed by Beam (but I'm not sure I got it right).

Strictly speaking, it is not true that every UDF is in a different python process. For example, the two python functions of udf1 and udf2 such as udf1(udf2(a)) are running in a python process, and you can even think that there is a return value of python wrap func udf1(udf2(a)). In fact, you can think that in most of the cases, we will put multiple python udf together to improve its performance.

>>> Does it mean that I can register multiple udf functions with different versions of the same library or what would be even better with different python environments and they won't clash

A PyFlink job All nodes use the same python environment path currently. So there is no way to make each UDF use a different python execution environment. Maybe you need to use multiple jobs to achieve this effect.

Best,
Xingbo

Sharipov, Rinat <[hidden email]> 于2020年10月10日周六上午1:18写道：

Hi mates !

I've just read an amazing article about PyFlink and I'm absolutely delighted.
I got some questions about udf registration, and it seems that it's possible to specify the list of libraries that should be used to evaluate udf functions.

As far as I understand, each udf function is a separate process, that is managed by Beam (but I'm not sure I got it right).
Does it mean that I can register multiple udf functions with different versions of the same library or what would be even better with different python environments and they won't clash ?

A few words about the task that I'm trying to solve: I would like to build a recommendation pipeline that will accumulate features as a table and make
recommendations using models from Ml flow registry. Since I don't want to limit data analysts from usage in all libraries that they won't, the best solution
for me - assemble the environment using conda descriptor and register a UDF function.

Kubernetes and Kubeflow are not an option for us yet, so we are trying to include models into existing pipelines.

thx !

Rinat

Re: [PyFlink] register udf functions with different versions of the same library in the same job

Hi Xingbo !

Thx a lot for such a detailed reply, it is very useful.

пн, 12 окт. 2020 г. в 09:32, Xingbo Huang <[hidden email]>:

Hi,
I will do my best to provide pyflink related content, I hope it helps you.

>>> each udf function is a separate process, that is managed by Beam (but I'm not sure I got it right).

Strictly speaking, it is not true that every UDF is in a different python process. For example, the two python functions of udf1 and udf2 such as udf1(udf2(a)) are running in a python process, and you can even think that there is a return value of python wrap func udf1(udf2(a)). In fact, you can think that in most of the cases, we will put multiple python udf together to improve its performance.

>>> Does it mean that I can register multiple udf functions with different versions of the same library or what would be even better with different python environments and they won't clash

A PyFlink job All nodes use the same python environment path currently. So there is no way to make each UDF use a different python execution environment. Maybe you need to use multiple jobs to achieve this effect.

Best,
Xingbo

Sharipov, Rinat <[hidden email]> 于2020年10月10日周六上午1:18写道：
Hi mates !

I've just read an amazing article about PyFlink and I'm absolutely delighted.
I got some questions about udf registration, and it seems that it's possible to specify the list of libraries that should be used to evaluate udf functions.

As far as I understand, each udf function is a separate process, that is managed by Beam (but I'm not sure I got it right).
Does it mean that I can register multiple udf functions with different versions of the same library or what would be even better with different python environments and they won't clash ?

A few words about the task that I'm trying to solve: I would like to build a recommendation pipeline that will accumulate features as a table and make
recommendations using models from Ml flow registry. Since I don't want to limit data analysts from usage in all libraries that they won't, the best solution
for me - assemble the environment using conda descriptor and register a UDF function.

Kubernetes and Kubeflow are not an option for us yet, so we are trying to include models into existing pipelines.

thx !