(DEPRECATED) Apache Flink User Mailing List archive.

Best practice for packaging and deploying Flink jobs on K8S

Classic

List

Threaded

7 messages Options

Sumeet Malhotra

Best practice for packaging and deploying Flink jobs on K8S

Hi,

I have a PyFlink job that consists of:

Multiple Python files.
Multiple 3rdparty Python dependencies, specified in a `requirements.txt` file.
A few Java dependencies, mainly for external connectors.
An overall job config YAML file.

Here's a simplified structure of the code layout.

flink/

├── deps

│ ├── jar

│ │ ├── flink-connector-kafka_2.11-1.12.2.jar

│ │ └── kafka-clients-2.4.1.jar

│ └── pip

│ └── requirements.txt

├── conf

│ └── job.yaml

└── job

├── some_file_x.py

├── some_file_y.py

└── main.py

I'm able to execute this job running it locally i.e. invoking something like:

python main.py --config <path_to_job_yaml>

I'm loading the jars inside the Python code, using env.add_jars(...).

Now, the next step is to submit this job to a Flink cluster running on K8S. I'm looking for any best practices in packaging and specifying dependencies that people tend to follow. As per the documentation here [1], various Python files, including the conf YAML, can be specified using the --pyFiles option and Java dependencies can be specified using --jarfile option.

So, how can I specify 3rdparty Python package dependencies? According to another piece of documentation here [2], I should be able to specify the requirements.txt directly inside the code and submit it via the --pyFiles option. Is that right?

Are there any other best practices folks use to package/submit jobs?

Thanks,

Sumeet

[1] https://ci.apache.org/projects/flink/flink-docs-release-1.12/deployment/cli.html#submitting-pyflink-jobs

[2] https://ci.apache.org/projects/flink/flink-docs-release-1.12/dev/python/table-api-users-guide/dependency_management.html#python-dependency-in-python-program

Till Rohrmann

Re: Best practice for packaging and deploying Flink jobs on K8S

Hi Sumeet,

Is there a problem with the documented approaches on how to submit the Python program (not working) or are you asking in general? Given the documentation, I would assume that you can configure the requirements.txt via `set_python_requirements`.

I am also pulling in Dian who might be able to tell you more about the Python deployment options.

If you are not running on a session cluster, then you can also create a K8s image which contains your user code. That way you ship your job when deploying the cluster.

Cheers,

Till

On Wed, Apr 28, 2021 at 10:17 AM Sumeet Malhotra <[hidden email]> wrote:

Hi,

I have a PyFlink job that consists of:
Multiple Python files.
Multiple 3rdparty Python dependencies, specified in a `requirements.txt` file.
A few Java dependencies, mainly for external connectors.
An overall job config YAML file.
Here's a simplified structure of the code layout.

flink/
├── deps
│ ├── jar
│ │ ├── flink-connector-kafka_2.11-1.12.2.jar
│ │ └── kafka-clients-2.4.1.jar
│ └── pip
│ └── requirements.txt
├── conf
│ └── job.yaml
└── job
├── some_file_x.py
├── some_file_y.py
└── main.py

I'm able to execute this job running it locally i.e. invoking something like:

python main.py --config <path_to_job_yaml>

I'm loading the jars inside the Python code, using env.add_jars(...).

Now, the next step is to submit this job to a Flink cluster running on K8S. I'm looking for any best practices in packaging and specifying dependencies that people tend to follow. As per the documentation here [1], various Python files, including the conf YAML, can be specified using the --pyFiles option and Java dependencies can be specified using --jarfile option.

So, how can I specify 3rdparty Python package dependencies? According to another piece of documentation here [2], I should be able to specify the requirements.txt directly inside the code and submit it via the --pyFiles option. Is that right?

Are there any other best practices folks use to package/submit jobs?

Thanks,
Sumeet

[1] https://ci.apache.org/projects/flink/flink-docs-release-1.12/deployment/cli.html#submitting-pyflink-jobs
[2] https://ci.apache.org/projects/flink/flink-docs-release-1.12/dev/python/table-api-users-guide/dependency_management.html#python-dependency-in-python-program

Sumeet Malhotra

Re: Best practice for packaging and deploying Flink jobs on K8S

Hi Till,

There’s no problem with the documented approach. I was looking if there were any standardized ways of organizing, packaging and deploying Python code on a Flink cluster.

Thanks,

Sumeet

On Thu, Apr 29, 2021 at 12:37 PM Till Rohrmann <[hidden email]> wrote:

Hi Sumeet,

Is there a problem with the documented approaches on how to submit the Python program (not working) or are you asking in general? Given the documentation, I would assume that you can configure the requirements.txt via `set_python_requirements`.

I am also pulling in Dian who might be able to tell you more about the Python deployment options.

If you are not running on a session cluster, then you can also create a K8s image which contains your user code. That way you ship your job when deploying the cluster.

Cheers,
Till

On Wed, Apr 28, 2021 at 10:17 AM Sumeet Malhotra <[hidden email]> wrote:
Hi,

I have a PyFlink job that consists of:
Multiple Python files.
Multiple 3rdparty Python dependencies, specified in a `requirements.txt` file.
A few Java dependencies, mainly for external connectors.
An overall job config YAML file.
Here's a simplified structure of the code layout.

flink/
├── deps
│ ├── jar
│ │ ├── flink-connector-kafka_2.11-1.12.2.jar
│ │ └── kafka-clients-2.4.1.jar
│ └── pip
│ └── requirements.txt
├── conf
│ └── job.yaml
└── job
├── some_file_x.py
├── some_file_y.py
└── main.py

I'm able to execute this job running it locally i.e. invoking something like:

python main.py --config <path_to_job_yaml>

I'm loading the jars inside the Python code, using env.add_jars(...).

Now, the next step is to submit this job to a Flink cluster running on K8S. I'm looking for any best practices in packaging and specifying dependencies that people tend to follow. As per the documentation here [1], various Python files, including the conf YAML, can be specified using the --pyFiles option and Java dependencies can be specified using --jarfile option.

So, how can I specify 3rdparty Python package dependencies? According to another piece of documentation here [2], I should be able to specify the requirements.txt directly inside the code and submit it via the --pyFiles option. Is that right?

Are there any other best practices folks use to package/submit jobs?

Thanks,
Sumeet

[1] https://ci.apache.org/projects/flink/flink-docs-release-1.12/deployment/cli.html#submitting-pyflink-jobs
[2] https://ci.apache.org/projects/flink/flink-docs-release-1.12/dev/python/table-api-users-guide/dependency_management.html#python-dependency-in-python-program

Till Rohrmann

Re: Best practice for packaging and deploying Flink jobs on K8S

Alright. Then let's see what Dian recommends to do.

Cheers,

Till

On Thu, Apr 29, 2021 at 9:25 AM Sumeet Malhotra <[hidden email]> wrote:

Hi Till,

There’s no problem with the documented approach. I was looking if there were any standardized ways of organizing, packaging and deploying Python code on a Flink cluster.

Thanks,
Sumeet

On Thu, Apr 29, 2021 at 12:37 PM Till Rohrmann <[hidden email]> wrote:
Hi Sumeet,

Is there a problem with the documented approaches on how to submit the Python program (not working) or are you asking in general? Given the documentation, I would assume that you can configure the requirements.txt via `set_python_requirements`.

I am also pulling in Dian who might be able to tell you more about the Python deployment options.

If you are not running on a session cluster, then you can also create a K8s image which contains your user code. That way you ship your job when deploying the cluster.

Cheers,
Till

On Wed, Apr 28, 2021 at 10:17 AM Sumeet Malhotra <[hidden email]> wrote:
Hi,

I have a PyFlink job that consists of:
Multiple Python files.
Multiple 3rdparty Python dependencies, specified in a `requirements.txt` file.
A few Java dependencies, mainly for external connectors.
An overall job config YAML file.
Here's a simplified structure of the code layout.

flink/
├── deps
│ ├── jar
│ │ ├── flink-connector-kafka_2.11-1.12.2.jar
│ │ └── kafka-clients-2.4.1.jar
│ └── pip
│ └── requirements.txt
├── conf
│ └── job.yaml
└── job
├── some_file_x.py
├── some_file_y.py
└── main.py

I'm able to execute this job running it locally i.e. invoking something like:

python main.py --config <path_to_job_yaml>

I'm loading the jars inside the Python code, using env.add_jars(...).

Now, the next step is to submit this job to a Flink cluster running on K8S. I'm looking for any best practices in packaging and specifying dependencies that people tend to follow. As per the documentation here [1], various Python files, including the conf YAML, can be specified using the --pyFiles option and Java dependencies can be specified using --jarfile option.

So, how can I specify 3rdparty Python package dependencies? According to another piece of documentation here [2], I should be able to specify the requirements.txt directly inside the code and submit it via the --pyFiles option. Is that right?

Are there any other best practices folks use to package/submit jobs?

Thanks,
Sumeet

[1] https://ci.apache.org/projects/flink/flink-docs-release-1.12/deployment/cli.html#submitting-pyflink-jobs
[2] https://ci.apache.org/projects/flink/flink-docs-release-1.12/dev/python/table-api-users-guide/dependency_management.html#python-dependency-in-python-program

Dian Fu

Re: Best practice for packaging and deploying Flink jobs on K8S

In reply to this post by Sumeet Malhotra

Hi Sumeet,

For the Python dependencies, multiple ways have been provided to specify them and you could take either way of them.

Regarding to requirements.txt, there are 3 ways provided and you could specify it via either of them:

- API inside the code: set_python_requirements

- command line option: -pyreq [1]

- configuration: python.requirements

So you don’t need to specify them both inside the code and the command line options.

PS: It seems that -pyreq is missing from the latest CLI documentation, however, actually it’s there and you could refer to the 1.11 documentation for now. I’ll try to add it back ASAP.

[1] https://ci.apache.org/projects/flink/flink-docs-release-1.11/ops/cli.html

Regards,

Dian

2021年4月29日下午3:24，Sumeet Malhotra <[hidden email]> 写道：

Hi Till,

There’s no problem with the documented approach. I was looking if there were any standardized ways of organizing, packaging and deploying Python code on a Flink cluster.

Thanks,
Sumeet

On Thu, Apr 29, 2021 at 12:37 PM Till Rohrmann <[hidden email]> wrote:
Hi Sumeet,

Is there a problem with the documented approaches on how to submit the Python program (not working) or are you asking in general? Given the documentation, I would assume that you can configure the requirements.txt via `set_python_requirements`.

I am also pulling in Dian who might be able to tell you more about the Python deployment options.

If you are not running on a session cluster, then you can also create a K8s image which contains your user code. That way you ship your job when deploying the cluster.

Cheers,
Till

On Wed, Apr 28, 2021 at 10:17 AM Sumeet Malhotra <[hidden email]> wrote:
Hi,

I have a PyFlink job that consists of:
Multiple Python files.
Multiple 3rdparty Python dependencies, specified in a `requirements.txt` file.
A few Java dependencies, mainly for external connectors.
An overall job config YAML file.
Here's a simplified structure of the code layout.

flink/
├── deps
│ ├── jar
│ │ ├── flink-connector-kafka_2.11-1.12.2.jar
│ │ └── kafka-clients-2.4.1.jar
│ └── pip
│ └── requirements.txt
├── conf
│ └── job.yaml
└── job
├── some_file_x.py
├── some_file_y.py
└── main.py

I'm able to execute this job running it locally i.e. invoking something like:

python main.py --config <path_to_job_yaml>

I'm loading the jars inside the Python code, using env.add_jars(...).

Now, the next step is to submit this job to a Flink cluster running on K8S. I'm looking for any best practices in packaging and specifying dependencies that people tend to follow. As per the documentation here [1], various Python files, including the conf YAML, can be specified using the --pyFiles option and Java dependencies can be specified using --jarfile option.

So, how can I specify 3rdparty Python package dependencies? According to another piece of documentation here [2], I should be able to specify the requirements.txt directly inside the code and submit it via the --pyFiles option. Is that right?

Are there any other best practices folks use to package/submit jobs?

Thanks,
Sumeet

[1] https://ci.apache.org/projects/flink/flink-docs-release-1.12/deployment/cli.html#submitting-pyflink-jobs
[2] https://ci.apache.org/projects/flink/flink-docs-release-1.12/dev/python/table-api-users-guide/dependency_management.html#python-dependency-in-python-program

Dian Fu

Re: Best practice for packaging and deploying Flink jobs on K8S

Hi Sumeet,

FYI: the documentation about the CLI options of PyFlink has already been updated [1].

[1] https://ci.apache.org/projects/flink/flink-docs-release-1.12/deployment/cli.html#submitting-pyflink-jobs

Regards,
Dian

On Thu, Apr 29, 2021 at 4:46 PM Dian Fu <[hidden email]> wrote:

Hi Sumeet,

For the Python dependencies, multiple ways have been provided to specify them and you could take either way of them.

Regarding to requirements.txt, there are 3 ways provided and you could specify it via either of them:
- API inside the code: set_python_requirements
- command line option: -pyreq [1]
- configuration: python.requirements

So you don’t need to specify them both inside the code and the command line options.

PS: It seems that -pyreq is missing from the latest CLI documentation, however, actually it’s there and you could refer to the 1.11 documentation for now. I’ll try to add it back ASAP.

[1] https://ci.apache.org/projects/flink/flink-docs-release-1.11/ops/cli.html

Regards,
Dian

2021年4月29日下午3:24，Sumeet Malhotra <[hidden email]> 写道：

Hi Till,

There’s no problem with the documented approach. I was looking if there were any standardized ways of organizing, packaging and deploying Python code on a Flink cluster.

Thanks,
Sumeet

On Thu, Apr 29, 2021 at 12:37 PM Till Rohrmann <[hidden email]> wrote:
Hi Sumeet,

Is there a problem with the documented approaches on how to submit the Python program (not working) or are you asking in general? Given the documentation, I would assume that you can configure the requirements.txt via `set_python_requirements`.

I am also pulling in Dian who might be able to tell you more about the Python deployment options.

If you are not running on a session cluster, then you can also create a K8s image which contains your user code. That way you ship your job when deploying the cluster.

Cheers,
Till

On Wed, Apr 28, 2021 at 10:17 AM Sumeet Malhotra <[hidden email]> wrote:
Hi,

I have a PyFlink job that consists of:
Multiple Python files.
Multiple 3rdparty Python dependencies, specified in a `requirements.txt` file.
A few Java dependencies, mainly for external connectors.
An overall job config YAML file.
Here's a simplified structure of the code layout.

flink/
├── deps
│ ├── jar
│ │ ├── flink-connector-kafka_2.11-1.12.2.jar
│ │ └── kafka-clients-2.4.1.jar
│ └── pip
│ └── requirements.txt
├── conf
│ └── job.yaml
└── job
├── some_file_x.py
├── some_file_y.py
└── main.py

I'm able to execute this job running it locally i.e. invoking something like:

python main.py --config <path_to_job_yaml>

I'm loading the jars inside the Python code, using env.add_jars(...).

Now, the next step is to submit this job to a Flink cluster running on K8S. I'm looking for any best practices in packaging and specifying dependencies that people tend to follow. As per the documentation here [1], various Python files, including the conf YAML, can be specified using the --pyFiles option and Java dependencies can be specified using --jarfile option.

So, how can I specify 3rdparty Python package dependencies? According to another piece of documentation here [2], I should be able to specify the requirements.txt directly inside the code and submit it via the --pyFiles option. Is that right?

Are there any other best practices folks use to package/submit jobs?

Thanks,
Sumeet

[1] https://ci.apache.org/projects/flink/flink-docs-release-1.12/deployment/cli.html#submitting-pyflink-jobs
[2] https://ci.apache.org/projects/flink/flink-docs-release-1.12/dev/python/table-api-users-guide/dependency_management.html#python-dependency-in-python-program

Sumeet Malhotra

Re: Best practice for packaging and deploying Flink jobs on K8S

Thanks for updating the documentation Dian. Appreciate it.

..Sumeet

On Sun, May 2, 2021 at 10:53 AM Dian Fu <[hidden email]> wrote:

Hi Sumeet,

FYI: the documentation about the CLI options of PyFlink has already been updated [1].

[1] https://ci.apache.org/projects/flink/flink-docs-release-1.12/deployment/cli.html#submitting-pyflink-jobs

Regards,
Dian

On Thu, Apr 29, 2021 at 4:46 PM Dian Fu <[hidden email]> wrote:
Hi Sumeet,

For the Python dependencies, multiple ways have been provided to specify them and you could take either way of them.

Regarding to requirements.txt, there are 3 ways provided and you could specify it via either of them:
- API inside the code: set_python_requirements
- command line option: -pyreq [1]
- configuration: python.requirements

So you don’t need to specify them both inside the code and the command line options.

PS: It seems that -pyreq is missing from the latest CLI documentation, however, actually it’s there and you could refer to the 1.11 documentation for now. I’ll try to add it back ASAP.

[1] https://ci.apache.org/projects/flink/flink-docs-release-1.11/ops/cli.html

Regards,
Dian

2021年4月29日下午3:24，Sumeet Malhotra <[hidden email]> 写道：

Hi Till,

There’s no problem with the documented approach. I was looking if there were any standardized ways of organizing, packaging and deploying Python code on a Flink cluster.

Thanks,
Sumeet

On Thu, Apr 29, 2021 at 12:37 PM Till Rohrmann <[hidden email]> wrote:
Hi Sumeet,

Is there a problem with the documented approaches on how to submit the Python program (not working) or are you asking in general? Given the documentation, I would assume that you can configure the requirements.txt via `set_python_requirements`.

I am also pulling in Dian who might be able to tell you more about the Python deployment options.

If you are not running on a session cluster, then you can also create a K8s image which contains your user code. That way you ship your job when deploying the cluster.

Cheers,
Till

On Wed, Apr 28, 2021 at 10:17 AM Sumeet Malhotra <[hidden email]> wrote:
Hi,

I have a PyFlink job that consists of:
Multiple Python files.
Multiple 3rdparty Python dependencies, specified in a `requirements.txt` file.
A few Java dependencies, mainly for external connectors.
An overall job config YAML file.
Here's a simplified structure of the code layout.

flink/
├── deps
│ ├── jar
│ │ ├── flink-connector-kafka_2.11-1.12.2.jar
│ │ └── kafka-clients-2.4.1.jar
│ └── pip
│ └── requirements.txt
├── conf
│ └── job.yaml
└── job
├── some_file_x.py
├── some_file_y.py
└── main.py

I'm able to execute this job running it locally i.e. invoking something like:

python main.py --config <path_to_job_yaml>

I'm loading the jars inside the Python code, using env.add_jars(...).

Now, the next step is to submit this job to a Flink cluster running on K8S. I'm looking for any best practices in packaging and specifying dependencies that people tend to follow. As per the documentation here [1], various Python files, including the conf YAML, can be specified using the --pyFiles option and Java dependencies can be specified using --jarfile option.

So, how can I specify 3rdparty Python package dependencies? According to another piece of documentation here [2], I should be able to specify the requirements.txt directly inside the code and submit it via the --pyFiles option. Is that right?

Are there any other best practices folks use to package/submit jobs?

Thanks,
Sumeet

[1] https://ci.apache.org/projects/flink/flink-docs-release-1.12/deployment/cli.html#submitting-pyflink-jobs
[2] https://ci.apache.org/projects/flink/flink-docs-release-1.12/dev/python/table-api-users-guide/dependency_management.html#python-dependency-in-python-program