Best practice for packaging and deploying Flink jobs on K8S

classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|

Best practice for packaging and deploying Flink jobs on K8S

Sumeet Malhotra
Hi,

I have a PyFlink job that consists of:
  • Multiple Python files.
  • Multiple 3rdparty Python dependencies, specified in a `requirements.txt` file.
  • A few Java dependencies, mainly for external connectors.
  • An overall job config YAML file.
Here's a simplified structure of the code layout.

flink/
├── deps
│   ├── jar
│   │   ├── flink-connector-kafka_2.11-1.12.2.jar
│   │   └── kafka-clients-2.4.1.jar
│   └── pip
│       └── requirements.txt
├── conf
│   └── job.yaml
└── job
    ├── some_file_x.py
    ├── some_file_y.py
    └── main.py

I'm able to execute this job running it locally i.e. invoking something like:

python main.py --config <path_to_job_yaml>

I'm loading the jars inside the Python code, using env.add_jars(...).

Now, the next step is to submit this job to a Flink cluster running on K8S. I'm looking for any best practices in packaging and specifying dependencies that people tend to follow. As per the documentation here [1], various Python files, including the conf YAML, can be specified using the --pyFiles option and Java dependencies can be specified using --jarfile option.

So, how can I specify 3rdparty Python package dependencies? According to another piece of documentation here [2], I should be able to specify the requirements.txt directly inside the code and submit it via the --pyFiles option. Is that right?

Are there any other best practices folks use to package/submit jobs?

Thanks,
Sumeet

Reply | Threaded
Open this post in threaded view
|

Re: Best practice for packaging and deploying Flink jobs on K8S

Till Rohrmann
Hi Sumeet,

Is there a problem with the documented approaches on how to submit the Python program (not working) or are you asking in general? Given the documentation, I would assume that you can configure the requirements.txt via `set_python_requirements`.

I am also pulling in Dian who might be able to tell you more about the Python deployment options.

If you are not running on a session cluster, then you can also create a K8s image which contains your user code. That way you ship your job when deploying the cluster.

Cheers,
Till

On Wed, Apr 28, 2021 at 10:17 AM Sumeet Malhotra <[hidden email]> wrote:
Hi,

I have a PyFlink job that consists of:
  • Multiple Python files.
  • Multiple 3rdparty Python dependencies, specified in a `requirements.txt` file.
  • A few Java dependencies, mainly for external connectors.
  • An overall job config YAML file.
Here's a simplified structure of the code layout.

flink/
├── deps
│   ├── jar
│   │   ├── flink-connector-kafka_2.11-1.12.2.jar
│   │   └── kafka-clients-2.4.1.jar
│   └── pip
│       └── requirements.txt
├── conf
│   └── job.yaml
└── job
    ├── some_file_x.py
    ├── some_file_y.py
    └── main.py

I'm able to execute this job running it locally i.e. invoking something like:

python main.py --config <path_to_job_yaml>

I'm loading the jars inside the Python code, using env.add_jars(...).

Now, the next step is to submit this job to a Flink cluster running on K8S. I'm looking for any best practices in packaging and specifying dependencies that people tend to follow. As per the documentation here [1], various Python files, including the conf YAML, can be specified using the --pyFiles option and Java dependencies can be specified using --jarfile option.

So, how can I specify 3rdparty Python package dependencies? According to another piece of documentation here [2], I should be able to specify the requirements.txt directly inside the code and submit it via the --pyFiles option. Is that right?

Are there any other best practices folks use to package/submit jobs?

Thanks,
Sumeet

Reply | Threaded
Open this post in threaded view
|

Re: Best practice for packaging and deploying Flink jobs on K8S

Sumeet Malhotra
Hi Till,

There’s no problem with the documented approach. I was looking if there were any standardized ways of organizing, packaging and deploying Python code on a Flink cluster.

Thanks,
Sumeet



On Thu, Apr 29, 2021 at 12:37 PM Till Rohrmann <[hidden email]> wrote:
Hi Sumeet,

Is there a problem with the documented approaches on how to submit the Python program (not working) or are you asking in general? Given the documentation, I would assume that you can configure the requirements.txt via `set_python_requirements`.

I am also pulling in Dian who might be able to tell you more about the Python deployment options.

If you are not running on a session cluster, then you can also create a K8s image which contains your user code. That way you ship your job when deploying the cluster.

Cheers,
Till

On Wed, Apr 28, 2021 at 10:17 AM Sumeet Malhotra <[hidden email]> wrote:
Hi,

I have a PyFlink job that consists of:
  • Multiple Python files.
  • Multiple 3rdparty Python dependencies, specified in a `requirements.txt` file.
  • A few Java dependencies, mainly for external connectors.
  • An overall job config YAML file.
Here's a simplified structure of the code layout.

flink/
├── deps
│   ├── jar
│   │   ├── flink-connector-kafka_2.11-1.12.2.jar
│   │   └── kafka-clients-2.4.1.jar
│   └── pip
│       └── requirements.txt
├── conf
│   └── job.yaml
└── job
    ├── some_file_x.py
    ├── some_file_y.py
    └── main.py

I'm able to execute this job running it locally i.e. invoking something like:

python main.py --config <path_to_job_yaml>

I'm loading the jars inside the Python code, using env.add_jars(...).

Now, the next step is to submit this job to a Flink cluster running on K8S. I'm looking for any best practices in packaging and specifying dependencies that people tend to follow. As per the documentation here [1], various Python files, including the conf YAML, can be specified using the --pyFiles option and Java dependencies can be specified using --jarfile option.

So, how can I specify 3rdparty Python package dependencies? According to another piece of documentation here [2], I should be able to specify the requirements.txt directly inside the code and submit it via the --pyFiles option. Is that right?

Are there any other best practices folks use to package/submit jobs?

Thanks,
Sumeet

Reply | Threaded
Open this post in threaded view
|

Re: Best practice for packaging and deploying Flink jobs on K8S

Till Rohrmann
Alright. Then let's see what Dian recommends to do.

Cheers,
Till

On Thu, Apr 29, 2021 at 9:25 AM Sumeet Malhotra <[hidden email]> wrote:
Hi Till,

There’s no problem with the documented approach. I was looking if there were any standardized ways of organizing, packaging and deploying Python code on a Flink cluster.

Thanks,
Sumeet



On Thu, Apr 29, 2021 at 12:37 PM Till Rohrmann <[hidden email]> wrote:
Hi Sumeet,

Is there a problem with the documented approaches on how to submit the Python program (not working) or are you asking in general? Given the documentation, I would assume that you can configure the requirements.txt via `set_python_requirements`.

I am also pulling in Dian who might be able to tell you more about the Python deployment options.

If you are not running on a session cluster, then you can also create a K8s image which contains your user code. That way you ship your job when deploying the cluster.

Cheers,
Till

On Wed, Apr 28, 2021 at 10:17 AM Sumeet Malhotra <[hidden email]> wrote:
Hi,

I have a PyFlink job that consists of:
  • Multiple Python files.
  • Multiple 3rdparty Python dependencies, specified in a `requirements.txt` file.
  • A few Java dependencies, mainly for external connectors.
  • An overall job config YAML file.
Here's a simplified structure of the code layout.

flink/
├── deps
│   ├── jar
│   │   ├── flink-connector-kafka_2.11-1.12.2.jar
│   │   └── kafka-clients-2.4.1.jar
│   └── pip
│       └── requirements.txt
├── conf
│   └── job.yaml
└── job
    ├── some_file_x.py
    ├── some_file_y.py
    └── main.py

I'm able to execute this job running it locally i.e. invoking something like:

python main.py --config <path_to_job_yaml>

I'm loading the jars inside the Python code, using env.add_jars(...).

Now, the next step is to submit this job to a Flink cluster running on K8S. I'm looking for any best practices in packaging and specifying dependencies that people tend to follow. As per the documentation here [1], various Python files, including the conf YAML, can be specified using the --pyFiles option and Java dependencies can be specified using --jarfile option.

So, how can I specify 3rdparty Python package dependencies? According to another piece of documentation here [2], I should be able to specify the requirements.txt directly inside the code and submit it via the --pyFiles option. Is that right?

Are there any other best practices folks use to package/submit jobs?

Thanks,
Sumeet

Reply | Threaded
Open this post in threaded view
|

Re: Best practice for packaging and deploying Flink jobs on K8S

Dian Fu
In reply to this post by Sumeet Malhotra
Hi Sumeet,

For the Python dependencies, multiple ways have been provided to specify them and you could take either way of them.

Regarding to requirements.txt, there are 3 ways provided and you could specify it via either of them:
- API inside the code: set_python_requirements
- command line option: -pyreq [1]
- configuration: python.requirements

So you don’t need to specify them both inside the code and the command line options.

PS: It seems that -pyreq is missing from the latest CLI documentation, however, actually it’s there and you could refer to the 1.11 documentation for now. I’ll try to add it back ASAP.


Regards,
Dian

2021年4月29日 下午3:24,Sumeet Malhotra <[hidden email]> 写道:

Hi Till,

There’s no problem with the documented approach. I was looking if there were any standardized ways of organizing, packaging and deploying Python code on a Flink cluster.

Thanks,
Sumeet



On Thu, Apr 29, 2021 at 12:37 PM Till Rohrmann <[hidden email]> wrote:
Hi Sumeet,

Is there a problem with the documented approaches on how to submit the Python program (not working) or are you asking in general? Given the documentation, I would assume that you can configure the requirements.txt via `set_python_requirements`.

I am also pulling in Dian who might be able to tell you more about the Python deployment options.

If you are not running on a session cluster, then you can also create a K8s image which contains your user code. That way you ship your job when deploying the cluster.

Cheers,
Till

On Wed, Apr 28, 2021 at 10:17 AM Sumeet Malhotra <[hidden email]> wrote:
Hi,

I have a PyFlink job that consists of:
  • Multiple Python files.
  • Multiple 3rdparty Python dependencies, specified in a `requirements.txt` file.
  • A few Java dependencies, mainly for external connectors.
  • An overall job config YAML file.
Here's a simplified structure of the code layout.

flink/
├── deps
│   ├── jar
│   │   ├── flink-connector-kafka_2.11-1.12.2.jar
│   │   └── kafka-clients-2.4.1.jar
│   └── pip
│       └── requirements.txt
├── conf
│   └── job.yaml
└── job
    ├── some_file_x.py
    ├── some_file_y.py
    └── main.py

I'm able to execute this job running it locally i.e. invoking something like:

python main.py --config <path_to_job_yaml>

I'm loading the jars inside the Python code, using env.add_jars(...).

Now, the next step is to submit this job to a Flink cluster running on K8S. I'm looking for any best practices in packaging and specifying dependencies that people tend to follow. As per the documentation here [1], various Python files, including the conf YAML, can be specified using the --pyFiles option and Java dependencies can be specified using --jarfile option.

So, how can I specify 3rdparty Python package dependencies? According to another piece of documentation here [2], I should be able to specify the requirements.txt directly inside the code and submit it via the --pyFiles option. Is that right?

Are there any other best practices folks use to package/submit jobs?

Thanks,
Sumeet


Reply | Threaded
Open this post in threaded view
|

Re: Best practice for packaging and deploying Flink jobs on K8S

Dian Fu
Hi Sumeet,

FYI: the documentation about the CLI options of PyFlink has already been updated [1].

[1] https://ci.apache.org/projects/flink/flink-docs-release-1.12/deployment/cli.html#submitting-pyflink-jobs

Regards,
Dian


On Thu, Apr 29, 2021 at 4:46 PM Dian Fu <[hidden email]> wrote:
Hi Sumeet,

For the Python dependencies, multiple ways have been provided to specify them and you could take either way of them.

Regarding to requirements.txt, there are 3 ways provided and you could specify it via either of them:
- API inside the code: set_python_requirements
- command line option: -pyreq [1]
- configuration: python.requirements

So you don’t need to specify them both inside the code and the command line options.

PS: It seems that -pyreq is missing from the latest CLI documentation, however, actually it’s there and you could refer to the 1.11 documentation for now. I’ll try to add it back ASAP.


Regards,
Dian

2021年4月29日 下午3:24,Sumeet Malhotra <[hidden email]> 写道:

Hi Till,

There’s no problem with the documented approach. I was looking if there were any standardized ways of organizing, packaging and deploying Python code on a Flink cluster.

Thanks,
Sumeet



On Thu, Apr 29, 2021 at 12:37 PM Till Rohrmann <[hidden email]> wrote:
Hi Sumeet,

Is there a problem with the documented approaches on how to submit the Python program (not working) or are you asking in general? Given the documentation, I would assume that you can configure the requirements.txt via `set_python_requirements`.

I am also pulling in Dian who might be able to tell you more about the Python deployment options.

If you are not running on a session cluster, then you can also create a K8s image which contains your user code. That way you ship your job when deploying the cluster.

Cheers,
Till

On Wed, Apr 28, 2021 at 10:17 AM Sumeet Malhotra <[hidden email]> wrote:
Hi,

I have a PyFlink job that consists of:
  • Multiple Python files.
  • Multiple 3rdparty Python dependencies, specified in a `requirements.txt` file.
  • A few Java dependencies, mainly for external connectors.
  • An overall job config YAML file.
Here's a simplified structure of the code layout.

flink/
├── deps
│   ├── jar
│   │   ├── flink-connector-kafka_2.11-1.12.2.jar
│   │   └── kafka-clients-2.4.1.jar
│   └── pip
│       └── requirements.txt
├── conf
│   └── job.yaml
└── job
    ├── some_file_x.py
    ├── some_file_y.py
    └── main.py

I'm able to execute this job running it locally i.e. invoking something like:

python main.py --config <path_to_job_yaml>

I'm loading the jars inside the Python code, using env.add_jars(...).

Now, the next step is to submit this job to a Flink cluster running on K8S. I'm looking for any best practices in packaging and specifying dependencies that people tend to follow. As per the documentation here [1], various Python files, including the conf YAML, can be specified using the --pyFiles option and Java dependencies can be specified using --jarfile option.

So, how can I specify 3rdparty Python package dependencies? According to another piece of documentation here [2], I should be able to specify the requirements.txt directly inside the code and submit it via the --pyFiles option. Is that right?

Are there any other best practices folks use to package/submit jobs?

Thanks,
Sumeet


Reply | Threaded
Open this post in threaded view
|

Re: Best practice for packaging and deploying Flink jobs on K8S

Sumeet Malhotra
Thanks for updating the documentation Dian. Appreciate it.

..Sumeet

On Sun, May 2, 2021 at 10:53 AM Dian Fu <[hidden email]> wrote:
Hi Sumeet,

FYI: the documentation about the CLI options of PyFlink has already been updated [1].

[1] https://ci.apache.org/projects/flink/flink-docs-release-1.12/deployment/cli.html#submitting-pyflink-jobs

Regards,
Dian


On Thu, Apr 29, 2021 at 4:46 PM Dian Fu <[hidden email]> wrote:
Hi Sumeet,

For the Python dependencies, multiple ways have been provided to specify them and you could take either way of them.

Regarding to requirements.txt, there are 3 ways provided and you could specify it via either of them:
- API inside the code: set_python_requirements
- command line option: -pyreq [1]
- configuration: python.requirements

So you don’t need to specify them both inside the code and the command line options.

PS: It seems that -pyreq is missing from the latest CLI documentation, however, actually it’s there and you could refer to the 1.11 documentation for now. I’ll try to add it back ASAP.


Regards,
Dian

2021年4月29日 下午3:24,Sumeet Malhotra <[hidden email]> 写道:

Hi Till,

There’s no problem with the documented approach. I was looking if there were any standardized ways of organizing, packaging and deploying Python code on a Flink cluster.

Thanks,
Sumeet



On Thu, Apr 29, 2021 at 12:37 PM Till Rohrmann <[hidden email]> wrote:
Hi Sumeet,

Is there a problem with the documented approaches on how to submit the Python program (not working) or are you asking in general? Given the documentation, I would assume that you can configure the requirements.txt via `set_python_requirements`.

I am also pulling in Dian who might be able to tell you more about the Python deployment options.

If you are not running on a session cluster, then you can also create a K8s image which contains your user code. That way you ship your job when deploying the cluster.

Cheers,
Till

On Wed, Apr 28, 2021 at 10:17 AM Sumeet Malhotra <[hidden email]> wrote:
Hi,

I have a PyFlink job that consists of:
  • Multiple Python files.
  • Multiple 3rdparty Python dependencies, specified in a `requirements.txt` file.
  • A few Java dependencies, mainly for external connectors.
  • An overall job config YAML file.
Here's a simplified structure of the code layout.

flink/
├── deps
│   ├── jar
│   │   ├── flink-connector-kafka_2.11-1.12.2.jar
│   │   └── kafka-clients-2.4.1.jar
│   └── pip
│       └── requirements.txt
├── conf
│   └── job.yaml
└── job
    ├── some_file_x.py
    ├── some_file_y.py
    └── main.py

I'm able to execute this job running it locally i.e. invoking something like:

python main.py --config <path_to_job_yaml>

I'm loading the jars inside the Python code, using env.add_jars(...).

Now, the next step is to submit this job to a Flink cluster running on K8S. I'm looking for any best practices in packaging and specifying dependencies that people tend to follow. As per the documentation here [1], various Python files, including the conf YAML, can be specified using the --pyFiles option and Java dependencies can be specified using --jarfile option.

So, how can I specify 3rdparty Python package dependencies? According to another piece of documentation here [2], I should be able to specify the requirements.txt directly inside the code and submit it via the --pyFiles option. Is that right?

Are there any other best practices folks use to package/submit jobs?

Thanks,
Sumeet