Fwd: Flink Python API and HADOO_CLASSPATH

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Fwd: Flink Python API and HADOO_CLASSPATH

Eduard Tudenhoefner
Hello,

I was wondering whether anyone has tried and/or had any luck creating a custom catalog with Iceberg + Flink via the Python API (https://iceberg.apache.org/flink/#custom-catalog)? 

When doing so, the docs mention that dependencies need to be specified via pipeline.jars / pipeline.classpaths (https://ci.apache.org/projects/flink/flink-docs-release-1.13/docs/dev/python/dependency_management/). I'm providing the iceberg flink runtime + the hadoop libs via those and I can confirm that I'm landing in the FlinkCatalogFactory but then things fail because it doesn't see the hadoop dependencies for some reason. 

What would be the right way to provide the HADOOP_CLASSPATH when using the Python API? I have a minimal code example that shows the issue here: https://gist.github.com/nastra/92bc3bc7b7037d956aa5807988078b8d#file-flink-py-L38 

Thanks
Reply | Threaded
Open this post in threaded view
|

Re: Flink Python API and HADOO_CLASSPATH

Dian Fu
Hi,

1) The cause of the exception:
The dependencies added via pipeline.jars / pipeline.classpaths will be used to construct user class loader.

For your job, the exception happens when HadoopUtils.getHadoopConfiguration is called. The reason is that HadoopUtils is provided by Flink which is loaded by the system classloader instead of user classloader. As the hadoop dependences are only available in the user classloader and so ClassNotFoundException was raised.

So, the Hadoop dependencies are a little special than the other dependencies as they are also dependent by Flink, not only the user coder.

2) How to support HADOOP_CLASSPATH in local mode (e.g. execute in IDE)

Currently, you have to copy the hadoop jars into the PyFlink installation directory: site-packages/pyflink/lib/
However, it makes sense to support HADOOP_CLASSPATH environment variable in local mode to avoid copy the jars manually. I will create a ticket for this.

3) How to support HADOOP_CLASSPATH in remote mode (e.g. submitted via `flink run`)
You could export HADOOP_CLASSPATH=`hadoop classpath` manually before executing `flink run`.

Regards,
Dian

2021年5月17日 下午8:45,Eduard Tudenhoefner <[hidden email]> 写道:

Hello,

I was wondering whether anyone has tried and/or had any luck creating a custom catalog with Iceberg + Flink via the Python API (https://iceberg.apache.org/flink/#custom-catalog)? 

When doing so, the docs mention that dependencies need to be specified via pipeline.jars / pipeline.classpaths (https://ci.apache.org/projects/flink/flink-docs-release-1.13/docs/dev/python/dependency_management/). I'm providing the iceberg flink runtime + the hadoop libs via those and I can confirm that I'm landing in the FlinkCatalogFactory but then things fail because it doesn't see the hadoop dependencies for some reason. 

What would be the right way to provide the HADOOP_CLASSPATH when using the Python API? I have a minimal code example that shows the issue here: https://gist.github.com/nastra/92bc3bc7b7037d956aa5807988078b8d#file-flink-py-L38 

Thanks

Reply | Threaded
Open this post in threaded view
|

Re: Flink Python API and HADOO_CLASSPATH

Eduard Tudenhoefner
Hi Dian,

thanks a lot for the explanation and help. Option 2) is what I needed and it works.

Regards
Eduard

On Tue, May 18, 2021 at 6:21 AM Dian Fu <[hidden email]> wrote:
Hi,

1) The cause of the exception:
The dependencies added via pipeline.jars / pipeline.classpaths will be used to construct user class loader.

For your job, the exception happens when HadoopUtils.getHadoopConfiguration is called. The reason is that HadoopUtils is provided by Flink which is loaded by the system classloader instead of user classloader. As the hadoop dependences are only available in the user classloader and so ClassNotFoundException was raised.

So, the Hadoop dependencies are a little special than the other dependencies as they are also dependent by Flink, not only the user coder.

2) How to support HADOOP_CLASSPATH in local mode (e.g. execute in IDE)

Currently, you have to copy the hadoop jars into the PyFlink installation directory: site-packages/pyflink/lib/
However, it makes sense to support HADOOP_CLASSPATH environment variable in local mode to avoid copy the jars manually. I will create a ticket for this.

3) How to support HADOOP_CLASSPATH in remote mode (e.g. submitted via `flink run`)
You could export HADOOP_CLASSPATH=`hadoop classpath` manually before executing `flink run`.

Regards,
Dian

2021年5月17日 下午8:45,Eduard Tudenhoefner <[hidden email]> 写道:

Hello,

I was wondering whether anyone has tried and/or had any luck creating a custom catalog with Iceberg + Flink via the Python API (https://iceberg.apache.org/flink/#custom-catalog)? 

When doing so, the docs mention that dependencies need to be specified via pipeline.jars / pipeline.classpaths (https://ci.apache.org/projects/flink/flink-docs-release-1.13/docs/dev/python/dependency_management/). I'm providing the iceberg flink runtime + the hadoop libs via those and I can confirm that I'm landing in the FlinkCatalogFactory but then things fail because it doesn't see the hadoop dependencies for some reason. 

What would be the right way to provide the HADOOP_CLASSPATH when using the Python API? I have a minimal code example that shows the issue here: https://gist.github.com/nastra/92bc3bc7b7037d956aa5807988078b8d#file-flink-py-L38 

Thanks