Running a Python streaming job with Java dependencies

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

Running a Python streaming job with Java dependencies

Joe Malt
Hi,

I'm trying to run a job with Flink's new Python streaming API but I'm running into issues with Java imports.

I have a Jython project in IntelliJ with a lot of Java dependencies configured through Maven. I can't figure out how to make Flink "see" these dependencies.

An example script that exhibits the problem is the following (it's the streaming example from the docs (https://ci.apache.org/projects/flink/flink-docs-master/dev/stream/python.html#streaming-program-example) but with an extra import added)

from org.apache.flink.streaming.api.functions.source import SourceFunction
from org.apache.flink.api.common.functions import FlatMapFunction, ReduceFunction
from org.apache.flink.api.java.functions import KeySelector
from org.apache.flink.streaming.api.windowing.time.Time import milliseconds

# Added an extra import, this fails with an ImportError
import com.google.gson.GsonBuilder


class Generator(SourceFunction):
def __init__(self, num_iters):
self._running = True
self._num_iters = num_iters

# ... rest of the file is as in the documentation

This runs without any exceptions when run from IntelliJ (assuming com.google.gson is added in the POM), but when I try to run it as a Flink job with this command:

./pyflink-stream.sh ~/flink-python/MinimalExample.py - --local

it fails to find the dependency:

Starting execution of program
Failed to run plan: null
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/var/folders/t1/gcltcjcn5zdgqfqrc32xk90x85xkg9/T/flink_streaming_plan_0bfab09c-baeb-414f-a718-01a5c71b3507/MinimalExample.py", line 7, in <module>
ImportError: No module named google

How can I point pyflink-stream.sh to these Maven dependencies? I've tried modifying the script to add my .m2/ directory to the classpath (using flink run -C), but that didn't make any difference:

bin=`dirname "$0"`
bin=`cd "$bin"; pwd`

. "$bin"/config.sh

"$FLINK_BIN_DIR"/flink run -C "file:///Users/jmalt/.m2/" --class org.apache.flink.streaming.python.api.PythonStreamBinder -v "$FLINK_ROOT_DIR"/opt/flink-streaming-python*.jar "$@"


Thanks,

Joe Malt

Engineering Intern, Stream Processing
Yelp
Reply | Threaded
Open this post in threaded view
|

Re: Running a Python streaming job with Java dependencies

Chesnay Schepler
To use java classes not bundled with FLink you will have to place a jar containing said classes into the /lib directory of the distribution.

On 25.07.2018 23:32, Joe Malt wrote:
Hi,

I'm trying to run a job with Flink's new Python streaming API but I'm running into issues with Java imports.

I have a Jython project in IntelliJ with a lot of Java dependencies configured through Maven. I can't figure out how to make Flink "see" these dependencies.

An example script that exhibits the problem is the following (it's the streaming example from the docs (https://ci.apache.org/projects/flink/flink-docs-master/dev/stream/python.html#streaming-program-example) but with an extra import added)

from org.apache.flink.streaming.api.functions.source import SourceFunction
from org.apache.flink.api.common.functions import FlatMapFunction, ReduceFunction
from org.apache.flink.api.java.functions import KeySelector
from org.apache.flink.streaming.api.windowing.time.Time import milliseconds

# Added an extra import, this fails with an ImportError
import com.google.gson.GsonBuilder


class Generator(SourceFunction):
    def __init__(self, num_iters):
        self._running = True
        self._num_iters = num_iters
            
# ... rest of the file is as in the documentation

This runs without any exceptions when run from IntelliJ (assuming com.google.gson is added in the POM), but when I try to run it as a Flink job with this command:

./pyflink-stream.sh ~/flink-python/MinimalExample.py - --local

it fails to find the dependency:

Starting execution of program
Failed to run plan: null
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/var/folders/t1/gcltcjcn5zdgqfqrc32xk90x85xkg9/T/flink_streaming_plan_0bfab09c-baeb-414f-a718-01a5c71b3507/MinimalExample.py", line 7, in <module>
ImportError: No module named google

How can I point pyflink-stream.sh to these Maven dependencies? I've tried modifying the script to add my .m2/ directory to the classpath (using flink run -C), but that didn't make any difference:

bin=`dirname "$0"`
bin=`cd "$bin"; pwd`

. "$bin"/config.sh

"$FLINK_BIN_DIR"/flink run -C "file:///Users/jmalt/.m2/" --class org.apache.flink.streaming.python.api.PythonStreamBinder -v "$FLINK_ROOT_DIR"/opt/flink-streaming-python*.jar "$@"


Thanks,

Joe Malt

Engineering Intern, Stream Processing
Yelp