Intermittent issue with GCS storage

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

Intermittent issue with GCS storage

Heath Albritton
Howdy folks,

I'm attempting to get Flink running in a Kubernetes cluster with the
ultimate goal of using GCS for checkpoints and savepoints.  I've used
the helm chart to deploy and followed this guide, modified for 1.6.0:

https://data-artisans.com/blog/getting-started-with-da-platform-on-google-kubernetes-engine

I've built a container putting these:
flink-shaded-hadoop2-uber-1.6.0.jar
gcs-connector-hadoop2-latest.jar
in /opt/flink/lib

I've been running WordCount.jar to test, using the input and output
flags pointing at a GCS bucket.  I've verified that my two jars show
up in the classpath in the logs, but when I run the job it throws the
following errors:

flink-flink-jobmanager-9766f9b4c-kfkk5: Caused by:
org.apache.flink.core.fs.UnsupportedFileSystemSchemeException: Could
not find a file system implementation for scheme 'gs'. The scheme is
not directly supported by Flink and no Hadoop file system to support
this scheme could be loaded.
flink-flink-jobmanager-9766f9b4c-kfkk5: at
org.apache.flink.core.fs.FileSystem.getUnguardedFileSystem(FileSystem.java:403)
flink-flink-jobmanager-9766f9b4c-kfkk5: at
org.apache.flink.core.fs.FileSystem.get(FileSystem.java:318)
flink-flink-jobmanager-9766f9b4c-kfkk5: at
org.apache.flink.core.fs.Path.getFileSystem(Path.java:298)
flink-flink-jobmanager-9766f9b4c-kfkk5: at
org.apache.flink.api.common.io.FileOutputFormat.initializeGlobal(FileOutputFormat.java:275)
flink-flink-jobmanager-9766f9b4c-kfkk5: at
org.apache.flink.runtime.jobgraph.OutputFormatVertex.initializeOnMaster(OutputFormatVertex.java:89)
flink-flink-jobmanager-9766f9b4c-kfkk5: at
org.apache.flink.runtime.executiongraph.ExecutionGraphBuilder.buildGraph(ExecutionGraphBuilder.java:216)

I've been wrestling with this a fair bit.  Eventually I built a
container with the core-site.xml and the GCS key in the
/opt/flink/etc-hadoop directory and then set HADOOP_CONF_DIR to point
there.

I've discovered that I can run the container in standalone mode using
the start-cluster.sh script, it works just fine.  I can replicate this
in kubernetes and locally using docker as well as locally.

If I start the job manager and the task manager individually using
their respective scripts, I get the aforementioned error.  Oddly, I
get issues when running locally as well, if I use the start-cluster.sh
script, my wordcount test works just fine.  If I start the job manager
and task manager processes using their scripts, I can read the file
from GCS, but I get a 403 when trying to write the output.

I've no idea how to proceed with troubleshooting this further as I'm a
newbie to flink.  Some direction would be helpful.


Cheers,

Heath Albritton