I'm assuming I have a simple, common setup problem. I've spent 6 hours debugging and haven't been able to figure it out. Any help would be greatly appreciated.
Problem
I have a Flink Streaming job setup that writes SequenceFiles in S3. When I try to create a Flink Batch job to read these Sequence files, I get the following error:
NoClassDefFoundError: org/apache/hadoop/mapred/FileInputFormat
It fails on this readSequenceFile.
env.createInput(HadoopInputs.readSequenceFile(Text.class, ByteWritable.class, INPUT_FILE))
If I directly depend on org-apache-hadoop/hadoop-mapred when building the job, I get the following error when trying to run the job:
Caused by: org.apache.hadoop.fs.UnsupportedFileSystemException: No FileSystem for scheme "s3"
at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:3332)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:3352)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:124)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:3403)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:3371)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:477)
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:361)
at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:209)
at org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:48)
at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:254)
at org.apache.flink.api.java.hadoop.mapred.HadoopInputFormatBase.createInputSplits(HadoopInputFormatBase.java:150)
at org.apache.flink.api.java.hadoop.mapred.HadoopInputFormatBase.createInputSplits(HadoopInputFormatBase.java:58)
at org.apache.flink.runtime.executiongraph.ExecutionJobVertex.<init>(ExecutionJobVertex.java:257)
Extra context
Questions
Are there any existing projects that read batch Hadoop file formats from S3?
I've looked at these instructions for
Hadoop Integration. I'm assuming my configuration is wrong. I'm also assuming I need the hadoop dependency properly setup in the jobmanager and taskmanager (not in the job itself). If I use this Helm chart, do I need to download a hadoop common jar into the Flink images for jobmanager and taskmanager? Are there pre-built images which I can use that already have the dependencies setup?
- Dan