StreamingFileSink on EMR

classic Classic list List threaded Threaded
11 messages Options
kb
Reply | Threaded
Open this post in threaded view
|

StreamingFileSink on EMR

kb

When running Flink 1.7 on EMR 5.21 using StreamingFileSink we see java.lang.UnsupportedOperationException: Recoverable writers on Hadoop are only supported for HDFS and for Hadoop version 2.7 or newer. EMR is showing Hadoop version 2.8.5. Is anyone else seeing this issue?

Reply | Threaded
Open this post in threaded view
|

Re: StreamingFileSink on EMR

Till Rohrmann
Hi Kevin,

could you check what's on the class path of the Flink cluster? You should see this in the jobmanager.log at the top. It seems as if there is a Hadoop dependency with a lower version. Flink 1.7 is build against which Hadoop version? You should make sure that you either use the Hadoop-free version of the version where the Hadoop version is >= 2.7. Not sure what option EMR offers here.

Cheers,
Till

On Tue, Feb 26, 2019 at 12:23 AM Bohinski, Kevin (Contractor) <[hidden email]> wrote:

When running Flink 1.7 on EMR 5.21 using StreamingFileSink we see java.lang.UnsupportedOperationException: Recoverable writers on Hadoop are only supported for HDFS and for Hadoop version 2.7 or newer. EMR is showing Hadoop version 2.8.5. Is anyone else seeing this issue?

kb
Reply | Threaded
Open this post in threaded view
|

Re: StreamingFileSink on EMR

kb
Hi Till,

The only potential issue in the path I see is
`/usr/share/aws/emr/emrfs/lib/emrfs-hadoop-assembly-2.29.0.jar`. I double
checked my pom, the project is Hadoop-free. The JM log also shows `INFO
org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -  Hadoop
version: 2.8.5-amzn-1`.

Best,
Kevin



--
Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/
Reply | Threaded
Open this post in threaded view
|

Re: StreamingFileSink on EMR

Till Rohrmann
Hmm good question, I've pulled in Kostas who worked on the StreamingFileSink. He might be able to tell you more in case that there is some special behaviour wrt the Hadoop file systems.

Cheers,
Till

On Tue, Feb 26, 2019 at 3:29 PM kb <[hidden email]> wrote:
Hi Till,

The only potential issue in the path I see is
`/usr/share/aws/emr/emrfs/lib/emrfs-hadoop-assembly-2.29.0.jar`. I double
checked my pom, the project is Hadoop-free. The JM log also shows `INFO
org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -  Hadoop
version: 2.8.5-amzn-1`.

Best,
Kevin



--
Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/
Reply | Threaded
Open this post in threaded view
|

Re: StreamingFileSink on EMR

Kostas Kloudas-2
Hi Kevin,

I cannot find anything obviously wrong from what you describe. 
Just to eliminate the obvious, you are specifying "hdfs" as the scheme for your file path, right?

Cheers,
Kostas

On Tue, Feb 26, 2019 at 3:35 PM Till Rohrmann <[hidden email]> wrote:
Hmm good question, I've pulled in Kostas who worked on the StreamingFileSink. He might be able to tell you more in case that there is some special behaviour wrt the Hadoop file systems.

Cheers,
Till

On Tue, Feb 26, 2019 at 3:29 PM kb <[hidden email]> wrote:
Hi Till,

The only potential issue in the path I see is
`/usr/share/aws/emr/emrfs/lib/emrfs-hadoop-assembly-2.29.0.jar`. I double
checked my pom, the project is Hadoop-free. The JM log also shows `INFO
org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -  Hadoop
version: 2.8.5-amzn-1`.

Best,
Kevin



--
Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/
Reply | Threaded
Open this post in threaded view
|

Re: StreamingFileSink on EMR

elmosca
Hi,

I am having the same issue, but it is related to what Kostas is pointing out. I was trying to stream to the "s3" scheme and not "hdfs", and then getting that exception.

I have realised that somehow I need to reach the S3RecoverableWriter, and found out it is in a difference library "flink-s3-fs-hadoop". Still trying to figure out how to make it work, though. I am aiming for code such as:

  val sink = StreamingFileSink
      .forBulkFormat(new Path("s3://...."), ...)
      .build()

Cheers,

Bruno

On Tue, 26 Feb 2019 at 14:59, Kostas Kloudas <[hidden email]> wrote:
Hi Kevin,

I cannot find anything obviously wrong from what you describe. 
Just to eliminate the obvious, you are specifying "hdfs" as the scheme for your file path, right?

Cheers,
Kostas

On Tue, Feb 26, 2019 at 3:35 PM Till Rohrmann <[hidden email]> wrote:
Hmm good question, I've pulled in Kostas who worked on the StreamingFileSink. He might be able to tell you more in case that there is some special behaviour wrt the Hadoop file systems.

Cheers,
Till

On Tue, Feb 26, 2019 at 3:29 PM kb <[hidden email]> wrote:
Hi Till,

The only potential issue in the path I see is
`/usr/share/aws/emr/emrfs/lib/emrfs-hadoop-assembly-2.29.0.jar`. I double
checked my pom, the project is Hadoop-free. The JM log also shows `INFO
org.apache.flink.runtime.entrypoint.ClusterEntrypoint         -  Hadoop
version: 2.8.5-amzn-1`.

Best,
Kevin



--
Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/
kb
Reply | Threaded
Open this post in threaded view
|

Re: StreamingFileSink on EMR

kb
Hi Bruno,

Thanks for verifying. We are aiming for the same.

Best,
Kevin



--
Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/
Reply | Threaded
Open this post in threaded view
|

Re: StreamingFileSink on EMR

elmosca
Hey,

Got it working, basically you need to add the flink-s3-fs-hadoop-1.7.2.jar libraries from the /opt folder of the flink distribution into the /usr/lib/flink/lib. That has done the trick for me.

Cheers,

Bruno

On Tue, 26 Feb 2019 at 16:28, kb <[hidden email]> wrote:
Hi Bruno,

Thanks for verifying. We are aiming for the same.

Best,
Kevin



--
Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/
kb
Reply | Threaded
Open this post in threaded view
|

Re: StreamingFileSink on EMR

kb
Hi,

So 1.7.2 jar has the fix?

Thanks
Kevin



--
Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/
Reply | Threaded
Open this post in threaded view
|

Re: StreamingFileSink on EMR

elmosca
Hi,

That Jar must exist for all the 1.7 versions, but I was replacing the libs for the Flink provided by the AWS EMR (1.7.0) by the more recent ones. But you could download the 1.7.0 distribution and copy the flink-s3-fs-hadoop-1.7.0.jar from there into the /usr/lib/flink/lib folder.

But knowing there is a more recent 1.7 release out there, I prefer replacing the one in the EMR by this one. To do so, we basically replace the libs in the /usr/lib/flink/lib folder by the ones from the most recent distribution.

Cheers,

Bruno

On Tue, 26 Feb 2019 at 21:37, kb <[hidden email]> wrote:
Hi,

So 1.7.2 jar has the fix?

Thanks
Kevin



--
Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/
kb
Reply | Threaded
Open this post in threaded view
|

Re: StreamingFileSink on EMR

kb