(DEPRECATED) Apache Flink User Mailing List archive.

Re: Processing S3 data with Apache Flink

Posted by Kostiantyn Kudriavtsev on
URL: http://deprecated-apache-flink-user-mailing-list-archive.369.s1.nabble.com/Processing-S3-data-with-Apache-Flink-tp3046p3064.html

Hi Robert,

you are right, I just misspell name of the file :( Everything works fine!

Basically, I'd suggest to move this workaround into official doc and mark custom S3FileSystem as @Deprecated...

In fact, I like that idea to mark all untested functional with specific annotation, for example @Beta. Just because of a big enterprises won't be like to use any product where documented features don't work. For example, for me it would be difficult to advocate Flink usage on the project as far as S3FileSystem was broken and my opponents will refer to that "who knows what's broken". If some functionality is marked as not properly tested, it's much easier to make decisions because of better visibility

WBR,

Kostia

Thank you,
Konstantin Kudryavtsev

On Tue, Oct 6, 2015 at 2:12 PM, Robert Metzger <[hidden email]> wrote:

Mh. I tried out the code I've posted yesterday and it was working immediately.
The security settings of AWS are sometimes a bit complicated.
I think there are some logs for S3 buckets, maybe they contain some more information.

Maybe there are other users facing the same issue. Since the S3FileSystem class is from Hadoop, I suspect the code to be widely used, and you can probably find answers to the most common problems on google.
On Tue, Oct 6, 2015 at 1:07 PM, KOSTIANTYN Kudriavtsev <[hidden email]> wrote:
Hi Robert,

thank you very much for your input!

Have you tried that?
With org.apache.hadoop.fs.s3native.NativeS3FileSystem I moved forward, and now got a new exception:

Caused by: org.jets3t.service.S3ServiceException: S3 HEAD request failed for '/***.csv' - ResponseCode=403, ResponseMessage=Forbidden

it's really strange as far as I gave full permissions to authenticated users and can get target file from s3cmd or s3 browser from the same PC... I realize that it's question not to you, but perhaps you have faced the same issue

Thanks in advance!
Kostia
Thank you,
Konstantin Kudryavtsev
On Mon, Oct 5, 2015 at 10:13 PM, Robert Metzger <[hidden email]> wrote:
Hi Kostia,

thank you for writing to the Flink mailing list. I actually started to try out our S3 File system support after I saw your question on StackOverflow [1].
I found that our S3 connector is very broken. I had to resolve two more issues with it, before I was able to get the same exception you reported.

Another Flink commiter looked into the issue as well (it was confirmed as well) but there was no solution [2].

So for now, I would say we have to assume that our S3 connector is not working. I will start a separate discussion at the developer mailing list to remove our S3 connector.

The good news is that you can just use Hadoop's S3 File System implementation with Flink.

I used this Flink program to verify its working:
public class S3FileSystem {
   public static void main(String[] args) throws Exception {
      ExecutionEnvironment ee = ExecutionEnvironment.createLocalEnvironment();
      DataSet<String> myLines = ee.readTextFile("s3n://my-bucket-name/some-test-file.xml");
      myLines.print();
   }
}
also, you need to make a Hadoop configuration file available to Flink.
When running flink locally in your IDE, just create a "core-site.xml" in the src/main/resource folder, with the following content:
<configuration>

    <property>
        <name>fs.s3n.awsAccessKeyId</name>
        <value>putKeyHere</value>
    </property>

    <property>
        <name>fs.s3n.awsSecretAccessKey</name>
        <value>putSecretHere</value>
    </property>
    <property>
        <name>fs.s3n.impl</name>
        <value>org.apache.hadoop.fs.s3native.NativeS3FileSystem</value>
    </property>
</configuration>
Maybe you are running on a cluster, then re-use the existing core-site.xml file (= edit it) and point to the directory using Flink's fs.hdfs.hadoopconf configuration option.
With these two things in place, you should be good to go.

[1] http://stackoverflow.com/questions/32959790/run-apache-flink-with-amazon-s3
[2] http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Problem-with-Amazon-S3-td946.html

On Mon, Oct 5, 2015 at 8:19 PM, Kostiantyn Kudriavtsev <[hidden email]> wrote:
Hi guys,

I,m trying to get work Apache Flink 0.9.1 on EMR, basically to read
data from S3. I tried the following path for data
s3://mybucket.s3.amazonaws.com/folder, but it throws me the following
exception:

java.io.IOException: Cannot establish connection to Amazon S3:
com.amazonaws.services.s3.model.AmazonS3Exception: The request signature
we calculated does not match the signature you provided. Check your key
and signing method. (Service: Amazon S3; Status Code: 403;

I added access and secret keys, so the problem is not here. I=92m using
standard region and gave read credential to everyone.

Any ideas how can it be fixed?

Thank you in advance,
Kostia