(DEPRECATED) Apache Flink User Mailing List archive.

Flink + S3

Classic

List

Threaded

6 messages Options

Michael-Keith Bernard

Flink + S3

Hello Flink Users!

I'm a Flink newbie at the early stages of deploying our first Flink cluster into production and I have a few questions about wiring up Flink with S3:

* We are going to use the HA configuration[1] from day one (we have existing zk infrastructure already). Can S3 be used as a state backend for the Job Manager? The documentation talks about using S3 as a state backend for TM[2] (and in particular for streaming), but I'm wondering if it's a suitable backend for the JM as well.

* How do I configure S3 for Flink when I don't already have an existing Hadoop cluster? The documentation references the Hadoop configuration manifest[3], which kind of implies to me that I must already be running Hadoop (or at least have a properly configured Hadoop cluster). Is there an example somewhere of using S3 as a storage backend for a standalone cluster?

* Bonus: I'm writing a Puppet module for installing/configuring/managing Flink in stand alone mode with an existing zk cluster. Are there any existing modules for this (I didn't find anything in the forge)? Would others in the community be interested if we added our module to the forge once complete?

Thanks so much for your time and consideration. We look forward to using Flink in production!

Cheers,
Michael-Keith

[1]: https://ci.apache.org/projects/flink/flink-docs-master/setup/jobmanager_high_availability.html#standalone-cluster-high-availability

[2]: https://ci.apache.org/projects/flink/flink-docs-master/setup/aws.html#s3-simple-storage-service

[3]: https://ci.apache.org/projects/flink/flink-docs-master/setup/aws.html#set-s3-filesystem

Till Rohrmann

Re: Flink + S3

Hi Michael-Keith,

you can use S3 as the checkpoint directory for the filesystem state backend. This means that whenever a checkpoint is performed the state data will be written to this directory.

The same holds true for the zookeeper recovery storage directory. This directory will contain the submitted and not yet finished jobs as well as some meta data for the checkpoints. With this information it is possible to restore running jobs if the job manager dies.

As far as I know, Flink relies on Hadoop's file system wrapper classes to support S3. Flink has built in support for hdfs, maprfs and the local file system. For everything else, Flink tries to find a Hadoop class. Therefore, I fear that you need at least Hadoop's s3 filesystem class in your classpath and a file called core-site.xml or hdfs-site.xml which is stored at a location specified by fs.hdfs.hdfsdefault in Flink's configuration. And in one of these files you have to create the xml tag to specify the class. But the easiest way would be to simply install Hadoop.

I'm not aware of any puppet scripts but I might miss something here. If you should complete a puppet script, then it would definitely be a valuable addition to Flink :-)

Cheers,

Till

On Tue, Apr 19, 2016 at 3:54 AM, Michael-Keith Bernard <[hidden email]> wrote:

Hello Flink Users!

I'm a Flink newbie at the early stages of deploying our first Flink cluster into production and I have a few questions about wiring up Flink with S3:

* We are going to use the HA configuration[1] from day one (we have existing zk infrastructure already). Can S3 be used as a state backend for the Job Manager? The documentation talks about using S3 as a state backend for TM[2] (and in particular for streaming), but I'm wondering if it's a suitable backend for the JM as well.

* How do I configure S3 for Flink when I don't already have an existing Hadoop cluster? The documentation references the Hadoop configuration manifest[3], which kind of implies to me that I must already be running Hadoop (or at least have a properly configured Hadoop cluster). Is there an example somewhere of using S3 as a storage backend for a standalone cluster?

* Bonus: I'm writing a Puppet module for installing/configuring/managing Flink in stand alone mode with an existing zk cluster. Are there any existing modules for this (I didn't find anything in the forge)? Would others in the community be interested if we added our module to the forge once complete?

Thanks so much for your time and consideration. We look forward to using Flink in production!

Cheers,
Michael-Keith

[1]: https://ci.apache.org/projects/flink/flink-docs-master/setup/jobmanager_high_availability.html#standalone-cluster-high-availability

[2]: https://ci.apache.org/projects/flink/flink-docs-master/setup/aws.html#s3-simple-storage-service

[3]: https://ci.apache.org/projects/flink/flink-docs-master/setup/aws.html#set-s3-filesystem

Ufuk Celebi

Re: Flink + S3

Hey Michael-Keith,

are you running self-managed EC2 instances or EMR?

In addition to what Till said:

We tried to document this here as well:
https://ci.apache.org/projects/flink/flink-docs-master/setup/aws.html#provide-s3-filesystem-dependency

Does this help? You don't need to really install Hadoop, but only
provide the configuration and the S3 FileSystem code on your
classpath.

If you use EMR + Flink on YARN, it should work out of the box.

– Ufuk

On Tue, Apr 19, 2016 at 10:23 AM, Till Rohrmann <[hidden email]> wrote:

> Hi Michael-Keith,
>
> you can use S3 as the checkpoint directory for the filesystem state backend.
> This means that whenever a checkpoint is performed the state data will be
> written to this directory.
>
> The same holds true for the zookeeper recovery storage directory. This
> directory will contain the submitted and not yet finished jobs as well as
> some meta data for the checkpoints. With this information it is possible to
> restore running jobs if the job manager dies.
>
> As far as I know, Flink relies on Hadoop's file system wrapper classes to
> support S3. Flink has built in support for hdfs, maprfs and the local file
> system. For everything else, Flink tries to find a Hadoop class. Therefore,
> I fear that you need at least Hadoop's s3 filesystem class in your classpath
> and a file called core-site.xml or hdfs-site.xml which is stored at a
> location specified by fs.hdfs.hdfsdefault in Flink's configuration. And in
> one of these files you have to create the xml tag to specify the class. But
> the easiest way would be to simply install Hadoop.
>
> I'm not aware of any puppet scripts but I might miss something here. If you
> should complete a puppet script, then it would definitely be a valuable
> addition to Flink :-)
>
> Cheers,
> Till
>
> On Tue, Apr 19, 2016 at 3:54 AM, Michael-Keith Bernard
> <[hidden email]> wrote:
>>
>> Hello Flink Users!
>>
>> I'm a Flink newbie at the early stages of deploying our first Flink
>> cluster into production and I have a few questions about wiring up Flink
>> with S3:
>>
>> * We are going to use the HA configuration[1] from day one (we have
>> existing zk infrastructure already). Can S3 be used as a state backend for
>> the Job Manager? The documentation talks about using S3 as a state backend
>> for TM[2] (and in particular for streaming), but I'm wondering if it's a
>> suitable backend for the JM as well.
>>
>> * How do I configure S3 for Flink when I don't already have an existing
>> Hadoop cluster? The documentation references the Hadoop configuration
>> manifest[3], which kind of implies to me that I must already be running
>> Hadoop (or at least have a properly configured Hadoop cluster). Is there an
>> example somewhere of using S3 as a storage backend for a standalone cluster?
>>
>> * Bonus: I'm writing a Puppet module for installing/configuring/managing
>> Flink in stand alone mode with an existing zk cluster. Are there any
>> existing modules for this (I didn't find anything in the forge)? Would
>> others in the community be interested if we added our module to the forge
>> once complete?
>>
>> Thanks so much for your time and consideration. We look forward to using
>> Flink in production!
>>
>> Cheers,
>> Michael-Keith
>>
>> [1]:
>> https://ci.apache.org/projects/flink/flink-docs-master/setup/jobmanager_high_availability.html#standalone-cluster-high-availability
>>
>> [2]:
>> https://ci.apache.org/projects/flink/flink-docs-master/setup/aws.html#s3-simple-storage-service
>>
>> [3]:
>> https://ci.apache.org/projects/flink/flink-docs-master/setup/aws.html#set-s3-filesystem
>
>

Michael-Keith Bernard

Re: Flink + S3

Hey Till & Ufuk,

We're running on self-managed EC2 instances (and we'll eventually have a mirror cluster in our colo). The provided documentation notes that for Hadoop 2.6, we'd need such-and-such version of hadoop-aws and guice on the CP. If I wanted to instead use Hadoop 2.7, which versions of those dependencies should I get? And how can I look that up myself? The pom file for hadoop-aws[1] doesn't mention a specific dependency on Guice, so I'm curious how the author of that documentation knew exactly the dependencies and versions required.

Let me switch my questioning slightly:

What is the best (most widely supported, most common, easiest to use, easiest to scale, etc) way to deploy Flink today? I've been operating under the assumption that, since we have no existing Hadoop infrastructure, the path of least resistance is a stand-alone cluster. However it seems like Flink is still relatively tightly coupled to the Hadoop platform, so maybe I would be better off switching to Hadoop + YARN? Our requirements are simple (for now):

Kafka (consumer & producer), S3 (read & write), streaming- and batch-mode computation

If the answer turns out to be that YARN is the best path forward for us, do you have any recommendations on how to get started building a minimal, but production ready Hadoop cluster suitable for Flink? Ambari looks amazing, so barring feedback to the contrary I'll probably be investing time looking at that first.

Finally, any relevant book recommendations? :) I'm extremely excited about this project, so all the feedback I can get is highly welcome and highly appreciated!

Cheers,
Michael-Keith

P.S. Is there planned support for Mesos as an alternative scheduler to YARN?

[1]: http://central.maven.org/maven2/org/apache/hadoop/hadoop-aws/2.7.2/hadoop-aws-2.7.2.pom

________________________________________
From: Ufuk Celebi <[hidden email]>
Sent: Tuesday, April 19, 2016 2:30 AM
To: [hidden email]
Subject: Re: Flink + S3

Hey Michael-Keith,

are you running self-managed EC2 instances or EMR?

In addition to what Till said:

We tried to document this here as well:
https://ci.apache.org/projects/flink/flink-docs-master/setup/aws.html#provide-s3-filesystem-dependency

Does this help? You don't need to really install Hadoop, but only
provide the configuration and the S3 FileSystem code on your
classpath.

If you use EMR + Flink on YARN, it should work out of the box.

– Ufuk

On Tue, Apr 19, 2016 at 10:23 AM, Till Rohrmann <[hidden email]> wrote:

Ufuk Celebi

Re: Flink + S3

On Wed, Apr 20, 2016 at 1:35 AM, Michael-Keith Bernard
<[hidden email]> wrote:
> We're running on self-managed EC2 instances (and we'll eventually have a mirror cluster in our colo). The provided documentation notes that for Hadoop 2.6, we'd need such-and-such version of hadoop-aws and guice on the CP. If I wanted to instead use Hadoop 2.7, which versions of those dependencies should I get? And how can I look that up myself? The pom file for hadoop-aws[1] doesn't mention a specific dependency on Guice, so I'm curious how the author of that documentation knew exactly the dependencies and versions required.

Hey Michael-Keith,

I think you meant Guava and not Guice.

How to determine, which dependencies you need is quite a mess at the
moment. It depends on a combination of 3 things:
1) the dependencies of hadoop-aws [1],
2) which S3 file system you use (in case of the docs
org.apache.hadoop.fs.s3native.NativeS3FileSystem) [2],
3) what Flink shades away in its Hadoop dependencies [3]

1) hadoop-aws depends on hadoop-common (and other packages).
hadoop-common is already part of Flink (including the fs.FileSystem
classes etc.)
2) NativeS3FileSystem uses dependencies from hadoop-common like
FileSystem and from hadoop-aws like Jets3tNativeFileSystemStore. The
hadoop-common stuff is part of Flink and Jets3tNativeFileSystemStore
is part of hadoop-aws. The big issue here is that other S3 FS
implementations might work with the aws-java-sdk packages of
hadoop-aws.
3) Flink shades Hadoop's Guava dependency away and that's why you need
to add it manually to the CP.

So, if you go for the suggested NativeS3FileSystem, you end up needing
hadoop-aws and Guava. Of course, this might change in future versions
of Flink and/or Hadoop. I will update the docs for the different
versions of Flink and Hadoop for now and hope that this will help. :-(

The easiest solution in the future would be that Flink comes with
hadoop-aws, but I don't think that this is going to happen.

– Ufuk

[1] http://mvnrepository.com/artifact/org.apache.hadoop/hadoop-aws/2.6.0
[2] https://ci.apache.org/projects/flink/flink-docs-release-1.0/setup/aws.html#provide-s3-filesystem-dependency
[3] https://github.com/apache/flink/blob/master/flink-shaded-hadoop/pom.xml

rmetzger0

Re: Flink + S3

In reply to this post by Michael-Keith Bernard

Hi Michael-Keith,

Welcome to the Flink community!

Let me try answer your question regarding the "best" deployment options:

From what I see from the mailing list, most of our users are using one of the big hadoop distributions (including Amazon EMR) with YARN installed. Having YARN makes things quite comfortable because its taking care of restarting JVMs in failure cases, deployment of the flink jars, security (if used), ...

Flink might look tightly coupled to Hadoop, because we ship it with different Hadoop versions, but that's mostly for convenience reasons. You don't need to install anything from the Hadoop project to run Flink.

Its just that in the past, almost all users were using Hadoop, so it was not an issue.

I would not install YARN on the cluster just for running Flink.

How are you running Kafka in your cluster?

I think you can run Flink in a very similar way. The only difficult part is probably the failure recovery: When a Flink JVM crashes, you want it to be restarted by some service on your server.

I've seen users which were using OS tools like upstart to ensure the Flink TaskManagers are always running.

Regarding the puppet module: The Apache Bigtop project (basically an open source hadoop distro) is currently adding support for Flink: https://issues.apache.org/jira/browse/BIGTOP-1927. They'll create deb and rpm packages, puppet scripts and testing for Flink.

Maybe they can use your puppet code as a reference.

Independent of their effort, I think it would be great if you publish your puppet module.

Regarding Mesos: There are plans to integrate Flink with Mesos, I don't think it'll make it into 1.1 but 1.2 seems realistic.

Regards,

Robert

On Wed, Apr 20, 2016 at 1:35 AM, Michael-Keith Bernard <[hidden email]> wrote:

Hey Till & Ufuk,

We're running on self-managed EC2 instances (and we'll eventually have a mirror cluster in our colo). The provided documentation notes that for Hadoop 2.6, we'd need such-and-such version of hadoop-aws and guice on the CP. If I wanted to instead use Hadoop 2.7, which versions of those dependencies should I get? And how can I look that up myself? The pom file for hadoop-aws[1] doesn't mention a specific dependency on Guice, so I'm curious how the author of that documentation knew exactly the dependencies and versions required.

Let me switch my questioning slightly:

What is the best (most widely supported, most common, easiest to use, easiest to scale, etc) way to deploy Flink today? I've been operating under the assumption that, since we have no existing Hadoop infrastructure, the path of least resistance is a stand-alone cluster. However it seems like Flink is still relatively tightly coupled to the Hadoop platform, so maybe I would be better off switching to Hadoop + YARN? Our requirements are simple (for now):

Kafka (consumer & producer), S3 (read & write), streaming- and batch-mode computation

If the answer turns out to be that YARN is the best path forward for us, do you have any recommendations on how to get started building a minimal, but production ready Hadoop cluster suitable for Flink? Ambari looks amazing, so barring feedback to the contrary I'll probably be investing time looking at that first.

Finally, any relevant book recommendations? :) I'm extremely excited about this project, so all the feedback I can get is highly welcome and highly appreciated!

Cheers,
Michael-Keith

P.S. Is there planned support for Mesos as an alternative scheduler to YARN?

[1]: http://central.maven.org/maven2/org/apache/hadoop/hadoop-aws/2.7.2/hadoop-aws-2.7.2.pom

________________________________________
From: Ufuk Celebi <[hidden email]>
Sent: Tuesday, April 19, 2016 2:30 AM
To: [hidden email]
Subject: Re: Flink + S3

Hey Michael-Keith,

are you running self-managed EC2 instances or EMR?

In addition to what Till said:

We tried to document this here as well:
https://ci.apache.org/projects/flink/flink-docs-master/setup/aws.html#provide-s3-filesystem-dependency

Does this help? You don't need to really install Hadoop, but only
provide the configuration and the S3 FileSystem code on your
classpath.

If you use EMR + Flink on YARN, it should work out of the box.

– Ufuk

On Tue, Apr 19, 2016 at 10:23 AM, Till Rohrmann <[hidden email]> wrote:
> Hi Michael-Keith,
>
> you can use S3 as the checkpoint directory for the filesystem state backend.
> This means that whenever a checkpoint is performed the state data will be
> written to this directory.
>
> The same holds true for the zookeeper recovery storage directory. This
> directory will contain the submitted and not yet finished jobs as well as
> some meta data for the checkpoints. With this information it is possible to
> restore running jobs if the job manager dies.
>
> As far as I know, Flink relies on Hadoop's file system wrapper classes to
> support S3. Flink has built in support for hdfs, maprfs and the local file
> system. For everything else, Flink tries to find a Hadoop class. Therefore,
> I fear that you need at least Hadoop's s3 filesystem class in your classpath
> and a file called core-site.xml or hdfs-site.xml which is stored at a
> location specified by fs.hdfs.hdfsdefault in Flink's configuration. And in
> one of these files you have to create the xml tag to specify the class. But
> the easiest way would be to simply install Hadoop.
>
> I'm not aware of any puppet scripts but I might miss something here. If you
> should complete a puppet script, then it would definitely be a valuable
> addition to Flink :-)
>
> Cheers,
> Till
>
> On Tue, Apr 19, 2016 at 3:54 AM, Michael-Keith Bernard
> <[hidden email]> wrote:
>>
>> Hello Flink Users!
>>
>> I'm a Flink newbie at the early stages of deploying our first Flink
>> cluster into production and I have a few questions about wiring up Flink
>> with S3:
>>
>> * We are going to use the HA configuration[1] from day one (we have
>> existing zk infrastructure already). Can S3 be used as a state backend for
>> the Job Manager? The documentation talks about using S3 as a state backend
>> for TM[2] (and in particular for streaming), but I'm wondering if it's a
>> suitable backend for the JM as well.
>>
>> * How do I configure S3 for Flink when I don't already have an existing
>> Hadoop cluster? The documentation references the Hadoop configuration
>> manifest[3], which kind of implies to me that I must already be running
>> Hadoop (or at least have a properly configured Hadoop cluster). Is there an
>> example somewhere of using S3 as a storage backend for a standalone cluster?
>>
>> * Bonus: I'm writing a Puppet module for installing/configuring/managing
>> Flink in stand alone mode with an existing zk cluster. Are there any
>> existing modules for this (I didn't find anything in the forge)? Would
>> others in the community be interested if we added our module to the forge
>> once complete?
>>
>> Thanks so much for your time and consideration. We look forward to using
>> Flink in production!
>>
>> Cheers,
>> Michael-Keith
>>
>> [1]:
>> https://ci.apache.org/projects/flink/flink-docs-master/setup/jobmanager_high_availability.html#standalone-cluster-high-availability
>>
>> [2]:
>> https://ci.apache.org/projects/flink/flink-docs-master/setup/aws.html#s3-simple-storage-service
>>
>> [3]:
>> https://ci.apache.org/projects/flink/flink-docs-master/setup/aws.html#set-s3-filesystem
>
>