(DEPRECATED) Apache Flink User Mailing List archive.

Re: Flink + S3

Posted by Till Rohrmann on
URL: http://deprecated-apache-flink-user-mailing-list-archive.369.s1.nabble.com/Flink-S3-tp6190p6192.html

Hi Michael-Keith,

you can use S3 as the checkpoint directory for the filesystem state backend. This means that whenever a checkpoint is performed the state data will be written to this directory.

The same holds true for the zookeeper recovery storage directory. This directory will contain the submitted and not yet finished jobs as well as some meta data for the checkpoints. With this information it is possible to restore running jobs if the job manager dies.

As far as I know, Flink relies on Hadoop's file system wrapper classes to support S3. Flink has built in support for hdfs, maprfs and the local file system. For everything else, Flink tries to find a Hadoop class. Therefore, I fear that you need at least Hadoop's s3 filesystem class in your classpath and a file called core-site.xml or hdfs-site.xml which is stored at a location specified by fs.hdfs.hdfsdefault in Flink's configuration. And in one of these files you have to create the xml tag to specify the class. But the easiest way would be to simply install Hadoop.

I'm not aware of any puppet scripts but I might miss something here. If you should complete a puppet script, then it would definitely be a valuable addition to Flink :-)

Cheers,

Till

On Tue, Apr 19, 2016 at 3:54 AM, Michael-Keith Bernard <[hidden email]> wrote:

Hello Flink Users!

I'm a Flink newbie at the early stages of deploying our first Flink cluster into production and I have a few questions about wiring up Flink with S3:

* We are going to use the HA configuration[1] from day one (we have existing zk infrastructure already). Can S3 be used as a state backend for the Job Manager? The documentation talks about using S3 as a state backend for TM[2] (and in particular for streaming), but I'm wondering if it's a suitable backend for the JM as well.

* How do I configure S3 for Flink when I don't already have an existing Hadoop cluster? The documentation references the Hadoop configuration manifest[3], which kind of implies to me that I must already be running Hadoop (or at least have a properly configured Hadoop cluster). Is there an example somewhere of using S3 as a storage backend for a standalone cluster?

* Bonus: I'm writing a Puppet module for installing/configuring/managing Flink in stand alone mode with an existing zk cluster. Are there any existing modules for this (I didn't find anything in the forge)? Would others in the community be interested if we added our module to the forge once complete?

Thanks so much for your time and consideration. We look forward to using Flink in production!

Cheers,
Michael-Keith

[1]: https://ci.apache.org/projects/flink/flink-docs-master/setup/jobmanager_high_availability.html#standalone-cluster-high-availability

[2]: https://ci.apache.org/projects/flink/flink-docs-master/setup/aws.html#s3-simple-storage-service

[3]: https://ci.apache.org/projects/flink/flink-docs-master/setup/aws.html#set-s3-filesystem