(DEPRECATED) Apache Flink User Mailing List archive.

Re: Using standalone single node without HA in production, crazy?

Posted by Ryan Crumley on
URL: http://deprecated-apache-flink-user-mailing-list-archive.369.s1.nabble.com/Using-standalone-single-node-without-HA-in-production-crazy-tp7758p7770.html

Jamie,

Thank you for your insight. To answer your questions I am running on AWS and have access to S3. Further I already have Zookeeper in the mix (its used by Mesos as well as Kafka). I was hoping to avoid the complexities of an automated HA setup by running a single jvm and then migrate to HA down the road. It sounds like I can't have my cake and eat it too (yet). =)

Ryan

On Fri, Jul 1, 2016 at 7:22 PM Jamie Grier <[hidden email]> wrote:

I started to answer these questions and then realized I was making an assumption about your environment. Do you have a reliable persistent file system such as HDFS or S3 at your disposal or do you truly mean to run on a single node?

If the you are truly thinking to run on a single node only there's no way to make this guaranteed to be reliable. You would be open to machine and disk failures, etc.

I think the minimal reasonable production setup must use at least 3 physical nodes with the following services running:

1) HDFS or some other reliable filesystem (for persistent state storage)
2) Zookeeper for the Flink HA JobManager setup

The rest is configuration..

With regard to scaling up after your initial deployment: right now in the latest Flink release (1.0.3) you cannot stop and restart a job with a different parallelism without losing your computed state. What this means is that if you know you will likely scale up and you don't want to lose that state you can provision many, many slots on the TaskManagers you do run, essentially over-provisioning them, and run your job now with the max parallelism you expect to need to scale to. This will all be much simpler to do in future Flink versions (though not in 1.1) but for now this would be a decent approach.

In Flink versions after 1.1 Flink will be able to scale parallelism up and down while preserving all of the previously computed state.

-Jamie

On Fri, Jul 1, 2016 at 6:41 AM, Ryan Crumley <[hidden email]> wrote:
Hi,

I am evaluating flink for use in stateful streaming application. Some information about the intended use:

- Will run in a mesos cluster and deployed via marathon in a docker container
- Initial throughput ~ 100 messages per second (from kafka)
- Will need to scale to 10x that soon after launch
- State will be much larger than memory available

In order to quickly get this out the door I am considering postponing the YARN / HA setup of a cluster with the idea that the current application can easily fit within a single jvm and handle the throughput. Hopefully by the time I need more scale flink support for mesos will be available and I can use that to distribute the job to the cluster with minimal code rewrite.

Questions:
1. Is this a viable approach? Any pitfalls to be aware of?

2. What is the correct term for this deployment mode? Single node standalone? Local?

3. Will the RocksDB state backend work in a single jvm mode?

4. When the single jvm process becomes unhealthy and is restarted by marathon will flink recover appropriately or is failure recovery a function of HA?

5. How would I migrate the RocksDB state once I move to HA mode? Is there a straight forward path?

Thanks for your time,

Ryan

--

Jamie Grier
data Artisans, Director of Applications Engineering
@jamiegrier
[hidden email]