(DEPRECATED) Apache Flink User Mailing List archive.

Deployment Architecture for Flink Applications

Classic

List

Threaded

2 messages Options

Chakravarthy varaga

Deployment Architecture for Flink Applications

Hi Team,

We are analysing different deployment options for managing Flink Jobs on AWS EC2 instances.

Basically, the options (Resource Manangers) in front of us are using:

-> Standalone cluster

-> On YARN

-> Deploy using Mesos/Marthon

-> Deploy using Kubernetes/Docker

The Resource Managers options are a bit confusing as we are unable to decide on which one to go with. What we are looking at as inputs to our analysis is:

-> Dynamic Scaling of resources

-> Resource Allocation

-> Jobs Scheduling

-> No-Downtime upgrades

-> Monitoring & Metrics.

Right now our plan is to do a paper based study evaluating these options.

I'm sure lot of you guys in production/support would have encountered issues around these. Can someone point out to blogs/research papers/material focussing on the approach taken and the considerations for evaluation?

Any help here is highly appreciated !

Best Regards

CVP

Kostas Kloudas

Re: Deployment Architecture for Flink Applications

Hi CVP,

On how people use Flink, you can check this blogpost to see how Alibaba does it:

http://data-artisans.com/blink-flink-alibaba-search/

In addition, you can also find some more information on the matter on the talks from

the last Flink Forwards conference: http://berlin.flink-forward.org/program/sessions/

For example Netflix also shares some information here:

http://berlin.flink-forward.org/kb_sessions/beaming-flink-to-the-cloud-netflix/

Now for how things work under the hood, I will provide links to the Flink documentation.

I hope that this will also help you figure out what fits your needs best:

For deployment and operations, the main resource is the Flink documentation,

https://ci.apache.org/projects/flink/flink-docs-release-1.2/setup/cluster_setup.html

and for what is about to come on that front, you can check out the FLIP-6 page:

https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=65147077

To dynamically scale your Flink job you have to take a savepoint and restart your job with different parallelism.

You can find some details here https://www.slideshare.net/tillrohrmann/dynamic-scaling-how-apache-flink-adapts-to-changing-workloads , but unfortunately, this talk is a little bit outdated. We will update our documentation on dynamic scaling soon.

For the Resource allocation and Job Scheduling, you can check the links I included for deployment and operations,

and also this: https://ci.apache.org/projects/flink/flink-docs-release-1.3/internals/job_scheduling.html

For metrics and monitoring you can check here: https://ci.apache.org/projects/flink/flink-docs-release-1.2/monitoring/metrics.html

and the related pages in the Debugging and monitoring section of the Flink documentation.

I hope this can help as a first step,

Kostas

    Right now our plan is to do a paper based study evaluating these options.

    I'm sure lot of you guys in production/support would have encountered issues around these. Can someone point out to blogs/research papers/material focussing on the approach taken and the considerations for evaluation?

    Any help here is highly appreciated !

Best Regards
CVP

On Feb 22, 2017, at 12:30 PM, Chakravarthy varaga <[hidden email]> wrote:

Hi Team,

    We are analysing different deployment options for managing Flink Jobs on AWS EC2 instances.

     Basically, the options (Resource Manangers) in front of us are using:
     -> Standalone cluster
     -> On YARN
     -> Deploy using Mesos/Marthon
     -> Deploy using Kubernetes/Docker

     The Resource Managers options are a bit confusing as we are unable to decide on which one to go with. What we are looking at as inputs to our analysis is:
    -> Dynamic Scaling of resources
    -> Resource Allocation
    -> Jobs Scheduling
    -> No-Downtime upgrades
    -> Monitoring & Metrics.

    Right now our plan is to do a paper based study evaluating these options.

    I'm sure lot of you guys in production/support would have encountered issues around these. Can someone point out to blogs/research papers/material focussing on the approach taken and the considerations for evaluation?

    Any help here is highly appreciated !

Best Regards
CVP