Flink remote batch execution in dynamic cluster

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Flink remote batch execution in dynamic cluster

Antonio Martínez Carratalá
Hello

I'm working on a project with Flink 1.8. I'm running my code from Java in a remote Flink as described here https://ci.apache.org/projects/flink/flink-docs-release-1.8/dev/cluster_execution.html . That part is working, but I want to configure a dynamic Flink cluster to execute the jobs

Imagine I have users that sometimes need to run a report, this report is generated with data processed in Flink, whenever a user requests a report I have to submit a job to a remote Flink cluster, this job execution is heavy and may require 1 hour to finish

So, I don't want to have 3, 4, 5... Task Managers always running in the cluster, some times they are idle and other times I don't have enough Task Managers for all the requests, I want to dynamically create Task Managers as the jobs are received at the Job Manager, and get rid of them at the end

I see a lot of options to create a cluster in https://ci.apache.org/projects/flink/flink-docs-release-1.8/ section [Deployment & Operations] [Clusters & Deployment] like Standalone, YARN, Mesos, Docker, Kubernetes... but I don't know what would be the most suitable for my case of use, I'm not an expert in devops and I barely know about these technologies

Some advice on which technology to use, and maybe some examples, would be really appreciated

Have in mind that I need to run the job with ExecutionEnvironment.createRemoteEnvironment(), to upload a jar is not a valid option for me, it seems to me that not all the options support remote submission of jobs, but I'm not sure

Thank you

Antonio Martinez


Reply | Threaded
Open this post in threaded view
|

Re: Flink remote batch execution in dynamic cluster

Piotr Nowojski-3
Hi,

I guess it depends what do you have already available in your cluster and try to use that. Running Flink in existing Yarn cluster is very easy, but setting up yarn cluster in the first place even if it’s easy (I’m not sure about if that’s the case), would add extra complexity.

When I’m spawning an AWS cluster for testing, I’m using EMR with Yarn included and I think that’s very easy to do, as everything works out of the box. I’ve heard that Kubernetes/Docker are just as easy. I’m also not a dev ops, but I’ve heard that my colleagues, if have any preferences, they usually prefer Kubernetes.

Have in mind that I need to run the job with ExecutionEnvironment.createRemoteEnvironment(), to upload a jar is not a valid option for me, it seems to me that not all the options support remote submission of jobs, but I'm not sure


I think all of them support should support remote environment. Almost for sure Standalone, Yarn, Kubernetes and Docker do.

Piotrek

On 28 Feb 2020, at 10:25, Antonio Martínez Carratalá <[hidden email]> wrote:

Hello

I'm working on a project with Flink 1.8. I'm running my code from Java in a remote Flink as described here https://ci.apache.org/projects/flink/flink-docs-release-1.8/dev/cluster_execution.html . That part is working, but I want to configure a dynamic Flink cluster to execute the jobs

Imagine I have users that sometimes need to run a report, this report is generated with data processed in Flink, whenever a user requests a report I have to submit a job to a remote Flink cluster, this job execution is heavy and may require 1 hour to finish

So, I don't want to have 3, 4, 5... Task Managers always running in the cluster, some times they are idle and other times I don't have enough Task Managers for all the requests, I want to dynamically create Task Managers as the jobs are received at the Job Manager, and get rid of them at the end

I see a lot of options to create a cluster in https://ci.apache.org/projects/flink/flink-docs-release-1.8/ section [Deployment & Operations] [Clusters & Deployment] like Standalone, YARN, Mesos, Docker, Kubernetes... but I don't know what would be the most suitable for my case of use, I'm not an expert in devops and I barely know about these technologies

Some advice on which technology to use, and maybe some examples, would be really appreciated

Have in mind that I need to run the job with ExecutionEnvironment.createRemoteEnvironment(), to upload a jar is not a valid option for me, it seems to me that not all the options support remote submission of jobs, but I'm not sure

Thank you

Antonio Martinez



Reply | Threaded
Open this post in threaded view
|

Re: Flink remote batch execution in dynamic cluster

Antonio Martínez Carratalá
Thank you Piotrek, I will check those options, I only have a standalone cluster so any option would need a set up.

On Fri, Feb 28, 2020 at 2:12 PM Piotr Nowojski <[hidden email]> wrote:
Hi,

I guess it depends what do you have already available in your cluster and try to use that. Running Flink in existing Yarn cluster is very easy, but setting up yarn cluster in the first place even if it’s easy (I’m not sure about if that’s the case), would add extra complexity.

When I’m spawning an AWS cluster for testing, I’m using EMR with Yarn included and I think that’s very easy to do, as everything works out of the box. I’ve heard that Kubernetes/Docker are just as easy. I’m also not a dev ops, but I’ve heard that my colleagues, if have any preferences, they usually prefer Kubernetes.

Have in mind that I need to run the job with ExecutionEnvironment.createRemoteEnvironment(), to upload a jar is not a valid option for me, it seems to me that not all the options support remote submission of jobs, but I'm not sure


I think all of them support should support remote environment. Almost for sure Standalone, Yarn, Kubernetes and Docker do.

Piotrek

On 28 Feb 2020, at 10:25, Antonio Martínez Carratalá <[hidden email]> wrote:

Hello

I'm working on a project with Flink 1.8. I'm running my code from Java in a remote Flink as described here https://ci.apache.org/projects/flink/flink-docs-release-1.8/dev/cluster_execution.html . That part is working, but I want to configure a dynamic Flink cluster to execute the jobs

Imagine I have users that sometimes need to run a report, this report is generated with data processed in Flink, whenever a user requests a report I have to submit a job to a remote Flink cluster, this job execution is heavy and may require 1 hour to finish

So, I don't want to have 3, 4, 5... Task Managers always running in the cluster, some times they are idle and other times I don't have enough Task Managers for all the requests, I want to dynamically create Task Managers as the jobs are received at the Job Manager, and get rid of them at the end

I see a lot of options to create a cluster in https://ci.apache.org/projects/flink/flink-docs-release-1.8/ section [Deployment & Operations] [Clusters & Deployment] like Standalone, YARN, Mesos, Docker, Kubernetes... but I don't know what would be the most suitable for my case of use, I'm not an expert in devops and I barely know about these technologies

Some advice on which technology to use, and maybe some examples, would be really appreciated

Have in mind that I need to run the job with ExecutionEnvironment.createRemoteEnvironment(), to upload a jar is not a valid option for me, it seems to me that not all the options support remote submission of jobs, but I'm not sure

Thank you

Antonio Martinez





--

----------------------------------------------------------------------------------------------------------

Alto Social Analytics, S.L., tratará tus datos con la finalidad de mantener la relación contractual, gestionar tu solicitud, así como enviarte comunicaciones comerciales relacionadas con su ámbito de actividad y sus servicios. Puedes oponerte a este tratamiento, así como ejercitar el resto de derechos de acceso, rectificación o supresión, limitación de su tratamiento, portabilidad, en nuestro domicilio social y en el correo electrónico: [hidden email]. Más información en www.alto-analytics.com. La información contenida en este correo es confidencial y para uso exclusivo de la persona que la reciba. Si no eres la persona correcta o has recibido esta comunicación por error, te rogamos que nos lo notifiques y lo elimines, dado que puede contener información sujeta a secreto empresarial o propiedad intelectual de terceros.

 

Alto Social Analytics, S.L., will process your data for the purpose of maintaining the contractual relationship, managing your request, as well as sending you commercial communications related to its field of activity and services. You can oppose this processing, as well as exercise the rest of rights of access, rectification or deletion, limitation of processing, portability, in our registered office and in our email: [hidden email]. More information at www.alto-analytics.com. The information contained in this email is confidential and for the exclusive use of the person who receives it. If you have received this communication by mistake, we ask you to notify us and delete it, since it may contain information subject to business secrecy or intellectual property of third parties.