Scaling a Flink cluster

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

Scaling a Flink cluster

Emmanuel
Hello,

In my understanding, the flink-conf.yaml is the one config file to configure a cluster.
The slave file lists the slave nodes.
they must both be on every node.

I'm trying to understand what is the best strategy to scale a Flink cluster since:
- adding a node means adding an entry to the slave list and replicating on every node.

Does the cluster need to be restarted to take the new nodes into account? It seems like it.
Having to replicate the file on all nodes is not super convenient. Restarting is even more trouble.
Is there a way to scale a live cluster? If so how?
Any link to the info would be helpful.

Thanks
Reply | Threaded
Open this post in threaded view
|

Re: Scaling a Flink cluster

Ufuk Celebi

On 16 Mar 2015, at 08:27, Emmanuel <[hidden email]> wrote:

> Hello,
>
> In my understanding, the flink-conf.yaml is the one config file to configure a cluster.
> The slave file lists the slave nodes.
> they must both be on every node.

The slaves file is only used for the startup script when using the bin/start-cluster.sh. The other configuration files (flink-conf.yaml, log4j.properties etc.) need to be available on each worker node if you want to run a custom configuration, that's true.

The usual setup is to start the system from a shared directory, which is available mounted on each node. If you don't have that in place, it would make sense to write a small script to sync the different nodes of your setup. How do you do it currently? You need to transfer the Flink files anyways, no?

> Does the cluster need to be restarted to take the new nodes into account? It seems like it.
> Having to replicate the file on all nodes is not super convenient. Restarting is even more trouble.
> Is there a way to scale a live cluster? If so how?

Thanks for the pointer. I think it's a good idea to add documentation for this.

You can add new worker nodes at runtime. You need to use the bin/taskmanager.sh script on the new worker node though:

path/to/bin/taskmanager.sh start &

The new worker will be available for all programs submitted after it has been registered with the master.

– Ufuk
Reply | Threaded
Open this post in threaded view
|

Re: Scaling a Flink cluster

Stephan Ewen
In reply to this post by Emmanuel

Hi Emmanuel!

The slaves file is not needed on every node. It is only used by the "start-cluster.sh" Script, which makes an ssh call to every host in that file to start a taskmanager.

You can add a taskmanager to an existing flink cluster by simply calling "taskmanager.sh start" on that machine (which should have a flink-conf.yaml file). The flink-conf.yaml may actually be different for every taskmanager as well, but that is a more uncommon setup...

Greetings,
Stephan

Am 16.03.2015 08:27 schrieb "Emmanuel" <[hidden email]>:
Hello,

In my understanding, the flink-conf.yaml is the one config file to configure a cluster.
The slave file lists the slave nodes.
they must both be on every node.

I'm trying to understand what is the best strategy to scale a Flink cluster since:
- adding a node means adding an entry to the slave list and replicating on every node.

Does the cluster need to be restarted to take the new nodes into account? It seems like it.
Having to replicate the file on all nodes is not super convenient. Restarting is even more trouble.
Is there a way to scale a live cluster? If so how?
Any link to the info would be helpful.

Thanks
Reply | Threaded
Open this post in threaded view
|

RE: Scaling a Flink cluster

Emmanuel
In reply to this post by Emmanuel
I see...
Because of the start-cluster script, I was under the impression that the jobmanager had to connect to each node upon start-up, which would make scaling an issue without restarting the job manager, but it makes sense now. Thanks for the clarification.

Side question:what happens if the job manager fails? Taskmanagers keep running the jobs? Should a typical production setup include multiple jobmanagers for replication, and if so how is that configured? 

Emmanuel



-------- Original message --------
From: Stephan Ewen <[hidden email]>
Date:03/16/2015 1:09 AM (GMT-08:00)
To: [hidden email]
Subject: Re: Scaling a Flink cluster

Hi Emmanuel!

The slaves file is not needed on every node. It is only used by the "start-cluster.sh" Script, which makes an ssh call to every host in that file to start a taskmanager.

You can add a taskmanager to an existing flink cluster by simply calling "taskmanager.sh start" on that machine (which should have a flink-conf.yaml file). The flink-conf.yaml may actually be different for every taskmanager as well, but that is a more uncommon setup...

Greetings,
Stephan

Am 16.03.2015 08:27 schrieb "Emmanuel" <[hidden email]>:
Hello,

In my understanding, the flink-conf.yaml is the one config file to configure a cluster.
The slave file lists the slave nodes.
they must both be on every node.

I'm trying to understand what is the best strategy to scale a Flink cluster since:
- adding a node means adding an entry to the slave list and replicating on every node.

Does the cluster need to be restarted to take the new nodes into account? It seems like it.
Having to replicate the file on all nodes is not super convenient. Restarting is even more trouble.
Is there a way to scale a live cluster? If so how?
Any link to the info would be helpful.

Thanks
Reply | Threaded
Open this post in threaded view
|

Re: Scaling a Flink cluster

Stephan Ewen
Hi Emmanuel!

Flink does not yet include JobManager failover, but we have this on the list for the mid term future (middle to second half of the year).

At this point, when the JobManager dies, the job is cancelled.

Greetings,
Stephan


On Mon, Mar 16, 2015 at 4:43 PM, Emmanuel <[hidden email]> wrote:
I see...
Because of the start-cluster script, I was under the impression that the jobmanager had to connect to each node upon start-up, which would make scaling an issue without restarting the job manager, but it makes sense now. Thanks for the clarification.

Side question:what happens if the job manager fails? Taskmanagers keep running the jobs? Should a typical production setup include multiple jobmanagers for replication, and if so how is that configured? 

Emmanuel



-------- Original message --------
From: Stephan Ewen <[hidden email]>
Date:03/16/2015 1:09 AM (GMT-08:00)
To: [hidden email]
Subject: Re: Scaling a Flink cluster

Hi Emmanuel!

The slaves file is not needed on every node. It is only used by the "start-cluster.sh" Script, which makes an ssh call to every host in that file to start a taskmanager.

You can add a taskmanager to an existing flink cluster by simply calling "taskmanager.sh start" on that machine (which should have a flink-conf.yaml file). The flink-conf.yaml may actually be different for every taskmanager as well, but that is a more uncommon setup...

Greetings,
Stephan

Am 16.03.2015 08:27 schrieb "Emmanuel" <[hidden email]>:
Hello,

In my understanding, the flink-conf.yaml is the one config file to configure a cluster.
The slave file lists the slave nodes.
they must both be on every node.

I'm trying to understand what is the best strategy to scale a Flink cluster since:
- adding a node means adding an entry to the slave list and replicating on every node.

Does the cluster need to be restarted to take the new nodes into account? It seems like it.
Having to replicate the file on all nodes is not super convenient. Restarting is even more trouble.
Is there a way to scale a live cluster? If so how?
Any link to the info would be helpful.

Thanks