Hi,
I have a cluster of 4 dedicated machines (no VMs). My previous config was: 1 master and 3 slaves. Each machine provides a task- or jobmanager. Now I want to reduce my cluster and have 1 master and 3 slaves, but one machine provides a jobmanager and one task manager in parallel. I changed all conf/slaves files. While I start my cluster everything seems well for 2 seconds -> one JM and 3 TM with each 8 cores/slots. Two seconds later I see 4 taskmanger and one JM. I also can run a job with 32 slots (4 TM * 8 slots) without any errors. Why does my cluster has 4 task manager?! All slaves files are cleaned and contains 3 inputs Thanks! Marc |
I start my cluster with:
bigdata@master:/usr/lib/flink-1.3.2$ ./bin/start-cluster.sh
Starting cluster.
Starting jobmanager daemon on host master.
Starting taskmanager daemon on host master.
Starting taskmanager daemon on host slave1.
Starting taskmanager daemon on host slave3.
And if I stop it:
bigdata@master:/usr/lib/flink-1.3.2$ ./bin/stop-cluster.sh
Stopping taskmanager daemon (pid: 27050) on host master.
Stopping taskmanager daemon (pid: 2091) on host slave1.
Stopping taskmanager daemon (pid: 12684) on host slave3.
Stopping jobmanager daemon (pid: 26636) on host master.
My previous cluster included additionally slave5.
My current cluster has not slave5. But the WebUI shows 4 TM -> master, slave1, slave3 and slave5
|
In reply to this post by Kaepke, Marc
Hi Marc,
By chance did you edit the slaves file before shutting down the cluster? If so, then the removed worker would not be stopped and would reconnect to the restarted JobManager. Greg > On Aug 11, 2017, at 11:25 AM, Kaepke, Marc <[hidden email]> wrote: > > Hi, > > I have a cluster of 4 dedicated machines (no VMs). My previous config was: 1 master and 3 slaves. Each machine provides a task- or jobmanager. > > Now I want to reduce my cluster and have 1 master and 3 slaves, but one machine provides a jobmanager and one task manager in parallel. I changed all conf/slaves files. While I start my cluster everything seems well for 2 seconds -> one JM and 3 TM with each 8 cores/slots. Two seconds later I see 4 taskmanger and one JM. I also can run a job with 32 slots (4 TM * 8 slots) without any errors. > > Why does my cluster has 4 task manager?! All slaves files are cleaned and contains 3 inputs > > > Thanks! > > Marc |
Hi Greg,
I guess I restarted the cluster too fast. Combined with a high cpu inside the cluster. I tested it again few minutes ago and there was no issue! With „$ jps“ I checked if there any Java process -> there wasn’t But if the master don’t know slave5, how can slave5 reconnect to the JobManager? That mean the JobManager will „adopt a child“. Marc > Am 11.08.2017 um 20:27 schrieb Greg Hogan <[hidden email]>: > > Hi Marc, > > By chance did you edit the slaves file before shutting down the cluster? If so, then the removed worker would not be stopped and would reconnect to the restarted JobManager. > > Greg > > >> On Aug 11, 2017, at 11:25 AM, Kaepke, Marc <[hidden email]> wrote: >> >> Hi, >> >> I have a cluster of 4 dedicated machines (no VMs). My previous config was: 1 master and 3 slaves. Each machine provides a task- or jobmanager. >> >> Now I want to reduce my cluster and have 1 master and 3 slaves, but one machine provides a jobmanager and one task manager in parallel. I changed all conf/slaves files. While I start my cluster everything seems well for 2 seconds -> one JM and 3 TM with each 8 cores/slots. Two seconds later I see 4 taskmanger and one JM. I also can run a job with 32 slots (4 TM * 8 slots) without any errors. >> >> Why does my cluster has 4 task manager?! All slaves files are cleaned and contains 3 inputs >> >> >> Thanks! >> >> Marc |
Hi Marc,
the master, i.e. JobManager, does not need to know which clients, i.e. TaskManager, are supposed to connect to it. Indeed, only the task managers need to know where to connect to and they will try to establish that connection and re-connect when losing it. Nico On Friday, 11 August 2017 22:24:29 CEST Kaepke, Marc wrote: > Hi Greg, > > I guess I restarted the cluster too fast. Combined with a high cpu inside > the cluster. I tested it again few minutes ago and there was no issue! > With „$ jps“ I checked if there any Java process -> there wasn’t > But if the master don’t know slave5, how can slave5 reconnect to the > JobManager? That mean the JobManager will „adopt a child“. > Marc > > > > Am 11.08.2017 um 20:27 schrieb Greg Hogan <[hidden email]>: > > > > Hi Marc, > > > > By chance did you edit the slaves file before shutting down the cluster? > > If so, then the removed worker would not be stopped and would reconnect > > to the restarted JobManager. > > Greg > > > > > > > >> On Aug 11, 2017, at 11:25 AM, Kaepke, Marc <[hidden email]> > >> wrote: > >> Hi, > >> > >> I have a cluster of 4 dedicated machines (no VMs). My previous config > >> was: 1 master and 3 slaves. Each machine provides a task- or > >> jobmanager. > >> Now I want to reduce my cluster and have 1 master and 3 slaves, but one > >> machine provides a jobmanager and one task manager in parallel. I > >> changed all conf/slaves files. While I start my cluster everything seems > >> well for 2 seconds -> one JM and 3 TM with each 8 cores/slots. Two > >> seconds later I see 4 taskmanger and one JM. I also can run a job with > >> 32 slots (4 TM * 8 slots) without any errors. > >> Why does my cluster has 4 task manager?! All slaves files are cleaned and > >> contains 3 inputs > >> > >> Thanks! > >> > >> Marc > > signature.asc (201 bytes) Download Attachment |
The scripts and the masters/slaves files are only relevant to the scripts which SSH to the machines to start/stop the processes. They have not really an impact on how the processes find each other. Calling them repeatedly and editing them can start additional processes, or not stop all processes. In that case, you can try and repeatedly call stop-cluster.sh to stop remaining processes, or SSH to the nodes and kill the processes manually. Also: The files are only relevant on the machine where you execute the shell scripts. If you edit them on other machines, it has no impact. On Mon, Aug 14, 2017 at 11:46 AM, Nico Kruber <[hidden email]> wrote: Hi Marc, |
Free forum by Nabble | Edit this page |