Modifying start-cluster scripts to efficiently spawn multiple TMs

classic Classic list List threaded Threaded
11 messages Options
Reply | Threaded
Open this post in threaded view
|

Modifying start-cluster scripts to efficiently spawn multiple TMs

Saliya Ekanayake
Hi,

The current start/stop scripts SSH worker nodes each time they appear in the slaves file. When spawning multiple TMs (like 24 per node), this is very inefficient.

I've changed the scripts to do one SSH per node and spawn a given N number of TMs afterwards. I can make a pull request if this seems usable to others. For now, I assume slaves file will indicate the number of TMs per slave in "IP N" format.

Thank you,
Saliya

--
Saliya Ekanayake
Ph.D. Candidate | Research Assistant
School of Informatics and Computing | Digital Science Center
Indiana University, Bloomington

Reply | Threaded
Open this post in threaded view
|

Re: Modifying start-cluster scripts to efficiently spawn multiple TMs

Gyula Fóra-2
Hi,

I think this would be a nice addition especially for Flink clusters running on big machines where you might want to run multiple task managers just to split the memory between multiple java processes. 

In any case the previous config format should also be supported as the default.

I am curious what other developers/users think about this.

Cheers,
Gyula

Saliya Ekanayake <[hidden email]> ezt írta (időpont: 2016. júl. 9., Szo, 9:44):
Hi,

The current start/stop scripts SSH worker nodes each time they appear in the slaves file. When spawning multiple TMs (like 24 per node), this is very inefficient.

I've changed the scripts to do one SSH per node and spawn a given N number of TMs afterwards. I can make a pull request if this seems usable to others. For now, I assume slaves file will indicate the number of TMs per slave in "IP N" format.

Thank you,
Saliya


--
Saliya Ekanayake
Ph.D. Candidate | Research Assistant
School of Informatics and Computing | Digital Science Center
Indiana University, Bloomington

Reply | Threaded
Open this post in threaded view
|

Re: Modifying start-cluster scripts to efficiently spawn multiple TMs

Saliya Ekanayake
Thank you. Yes, the previous format is still supported. If a number is specified after the hostname then only it'll kick in this change. 



On Sun, Jul 10, 2016 at 5:42 PM, Gyula Fóra <[hidden email]> wrote:
Hi,

I think this would be a nice addition especially for Flink clusters running on big machines where you might want to run multiple task managers just to split the memory between multiple java processes. 

In any case the previous config format should also be supported as the default.

I am curious what other developers/users think about this.

Cheers,
Gyula

Saliya Ekanayake <[hidden email]> ezt írta (időpont: 2016. júl. 9., Szo, 9:44):
Hi,

The current start/stop scripts SSH worker nodes each time they appear in the slaves file. When spawning multiple TMs (like 24 per node), this is very inefficient.

I've changed the scripts to do one SSH per node and spawn a given N number of TMs afterwards. I can make a pull request if this seems usable to others. For now, I assume slaves file will indicate the number of TMs per slave in "IP N" format.

Thank you,
Saliya


--
Saliya Ekanayake
Ph.D. Candidate | Research Assistant
School of Informatics and Computing | Digital Science Center
Indiana University, Bloomington




--
Saliya Ekanayake
Ph.D. Candidate | Research Assistant
School of Informatics and Computing | Digital Science Center
Indiana University, Bloomington

Reply | Threaded
Open this post in threaded view
|

Re: Modifying start-cluster scripts to efficiently spawn multiple TMs

Greg Hogan
In reply to this post by Saliya Ekanayake
Hi Saliya,

Would you happen to have pdsh (parallel distributed shell) installed? If so the TaskManager startup in start-cluster.sh will run in parallel.

As to running 24 TaskManagers together, are these running across multiple NUMA nodes? I had filed FLINK-3163 (https://issues.apache.org/jira/browse/FLINK-3163) last year as I have seen that even with only two NUMA nodes performance is improved by binding TaskManagers, both memory and CPU. I think we can improve configuration of task slots as we do with memory, where the latter can be a fixed measure or a fraction relative to total memory.

Greg

On Sat, Jul 9, 2016 at 3:44 AM, Saliya Ekanayake <[hidden email]> wrote:
Hi,

The current start/stop scripts SSH worker nodes each time they appear in the slaves file. When spawning multiple TMs (like 24 per node), this is very inefficient.

I've changed the scripts to do one SSH per node and spawn a given N number of TMs afterwards. I can make a pull request if this seems usable to others. For now, I assume slaves file will indicate the number of TMs per slave in "IP N" format.

Thank you,
Saliya

--
Saliya Ekanayake
Ph.D. Candidate | Research Assistant
School of Informatics and Computing | Digital Science Center
Indiana University, Bloomington


Reply | Threaded
Open this post in threaded view
|

Re: Modifying start-cluster scripts to efficiently spawn multiple TMs

Saliya Ekanayake
pdsh is available in head node only, but when I tried to do start-cluster from head node (note Job manager node is not head node) it didn't work, which is why I modified the scripts.

Yes, exactly, this is what I was trying to do. My research area has been on these NUMA related issues and binding a process to a socket (CPU) and then its thread to individual cores have shown great advantage. I actually have Java code that automatically (user configurable as well) bind processes and threads. For Flink, I've manually done this using  shell script that scans TMs in a node and pin them appropriately. This approach is OK, but it's better if the support is integrated to Flink.

On Sun, Jul 10, 2016 at 8:33 PM, Greg Hogan <[hidden email]> wrote:
Hi Saliya,

Would you happen to have pdsh (parallel distributed shell) installed? If so the TaskManager startup in start-cluster.sh will run in parallel.

As to running 24 TaskManagers together, are these running across multiple NUMA nodes? I had filed FLINK-3163 (https://issues.apache.org/jira/browse/FLINK-3163) last year as I have seen that even with only two NUMA nodes performance is improved by binding TaskManagers, both memory and CPU. I think we can improve configuration of task slots as we do with memory, where the latter can be a fixed measure or a fraction relative to total memory.

Greg

On Sat, Jul 9, 2016 at 3:44 AM, Saliya Ekanayake <[hidden email]> wrote:
Hi,

The current start/stop scripts SSH worker nodes each time they appear in the slaves file. When spawning multiple TMs (like 24 per node), this is very inefficient.

I've changed the scripts to do one SSH per node and spawn a given N number of TMs afterwards. I can make a pull request if this seems usable to others. For now, I assume slaves file will indicate the number of TMs per slave in "IP N" format.

Thank you,
Saliya

--
Saliya Ekanayake
Ph.D. Candidate | Research Assistant
School of Informatics and Computing | Digital Science Center
Indiana University, Bloomington





--
Saliya Ekanayake
Ph.D. Candidate | Research Assistant
School of Informatics and Computing | Digital Science Center
Indiana University, Bloomington

Reply | Threaded
Open this post in threaded view
|

Re: Modifying start-cluster scripts to efficiently spawn multiple TMs

Greg Hogan
I'd definitely be interested to hear any insight into what failed when starting the taskmanagers with pdsh. Did the command fail, or fallback to standard ssh, a parse error on the slaves file?

I'm wondering if we need to escape
  PDSH_SSH_ARGS_APPEND=$FLINK_SSH_OPTS
as
  PDSH_SSH_ARGS_APPEND="${FLINK_SSH_OPTS}"

On Mon, Jul 11, 2016 at 12:02 AM, Saliya Ekanayake <[hidden email]> wrote:
pdsh is available in head node only, but when I tried to do start-cluster from head node (note Job manager node is not head node) it didn't work, which is why I modified the scripts.

Yes, exactly, this is what I was trying to do. My research area has been on these NUMA related issues and binding a process to a socket (CPU) and then its thread to individual cores have shown great advantage. I actually have Java code that automatically (user configurable as well) bind processes and threads. For Flink, I've manually done this using  shell script that scans TMs in a node and pin them appropriately. This approach is OK, but it's better if the support is integrated to Flink.

On Sun, Jul 10, 2016 at 8:33 PM, Greg Hogan <[hidden email]> wrote:
Hi Saliya,

Would you happen to have pdsh (parallel distributed shell) installed? If so the TaskManager startup in start-cluster.sh will run in parallel.

As to running 24 TaskManagers together, are these running across multiple NUMA nodes? I had filed FLINK-3163 (https://issues.apache.org/jira/browse/FLINK-3163) last year as I have seen that even with only two NUMA nodes performance is improved by binding TaskManagers, both memory and CPU. I think we can improve configuration of task slots as we do with memory, where the latter can be a fixed measure or a fraction relative to total memory.

Greg

On Sat, Jul 9, 2016 at 3:44 AM, Saliya Ekanayake <[hidden email]> wrote:
Hi,

The current start/stop scripts SSH worker nodes each time they appear in the slaves file. When spawning multiple TMs (like 24 per node), this is very inefficient.

I've changed the scripts to do one SSH per node and spawn a given N number of TMs afterwards. I can make a pull request if this seems usable to others. For now, I assume slaves file will indicate the number of TMs per slave in "IP N" format.

Thank you,
Saliya

--
Saliya Ekanayake
Ph.D. Candidate | Research Assistant
School of Informatics and Computing | Digital Science Center
Indiana University, Bloomington





--
Saliya Ekanayake
Ph.D. Candidate | Research Assistant
School of Informatics and Computing | Digital Science Center
Indiana University, Bloomington


Reply | Threaded
Open this post in threaded view
|

Re: Modifying start-cluster scripts to efficiently spawn multiple TMs

Saliya Ekanayake
I am running some jobs now. I'll stop and restart using pdsh to see what was the issue again

On Mon, Jul 11, 2016 at 12:15 PM, Greg Hogan <[hidden email]> wrote:
I'd definitely be interested to hear any insight into what failed when starting the taskmanagers with pdsh. Did the command fail, or fallback to standard ssh, a parse error on the slaves file?

I'm wondering if we need to escape
  PDSH_SSH_ARGS_APPEND=$FLINK_SSH_OPTS
as
  PDSH_SSH_ARGS_APPEND="${FLINK_SSH_OPTS}"

On Mon, Jul 11, 2016 at 12:02 AM, Saliya Ekanayake <[hidden email]> wrote:
pdsh is available in head node only, but when I tried to do start-cluster from head node (note Job manager node is not head node) it didn't work, which is why I modified the scripts.

Yes, exactly, this is what I was trying to do. My research area has been on these NUMA related issues and binding a process to a socket (CPU) and then its thread to individual cores have shown great advantage. I actually have Java code that automatically (user configurable as well) bind processes and threads. For Flink, I've manually done this using  shell script that scans TMs in a node and pin them appropriately. This approach is OK, but it's better if the support is integrated to Flink.

On Sun, Jul 10, 2016 at 8:33 PM, Greg Hogan <[hidden email]> wrote:
Hi Saliya,

Would you happen to have pdsh (parallel distributed shell) installed? If so the TaskManager startup in start-cluster.sh will run in parallel.

As to running 24 TaskManagers together, are these running across multiple NUMA nodes? I had filed FLINK-3163 (https://issues.apache.org/jira/browse/FLINK-3163) last year as I have seen that even with only two NUMA nodes performance is improved by binding TaskManagers, both memory and CPU. I think we can improve configuration of task slots as we do with memory, where the latter can be a fixed measure or a fraction relative to total memory.

Greg

On Sat, Jul 9, 2016 at 3:44 AM, Saliya Ekanayake <[hidden email]> wrote:
Hi,

The current start/stop scripts SSH worker nodes each time they appear in the slaves file. When spawning multiple TMs (like 24 per node), this is very inefficient.

I've changed the scripts to do one SSH per node and spawn a given N number of TMs afterwards. I can make a pull request if this seems usable to others. For now, I assume slaves file will indicate the number of TMs per slave in "IP N" format.

Thank you,
Saliya

--
Saliya Ekanayake
Ph.D. Candidate | Research Assistant
School of Informatics and Computing | Digital Science Center
Indiana University, Bloomington





--
Saliya Ekanayake
Ph.D. Candidate | Research Assistant
School of Informatics and Computing | Digital Science Center
Indiana University, Bloomington





--
Saliya Ekanayake
Ph.D. Candidate | Research Assistant
School of Informatics and Computing | Digital Science Center
Indiana University, Bloomington

Reply | Threaded
Open this post in threaded view
|

Re: Modifying start-cluster scripts to efficiently spawn multiple TMs

Saliya Ekanayake
I meant, I'll check when current jobs are done and will let you know.

On Mon, Jul 11, 2016 at 12:19 PM, Saliya Ekanayake <[hidden email]> wrote:
I am running some jobs now. I'll stop and restart using pdsh to see what was the issue again

On Mon, Jul 11, 2016 at 12:15 PM, Greg Hogan <[hidden email]> wrote:
I'd definitely be interested to hear any insight into what failed when starting the taskmanagers with pdsh. Did the command fail, or fallback to standard ssh, a parse error on the slaves file?

I'm wondering if we need to escape
  PDSH_SSH_ARGS_APPEND=$FLINK_SSH_OPTS
as
  PDSH_SSH_ARGS_APPEND="${FLINK_SSH_OPTS}"

On Mon, Jul 11, 2016 at 12:02 AM, Saliya Ekanayake <[hidden email]> wrote:
pdsh is available in head node only, but when I tried to do start-cluster from head node (note Job manager node is not head node) it didn't work, which is why I modified the scripts.

Yes, exactly, this is what I was trying to do. My research area has been on these NUMA related issues and binding a process to a socket (CPU) and then its thread to individual cores have shown great advantage. I actually have Java code that automatically (user configurable as well) bind processes and threads. For Flink, I've manually done this using  shell script that scans TMs in a node and pin them appropriately. This approach is OK, but it's better if the support is integrated to Flink.

On Sun, Jul 10, 2016 at 8:33 PM, Greg Hogan <[hidden email]> wrote:
Hi Saliya,

Would you happen to have pdsh (parallel distributed shell) installed? If so the TaskManager startup in start-cluster.sh will run in parallel.

As to running 24 TaskManagers together, are these running across multiple NUMA nodes? I had filed FLINK-3163 (https://issues.apache.org/jira/browse/FLINK-3163) last year as I have seen that even with only two NUMA nodes performance is improved by binding TaskManagers, both memory and CPU. I think we can improve configuration of task slots as we do with memory, where the latter can be a fixed measure or a fraction relative to total memory.

Greg

On Sat, Jul 9, 2016 at 3:44 AM, Saliya Ekanayake <[hidden email]> wrote:
Hi,

The current start/stop scripts SSH worker nodes each time they appear in the slaves file. When spawning multiple TMs (like 24 per node), this is very inefficient.

I've changed the scripts to do one SSH per node and spawn a given N number of TMs afterwards. I can make a pull request if this seems usable to others. For now, I assume slaves file will indicate the number of TMs per slave in "IP N" format.

Thank you,
Saliya

--
Saliya Ekanayake
Ph.D. Candidate | Research Assistant
School of Informatics and Computing | Digital Science Center
Indiana University, Bloomington





--
Saliya Ekanayake
Ph.D. Candidate | Research Assistant
School of Informatics and Computing | Digital Science Center
Indiana University, Bloomington





--
Saliya Ekanayake
Ph.D. Candidate | Research Assistant
School of Informatics and Computing | Digital Science Center
Indiana University, Bloomington




--
Saliya Ekanayake
Ph.D. Candidate | Research Assistant
School of Informatics and Computing | Digital Science Center
Indiana University, Bloomington

Reply | Threaded
Open this post in threaded view
|

Re: Modifying start-cluster scripts to efficiently spawn multiple TMs

Saliya Ekanayake
Looking at what happens with pdsh, there are two things that go wrong.

1. pdsh is installed in a node other than where the job manager would run, so invoking start-cluster from there does not spawn a job manager. Only if I do start-cluster from the node I specify as the job manager's node that it'll be created.

2. If the slaves file has the same IP more than once, then the following error happens trying to move log files. For example I had node j-029 specified twice in my slaves file.

j-020: mv: cannot move `/N/u/sekanaya/sali/software/flink-1.0.3/log/flink-sekanaya-taskmanager-26-j-020.log' to `/N/u/sekanaya/sali/software/flink-1.0.3/log/flink-sekanaya-taskmanager-26-j-020.log.1': No such file or directory
j-020: mv: cannot move `/N/u/sekanaya/sali/software/flink-1.0.3/log/flink-sekanaya-taskmanager-26-j-020.out' to `/N/u/sekanaya/sali/software/flink-1.0.3/log/flink-sekanaya-taskmanager-26-j-020.out.1': No such file or directory


On Mon, Jul 11, 2016 at 12:19 PM, Saliya Ekanayake <[hidden email]> wrote:
I meant, I'll check when current jobs are done and will let you know.

On Mon, Jul 11, 2016 at 12:19 PM, Saliya Ekanayake <[hidden email]> wrote:
I am running some jobs now. I'll stop and restart using pdsh to see what was the issue again

On Mon, Jul 11, 2016 at 12:15 PM, Greg Hogan <[hidden email]> wrote:
I'd definitely be interested to hear any insight into what failed when starting the taskmanagers with pdsh. Did the command fail, or fallback to standard ssh, a parse error on the slaves file?

I'm wondering if we need to escape
  PDSH_SSH_ARGS_APPEND=$FLINK_SSH_OPTS
as
  PDSH_SSH_ARGS_APPEND="${FLINK_SSH_OPTS}"

On Mon, Jul 11, 2016 at 12:02 AM, Saliya Ekanayake <[hidden email]> wrote:
pdsh is available in head node only, but when I tried to do start-cluster from head node (note Job manager node is not head node) it didn't work, which is why I modified the scripts.

Yes, exactly, this is what I was trying to do. My research area has been on these NUMA related issues and binding a process to a socket (CPU) and then its thread to individual cores have shown great advantage. I actually have Java code that automatically (user configurable as well) bind processes and threads. For Flink, I've manually done this using  shell script that scans TMs in a node and pin them appropriately. This approach is OK, but it's better if the support is integrated to Flink.

On Sun, Jul 10, 2016 at 8:33 PM, Greg Hogan <[hidden email]> wrote:
Hi Saliya,

Would you happen to have pdsh (parallel distributed shell) installed? If so the TaskManager startup in start-cluster.sh will run in parallel.

As to running 24 TaskManagers together, are these running across multiple NUMA nodes? I had filed FLINK-3163 (https://issues.apache.org/jira/browse/FLINK-3163) last year as I have seen that even with only two NUMA nodes performance is improved by binding TaskManagers, both memory and CPU. I think we can improve configuration of task slots as we do with memory, where the latter can be a fixed measure or a fraction relative to total memory.

Greg

On Sat, Jul 9, 2016 at 3:44 AM, Saliya Ekanayake <[hidden email]> wrote:
Hi,

The current start/stop scripts SSH worker nodes each time they appear in the slaves file. When spawning multiple TMs (like 24 per node), this is very inefficient.

I've changed the scripts to do one SSH per node and spawn a given N number of TMs afterwards. I can make a pull request if this seems usable to others. For now, I assume slaves file will indicate the number of TMs per slave in "IP N" format.

Thank you,
Saliya

--
Saliya Ekanayake
Ph.D. Candidate | Research Assistant
School of Informatics and Computing | Digital Science Center
Indiana University, Bloomington





--
Saliya Ekanayake
Ph.D. Candidate | Research Assistant
School of Informatics and Computing | Digital Science Center
Indiana University, Bloomington





--
Saliya Ekanayake
Ph.D. Candidate | Research Assistant
School of Informatics and Computing | Digital Science Center
Indiana University, Bloomington




--
Saliya Ekanayake
Ph.D. Candidate | Research Assistant
School of Informatics and Computing | Digital Science Center
Indiana University, Bloomington




--
Saliya Ekanayake
Ph.D. Candidate | Research Assistant
School of Informatics and Computing | Digital Science Center
Indiana University, Bloomington

Reply | Threaded
Open this post in threaded view
|

Re: Modifying start-cluster scripts to efficiently spawn multiple TMs

Greg Hogan
pdsh is only used for starting taskmanagers. How did you work around this? You are able to passwordless-ssh to the jobmanager?

The error looks to be from config.sh:318 in rotateLogFile. The way we generate the taskmanager index assumes that taskmanagers are started sequentially (flink-daemon.sh:108).

On Mon, Jul 11, 2016 at 2:59 PM, Saliya Ekanayake <[hidden email]> wrote:
Looking at what happens with pdsh, there are two things that go wrong.

1. pdsh is installed in a node other than where the job manager would run, so invoking start-cluster from there does not spawn a job manager. Only if I do start-cluster from the node I specify as the job manager's node that it'll be created.

2. If the slaves file has the same IP more than once, then the following error happens trying to move log files. For example I had node j-029 specified twice in my slaves file.

j-020: mv: cannot move `/N/u/sekanaya/sali/software/flink-1.0.3/log/flink-sekanaya-taskmanager-26-j-020.log' to `/N/u/sekanaya/sali/software/flink-1.0.3/log/flink-sekanaya-taskmanager-26-j-020.log.1': No such file or directory
j-020: mv: cannot move `/N/u/sekanaya/sali/software/flink-1.0.3/log/flink-sekanaya-taskmanager-26-j-020.out' to `/N/u/sekanaya/sali/software/flink-1.0.3/log/flink-sekanaya-taskmanager-26-j-020.out.1': No such file or directory


On Mon, Jul 11, 2016 at 12:19 PM, Saliya Ekanayake <[hidden email]> wrote:
I meant, I'll check when current jobs are done and will let you know.

On Mon, Jul 11, 2016 at 12:19 PM, Saliya Ekanayake <[hidden email]> wrote:
I am running some jobs now. I'll stop and restart using pdsh to see what was the issue again

On Mon, Jul 11, 2016 at 12:15 PM, Greg Hogan <[hidden email]> wrote:
I'd definitely be interested to hear any insight into what failed when starting the taskmanagers with pdsh. Did the command fail, or fallback to standard ssh, a parse error on the slaves file?

I'm wondering if we need to escape
  PDSH_SSH_ARGS_APPEND=$FLINK_SSH_OPTS
as
  PDSH_SSH_ARGS_APPEND="${FLINK_SSH_OPTS}"

On Mon, Jul 11, 2016 at 12:02 AM, Saliya Ekanayake <[hidden email]> wrote:
pdsh is available in head node only, but when I tried to do start-cluster from head node (note Job manager node is not head node) it didn't work, which is why I modified the scripts.

Yes, exactly, this is what I was trying to do. My research area has been on these NUMA related issues and binding a process to a socket (CPU) and then its thread to individual cores have shown great advantage. I actually have Java code that automatically (user configurable as well) bind processes and threads. For Flink, I've manually done this using  shell script that scans TMs in a node and pin them appropriately. This approach is OK, but it's better if the support is integrated to Flink.

On Sun, Jul 10, 2016 at 8:33 PM, Greg Hogan <[hidden email]> wrote:
Hi Saliya,

Would you happen to have pdsh (parallel distributed shell) installed? If so the TaskManager startup in start-cluster.sh will run in parallel.

As to running 24 TaskManagers together, are these running across multiple NUMA nodes? I had filed FLINK-3163 (https://issues.apache.org/jira/browse/FLINK-3163) last year as I have seen that even with only two NUMA nodes performance is improved by binding TaskManagers, both memory and CPU. I think we can improve configuration of task slots as we do with memory, where the latter can be a fixed measure or a fraction relative to total memory.

Greg

On Sat, Jul 9, 2016 at 3:44 AM, Saliya Ekanayake <[hidden email]> wrote:
Hi,

The current start/stop scripts SSH worker nodes each time they appear in the slaves file. When spawning multiple TMs (like 24 per node), this is very inefficient.

I've changed the scripts to do one SSH per node and spawn a given N number of TMs afterwards. I can make a pull request if this seems usable to others. For now, I assume slaves file will indicate the number of TMs per slave in "IP N" format.

Thank you,
Saliya

--
Saliya Ekanayake
Ph.D. Candidate | Research Assistant
School of Informatics and Computing | Digital Science Center
Indiana University, Bloomington





--
Saliya Ekanayake
Ph.D. Candidate | Research Assistant
School of Informatics and Computing | Digital Science Center
Indiana University, Bloomington





--
Saliya Ekanayake
Ph.D. Candidate | Research Assistant
School of Informatics and Computing | Digital Science Center
Indiana University, Bloomington




--
Saliya Ekanayake
Ph.D. Candidate | Research Assistant
School of Informatics and Computing | Digital Science Center
Indiana University, Bloomington




--
Saliya Ekanayake
Ph.D. Candidate | Research Assistant
School of Informatics and Computing | Digital Science Center
Indiana University, Bloomington


Reply | Threaded
Open this post in threaded view
|

Re: Modifying start-cluster scripts to efficiently spawn multiple TMs

Saliya Ekanayake
Yes, I've password-less SSH to the job manager node.

On Mon, Jul 11, 2016 at 4:53 PM, Greg Hogan <[hidden email]> wrote:
pdsh is only used for starting taskmanagers. How did you work around this? You are able to passwordless-ssh to the jobmanager?

The error looks to be from config.sh:318 in rotateLogFile. The way we generate the taskmanager index assumes that taskmanagers are started sequentially (flink-daemon.sh:108).

On Mon, Jul 11, 2016 at 2:59 PM, Saliya Ekanayake <[hidden email]> wrote:
Looking at what happens with pdsh, there are two things that go wrong.

1. pdsh is installed in a node other than where the job manager would run, so invoking start-cluster from there does not spawn a job manager. Only if I do start-cluster from the node I specify as the job manager's node that it'll be created.

2. If the slaves file has the same IP more than once, then the following error happens trying to move log files. For example I had node j-029 specified twice in my slaves file.

j-020: mv: cannot move `/N/u/sekanaya/sali/software/flink-1.0.3/log/flink-sekanaya-taskmanager-26-j-020.log' to `/N/u/sekanaya/sali/software/flink-1.0.3/log/flink-sekanaya-taskmanager-26-j-020.log.1': No such file or directory
j-020: mv: cannot move `/N/u/sekanaya/sali/software/flink-1.0.3/log/flink-sekanaya-taskmanager-26-j-020.out' to `/N/u/sekanaya/sali/software/flink-1.0.3/log/flink-sekanaya-taskmanager-26-j-020.out.1': No such file or directory


On Mon, Jul 11, 2016 at 12:19 PM, Saliya Ekanayake <[hidden email]> wrote:
I meant, I'll check when current jobs are done and will let you know.

On Mon, Jul 11, 2016 at 12:19 PM, Saliya Ekanayake <[hidden email]> wrote:
I am running some jobs now. I'll stop and restart using pdsh to see what was the issue again

On Mon, Jul 11, 2016 at 12:15 PM, Greg Hogan <[hidden email]> wrote:
I'd definitely be interested to hear any insight into what failed when starting the taskmanagers with pdsh. Did the command fail, or fallback to standard ssh, a parse error on the slaves file?

I'm wondering if we need to escape
  PDSH_SSH_ARGS_APPEND=$FLINK_SSH_OPTS
as
  PDSH_SSH_ARGS_APPEND="${FLINK_SSH_OPTS}"

On Mon, Jul 11, 2016 at 12:02 AM, Saliya Ekanayake <[hidden email]> wrote:
pdsh is available in head node only, but when I tried to do start-cluster from head node (note Job manager node is not head node) it didn't work, which is why I modified the scripts.

Yes, exactly, this is what I was trying to do. My research area has been on these NUMA related issues and binding a process to a socket (CPU) and then its thread to individual cores have shown great advantage. I actually have Java code that automatically (user configurable as well) bind processes and threads. For Flink, I've manually done this using  shell script that scans TMs in a node and pin them appropriately. This approach is OK, but it's better if the support is integrated to Flink.

On Sun, Jul 10, 2016 at 8:33 PM, Greg Hogan <[hidden email]> wrote:
Hi Saliya,

Would you happen to have pdsh (parallel distributed shell) installed? If so the TaskManager startup in start-cluster.sh will run in parallel.

As to running 24 TaskManagers together, are these running across multiple NUMA nodes? I had filed FLINK-3163 (https://issues.apache.org/jira/browse/FLINK-3163) last year as I have seen that even with only two NUMA nodes performance is improved by binding TaskManagers, both memory and CPU. I think we can improve configuration of task slots as we do with memory, where the latter can be a fixed measure or a fraction relative to total memory.

Greg

On Sat, Jul 9, 2016 at 3:44 AM, Saliya Ekanayake <[hidden email]> wrote:
Hi,

The current start/stop scripts SSH worker nodes each time they appear in the slaves file. When spawning multiple TMs (like 24 per node), this is very inefficient.

I've changed the scripts to do one SSH per node and spawn a given N number of TMs afterwards. I can make a pull request if this seems usable to others. For now, I assume slaves file will indicate the number of TMs per slave in "IP N" format.

Thank you,
Saliya

--
Saliya Ekanayake
Ph.D. Candidate | Research Assistant
School of Informatics and Computing | Digital Science Center
Indiana University, Bloomington





--
Saliya Ekanayake
Ph.D. Candidate | Research Assistant
School of Informatics and Computing | Digital Science Center
Indiana University, Bloomington





--
Saliya Ekanayake
Ph.D. Candidate | Research Assistant
School of Informatics and Computing | Digital Science Center
Indiana University, Bloomington




--
Saliya Ekanayake
Ph.D. Candidate | Research Assistant
School of Informatics and Computing | Digital Science Center
Indiana University, Bloomington




--
Saliya Ekanayake
Ph.D. Candidate | Research Assistant
School of Informatics and Computing | Digital Science Center
Indiana University, Bloomington





--
Saliya Ekanayake
Ph.D. Candidate | Research Assistant
School of Informatics and Computing | Digital Science Center
Indiana University, Bloomington