(DEPRECATED) Apache Flink User Mailing List archive.

Modifying start-cluster scripts to efficiently spawn multiple TMs

Classic

List

Threaded

11 messages Options

Saliya Ekanayake

Modifying start-cluster scripts to efficiently spawn multiple TMs

Hi,

The current start/stop scripts SSH worker nodes each time they appear in the slaves file. When spawning multiple TMs (like 24 per node), this is very inefficient.

I've changed the scripts to do one SSH per node and spawn a given N number of TMs afterwards. I can make a pull request if this seems usable to others. For now, I assume slaves file will indicate the number of TMs per slave in "IP N" format.

Thank you,

Saliya

Saliya Ekanayake

Ph.D. Candidate | Research Assistant

School of Informatics and Computing | Digital Science Center

Indiana University, Bloomington

Gyula Fóra-2

Re: Modifying start-cluster scripts to efficiently spawn multiple TMs

Hi,

I think this would be a nice addition especially for Flink clusters running on big machines where you might want to run multiple task managers just to split the memory between multiple java processes.

In any case the previous config format should also be supported as the default.

I am curious what other developers/users think about this.

Cheers,

Gyula

Saliya Ekanayake <[hidden email]> ezt írta (időpont: 2016. júl. 9., Szo, 9:44):

Hi,

The current start/stop scripts SSH worker nodes each time they appear in the slaves file. When spawning multiple TMs (like 24 per node), this is very inefficient.

I've changed the scripts to do one SSH per node and spawn a given N number of TMs afterwards. I can make a pull request if this seems usable to others. For now, I assume slaves file will indicate the number of TMs per slave in "IP N" format.

Thank you,
Saliya

--
Saliya Ekanayake
Ph.D. Candidate | Research Assistant
School of Informatics and Computing | Digital Science Center
Indiana University, Bloomington

Saliya Ekanayake

Re: Modifying start-cluster scripts to efficiently spawn multiple TMs

Thank you. Yes, the previous format is still supported. If a number is specified after the hostname then only it'll kick in this change.

On Sun, Jul 10, 2016 at 5:42 PM, Gyula Fóra <[hidden email]> wrote:

Hi,

I think this would be a nice addition especially for Flink clusters running on big machines where you might want to run multiple task managers just to split the memory between multiple java processes.

In any case the previous config format should also be supported as the default.

I am curious what other developers/users think about this.

Cheers,
Gyula

Saliya Ekanayake <[hidden email]> ezt írta (időpont: 2016. júl. 9., Szo, 9:44):
Hi,

The current start/stop scripts SSH worker nodes each time they appear in the slaves file. When spawning multiple TMs (like 24 per node), this is very inefficient.

I've changed the scripts to do one SSH per node and spawn a given N number of TMs afterwards. I can make a pull request if this seems usable to others. For now, I assume slaves file will indicate the number of TMs per slave in "IP N" format.

Thank you,
Saliya

--
Saliya Ekanayake
Ph.D. Candidate | Research Assistant
School of Informatics and Computing | Digital Science Center
Indiana University, Bloomington

Saliya Ekanayake

Ph.D. Candidate | Research Assistant

School of Informatics and Computing | Digital Science Center

Indiana University, Bloomington

Greg Hogan

Re: Modifying start-cluster scripts to efficiently spawn multiple TMs

In reply to this post by Saliya Ekanayake

Hi Saliya,

Would you happen to have pdsh (parallel distributed shell) installed? If so the TaskManager startup in start-cluster.sh will run in parallel.

As to running 24 TaskManagers together, are these running across multiple NUMA nodes? I had filed FLINK-3163 (https://issues.apache.org/jira/browse/FLINK-3163) last year as I have seen that even with only two NUMA nodes performance is improved by binding TaskManagers, both memory and CPU. I think we can improve configuration of task slots as we do with memory, where the latter can be a fixed measure or a fraction relative to total memory.

Greg

On Sat, Jul 9, 2016 at 3:44 AM, Saliya Ekanayake <[hidden email]> wrote:

Hi,

The current start/stop scripts SSH worker nodes each time they appear in the slaves file. When spawning multiple TMs (like 24 per node), this is very inefficient.

I've changed the scripts to do one SSH per node and spawn a given N number of TMs afterwards. I can make a pull request if this seems usable to others. For now, I assume slaves file will indicate the number of TMs per slave in "IP N" format.

Thank you,
Saliya

--
Saliya Ekanayake
Ph.D. Candidate | Research Assistant
School of Informatics and Computing | Digital Science Center
Indiana University, Bloomington

Saliya Ekanayake

Re: Modifying start-cluster scripts to efficiently spawn multiple TMs

pdsh is available in head node only, but when I tried to do start-cluster from head node (note Job manager node is not head node) it didn't work, which is why I modified the scripts.

Yes, exactly, this is what I was trying to do. My research area has been on these NUMA related issues and binding a process to a socket (CPU) and then its thread to individual cores have shown great advantage. I actually have Java code that automatically (user configurable as well) bind processes and threads. For Flink, I've manually done this using shell script that scans TMs in a node and pin them appropriately. This approach is OK, but it's better if the support is integrated to Flink.

On Sun, Jul 10, 2016 at 8:33 PM, Greg Hogan <[hidden email]> wrote:

Hi Saliya,

Would you happen to have pdsh (parallel distributed shell) installed? If so the TaskManager startup in start-cluster.sh will run in parallel.

As to running 24 TaskManagers together, are these running across multiple NUMA nodes? I had filed FLINK-3163 (https://issues.apache.org/jira/browse/FLINK-3163) last year as I have seen that even with only two NUMA nodes performance is improved by binding TaskManagers, both memory and CPU. I think we can improve configuration of task slots as we do with memory, where the latter can be a fixed measure or a fraction relative to total memory.

Greg

On Sat, Jul 9, 2016 at 3:44 AM, Saliya Ekanayake <[hidden email]> wrote:
Hi,

The current start/stop scripts SSH worker nodes each time they appear in the slaves file. When spawning multiple TMs (like 24 per node), this is very inefficient.

I've changed the scripts to do one SSH per node and spawn a given N number of TMs afterwards. I can make a pull request if this seems usable to others. For now, I assume slaves file will indicate the number of TMs per slave in "IP N" format.

Thank you,
Saliya

--
Saliya Ekanayake
Ph.D. Candidate | Research Assistant
School of Informatics and Computing | Digital Science Center
Indiana University, Bloomington

Saliya Ekanayake

Ph.D. Candidate | Research Assistant

School of Informatics and Computing | Digital Science Center

Indiana University, Bloomington

Greg Hogan

Re: Modifying start-cluster scripts to efficiently spawn multiple TMs

I'd definitely be interested to hear any insight into what failed when starting the taskmanagers with pdsh. Did the command fail, or fallback to standard ssh, a parse error on the slaves file?

I'm wondering if we need to escape
PDSH_SSH_ARGS_APPEND=$FLINK_SSH_OPTS

as
PDSH_SSH_ARGS_APPEND="${FLINK_SSH_OPTS}"

On Mon, Jul 11, 2016 at 12:02 AM, Saliya Ekanayake <[hidden email]> wrote:

pdsh is available in head node only, but when I tried to do start-cluster from head node (note Job manager node is not head node) it didn't work, which is why I modified the scripts.

Yes, exactly, this is what I was trying to do. My research area has been on these NUMA related issues and binding a process to a socket (CPU) and then its thread to individual cores have shown great advantage. I actually have Java code that automatically (user configurable as well) bind processes and threads. For Flink, I've manually done this using shell script that scans TMs in a node and pin them appropriately. This approach is OK, but it's better if the support is integrated to Flink.

On Sun, Jul 10, 2016 at 8:33 PM, Greg Hogan <[hidden email]> wrote:
Hi Saliya,

Would you happen to have pdsh (parallel distributed shell) installed? If so the TaskManager startup in start-cluster.sh will run in parallel.

As to running 24 TaskManagers together, are these running across multiple NUMA nodes? I had filed FLINK-3163 (https://issues.apache.org/jira/browse/FLINK-3163) last year as I have seen that even with only two NUMA nodes performance is improved by binding TaskManagers, both memory and CPU. I think we can improve configuration of task slots as we do with memory, where the latter can be a fixed measure or a fraction relative to total memory.

Greg

On Sat, Jul 9, 2016 at 3:44 AM, Saliya Ekanayake <[hidden email]> wrote:
Hi,

The current start/stop scripts SSH worker nodes each time they appear in the slaves file. When spawning multiple TMs (like 24 per node), this is very inefficient.

I've changed the scripts to do one SSH per node and spawn a given N number of TMs afterwards. I can make a pull request if this seems usable to others. For now, I assume slaves file will indicate the number of TMs per slave in "IP N" format.

Thank you,
Saliya

--
Saliya Ekanayake
Ph.D. Candidate | Research Assistant
School of Informatics and Computing | Digital Science Center
Indiana University, Bloomington

--
Saliya Ekanayake
Ph.D. Candidate | Research Assistant
School of Informatics and Computing | Digital Science Center
Indiana University, Bloomington

Saliya Ekanayake

Re: Modifying start-cluster scripts to efficiently spawn multiple TMs

I am running some jobs now. I'll stop and restart using pdsh to see what was the issue again

On Mon, Jul 11, 2016 at 12:15 PM, Greg Hogan <[hidden email]> wrote:

I'd definitely be interested to hear any insight into what failed when starting the taskmanagers with pdsh. Did the command fail, or fallback to standard ssh, a parse error on the slaves file?

I'm wondering if we need to escape
PDSH_SSH_ARGS_APPEND=$FLINK_SSH_OPTS
as
PDSH_SSH_ARGS_APPEND="${FLINK_SSH_OPTS}"

On Mon, Jul 11, 2016 at 12:02 AM, Saliya Ekanayake <[hidden email]> wrote:
pdsh is available in head node only, but when I tried to do start-cluster from head node (note Job manager node is not head node) it didn't work, which is why I modified the scripts.

Yes, exactly, this is what I was trying to do. My research area has been on these NUMA related issues and binding a process to a socket (CPU) and then its thread to individual cores have shown great advantage. I actually have Java code that automatically (user configurable as well) bind processes and threads. For Flink, I've manually done this using shell script that scans TMs in a node and pin them appropriately. This approach is OK, but it's better if the support is integrated to Flink.

On Sun, Jul 10, 2016 at 8:33 PM, Greg Hogan <[hidden email]> wrote:
Hi Saliya,

Would you happen to have pdsh (parallel distributed shell) installed? If so the TaskManager startup in start-cluster.sh will run in parallel.

As to running 24 TaskManagers together, are these running across multiple NUMA nodes? I had filed FLINK-3163 (https://issues.apache.org/jira/browse/FLINK-3163) last year as I have seen that even with only two NUMA nodes performance is improved by binding TaskManagers, both memory and CPU. I think we can improve configuration of task slots as we do with memory, where the latter can be a fixed measure or a fraction relative to total memory.

Greg

On Sat, Jul 9, 2016 at 3:44 AM, Saliya Ekanayake <[hidden email]> wrote:
Hi,

The current start/stop scripts SSH worker nodes each time they appear in the slaves file. When spawning multiple TMs (like 24 per node), this is very inefficient.

I've changed the scripts to do one SSH per node and spawn a given N number of TMs afterwards. I can make a pull request if this seems usable to others. For now, I assume slaves file will indicate the number of TMs per slave in "IP N" format.

Thank you,
Saliya

--
Saliya Ekanayake
Ph.D. Candidate | Research Assistant
School of Informatics and Computing | Digital Science Center
Indiana University, Bloomington

--
Saliya Ekanayake
Ph.D. Candidate | Research Assistant
School of Informatics and Computing | Digital Science Center
Indiana University, Bloomington

Saliya Ekanayake

Ph.D. Candidate | Research Assistant

School of Informatics and Computing | Digital Science Center

Indiana University, Bloomington

Saliya Ekanayake

Re: Modifying start-cluster scripts to efficiently spawn multiple TMs

I meant, I'll check when current jobs are done and will let you know.

On Mon, Jul 11, 2016 at 12:19 PM, Saliya Ekanayake <[hidden email]> wrote:

I am running some jobs now. I'll stop and restart using pdsh to see what was the issue again

On Mon, Jul 11, 2016 at 12:15 PM, Greg Hogan <[hidden email]> wrote:
I'd definitely be interested to hear any insight into what failed when starting the taskmanagers with pdsh. Did the command fail, or fallback to standard ssh, a parse error on the slaves file?

I'm wondering if we need to escape
PDSH_SSH_ARGS_APPEND=$FLINK_SSH_OPTS
as
PDSH_SSH_ARGS_APPEND="${FLINK_SSH_OPTS}"

On Mon, Jul 11, 2016 at 12:02 AM, Saliya Ekanayake <[hidden email]> wrote:
pdsh is available in head node only, but when I tried to do start-cluster from head node (note Job manager node is not head node) it didn't work, which is why I modified the scripts.

Yes, exactly, this is what I was trying to do. My research area has been on these NUMA related issues and binding a process to a socket (CPU) and then its thread to individual cores have shown great advantage. I actually have Java code that automatically (user configurable as well) bind processes and threads. For Flink, I've manually done this using shell script that scans TMs in a node and pin them appropriately. This approach is OK, but it's better if the support is integrated to Flink.

On Sun, Jul 10, 2016 at 8:33 PM, Greg Hogan <[hidden email]> wrote:
Hi Saliya,

Would you happen to have pdsh (parallel distributed shell) installed? If so the TaskManager startup in start-cluster.sh will run in parallel.

As to running 24 TaskManagers together, are these running across multiple NUMA nodes? I had filed FLINK-3163 (https://issues.apache.org/jira/browse/FLINK-3163) last year as I have seen that even with only two NUMA nodes performance is improved by binding TaskManagers, both memory and CPU. I think we can improve configuration of task slots as we do with memory, where the latter can be a fixed measure or a fraction relative to total memory.

Greg

On Sat, Jul 9, 2016 at 3:44 AM, Saliya Ekanayake <[hidden email]> wrote:
Hi,

The current start/stop scripts SSH worker nodes each time they appear in the slaves file. When spawning multiple TMs (like 24 per node), this is very inefficient.

I've changed the scripts to do one SSH per node and spawn a given N number of TMs afterwards. I can make a pull request if this seems usable to others. For now, I assume slaves file will indicate the number of TMs per slave in "IP N" format.

Thank you,
Saliya

--
Saliya Ekanayake
Ph.D. Candidate | Research Assistant
School of Informatics and Computing | Digital Science Center
Indiana University, Bloomington

--
Saliya Ekanayake
Ph.D. Candidate | Research Assistant
School of Informatics and Computing | Digital Science Center
Indiana University, Bloomington

--
Saliya Ekanayake
Ph.D. Candidate | Research Assistant
School of Informatics and Computing | Digital Science Center
Indiana University, Bloomington

Saliya Ekanayake

Ph.D. Candidate | Research Assistant

School of Informatics and Computing | Digital Science Center

Indiana University, Bloomington

Saliya Ekanayake

Re: Modifying start-cluster scripts to efficiently spawn multiple TMs

Looking at what happens with pdsh, there are two things that go wrong.

1. pdsh is installed in a node other than where the job manager would run, so invoking start-cluster from there does not spawn a job manager. Only if I do start-cluster from the node I specify as the job manager's node that it'll be created.

2. If the slaves file has the same IP more than once, then the following error happens trying to move log files. For example I had node j-029 specified twice in my slaves file.

j-020: mv: cannot move `/N/u/sekanaya/sali/software/flink-1.0.3/log/flink-sekanaya-taskmanager-26-j-020.log' to `/N/u/sekanaya/sali/software/flink-1.0.3/log/flink-sekanaya-taskmanager-26-j-020.log.1': No such file or directory

j-020: mv: cannot move `/N/u/sekanaya/sali/software/flink-1.0.3/log/flink-sekanaya-taskmanager-26-j-020.out' to `/N/u/sekanaya/sali/software/flink-1.0.3/log/flink-sekanaya-taskmanager-26-j-020.out.1': No such file or directory

On Mon, Jul 11, 2016 at 12:19 PM, Saliya Ekanayake <[hidden email]> wrote:

I meant, I'll check when current jobs are done and will let you know.

On Mon, Jul 11, 2016 at 12:19 PM, Saliya Ekanayake <[hidden email]> wrote:
I am running some jobs now. I'll stop and restart using pdsh to see what was the issue again

On Mon, Jul 11, 2016 at 12:15 PM, Greg Hogan <[hidden email]> wrote:
I'd definitely be interested to hear any insight into what failed when starting the taskmanagers with pdsh. Did the command fail, or fallback to standard ssh, a parse error on the slaves file?

I'm wondering if we need to escape
PDSH_SSH_ARGS_APPEND=$FLINK_SSH_OPTS
as
PDSH_SSH_ARGS_APPEND="${FLINK_SSH_OPTS}"

On Mon, Jul 11, 2016 at 12:02 AM, Saliya Ekanayake <[hidden email]> wrote:
pdsh is available in head node only, but when I tried to do start-cluster from head node (note Job manager node is not head node) it didn't work, which is why I modified the scripts.

Yes, exactly, this is what I was trying to do. My research area has been on these NUMA related issues and binding a process to a socket (CPU) and then its thread to individual cores have shown great advantage. I actually have Java code that automatically (user configurable as well) bind processes and threads. For Flink, I've manually done this using shell script that scans TMs in a node and pin them appropriately. This approach is OK, but it's better if the support is integrated to Flink.

On Sun, Jul 10, 2016 at 8:33 PM, Greg Hogan <[hidden email]> wrote:
Hi Saliya,

Would you happen to have pdsh (parallel distributed shell) installed? If so the TaskManager startup in start-cluster.sh will run in parallel.

As to running 24 TaskManagers together, are these running across multiple NUMA nodes? I had filed FLINK-3163 (https://issues.apache.org/jira/browse/FLINK-3163) last year as I have seen that even with only two NUMA nodes performance is improved by binding TaskManagers, both memory and CPU. I think we can improve configuration of task slots as we do with memory, where the latter can be a fixed measure or a fraction relative to total memory.

Greg

On Sat, Jul 9, 2016 at 3:44 AM, Saliya Ekanayake <[hidden email]> wrote:
Hi,

The current start/stop scripts SSH worker nodes each time they appear in the slaves file. When spawning multiple TMs (like 24 per node), this is very inefficient.

I've changed the scripts to do one SSH per node and spawn a given N number of TMs afterwards. I can make a pull request if this seems usable to others. For now, I assume slaves file will indicate the number of TMs per slave in "IP N" format.

Thank you,
Saliya

--
Saliya Ekanayake
Ph.D. Candidate | Research Assistant
School of Informatics and Computing | Digital Science Center
Indiana University, Bloomington

--
Saliya Ekanayake
Ph.D. Candidate | Research Assistant
School of Informatics and Computing | Digital Science Center
Indiana University, Bloomington

--
Saliya Ekanayake
Ph.D. Candidate | Research Assistant
School of Informatics and Computing | Digital Science Center
Indiana University, Bloomington

--
Saliya Ekanayake
Ph.D. Candidate | Research Assistant
School of Informatics and Computing | Digital Science Center
Indiana University, Bloomington

Saliya Ekanayake

Ph.D. Candidate | Research Assistant

School of Informatics and Computing | Digital Science Center

Indiana University, Bloomington

Greg Hogan

Re: Modifying start-cluster scripts to efficiently spawn multiple TMs

pdsh is only used for starting taskmanagers. How did you work around this? You are able to passwordless-ssh to the jobmanager?

The error looks to be from config.sh:318 in rotateLogFile. The way we generate the taskmanager index assumes that taskmanagers are started sequentially (flink-daemon.sh:108).

On Mon, Jul 11, 2016 at 2:59 PM, Saliya Ekanayake <[hidden email]> wrote:

Looking at what happens with pdsh, there are two things that go wrong.

1. pdsh is installed in a node other than where the job manager would run, so invoking start-cluster from there does not spawn a job manager. Only if I do start-cluster from the node I specify as the job manager's node that it'll be created.

2. If the slaves file has the same IP more than once, then the following error happens trying to move log files. For example I had node j-029 specified twice in my slaves file.

j-020: mv: cannot move `/N/u/sekanaya/sali/software/flink-1.0.3/log/flink-sekanaya-taskmanager-26-j-020.log' to `/N/u/sekanaya/sali/software/flink-1.0.3/log/flink-sekanaya-taskmanager-26-j-020.log.1': No such file or directory
j-020: mv: cannot move `/N/u/sekanaya/sali/software/flink-1.0.3/log/flink-sekanaya-taskmanager-26-j-020.out' to `/N/u/sekanaya/sali/software/flink-1.0.3/log/flink-sekanaya-taskmanager-26-j-020.out.1': No such file or directory

On Mon, Jul 11, 2016 at 12:19 PM, Saliya Ekanayake <[hidden email]> wrote:
I meant, I'll check when current jobs are done and will let you know.

On Mon, Jul 11, 2016 at 12:19 PM, Saliya Ekanayake <[hidden email]> wrote:
I am running some jobs now. I'll stop and restart using pdsh to see what was the issue again

On Mon, Jul 11, 2016 at 12:15 PM, Greg Hogan <[hidden email]> wrote:
I'd definitely be interested to hear any insight into what failed when starting the taskmanagers with pdsh. Did the command fail, or fallback to standard ssh, a parse error on the slaves file?

I'm wondering if we need to escape
PDSH_SSH_ARGS_APPEND=$FLINK_SSH_OPTS
as
PDSH_SSH_ARGS_APPEND="${FLINK_SSH_OPTS}"

On Mon, Jul 11, 2016 at 12:02 AM, Saliya Ekanayake <[hidden email]> wrote:
pdsh is available in head node only, but when I tried to do start-cluster from head node (note Job manager node is not head node) it didn't work, which is why I modified the scripts.

Yes, exactly, this is what I was trying to do. My research area has been on these NUMA related issues and binding a process to a socket (CPU) and then its thread to individual cores have shown great advantage. I actually have Java code that automatically (user configurable as well) bind processes and threads. For Flink, I've manually done this using shell script that scans TMs in a node and pin them appropriately. This approach is OK, but it's better if the support is integrated to Flink.

On Sun, Jul 10, 2016 at 8:33 PM, Greg Hogan <[hidden email]> wrote:
Hi Saliya,

Would you happen to have pdsh (parallel distributed shell) installed? If so the TaskManager startup in start-cluster.sh will run in parallel.

As to running 24 TaskManagers together, are these running across multiple NUMA nodes? I had filed FLINK-3163 (https://issues.apache.org/jira/browse/FLINK-3163) last year as I have seen that even with only two NUMA nodes performance is improved by binding TaskManagers, both memory and CPU. I think we can improve configuration of task slots as we do with memory, where the latter can be a fixed measure or a fraction relative to total memory.

Greg

On Sat, Jul 9, 2016 at 3:44 AM, Saliya Ekanayake <[hidden email]> wrote:
Hi,

The current start/stop scripts SSH worker nodes each time they appear in the slaves file. When spawning multiple TMs (like 24 per node), this is very inefficient.

I've changed the scripts to do one SSH per node and spawn a given N number of TMs afterwards. I can make a pull request if this seems usable to others. For now, I assume slaves file will indicate the number of TMs per slave in "IP N" format.

Thank you,
Saliya

--
Saliya Ekanayake
Ph.D. Candidate | Research Assistant
School of Informatics and Computing | Digital Science Center
Indiana University, Bloomington

--
Saliya Ekanayake
Ph.D. Candidate | Research Assistant
School of Informatics and Computing | Digital Science Center
Indiana University, Bloomington

--
Saliya Ekanayake
Ph.D. Candidate | Research Assistant
School of Informatics and Computing | Digital Science Center
Indiana University, Bloomington

--
Saliya Ekanayake
Ph.D. Candidate | Research Assistant
School of Informatics and Computing | Digital Science Center
Indiana University, Bloomington

--
Saliya Ekanayake
Ph.D. Candidate | Research Assistant
School of Informatics and Computing | Digital Science Center
Indiana University, Bloomington

Saliya Ekanayake

Re: Modifying start-cluster scripts to efficiently spawn multiple TMs

Yes, I've password-less SSH to the job manager node.

On Mon, Jul 11, 2016 at 4:53 PM, Greg Hogan <[hidden email]> wrote:

pdsh is only used for starting taskmanagers. How did you work around this? You are able to passwordless-ssh to the jobmanager?

The error looks to be from config.sh:318 in rotateLogFile. The way we generate the taskmanager index assumes that taskmanagers are started sequentially (flink-daemon.sh:108).

On Mon, Jul 11, 2016 at 2:59 PM, Saliya Ekanayake <[hidden email]> wrote:
Looking at what happens with pdsh, there are two things that go wrong.

1. pdsh is installed in a node other than where the job manager would run, so invoking start-cluster from there does not spawn a job manager. Only if I do start-cluster from the node I specify as the job manager's node that it'll be created.

2. If the slaves file has the same IP more than once, then the following error happens trying to move log files. For example I had node j-029 specified twice in my slaves file.

j-020: mv: cannot move `/N/u/sekanaya/sali/software/flink-1.0.3/log/flink-sekanaya-taskmanager-26-j-020.log' to `/N/u/sekanaya/sali/software/flink-1.0.3/log/flink-sekanaya-taskmanager-26-j-020.log.1': No such file or directory
j-020: mv: cannot move `/N/u/sekanaya/sali/software/flink-1.0.3/log/flink-sekanaya-taskmanager-26-j-020.out' to `/N/u/sekanaya/sali/software/flink-1.0.3/log/flink-sekanaya-taskmanager-26-j-020.out.1': No such file or directory

On Mon, Jul 11, 2016 at 12:19 PM, Saliya Ekanayake <[hidden email]> wrote:
I meant, I'll check when current jobs are done and will let you know.

On Mon, Jul 11, 2016 at 12:19 PM, Saliya Ekanayake <[hidden email]> wrote:
I am running some jobs now. I'll stop and restart using pdsh to see what was the issue again

On Mon, Jul 11, 2016 at 12:15 PM, Greg Hogan <[hidden email]> wrote:
I'd definitely be interested to hear any insight into what failed when starting the taskmanagers with pdsh. Did the command fail, or fallback to standard ssh, a parse error on the slaves file?

I'm wondering if we need to escape
PDSH_SSH_ARGS_APPEND=$FLINK_SSH_OPTS
as
PDSH_SSH_ARGS_APPEND="${FLINK_SSH_OPTS}"

On Mon, Jul 11, 2016 at 12:02 AM, Saliya Ekanayake <[hidden email]> wrote:
pdsh is available in head node only, but when I tried to do start-cluster from head node (note Job manager node is not head node) it didn't work, which is why I modified the scripts.

Yes, exactly, this is what I was trying to do. My research area has been on these NUMA related issues and binding a process to a socket (CPU) and then its thread to individual cores have shown great advantage. I actually have Java code that automatically (user configurable as well) bind processes and threads. For Flink, I've manually done this using shell script that scans TMs in a node and pin them appropriately. This approach is OK, but it's better if the support is integrated to Flink.

On Sun, Jul 10, 2016 at 8:33 PM, Greg Hogan <[hidden email]> wrote:
Hi Saliya,

Would you happen to have pdsh (parallel distributed shell) installed? If so the TaskManager startup in start-cluster.sh will run in parallel.

As to running 24 TaskManagers together, are these running across multiple NUMA nodes? I had filed FLINK-3163 (https://issues.apache.org/jira/browse/FLINK-3163) last year as I have seen that even with only two NUMA nodes performance is improved by binding TaskManagers, both memory and CPU. I think we can improve configuration of task slots as we do with memory, where the latter can be a fixed measure or a fraction relative to total memory.

Greg

On Sat, Jul 9, 2016 at 3:44 AM, Saliya Ekanayake <[hidden email]> wrote:
Hi,

The current start/stop scripts SSH worker nodes each time they appear in the slaves file. When spawning multiple TMs (like 24 per node), this is very inefficient.

I've changed the scripts to do one SSH per node and spawn a given N number of TMs afterwards. I can make a pull request if this seems usable to others. For now, I assume slaves file will indicate the number of TMs per slave in "IP N" format.

Thank you,
Saliya

--
Saliya Ekanayake
Ph.D. Candidate | Research Assistant
School of Informatics and Computing | Digital Science Center
Indiana University, Bloomington

--
Saliya Ekanayake
Ph.D. Candidate | Research Assistant
School of Informatics and Computing | Digital Science Center
Indiana University, Bloomington

--
Saliya Ekanayake
Ph.D. Candidate | Research Assistant
School of Informatics and Computing | Digital Science Center
Indiana University, Bloomington

--
Saliya Ekanayake
Ph.D. Candidate | Research Assistant
School of Informatics and Computing | Digital Science Center
Indiana University, Bloomington

--
Saliya Ekanayake
Ph.D. Candidate | Research Assistant
School of Informatics and Computing | Digital Science Center
Indiana University, Bloomington

Saliya Ekanayake

Ph.D. Candidate | Research Assistant

School of Informatics and Computing | Digital Science Center

Indiana University, Bloomington