start-cluster.sh not working in HA mode

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

start-cluster.sh not working in HA mode

Marchant, Hayden
I am attempting to run Flink 1.3.2 in HA mode with zookeeper.

When I run the start-cluster.sh, the job manager is not started, even though the task manager is started. When I delved into this, I saw that the  command:

ssh -n $FLINK_SSH_OPTS $master -- "nohup /bin/bash -l \"${FLINK_BIN_DIR}/jobmanager.sh\" start cluster ${master} ${webuiport} &"

is not actually running anything on the host. i.e. I do not see "Starting jobmanager daemon on host ....."

Only when I remove ALL quotes, do I see it working. i.e. if I run:

ssh -n $FLINK_SSH_OPTS $master -- nohup /bin/bash -l ${FLINK_BIN_DIR}/jobmanager.sh start cluster ${master} ${webuiport} &

I see that it manages to run the job manager - I see " Starting jobmanager daemon on host.....".

Did anyone else experience a similar problem? Any elegant workarounds without having to change source code?

Thanks,
Hayden Marchant

Reply | Threaded
Open this post in threaded view
|

Re: start-cluster.sh not working in HA mode

Fabian Hueske-2
Hi Hayden,

I tried to reproduce the problem you described and followed the HA setup instructions of the documentation [1].
For me the instructions worked and start-cluster.sh started two JobManagers on my local machine (master contained two localhost entries).

The bash scripts tend to be a bit fragile, especially when it comes to handling spaces in variables and quotes.
What kind of environment are you running on (I'm on macOS) and do you try to start the JMs on localhost or remote machines?

Best, Fabian

2017-10-16 11:53 GMT+02:00 Marchant, Hayden <[hidden email]>:
I am attempting to run Flink 1.3.2 in HA mode with zookeeper.

When I run the start-cluster.sh, the job manager is not started, even though the task manager is started. When I delved into this, I saw that theĀ  command:

ssh -n $FLINK_SSH_OPTS $master -- "nohup /bin/bash -l \"${FLINK_BIN_DIR}/jobmanager.sh\" start cluster ${master} ${webuiport} &"

is not actually running anything on the host. i.e. I do not see "Starting jobmanager daemon on host ....."

Only when I remove ALL quotes, do I see it working. i.e. if I run:

ssh -n $FLINK_SSH_OPTS $master -- nohup /bin/bash -l ${FLINK_BIN_DIR}/jobmanager.sh start cluster ${master} ${webuiport} &

I see that it manages to run the job manager - I see " Starting jobmanager daemon on host.....".

Did anyone else experience a similar problem? Any elegant workarounds without having to change source code?

Thanks,
Hayden Marchant