I'm trying to set up a 3 node Flink cluster (version 1.9) on the following machines:
Node 1 (Master) : 4 GB (3.8 GB) Core2 Duo 2.80GHz, Ubuntu 16.04 LTS Node 2 (Slave) : 16 GB, Core i7-3.40GHz, Ubuntu 16.04 LTS Node 3 (Slave) : 16 GB, Core i7-3,40GHz, Ubuntu 16.04 LTS
I have followed the instructions on: https://ci.apache.org/projects/flink/flink-docs-stable/ops/deployment/cluster_setup.html
I have defined the IP/address of "jobmanager.rpc.address" in conf/flink-conf.yaml in the follwoing format: master@master-node1-hostname
Slaves as conf/slaves: slave@slave-node2-hostname slave@slave-node3-hostname master@master-node1-hostname (using master machine for task execution too)
However my problem is when running bin/start-cluster.sh on Master node, it fails to start taskexecutor daemon on both Slave nodes. It only starts both taskexecutor daemon and standalonesession daemon on master@master-node1-hostname (Node 1)
I have tried both passwordless ssh and password ssh on all machines but the result is the same. In the latter case, it does ask for slave@slave-node2-hostname, slave@slave-node3-hostname passowords but fails to display any message like "starting taskexecutor daemon on xxxx" after that.
I switched my master node to Node 2 and set Node 1 to slave. It was able to start taskexecutor daemons on both Node 2 and Node 3 successfully but did nothing for Node 1.
I'd appreciate if you can advice on what the problem here could be and how I can resolve it.
Best Regards, Komal
|
I managed to fix it however ran into another problem that I could appreciate help in resolving. it turns out that the username for all three nodes was different. having the same username for them fixed the issue. i.e same_username@slave-node2-hostname same_username@slave-node3-hostname same_username@master-node1-hostname Infact, because the usernames are the same, I can just save them in the conf files as: slave-node2-hostname slave-node3-hostname master-node1-hostname However, for some reason my worker nodes dont show up in the available task manager in the web UI.
... (clipped for brevity)
2019-09-12 15:56:36,625 INFO org.apache.flink.runtime.taskexecutor.TaskManagerRunner - --------------------------------------------------------------------------------
Best Regards, Komal On Wed, 11 Sep 2019 at 14:13, Komal Mariam <[hidden email]> wrote:
|
Hi Komal, could you check that every node can reach the other nodes? It looks a little bit as if the TaskManager cannot talk to the JobManager running on 150.82.218.218:6123. Cheers, Till On Thu, Sep 12, 2019 at 9:30 AM Komal Mariam <[hidden email]> wrote:
|
Hi Till, Thank you for the reply. I tried to ssh each of the nodes individually with each other and they all can connect to each other. Its just that all the other worker nodes cannot for some reason. connect to the job manager on 150.82.218.218:6123. (Node 1) I got around the problem by setting the master node (JobManager) on Node 2 and making 150.82.218.218 as a slave (TaskManager). now, all nodes including 150.82.218.218 are showing up in the new jobmanager's UI and I can see my jobs getting distributed between them too. For now, all my nodes have password enabled SSH. Do you think this issue could be because I have not set passwordless SSH? If the start-cluster.yaml can instantiate the nodes with password ssh why is it important to set passwordless SSH (aside from convenience)? Best Regards, Komal On Fri, 13 Sep 2019 at 18:31, Till Rohrmann <[hidden email]> wrote:
|
SSH access to the nodes and nodes being able to talk to each other are separate issues. The former is only used for starting the Flink cluster. Once the cluster is started, Flink only requires that nodes can talk to each other (independent of SSH). Cheers, Till On Tue, Sep 17, 2019 at 7:39 AM Komal Mariam <[hidden email]> wrote:
|
Free forum by Nabble | Edit this page |