Problem starting taskexecutor daemons in 3 node cluster

Posted by Komal Mariam on
URL: http://deprecated-apache-flink-user-mailing-list-archive.369.s1.nabble.com/Problem-starting-taskexecutor-daemons-in-3-node-cluster-tp29908.html

I'm trying to set up a 3 node Flink cluster (version 1.9) on the following machines:

Node 1 (Master) : 4 GB (3.8 GB) Core2 Duo 2.80GHz,  Ubuntu 16.04 LTS
Node 2 (Slave) : 16 GB, Core i7-3.40GHz, Ubuntu 16.04 LTS
Node 3 (Slave) : 16 GB, Core i7-3,40GHz, Ubuntu 16.04 LTS

I have followed the instructions on: https://ci.apache.org/projects/flink/flink-docs-stable/ops/deployment/cluster_setup.html

I have defined the IP/address of "jobmanager.rpc.address" in conf/flink-conf.yaml in the follwoing format: master@master-node1-hostname

Slaves as conf/slaves:  slave@slave-node2-hostname
                        slave@slave-node3-hostname
                        master@master-node1-hostname (using master machine for task execution too)


However my problem is when running bin/start-cluster.sh on Master node, it fails to start taskexecutor daemon on both Slave nodes. It only starts both taskexecutor daemon and standalonesession daemon on master@master-node1-hostname (Node 1)

I have tried both passwordless ssh and password ssh on all machines but the result is the same. In the latter case, it does ask for slave@slave-node2-hostname, slave@slave-node3-hostname passowords but fails to display any message like "starting taskexecutor daemon on xxxx" after that.

I switched my master node to Node 2 and set Node 1 to slave. It was able to start taskexecutor daemons on both Node 2 and Node 3 successfully but did nothing for Node 1.

I'd appreciate if you can advice on what the problem here could be and how I can resolve it.

Best Regards,
Komal