Fwd: TaskManager unable to register with JobManager

classic Classic list List threaded Threaded
18 messages Options
Reply | Threaded
Open this post in threaded view
|

Fwd: TaskManager unable to register with JobManager

Ravinder Kaur

Hello Community,

I'm a student and new to Apache Flink. I'm trying to learn and have setup a 2- node standalone Flink(0.10.1) cluster (one master and one worker). I'm facing the following issue. 

Cluster: consists of 2 vms (one master and one worker)


When I start the cluster both the JobManager and the TaskManager are started on the master and worker respectively.

Command to start the cluster : bin/start-cluster.sh

JPS shows all the processes running.

Then I run the following command to run a WordCount example job: ./bin/flink run ./examples/WordCount.jar

the result is attached with the mail.

The error is org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailabeException: Not enough free slots available to run to run the job ....................... Resources available to scheduler: Number of instances=0, total number of slots= 0, available slots=0

Therefore I suppose that the JobManager does not find the TaskManager and checked the logs of the TaskManager which indeed shows that the TaskManager is unable to register at the JobManager for quite a long time. There are org.apache.flink.runtime.net.ConnectionUtils: Failed to connect from localhost: Connect timed out and org.apache.flink.runtime.net.ConnectionUtils: Failed to connect from address localhost: Network is Unreachable messages in the log of the TaskManager. Later when it starts up after a number of attempts and tries to register at the JobManager, which also fails after a lot of attempts showing the following message org.apache.flink.runtime.taskmanager.Taskmanager: Trying to register at JobManager akka.tcp://flink@master:6123/user'/jobmanager (attempt:92, timeout:30seconds) and org.apache.flink.runtime.taskmanager.Taskmanager: Tried to associate with unreachable remote host [akka.tcp://flink@master:6123/user/jobmanager]. Address is now gated for 5000ms, all messages to this address will be delivered to dead letters. Reason: Connection timed out: /master:6123

and https://issues.apache.org/jira/browse/FLINK-1119 these links helpful. Stephan Ewen the guy who provided the solution in both the links gives a good explanation that the TaskManagers take quite some time to register at the JobManager and therefore I waited for as long as 20 mins after starting the cluster to run the job. But even after waiting so long I get the same error. 

Another suggestion was to run the cluster in streaming mode. So I tried it with the command : bin/start-cluster-streaming.sh and ran the job but I get the same error. I have rechecked all the configurations but I'm unable to find out the fault.

I re-checked all the configurations but could not find anything wrong. Also checked the port 6123 on master which is in LISTEN state and tcp request from worker to master shows SYN_SENT state using netstat -na and lsof -i commands.

I opened the webpage on master http://localhost:8081 but it shows nothing and localhost:8080 says connection refused.

Kindly help me out as it is very important for me. Let me know if you have any questions.

Kind Regards,
Ravinder Kaur


image.png (106K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: TaskManager unable to register with JobManager

Stephan Ewen
Looks like the network configuration is not correct.

I would try setting the full host name (like "master.abc.xyz.com") as jobmanager.rpc.address.

Greetings,
Stephan


On Wed, Feb 3, 2016 at 5:43 PM, Ravinder Kaur <[hidden email]> wrote:

Hello Community,

I'm a student and new to Apache Flink. I'm trying to learn and have setup a 2- node standalone Flink(0.10.1) cluster (one master and one worker). I'm facing the following issue. 

Cluster: consists of 2 vms (one master and one worker)


When I start the cluster both the JobManager and the TaskManager are started on the master and worker respectively.

Command to start the cluster : bin/start-cluster.sh

JPS shows all the processes running.

Then I run the following command to run a WordCount example job: ./bin/flink run ./examples/WordCount.jar

the result is attached with the mail.

The error is org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailabeException: Not enough free slots available to run to run the job ....................... Resources available to scheduler: Number of instances=0, total number of slots= 0, available slots=0

Therefore I suppose that the JobManager does not find the TaskManager and checked the logs of the TaskManager which indeed shows that the TaskManager is unable to register at the JobManager for quite a long time. There are org.apache.flink.runtime.net.ConnectionUtils: Failed to connect from localhost: Connect timed out and org.apache.flink.runtime.net.ConnectionUtils: Failed to connect from address localhost: Network is Unreachable messages in the log of the TaskManager. Later when it starts up after a number of attempts and tries to register at the JobManager, which also fails after a lot of attempts showing the following message org.apache.flink.runtime.taskmanager.Taskmanager: Trying to register at JobManager akka.tcp://flink@master:6123/user'/jobmanager (attempt:92, timeout:30seconds) and org.apache.flink.runtime.taskmanager.Taskmanager: Tried to associate with unreachable remote host [akka.tcp://flink@master:6123/user/jobmanager]. Address is now gated for 5000ms, all messages to this address will be delivered to dead letters. Reason: Connection timed out: /master:6123

and https://issues.apache.org/jira/browse/FLINK-1119 these links helpful. Stephan Ewen the guy who provided the solution in both the links gives a good explanation that the TaskManagers take quite some time to register at the JobManager and therefore I waited for as long as 20 mins after starting the cluster to run the job. But even after waiting so long I get the same error. 

Another suggestion was to run the cluster in streaming mode. So I tried it with the command : bin/start-cluster-streaming.sh and ran the job but I get the same error. I have rechecked all the configurations but I'm unable to find out the fault.

I re-checked all the configurations but could not find anything wrong. Also checked the port 6123 on master which is in LISTEN state and tcp request from worker to master shows SYN_SENT state using netstat -na and lsof -i commands.

I opened the webpage on master http://localhost:8081 but it shows nothing and localhost:8080 says connection refused.

Kindly help me out as it is very important for me. Let me know if you have any questions.

Kind Regards,
Ravinder Kaur


Reply | Threaded
Open this post in threaded view
|

Re: TaskManager unable to register with JobManager

Ravinder Kaur
Hello,

Thanks for the quick reply. I tried to set jobmanager.rpc.address in flink-conf.yaml to the hostname of master node on both the nodes.

Now it does not start the Taskmanager at the worker node at all. When I start the cluster using ./bin/start-cluster.sh on master it shows the normal output of starting the Jobmanager and Taskmanager but when I run jps on the nodes the slave does not have the Taskmanager running. 

Running the WordCount example again fails showing the same error. Stopping the cluster says no taskmanager to stop.

Kind Regards,
Ravinder Kaur

On Wed, Feb 3, 2016 at 5:47 PM, Stephan Ewen <[hidden email]> wrote:
Looks like the network configuration is not correct.

I would try setting the full host name (like "master.abc.xyz.com") as jobmanager.rpc.address.

Greetings,
Stephan


On Wed, Feb 3, 2016 at 5:43 PM, Ravinder Kaur <[hidden email]> wrote:

Hello Community,

I'm a student and new to Apache Flink. I'm trying to learn and have setup a 2- node standalone Flink(0.10.1) cluster (one master and one worker). I'm facing the following issue. 

Cluster: consists of 2 vms (one master and one worker)


When I start the cluster both the JobManager and the TaskManager are started on the master and worker respectively.

Command to start the cluster : bin/start-cluster.sh

JPS shows all the processes running.

Then I run the following command to run a WordCount example job: ./bin/flink run ./examples/WordCount.jar

the result is attached with the mail.

The error is org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailabeException: Not enough free slots available to run to run the job ....................... Resources available to scheduler: Number of instances=0, total number of slots= 0, available slots=0

Therefore I suppose that the JobManager does not find the TaskManager and checked the logs of the TaskManager which indeed shows that the TaskManager is unable to register at the JobManager for quite a long time. There are org.apache.flink.runtime.net.ConnectionUtils: Failed to connect from localhost: Connect timed out and org.apache.flink.runtime.net.ConnectionUtils: Failed to connect from address localhost: Network is Unreachable messages in the log of the TaskManager. Later when it starts up after a number of attempts and tries to register at the JobManager, which also fails after a lot of attempts showing the following message org.apache.flink.runtime.taskmanager.Taskmanager: Trying to register at JobManager akka.tcp://flink@master:6123/user'/jobmanager (attempt:92, timeout:30seconds) and org.apache.flink.runtime.taskmanager.Taskmanager: Tried to associate with unreachable remote host [akka.tcp://flink@master:6123/user/jobmanager]. Address is now gated for 5000ms, all messages to this address will be delivered to dead letters. Reason: Connection timed out: /master:6123

and https://issues.apache.org/jira/browse/FLINK-1119 these links helpful. Stephan Ewen the guy who provided the solution in both the links gives a good explanation that the TaskManagers take quite some time to register at the JobManager and therefore I waited for as long as 20 mins after starting the cluster to run the job. But even after waiting so long I get the same error. 

Another suggestion was to run the cluster in streaming mode. So I tried it with the command : bin/start-cluster-streaming.sh and ran the job but I get the same error. I have rechecked all the configurations but I'm unable to find out the fault.

I re-checked all the configurations but could not find anything wrong. Also checked the port 6123 on master which is in LISTEN state and tcp request from worker to master shows SYN_SENT state using netstat -na and lsof -i commands.

I opened the webpage on master http://localhost:8081 but it shows nothing and localhost:8080 says connection refused.

Kindly help me out as it is very important for me. Let me know if you have any questions.

Kind Regards,
Ravinder Kaur



Reply | Threaded
Open this post in threaded view
|

Re: TaskManager unable to register with JobManager

Stephan Ewen
What do the TaskManger logs say?

On Wed, Feb 3, 2016 at 6:34 PM, Ravinder Kaur <[hidden email]> wrote:
Hello,

Thanks for the quick reply. I tried to set jobmanager.rpc.address in flink-conf.yaml to the hostname of master node on both the nodes.

Now it does not start the Taskmanager at the worker node at all. When I start the cluster using ./bin/start-cluster.sh on master it shows the normal output of starting the Jobmanager and Taskmanager but when I run jps on the nodes the slave does not have the Taskmanager running. 

Running the WordCount example again fails showing the same error. Stopping the cluster says no taskmanager to stop.

Kind Regards,
Ravinder Kaur

On Wed, Feb 3, 2016 at 5:47 PM, Stephan Ewen <[hidden email]> wrote:
Looks like the network configuration is not correct.

I would try setting the full host name (like "master.abc.xyz.com") as jobmanager.rpc.address.

Greetings,
Stephan


On Wed, Feb 3, 2016 at 5:43 PM, Ravinder Kaur <[hidden email]> wrote:

Hello Community,

I'm a student and new to Apache Flink. I'm trying to learn and have setup a 2- node standalone Flink(0.10.1) cluster (one master and one worker). I'm facing the following issue. 

Cluster: consists of 2 vms (one master and one worker)


When I start the cluster both the JobManager and the TaskManager are started on the master and worker respectively.

Command to start the cluster : bin/start-cluster.sh

JPS shows all the processes running.

Then I run the following command to run a WordCount example job: ./bin/flink run ./examples/WordCount.jar

the result is attached with the mail.

The error is org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailabeException: Not enough free slots available to run to run the job ....................... Resources available to scheduler: Number of instances=0, total number of slots= 0, available slots=0

Therefore I suppose that the JobManager does not find the TaskManager and checked the logs of the TaskManager which indeed shows that the TaskManager is unable to register at the JobManager for quite a long time. There are org.apache.flink.runtime.net.ConnectionUtils: Failed to connect from localhost: Connect timed out and org.apache.flink.runtime.net.ConnectionUtils: Failed to connect from address localhost: Network is Unreachable messages in the log of the TaskManager. Later when it starts up after a number of attempts and tries to register at the JobManager, which also fails after a lot of attempts showing the following message org.apache.flink.runtime.taskmanager.Taskmanager: Trying to register at JobManager akka.tcp://flink@master:6123/user'/jobmanager (attempt:92, timeout:30seconds) and org.apache.flink.runtime.taskmanager.Taskmanager: Tried to associate with unreachable remote host [akka.tcp://flink@master:6123/user/jobmanager]. Address is now gated for 5000ms, all messages to this address will be delivered to dead letters. Reason: Connection timed out: /master:6123

and https://issues.apache.org/jira/browse/FLINK-1119 these links helpful. Stephan Ewen the guy who provided the solution in both the links gives a good explanation that the TaskManagers take quite some time to register at the JobManager and therefore I waited for as long as 20 mins after starting the cluster to run the job. But even after waiting so long I get the same error. 

Another suggestion was to run the cluster in streaming mode. So I tried it with the command : bin/start-cluster-streaming.sh and ran the job but I get the same error. I have rechecked all the configurations but I'm unable to find out the fault.

I re-checked all the configurations but could not find anything wrong. Also checked the port 6123 on master which is in LISTEN state and tcp request from worker to master shows SYN_SENT state using netstat -na and lsof -i commands.

I opened the webpage on master http://localhost:8081 but it shows nothing and localhost:8080 says connection refused.

Kindly help me out as it is very important for me. Let me know if you have any questions.

Kind Regards,
Ravinder Kaur




Reply | Threaded
Open this post in threaded view
|

Re: TaskManager unable to register with JobManager

Ravinder Kaur
Hello,

The log file of the Taskmanager now shows the following 

18:27:10,082 WARN  org.apache.hadoop.util.NativeCodeLoader                       - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
18:27:10,244 INFO  org.apache.flink.runtime.taskmanager.TaskManager              - --------------------------------------------------------------------------------
18:27:10,244 INFO  org.apache.flink.runtime.taskmanager.TaskManager              -  Starting TaskManager (Version: 0.10.1, Rev:2e9b231, Date:22.11.2015 @ 12:41:12 CET)
18:27:10,244 INFO  org.apache.flink.runtime.taskmanager.TaskManager              -  Current user: flink
18:27:10,245 INFO  org.apache.flink.runtime.taskmanager.TaskManager              -  JVM: OpenJDK 64-Bit Server VM - Oracle Corporation - 1.7/24.91-b01
18:27:10,245 INFO  org.apache.flink.runtime.taskmanager.TaskManager              -  Maximum heap size: 491 MiBytes
18:27:10,245 INFO  org.apache.flink.runtime.taskmanager.TaskManager              -  JAVA_HOME: /usr/lib/jvm/java-1.7.0-openjdk-amd64
18:27:10,247 INFO  org.apache.flink.runtime.taskmanager.TaskManager              -  Hadoop version: 2.7.0
18:27:10,247 INFO  org.apache.flink.runtime.taskmanager.TaskManager              -  JVM Options:
18:27:10,247 INFO  org.apache.flink.runtime.taskmanager.TaskManager              -     -Xms512M
18:27:10,247 INFO  org.apache.flink.runtime.taskmanager.TaskManager              -     -Xmx512M
18:27:10,248 INFO  org.apache.flink.runtime.taskmanager.TaskManager              -     -XX:MaxDirectMemorySize=8388607T
18:27:10,248 INFO  org.apache.flink.runtime.taskmanager.TaskManager              -     -XX:MaxPermSize=256m
18:27:10,248 INFO  org.apache.flink.runtime.taskmanager.TaskManager              -     -Dlog.file=/home/flink/flink-0.10.1/log/flink-flink-taskmanager-0-vm-10-155-208-137.cloud.mwn.de.log
18:27:10,248 INFO  org.apache.flink.runtime.taskmanager.TaskManager              -     -Dlog4j.configuration=file:/home/flink/flink-0.10.1/conf/log4j.properties
18:27:10,248 INFO  org.apache.flink.runtime.taskmanager.TaskManager              -     -Dlogback.configurationFile=file:/home/flink/flink-0.10.1/conf/logback.xml
18:27:10,248 INFO  org.apache.flink.runtime.taskmanager.TaskManager              -  Program Arguments:
18:27:10,248 INFO  org.apache.flink.runtime.taskmanager.TaskManager              -     --configDir
18:27:10,248 INFO  org.apache.flink.runtime.taskmanager.TaskManager              -     /home/flink/flink-0.10.1/conf
18:27:10,248 INFO  org.apache.flink.runtime.taskmanager.TaskManager              -     --streamingMode
18:27:10,248 INFO  org.apache.flink.runtime.taskmanager.TaskManager              -     batch
18:27:10,248 INFO  org.apache.flink.runtime.taskmanager.TaskManager              -  Classpath: /home/flink/flink-0.10.1/lib/flink-dist_2.11-0.10.1.jar:/home/flink/flink-0.10.1/lib/flink-python_2.11-0.10.1.jar:/home/flink/flink-0.10.1/lib/log4j-1.2.17.jar:/home/flink/flink-0.10.1/lib/slf4j-log4j12-1.7.7.jar:/usr/lib/jvm/java-1.7.0-openjdk-amd64/lib/tools.jar::
18:27:10,248 INFO  org.apache.flink.runtime.taskmanager.TaskManager              - --------------------------------------------------------------------------------
18:27:10,252 INFO  org.apache.flink.runtime.taskmanager.TaskManager              - Maximum number of open file descriptors is 4096
18:27:10,277 INFO  org.apache.flink.runtime.taskmanager.TaskManager              - Loading configuration from /home/flink/flink-0.10.1/conf
18:27:10,356 INFO  org.apache.flink.runtime.taskmanager.TaskManager              - Security is not enabled. Starting non-authenticated TaskManager.
18:27:10,365 ERROR org.apache.flink.runtime.taskmanager.TaskManager              - Failed to run TaskManager.
java.net.UnknownHostException: Cannot resolve the JobManager hostname 'hostname-of-master' specified in the configuration
        at org.apache.flink.runtime.util.StandaloneUtils.createLeaderRetrievalService(StandaloneUtils.java:79)
        at org.apache.flink.runtime.util.StandaloneUtils.createLeaderRetrievalService(StandaloneUtils.java:48)
        at org.apache.flink.runtime.util.LeaderRetrievalUtils.createLeaderRetrievalService(LeaderRetrievalUtils.java:69)
        at org.apache.flink.runtime.taskmanager.TaskManager$.selectNetworkInterfaceAndPort(TaskManager.scala:1351)
        at org.apache.flink.runtime.taskmanager.TaskManager$.selectNetworkInterfaceAndRunTaskManager(TaskManager.scala:1328)
        at org.apache.flink.runtime.taskmanager.TaskManager$.main(TaskManager.scala:1240)
        at org.apache.flink.runtime.taskmanager.TaskManager.main(TaskManager.scala)

Kind Regards,
Ravinder Kaur

On Wed, Feb 3, 2016 at 7:19 PM, Stephan Ewen <[hidden email]> wrote:
What do the TaskManger logs say?

On Wed, Feb 3, 2016 at 6:34 PM, Ravinder Kaur <[hidden email]> wrote:
Hello,

Thanks for the quick reply. I tried to set jobmanager.rpc.address in flink-conf.yaml to the hostname of master node on both the nodes.

Now it does not start the Taskmanager at the worker node at all. When I start the cluster using ./bin/start-cluster.sh on master it shows the normal output of starting the Jobmanager and Taskmanager but when I run jps on the nodes the slave does not have the Taskmanager running. 

Running the WordCount example again fails showing the same error. Stopping the cluster says no taskmanager to stop.

Kind Regards,
Ravinder Kaur

On Wed, Feb 3, 2016 at 5:47 PM, Stephan Ewen <[hidden email]> wrote:
Looks like the network configuration is not correct.

I would try setting the full host name (like "master.abc.xyz.com") as jobmanager.rpc.address.

Greetings,
Stephan


On Wed, Feb 3, 2016 at 5:43 PM, Ravinder Kaur <[hidden email]> wrote:

Hello Community,

I'm a student and new to Apache Flink. I'm trying to learn and have setup a 2- node standalone Flink(0.10.1) cluster (one master and one worker). I'm facing the following issue. 

Cluster: consists of 2 vms (one master and one worker)


When I start the cluster both the JobManager and the TaskManager are started on the master and worker respectively.

Command to start the cluster : bin/start-cluster.sh

JPS shows all the processes running.

Then I run the following command to run a WordCount example job: ./bin/flink run ./examples/WordCount.jar

the result is attached with the mail.

The error is org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailabeException: Not enough free slots available to run to run the job ....................... Resources available to scheduler: Number of instances=0, total number of slots= 0, available slots=0

Therefore I suppose that the JobManager does not find the TaskManager and checked the logs of the TaskManager which indeed shows that the TaskManager is unable to register at the JobManager for quite a long time. There are org.apache.flink.runtime.net.ConnectionUtils: Failed to connect from localhost: Connect timed out and org.apache.flink.runtime.net.ConnectionUtils: Failed to connect from address localhost: Network is Unreachable messages in the log of the TaskManager. Later when it starts up after a number of attempts and tries to register at the JobManager, which also fails after a lot of attempts showing the following message org.apache.flink.runtime.taskmanager.Taskmanager: Trying to register at JobManager akka.tcp://flink@master:6123/user'/jobmanager (attempt:92, timeout:30seconds) and org.apache.flink.runtime.taskmanager.Taskmanager: Tried to associate with unreachable remote host [akka.tcp://flink@master:6123/user/jobmanager]. Address is now gated for 5000ms, all messages to this address will be delivered to dead letters. Reason: Connection timed out: /master:6123

and https://issues.apache.org/jira/browse/FLINK-1119 these links helpful. Stephan Ewen the guy who provided the solution in both the links gives a good explanation that the TaskManagers take quite some time to register at the JobManager and therefore I waited for as long as 20 mins after starting the cluster to run the job. But even after waiting so long I get the same error. 

Another suggestion was to run the cluster in streaming mode. So I tried it with the command : bin/start-cluster-streaming.sh and ran the job but I get the same error. I have rechecked all the configurations but I'm unable to find out the fault.

I re-checked all the configurations but could not find anything wrong. Also checked the port 6123 on master which is in LISTEN state and tcp request from worker to master shows SYN_SENT state using netstat -na and lsof -i commands.

I opened the webpage on master http://localhost:8081 but it shows nothing and localhost:8080 says connection refused.

Kindly help me out as it is very important for me. Let me know if you have any questions.

Kind Regards,
Ravinder Kaur





Reply | Threaded
Open this post in threaded view
|

Re: TaskManager unable to register with JobManager

Stephan Ewen
This looks like the reason:

java.net.UnknownHostException: Cannot resolve the JobManager hostname 'hostname-of-master' specified in the configuration

On Wed, Feb 3, 2016 at 7:29 PM, Ravinder Kaur <[hidden email]> wrote:
Hello,

The log file of the Taskmanager now shows the following 

18:27:10,082 WARN  org.apache.hadoop.util.NativeCodeLoader                       - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
18:27:10,244 INFO  org.apache.flink.runtime.taskmanager.TaskManager              - --------------------------------------------------------------------------------
18:27:10,244 INFO  org.apache.flink.runtime.taskmanager.TaskManager              -  Starting TaskManager (Version: 0.10.1, Rev:2e9b231, Date:22.11.2015 @ 12:41:12 CET)
18:27:10,244 INFO  org.apache.flink.runtime.taskmanager.TaskManager              -  Current user: flink
18:27:10,245 INFO  org.apache.flink.runtime.taskmanager.TaskManager              -  JVM: OpenJDK 64-Bit Server VM - Oracle Corporation - 1.7/24.91-b01
18:27:10,245 INFO  org.apache.flink.runtime.taskmanager.TaskManager              -  Maximum heap size: 491 MiBytes
18:27:10,245 INFO  org.apache.flink.runtime.taskmanager.TaskManager              -  JAVA_HOME: /usr/lib/jvm/java-1.7.0-openjdk-amd64
18:27:10,247 INFO  org.apache.flink.runtime.taskmanager.TaskManager              -  Hadoop version: 2.7.0
18:27:10,247 INFO  org.apache.flink.runtime.taskmanager.TaskManager              -  JVM Options:
18:27:10,247 INFO  org.apache.flink.runtime.taskmanager.TaskManager              -     -Xms512M
18:27:10,247 INFO  org.apache.flink.runtime.taskmanager.TaskManager              -     -Xmx512M
18:27:10,248 INFO  org.apache.flink.runtime.taskmanager.TaskManager              -     -XX:MaxDirectMemorySize=8388607T
18:27:10,248 INFO  org.apache.flink.runtime.taskmanager.TaskManager              -     -XX:MaxPermSize=256m
18:27:10,248 INFO  org.apache.flink.runtime.taskmanager.TaskManager              -     -Dlog.file=/home/flink/flink-0.10.1/log/flink-flink-taskmanager-0-vm-10-155-208-137.cloud.mwn.de.log
18:27:10,248 INFO  org.apache.flink.runtime.taskmanager.TaskManager              -     -Dlog4j.configuration=file:/home/flink/flink-0.10.1/conf/log4j.properties
18:27:10,248 INFO  org.apache.flink.runtime.taskmanager.TaskManager              -     -Dlogback.configurationFile=file:/home/flink/flink-0.10.1/conf/logback.xml
18:27:10,248 INFO  org.apache.flink.runtime.taskmanager.TaskManager              -  Program Arguments:
18:27:10,248 INFO  org.apache.flink.runtime.taskmanager.TaskManager              -     --configDir
18:27:10,248 INFO  org.apache.flink.runtime.taskmanager.TaskManager              -     /home/flink/flink-0.10.1/conf
18:27:10,248 INFO  org.apache.flink.runtime.taskmanager.TaskManager              -     --streamingMode
18:27:10,248 INFO  org.apache.flink.runtime.taskmanager.TaskManager              -     batch
18:27:10,248 INFO  org.apache.flink.runtime.taskmanager.TaskManager              -  Classpath: /home/flink/flink-0.10.1/lib/flink-dist_2.11-0.10.1.jar:/home/flink/flink-0.10.1/lib/flink-python_2.11-0.10.1.jar:/home/flink/flink-0.10.1/lib/log4j-1.2.17.jar:/home/flink/flink-0.10.1/lib/slf4j-log4j12-1.7.7.jar:/usr/lib/jvm/java-1.7.0-openjdk-amd64/lib/tools.jar::
18:27:10,248 INFO  org.apache.flink.runtime.taskmanager.TaskManager              - --------------------------------------------------------------------------------
18:27:10,252 INFO  org.apache.flink.runtime.taskmanager.TaskManager              - Maximum number of open file descriptors is 4096
18:27:10,277 INFO  org.apache.flink.runtime.taskmanager.TaskManager              - Loading configuration from /home/flink/flink-0.10.1/conf
18:27:10,356 INFO  org.apache.flink.runtime.taskmanager.TaskManager              - Security is not enabled. Starting non-authenticated TaskManager.
18:27:10,365 ERROR org.apache.flink.runtime.taskmanager.TaskManager              - Failed to run TaskManager.
java.net.UnknownHostException: Cannot resolve the JobManager hostname 'hostname-of-master' specified in the configuration
        at org.apache.flink.runtime.util.StandaloneUtils.createLeaderRetrievalService(StandaloneUtils.java:79)
        at org.apache.flink.runtime.util.StandaloneUtils.createLeaderRetrievalService(StandaloneUtils.java:48)
        at org.apache.flink.runtime.util.LeaderRetrievalUtils.createLeaderRetrievalService(LeaderRetrievalUtils.java:69)
        at org.apache.flink.runtime.taskmanager.TaskManager$.selectNetworkInterfaceAndPort(TaskManager.scala:1351)
        at org.apache.flink.runtime.taskmanager.TaskManager$.selectNetworkInterfaceAndRunTaskManager(TaskManager.scala:1328)
        at org.apache.flink.runtime.taskmanager.TaskManager$.main(TaskManager.scala:1240)
        at org.apache.flink.runtime.taskmanager.TaskManager.main(TaskManager.scala)

Kind Regards,
Ravinder Kaur

On Wed, Feb 3, 2016 at 7:19 PM, Stephan Ewen <[hidden email]> wrote:
What do the TaskManger logs say?

On Wed, Feb 3, 2016 at 6:34 PM, Ravinder Kaur <[hidden email]> wrote:
Hello,

Thanks for the quick reply. I tried to set jobmanager.rpc.address in flink-conf.yaml to the hostname of master node on both the nodes.

Now it does not start the Taskmanager at the worker node at all. When I start the cluster using ./bin/start-cluster.sh on master it shows the normal output of starting the Jobmanager and Taskmanager but when I run jps on the nodes the slave does not have the Taskmanager running. 

Running the WordCount example again fails showing the same error. Stopping the cluster says no taskmanager to stop.

Kind Regards,
Ravinder Kaur

On Wed, Feb 3, 2016 at 5:47 PM, Stephan Ewen <[hidden email]> wrote:
Looks like the network configuration is not correct.

I would try setting the full host name (like "master.abc.xyz.com") as jobmanager.rpc.address.

Greetings,
Stephan


On Wed, Feb 3, 2016 at 5:43 PM, Ravinder Kaur <[hidden email]> wrote:

Hello Community,

I'm a student and new to Apache Flink. I'm trying to learn and have setup a 2- node standalone Flink(0.10.1) cluster (one master and one worker). I'm facing the following issue. 

Cluster: consists of 2 vms (one master and one worker)


When I start the cluster both the JobManager and the TaskManager are started on the master and worker respectively.

Command to start the cluster : bin/start-cluster.sh

JPS shows all the processes running.

Then I run the following command to run a WordCount example job: ./bin/flink run ./examples/WordCount.jar

the result is attached with the mail.

The error is org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailabeException: Not enough free slots available to run to run the job ....................... Resources available to scheduler: Number of instances=0, total number of slots= 0, available slots=0

Therefore I suppose that the JobManager does not find the TaskManager and checked the logs of the TaskManager which indeed shows that the TaskManager is unable to register at the JobManager for quite a long time. There are org.apache.flink.runtime.net.ConnectionUtils: Failed to connect from localhost: Connect timed out and org.apache.flink.runtime.net.ConnectionUtils: Failed to connect from address localhost: Network is Unreachable messages in the log of the TaskManager. Later when it starts up after a number of attempts and tries to register at the JobManager, which also fails after a lot of attempts showing the following message org.apache.flink.runtime.taskmanager.Taskmanager: Trying to register at JobManager akka.tcp://flink@master:6123/user'/jobmanager (attempt:92, timeout:30seconds) and org.apache.flink.runtime.taskmanager.Taskmanager: Tried to associate with unreachable remote host [akka.tcp://flink@master:6123/user/jobmanager]. Address is now gated for 5000ms, all messages to this address will be delivered to dead letters. Reason: Connection timed out: /master:6123

and https://issues.apache.org/jira/browse/FLINK-1119 these links helpful. Stephan Ewen the guy who provided the solution in both the links gives a good explanation that the TaskManagers take quite some time to register at the JobManager and therefore I waited for as long as 20 mins after starting the cluster to run the job. But even after waiting so long I get the same error. 

Another suggestion was to run the cluster in streaming mode. So I tried it with the command : bin/start-cluster-streaming.sh and ran the job but I get the same error. I have rechecked all the configurations but I'm unable to find out the fault.

I re-checked all the configurations but could not find anything wrong. Also checked the port 6123 on master which is in LISTEN state and tcp request from worker to master shows SYN_SENT state using netstat -na and lsof -i commands.

I opened the webpage on master http://localhost:8081 but it shows nothing and localhost:8080 says connection refused.

Kindly help me out as it is very important for me. Let me know if you have any questions.

Kind Regards,
Ravinder Kaur






Reply | Threaded
Open this post in threaded view
|

Re: TaskManager unable to register with JobManager

Ravinder Kaur
Hello,

Thank you for pointing it out. I had a little typo while I edited the hostname in flink-conf.yaml. I've reset it and the TaskManager started up. But I still can't run the WordCount example and it throws the same NoResourceAvaliableException.

Caused by: org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableExce                                                                                        ption: Not enough free slots available to run the job. You can decrease the oper                                                                                        ator parallelism or increase the number of slots per TaskManager in the configur                                                                                        ation. Task to schedule: < Attempt #0 (CHAIN DataSource (at getDefaultTextLineDa                                                                                        taSet(WordCountData.java:70) (org.apache.flink.api.java.io.CollectionInputFormat                                                                                        )) -> FlatMap (FlatMap at main(WordCount.java:69)) -> Combine(SUM(1), at main(Wo                                                                                        rdCount.java:72) (1/1)) @ (unassigned) - [SCHEDULED] > with groupID < 31e497f2f6                                                                                        8c9cee5864c8fddaff3d59 > in sharing group < SlotSharingGroup [f9ed1aab933e061a8c                                                                                        e1ecaa3534f18c, 037bb78a1902f7edea69a978ad7b54ce, 31e497f2f68c9cee5864c8fddaff3d                                                                                        59] >. Resources available to scheduler: Number of instances=0, total number of                                                                                         slots=0, available slots=0
        at org.apache.flink.runtime.jobmanager.scheduler.Scheduler.scheduleTask(                                                                                        Scheduler.java:256)
        at org.apache.flink.runtime.jobmanager.scheduler.Scheduler.scheduleImmed                                                                                        iately(Scheduler.java:131)
        at org.apache.flink.runtime.executiongraph.Execution.scheduleForExecutio                                                                                        n(Execution.java:298)
        at org.apache.flink.runtime.executiongraph.ExecutionVertex.scheduleForEx                                                                                        ecution(ExecutionVertex.java:458)
        at org.apache.flink.runtime.executiongraph.ExecutionJobVertex.scheduleAl                                                                                        l(ExecutionJobVertex.java:322)
        at org.apache.flink.runtime.executiongraph.ExecutionGraph.scheduleForExe                                                                                        cution(ExecutionGraph.java:679)
        at org.apache.flink.runtime.jobmanager.JobManager$$anonfun$org$apache$fl                                                                                        ink$runtime$jobmanager$JobManager$$submitJob$1.apply$mcV$sp(JobManager.scala:982                                                                                        )
        at org.apache.flink.runtime.jobmanager.JobManager$$anonfun$org$apache$fl                                                                                        ink$runtime$jobmanager$JobManager$$submitJob$1.apply(JobManager.scala:962)
        at org.apache.flink.runtime.jobmanager.JobManager$$anonfun$org$apache$fl                                                                                        ink$runtime$jobmanager$JobManager$$submitJob$1.apply(JobManager.scala:962)
        ... 8 more

The log of TaskManager again has the same errors as before.

20:58:58,457 INFO  org.apache.flink.runtime.net.ConnectionUtils                  - Failed to connect from address '/slave-IP': connect timed out
20:58:58,458 INFO  org.apache.flink.runtime.net.ConnectionUtils                  - Failed to connect from address '/0:0:0:0:0:0:0:1%1': Network is unreachable
20:58:58,458 INFO  org.apache.flink.runtime.net.ConnectionUtils                  - Failed to connect from address '/127.0.0.1': Invalid argument
20:58:59,048 WARN  org.apache.flink.runtime.net.ConnectionUtils                  - Could not connect to /master-IP:6123. Selecting a local address using heuristics.
20:58:59,050 INFO  org.apache.flink.runtime.taskmanager.TaskManager              - TaskManager will use hostname/address 'hostname-of-slave' (slave-IP) for communication.
20:58:59,051 INFO  org.apache.flink.runtime.taskmanager.TaskManager              - Starting TaskManager in streaming mode BATCH_ONLY
20:58:59,052 INFO  org.apache.flink.runtime.taskmanager.TaskManager              - Starting TaskManager actor system at slave_IP:0
20:58:59,776 INFO  akka.event.slf4j.Slf4jLogger                                  - Slf4jLogger started
20:58:59,842 INFO  Remoting                                                      - Starting remoting
20:59:00,094 INFO  Remoting                                                      - Remoting started; listening on addresses :[akka.tcp://flink@slave-IP:33813]
20:59:00,100 INFO  org.apache.flink.runtime.taskmanager.TaskManager              - Starting TaskManager actor
20:59:00,125 INFO  org.apache.flink.runtime.io.network.netty.NettyConfig         - NettyConfig [server address: hostname-of-master/master-IP, server port: 49030, memory segment size (bytes): 32768, transport type: NIO, number of server threads: 0 (use Netty's default), number of client threads: 0 (use Netty's default), server connect backlog: 0 (use Netty's default), client connect timeout (sec): 120, send/receive buffer size (bytes): 0 (use Netty's default)]
20:59:00,131 INFO  org.apache.flink.runtime.taskmanager.TaskManager              - Messages between TaskManager and JobManager have a max timeout of 100000 milliseconds
20:59:00,142 INFO  org.apache.flink.runtime.taskmanager.TaskManager              - Temporary file directory '/tmp': total 4 GB, usable 1 GB (25.00% usable)
20:59:00,210 INFO  org.apache.flink.runtime.io.network.buffer.NetworkBufferPool  - Allocated 64 MB for network buffer pool (number of memory segments: 2048, bytes per segment: 32768).
20:59:00,323 INFO  org.apache.flink.runtime.taskmanager.TaskManager              - Using 0.7 of the currently free heap space for Flink managed heap memory (293 MB).
20:59:00,565 INFO  org.apache.flink.runtime.io.disk.iomanager.IOManager          - I/O manager uses directory /tmp/flink-io-c7796b82-6676-4604-97fd-df09001a84e8 for spill files.
20:59:00,578 INFO  org.apache.flink.runtime.filecache.FileCache                  - User file cache uses directory /tmp/flink-dist-cache-13ed3e76-cf1e-46fa-9ba2-5177e801429e
20:59:00,908 INFO  org.apache.flink.runtime.taskmanager.TaskManager              - Starting TaskManager actor at akka://flink/user/taskmanager#-157676733.
20:59:00,908 INFO  org.apache.flink.runtime.taskmanager.TaskManager              - TaskManager data connection information: hostname-of-master (dataPort=49030)
20:59:00,909 INFO  org.apache.flink.runtime.taskmanager.TaskManager              - TaskManager has 1 task slot(s).
20:59:00,910 INFO  org.apache.flink.runtime.taskmanager.TaskManager              - Memory usage stats: [HEAP: 376/491/491 MB, NON HEAP: 24/49/304 MB (used/committed/max)]
20:59:00,917 INFO  org.apache.flink.runtime.taskmanager.TaskManager              - Trying to register at JobManager akka.tcp://flink@master-IP:6123/user/jobmanager (attempt 1, timeout: 500 milliseconds)
20:59:01,443 INFO  org.apache.flink.runtime.taskmanager.TaskManager              - Trying to register at JobManager akka.tcp://flink@master-IP:6123/user/jobmanager (attempt 2, timeout: 1000 milliseconds)
20:59:02,873 INFO  org.apache.flink.runtime.taskmanager.TaskManager              - Trying to register at JobManager akka.tcp://flink@master-IP:6123/user/jobmanager (attempt 3, timeout: 2000 milliseconds)
20:59:04,893 INFO  org.apache.flink.runtime.taskmanager.TaskManager              - Trying to register at JobManager akka.tcp://flink@master-IP:6123/user/jobmanager (attempt 4, timeout: 4000 milliseconds)
20:59:08,914 INFO  org.apache.flink.runtime.taskmanager.TaskManager              - Trying to register at JobManager akka.tcp://flink@master-IP:6123/user/jobmanager (attempt 5, timeout: 8000 milliseconds)


Kind Regards,
Ravinder Kaur

On Wed, Feb 3, 2016 at 8:12 PM, Stephan Ewen <[hidden email]> wrote:
This looks like the reason:

java.net.UnknownHostException: Cannot resolve the JobManager hostname 'hostname-of-master' specified in the configuration

On Wed, Feb 3, 2016 at 7:29 PM, Ravinder Kaur <[hidden email]> wrote:
Hello,

The log file of the Taskmanager now shows the following 

18:27:10,082 WARN  org.apache.hadoop.util.NativeCodeLoader                       - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
18:27:10,244 INFO  org.apache.flink.runtime.taskmanager.TaskManager              - --------------------------------------------------------------------------------
18:27:10,244 INFO  org.apache.flink.runtime.taskmanager.TaskManager              -  Starting TaskManager (Version: 0.10.1, Rev:2e9b231, Date:22.11.2015 @ 12:41:12 CET)
18:27:10,244 INFO  org.apache.flink.runtime.taskmanager.TaskManager              -  Current user: flink
18:27:10,245 INFO  org.apache.flink.runtime.taskmanager.TaskManager              -  JVM: OpenJDK 64-Bit Server VM - Oracle Corporation - 1.7/24.91-b01
18:27:10,245 INFO  org.apache.flink.runtime.taskmanager.TaskManager              -  Maximum heap size: 491 MiBytes
18:27:10,245 INFO  org.apache.flink.runtime.taskmanager.TaskManager              -  JAVA_HOME: /usr/lib/jvm/java-1.7.0-openjdk-amd64
18:27:10,247 INFO  org.apache.flink.runtime.taskmanager.TaskManager              -  Hadoop version: 2.7.0
18:27:10,247 INFO  org.apache.flink.runtime.taskmanager.TaskManager              -  JVM Options:
18:27:10,247 INFO  org.apache.flink.runtime.taskmanager.TaskManager              -     -Xms512M
18:27:10,247 INFO  org.apache.flink.runtime.taskmanager.TaskManager              -     -Xmx512M
18:27:10,248 INFO  org.apache.flink.runtime.taskmanager.TaskManager              -     -XX:MaxDirectMemorySize=8388607T
18:27:10,248 INFO  org.apache.flink.runtime.taskmanager.TaskManager              -     -XX:MaxPermSize=256m
18:27:10,248 INFO  org.apache.flink.runtime.taskmanager.TaskManager              -     -Dlog.file=/home/flink/flink-0.10.1/log/flink-flink-taskmanager-0-vm-10-155-208-137.cloud.mwn.de.log
18:27:10,248 INFO  org.apache.flink.runtime.taskmanager.TaskManager              -     -Dlog4j.configuration=file:/home/flink/flink-0.10.1/conf/log4j.properties
18:27:10,248 INFO  org.apache.flink.runtime.taskmanager.TaskManager              -     -Dlogback.configurationFile=file:/home/flink/flink-0.10.1/conf/logback.xml
18:27:10,248 INFO  org.apache.flink.runtime.taskmanager.TaskManager              -  Program Arguments:
18:27:10,248 INFO  org.apache.flink.runtime.taskmanager.TaskManager              -     --configDir
18:27:10,248 INFO  org.apache.flink.runtime.taskmanager.TaskManager              -     /home/flink/flink-0.10.1/conf
18:27:10,248 INFO  org.apache.flink.runtime.taskmanager.TaskManager              -     --streamingMode
18:27:10,248 INFO  org.apache.flink.runtime.taskmanager.TaskManager              -     batch
18:27:10,248 INFO  org.apache.flink.runtime.taskmanager.TaskManager              -  Classpath: /home/flink/flink-0.10.1/lib/flink-dist_2.11-0.10.1.jar:/home/flink/flink-0.10.1/lib/flink-python_2.11-0.10.1.jar:/home/flink/flink-0.10.1/lib/log4j-1.2.17.jar:/home/flink/flink-0.10.1/lib/slf4j-log4j12-1.7.7.jar:/usr/lib/jvm/java-1.7.0-openjdk-amd64/lib/tools.jar::
18:27:10,248 INFO  org.apache.flink.runtime.taskmanager.TaskManager              - --------------------------------------------------------------------------------
18:27:10,252 INFO  org.apache.flink.runtime.taskmanager.TaskManager              - Maximum number of open file descriptors is 4096
18:27:10,277 INFO  org.apache.flink.runtime.taskmanager.TaskManager              - Loading configuration from /home/flink/flink-0.10.1/conf
18:27:10,356 INFO  org.apache.flink.runtime.taskmanager.TaskManager              - Security is not enabled. Starting non-authenticated TaskManager.
18:27:10,365 ERROR org.apache.flink.runtime.taskmanager.TaskManager              - Failed to run TaskManager.
java.net.UnknownHostException: Cannot resolve the JobManager hostname 'hostname-of-master' specified in the configuration
        at org.apache.flink.runtime.util.StandaloneUtils.createLeaderRetrievalService(StandaloneUtils.java:79)
        at org.apache.flink.runtime.util.StandaloneUtils.createLeaderRetrievalService(StandaloneUtils.java:48)
        at org.apache.flink.runtime.util.LeaderRetrievalUtils.createLeaderRetrievalService(LeaderRetrievalUtils.java:69)
        at org.apache.flink.runtime.taskmanager.TaskManager$.selectNetworkInterfaceAndPort(TaskManager.scala:1351)
        at org.apache.flink.runtime.taskmanager.TaskManager$.selectNetworkInterfaceAndRunTaskManager(TaskManager.scala:1328)
        at org.apache.flink.runtime.taskmanager.TaskManager$.main(TaskManager.scala:1240)
        at org.apache.flink.runtime.taskmanager.TaskManager.main(TaskManager.scala)

Kind Regards,
Ravinder Kaur

On Wed, Feb 3, 2016 at 7:19 PM, Stephan Ewen <[hidden email]> wrote:
What do the TaskManger logs say?

On Wed, Feb 3, 2016 at 6:34 PM, Ravinder Kaur <[hidden email]> wrote:
Hello,

Thanks for the quick reply. I tried to set jobmanager.rpc.address in flink-conf.yaml to the hostname of master node on both the nodes.

Now it does not start the Taskmanager at the worker node at all. When I start the cluster using ./bin/start-cluster.sh on master it shows the normal output of starting the Jobmanager and Taskmanager but when I run jps on the nodes the slave does not have the Taskmanager running. 

Running the WordCount example again fails showing the same error. Stopping the cluster says no taskmanager to stop.

Kind Regards,
Ravinder Kaur

On Wed, Feb 3, 2016 at 5:47 PM, Stephan Ewen <[hidden email]> wrote:
Looks like the network configuration is not correct.

I would try setting the full host name (like "master.abc.xyz.com") as jobmanager.rpc.address.

Greetings,
Stephan


On Wed, Feb 3, 2016 at 5:43 PM, Ravinder Kaur <[hidden email]> wrote:

Hello Community,

I'm a student and new to Apache Flink. I'm trying to learn and have setup a 2- node standalone Flink(0.10.1) cluster (one master and one worker). I'm facing the following issue. 

Cluster: consists of 2 vms (one master and one worker)


When I start the cluster both the JobManager and the TaskManager are started on the master and worker respectively.

Command to start the cluster : bin/start-cluster.sh

JPS shows all the processes running.

Then I run the following command to run a WordCount example job: ./bin/flink run ./examples/WordCount.jar

the result is attached with the mail.

The error is org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailabeException: Not enough free slots available to run to run the job ....................... Resources available to scheduler: Number of instances=0, total number of slots= 0, available slots=0

Therefore I suppose that the JobManager does not find the TaskManager and checked the logs of the TaskManager which indeed shows that the TaskManager is unable to register at the JobManager for quite a long time. There are org.apache.flink.runtime.net.ConnectionUtils: Failed to connect from localhost: Connect timed out and org.apache.flink.runtime.net.ConnectionUtils: Failed to connect from address localhost: Network is Unreachable messages in the log of the TaskManager. Later when it starts up after a number of attempts and tries to register at the JobManager, which also fails after a lot of attempts showing the following message org.apache.flink.runtime.taskmanager.Taskmanager: Trying to register at JobManager akka.tcp://flink@master:6123/user'/jobmanager (attempt:92, timeout:30seconds) and org.apache.flink.runtime.taskmanager.Taskmanager: Tried to associate with unreachable remote host [akka.tcp://flink@master:6123/user/jobmanager]. Address is now gated for 5000ms, all messages to this address will be delivered to dead letters. Reason: Connection timed out: /master:6123

and https://issues.apache.org/jira/browse/FLINK-1119 these links helpful. Stephan Ewen the guy who provided the solution in both the links gives a good explanation that the TaskManagers take quite some time to register at the JobManager and therefore I waited for as long as 20 mins after starting the cluster to run the job. But even after waiting so long I get the same error. 

Another suggestion was to run the cluster in streaming mode. So I tried it with the command : bin/start-cluster-streaming.sh and ran the job but I get the same error. I have rechecked all the configurations but I'm unable to find out the fault.

I re-checked all the configurations but could not find anything wrong. Also checked the port 6123 on master which is in LISTEN state and tcp request from worker to master shows SYN_SENT state using netstat -na and lsof -i commands.

I opened the webpage on master http://localhost:8081 but it shows nothing and localhost:8080 says connection refused.

Kindly help me out as it is very important for me. Let me know if you have any questions.

Kind Regards,
Ravinder Kaur







Reply | Threaded
Open this post in threaded view
|

Re: TaskManager unable to register with JobManager

rmetzger0
Hi,

the TaskManager is starting up, but its not able to register at the job manager. Did you check the JobManager log? Do you see anything suspicious there? Are the ports matching?


On Wed, Feb 3, 2016 at 9:23 PM, Ravinder Kaur <[hidden email]> wrote:
Hello,

Thank you for pointing it out. I had a little typo while I edited the hostname in flink-conf.yaml. I've reset it and the TaskManager started up. But I still can't run the WordCount example and it throws the same NoResourceAvaliableException.

Caused by: org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableExce                                                                                        ption: Not enough free slots available to run the job. You can decrease the oper                                                                                        ator parallelism or increase the number of slots per TaskManager in the configur                                                                                        ation. Task to schedule: < Attempt #0 (CHAIN DataSource (at getDefaultTextLineDa                                                                                        taSet(WordCountData.java:70) (org.apache.flink.api.java.io.CollectionInputFormat                                                                                        )) -> FlatMap (FlatMap at main(WordCount.java:69)) -> Combine(SUM(1), at main(Wo                                                                                        rdCount.java:72) (1/1)) @ (unassigned) - [SCHEDULED] > with groupID < 31e497f2f6                                                                                        8c9cee5864c8fddaff3d59 > in sharing group < SlotSharingGroup [f9ed1aab933e061a8c                                                                                        e1ecaa3534f18c, 037bb78a1902f7edea69a978ad7b54ce, 31e497f2f68c9cee5864c8fddaff3d                                                                                        59] >. Resources available to scheduler: Number of instances=0, total number of                                                                                         slots=0, available slots=0
        at org.apache.flink.runtime.jobmanager.scheduler.Scheduler.scheduleTask(                                                                                        Scheduler.java:256)
        at org.apache.flink.runtime.jobmanager.scheduler.Scheduler.scheduleImmed                                                                                        iately(Scheduler.java:131)
        at org.apache.flink.runtime.executiongraph.Execution.scheduleForExecutio                                                                                        n(Execution.java:298)
        at org.apache.flink.runtime.executiongraph.ExecutionVertex.scheduleForEx                                                                                        ecution(ExecutionVertex.java:458)
        at org.apache.flink.runtime.executiongraph.ExecutionJobVertex.scheduleAl                                                                                        l(ExecutionJobVertex.java:322)
        at org.apache.flink.runtime.executiongraph.ExecutionGraph.scheduleForExe                                                                                        cution(ExecutionGraph.java:679)
        at org.apache.flink.runtime.jobmanager.JobManager$$anonfun$org$apache$fl                                                                                        ink$runtime$jobmanager$JobManager$$submitJob$1.apply$mcV$sp(JobManager.scala:982                                                                                        )
        at org.apache.flink.runtime.jobmanager.JobManager$$anonfun$org$apache$fl                                                                                        ink$runtime$jobmanager$JobManager$$submitJob$1.apply(JobManager.scala:962)
        at org.apache.flink.runtime.jobmanager.JobManager$$anonfun$org$apache$fl                                                                                        ink$runtime$jobmanager$JobManager$$submitJob$1.apply(JobManager.scala:962)
        ... 8 more

The log of TaskManager again has the same errors as before.

20:58:58,457 INFO  org.apache.flink.runtime.net.ConnectionUtils                  - Failed to connect from address '/slave-IP': connect timed out
20:58:58,458 INFO  org.apache.flink.runtime.net.ConnectionUtils                  - Failed to connect from address '/0:0:0:0:0:0:0:1%1': Network is unreachable
20:58:58,458 INFO  org.apache.flink.runtime.net.ConnectionUtils                  - Failed to connect from address '/127.0.0.1': Invalid argument
20:58:59,048 WARN  org.apache.flink.runtime.net.ConnectionUtils                  - Could not connect to /master-IP:6123. Selecting a local address using heuristics.
20:58:59,050 INFO  org.apache.flink.runtime.taskmanager.TaskManager              - TaskManager will use hostname/address 'hostname-of-slave' (slave-IP) for communication.
20:58:59,051 INFO  org.apache.flink.runtime.taskmanager.TaskManager              - Starting TaskManager in streaming mode BATCH_ONLY
20:58:59,052 INFO  org.apache.flink.runtime.taskmanager.TaskManager              - Starting TaskManager actor system at slave_IP:0
20:58:59,776 INFO  akka.event.slf4j.Slf4jLogger                                  - Slf4jLogger started
20:58:59,842 INFO  Remoting                                                      - Starting remoting
20:59:00,094 INFO  Remoting                                                      - Remoting started; listening on addresses :[akka.tcp://flink@slave-IP:33813]
20:59:00,100 INFO  org.apache.flink.runtime.taskmanager.TaskManager              - Starting TaskManager actor
20:59:00,125 INFO  org.apache.flink.runtime.io.network.netty.NettyConfig         - NettyConfig [server address: hostname-of-master/master-IP, server port: 49030, memory segment size (bytes): 32768, transport type: NIO, number of server threads: 0 (use Netty's default), number of client threads: 0 (use Netty's default), server connect backlog: 0 (use Netty's default), client connect timeout (sec): 120, send/receive buffer size (bytes): 0 (use Netty's default)]
20:59:00,131 INFO  org.apache.flink.runtime.taskmanager.TaskManager              - Messages between TaskManager and JobManager have a max timeout of 100000 milliseconds
20:59:00,142 INFO  org.apache.flink.runtime.taskmanager.TaskManager              - Temporary file directory '/tmp': total 4 GB, usable 1 GB (25.00% usable)
20:59:00,210 INFO  org.apache.flink.runtime.io.network.buffer.NetworkBufferPool  - Allocated 64 MB for network buffer pool (number of memory segments: 2048, bytes per segment: 32768).
20:59:00,323 INFO  org.apache.flink.runtime.taskmanager.TaskManager              - Using 0.7 of the currently free heap space for Flink managed heap memory (293 MB).
20:59:00,565 INFO  org.apache.flink.runtime.io.disk.iomanager.IOManager          - I/O manager uses directory /tmp/flink-io-c7796b82-6676-4604-97fd-df09001a84e8 for spill files.
20:59:00,578 INFO  org.apache.flink.runtime.filecache.FileCache                  - User file cache uses directory /tmp/flink-dist-cache-13ed3e76-cf1e-46fa-9ba2-5177e801429e
20:59:00,908 INFO  org.apache.flink.runtime.taskmanager.TaskManager              - Starting TaskManager actor at akka://flink/user/taskmanager#-157676733.
20:59:00,908 INFO  org.apache.flink.runtime.taskmanager.TaskManager              - TaskManager data connection information: hostname-of-master (dataPort=49030)
20:59:00,909 INFO  org.apache.flink.runtime.taskmanager.TaskManager              - TaskManager has 1 task slot(s).
20:59:00,910 INFO  org.apache.flink.runtime.taskmanager.TaskManager              - Memory usage stats: [HEAP: 376/491/491 MB, NON HEAP: 24/49/304 MB (used/committed/max)]
20:59:00,917 INFO  org.apache.flink.runtime.taskmanager.TaskManager              - Trying to register at JobManager akka.tcp://flink@master-IP:6123/user/jobmanager (attempt 1, timeout: 500 milliseconds)
20:59:01,443 INFO  org.apache.flink.runtime.taskmanager.TaskManager              - Trying to register at JobManager akka.tcp://flink@master-IP:6123/user/jobmanager (attempt 2, timeout: 1000 milliseconds)
20:59:02,873 INFO  org.apache.flink.runtime.taskmanager.TaskManager              - Trying to register at JobManager akka.tcp://flink@master-IP:6123/user/jobmanager (attempt 3, timeout: 2000 milliseconds)
20:59:04,893 INFO  org.apache.flink.runtime.taskmanager.TaskManager              - Trying to register at JobManager akka.tcp://flink@master-IP:6123/user/jobmanager (attempt 4, timeout: 4000 milliseconds)
20:59:08,914 INFO  org.apache.flink.runtime.taskmanager.TaskManager              - Trying to register at JobManager akka.tcp://flink@master-IP:6123/user/jobmanager (attempt 5, timeout: 8000 milliseconds)


Kind Regards,
Ravinder Kaur

On Wed, Feb 3, 2016 at 8:12 PM, Stephan Ewen <[hidden email]> wrote:
This looks like the reason:

java.net.UnknownHostException: Cannot resolve the JobManager hostname 'hostname-of-master' specified in the configuration

On Wed, Feb 3, 2016 at 7:29 PM, Ravinder Kaur <[hidden email]> wrote:
Hello,

The log file of the Taskmanager now shows the following 

18:27:10,082 WARN  org.apache.hadoop.util.NativeCodeLoader                       - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
18:27:10,244 INFO  org.apache.flink.runtime.taskmanager.TaskManager              - --------------------------------------------------------------------------------
18:27:10,244 INFO  org.apache.flink.runtime.taskmanager.TaskManager              -  Starting TaskManager (Version: 0.10.1, Rev:2e9b231, Date:22.11.2015 @ 12:41:12 CET)
18:27:10,244 INFO  org.apache.flink.runtime.taskmanager.TaskManager              -  Current user: flink
18:27:10,245 INFO  org.apache.flink.runtime.taskmanager.TaskManager              -  JVM: OpenJDK 64-Bit Server VM - Oracle Corporation - 1.7/24.91-b01
18:27:10,245 INFO  org.apache.flink.runtime.taskmanager.TaskManager              -  Maximum heap size: 491 MiBytes
18:27:10,245 INFO  org.apache.flink.runtime.taskmanager.TaskManager              -  JAVA_HOME: /usr/lib/jvm/java-1.7.0-openjdk-amd64
18:27:10,247 INFO  org.apache.flink.runtime.taskmanager.TaskManager              -  Hadoop version: 2.7.0
18:27:10,247 INFO  org.apache.flink.runtime.taskmanager.TaskManager              -  JVM Options:
18:27:10,247 INFO  org.apache.flink.runtime.taskmanager.TaskManager              -     -Xms512M
18:27:10,247 INFO  org.apache.flink.runtime.taskmanager.TaskManager              -     -Xmx512M
18:27:10,248 INFO  org.apache.flink.runtime.taskmanager.TaskManager              -     -XX:MaxDirectMemorySize=8388607T
18:27:10,248 INFO  org.apache.flink.runtime.taskmanager.TaskManager              -     -XX:MaxPermSize=256m
18:27:10,248 INFO  org.apache.flink.runtime.taskmanager.TaskManager              -     -Dlog.file=/home/flink/flink-0.10.1/log/flink-flink-taskmanager-0-vm-10-155-208-137.cloud.mwn.de.log
18:27:10,248 INFO  org.apache.flink.runtime.taskmanager.TaskManager              -     -Dlog4j.configuration=file:/home/flink/flink-0.10.1/conf/log4j.properties
18:27:10,248 INFO  org.apache.flink.runtime.taskmanager.TaskManager              -     -Dlogback.configurationFile=file:/home/flink/flink-0.10.1/conf/logback.xml
18:27:10,248 INFO  org.apache.flink.runtime.taskmanager.TaskManager              -  Program Arguments:
18:27:10,248 INFO  org.apache.flink.runtime.taskmanager.TaskManager              -     --configDir
18:27:10,248 INFO  org.apache.flink.runtime.taskmanager.TaskManager              -     /home/flink/flink-0.10.1/conf
18:27:10,248 INFO  org.apache.flink.runtime.taskmanager.TaskManager              -     --streamingMode
18:27:10,248 INFO  org.apache.flink.runtime.taskmanager.TaskManager              -     batch
18:27:10,248 INFO  org.apache.flink.runtime.taskmanager.TaskManager              -  Classpath: /home/flink/flink-0.10.1/lib/flink-dist_2.11-0.10.1.jar:/home/flink/flink-0.10.1/lib/flink-python_2.11-0.10.1.jar:/home/flink/flink-0.10.1/lib/log4j-1.2.17.jar:/home/flink/flink-0.10.1/lib/slf4j-log4j12-1.7.7.jar:/usr/lib/jvm/java-1.7.0-openjdk-amd64/lib/tools.jar::
18:27:10,248 INFO  org.apache.flink.runtime.taskmanager.TaskManager              - --------------------------------------------------------------------------------
18:27:10,252 INFO  org.apache.flink.runtime.taskmanager.TaskManager              - Maximum number of open file descriptors is 4096
18:27:10,277 INFO  org.apache.flink.runtime.taskmanager.TaskManager              - Loading configuration from /home/flink/flink-0.10.1/conf
18:27:10,356 INFO  org.apache.flink.runtime.taskmanager.TaskManager              - Security is not enabled. Starting non-authenticated TaskManager.
18:27:10,365 ERROR org.apache.flink.runtime.taskmanager.TaskManager              - Failed to run TaskManager.
java.net.UnknownHostException: Cannot resolve the JobManager hostname 'hostname-of-master' specified in the configuration
        at org.apache.flink.runtime.util.StandaloneUtils.createLeaderRetrievalService(StandaloneUtils.java:79)
        at org.apache.flink.runtime.util.StandaloneUtils.createLeaderRetrievalService(StandaloneUtils.java:48)
        at org.apache.flink.runtime.util.LeaderRetrievalUtils.createLeaderRetrievalService(LeaderRetrievalUtils.java:69)
        at org.apache.flink.runtime.taskmanager.TaskManager$.selectNetworkInterfaceAndPort(TaskManager.scala:1351)
        at org.apache.flink.runtime.taskmanager.TaskManager$.selectNetworkInterfaceAndRunTaskManager(TaskManager.scala:1328)
        at org.apache.flink.runtime.taskmanager.TaskManager$.main(TaskManager.scala:1240)
        at org.apache.flink.runtime.taskmanager.TaskManager.main(TaskManager.scala)

Kind Regards,
Ravinder Kaur

On Wed, Feb 3, 2016 at 7:19 PM, Stephan Ewen <[hidden email]> wrote:
What do the TaskManger logs say?

On Wed, Feb 3, 2016 at 6:34 PM, Ravinder Kaur <[hidden email]> wrote:
Hello,

Thanks for the quick reply. I tried to set jobmanager.rpc.address in flink-conf.yaml to the hostname of master node on both the nodes.

Now it does not start the Taskmanager at the worker node at all. When I start the cluster using ./bin/start-cluster.sh on master it shows the normal output of starting the Jobmanager and Taskmanager but when I run jps on the nodes the slave does not have the Taskmanager running. 

Running the WordCount example again fails showing the same error. Stopping the cluster says no taskmanager to stop.

Kind Regards,
Ravinder Kaur

On Wed, Feb 3, 2016 at 5:47 PM, Stephan Ewen <[hidden email]> wrote:
Looks like the network configuration is not correct.

I would try setting the full host name (like "master.abc.xyz.com") as jobmanager.rpc.address.

Greetings,
Stephan


On Wed, Feb 3, 2016 at 5:43 PM, Ravinder Kaur <[hidden email]> wrote:

Hello Community,

I'm a student and new to Apache Flink. I'm trying to learn and have setup a 2- node standalone Flink(0.10.1) cluster (one master and one worker). I'm facing the following issue. 

Cluster: consists of 2 vms (one master and one worker)


When I start the cluster both the JobManager and the TaskManager are started on the master and worker respectively.

Command to start the cluster : bin/start-cluster.sh

JPS shows all the processes running.

Then I run the following command to run a WordCount example job: ./bin/flink run ./examples/WordCount.jar

the result is attached with the mail.

The error is org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailabeException: Not enough free slots available to run to run the job ....................... Resources available to scheduler: Number of instances=0, total number of slots= 0, available slots=0

Therefore I suppose that the JobManager does not find the TaskManager and checked the logs of the TaskManager which indeed shows that the TaskManager is unable to register at the JobManager for quite a long time. There are org.apache.flink.runtime.net.ConnectionUtils: Failed to connect from localhost: Connect timed out and org.apache.flink.runtime.net.ConnectionUtils: Failed to connect from address localhost: Network is Unreachable messages in the log of the TaskManager. Later when it starts up after a number of attempts and tries to register at the JobManager, which also fails after a lot of attempts showing the following message org.apache.flink.runtime.taskmanager.Taskmanager: Trying to register at JobManager akka.tcp://flink@master:6123/user'/jobmanager (attempt:92, timeout:30seconds) and org.apache.flink.runtime.taskmanager.Taskmanager: Tried to associate with unreachable remote host [akka.tcp://flink@master:6123/user/jobmanager]. Address is now gated for 5000ms, all messages to this address will be delivered to dead letters. Reason: Connection timed out: /master:6123

and https://issues.apache.org/jira/browse/FLINK-1119 these links helpful. Stephan Ewen the guy who provided the solution in both the links gives a good explanation that the TaskManagers take quite some time to register at the JobManager and therefore I waited for as long as 20 mins after starting the cluster to run the job. But even after waiting so long I get the same error. 

Another suggestion was to run the cluster in streaming mode. So I tried it with the command : bin/start-cluster-streaming.sh and ran the job but I get the same error. I have rechecked all the configurations but I'm unable to find out the fault.

I re-checked all the configurations but could not find anything wrong. Also checked the port 6123 on master which is in LISTEN state and tcp request from worker to master shows SYN_SENT state using netstat -na and lsof -i commands.

I opened the webpage on master http://localhost:8081 but it shows nothing and localhost:8080 says connection refused.

Kindly help me out as it is very important for me. Let me know if you have any questions.

Kind Regards,
Ravinder Kaur








Reply | Threaded
Open this post in threaded view
|

Re: TaskManager unable to register with JobManager

Stephan Ewen
There still seems to be something wrong with your network config.

This looks not like a Flink problem and needs work on your end, we cannot debug that for you.

Please go through your network setup and check for example

  - if the hostnames are right (is "master-IP" really the name of the network interface on the JobManager)
  - can the machines actually communicate with each other (firewall, etc)
  - if the "master-IP" interface externally visible (such that other machines can connect to it)

These things are prerequisites for any distributed system installation.


On Wed, Feb 3, 2016 at 9:27 PM, Robert Metzger <[hidden email]> wrote:
Hi,

the TaskManager is starting up, but its not able to register at the job manager. Did you check the JobManager log? Do you see anything suspicious there? Are the ports matching?


On Wed, Feb 3, 2016 at 9:23 PM, Ravinder Kaur <[hidden email]> wrote:
Hello,

Thank you for pointing it out. I had a little typo while I edited the hostname in flink-conf.yaml. I've reset it and the TaskManager started up. But I still can't run the WordCount example and it throws the same NoResourceAvaliableException.

Caused by: org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableExce                                                                                        ption: Not enough free slots available to run the job. You can decrease the oper                                                                                        ator parallelism or increase the number of slots per TaskManager in the configur                                                                                        ation. Task to schedule: < Attempt #0 (CHAIN DataSource (at getDefaultTextLineDa                                                                                        taSet(WordCountData.java:70) (org.apache.flink.api.java.io.CollectionInputFormat                                                                                        )) -> FlatMap (FlatMap at main(WordCount.java:69)) -> Combine(SUM(1), at main(Wo                                                                                        rdCount.java:72) (1/1)) @ (unassigned) - [SCHEDULED] > with groupID < 31e497f2f6                                                                                        8c9cee5864c8fddaff3d59 > in sharing group < SlotSharingGroup [f9ed1aab933e061a8c                                                                                        e1ecaa3534f18c, 037bb78a1902f7edea69a978ad7b54ce, 31e497f2f68c9cee5864c8fddaff3d                                                                                        59] >. Resources available to scheduler: Number of instances=0, total number of                                                                                         slots=0, available slots=0
        at org.apache.flink.runtime.jobmanager.scheduler.Scheduler.scheduleTask(                                                                                        Scheduler.java:256)
        at org.apache.flink.runtime.jobmanager.scheduler.Scheduler.scheduleImmed                                                                                        iately(Scheduler.java:131)
        at org.apache.flink.runtime.executiongraph.Execution.scheduleForExecutio                                                                                        n(Execution.java:298)
        at org.apache.flink.runtime.executiongraph.ExecutionVertex.scheduleForEx                                                                                        ecution(ExecutionVertex.java:458)
        at org.apache.flink.runtime.executiongraph.ExecutionJobVertex.scheduleAl                                                                                        l(ExecutionJobVertex.java:322)
        at org.apache.flink.runtime.executiongraph.ExecutionGraph.scheduleForExe                                                                                        cution(ExecutionGraph.java:679)
        at org.apache.flink.runtime.jobmanager.JobManager$$anonfun$org$apache$fl                                                                                        ink$runtime$jobmanager$JobManager$$submitJob$1.apply$mcV$sp(JobManager.scala:982                                                                                        )
        at org.apache.flink.runtime.jobmanager.JobManager$$anonfun$org$apache$fl                                                                                        ink$runtime$jobmanager$JobManager$$submitJob$1.apply(JobManager.scala:962)
        at org.apache.flink.runtime.jobmanager.JobManager$$anonfun$org$apache$fl                                                                                        ink$runtime$jobmanager$JobManager$$submitJob$1.apply(JobManager.scala:962)
        ... 8 more

The log of TaskManager again has the same errors as before.

20:58:58,457 INFO  org.apache.flink.runtime.net.ConnectionUtils                  - Failed to connect from address '/slave-IP': connect timed out
20:58:58,458 INFO  org.apache.flink.runtime.net.ConnectionUtils                  - Failed to connect from address '/0:0:0:0:0:0:0:1%1': Network is unreachable
20:58:58,458 INFO  org.apache.flink.runtime.net.ConnectionUtils                  - Failed to connect from address '/127.0.0.1': Invalid argument
20:58:59,048 WARN  org.apache.flink.runtime.net.ConnectionUtils                  - Could not connect to /master-IP:6123. Selecting a local address using heuristics.
20:58:59,050 INFO  org.apache.flink.runtime.taskmanager.TaskManager              - TaskManager will use hostname/address 'hostname-of-slave' (slave-IP) for communication.
20:58:59,051 INFO  org.apache.flink.runtime.taskmanager.TaskManager              - Starting TaskManager in streaming mode BATCH_ONLY
20:58:59,052 INFO  org.apache.flink.runtime.taskmanager.TaskManager              - Starting TaskManager actor system at slave_IP:0
20:58:59,776 INFO  akka.event.slf4j.Slf4jLogger                                  - Slf4jLogger started
20:58:59,842 INFO  Remoting                                                      - Starting remoting
20:59:00,094 INFO  Remoting                                                      - Remoting started; listening on addresses :[akka.tcp://flink@slave-IP:33813]
20:59:00,100 INFO  org.apache.flink.runtime.taskmanager.TaskManager              - Starting TaskManager actor
20:59:00,125 INFO  org.apache.flink.runtime.io.network.netty.NettyConfig         - NettyConfig [server address: hostname-of-master/master-IP, server port: 49030, memory segment size (bytes): 32768, transport type: NIO, number of server threads: 0 (use Netty's default), number of client threads: 0 (use Netty's default), server connect backlog: 0 (use Netty's default), client connect timeout (sec): 120, send/receive buffer size (bytes): 0 (use Netty's default)]
20:59:00,131 INFO  org.apache.flink.runtime.taskmanager.TaskManager              - Messages between TaskManager and JobManager have a max timeout of 100000 milliseconds
20:59:00,142 INFO  org.apache.flink.runtime.taskmanager.TaskManager              - Temporary file directory '/tmp': total 4 GB, usable 1 GB (25.00% usable)
20:59:00,210 INFO  org.apache.flink.runtime.io.network.buffer.NetworkBufferPool  - Allocated 64 MB for network buffer pool (number of memory segments: 2048, bytes per segment: 32768).
20:59:00,323 INFO  org.apache.flink.runtime.taskmanager.TaskManager              - Using 0.7 of the currently free heap space for Flink managed heap memory (293 MB).
20:59:00,565 INFO  org.apache.flink.runtime.io.disk.iomanager.IOManager          - I/O manager uses directory /tmp/flink-io-c7796b82-6676-4604-97fd-df09001a84e8 for spill files.
20:59:00,578 INFO  org.apache.flink.runtime.filecache.FileCache                  - User file cache uses directory /tmp/flink-dist-cache-13ed3e76-cf1e-46fa-9ba2-5177e801429e
20:59:00,908 INFO  org.apache.flink.runtime.taskmanager.TaskManager              - Starting TaskManager actor at akka://flink/user/taskmanager#-157676733.
20:59:00,908 INFO  org.apache.flink.runtime.taskmanager.TaskManager              - TaskManager data connection information: hostname-of-master (dataPort=49030)
20:59:00,909 INFO  org.apache.flink.runtime.taskmanager.TaskManager              - TaskManager has 1 task slot(s).
20:59:00,910 INFO  org.apache.flink.runtime.taskmanager.TaskManager              - Memory usage stats: [HEAP: 376/491/491 MB, NON HEAP: 24/49/304 MB (used/committed/max)]
20:59:00,917 INFO  org.apache.flink.runtime.taskmanager.TaskManager              - Trying to register at JobManager akka.tcp://flink@master-IP:6123/user/jobmanager (attempt 1, timeout: 500 milliseconds)
20:59:01,443 INFO  org.apache.flink.runtime.taskmanager.TaskManager              - Trying to register at JobManager akka.tcp://flink@master-IP:6123/user/jobmanager (attempt 2, timeout: 1000 milliseconds)
20:59:02,873 INFO  org.apache.flink.runtime.taskmanager.TaskManager              - Trying to register at JobManager akka.tcp://flink@master-IP:6123/user/jobmanager (attempt 3, timeout: 2000 milliseconds)
20:59:04,893 INFO  org.apache.flink.runtime.taskmanager.TaskManager              - Trying to register at JobManager akka.tcp://flink@master-IP:6123/user/jobmanager (attempt 4, timeout: 4000 milliseconds)
20:59:08,914 INFO  org.apache.flink.runtime.taskmanager.TaskManager              - Trying to register at JobManager akka.tcp://flink@master-IP:6123/user/jobmanager (attempt 5, timeout: 8000 milliseconds)


Kind Regards,
Ravinder Kaur

On Wed, Feb 3, 2016 at 8:12 PM, Stephan Ewen <[hidden email]> wrote:
This looks like the reason:

java.net.UnknownHostException: Cannot resolve the JobManager hostname 'hostname-of-master' specified in the configuration

On Wed, Feb 3, 2016 at 7:29 PM, Ravinder Kaur <[hidden email]> wrote:
Hello,

The log file of the Taskmanager now shows the following 

18:27:10,082 WARN  org.apache.hadoop.util.NativeCodeLoader                       - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
18:27:10,244 INFO  org.apache.flink.runtime.taskmanager.TaskManager              - --------------------------------------------------------------------------------
18:27:10,244 INFO  org.apache.flink.runtime.taskmanager.TaskManager              -  Starting TaskManager (Version: 0.10.1, Rev:2e9b231, Date:22.11.2015 @ 12:41:12 CET)
18:27:10,244 INFO  org.apache.flink.runtime.taskmanager.TaskManager              -  Current user: flink
18:27:10,245 INFO  org.apache.flink.runtime.taskmanager.TaskManager              -  JVM: OpenJDK 64-Bit Server VM - Oracle Corporation - 1.7/24.91-b01
18:27:10,245 INFO  org.apache.flink.runtime.taskmanager.TaskManager              -  Maximum heap size: 491 MiBytes
18:27:10,245 INFO  org.apache.flink.runtime.taskmanager.TaskManager              -  JAVA_HOME: /usr/lib/jvm/java-1.7.0-openjdk-amd64
18:27:10,247 INFO  org.apache.flink.runtime.taskmanager.TaskManager              -  Hadoop version: 2.7.0
18:27:10,247 INFO  org.apache.flink.runtime.taskmanager.TaskManager              -  JVM Options:
18:27:10,247 INFO  org.apache.flink.runtime.taskmanager.TaskManager              -     -Xms512M
18:27:10,247 INFO  org.apache.flink.runtime.taskmanager.TaskManager              -     -Xmx512M
18:27:10,248 INFO  org.apache.flink.runtime.taskmanager.TaskManager              -     -XX:MaxDirectMemorySize=8388607T
18:27:10,248 INFO  org.apache.flink.runtime.taskmanager.TaskManager              -     -XX:MaxPermSize=256m
18:27:10,248 INFO  org.apache.flink.runtime.taskmanager.TaskManager              -     -Dlog.file=/home/flink/flink-0.10.1/log/flink-flink-taskmanager-0-vm-10-155-208-137.cloud.mwn.de.log
18:27:10,248 INFO  org.apache.flink.runtime.taskmanager.TaskManager              -     -Dlog4j.configuration=file:/home/flink/flink-0.10.1/conf/log4j.properties
18:27:10,248 INFO  org.apache.flink.runtime.taskmanager.TaskManager              -     -Dlogback.configurationFile=file:/home/flink/flink-0.10.1/conf/logback.xml
18:27:10,248 INFO  org.apache.flink.runtime.taskmanager.TaskManager              -  Program Arguments:
18:27:10,248 INFO  org.apache.flink.runtime.taskmanager.TaskManager              -     --configDir
18:27:10,248 INFO  org.apache.flink.runtime.taskmanager.TaskManager              -     /home/flink/flink-0.10.1/conf
18:27:10,248 INFO  org.apache.flink.runtime.taskmanager.TaskManager              -     --streamingMode
18:27:10,248 INFO  org.apache.flink.runtime.taskmanager.TaskManager              -     batch
18:27:10,248 INFO  org.apache.flink.runtime.taskmanager.TaskManager              -  Classpath: /home/flink/flink-0.10.1/lib/flink-dist_2.11-0.10.1.jar:/home/flink/flink-0.10.1/lib/flink-python_2.11-0.10.1.jar:/home/flink/flink-0.10.1/lib/log4j-1.2.17.jar:/home/flink/flink-0.10.1/lib/slf4j-log4j12-1.7.7.jar:/usr/lib/jvm/java-1.7.0-openjdk-amd64/lib/tools.jar::
18:27:10,248 INFO  org.apache.flink.runtime.taskmanager.TaskManager              - --------------------------------------------------------------------------------
18:27:10,252 INFO  org.apache.flink.runtime.taskmanager.TaskManager              - Maximum number of open file descriptors is 4096
18:27:10,277 INFO  org.apache.flink.runtime.taskmanager.TaskManager              - Loading configuration from /home/flink/flink-0.10.1/conf
18:27:10,356 INFO  org.apache.flink.runtime.taskmanager.TaskManager              - Security is not enabled. Starting non-authenticated TaskManager.
18:27:10,365 ERROR org.apache.flink.runtime.taskmanager.TaskManager              - Failed to run TaskManager.
java.net.UnknownHostException: Cannot resolve the JobManager hostname 'hostname-of-master' specified in the configuration
        at org.apache.flink.runtime.util.StandaloneUtils.createLeaderRetrievalService(StandaloneUtils.java:79)
        at org.apache.flink.runtime.util.StandaloneUtils.createLeaderRetrievalService(StandaloneUtils.java:48)
        at org.apache.flink.runtime.util.LeaderRetrievalUtils.createLeaderRetrievalService(LeaderRetrievalUtils.java:69)
        at org.apache.flink.runtime.taskmanager.TaskManager$.selectNetworkInterfaceAndPort(TaskManager.scala:1351)
        at org.apache.flink.runtime.taskmanager.TaskManager$.selectNetworkInterfaceAndRunTaskManager(TaskManager.scala:1328)
        at org.apache.flink.runtime.taskmanager.TaskManager$.main(TaskManager.scala:1240)
        at org.apache.flink.runtime.taskmanager.TaskManager.main(TaskManager.scala)

Kind Regards,
Ravinder Kaur

On Wed, Feb 3, 2016 at 7:19 PM, Stephan Ewen <[hidden email]> wrote:
What do the TaskManger logs say?

On Wed, Feb 3, 2016 at 6:34 PM, Ravinder Kaur <[hidden email]> wrote:
Hello,

Thanks for the quick reply. I tried to set jobmanager.rpc.address in flink-conf.yaml to the hostname of master node on both the nodes.

Now it does not start the Taskmanager at the worker node at all. When I start the cluster using ./bin/start-cluster.sh on master it shows the normal output of starting the Jobmanager and Taskmanager but when I run jps on the nodes the slave does not have the Taskmanager running. 

Running the WordCount example again fails showing the same error. Stopping the cluster says no taskmanager to stop.

Kind Regards,
Ravinder Kaur

On Wed, Feb 3, 2016 at 5:47 PM, Stephan Ewen <[hidden email]> wrote:
Looks like the network configuration is not correct.

I would try setting the full host name (like "master.abc.xyz.com") as jobmanager.rpc.address.

Greetings,
Stephan


On Wed, Feb 3, 2016 at 5:43 PM, Ravinder Kaur <[hidden email]> wrote:

Hello Community,

I'm a student and new to Apache Flink. I'm trying to learn and have setup a 2- node standalone Flink(0.10.1) cluster (one master and one worker). I'm facing the following issue. 

Cluster: consists of 2 vms (one master and one worker)


When I start the cluster both the JobManager and the TaskManager are started on the master and worker respectively.

Command to start the cluster : bin/start-cluster.sh

JPS shows all the processes running.

Then I run the following command to run a WordCount example job: ./bin/flink run ./examples/WordCount.jar

the result is attached with the mail.

The error is org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailabeException: Not enough free slots available to run to run the job ....................... Resources available to scheduler: Number of instances=0, total number of slots= 0, available slots=0

Therefore I suppose that the JobManager does not find the TaskManager and checked the logs of the TaskManager which indeed shows that the TaskManager is unable to register at the JobManager for quite a long time. There are org.apache.flink.runtime.net.ConnectionUtils: Failed to connect from localhost: Connect timed out and org.apache.flink.runtime.net.ConnectionUtils: Failed to connect from address localhost: Network is Unreachable messages in the log of the TaskManager. Later when it starts up after a number of attempts and tries to register at the JobManager, which also fails after a lot of attempts showing the following message org.apache.flink.runtime.taskmanager.Taskmanager: Trying to register at JobManager akka.tcp://flink@master:6123/user'/jobmanager (attempt:92, timeout:30seconds) and org.apache.flink.runtime.taskmanager.Taskmanager: Tried to associate with unreachable remote host [akka.tcp://flink@master:6123/user/jobmanager]. Address is now gated for 5000ms, all messages to this address will be delivered to dead letters. Reason: Connection timed out: /master:6123

and https://issues.apache.org/jira/browse/FLINK-1119 these links helpful. Stephan Ewen the guy who provided the solution in both the links gives a good explanation that the TaskManagers take quite some time to register at the JobManager and therefore I waited for as long as 20 mins after starting the cluster to run the job. But even after waiting so long I get the same error. 

Another suggestion was to run the cluster in streaming mode. So I tried it with the command : bin/start-cluster-streaming.sh and ran the job but I get the same error. I have rechecked all the configurations but I'm unable to find out the fault.

I re-checked all the configurations but could not find anything wrong. Also checked the port 6123 on master which is in LISTEN state and tcp request from worker to master shows SYN_SENT state using netstat -na and lsof -i commands.

I opened the webpage on master http://localhost:8081 but it shows nothing and localhost:8080 says connection refused.

Kindly help me out as it is very important for me. Let me know if you have any questions.

Kind Regards,
Ravinder Kaur









Reply | Threaded
Open this post in threaded view
|

Re: TaskManager unable to register with JobManager

Ravinder Kaur
Hello,

I also feel like it is something to do with network configuration. But then I have checked all these pre-requisites that you have mentioned. 

  1. the hostnames are right. ("master-IP" or "hostname-of-master" is just my edit to make things clear. its not the real value).
  2. the machines can communicate. passphraseless SSH works fine. Ping and telnet also work fine. nothing suspicious.
  3. other machines are able to connect to the master machine.
Is there anything else that you can maybe suggest.

Kind Regards,
Ravinder

On Wed, Feb 3, 2016 at 9:36 PM, Stephan Ewen <[hidden email]> wrote:
There still seems to be something wrong with your network config.

This looks not like a Flink problem and needs work on your end, we cannot debug that for you.

Please go through your network setup and check for example

  - if the hostnames are right (is "master-IP" really the name of the network interface on the JobManager)
  - can the machines actually communicate with each other (firewall, etc)
  - if the "master-IP" interface externally visible (such that other machines can connect to it)

These things are prerequisites for any distributed system installation.


On Wed, Feb 3, 2016 at 9:27 PM, Robert Metzger <[hidden email]> wrote:
Hi,

the TaskManager is starting up, but its not able to register at the job manager. Did you check the JobManager log? Do you see anything suspicious there? Are the ports matching?


On Wed, Feb 3, 2016 at 9:23 PM, Ravinder Kaur <[hidden email]> wrote:
Hello,

Thank you for pointing it out. I had a little typo while I edited the hostname in flink-conf.yaml. I've reset it and the TaskManager started up. But I still can't run the WordCount example and it throws the same NoResourceAvaliableException.

Caused by: org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableExce                                                                                        ption: Not enough free slots available to run the job. You can decrease the oper                                                                                        ator parallelism or increase the number of slots per TaskManager in the configur                                                                                        ation. Task to schedule: < Attempt #0 (CHAIN DataSource (at getDefaultTextLineDa                                                                                        taSet(WordCountData.java:70) (org.apache.flink.api.java.io.CollectionInputFormat                                                                                        )) -> FlatMap (FlatMap at main(WordCount.java:69)) -> Combine(SUM(1), at main(Wo                                                                                        rdCount.java:72) (1/1)) @ (unassigned) - [SCHEDULED] > with groupID < 31e497f2f6                                                                                        8c9cee5864c8fddaff3d59 > in sharing group < SlotSharingGroup [f9ed1aab933e061a8c                                                                                        e1ecaa3534f18c, 037bb78a1902f7edea69a978ad7b54ce, 31e497f2f68c9cee5864c8fddaff3d                                                                                        59] >. Resources available to scheduler: Number of instances=0, total number of                                                                                         slots=0, available slots=0
        at org.apache.flink.runtime.jobmanager.scheduler.Scheduler.scheduleTask(                                                                                        Scheduler.java:256)
        at org.apache.flink.runtime.jobmanager.scheduler.Scheduler.scheduleImmed                                                                                        iately(Scheduler.java:131)
        at org.apache.flink.runtime.executiongraph.Execution.scheduleForExecutio                                                                                        n(Execution.java:298)
        at org.apache.flink.runtime.executiongraph.ExecutionVertex.scheduleForEx                                                                                        ecution(ExecutionVertex.java:458)
        at org.apache.flink.runtime.executiongraph.ExecutionJobVertex.scheduleAl                                                                                        l(ExecutionJobVertex.java:322)
        at org.apache.flink.runtime.executiongraph.ExecutionGraph.scheduleForExe                                                                                        cution(ExecutionGraph.java:679)
        at org.apache.flink.runtime.jobmanager.JobManager$$anonfun$org$apache$fl                                                                                        ink$runtime$jobmanager$JobManager$$submitJob$1.apply$mcV$sp(JobManager.scala:982                                                                                        )
        at org.apache.flink.runtime.jobmanager.JobManager$$anonfun$org$apache$fl                                                                                        ink$runtime$jobmanager$JobManager$$submitJob$1.apply(JobManager.scala:962)
        at org.apache.flink.runtime.jobmanager.JobManager$$anonfun$org$apache$fl                                                                                        ink$runtime$jobmanager$JobManager$$submitJob$1.apply(JobManager.scala:962)
        ... 8 more

The log of TaskManager again has the same errors as before.

20:58:58,457 INFO  org.apache.flink.runtime.net.ConnectionUtils                  - Failed to connect from address '/slave-IP': connect timed out
20:58:58,458 INFO  org.apache.flink.runtime.net.ConnectionUtils                  - Failed to connect from address '/0:0:0:0:0:0:0:1%1': Network is unreachable
20:58:58,458 INFO  org.apache.flink.runtime.net.ConnectionUtils                  - Failed to connect from address '/127.0.0.1': Invalid argument
20:58:59,048 WARN  org.apache.flink.runtime.net.ConnectionUtils                  - Could not connect to /master-IP:6123. Selecting a local address using heuristics.
20:58:59,050 INFO  org.apache.flink.runtime.taskmanager.TaskManager              - TaskManager will use hostname/address 'hostname-of-slave' (slave-IP) for communication.
20:58:59,051 INFO  org.apache.flink.runtime.taskmanager.TaskManager              - Starting TaskManager in streaming mode BATCH_ONLY
20:58:59,052 INFO  org.apache.flink.runtime.taskmanager.TaskManager              - Starting TaskManager actor system at slave_IP:0
20:58:59,776 INFO  akka.event.slf4j.Slf4jLogger                                  - Slf4jLogger started
20:58:59,842 INFO  Remoting                                                      - Starting remoting
20:59:00,094 INFO  Remoting                                                      - Remoting started; listening on addresses :[akka.tcp://flink@slave-IP:33813]
20:59:00,100 INFO  org.apache.flink.runtime.taskmanager.TaskManager              - Starting TaskManager actor
20:59:00,125 INFO  org.apache.flink.runtime.io.network.netty.NettyConfig         - NettyConfig [server address: hostname-of-master/master-IP, server port: 49030, memory segment size (bytes): 32768, transport type: NIO, number of server threads: 0 (use Netty's default), number of client threads: 0 (use Netty's default), server connect backlog: 0 (use Netty's default), client connect timeout (sec): 120, send/receive buffer size (bytes): 0 (use Netty's default)]
20:59:00,131 INFO  org.apache.flink.runtime.taskmanager.TaskManager              - Messages between TaskManager and JobManager have a max timeout of 100000 milliseconds
20:59:00,142 INFO  org.apache.flink.runtime.taskmanager.TaskManager              - Temporary file directory '/tmp': total 4 GB, usable 1 GB (25.00% usable)
20:59:00,210 INFO  org.apache.flink.runtime.io.network.buffer.NetworkBufferPool  - Allocated 64 MB for network buffer pool (number of memory segments: 2048, bytes per segment: 32768).
20:59:00,323 INFO  org.apache.flink.runtime.taskmanager.TaskManager              - Using 0.7 of the currently free heap space for Flink managed heap memory (293 MB).
20:59:00,565 INFO  org.apache.flink.runtime.io.disk.iomanager.IOManager          - I/O manager uses directory /tmp/flink-io-c7796b82-6676-4604-97fd-df09001a84e8 for spill files.
20:59:00,578 INFO  org.apache.flink.runtime.filecache.FileCache                  - User file cache uses directory /tmp/flink-dist-cache-13ed3e76-cf1e-46fa-9ba2-5177e801429e
20:59:00,908 INFO  org.apache.flink.runtime.taskmanager.TaskManager              - Starting TaskManager actor at akka://flink/user/taskmanager#-157676733.
20:59:00,908 INFO  org.apache.flink.runtime.taskmanager.TaskManager              - TaskManager data connection information: hostname-of-master (dataPort=49030)
20:59:00,909 INFO  org.apache.flink.runtime.taskmanager.TaskManager              - TaskManager has 1 task slot(s).
20:59:00,910 INFO  org.apache.flink.runtime.taskmanager.TaskManager              - Memory usage stats: [HEAP: 376/491/491 MB, NON HEAP: 24/49/304 MB (used/committed/max)]
20:59:00,917 INFO  org.apache.flink.runtime.taskmanager.TaskManager              - Trying to register at JobManager akka.tcp://flink@master-IP:6123/user/jobmanager (attempt 1, timeout: 500 milliseconds)
20:59:01,443 INFO  org.apache.flink.runtime.taskmanager.TaskManager              - Trying to register at JobManager akka.tcp://flink@master-IP:6123/user/jobmanager (attempt 2, timeout: 1000 milliseconds)
20:59:02,873 INFO  org.apache.flink.runtime.taskmanager.TaskManager              - Trying to register at JobManager akka.tcp://flink@master-IP:6123/user/jobmanager (attempt 3, timeout: 2000 milliseconds)
20:59:04,893 INFO  org.apache.flink.runtime.taskmanager.TaskManager              - Trying to register at JobManager akka.tcp://flink@master-IP:6123/user/jobmanager (attempt 4, timeout: 4000 milliseconds)
20:59:08,914 INFO  org.apache.flink.runtime.taskmanager.TaskManager              - Trying to register at JobManager akka.tcp://flink@master-IP:6123/user/jobmanager (attempt 5, timeout: 8000 milliseconds)


Kind Regards,
Ravinder Kaur

On Wed, Feb 3, 2016 at 8:12 PM, Stephan Ewen <[hidden email]> wrote:
This looks like the reason:

java.net.UnknownHostException: Cannot resolve the JobManager hostname 'hostname-of-master' specified in the configuration

On Wed, Feb 3, 2016 at 7:29 PM, Ravinder Kaur <[hidden email]> wrote:
Hello,

The log file of the Taskmanager now shows the following 

18:27:10,082 WARN  org.apache.hadoop.util.NativeCodeLoader                       - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
18:27:10,244 INFO  org.apache.flink.runtime.taskmanager.TaskManager              - --------------------------------------------------------------------------------
18:27:10,244 INFO  org.apache.flink.runtime.taskmanager.TaskManager              -  Starting TaskManager (Version: 0.10.1, Rev:2e9b231, Date:22.11.2015 @ 12:41:12 CET)
18:27:10,244 INFO  org.apache.flink.runtime.taskmanager.TaskManager              -  Current user: flink
18:27:10,245 INFO  org.apache.flink.runtime.taskmanager.TaskManager              -  JVM: OpenJDK 64-Bit Server VM - Oracle Corporation - 1.7/24.91-b01
18:27:10,245 INFO  org.apache.flink.runtime.taskmanager.TaskManager              -  Maximum heap size: 491 MiBytes
18:27:10,245 INFO  org.apache.flink.runtime.taskmanager.TaskManager              -  JAVA_HOME: /usr/lib/jvm/java-1.7.0-openjdk-amd64
18:27:10,247 INFO  org.apache.flink.runtime.taskmanager.TaskManager              -  Hadoop version: 2.7.0
18:27:10,247 INFO  org.apache.flink.runtime.taskmanager.TaskManager              -  JVM Options:
18:27:10,247 INFO  org.apache.flink.runtime.taskmanager.TaskManager              -     -Xms512M
18:27:10,247 INFO  org.apache.flink.runtime.taskmanager.TaskManager              -     -Xmx512M
18:27:10,248 INFO  org.apache.flink.runtime.taskmanager.TaskManager              -     -XX:MaxDirectMemorySize=8388607T
18:27:10,248 INFO  org.apache.flink.runtime.taskmanager.TaskManager              -     -XX:MaxPermSize=256m
18:27:10,248 INFO  org.apache.flink.runtime.taskmanager.TaskManager              -     -Dlog.file=/home/flink/flink-0.10.1/log/flink-flink-taskmanager-0-vm-10-155-208-137.cloud.mwn.de.log
18:27:10,248 INFO  org.apache.flink.runtime.taskmanager.TaskManager              -     -Dlog4j.configuration=file:/home/flink/flink-0.10.1/conf/log4j.properties
18:27:10,248 INFO  org.apache.flink.runtime.taskmanager.TaskManager              -     -Dlogback.configurationFile=file:/home/flink/flink-0.10.1/conf/logback.xml
18:27:10,248 INFO  org.apache.flink.runtime.taskmanager.TaskManager              -  Program Arguments:
18:27:10,248 INFO  org.apache.flink.runtime.taskmanager.TaskManager              -     --configDir
18:27:10,248 INFO  org.apache.flink.runtime.taskmanager.TaskManager              -     /home/flink/flink-0.10.1/conf
18:27:10,248 INFO  org.apache.flink.runtime.taskmanager.TaskManager              -     --streamingMode
18:27:10,248 INFO  org.apache.flink.runtime.taskmanager.TaskManager              -     batch
18:27:10,248 INFO  org.apache.flink.runtime.taskmanager.TaskManager              -  Classpath: /home/flink/flink-0.10.1/lib/flink-dist_2.11-0.10.1.jar:/home/flink/flink-0.10.1/lib/flink-python_2.11-0.10.1.jar:/home/flink/flink-0.10.1/lib/log4j-1.2.17.jar:/home/flink/flink-0.10.1/lib/slf4j-log4j12-1.7.7.jar:/usr/lib/jvm/java-1.7.0-openjdk-amd64/lib/tools.jar::
18:27:10,248 INFO  org.apache.flink.runtime.taskmanager.TaskManager              - --------------------------------------------------------------------------------
18:27:10,252 INFO  org.apache.flink.runtime.taskmanager.TaskManager              - Maximum number of open file descriptors is 4096
18:27:10,277 INFO  org.apache.flink.runtime.taskmanager.TaskManager              - Loading configuration from /home/flink/flink-0.10.1/conf
18:27:10,356 INFO  org.apache.flink.runtime.taskmanager.TaskManager              - Security is not enabled. Starting non-authenticated TaskManager.
18:27:10,365 ERROR org.apache.flink.runtime.taskmanager.TaskManager              - Failed to run TaskManager.
java.net.UnknownHostException: Cannot resolve the JobManager hostname 'hostname-of-master' specified in the configuration
        at org.apache.flink.runtime.util.StandaloneUtils.createLeaderRetrievalService(StandaloneUtils.java:79)
        at org.apache.flink.runtime.util.StandaloneUtils.createLeaderRetrievalService(StandaloneUtils.java:48)
        at org.apache.flink.runtime.util.LeaderRetrievalUtils.createLeaderRetrievalService(LeaderRetrievalUtils.java:69)
        at org.apache.flink.runtime.taskmanager.TaskManager$.selectNetworkInterfaceAndPort(TaskManager.scala:1351)
        at org.apache.flink.runtime.taskmanager.TaskManager$.selectNetworkInterfaceAndRunTaskManager(TaskManager.scala:1328)
        at org.apache.flink.runtime.taskmanager.TaskManager$.main(TaskManager.scala:1240)
        at org.apache.flink.runtime.taskmanager.TaskManager.main(TaskManager.scala)

Kind Regards,
Ravinder Kaur

On Wed, Feb 3, 2016 at 7:19 PM, Stephan Ewen <[hidden email]> wrote:
What do the TaskManger logs say?

On Wed, Feb 3, 2016 at 6:34 PM, Ravinder Kaur <[hidden email]> wrote:
Hello,

Thanks for the quick reply. I tried to set jobmanager.rpc.address in flink-conf.yaml to the hostname of master node on both the nodes.

Now it does not start the Taskmanager at the worker node at all. When I start the cluster using ./bin/start-cluster.sh on master it shows the normal output of starting the Jobmanager and Taskmanager but when I run jps on the nodes the slave does not have the Taskmanager running. 

Running the WordCount example again fails showing the same error. Stopping the cluster says no taskmanager to stop.

Kind Regards,
Ravinder Kaur

On Wed, Feb 3, 2016 at 5:47 PM, Stephan Ewen <[hidden email]> wrote:
Looks like the network configuration is not correct.

I would try setting the full host name (like "master.abc.xyz.com") as jobmanager.rpc.address.

Greetings,
Stephan


On Wed, Feb 3, 2016 at 5:43 PM, Ravinder Kaur <[hidden email]> wrote:

Hello Community,

I'm a student and new to Apache Flink. I'm trying to learn and have setup a 2- node standalone Flink(0.10.1) cluster (one master and one worker). I'm facing the following issue. 

Cluster: consists of 2 vms (one master and one worker)


When I start the cluster both the JobManager and the TaskManager are started on the master and worker respectively.

Command to start the cluster : bin/start-cluster.sh

JPS shows all the processes running.

Then I run the following command to run a WordCount example job: ./bin/flink run ./examples/WordCount.jar

the result is attached with the mail.

The error is org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailabeException: Not enough free slots available to run to run the job ....................... Resources available to scheduler: Number of instances=0, total number of slots= 0, available slots=0

Therefore I suppose that the JobManager does not find the TaskManager and checked the logs of the TaskManager which indeed shows that the TaskManager is unable to register at the JobManager for quite a long time. There are org.apache.flink.runtime.net.ConnectionUtils: Failed to connect from localhost: Connect timed out and org.apache.flink.runtime.net.ConnectionUtils: Failed to connect from address localhost: Network is Unreachable messages in the log of the TaskManager. Later when it starts up after a number of attempts and tries to register at the JobManager, which also fails after a lot of attempts showing the following message org.apache.flink.runtime.taskmanager.Taskmanager: Trying to register at JobManager akka.tcp://flink@master:6123/user'/jobmanager (attempt:92, timeout:30seconds) and org.apache.flink.runtime.taskmanager.Taskmanager: Tried to associate with unreachable remote host [akka.tcp://flink@master:6123/user/jobmanager]. Address is now gated for 5000ms, all messages to this address will be delivered to dead letters. Reason: Connection timed out: /master:6123

and https://issues.apache.org/jira/browse/FLINK-1119 these links helpful. Stephan Ewen the guy who provided the solution in both the links gives a good explanation that the TaskManagers take quite some time to register at the JobManager and therefore I waited for as long as 20 mins after starting the cluster to run the job. But even after waiting so long I get the same error. 

Another suggestion was to run the cluster in streaming mode. So I tried it with the command : bin/start-cluster-streaming.sh and ran the job but I get the same error. I have rechecked all the configurations but I'm unable to find out the fault.

I re-checked all the configurations but could not find anything wrong. Also checked the port 6123 on master which is in LISTEN state and tcp request from worker to master shows SYN_SENT state using netstat -na and lsof -i commands.

I opened the webpage on master http://localhost:8081 but it shows nothing and localhost:8080 says connection refused.

Kindly help me out as it is very important for me. Let me know if you have any questions.

Kind Regards,
Ravinder Kaur










Reply | Threaded
Open this post in threaded view
|

Re: TaskManager unable to register with JobManager

Ravinder Kaur
In reply to this post by rmetzger0
Hello,

Here is the log file of Jobmanager. I did not see some thing suspicious and as it suggests the ports are also listening.

20:58:46,906 INFO  org.apache.flink.runtime.jobmanager.JobManager                - Starting JobManager on IP-of-master:6123 with execution mode CLUSTER and streaming mode BATCH_ONLY
20:58:46,978 INFO  org.apache.flink.runtime.jobmanager.JobManager                - Security is not enabled. Starting non-authenticated JobManager.
20:58:46,979 INFO  org.apache.flink.runtime.jobmanager.JobManager                - Starting JobManager
20:58:46,980 INFO  org.apache.flink.runtime.jobmanager.JobManager                - Starting JobManager actor system at 10.155.208.138:6123
20:58:48,196 INFO  akka.event.slf4j.Slf4jLogger                                  - Slf4jLogger started
20:58:48,295 INFO  Remoting                                                      - Starting remoting
20:58:48,541 INFO  Remoting                                                      - Remoting started; listening on addresses :[akka.tcp://flink@IP-of-master:6123]
20:58:48,549 INFO  org.apache.flink.runtime.jobmanager.JobManager                - Starting JobManger web frontend
20:58:48,690 INFO  org.apache.flink.runtime.webmonitor.WebRuntimeMonitor         - Using directory /tmp/flink-web-876a4755-4f38-4ff7-8202-f263afa9b986 for the web interface files
20:58:48,691 INFO  org.apache.flink.runtime.webmonitor.WebRuntimeMonitor         - Serving job manager log from /home/flink/flink-0.10.1/log/flink-flink-jobmanager-0-hostname.log
20:58:48,691 INFO  org.apache.flink.runtime.webmonitor.WebRuntimeMonitor         - Serving job manager stdout from /home/flink/flink-0.10.1/log/flink-flink-jobmanager-0-hostname.out
20:58:49,044 INFO  org.apache.flink.runtime.webmonitor.WebRuntimeMonitor         - Web frontend listening at 0:0:0:0:0:0:0:0:8081
20:58:49,045 INFO  org.apache.flink.runtime.jobmanager.JobManager                - Starting JobManager actor
20:58:49,052 INFO  org.apache.flink.runtime.blob.BlobServer                      - Created BLOB server storage directory /tmp/blobStore-e0c52bfb-2411-4a83-ac8d-5664a5894258
20:58:49,054 INFO  org.apache.flink.runtime.blob.BlobServer                      - Started BLOB server at 0.0.0.0:43683 - max concurrent requests: 50 - max backlog: 1000
20:58:49,075 INFO  org.apache.flink.runtime.jobmanager.MemoryArchivist           - Started memory archivist akka://flink/user/archive
20:58:49,075 INFO  org.apache.flink.runtime.jobmanager.JobManager                - Starting JobManager at akka.tcp://flink@IP-of-master:6123/user/jobmanager.
20:58:49,081 INFO  org.apache.flink.runtime.jobmanager.JobManager                - JobManager akka.tcp://flink@IP-of-master:6123/user/jobmanager was granted leadership with leader session ID None.
20:58:49,082 INFO  org.apache.flink.runtime.webmonitor.WebRuntimeMonitor         - Starting with JobManager akka.tcp://flink@IP-of-master:6123/user/jobmanager on port 8081
20:58:49,083 INFO  org.apache.flink.runtime.webmonitor.JobManagerRetriever       - New leader reachable under akka.tcp://flink@IP-of-master:6123/user/jobmanager:null.
20:59:22,794 INFO  org.apache.flink.runtime.jobmanager.JobManager                - Submitting job 72733d69588678ec224003ab5577cab8 (Flink Java Job at Wed Feb 03 20:59:22 CET 2016).
20:59:22,853 INFO  org.apache.flink.runtime.jobmanager.JobManager                - Scheduling job 72733d69588678ec224003ab5577cab8 (Flink Java Job at Wed Feb 03 20:59:22 CET 2016).
20:59:22,857 INFO  org.apache.flink.runtime.jobmanager.JobManager                - Status of job 72733d69588678ec224003ab5577cab8 (Flink Java Job at Wed Feb 03 20:59:22 CET 2016) changed to RUNNING.
20:59:22,859 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - CHAIN DataSource (at getDefaultTextLineDataSet(WordCountData.java:70) (org.apache.flink.api.java.io.CollectionInputFormat)) -> FlatMap (FlatMap at main(WordCount.java:69)) -> Combine(SUM(1), at main(WordCount.java:72) (1/1) (23fb37019a504fd6c7bf95e46a8cd7a3) switched from CREATED to SCHEDULED
20:59:22,881 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - CHAIN DataSource (at getDefaultTextLineDataSet(WordCountData.java:70) (org.apache.flink.api.java.io.CollectionInputFormat)) -> FlatMap (FlatMap at main(WordCount.java:69)) -> Combine(SUM(1), at main(WordCount.java:72) (1/1) (23fb37019a504fd6c7bf95e46a8cd7a3) switched from SCHEDULED to CANCELED
20:59:22,881 INFO  org.apache.flink.runtime.jobmanager.JobManager                - Status of job 72733d69588678ec224003ab5577cab8 (Flink Java Job at Wed Feb 03 20:59:22 CET 2016) changed to FAILING.
org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException: Not enough free slots available to run the job. You can decrease the operator parallelism or increase the number of slots per TaskManager in the configuration. Task to schedule: < Attempt #0 (CHAIN DataSource (at getDefaultTextLineDataSet(WordCountData.java:70) (org.apache.flink.api.java.io.CollectionInputFormat)) -> FlatMap (FlatMap at main(WordCount.java:69)) -> Combine(SUM(1), at main(WordCount.java:72) (1/1)) @ (unassigned) - [SCHEDULED] > with groupID < 31e497f2f68c9cee5864c8fddaff3d59 > in sharing group < SlotSharingGroup [f9ed1aab933e061a8ce1ecaa3534f18c, 037bb78a1902f7edea69a978ad7b54ce, 31e497f2f68c9cee5864c8fddaff3d59] >. Resources available to scheduler: Number of instances=0, total number of slots=0, available slots=0
        at org.apache.flink.runtime.jobmanager.scheduler.Scheduler.scheduleTask(Scheduler.java:256)
        at org.apache.flink.runtime.jobmanager.scheduler.Scheduler.scheduleImmediately(Scheduler.java:131)
        at org.apache.flink.runtime.executiongraph.Execution.scheduleForExecution(Execution.java:298)
        at org.apache.flink.runtime.executiongraph.ExecutionVertex.scheduleForExecution(ExecutionVertex.java:458)
        at org.apache.flink.runtime.executiongraph.ExecutionJobVertex.scheduleAll(ExecutionJobVertex.java:322)
        at org.apache.flink.runtime.executiongraph.ExecutionGraph.scheduleForExecution(ExecutionGraph.java:679)
        at org.apache.flink.runtime.jobmanager.JobManager$$anonfun$org$apache$flink$runtime$jobmanager$JobManager$$submitJob$1.apply$mcV$sp(JobManager.scala:982)
        at org.apache.flink.runtime.jobmanager.JobManager$$anonfun$org$apache$flink$runtime$jobmanager$JobManager$$submitJob$1.apply(JobManager.scala:962)
        at org.apache.flink.runtime.jobmanager.JobManager$$anonfun$org$apache$flink$runtime$jobmanager$JobManager$$submitJob$1.apply(JobManager.scala:962)
        at scala.concurrent.impl.Future$PromiseCompletingRunnable.liftedTree1$1(Future.scala:24)
        at scala.concurrent.impl.Future$PromiseCompletingRunnable.run(Future.scala:24)
        at akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:41)
        at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:401)
        at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
        at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
        at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
        at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
20:59:22,886 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - CHAIN Reduce (SUM(1), at main(WordCount.java:72) -> FlatMap (collect()) (1/1) (824b6e3771304cd0f92aea4ab763a11d) switched from CREATED to CANCELED
20:59:22,887 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - DataSink (collect() sink) (1/1) (1bb64a2edc6f68ad716acd9f8d2d7d67) switched from CREATED to CANCELED
20:59:22,890 INFO  org.apache.flink.runtime.jobmanager.JobManager                - Status of job 72733d69588678ec224003ab5577cab8 (Flink Java Job at Wed Feb 03 20:59:22 CET 2016) changed to FAILED.
org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException: Not enough free slots available to run the job. You can decrease the operator parallelism or increase the number of slots per TaskManager in the configuration. Task to schedule: < Attempt #0 (CHAIN DataSource (at getDefaultTextLineDataSet(WordCountData.java:70) (org.apache.flink.api.java.io.CollectionInputFormat)) -> FlatMap (FlatMap at main(WordCount.java:69)) -> Combine(SUM(1), at main(WordCount.java:72) (1/1)) @ (unassigned) - [SCHEDULED] > with groupID < 31e497f2f68c9cee5864c8fddaff3d59 > in sharing group < SlotSharingGroup [f9ed1aab933e061a8ce1ecaa3534f18c, 037bb78a1902f7edea69a978ad7b54ce, 31e497f2f68c9cee5864c8fddaff3d59] >. Resources available to scheduler: Number of instances=0, total number of slots=0, available slots=0
        at org.apache.flink.runtime.jobmanager.scheduler.Scheduler.scheduleTask(Scheduler.java:256)
        at org.apache.flink.runtime.jobmanager.scheduler.Scheduler.scheduleImmediately(Scheduler.java:131)
        at org.apache.flink.runtime.executiongraph.Execution.scheduleForExecution(Execution.java:298)
        at org.apache.flink.runtime.executiongraph.ExecutionVertex.scheduleForExecution(ExecutionVertex.java:458)
        at org.apache.flink.runtime.executiongraph.ExecutionJobVertex.scheduleAll(ExecutionJobVertex.java:322)
        at org.apache.flink.runtime.executiongraph.ExecutionGraph.scheduleForExecution(ExecutionGraph.java:679)
        at org.apache.flink.runtime.jobmanager.JobManager$$anonfun$org$apache$flink$runtime$jobmanager$JobManager$$submitJob$1.apply$mcV$sp(JobManager.scala:982)
        at org.apache.flink.runtime.jobmanager.JobManager$$anonfun$org$apache$flink$runtime$jobmanager$JobManager$$submitJob$1.apply(JobManager.scala:962)
        at org.apache.flink.runtime.jobmanager.JobManager$$anonfun$org$apache$flink$runtime$jobmanager$JobManager$$submitJob$1.apply(JobManager.scala:962)
        at scala.concurrent.impl.Future$PromiseCompletingRunnable.liftedTree1$1(Future.scala:24)
        at scala.concurrent.impl.Future$PromiseCompletingRunnable.run(Future.scala:24)


On Wed, Feb 3, 2016 at 9:27 PM, Robert Metzger <[hidden email]> wrote:
Hi,

the TaskManager is starting up, but its not able to register at the job manager. Did you check the JobManager log? Do you see anything suspicious there? Are the ports matching?


On Wed, Feb 3, 2016 at 9:23 PM, Ravinder Kaur <[hidden email]> wrote:
Hello,

Thank you for pointing it out. I had a little typo while I edited the hostname in flink-conf.yaml. I've reset it and the TaskManager started up. But I still can't run the WordCount example and it throws the same NoResourceAvaliableException.

Caused by: org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableExce                                                                                        ption: Not enough free slots available to run the job. You can decrease the oper                                                                                        ator parallelism or increase the number of slots per TaskManager in the configur                                                                                        ation. Task to schedule: < Attempt #0 (CHAIN DataSource (at getDefaultTextLineDa                                                                                        taSet(WordCountData.java:70) (org.apache.flink.api.java.io.CollectionInputFormat                                                                                        )) -> FlatMap (FlatMap at main(WordCount.java:69)) -> Combine(SUM(1), at main(Wo                                                                                        rdCount.java:72) (1/1)) @ (unassigned) - [SCHEDULED] > with groupID < 31e497f2f6                                                                                        8c9cee5864c8fddaff3d59 > in sharing group < SlotSharingGroup [f9ed1aab933e061a8c                                                                                        e1ecaa3534f18c, 037bb78a1902f7edea69a978ad7b54ce, 31e497f2f68c9cee5864c8fddaff3d                                                                                        59] >. Resources available to scheduler: Number of instances=0, total number of                                                                                         slots=0, available slots=0
        at org.apache.flink.runtime.jobmanager.scheduler.Scheduler.scheduleTask(                                                                                        Scheduler.java:256)
        at org.apache.flink.runtime.jobmanager.scheduler.Scheduler.scheduleImmed                                                                                        iately(Scheduler.java:131)
        at org.apache.flink.runtime.executiongraph.Execution.scheduleForExecutio                                                                                        n(Execution.java:298)
        at org.apache.flink.runtime.executiongraph.ExecutionVertex.scheduleForEx                                                                                        ecution(ExecutionVertex.java:458)
        at org.apache.flink.runtime.executiongraph.ExecutionJobVertex.scheduleAl                                                                                        l(ExecutionJobVertex.java:322)
        at org.apache.flink.runtime.executiongraph.ExecutionGraph.scheduleForExe                                                                                        cution(ExecutionGraph.java:679)
        at org.apache.flink.runtime.jobmanager.JobManager$$anonfun$org$apache$fl                                                                                        ink$runtime$jobmanager$JobManager$$submitJob$1.apply$mcV$sp(JobManager.scala:982                                                                                        )
        at org.apache.flink.runtime.jobmanager.JobManager$$anonfun$org$apache$fl                                                                                        ink$runtime$jobmanager$JobManager$$submitJob$1.apply(JobManager.scala:962)
        at org.apache.flink.runtime.jobmanager.JobManager$$anonfun$org$apache$fl                                                                                        ink$runtime$jobmanager$JobManager$$submitJob$1.apply(JobManager.scala:962)
        ... 8 more

The log of TaskManager again has the same errors as before.

20:58:58,457 INFO  org.apache.flink.runtime.net.ConnectionUtils                  - Failed to connect from address '/slave-IP': connect timed out
20:58:58,458 INFO  org.apache.flink.runtime.net.ConnectionUtils                  - Failed to connect from address '/0:0:0:0:0:0:0:1%1': Network is unreachable
20:58:58,458 INFO  org.apache.flink.runtime.net.ConnectionUtils                  - Failed to connect from address '/127.0.0.1': Invalid argument
20:58:59,048 WARN  org.apache.flink.runtime.net.ConnectionUtils                  - Could not connect to /master-IP:6123. Selecting a local address using heuristics.
20:58:59,050 INFO  org.apache.flink.runtime.taskmanager.TaskManager              - TaskManager will use hostname/address 'hostname-of-slave' (slave-IP) for communication.
20:58:59,051 INFO  org.apache.flink.runtime.taskmanager.TaskManager              - Starting TaskManager in streaming mode BATCH_ONLY
20:58:59,052 INFO  org.apache.flink.runtime.taskmanager.TaskManager              - Starting TaskManager actor system at slave_IP:0
20:58:59,776 INFO  akka.event.slf4j.Slf4jLogger                                  - Slf4jLogger started
20:58:59,842 INFO  Remoting                                                      - Starting remoting
20:59:00,094 INFO  Remoting                                                      - Remoting started; listening on addresses :[akka.tcp://flink@slave-IP:33813]
20:59:00,100 INFO  org.apache.flink.runtime.taskmanager.TaskManager              - Starting TaskManager actor
20:59:00,125 INFO  org.apache.flink.runtime.io.network.netty.NettyConfig         - NettyConfig [server address: hostname-of-master/master-IP, server port: 49030, memory segment size (bytes): 32768, transport type: NIO, number of server threads: 0 (use Netty's default), number of client threads: 0 (use Netty's default), server connect backlog: 0 (use Netty's default), client connect timeout (sec): 120, send/receive buffer size (bytes): 0 (use Netty's default)]
20:59:00,131 INFO  org.apache.flink.runtime.taskmanager.TaskManager              - Messages between TaskManager and JobManager have a max timeout of 100000 milliseconds
20:59:00,142 INFO  org.apache.flink.runtime.taskmanager.TaskManager              - Temporary file directory '/tmp': total 4 GB, usable 1 GB (25.00% usable)
20:59:00,210 INFO  org.apache.flink.runtime.io.network.buffer.NetworkBufferPool  - Allocated 64 MB for network buffer pool (number of memory segments: 2048, bytes per segment: 32768).
20:59:00,323 INFO  org.apache.flink.runtime.taskmanager.TaskManager              - Using 0.7 of the currently free heap space for Flink managed heap memory (293 MB).
20:59:00,565 INFO  org.apache.flink.runtime.io.disk.iomanager.IOManager          - I/O manager uses directory /tmp/flink-io-c7796b82-6676-4604-97fd-df09001a84e8 for spill files.
20:59:00,578 INFO  org.apache.flink.runtime.filecache.FileCache                  - User file cache uses directory /tmp/flink-dist-cache-13ed3e76-cf1e-46fa-9ba2-5177e801429e
20:59:00,908 INFO  org.apache.flink.runtime.taskmanager.TaskManager              - Starting TaskManager actor at akka://flink/user/taskmanager#-157676733.
20:59:00,908 INFO  org.apache.flink.runtime.taskmanager.TaskManager              - TaskManager data connection information: hostname-of-master (dataPort=49030)
20:59:00,909 INFO  org.apache.flink.runtime.taskmanager.TaskManager              - TaskManager has 1 task slot(s).
20:59:00,910 INFO  org.apache.flink.runtime.taskmanager.TaskManager              - Memory usage stats: [HEAP: 376/491/491 MB, NON HEAP: 24/49/304 MB (used/committed/max)]
20:59:00,917 INFO  org.apache.flink.runtime.taskmanager.TaskManager              - Trying to register at JobManager akka.tcp://flink@master-IP:6123/user/jobmanager (attempt 1, timeout: 500 milliseconds)
20:59:01,443 INFO  org.apache.flink.runtime.taskmanager.TaskManager              - Trying to register at JobManager akka.tcp://flink@master-IP:6123/user/jobmanager (attempt 2, timeout: 1000 milliseconds)
20:59:02,873 INFO  org.apache.flink.runtime.taskmanager.TaskManager              - Trying to register at JobManager akka.tcp://flink@master-IP:6123/user/jobmanager (attempt 3, timeout: 2000 milliseconds)
20:59:04,893 INFO  org.apache.flink.runtime.taskmanager.TaskManager              - Trying to register at JobManager akka.tcp://flink@master-IP:6123/user/jobmanager (attempt 4, timeout: 4000 milliseconds)
20:59:08,914 INFO  org.apache.flink.runtime.taskmanager.TaskManager              - Trying to register at JobManager akka.tcp://flink@master-IP:6123/user/jobmanager (attempt 5, timeout: 8000 milliseconds)


Kind Regards,
Ravinder Kaur

On Wed, Feb 3, 2016 at 8:12 PM, Stephan Ewen <[hidden email]> wrote:
This looks like the reason:

java.net.UnknownHostException: Cannot resolve the JobManager hostname 'hostname-of-master' specified in the configuration

On Wed, Feb 3, 2016 at 7:29 PM, Ravinder Kaur <[hidden email]> wrote:
Hello,

The log file of the Taskmanager now shows the following 

18:27:10,082 WARN  org.apache.hadoop.util.NativeCodeLoader                       - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
18:27:10,244 INFO  org.apache.flink.runtime.taskmanager.TaskManager              - --------------------------------------------------------------------------------
18:27:10,244 INFO  org.apache.flink.runtime.taskmanager.TaskManager              -  Starting TaskManager (Version: 0.10.1, Rev:2e9b231, Date:22.11.2015 @ 12:41:12 CET)
18:27:10,244 INFO  org.apache.flink.runtime.taskmanager.TaskManager              -  Current user: flink
18:27:10,245 INFO  org.apache.flink.runtime.taskmanager.TaskManager              -  JVM: OpenJDK 64-Bit Server VM - Oracle Corporation - 1.7/24.91-b01
18:27:10,245 INFO  org.apache.flink.runtime.taskmanager.TaskManager              -  Maximum heap size: 491 MiBytes
18:27:10,245 INFO  org.apache.flink.runtime.taskmanager.TaskManager              -  JAVA_HOME: /usr/lib/jvm/java-1.7.0-openjdk-amd64
18:27:10,247 INFO  org.apache.flink.runtime.taskmanager.TaskManager              -  Hadoop version: 2.7.0
18:27:10,247 INFO  org.apache.flink.runtime.taskmanager.TaskManager              -  JVM Options:
18:27:10,247 INFO  org.apache.flink.runtime.taskmanager.TaskManager              -     -Xms512M
18:27:10,247 INFO  org.apache.flink.runtime.taskmanager.TaskManager              -     -Xmx512M
18:27:10,248 INFO  org.apache.flink.runtime.taskmanager.TaskManager              -     -XX:MaxDirectMemorySize=8388607T
18:27:10,248 INFO  org.apache.flink.runtime.taskmanager.TaskManager              -     -XX:MaxPermSize=256m
18:27:10,248 INFO  org.apache.flink.runtime.taskmanager.TaskManager              -     -Dlog.file=/home/flink/flink-0.10.1/log/flink-flink-taskmanager-0-vm-10-155-208-137.cloud.mwn.de.log
18:27:10,248 INFO  org.apache.flink.runtime.taskmanager.TaskManager              -     -Dlog4j.configuration=file:/home/flink/flink-0.10.1/conf/log4j.properties
18:27:10,248 INFO  org.apache.flink.runtime.taskmanager.TaskManager              -     -Dlogback.configurationFile=file:/home/flink/flink-0.10.1/conf/logback.xml
18:27:10,248 INFO  org.apache.flink.runtime.taskmanager.TaskManager              -  Program Arguments:
18:27:10,248 INFO  org.apache.flink.runtime.taskmanager.TaskManager              -     --configDir
18:27:10,248 INFO  org.apache.flink.runtime.taskmanager.TaskManager              -     /home/flink/flink-0.10.1/conf
18:27:10,248 INFO  org.apache.flink.runtime.taskmanager.TaskManager              -     --streamingMode
18:27:10,248 INFO  org.apache.flink.runtime.taskmanager.TaskManager              -     batch
18:27:10,248 INFO  org.apache.flink.runtime.taskmanager.TaskManager              -  Classpath: /home/flink/flink-0.10.1/lib/flink-dist_2.11-0.10.1.jar:/home/flink/flink-0.10.1/lib/flink-python_2.11-0.10.1.jar:/home/flink/flink-0.10.1/lib/log4j-1.2.17.jar:/home/flink/flink-0.10.1/lib/slf4j-log4j12-1.7.7.jar:/usr/lib/jvm/java-1.7.0-openjdk-amd64/lib/tools.jar::
18:27:10,248 INFO  org.apache.flink.runtime.taskmanager.TaskManager              - --------------------------------------------------------------------------------
18:27:10,252 INFO  org.apache.flink.runtime.taskmanager.TaskManager              - Maximum number of open file descriptors is 4096
18:27:10,277 INFO  org.apache.flink.runtime.taskmanager.TaskManager              - Loading configuration from /home/flink/flink-0.10.1/conf
18:27:10,356 INFO  org.apache.flink.runtime.taskmanager.TaskManager              - Security is not enabled. Starting non-authenticated TaskManager.
18:27:10,365 ERROR org.apache.flink.runtime.taskmanager.TaskManager              - Failed to run TaskManager.
java.net.UnknownHostException: Cannot resolve the JobManager hostname 'hostname-of-master' specified in the configuration
        at org.apache.flink.runtime.util.StandaloneUtils.createLeaderRetrievalService(StandaloneUtils.java:79)
        at org.apache.flink.runtime.util.StandaloneUtils.createLeaderRetrievalService(StandaloneUtils.java:48)
        at org.apache.flink.runtime.util.LeaderRetrievalUtils.createLeaderRetrievalService(LeaderRetrievalUtils.java:69)
        at org.apache.flink.runtime.taskmanager.TaskManager$.selectNetworkInterfaceAndPort(TaskManager.scala:1351)
        at org.apache.flink.runtime.taskmanager.TaskManager$.selectNetworkInterfaceAndRunTaskManager(TaskManager.scala:1328)
        at org.apache.flink.runtime.taskmanager.TaskManager$.main(TaskManager.scala:1240)
        at org.apache.flink.runtime.taskmanager.TaskManager.main(TaskManager.scala)

Kind Regards,
Ravinder Kaur

On Wed, Feb 3, 2016 at 7:19 PM, Stephan Ewen <[hidden email]> wrote:
What do the TaskManger logs say?

On Wed, Feb 3, 2016 at 6:34 PM, Ravinder Kaur <[hidden email]> wrote:
Hello,

Thanks for the quick reply. I tried to set jobmanager.rpc.address in flink-conf.yaml to the hostname of master node on both the nodes.

Now it does not start the Taskmanager at the worker node at all. When I start the cluster using ./bin/start-cluster.sh on master it shows the normal output of starting the Jobmanager and Taskmanager but when I run jps on the nodes the slave does not have the Taskmanager running. 

Running the WordCount example again fails showing the same error. Stopping the cluster says no taskmanager to stop.

Kind Regards,
Ravinder Kaur

On Wed, Feb 3, 2016 at 5:47 PM, Stephan Ewen <[hidden email]> wrote:
Looks like the network configuration is not correct.

I would try setting the full host name (like "master.abc.xyz.com") as jobmanager.rpc.address.

Greetings,
Stephan


On Wed, Feb 3, 2016 at 5:43 PM, Ravinder Kaur <[hidden email]> wrote:

Hello Community,

I'm a student and new to Apache Flink. I'm trying to learn and have setup a 2- node standalone Flink(0.10.1) cluster (one master and one worker). I'm facing the following issue. 

Cluster: consists of 2 vms (one master and one worker)


When I start the cluster both the JobManager and the TaskManager are started on the master and worker respectively.

Command to start the cluster : bin/start-cluster.sh

JPS shows all the processes running.

Then I run the following command to run a WordCount example job: ./bin/flink run ./examples/WordCount.jar

the result is attached with the mail.

The error is org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailabeException: Not enough free slots available to run to run the job ....................... Resources available to scheduler: Number of instances=0, total number of slots= 0, available slots=0

Therefore I suppose that the JobManager does not find the TaskManager and checked the logs of the TaskManager which indeed shows that the TaskManager is unable to register at the JobManager for quite a long time. There are org.apache.flink.runtime.net.ConnectionUtils: Failed to connect from localhost: Connect timed out and org.apache.flink.runtime.net.ConnectionUtils: Failed to connect from address localhost: Network is Unreachable messages in the log of the TaskManager. Later when it starts up after a number of attempts and tries to register at the JobManager, which also fails after a lot of attempts showing the following message org.apache.flink.runtime.taskmanager.Taskmanager: Trying to register at JobManager akka.tcp://flink@master:6123/user'/jobmanager (attempt:92, timeout:30seconds) and org.apache.flink.runtime.taskmanager.Taskmanager: Tried to associate with unreachable remote host [akka.tcp://flink@master:6123/user/jobmanager]. Address is now gated for 5000ms, all messages to this address will be delivered to dead letters. Reason: Connection timed out: /master:6123

and https://issues.apache.org/jira/browse/FLINK-1119 these links helpful. Stephan Ewen the guy who provided the solution in both the links gives a good explanation that the TaskManagers take quite some time to register at the JobManager and therefore I waited for as long as 20 mins after starting the cluster to run the job. But even after waiting so long I get the same error. 

Another suggestion was to run the cluster in streaming mode. So I tried it with the command : bin/start-cluster-streaming.sh and ran the job but I get the same error. I have rechecked all the configurations but I'm unable to find out the fault.

I re-checked all the configurations but could not find anything wrong. Also checked the port 6123 on master which is in LISTEN state and tcp request from worker to master shows SYN_SENT state using netstat -na and lsof -i commands.

I opened the webpage on master http://localhost:8081 but it shows nothing and localhost:8080 says connection refused.

Kindly help me out as it is very important for me. Let me know if you have any questions.

Kind Regards,
Ravinder Kaur









Reply | Threaded
Open this post in threaded view
|

Re: TaskManager unable to register with JobManager

Stephan Ewen
Can machines connect to port 6123? The firewall may block that port, put permit SSH.

On Wed, Feb 3, 2016 at 9:52 PM, Ravinder Kaur <[hidden email]> wrote:
Hello,

Here is the log file of Jobmanager. I did not see some thing suspicious and as it suggests the ports are also listening.

20:58:46,906 INFO  org.apache.flink.runtime.jobmanager.JobManager                - Starting JobManager on IP-of-master:6123 with execution mode CLUSTER and streaming mode BATCH_ONLY
20:58:46,978 INFO  org.apache.flink.runtime.jobmanager.JobManager                - Security is not enabled. Starting non-authenticated JobManager.
20:58:46,979 INFO  org.apache.flink.runtime.jobmanager.JobManager                - Starting JobManager
20:58:46,980 INFO  org.apache.flink.runtime.jobmanager.JobManager                - Starting JobManager actor system at 10.155.208.138:6123
20:58:48,196 INFO  akka.event.slf4j.Slf4jLogger                                  - Slf4jLogger started
20:58:48,295 INFO  Remoting                                                      - Starting remoting
20:58:48,541 INFO  Remoting                                                      - Remoting started; listening on addresses :[akka.tcp://flink@IP-of-master:6123]
20:58:48,549 INFO  org.apache.flink.runtime.jobmanager.JobManager                - Starting JobManger web frontend
20:58:48,690 INFO  org.apache.flink.runtime.webmonitor.WebRuntimeMonitor         - Using directory /tmp/flink-web-876a4755-4f38-4ff7-8202-f263afa9b986 for the web interface files
20:58:48,691 INFO  org.apache.flink.runtime.webmonitor.WebRuntimeMonitor         - Serving job manager log from /home/flink/flink-0.10.1/log/flink-flink-jobmanager-0-hostname.log
20:58:48,691 INFO  org.apache.flink.runtime.webmonitor.WebRuntimeMonitor         - Serving job manager stdout from /home/flink/flink-0.10.1/log/flink-flink-jobmanager-0-hostname.out
20:58:49,044 INFO  org.apache.flink.runtime.webmonitor.WebRuntimeMonitor         - Web frontend listening at 0:0:0:0:0:0:0:0:8081
20:58:49,045 INFO  org.apache.flink.runtime.jobmanager.JobManager                - Starting JobManager actor
20:58:49,052 INFO  org.apache.flink.runtime.blob.BlobServer                      - Created BLOB server storage directory /tmp/blobStore-e0c52bfb-2411-4a83-ac8d-5664a5894258
20:58:49,054 INFO  org.apache.flink.runtime.blob.BlobServer                      - Started BLOB server at 0.0.0.0:43683 - max concurrent requests: 50 - max backlog: 1000
20:58:49,075 INFO  org.apache.flink.runtime.jobmanager.MemoryArchivist           - Started memory archivist akka://flink/user/archive
20:58:49,075 INFO  org.apache.flink.runtime.jobmanager.JobManager                - Starting JobManager at akka.tcp://flink@IP-of-master:6123/user/jobmanager.
20:58:49,081 INFO  org.apache.flink.runtime.jobmanager.JobManager                - JobManager akka.tcp://flink@IP-of-master:6123/user/jobmanager was granted leadership with leader session ID None.
20:58:49,082 INFO  org.apache.flink.runtime.webmonitor.WebRuntimeMonitor         - Starting with JobManager akka.tcp://flink@IP-of-master:6123/user/jobmanager on port 8081
20:58:49,083 INFO  org.apache.flink.runtime.webmonitor.JobManagerRetriever       - New leader reachable under akka.tcp://flink@IP-of-master:6123/user/jobmanager:null.
20:59:22,794 INFO  org.apache.flink.runtime.jobmanager.JobManager                - Submitting job 72733d69588678ec224003ab5577cab8 (Flink Java Job at Wed Feb 03 20:59:22 CET 2016).
20:59:22,853 INFO  org.apache.flink.runtime.jobmanager.JobManager                - Scheduling job 72733d69588678ec224003ab5577cab8 (Flink Java Job at Wed Feb 03 20:59:22 CET 2016).
20:59:22,857 INFO  org.apache.flink.runtime.jobmanager.JobManager                - Status of job 72733d69588678ec224003ab5577cab8 (Flink Java Job at Wed Feb 03 20:59:22 CET 2016) changed to RUNNING.
20:59:22,859 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - CHAIN DataSource (at getDefaultTextLineDataSet(WordCountData.java:70) (org.apache.flink.api.java.io.CollectionInputFormat)) -> FlatMap (FlatMap at main(WordCount.java:69)) -> Combine(SUM(1), at main(WordCount.java:72) (1/1) (23fb37019a504fd6c7bf95e46a8cd7a3) switched from CREATED to SCHEDULED
20:59:22,881 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - CHAIN DataSource (at getDefaultTextLineDataSet(WordCountData.java:70) (org.apache.flink.api.java.io.CollectionInputFormat)) -> FlatMap (FlatMap at main(WordCount.java:69)) -> Combine(SUM(1), at main(WordCount.java:72) (1/1) (23fb37019a504fd6c7bf95e46a8cd7a3) switched from SCHEDULED to CANCELED
20:59:22,881 INFO  org.apache.flink.runtime.jobmanager.JobManager                - Status of job 72733d69588678ec224003ab5577cab8 (Flink Java Job at Wed Feb 03 20:59:22 CET 2016) changed to FAILING.
org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException: Not enough free slots available to run the job. You can decrease the operator parallelism or increase the number of slots per TaskManager in the configuration. Task to schedule: < Attempt #0 (CHAIN DataSource (at getDefaultTextLineDataSet(WordCountData.java:70) (org.apache.flink.api.java.io.CollectionInputFormat)) -> FlatMap (FlatMap at main(WordCount.java:69)) -> Combine(SUM(1), at main(WordCount.java:72) (1/1)) @ (unassigned) - [SCHEDULED] > with groupID < 31e497f2f68c9cee5864c8fddaff3d59 > in sharing group < SlotSharingGroup [f9ed1aab933e061a8ce1ecaa3534f18c, 037bb78a1902f7edea69a978ad7b54ce, 31e497f2f68c9cee5864c8fddaff3d59] >. Resources available to scheduler: Number of instances=0, total number of slots=0, available slots=0
        at org.apache.flink.runtime.jobmanager.scheduler.Scheduler.scheduleTask(Scheduler.java:256)
        at org.apache.flink.runtime.jobmanager.scheduler.Scheduler.scheduleImmediately(Scheduler.java:131)
        at org.apache.flink.runtime.executiongraph.Execution.scheduleForExecution(Execution.java:298)
        at org.apache.flink.runtime.executiongraph.ExecutionVertex.scheduleForExecution(ExecutionVertex.java:458)
        at org.apache.flink.runtime.executiongraph.ExecutionJobVertex.scheduleAll(ExecutionJobVertex.java:322)
        at org.apache.flink.runtime.executiongraph.ExecutionGraph.scheduleForExecution(ExecutionGraph.java:679)
        at org.apache.flink.runtime.jobmanager.JobManager$$anonfun$org$apache$flink$runtime$jobmanager$JobManager$$submitJob$1.apply$mcV$sp(JobManager.scala:982)
        at org.apache.flink.runtime.jobmanager.JobManager$$anonfun$org$apache$flink$runtime$jobmanager$JobManager$$submitJob$1.apply(JobManager.scala:962)
        at org.apache.flink.runtime.jobmanager.JobManager$$anonfun$org$apache$flink$runtime$jobmanager$JobManager$$submitJob$1.apply(JobManager.scala:962)
        at scala.concurrent.impl.Future$PromiseCompletingRunnable.liftedTree1$1(Future.scala:24)
        at scala.concurrent.impl.Future$PromiseCompletingRunnable.run(Future.scala:24)
        at akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:41)
        at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:401)
        at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
        at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
        at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
        at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
20:59:22,886 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - CHAIN Reduce (SUM(1), at main(WordCount.java:72) -> FlatMap (collect()) (1/1) (824b6e3771304cd0f92aea4ab763a11d) switched from CREATED to CANCELED
20:59:22,887 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - DataSink (collect() sink) (1/1) (1bb64a2edc6f68ad716acd9f8d2d7d67) switched from CREATED to CANCELED
20:59:22,890 INFO  org.apache.flink.runtime.jobmanager.JobManager                - Status of job 72733d69588678ec224003ab5577cab8 (Flink Java Job at Wed Feb 03 20:59:22 CET 2016) changed to FAILED.
org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException: Not enough free slots available to run the job. You can decrease the operator parallelism or increase the number of slots per TaskManager in the configuration. Task to schedule: < Attempt #0 (CHAIN DataSource (at getDefaultTextLineDataSet(WordCountData.java:70) (org.apache.flink.api.java.io.CollectionInputFormat)) -> FlatMap (FlatMap at main(WordCount.java:69)) -> Combine(SUM(1), at main(WordCount.java:72) (1/1)) @ (unassigned) - [SCHEDULED] > with groupID < 31e497f2f68c9cee5864c8fddaff3d59 > in sharing group < SlotSharingGroup [f9ed1aab933e061a8ce1ecaa3534f18c, 037bb78a1902f7edea69a978ad7b54ce, 31e497f2f68c9cee5864c8fddaff3d59] >. Resources available to scheduler: Number of instances=0, total number of slots=0, available slots=0
        at org.apache.flink.runtime.jobmanager.scheduler.Scheduler.scheduleTask(Scheduler.java:256)
        at org.apache.flink.runtime.jobmanager.scheduler.Scheduler.scheduleImmediately(Scheduler.java:131)
        at org.apache.flink.runtime.executiongraph.Execution.scheduleForExecution(Execution.java:298)
        at org.apache.flink.runtime.executiongraph.ExecutionVertex.scheduleForExecution(ExecutionVertex.java:458)
        at org.apache.flink.runtime.executiongraph.ExecutionJobVertex.scheduleAll(ExecutionJobVertex.java:322)
        at org.apache.flink.runtime.executiongraph.ExecutionGraph.scheduleForExecution(ExecutionGraph.java:679)
        at org.apache.flink.runtime.jobmanager.JobManager$$anonfun$org$apache$flink$runtime$jobmanager$JobManager$$submitJob$1.apply$mcV$sp(JobManager.scala:982)
        at org.apache.flink.runtime.jobmanager.JobManager$$anonfun$org$apache$flink$runtime$jobmanager$JobManager$$submitJob$1.apply(JobManager.scala:962)
        at org.apache.flink.runtime.jobmanager.JobManager$$anonfun$org$apache$flink$runtime$jobmanager$JobManager$$submitJob$1.apply(JobManager.scala:962)
        at scala.concurrent.impl.Future$PromiseCompletingRunnable.liftedTree1$1(Future.scala:24)
        at scala.concurrent.impl.Future$PromiseCompletingRunnable.run(Future.scala:24)


On Wed, Feb 3, 2016 at 9:27 PM, Robert Metzger <[hidden email]> wrote:
Hi,

the TaskManager is starting up, but its not able to register at the job manager. Did you check the JobManager log? Do you see anything suspicious there? Are the ports matching?


On Wed, Feb 3, 2016 at 9:23 PM, Ravinder Kaur <[hidden email]> wrote:
Hello,

Thank you for pointing it out. I had a little typo while I edited the hostname in flink-conf.yaml. I've reset it and the TaskManager started up. But I still can't run the WordCount example and it throws the same NoResourceAvaliableException.

Caused by: org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableExce                                                                                        ption: Not enough free slots available to run the job. You can decrease the oper                                                                                        ator parallelism or increase the number of slots per TaskManager in the configur                                                                                        ation. Task to schedule: < Attempt #0 (CHAIN DataSource (at getDefaultTextLineDa                                                                                        taSet(WordCountData.java:70) (org.apache.flink.api.java.io.CollectionInputFormat                                                                                        )) -> FlatMap (FlatMap at main(WordCount.java:69)) -> Combine(SUM(1), at main(Wo                                                                                        rdCount.java:72) (1/1)) @ (unassigned) - [SCHEDULED] > with groupID < 31e497f2f6                                                                                        8c9cee5864c8fddaff3d59 > in sharing group < SlotSharingGroup [f9ed1aab933e061a8c                                                                                        e1ecaa3534f18c, 037bb78a1902f7edea69a978ad7b54ce, 31e497f2f68c9cee5864c8fddaff3d                                                                                        59] >. Resources available to scheduler: Number of instances=0, total number of                                                                                         slots=0, available slots=0
        at org.apache.flink.runtime.jobmanager.scheduler.Scheduler.scheduleTask(                                                                                        Scheduler.java:256)
        at org.apache.flink.runtime.jobmanager.scheduler.Scheduler.scheduleImmed                                                                                        iately(Scheduler.java:131)
        at org.apache.flink.runtime.executiongraph.Execution.scheduleForExecutio                                                                                        n(Execution.java:298)
        at org.apache.flink.runtime.executiongraph.ExecutionVertex.scheduleForEx                                                                                        ecution(ExecutionVertex.java:458)
        at org.apache.flink.runtime.executiongraph.ExecutionJobVertex.scheduleAl                                                                                        l(ExecutionJobVertex.java:322)
        at org.apache.flink.runtime.executiongraph.ExecutionGraph.scheduleForExe                                                                                        cution(ExecutionGraph.java:679)
        at org.apache.flink.runtime.jobmanager.JobManager$$anonfun$org$apache$fl                                                                                        ink$runtime$jobmanager$JobManager$$submitJob$1.apply$mcV$sp(JobManager.scala:982                                                                                        )
        at org.apache.flink.runtime.jobmanager.JobManager$$anonfun$org$apache$fl                                                                                        ink$runtime$jobmanager$JobManager$$submitJob$1.apply(JobManager.scala:962)
        at org.apache.flink.runtime.jobmanager.JobManager$$anonfun$org$apache$fl                                                                                        ink$runtime$jobmanager$JobManager$$submitJob$1.apply(JobManager.scala:962)
        ... 8 more

The log of TaskManager again has the same errors as before.

20:58:58,457 INFO  org.apache.flink.runtime.net.ConnectionUtils                  - Failed to connect from address '/slave-IP': connect timed out
20:58:58,458 INFO  org.apache.flink.runtime.net.ConnectionUtils                  - Failed to connect from address '/0:0:0:0:0:0:0:1%1': Network is unreachable
20:58:58,458 INFO  org.apache.flink.runtime.net.ConnectionUtils                  - Failed to connect from address '/127.0.0.1': Invalid argument
20:58:59,048 WARN  org.apache.flink.runtime.net.ConnectionUtils                  - Could not connect to /master-IP:6123. Selecting a local address using heuristics.
20:58:59,050 INFO  org.apache.flink.runtime.taskmanager.TaskManager              - TaskManager will use hostname/address 'hostname-of-slave' (slave-IP) for communication.
20:58:59,051 INFO  org.apache.flink.runtime.taskmanager.TaskManager              - Starting TaskManager in streaming mode BATCH_ONLY
20:58:59,052 INFO  org.apache.flink.runtime.taskmanager.TaskManager              - Starting TaskManager actor system at slave_IP:0
20:58:59,776 INFO  akka.event.slf4j.Slf4jLogger                                  - Slf4jLogger started
20:58:59,842 INFO  Remoting                                                      - Starting remoting
20:59:00,094 INFO  Remoting                                                      - Remoting started; listening on addresses :[akka.tcp://flink@slave-IP:33813]
20:59:00,100 INFO  org.apache.flink.runtime.taskmanager.TaskManager              - Starting TaskManager actor
20:59:00,125 INFO  org.apache.flink.runtime.io.network.netty.NettyConfig         - NettyConfig [server address: hostname-of-master/master-IP, server port: 49030, memory segment size (bytes): 32768, transport type: NIO, number of server threads: 0 (use Netty's default), number of client threads: 0 (use Netty's default), server connect backlog: 0 (use Netty's default), client connect timeout (sec): 120, send/receive buffer size (bytes): 0 (use Netty's default)]
20:59:00,131 INFO  org.apache.flink.runtime.taskmanager.TaskManager              - Messages between TaskManager and JobManager have a max timeout of 100000 milliseconds
20:59:00,142 INFO  org.apache.flink.runtime.taskmanager.TaskManager              - Temporary file directory '/tmp': total 4 GB, usable 1 GB (25.00% usable)
20:59:00,210 INFO  org.apache.flink.runtime.io.network.buffer.NetworkBufferPool  - Allocated 64 MB for network buffer pool (number of memory segments: 2048, bytes per segment: 32768).
20:59:00,323 INFO  org.apache.flink.runtime.taskmanager.TaskManager              - Using 0.7 of the currently free heap space for Flink managed heap memory (293 MB).
20:59:00,565 INFO  org.apache.flink.runtime.io.disk.iomanager.IOManager          - I/O manager uses directory /tmp/flink-io-c7796b82-6676-4604-97fd-df09001a84e8 for spill files.
20:59:00,578 INFO  org.apache.flink.runtime.filecache.FileCache                  - User file cache uses directory /tmp/flink-dist-cache-13ed3e76-cf1e-46fa-9ba2-5177e801429e
20:59:00,908 INFO  org.apache.flink.runtime.taskmanager.TaskManager              - Starting TaskManager actor at akka://flink/user/taskmanager#-157676733.
20:59:00,908 INFO  org.apache.flink.runtime.taskmanager.TaskManager              - TaskManager data connection information: hostname-of-master (dataPort=49030)
20:59:00,909 INFO  org.apache.flink.runtime.taskmanager.TaskManager              - TaskManager has 1 task slot(s).
20:59:00,910 INFO  org.apache.flink.runtime.taskmanager.TaskManager              - Memory usage stats: [HEAP: 376/491/491 MB, NON HEAP: 24/49/304 MB (used/committed/max)]
20:59:00,917 INFO  org.apache.flink.runtime.taskmanager.TaskManager              - Trying to register at JobManager akka.tcp://flink@master-IP:6123/user/jobmanager (attempt 1, timeout: 500 milliseconds)
20:59:01,443 INFO  org.apache.flink.runtime.taskmanager.TaskManager              - Trying to register at JobManager akka.tcp://flink@master-IP:6123/user/jobmanager (attempt 2, timeout: 1000 milliseconds)
20:59:02,873 INFO  org.apache.flink.runtime.taskmanager.TaskManager              - Trying to register at JobManager akka.tcp://flink@master-IP:6123/user/jobmanager (attempt 3, timeout: 2000 milliseconds)
20:59:04,893 INFO  org.apache.flink.runtime.taskmanager.TaskManager              - Trying to register at JobManager akka.tcp://flink@master-IP:6123/user/jobmanager (attempt 4, timeout: 4000 milliseconds)
20:59:08,914 INFO  org.apache.flink.runtime.taskmanager.TaskManager              - Trying to register at JobManager akka.tcp://flink@master-IP:6123/user/jobmanager (attempt 5, timeout: 8000 milliseconds)


Kind Regards,
Ravinder Kaur

On Wed, Feb 3, 2016 at 8:12 PM, Stephan Ewen <[hidden email]> wrote:
This looks like the reason:

java.net.UnknownHostException: Cannot resolve the JobManager hostname 'hostname-of-master' specified in the configuration

On Wed, Feb 3, 2016 at 7:29 PM, Ravinder Kaur <[hidden email]> wrote:
Hello,

The log file of the Taskmanager now shows the following 

18:27:10,082 WARN  org.apache.hadoop.util.NativeCodeLoader                       - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
18:27:10,244 INFO  org.apache.flink.runtime.taskmanager.TaskManager              - --------------------------------------------------------------------------------
18:27:10,244 INFO  org.apache.flink.runtime.taskmanager.TaskManager              -  Starting TaskManager (Version: 0.10.1, Rev:2e9b231, Date:22.11.2015 @ 12:41:12 CET)
18:27:10,244 INFO  org.apache.flink.runtime.taskmanager.TaskManager              -  Current user: flink
18:27:10,245 INFO  org.apache.flink.runtime.taskmanager.TaskManager              -  JVM: OpenJDK 64-Bit Server VM - Oracle Corporation - 1.7/24.91-b01
18:27:10,245 INFO  org.apache.flink.runtime.taskmanager.TaskManager              -  Maximum heap size: 491 MiBytes
18:27:10,245 INFO  org.apache.flink.runtime.taskmanager.TaskManager              -  JAVA_HOME: /usr/lib/jvm/java-1.7.0-openjdk-amd64
18:27:10,247 INFO  org.apache.flink.runtime.taskmanager.TaskManager              -  Hadoop version: 2.7.0
18:27:10,247 INFO  org.apache.flink.runtime.taskmanager.TaskManager              -  JVM Options:
18:27:10,247 INFO  org.apache.flink.runtime.taskmanager.TaskManager              -     -Xms512M
18:27:10,247 INFO  org.apache.flink.runtime.taskmanager.TaskManager              -     -Xmx512M
18:27:10,248 INFO  org.apache.flink.runtime.taskmanager.TaskManager              -     -XX:MaxDirectMemorySize=8388607T
18:27:10,248 INFO  org.apache.flink.runtime.taskmanager.TaskManager              -     -XX:MaxPermSize=256m
18:27:10,248 INFO  org.apache.flink.runtime.taskmanager.TaskManager              -     -Dlog.file=/home/flink/flink-0.10.1/log/flink-flink-taskmanager-0-vm-10-155-208-137.cloud.mwn.de.log
18:27:10,248 INFO  org.apache.flink.runtime.taskmanager.TaskManager              -     -Dlog4j.configuration=file:/home/flink/flink-0.10.1/conf/log4j.properties
18:27:10,248 INFO  org.apache.flink.runtime.taskmanager.TaskManager              -     -Dlogback.configurationFile=file:/home/flink/flink-0.10.1/conf/logback.xml
18:27:10,248 INFO  org.apache.flink.runtime.taskmanager.TaskManager              -  Program Arguments:
18:27:10,248 INFO  org.apache.flink.runtime.taskmanager.TaskManager              -     --configDir
18:27:10,248 INFO  org.apache.flink.runtime.taskmanager.TaskManager              -     /home/flink/flink-0.10.1/conf
18:27:10,248 INFO  org.apache.flink.runtime.taskmanager.TaskManager              -     --streamingMode
18:27:10,248 INFO  org.apache.flink.runtime.taskmanager.TaskManager              -     batch
18:27:10,248 INFO  org.apache.flink.runtime.taskmanager.TaskManager              -  Classpath: /home/flink/flink-0.10.1/lib/flink-dist_2.11-0.10.1.jar:/home/flink/flink-0.10.1/lib/flink-python_2.11-0.10.1.jar:/home/flink/flink-0.10.1/lib/log4j-1.2.17.jar:/home/flink/flink-0.10.1/lib/slf4j-log4j12-1.7.7.jar:/usr/lib/jvm/java-1.7.0-openjdk-amd64/lib/tools.jar::
18:27:10,248 INFO  org.apache.flink.runtime.taskmanager.TaskManager              - --------------------------------------------------------------------------------
18:27:10,252 INFO  org.apache.flink.runtime.taskmanager.TaskManager              - Maximum number of open file descriptors is 4096
18:27:10,277 INFO  org.apache.flink.runtime.taskmanager.TaskManager              - Loading configuration from /home/flink/flink-0.10.1/conf
18:27:10,356 INFO  org.apache.flink.runtime.taskmanager.TaskManager              - Security is not enabled. Starting non-authenticated TaskManager.
18:27:10,365 ERROR org.apache.flink.runtime.taskmanager.TaskManager              - Failed to run TaskManager.
java.net.UnknownHostException: Cannot resolve the JobManager hostname 'hostname-of-master' specified in the configuration
        at org.apache.flink.runtime.util.StandaloneUtils.createLeaderRetrievalService(StandaloneUtils.java:79)
        at org.apache.flink.runtime.util.StandaloneUtils.createLeaderRetrievalService(StandaloneUtils.java:48)
        at org.apache.flink.runtime.util.LeaderRetrievalUtils.createLeaderRetrievalService(LeaderRetrievalUtils.java:69)
        at org.apache.flink.runtime.taskmanager.TaskManager$.selectNetworkInterfaceAndPort(TaskManager.scala:1351)
        at org.apache.flink.runtime.taskmanager.TaskManager$.selectNetworkInterfaceAndRunTaskManager(TaskManager.scala:1328)
        at org.apache.flink.runtime.taskmanager.TaskManager$.main(TaskManager.scala:1240)
        at org.apache.flink.runtime.taskmanager.TaskManager.main(TaskManager.scala)

Kind Regards,
Ravinder Kaur

On Wed, Feb 3, 2016 at 7:19 PM, Stephan Ewen <[hidden email]> wrote:
What do the TaskManger logs say?

On Wed, Feb 3, 2016 at 6:34 PM, Ravinder Kaur <[hidden email]> wrote:
Hello,

Thanks for the quick reply. I tried to set jobmanager.rpc.address in flink-conf.yaml to the hostname of master node on both the nodes.

Now it does not start the Taskmanager at the worker node at all. When I start the cluster using ./bin/start-cluster.sh on master it shows the normal output of starting the Jobmanager and Taskmanager but when I run jps on the nodes the slave does not have the Taskmanager running. 

Running the WordCount example again fails showing the same error. Stopping the cluster says no taskmanager to stop.

Kind Regards,
Ravinder Kaur

On Wed, Feb 3, 2016 at 5:47 PM, Stephan Ewen <[hidden email]> wrote:
Looks like the network configuration is not correct.

I would try setting the full host name (like "master.abc.xyz.com") as jobmanager.rpc.address.

Greetings,
Stephan


On Wed, Feb 3, 2016 at 5:43 PM, Ravinder Kaur <[hidden email]> wrote:

Hello Community,

I'm a student and new to Apache Flink. I'm trying to learn and have setup a 2- node standalone Flink(0.10.1) cluster (one master and one worker). I'm facing the following issue. 

Cluster: consists of 2 vms (one master and one worker)


When I start the cluster both the JobManager and the TaskManager are started on the master and worker respectively.

Command to start the cluster : bin/start-cluster.sh

JPS shows all the processes running.

Then I run the following command to run a WordCount example job: ./bin/flink run ./examples/WordCount.jar

the result is attached with the mail.

The error is org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailabeException: Not enough free slots available to run to run the job ....................... Resources available to scheduler: Number of instances=0, total number of slots= 0, available slots=0

Therefore I suppose that the JobManager does not find the TaskManager and checked the logs of the TaskManager which indeed shows that the TaskManager is unable to register at the JobManager for quite a long time. There are org.apache.flink.runtime.net.ConnectionUtils: Failed to connect from localhost: Connect timed out and org.apache.flink.runtime.net.ConnectionUtils: Failed to connect from address localhost: Network is Unreachable messages in the log of the TaskManager. Later when it starts up after a number of attempts and tries to register at the JobManager, which also fails after a lot of attempts showing the following message org.apache.flink.runtime.taskmanager.Taskmanager: Trying to register at JobManager akka.tcp://flink@master:6123/user'/jobmanager (attempt:92, timeout:30seconds) and org.apache.flink.runtime.taskmanager.Taskmanager: Tried to associate with unreachable remote host [akka.tcp://flink@master:6123/user/jobmanager]. Address is now gated for 5000ms, all messages to this address will be delivered to dead letters. Reason: Connection timed out: /master:6123

and https://issues.apache.org/jira/browse/FLINK-1119 these links helpful. Stephan Ewen the guy who provided the solution in both the links gives a good explanation that the TaskManagers take quite some time to register at the JobManager and therefore I waited for as long as 20 mins after starting the cluster to run the job. But even after waiting so long I get the same error. 

Another suggestion was to run the cluster in streaming mode. So I tried it with the command : bin/start-cluster-streaming.sh and ran the job but I get the same error. I have rechecked all the configurations but I'm unable to find out the fault.

I re-checked all the configurations but could not find anything wrong. Also checked the port 6123 on master which is in LISTEN state and tcp request from worker to master shows SYN_SENT state using netstat -na and lsof -i commands.

I opened the webpage on master http://localhost:8081 but it shows nothing and localhost:8080 says connection refused.

Kindly help me out as it is very important for me. Let me know if you have any questions.

Kind Regards,
Ravinder Kaur










Reply | Threaded
Open this post in threaded view
|

Re: TaskManager unable to register with JobManager

Ravinder Kaur
Hello,

Thank you very much. This was indeed the problem. The firewall was blocking 6123 and 43008. Also the user did not have permissions to unblock the firewall.

Retried the following command with root privileges : ufw allow port 

and this made the job run.

Kind Regards,
Ravinder Kaur


On Wed, Feb 3, 2016 at 10:09 PM, Stephan Ewen <[hidden email]> wrote:
Can machines connect to port 6123? The firewall may block that port, put permit SSH.

On Wed, Feb 3, 2016 at 9:52 PM, Ravinder Kaur <[hidden email]> wrote:
Hello,

Here is the log file of Jobmanager. I did not see some thing suspicious and as it suggests the ports are also listening.

20:58:46,906 INFO  org.apache.flink.runtime.jobmanager.JobManager                - Starting JobManager on IP-of-master:6123 with execution mode CLUSTER and streaming mode BATCH_ONLY
20:58:46,978 INFO  org.apache.flink.runtime.jobmanager.JobManager                - Security is not enabled. Starting non-authenticated JobManager.
20:58:46,979 INFO  org.apache.flink.runtime.jobmanager.JobManager                - Starting JobManager
20:58:46,980 INFO  org.apache.flink.runtime.jobmanager.JobManager                - Starting JobManager actor system at 10.155.208.138:6123
20:58:48,196 INFO  akka.event.slf4j.Slf4jLogger                                  - Slf4jLogger started
20:58:48,295 INFO  Remoting                                                      - Starting remoting
20:58:48,541 INFO  Remoting                                                      - Remoting started; listening on addresses :[akka.tcp://flink@IP-of-master:6123]
20:58:48,549 INFO  org.apache.flink.runtime.jobmanager.JobManager                - Starting JobManger web frontend
20:58:48,690 INFO  org.apache.flink.runtime.webmonitor.WebRuntimeMonitor         - Using directory /tmp/flink-web-876a4755-4f38-4ff7-8202-f263afa9b986 for the web interface files
20:58:48,691 INFO  org.apache.flink.runtime.webmonitor.WebRuntimeMonitor         - Serving job manager log from /home/flink/flink-0.10.1/log/flink-flink-jobmanager-0-hostname.log
20:58:48,691 INFO  org.apache.flink.runtime.webmonitor.WebRuntimeMonitor         - Serving job manager stdout from /home/flink/flink-0.10.1/log/flink-flink-jobmanager-0-hostname.out
20:58:49,044 INFO  org.apache.flink.runtime.webmonitor.WebRuntimeMonitor         - Web frontend listening at 0:0:0:0:0:0:0:0:8081
20:58:49,045 INFO  org.apache.flink.runtime.jobmanager.JobManager                - Starting JobManager actor
20:58:49,052 INFO  org.apache.flink.runtime.blob.BlobServer                      - Created BLOB server storage directory /tmp/blobStore-e0c52bfb-2411-4a83-ac8d-5664a5894258
20:58:49,054 INFO  org.apache.flink.runtime.blob.BlobServer                      - Started BLOB server at 0.0.0.0:43683 - max concurrent requests: 50 - max backlog: 1000
20:58:49,075 INFO  org.apache.flink.runtime.jobmanager.MemoryArchivist           - Started memory archivist akka://flink/user/archive
20:58:49,075 INFO  org.apache.flink.runtime.jobmanager.JobManager                - Starting JobManager at akka.tcp://flink@IP-of-master:6123/user/jobmanager.
20:58:49,081 INFO  org.apache.flink.runtime.jobmanager.JobManager                - JobManager akka.tcp://flink@IP-of-master:6123/user/jobmanager was granted leadership with leader session ID None.
20:58:49,082 INFO  org.apache.flink.runtime.webmonitor.WebRuntimeMonitor         - Starting with JobManager akka.tcp://flink@IP-of-master:6123/user/jobmanager on port 8081
20:58:49,083 INFO  org.apache.flink.runtime.webmonitor.JobManagerRetriever       - New leader reachable under akka.tcp://flink@IP-of-master:6123/user/jobmanager:null.
20:59:22,794 INFO  org.apache.flink.runtime.jobmanager.JobManager                - Submitting job 72733d69588678ec224003ab5577cab8 (Flink Java Job at Wed Feb 03 20:59:22 CET 2016).
20:59:22,853 INFO  org.apache.flink.runtime.jobmanager.JobManager                - Scheduling job 72733d69588678ec224003ab5577cab8 (Flink Java Job at Wed Feb 03 20:59:22 CET 2016).
20:59:22,857 INFO  org.apache.flink.runtime.jobmanager.JobManager                - Status of job 72733d69588678ec224003ab5577cab8 (Flink Java Job at Wed Feb 03 20:59:22 CET 2016) changed to RUNNING.
20:59:22,859 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - CHAIN DataSource (at getDefaultTextLineDataSet(WordCountData.java:70) (org.apache.flink.api.java.io.CollectionInputFormat)) -> FlatMap (FlatMap at main(WordCount.java:69)) -> Combine(SUM(1), at main(WordCount.java:72) (1/1) (23fb37019a504fd6c7bf95e46a8cd7a3) switched from CREATED to SCHEDULED
20:59:22,881 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - CHAIN DataSource (at getDefaultTextLineDataSet(WordCountData.java:70) (org.apache.flink.api.java.io.CollectionInputFormat)) -> FlatMap (FlatMap at main(WordCount.java:69)) -> Combine(SUM(1), at main(WordCount.java:72) (1/1) (23fb37019a504fd6c7bf95e46a8cd7a3) switched from SCHEDULED to CANCELED
20:59:22,881 INFO  org.apache.flink.runtime.jobmanager.JobManager                - Status of job 72733d69588678ec224003ab5577cab8 (Flink Java Job at Wed Feb 03 20:59:22 CET 2016) changed to FAILING.
org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException: Not enough free slots available to run the job. You can decrease the operator parallelism or increase the number of slots per TaskManager in the configuration. Task to schedule: < Attempt #0 (CHAIN DataSource (at getDefaultTextLineDataSet(WordCountData.java:70) (org.apache.flink.api.java.io.CollectionInputFormat)) -> FlatMap (FlatMap at main(WordCount.java:69)) -> Combine(SUM(1), at main(WordCount.java:72) (1/1)) @ (unassigned) - [SCHEDULED] > with groupID < 31e497f2f68c9cee5864c8fddaff3d59 > in sharing group < SlotSharingGroup [f9ed1aab933e061a8ce1ecaa3534f18c, 037bb78a1902f7edea69a978ad7b54ce, 31e497f2f68c9cee5864c8fddaff3d59] >. Resources available to scheduler: Number of instances=0, total number of slots=0, available slots=0
        at org.apache.flink.runtime.jobmanager.scheduler.Scheduler.scheduleTask(Scheduler.java:256)
        at org.apache.flink.runtime.jobmanager.scheduler.Scheduler.scheduleImmediately(Scheduler.java:131)
        at org.apache.flink.runtime.executiongraph.Execution.scheduleForExecution(Execution.java:298)
        at org.apache.flink.runtime.executiongraph.ExecutionVertex.scheduleForExecution(ExecutionVertex.java:458)
        at org.apache.flink.runtime.executiongraph.ExecutionJobVertex.scheduleAll(ExecutionJobVertex.java:322)
        at org.apache.flink.runtime.executiongraph.ExecutionGraph.scheduleForExecution(ExecutionGraph.java:679)
        at org.apache.flink.runtime.jobmanager.JobManager$$anonfun$org$apache$flink$runtime$jobmanager$JobManager$$submitJob$1.apply$mcV$sp(JobManager.scala:982)
        at org.apache.flink.runtime.jobmanager.JobManager$$anonfun$org$apache$flink$runtime$jobmanager$JobManager$$submitJob$1.apply(JobManager.scala:962)
        at org.apache.flink.runtime.jobmanager.JobManager$$anonfun$org$apache$flink$runtime$jobmanager$JobManager$$submitJob$1.apply(JobManager.scala:962)
        at scala.concurrent.impl.Future$PromiseCompletingRunnable.liftedTree1$1(Future.scala:24)
        at scala.concurrent.impl.Future$PromiseCompletingRunnable.run(Future.scala:24)
        at akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:41)
        at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:401)
        at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
        at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
        at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
        at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
20:59:22,886 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - CHAIN Reduce (SUM(1), at main(WordCount.java:72) -> FlatMap (collect()) (1/1) (824b6e3771304cd0f92aea4ab763a11d) switched from CREATED to CANCELED
20:59:22,887 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - DataSink (collect() sink) (1/1) (1bb64a2edc6f68ad716acd9f8d2d7d67) switched from CREATED to CANCELED
20:59:22,890 INFO  org.apache.flink.runtime.jobmanager.JobManager                - Status of job 72733d69588678ec224003ab5577cab8 (Flink Java Job at Wed Feb 03 20:59:22 CET 2016) changed to FAILED.
org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException: Not enough free slots available to run the job. You can decrease the operator parallelism or increase the number of slots per TaskManager in the configuration. Task to schedule: < Attempt #0 (CHAIN DataSource (at getDefaultTextLineDataSet(WordCountData.java:70) (org.apache.flink.api.java.io.CollectionInputFormat)) -> FlatMap (FlatMap at main(WordCount.java:69)) -> Combine(SUM(1), at main(WordCount.java:72) (1/1)) @ (unassigned) - [SCHEDULED] > with groupID < 31e497f2f68c9cee5864c8fddaff3d59 > in sharing group < SlotSharingGroup [f9ed1aab933e061a8ce1ecaa3534f18c, 037bb78a1902f7edea69a978ad7b54ce, 31e497f2f68c9cee5864c8fddaff3d59] >. Resources available to scheduler: Number of instances=0, total number of slots=0, available slots=0
        at org.apache.flink.runtime.jobmanager.scheduler.Scheduler.scheduleTask(Scheduler.java:256)
        at org.apache.flink.runtime.jobmanager.scheduler.Scheduler.scheduleImmediately(Scheduler.java:131)
        at org.apache.flink.runtime.executiongraph.Execution.scheduleForExecution(Execution.java:298)
        at org.apache.flink.runtime.executiongraph.ExecutionVertex.scheduleForExecution(ExecutionVertex.java:458)
        at org.apache.flink.runtime.executiongraph.ExecutionJobVertex.scheduleAll(ExecutionJobVertex.java:322)
        at org.apache.flink.runtime.executiongraph.ExecutionGraph.scheduleForExecution(ExecutionGraph.java:679)
        at org.apache.flink.runtime.jobmanager.JobManager$$anonfun$org$apache$flink$runtime$jobmanager$JobManager$$submitJob$1.apply$mcV$sp(JobManager.scala:982)
        at org.apache.flink.runtime.jobmanager.JobManager$$anonfun$org$apache$flink$runtime$jobmanager$JobManager$$submitJob$1.apply(JobManager.scala:962)
        at org.apache.flink.runtime.jobmanager.JobManager$$anonfun$org$apache$flink$runtime$jobmanager$JobManager$$submitJob$1.apply(JobManager.scala:962)
        at scala.concurrent.impl.Future$PromiseCompletingRunnable.liftedTree1$1(Future.scala:24)
        at scala.concurrent.impl.Future$PromiseCompletingRunnable.run(Future.scala:24)


On Wed, Feb 3, 2016 at 9:27 PM, Robert Metzger <[hidden email]> wrote:
Hi,

the TaskManager is starting up, but its not able to register at the job manager. Did you check the JobManager log? Do you see anything suspicious there? Are the ports matching?


On Wed, Feb 3, 2016 at 9:23 PM, Ravinder Kaur <[hidden email]> wrote:
Hello,

Thank you for pointing it out. I had a little typo while I edited the hostname in flink-conf.yaml. I've reset it and the TaskManager started up. But I still can't run the WordCount example and it throws the same NoResourceAvaliableException.

Caused by: org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableExce                                                                                        ption: Not enough free slots available to run the job. You can decrease the oper                                                                                        ator parallelism or increase the number of slots per TaskManager in the configur                                                                                        ation. Task to schedule: < Attempt #0 (CHAIN DataSource (at getDefaultTextLineDa                                                                                        taSet(WordCountData.java:70) (org.apache.flink.api.java.io.CollectionInputFormat                                                                                        )) -> FlatMap (FlatMap at main(WordCount.java:69)) -> Combine(SUM(1), at main(Wo                                                                                        rdCount.java:72) (1/1)) @ (unassigned) - [SCHEDULED] > with groupID < 31e497f2f6                                                                                        8c9cee5864c8fddaff3d59 > in sharing group < SlotSharingGroup [f9ed1aab933e061a8c                                                                                        e1ecaa3534f18c, 037bb78a1902f7edea69a978ad7b54ce, 31e497f2f68c9cee5864c8fddaff3d                                                                                        59] >. Resources available to scheduler: Number of instances=0, total number of                                                                                         slots=0, available slots=0
        at org.apache.flink.runtime.jobmanager.scheduler.Scheduler.scheduleTask(                                                                                        Scheduler.java:256)
        at org.apache.flink.runtime.jobmanager.scheduler.Scheduler.scheduleImmed                                                                                        iately(Scheduler.java:131)
        at org.apache.flink.runtime.executiongraph.Execution.scheduleForExecutio                                                                                        n(Execution.java:298)
        at org.apache.flink.runtime.executiongraph.ExecutionVertex.scheduleForEx                                                                                        ecution(ExecutionVertex.java:458)
        at org.apache.flink.runtime.executiongraph.ExecutionJobVertex.scheduleAl                                                                                        l(ExecutionJobVertex.java:322)
        at org.apache.flink.runtime.executiongraph.ExecutionGraph.scheduleForExe                                                                                        cution(ExecutionGraph.java:679)
        at org.apache.flink.runtime.jobmanager.JobManager$$anonfun$org$apache$fl                                                                                        ink$runtime$jobmanager$JobManager$$submitJob$1.apply$mcV$sp(JobManager.scala:982                                                                                        )
        at org.apache.flink.runtime.jobmanager.JobManager$$anonfun$org$apache$fl                                                                                        ink$runtime$jobmanager$JobManager$$submitJob$1.apply(JobManager.scala:962)
        at org.apache.flink.runtime.jobmanager.JobManager$$anonfun$org$apache$fl                                                                                        ink$runtime$jobmanager$JobManager$$submitJob$1.apply(JobManager.scala:962)
        ... 8 more

The log of TaskManager again has the same errors as before.

20:58:58,457 INFO  org.apache.flink.runtime.net.ConnectionUtils                  - Failed to connect from address '/slave-IP': connect timed out
20:58:58,458 INFO  org.apache.flink.runtime.net.ConnectionUtils                  - Failed to connect from address '/0:0:0:0:0:0:0:1%1': Network is unreachable
20:58:58,458 INFO  org.apache.flink.runtime.net.ConnectionUtils                  - Failed to connect from address '/127.0.0.1': Invalid argument
20:58:59,048 WARN  org.apache.flink.runtime.net.ConnectionUtils                  - Could not connect to /master-IP:6123. Selecting a local address using heuristics.
20:58:59,050 INFO  org.apache.flink.runtime.taskmanager.TaskManager              - TaskManager will use hostname/address 'hostname-of-slave' (slave-IP) for communication.
20:58:59,051 INFO  org.apache.flink.runtime.taskmanager.TaskManager              - Starting TaskManager in streaming mode BATCH_ONLY
20:58:59,052 INFO  org.apache.flink.runtime.taskmanager.TaskManager              - Starting TaskManager actor system at slave_IP:0
20:58:59,776 INFO  akka.event.slf4j.Slf4jLogger                                  - Slf4jLogger started
20:58:59,842 INFO  Remoting                                                      - Starting remoting
20:59:00,094 INFO  Remoting                                                      - Remoting started; listening on addresses :[akka.tcp://flink@slave-IP:33813]
20:59:00,100 INFO  org.apache.flink.runtime.taskmanager.TaskManager              - Starting TaskManager actor
20:59:00,125 INFO  org.apache.flink.runtime.io.network.netty.NettyConfig         - NettyConfig [server address: hostname-of-master/master-IP, server port: 49030, memory segment size (bytes): 32768, transport type: NIO, number of server threads: 0 (use Netty's default), number of client threads: 0 (use Netty's default), server connect backlog: 0 (use Netty's default), client connect timeout (sec): 120, send/receive buffer size (bytes): 0 (use Netty's default)]
20:59:00,131 INFO  org.apache.flink.runtime.taskmanager.TaskManager              - Messages between TaskManager and JobManager have a max timeout of 100000 milliseconds
20:59:00,142 INFO  org.apache.flink.runtime.taskmanager.TaskManager              - Temporary file directory '/tmp': total 4 GB, usable 1 GB (25.00% usable)
20:59:00,210 INFO  org.apache.flink.runtime.io.network.buffer.NetworkBufferPool  - Allocated 64 MB for network buffer pool (number of memory segments: 2048, bytes per segment: 32768).
20:59:00,323 INFO  org.apache.flink.runtime.taskmanager.TaskManager              - Using 0.7 of the currently free heap space for Flink managed heap memory (293 MB).
20:59:00,565 INFO  org.apache.flink.runtime.io.disk.iomanager.IOManager          - I/O manager uses directory /tmp/flink-io-c7796b82-6676-4604-97fd-df09001a84e8 for spill files.
20:59:00,578 INFO  org.apache.flink.runtime.filecache.FileCache                  - User file cache uses directory /tmp/flink-dist-cache-13ed3e76-cf1e-46fa-9ba2-5177e801429e
20:59:00,908 INFO  org.apache.flink.runtime.taskmanager.TaskManager              - Starting TaskManager actor at akka://flink/user/taskmanager#-157676733.
20:59:00,908 INFO  org.apache.flink.runtime.taskmanager.TaskManager              - TaskManager data connection information: hostname-of-master (dataPort=49030)
20:59:00,909 INFO  org.apache.flink.runtime.taskmanager.TaskManager              - TaskManager has 1 task slot(s).
20:59:00,910 INFO  org.apache.flink.runtime.taskmanager.TaskManager              - Memory usage stats: [HEAP: 376/491/491 MB, NON HEAP: 24/49/304 MB (used/committed/max)]
20:59:00,917 INFO  org.apache.flink.runtime.taskmanager.TaskManager              - Trying to register at JobManager akka.tcp://flink@master-IP:6123/user/jobmanager (attempt 1, timeout: 500 milliseconds)
20:59:01,443 INFO  org.apache.flink.runtime.taskmanager.TaskManager              - Trying to register at JobManager akka.tcp://flink@master-IP:6123/user/jobmanager (attempt 2, timeout: 1000 milliseconds)
20:59:02,873 INFO  org.apache.flink.runtime.taskmanager.TaskManager              - Trying to register at JobManager akka.tcp://flink@master-IP:6123/user/jobmanager (attempt 3, timeout: 2000 milliseconds)
20:59:04,893 INFO  org.apache.flink.runtime.taskmanager.TaskManager              - Trying to register at JobManager akka.tcp://flink@master-IP:6123/user/jobmanager (attempt 4, timeout: 4000 milliseconds)
20:59:08,914 INFO  org.apache.flink.runtime.taskmanager.TaskManager              - Trying to register at JobManager akka.tcp://flink@master-IP:6123/user/jobmanager (attempt 5, timeout: 8000 milliseconds)


Kind Regards,
Ravinder Kaur

On Wed, Feb 3, 2016 at 8:12 PM, Stephan Ewen <[hidden email]> wrote:
This looks like the reason:

java.net.UnknownHostException: Cannot resolve the JobManager hostname 'hostname-of-master' specified in the configuration

On Wed, Feb 3, 2016 at 7:29 PM, Ravinder Kaur <[hidden email]> wrote:
Hello,

The log file of the Taskmanager now shows the following 

18:27:10,082 WARN  org.apache.hadoop.util.NativeCodeLoader                       - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
18:27:10,244 INFO  org.apache.flink.runtime.taskmanager.TaskManager              - --------------------------------------------------------------------------------
18:27:10,244 INFO  org.apache.flink.runtime.taskmanager.TaskManager              -  Starting TaskManager (Version: 0.10.1, Rev:2e9b231, Date:22.11.2015 @ 12:41:12 CET)
18:27:10,244 INFO  org.apache.flink.runtime.taskmanager.TaskManager              -  Current user: flink
18:27:10,245 INFO  org.apache.flink.runtime.taskmanager.TaskManager              -  JVM: OpenJDK 64-Bit Server VM - Oracle Corporation - 1.7/24.91-b01
18:27:10,245 INFO  org.apache.flink.runtime.taskmanager.TaskManager              -  Maximum heap size: 491 MiBytes
18:27:10,245 INFO  org.apache.flink.runtime.taskmanager.TaskManager              -  JAVA_HOME: /usr/lib/jvm/java-1.7.0-openjdk-amd64
18:27:10,247 INFO  org.apache.flink.runtime.taskmanager.TaskManager              -  Hadoop version: 2.7.0
18:27:10,247 INFO  org.apache.flink.runtime.taskmanager.TaskManager              -  JVM Options:
18:27:10,247 INFO  org.apache.flink.runtime.taskmanager.TaskManager              -     -Xms512M
18:27:10,247 INFO  org.apache.flink.runtime.taskmanager.TaskManager              -     -Xmx512M
18:27:10,248 INFO  org.apache.flink.runtime.taskmanager.TaskManager              -     -XX:MaxDirectMemorySize=8388607T
18:27:10,248 INFO  org.apache.flink.runtime.taskmanager.TaskManager              -     -XX:MaxPermSize=256m
18:27:10,248 INFO  org.apache.flink.runtime.taskmanager.TaskManager              -     -Dlog.file=/home/flink/flink-0.10.1/log/flink-flink-taskmanager-0-vm-10-155-208-137.cloud.mwn.de.log
18:27:10,248 INFO  org.apache.flink.runtime.taskmanager.TaskManager              -     -Dlog4j.configuration=file:/home/flink/flink-0.10.1/conf/log4j.properties
18:27:10,248 INFO  org.apache.flink.runtime.taskmanager.TaskManager              -     -Dlogback.configurationFile=file:/home/flink/flink-0.10.1/conf/logback.xml
18:27:10,248 INFO  org.apache.flink.runtime.taskmanager.TaskManager              -  Program Arguments:
18:27:10,248 INFO  org.apache.flink.runtime.taskmanager.TaskManager              -     --configDir
18:27:10,248 INFO  org.apache.flink.runtime.taskmanager.TaskManager              -     /home/flink/flink-0.10.1/conf
18:27:10,248 INFO  org.apache.flink.runtime.taskmanager.TaskManager              -     --streamingMode
18:27:10,248 INFO  org.apache.flink.runtime.taskmanager.TaskManager              -     batch
18:27:10,248 INFO  org.apache.flink.runtime.taskmanager.TaskManager              -  Classpath: /home/flink/flink-0.10.1/lib/flink-dist_2.11-0.10.1.jar:/home/flink/flink-0.10.1/lib/flink-python_2.11-0.10.1.jar:/home/flink/flink-0.10.1/lib/log4j-1.2.17.jar:/home/flink/flink-0.10.1/lib/slf4j-log4j12-1.7.7.jar:/usr/lib/jvm/java-1.7.0-openjdk-amd64/lib/tools.jar::
18:27:10,248 INFO  org.apache.flink.runtime.taskmanager.TaskManager              - --------------------------------------------------------------------------------
18:27:10,252 INFO  org.apache.flink.runtime.taskmanager.TaskManager              - Maximum number of open file descriptors is 4096
18:27:10,277 INFO  org.apache.flink.runtime.taskmanager.TaskManager              - Loading configuration from /home/flink/flink-0.10.1/conf
18:27:10,356 INFO  org.apache.flink.runtime.taskmanager.TaskManager              - Security is not enabled. Starting non-authenticated TaskManager.
18:27:10,365 ERROR org.apache.flink.runtime.taskmanager.TaskManager              - Failed to run TaskManager.
java.net.UnknownHostException: Cannot resolve the JobManager hostname 'hostname-of-master' specified in the configuration
        at org.apache.flink.runtime.util.StandaloneUtils.createLeaderRetrievalService(StandaloneUtils.java:79)
        at org.apache.flink.runtime.util.StandaloneUtils.createLeaderRetrievalService(StandaloneUtils.java:48)
        at org.apache.flink.runtime.util.LeaderRetrievalUtils.createLeaderRetrievalService(LeaderRetrievalUtils.java:69)
        at org.apache.flink.runtime.taskmanager.TaskManager$.selectNetworkInterfaceAndPort(TaskManager.scala:1351)
        at org.apache.flink.runtime.taskmanager.TaskManager$.selectNetworkInterfaceAndRunTaskManager(TaskManager.scala:1328)
        at org.apache.flink.runtime.taskmanager.TaskManager$.main(TaskManager.scala:1240)
        at org.apache.flink.runtime.taskmanager.TaskManager.main(TaskManager.scala)

Kind Regards,
Ravinder Kaur

On Wed, Feb 3, 2016 at 7:19 PM, Stephan Ewen <[hidden email]> wrote:
What do the TaskManger logs say?

On Wed, Feb 3, 2016 at 6:34 PM, Ravinder Kaur <[hidden email]> wrote:
Hello,

Thanks for the quick reply. I tried to set jobmanager.rpc.address in flink-conf.yaml to the hostname of master node on both the nodes.

Now it does not start the Taskmanager at the worker node at all. When I start the cluster using ./bin/start-cluster.sh on master it shows the normal output of starting the Jobmanager and Taskmanager but when I run jps on the nodes the slave does not have the Taskmanager running. 

Running the WordCount example again fails showing the same error. Stopping the cluster says no taskmanager to stop.

Kind Regards,
Ravinder Kaur

On Wed, Feb 3, 2016 at 5:47 PM, Stephan Ewen <[hidden email]> wrote:
Looks like the network configuration is not correct.

I would try setting the full host name (like "master.abc.xyz.com") as jobmanager.rpc.address.

Greetings,
Stephan


On Wed, Feb 3, 2016 at 5:43 PM, Ravinder Kaur <[hidden email]> wrote:

Hello Community,

I'm a student and new to Apache Flink. I'm trying to learn and have setup a 2- node standalone Flink(0.10.1) cluster (one master and one worker). I'm facing the following issue. 

Cluster: consists of 2 vms (one master and one worker)


When I start the cluster both the JobManager and the TaskManager are started on the master and worker respectively.

Command to start the cluster : bin/start-cluster.sh

JPS shows all the processes running.

Then I run the following command to run a WordCount example job: ./bin/flink run ./examples/WordCount.jar

the result is attached with the mail.

The error is org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailabeException: Not enough free slots available to run to run the job ....................... Resources available to scheduler: Number of instances=0, total number of slots= 0, available slots=0

Therefore I suppose that the JobManager does not find the TaskManager and checked the logs of the TaskManager which indeed shows that the TaskManager is unable to register at the JobManager for quite a long time. There are org.apache.flink.runtime.net.ConnectionUtils: Failed to connect from localhost: Connect timed out and org.apache.flink.runtime.net.ConnectionUtils: Failed to connect from address localhost: Network is Unreachable messages in the log of the TaskManager. Later when it starts up after a number of attempts and tries to register at the JobManager, which also fails after a lot of attempts showing the following message org.apache.flink.runtime.taskmanager.Taskmanager: Trying to register at JobManager akka.tcp://flink@master:6123/user'/jobmanager (attempt:92, timeout:30seconds) and org.apache.flink.runtime.taskmanager.Taskmanager: Tried to associate with unreachable remote host [akka.tcp://flink@master:6123/user/jobmanager]. Address is now gated for 5000ms, all messages to this address will be delivered to dead letters. Reason: Connection timed out: /master:6123

and https://issues.apache.org/jira/browse/FLINK-1119 these links helpful. Stephan Ewen the guy who provided the solution in both the links gives a good explanation that the TaskManagers take quite some time to register at the JobManager and therefore I waited for as long as 20 mins after starting the cluster to run the job. But even after waiting so long I get the same error. 

Another suggestion was to run the cluster in streaming mode. So I tried it with the command : bin/start-cluster-streaming.sh and ran the job but I get the same error. I have rechecked all the configurations but I'm unable to find out the fault.

I re-checked all the configurations but could not find anything wrong. Also checked the port 6123 on master which is in LISTEN state and tcp request from worker to master shows SYN_SENT state using netstat -na and lsof -i commands.

I opened the webpage on master http://localhost:8081 but it shows nothing and localhost:8080 says connection refused.

Kindly help me out as it is very important for me. Let me know if you have any questions.

Kind Regards,
Ravinder Kaur











Reply | Threaded
Open this post in threaded view
|

Re: TaskManager unable to register with JobManager

Ravinder Kaur
In reply to this post by Stephan Ewen
Hello All,

I need to know the range of ports that are being used during the master/slave communication in the Flink cluster. Also is there a way I can specify a range of ports, at the slaves, to restrict them to connect to master only in this range?

Kind Regards,
Ravinder Kaur


On Wed, Feb 3, 2016 at 10:09 PM, Stephan Ewen <[hidden email]> wrote:
Can machines connect to port 6123? The firewall may block that port, put permit SSH.

On Wed, Feb 3, 2016 at 9:52 PM, Ravinder Kaur <[hidden email]> wrote:
Hello,

Here is the log file of Jobmanager. I did not see some thing suspicious and as it suggests the ports are also listening.

20:58:46,906 INFO  org.apache.flink.runtime.jobmanager.JobManager                - Starting JobManager on IP-of-master:6123 with execution mode CLUSTER and streaming mode BATCH_ONLY
20:58:46,978 INFO  org.apache.flink.runtime.jobmanager.JobManager                - Security is not enabled. Starting non-authenticated JobManager.
20:58:46,979 INFO  org.apache.flink.runtime.jobmanager.JobManager                - Starting JobManager
20:58:46,980 INFO  org.apache.flink.runtime.jobmanager.JobManager                - Starting JobManager actor system at 10.155.208.138:6123
20:58:48,196 INFO  akka.event.slf4j.Slf4jLogger                                  - Slf4jLogger started
20:58:48,295 INFO  Remoting                                                      - Starting remoting
20:58:48,541 INFO  Remoting                                                      - Remoting started; listening on addresses :[akka.tcp://flink@IP-of-master:6123]
20:58:48,549 INFO  org.apache.flink.runtime.jobmanager.JobManager                - Starting JobManger web frontend
20:58:48,690 INFO  org.apache.flink.runtime.webmonitor.WebRuntimeMonitor         - Using directory /tmp/flink-web-876a4755-4f38-4ff7-8202-f263afa9b986 for the web interface files
20:58:48,691 INFO  org.apache.flink.runtime.webmonitor.WebRuntimeMonitor         - Serving job manager log from /home/flink/flink-0.10.1/log/flink-flink-jobmanager-0-hostname.log
20:58:48,691 INFO  org.apache.flink.runtime.webmonitor.WebRuntimeMonitor         - Serving job manager stdout from /home/flink/flink-0.10.1/log/flink-flink-jobmanager-0-hostname.out
20:58:49,044 INFO  org.apache.flink.runtime.webmonitor.WebRuntimeMonitor         - Web frontend listening at 0:0:0:0:0:0:0:0:8081
20:58:49,045 INFO  org.apache.flink.runtime.jobmanager.JobManager                - Starting JobManager actor
20:58:49,052 INFO  org.apache.flink.runtime.blob.BlobServer                      - Created BLOB server storage directory /tmp/blobStore-e0c52bfb-2411-4a83-ac8d-5664a5894258
20:58:49,054 INFO  org.apache.flink.runtime.blob.BlobServer                      - Started BLOB server at 0.0.0.0:43683 - max concurrent requests: 50 - max backlog: 1000
20:58:49,075 INFO  org.apache.flink.runtime.jobmanager.MemoryArchivist           - Started memory archivist akka://flink/user/archive
20:58:49,075 INFO  org.apache.flink.runtime.jobmanager.JobManager                - Starting JobManager at akka.tcp://flink@IP-of-master:6123/user/jobmanager.
20:58:49,081 INFO  org.apache.flink.runtime.jobmanager.JobManager                - JobManager akka.tcp://flink@IP-of-master:6123/user/jobmanager was granted leadership with leader session ID None.
20:58:49,082 INFO  org.apache.flink.runtime.webmonitor.WebRuntimeMonitor         - Starting with JobManager akka.tcp://flink@IP-of-master:6123/user/jobmanager on port 8081
20:58:49,083 INFO  org.apache.flink.runtime.webmonitor.JobManagerRetriever       - New leader reachable under akka.tcp://flink@IP-of-master:6123/user/jobmanager:null.
20:59:22,794 INFO  org.apache.flink.runtime.jobmanager.JobManager                - Submitting job 72733d69588678ec224003ab5577cab8 (Flink Java Job at Wed Feb 03 20:59:22 CET 2016).
20:59:22,853 INFO  org.apache.flink.runtime.jobmanager.JobManager                - Scheduling job 72733d69588678ec224003ab5577cab8 (Flink Java Job at Wed Feb 03 20:59:22 CET 2016).
20:59:22,857 INFO  org.apache.flink.runtime.jobmanager.JobManager                - Status of job 72733d69588678ec224003ab5577cab8 (Flink Java Job at Wed Feb 03 20:59:22 CET 2016) changed to RUNNING.
20:59:22,859 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - CHAIN DataSource (at getDefaultTextLineDataSet(WordCountData.java:70) (org.apache.flink.api.java.io.CollectionInputFormat)) -> FlatMap (FlatMap at main(WordCount.java:69)) -> Combine(SUM(1), at main(WordCount.java:72) (1/1) (23fb37019a504fd6c7bf95e46a8cd7a3) switched from CREATED to SCHEDULED
20:59:22,881 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - CHAIN DataSource (at getDefaultTextLineDataSet(WordCountData.java:70) (org.apache.flink.api.java.io.CollectionInputFormat)) -> FlatMap (FlatMap at main(WordCount.java:69)) -> Combine(SUM(1), at main(WordCount.java:72) (1/1) (23fb37019a504fd6c7bf95e46a8cd7a3) switched from SCHEDULED to CANCELED
20:59:22,881 INFO  org.apache.flink.runtime.jobmanager.JobManager                - Status of job 72733d69588678ec224003ab5577cab8 (Flink Java Job at Wed Feb 03 20:59:22 CET 2016) changed to FAILING.
org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException: Not enough free slots available to run the job. You can decrease the operator parallelism or increase the number of slots per TaskManager in the configuration. Task to schedule: < Attempt #0 (CHAIN DataSource (at getDefaultTextLineDataSet(WordCountData.java:70) (org.apache.flink.api.java.io.CollectionInputFormat)) -> FlatMap (FlatMap at main(WordCount.java:69)) -> Combine(SUM(1), at main(WordCount.java:72) (1/1)) @ (unassigned) - [SCHEDULED] > with groupID < 31e497f2f68c9cee5864c8fddaff3d59 > in sharing group < SlotSharingGroup [f9ed1aab933e061a8ce1ecaa3534f18c, 037bb78a1902f7edea69a978ad7b54ce, 31e497f2f68c9cee5864c8fddaff3d59] >. Resources available to scheduler: Number of instances=0, total number of slots=0, available slots=0
        at org.apache.flink.runtime.jobmanager.scheduler.Scheduler.scheduleTask(Scheduler.java:256)
        at org.apache.flink.runtime.jobmanager.scheduler.Scheduler.scheduleImmediately(Scheduler.java:131)
        at org.apache.flink.runtime.executiongraph.Execution.scheduleForExecution(Execution.java:298)
        at org.apache.flink.runtime.executiongraph.ExecutionVertex.scheduleForExecution(ExecutionVertex.java:458)
        at org.apache.flink.runtime.executiongraph.ExecutionJobVertex.scheduleAll(ExecutionJobVertex.java:322)
        at org.apache.flink.runtime.executiongraph.ExecutionGraph.scheduleForExecution(ExecutionGraph.java:679)
        at org.apache.flink.runtime.jobmanager.JobManager$$anonfun$org$apache$flink$runtime$jobmanager$JobManager$$submitJob$1.apply$mcV$sp(JobManager.scala:982)
        at org.apache.flink.runtime.jobmanager.JobManager$$anonfun$org$apache$flink$runtime$jobmanager$JobManager$$submitJob$1.apply(JobManager.scala:962)
        at org.apache.flink.runtime.jobmanager.JobManager$$anonfun$org$apache$flink$runtime$jobmanager$JobManager$$submitJob$1.apply(JobManager.scala:962)
        at scala.concurrent.impl.Future$PromiseCompletingRunnable.liftedTree1$1(Future.scala:24)
        at scala.concurrent.impl.Future$PromiseCompletingRunnable.run(Future.scala:24)
        at akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:41)
        at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:401)
        at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
        at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
        at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
        at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
20:59:22,886 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - CHAIN Reduce (SUM(1), at main(WordCount.java:72) -> FlatMap (collect()) (1/1) (824b6e3771304cd0f92aea4ab763a11d) switched from CREATED to CANCELED
20:59:22,887 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - DataSink (collect() sink) (1/1) (1bb64a2edc6f68ad716acd9f8d2d7d67) switched from CREATED to CANCELED
20:59:22,890 INFO  org.apache.flink.runtime.jobmanager.JobManager                - Status of job 72733d69588678ec224003ab5577cab8 (Flink Java Job at Wed Feb 03 20:59:22 CET 2016) changed to FAILED.
org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException: Not enough free slots available to run the job. You can decrease the operator parallelism or increase the number of slots per TaskManager in the configuration. Task to schedule: < Attempt #0 (CHAIN DataSource (at getDefaultTextLineDataSet(WordCountData.java:70) (org.apache.flink.api.java.io.CollectionInputFormat)) -> FlatMap (FlatMap at main(WordCount.java:69)) -> Combine(SUM(1), at main(WordCount.java:72) (1/1)) @ (unassigned) - [SCHEDULED] > with groupID < 31e497f2f68c9cee5864c8fddaff3d59 > in sharing group < SlotSharingGroup [f9ed1aab933e061a8ce1ecaa3534f18c, 037bb78a1902f7edea69a978ad7b54ce, 31e497f2f68c9cee5864c8fddaff3d59] >. Resources available to scheduler: Number of instances=0, total number of slots=0, available slots=0
        at org.apache.flink.runtime.jobmanager.scheduler.Scheduler.scheduleTask(Scheduler.java:256)
        at org.apache.flink.runtime.jobmanager.scheduler.Scheduler.scheduleImmediately(Scheduler.java:131)
        at org.apache.flink.runtime.executiongraph.Execution.scheduleForExecution(Execution.java:298)
        at org.apache.flink.runtime.executiongraph.ExecutionVertex.scheduleForExecution(ExecutionVertex.java:458)
        at org.apache.flink.runtime.executiongraph.ExecutionJobVertex.scheduleAll(ExecutionJobVertex.java:322)
        at org.apache.flink.runtime.executiongraph.ExecutionGraph.scheduleForExecution(ExecutionGraph.java:679)
        at org.apache.flink.runtime.jobmanager.JobManager$$anonfun$org$apache$flink$runtime$jobmanager$JobManager$$submitJob$1.apply$mcV$sp(JobManager.scala:982)
        at org.apache.flink.runtime.jobmanager.JobManager$$anonfun$org$apache$flink$runtime$jobmanager$JobManager$$submitJob$1.apply(JobManager.scala:962)
        at org.apache.flink.runtime.jobmanager.JobManager$$anonfun$org$apache$flink$runtime$jobmanager$JobManager$$submitJob$1.apply(JobManager.scala:962)
        at scala.concurrent.impl.Future$PromiseCompletingRunnable.liftedTree1$1(Future.scala:24)
        at scala.concurrent.impl.Future$PromiseCompletingRunnable.run(Future.scala:24)


On Wed, Feb 3, 2016 at 9:27 PM, Robert Metzger <[hidden email]> wrote:
Hi,

the TaskManager is starting up, but its not able to register at the job manager. Did you check the JobManager log? Do you see anything suspicious there? Are the ports matching?


On Wed, Feb 3, 2016 at 9:23 PM, Ravinder Kaur <[hidden email]> wrote:
Hello,

Thank you for pointing it out. I had a little typo while I edited the hostname in flink-conf.yaml. I've reset it and the TaskManager started up. But I still can't run the WordCount example and it throws the same NoResourceAvaliableException.

Caused by: org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableExce                                                                                        ption: Not enough free slots available to run the job. You can decrease the oper                                                                                        ator parallelism or increase the number of slots per TaskManager in the configur                                                                                        ation. Task to schedule: < Attempt #0 (CHAIN DataSource (at getDefaultTextLineDa                                                                                        taSet(WordCountData.java:70) (org.apache.flink.api.java.io.CollectionInputFormat                                                                                        )) -> FlatMap (FlatMap at main(WordCount.java:69)) -> Combine(SUM(1), at main(Wo                                                                                        rdCount.java:72) (1/1)) @ (unassigned) - [SCHEDULED] > with groupID < 31e497f2f6                                                                                        8c9cee5864c8fddaff3d59 > in sharing group < SlotSharingGroup [f9ed1aab933e061a8c                                                                                        e1ecaa3534f18c, 037bb78a1902f7edea69a978ad7b54ce, 31e497f2f68c9cee5864c8fddaff3d                                                                                        59] >. Resources available to scheduler: Number of instances=0, total number of                                                                                         slots=0, available slots=0
        at org.apache.flink.runtime.jobmanager.scheduler.Scheduler.scheduleTask(                                                                                        Scheduler.java:256)
        at org.apache.flink.runtime.jobmanager.scheduler.Scheduler.scheduleImmed                                                                                        iately(Scheduler.java:131)
        at org.apache.flink.runtime.executiongraph.Execution.scheduleForExecutio                                                                                        n(Execution.java:298)
        at org.apache.flink.runtime.executiongraph.ExecutionVertex.scheduleForEx                                                                                        ecution(ExecutionVertex.java:458)
        at org.apache.flink.runtime.executiongraph.ExecutionJobVertex.scheduleAl                                                                                        l(ExecutionJobVertex.java:322)
        at org.apache.flink.runtime.executiongraph.ExecutionGraph.scheduleForExe                                                                                        cution(ExecutionGraph.java:679)
        at org.apache.flink.runtime.jobmanager.JobManager$$anonfun$org$apache$fl                                                                                        ink$runtime$jobmanager$JobManager$$submitJob$1.apply$mcV$sp(JobManager.scala:982                                                                                        )
        at org.apache.flink.runtime.jobmanager.JobManager$$anonfun$org$apache$fl                                                                                        ink$runtime$jobmanager$JobManager$$submitJob$1.apply(JobManager.scala:962)
        at org.apache.flink.runtime.jobmanager.JobManager$$anonfun$org$apache$fl                                                                                        ink$runtime$jobmanager$JobManager$$submitJob$1.apply(JobManager.scala:962)
        ... 8 more

The log of TaskManager again has the same errors as before.

20:58:58,457 INFO  org.apache.flink.runtime.net.ConnectionUtils                  - Failed to connect from address '/slave-IP': connect timed out
20:58:58,458 INFO  org.apache.flink.runtime.net.ConnectionUtils                  - Failed to connect from address '/0:0:0:0:0:0:0:1%1': Network is unreachable
20:58:58,458 INFO  org.apache.flink.runtime.net.ConnectionUtils                  - Failed to connect from address '/127.0.0.1': Invalid argument
20:58:59,048 WARN  org.apache.flink.runtime.net.ConnectionUtils                  - Could not connect to /master-IP:6123. Selecting a local address using heuristics.
20:58:59,050 INFO  org.apache.flink.runtime.taskmanager.TaskManager              - TaskManager will use hostname/address 'hostname-of-slave' (slave-IP) for communication.
20:58:59,051 INFO  org.apache.flink.runtime.taskmanager.TaskManager              - Starting TaskManager in streaming mode BATCH_ONLY
20:58:59,052 INFO  org.apache.flink.runtime.taskmanager.TaskManager              - Starting TaskManager actor system at slave_IP:0
20:58:59,776 INFO  akka.event.slf4j.Slf4jLogger                                  - Slf4jLogger started
20:58:59,842 INFO  Remoting                                                      - Starting remoting
20:59:00,094 INFO  Remoting                                                      - Remoting started; listening on addresses :[akka.tcp://flink@slave-IP:33813]
20:59:00,100 INFO  org.apache.flink.runtime.taskmanager.TaskManager              - Starting TaskManager actor
20:59:00,125 INFO  org.apache.flink.runtime.io.network.netty.NettyConfig         - NettyConfig [server address: hostname-of-master/master-IP, server port: 49030, memory segment size (bytes): 32768, transport type: NIO, number of server threads: 0 (use Netty's default), number of client threads: 0 (use Netty's default), server connect backlog: 0 (use Netty's default), client connect timeout (sec): 120, send/receive buffer size (bytes): 0 (use Netty's default)]
20:59:00,131 INFO  org.apache.flink.runtime.taskmanager.TaskManager              - Messages between TaskManager and JobManager have a max timeout of 100000 milliseconds
20:59:00,142 INFO  org.apache.flink.runtime.taskmanager.TaskManager              - Temporary file directory '/tmp': total 4 GB, usable 1 GB (25.00% usable)
20:59:00,210 INFO  org.apache.flink.runtime.io.network.buffer.NetworkBufferPool  - Allocated 64 MB for network buffer pool (number of memory segments: 2048, bytes per segment: 32768).
20:59:00,323 INFO  org.apache.flink.runtime.taskmanager.TaskManager              - Using 0.7 of the currently free heap space for Flink managed heap memory (293 MB).
20:59:00,565 INFO  org.apache.flink.runtime.io.disk.iomanager.IOManager          - I/O manager uses directory /tmp/flink-io-c7796b82-6676-4604-97fd-df09001a84e8 for spill files.
20:59:00,578 INFO  org.apache.flink.runtime.filecache.FileCache                  - User file cache uses directory /tmp/flink-dist-cache-13ed3e76-cf1e-46fa-9ba2-5177e801429e
20:59:00,908 INFO  org.apache.flink.runtime.taskmanager.TaskManager              - Starting TaskManager actor at akka://flink/user/taskmanager#-157676733.
20:59:00,908 INFO  org.apache.flink.runtime.taskmanager.TaskManager              - TaskManager data connection information: hostname-of-master (dataPort=49030)
20:59:00,909 INFO  org.apache.flink.runtime.taskmanager.TaskManager              - TaskManager has 1 task slot(s).
20:59:00,910 INFO  org.apache.flink.runtime.taskmanager.TaskManager              - Memory usage stats: [HEAP: 376/491/491 MB, NON HEAP: 24/49/304 MB (used/committed/max)]
20:59:00,917 INFO  org.apache.flink.runtime.taskmanager.TaskManager              - Trying to register at JobManager akka.tcp://flink@master-IP:6123/user/jobmanager (attempt 1, timeout: 500 milliseconds)
20:59:01,443 INFO  org.apache.flink.runtime.taskmanager.TaskManager              - Trying to register at JobManager akka.tcp://flink@master-IP:6123/user/jobmanager (attempt 2, timeout: 1000 milliseconds)
20:59:02,873 INFO  org.apache.flink.runtime.taskmanager.TaskManager              - Trying to register at JobManager akka.tcp://flink@master-IP:6123/user/jobmanager (attempt 3, timeout: 2000 milliseconds)
20:59:04,893 INFO  org.apache.flink.runtime.taskmanager.TaskManager              - Trying to register at JobManager akka.tcp://flink@master-IP:6123/user/jobmanager (attempt 4, timeout: 4000 milliseconds)
20:59:08,914 INFO  org.apache.flink.runtime.taskmanager.TaskManager              - Trying to register at JobManager akka.tcp://flink@master-IP:6123/user/jobmanager (attempt 5, timeout: 8000 milliseconds)


Kind Regards,
Ravinder Kaur

On Wed, Feb 3, 2016 at 8:12 PM, Stephan Ewen <[hidden email]> wrote:
This looks like the reason:

java.net.UnknownHostException: Cannot resolve the JobManager hostname 'hostname-of-master' specified in the configuration

On Wed, Feb 3, 2016 at 7:29 PM, Ravinder Kaur <[hidden email]> wrote:
Hello,

The log file of the Taskmanager now shows the following 

18:27:10,082 WARN  org.apache.hadoop.util.NativeCodeLoader                       - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
18:27:10,244 INFO  org.apache.flink.runtime.taskmanager.TaskManager              - --------------------------------------------------------------------------------
18:27:10,244 INFO  org.apache.flink.runtime.taskmanager.TaskManager              -  Starting TaskManager (Version: 0.10.1, Rev:2e9b231, Date:22.11.2015 @ 12:41:12 CET)
18:27:10,244 INFO  org.apache.flink.runtime.taskmanager.TaskManager              -  Current user: flink
18:27:10,245 INFO  org.apache.flink.runtime.taskmanager.TaskManager              -  JVM: OpenJDK 64-Bit Server VM - Oracle Corporation - 1.7/24.91-b01
18:27:10,245 INFO  org.apache.flink.runtime.taskmanager.TaskManager              -  Maximum heap size: 491 MiBytes
18:27:10,245 INFO  org.apache.flink.runtime.taskmanager.TaskManager              -  JAVA_HOME: /usr/lib/jvm/java-1.7.0-openjdk-amd64
18:27:10,247 INFO  org.apache.flink.runtime.taskmanager.TaskManager              -  Hadoop version: 2.7.0
18:27:10,247 INFO  org.apache.flink.runtime.taskmanager.TaskManager              -  JVM Options:
18:27:10,247 INFO  org.apache.flink.runtime.taskmanager.TaskManager              -     -Xms512M
18:27:10,247 INFO  org.apache.flink.runtime.taskmanager.TaskManager              -     -Xmx512M
18:27:10,248 INFO  org.apache.flink.runtime.taskmanager.TaskManager              -     -XX:MaxDirectMemorySize=8388607T
18:27:10,248 INFO  org.apache.flink.runtime.taskmanager.TaskManager              -     -XX:MaxPermSize=256m
18:27:10,248 INFO  org.apache.flink.runtime.taskmanager.TaskManager              -     -Dlog.file=/home/flink/flink-0.10.1/log/flink-flink-taskmanager-0-vm-10-155-208-137.cloud.mwn.de.log
18:27:10,248 INFO  org.apache.flink.runtime.taskmanager.TaskManager              -     -Dlog4j.configuration=file:/home/flink/flink-0.10.1/conf/log4j.properties
18:27:10,248 INFO  org.apache.flink.runtime.taskmanager.TaskManager              -     -Dlogback.configurationFile=file:/home/flink/flink-0.10.1/conf/logback.xml
18:27:10,248 INFO  org.apache.flink.runtime.taskmanager.TaskManager              -  Program Arguments:
18:27:10,248 INFO  org.apache.flink.runtime.taskmanager.TaskManager              -     --configDir
18:27:10,248 INFO  org.apache.flink.runtime.taskmanager.TaskManager              -     /home/flink/flink-0.10.1/conf
18:27:10,248 INFO  org.apache.flink.runtime.taskmanager.TaskManager              -     --streamingMode
18:27:10,248 INFO  org.apache.flink.runtime.taskmanager.TaskManager              -     batch
18:27:10,248 INFO  org.apache.flink.runtime.taskmanager.TaskManager              -  Classpath: /home/flink/flink-0.10.1/lib/flink-dist_2.11-0.10.1.jar:/home/flink/flink-0.10.1/lib/flink-python_2.11-0.10.1.jar:/home/flink/flink-0.10.1/lib/log4j-1.2.17.jar:/home/flink/flink-0.10.1/lib/slf4j-log4j12-1.7.7.jar:/usr/lib/jvm/java-1.7.0-openjdk-amd64/lib/tools.jar::
18:27:10,248 INFO  org.apache.flink.runtime.taskmanager.TaskManager              - --------------------------------------------------------------------------------
18:27:10,252 INFO  org.apache.flink.runtime.taskmanager.TaskManager              - Maximum number of open file descriptors is 4096
18:27:10,277 INFO  org.apache.flink.runtime.taskmanager.TaskManager              - Loading configuration from /home/flink/flink-0.10.1/conf
18:27:10,356 INFO  org.apache.flink.runtime.taskmanager.TaskManager              - Security is not enabled. Starting non-authenticated TaskManager.
18:27:10,365 ERROR org.apache.flink.runtime.taskmanager.TaskManager              - Failed to run TaskManager.
java.net.UnknownHostException: Cannot resolve the JobManager hostname 'hostname-of-master' specified in the configuration
        at org.apache.flink.runtime.util.StandaloneUtils.createLeaderRetrievalService(StandaloneUtils.java:79)
        at org.apache.flink.runtime.util.StandaloneUtils.createLeaderRetrievalService(StandaloneUtils.java:48)
        at org.apache.flink.runtime.util.LeaderRetrievalUtils.createLeaderRetrievalService(LeaderRetrievalUtils.java:69)
        at org.apache.flink.runtime.taskmanager.TaskManager$.selectNetworkInterfaceAndPort(TaskManager.scala:1351)
        at org.apache.flink.runtime.taskmanager.TaskManager$.selectNetworkInterfaceAndRunTaskManager(TaskManager.scala:1328)
        at org.apache.flink.runtime.taskmanager.TaskManager$.main(TaskManager.scala:1240)
        at org.apache.flink.runtime.taskmanager.TaskManager.main(TaskManager.scala)

Kind Regards,
Ravinder Kaur

On Wed, Feb 3, 2016 at 7:19 PM, Stephan Ewen <[hidden email]> wrote:
What do the TaskManger logs say?

On Wed, Feb 3, 2016 at 6:34 PM, Ravinder Kaur <[hidden email]> wrote:
Hello,

Thanks for the quick reply. I tried to set jobmanager.rpc.address in flink-conf.yaml to the hostname of master node on both the nodes.

Now it does not start the Taskmanager at the worker node at all. When I start the cluster using ./bin/start-cluster.sh on master it shows the normal output of starting the Jobmanager and Taskmanager but when I run jps on the nodes the slave does not have the Taskmanager running. 

Running the WordCount example again fails showing the same error. Stopping the cluster says no taskmanager to stop.

Kind Regards,
Ravinder Kaur

On Wed, Feb 3, 2016 at 5:47 PM, Stephan Ewen <[hidden email]> wrote:
Looks like the network configuration is not correct.

I would try setting the full host name (like "master.abc.xyz.com") as jobmanager.rpc.address.

Greetings,
Stephan


On Wed, Feb 3, 2016 at 5:43 PM, Ravinder Kaur <[hidden email]> wrote:

Hello Community,

I'm a student and new to Apache Flink. I'm trying to learn and have setup a 2- node standalone Flink(0.10.1) cluster (one master and one worker). I'm facing the following issue. 

Cluster: consists of 2 vms (one master and one worker)


When I start the cluster both the JobManager and the TaskManager are started on the master and worker respectively.

Command to start the cluster : bin/start-cluster.sh

JPS shows all the processes running.

Then I run the following command to run a WordCount example job: ./bin/flink run ./examples/WordCount.jar

the result is attached with the mail.

The error is org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailabeException: Not enough free slots available to run to run the job ....................... Resources available to scheduler: Number of instances=0, total number of slots= 0, available slots=0

Therefore I suppose that the JobManager does not find the TaskManager and checked the logs of the TaskManager which indeed shows that the TaskManager is unable to register at the JobManager for quite a long time. There are org.apache.flink.runtime.net.ConnectionUtils: Failed to connect from localhost: Connect timed out and org.apache.flink.runtime.net.ConnectionUtils: Failed to connect from address localhost: Network is Unreachable messages in the log of the TaskManager. Later when it starts up after a number of attempts and tries to register at the JobManager, which also fails after a lot of attempts showing the following message org.apache.flink.runtime.taskmanager.Taskmanager: Trying to register at JobManager akka.tcp://flink@master:6123/user'/jobmanager (attempt:92, timeout:30seconds) and org.apache.flink.runtime.taskmanager.Taskmanager: Tried to associate with unreachable remote host [akka.tcp://flink@master:6123/user/jobmanager]. Address is now gated for 5000ms, all messages to this address will be delivered to dead letters. Reason: Connection timed out: /master:6123

and https://issues.apache.org/jira/browse/FLINK-1119 these links helpful. Stephan Ewen the guy who provided the solution in both the links gives a good explanation that the TaskManagers take quite some time to register at the JobManager and therefore I waited for as long as 20 mins after starting the cluster to run the job. But even after waiting so long I get the same error. 

Another suggestion was to run the cluster in streaming mode. So I tried it with the command : bin/start-cluster-streaming.sh and ran the job but I get the same error. I have rechecked all the configurations but I'm unable to find out the fault.

I re-checked all the configurations but could not find anything wrong. Also checked the port 6123 on master which is in LISTEN state and tcp request from worker to master shows SYN_SENT state using netstat -na and lsof -i commands.

I opened the webpage on master http://localhost:8081 but it shows nothing and localhost:8080 says connection refused.

Kindly help me out as it is very important for me. Let me know if you have any questions.

Kind Regards,
Ravinder Kaur











Reply | Threaded
Open this post in threaded view
|

Re: TaskManager unable to register with JobManager

Fabian Hueske-2
Hi Ravinder,

please have a look at the configuration documentation:

--> https://ci.apache.org/projects/flink/flink-docs-master/setup/config.html#jobmanager-amp-taskmanager

Best, Fabian

2016-02-10 13:55 GMT+01:00 Ravinder Kaur <[hidden email]>:
Hello All,

I need to know the range of ports that are being used during the master/slave communication in the Flink cluster. Also is there a way I can specify a range of ports, at the slaves, to restrict them to connect to master only in this range?

Kind Regards,
Ravinder Kaur


On Wed, Feb 3, 2016 at 10:09 PM, Stephan Ewen <[hidden email]> wrote:
Can machines connect to port 6123? The firewall may block that port, put permit SSH.

On Wed, Feb 3, 2016 at 9:52 PM, Ravinder Kaur <[hidden email]> wrote:
Hello,

Here is the log file of Jobmanager. I did not see some thing suspicious and as it suggests the ports are also listening.

20:58:46,906 INFO  org.apache.flink.runtime.jobmanager.JobManager                - Starting JobManager on IP-of-master:6123 with execution mode CLUSTER and streaming mode BATCH_ONLY
20:58:46,978 INFO  org.apache.flink.runtime.jobmanager.JobManager                - Security is not enabled. Starting non-authenticated JobManager.
20:58:46,979 INFO  org.apache.flink.runtime.jobmanager.JobManager                - Starting JobManager
20:58:46,980 INFO  org.apache.flink.runtime.jobmanager.JobManager                - Starting JobManager actor system at 10.155.208.138:6123
20:58:48,196 INFO  akka.event.slf4j.Slf4jLogger                                  - Slf4jLogger started
20:58:48,295 INFO  Remoting                                                      - Starting remoting
20:58:48,541 INFO  Remoting                                                      - Remoting started; listening on addresses :[akka.tcp://flink@IP-of-master:6123]
20:58:48,549 INFO  org.apache.flink.runtime.jobmanager.JobManager                - Starting JobManger web frontend
20:58:48,690 INFO  org.apache.flink.runtime.webmonitor.WebRuntimeMonitor         - Using directory /tmp/flink-web-876a4755-4f38-4ff7-8202-f263afa9b986 for the web interface files
20:58:48,691 INFO  org.apache.flink.runtime.webmonitor.WebRuntimeMonitor         - Serving job manager log from /home/flink/flink-0.10.1/log/flink-flink-jobmanager-0-hostname.log
20:58:48,691 INFO  org.apache.flink.runtime.webmonitor.WebRuntimeMonitor         - Serving job manager stdout from /home/flink/flink-0.10.1/log/flink-flink-jobmanager-0-hostname.out
20:58:49,044 INFO  org.apache.flink.runtime.webmonitor.WebRuntimeMonitor         - Web frontend listening at 0:0:0:0:0:0:0:0:8081
20:58:49,045 INFO  org.apache.flink.runtime.jobmanager.JobManager                - Starting JobManager actor
20:58:49,052 INFO  org.apache.flink.runtime.blob.BlobServer                      - Created BLOB server storage directory /tmp/blobStore-e0c52bfb-2411-4a83-ac8d-5664a5894258
20:58:49,054 INFO  org.apache.flink.runtime.blob.BlobServer                      - Started BLOB server at 0.0.0.0:43683 - max concurrent requests: 50 - max backlog: 1000
20:58:49,075 INFO  org.apache.flink.runtime.jobmanager.MemoryArchivist           - Started memory archivist akka://flink/user/archive
20:58:49,075 INFO  org.apache.flink.runtime.jobmanager.JobManager                - Starting JobManager at akka.tcp://flink@IP-of-master:6123/user/jobmanager.
20:58:49,081 INFO  org.apache.flink.runtime.jobmanager.JobManager                - JobManager akka.tcp://flink@IP-of-master:6123/user/jobmanager was granted leadership with leader session ID None.
20:58:49,082 INFO  org.apache.flink.runtime.webmonitor.WebRuntimeMonitor         - Starting with JobManager akka.tcp://flink@IP-of-master:6123/user/jobmanager on port 8081
20:58:49,083 INFO  org.apache.flink.runtime.webmonitor.JobManagerRetriever       - New leader reachable under akka.tcp://flink@IP-of-master:6123/user/jobmanager:null.
20:59:22,794 INFO  org.apache.flink.runtime.jobmanager.JobManager                - Submitting job 72733d69588678ec224003ab5577cab8 (Flink Java Job at Wed Feb 03 20:59:22 CET 2016).
20:59:22,853 INFO  org.apache.flink.runtime.jobmanager.JobManager                - Scheduling job 72733d69588678ec224003ab5577cab8 (Flink Java Job at Wed Feb 03 20:59:22 CET 2016).
20:59:22,857 INFO  org.apache.flink.runtime.jobmanager.JobManager                - Status of job 72733d69588678ec224003ab5577cab8 (Flink Java Job at Wed Feb 03 20:59:22 CET 2016) changed to RUNNING.
20:59:22,859 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - CHAIN DataSource (at getDefaultTextLineDataSet(WordCountData.java:70) (org.apache.flink.api.java.io.CollectionInputFormat)) -> FlatMap (FlatMap at main(WordCount.java:69)) -> Combine(SUM(1), at main(WordCount.java:72) (1/1) (23fb37019a504fd6c7bf95e46a8cd7a3) switched from CREATED to SCHEDULED
20:59:22,881 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - CHAIN DataSource (at getDefaultTextLineDataSet(WordCountData.java:70) (org.apache.flink.api.java.io.CollectionInputFormat)) -> FlatMap (FlatMap at main(WordCount.java:69)) -> Combine(SUM(1), at main(WordCount.java:72) (1/1) (23fb37019a504fd6c7bf95e46a8cd7a3) switched from SCHEDULED to CANCELED
20:59:22,881 INFO  org.apache.flink.runtime.jobmanager.JobManager                - Status of job 72733d69588678ec224003ab5577cab8 (Flink Java Job at Wed Feb 03 20:59:22 CET 2016) changed to FAILING.
org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException: Not enough free slots available to run the job. You can decrease the operator parallelism or increase the number of slots per TaskManager in the configuration. Task to schedule: < Attempt #0 (CHAIN DataSource (at getDefaultTextLineDataSet(WordCountData.java:70) (org.apache.flink.api.java.io.CollectionInputFormat)) -> FlatMap (FlatMap at main(WordCount.java:69)) -> Combine(SUM(1), at main(WordCount.java:72) (1/1)) @ (unassigned) - [SCHEDULED] > with groupID < 31e497f2f68c9cee5864c8fddaff3d59 > in sharing group < SlotSharingGroup [f9ed1aab933e061a8ce1ecaa3534f18c, 037bb78a1902f7edea69a978ad7b54ce, 31e497f2f68c9cee5864c8fddaff3d59] >. Resources available to scheduler: Number of instances=0, total number of slots=0, available slots=0
        at org.apache.flink.runtime.jobmanager.scheduler.Scheduler.scheduleTask(Scheduler.java:256)
        at org.apache.flink.runtime.jobmanager.scheduler.Scheduler.scheduleImmediately(Scheduler.java:131)
        at org.apache.flink.runtime.executiongraph.Execution.scheduleForExecution(Execution.java:298)
        at org.apache.flink.runtime.executiongraph.ExecutionVertex.scheduleForExecution(ExecutionVertex.java:458)
        at org.apache.flink.runtime.executiongraph.ExecutionJobVertex.scheduleAll(ExecutionJobVertex.java:322)
        at org.apache.flink.runtime.executiongraph.ExecutionGraph.scheduleForExecution(ExecutionGraph.java:679)
        at org.apache.flink.runtime.jobmanager.JobManager$$anonfun$org$apache$flink$runtime$jobmanager$JobManager$$submitJob$1.apply$mcV$sp(JobManager.scala:982)
        at org.apache.flink.runtime.jobmanager.JobManager$$anonfun$org$apache$flink$runtime$jobmanager$JobManager$$submitJob$1.apply(JobManager.scala:962)
        at org.apache.flink.runtime.jobmanager.JobManager$$anonfun$org$apache$flink$runtime$jobmanager$JobManager$$submitJob$1.apply(JobManager.scala:962)
        at scala.concurrent.impl.Future$PromiseCompletingRunnable.liftedTree1$1(Future.scala:24)
        at scala.concurrent.impl.Future$PromiseCompletingRunnable.run(Future.scala:24)
        at akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:41)
        at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:401)
        at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
        at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
        at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
        at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
20:59:22,886 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - CHAIN Reduce (SUM(1), at main(WordCount.java:72) -> FlatMap (collect()) (1/1) (824b6e3771304cd0f92aea4ab763a11d) switched from CREATED to CANCELED
20:59:22,887 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - DataSink (collect() sink) (1/1) (1bb64a2edc6f68ad716acd9f8d2d7d67) switched from CREATED to CANCELED
20:59:22,890 INFO  org.apache.flink.runtime.jobmanager.JobManager                - Status of job 72733d69588678ec224003ab5577cab8 (Flink Java Job at Wed Feb 03 20:59:22 CET 2016) changed to FAILED.
org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException: Not enough free slots available to run the job. You can decrease the operator parallelism or increase the number of slots per TaskManager in the configuration. Task to schedule: < Attempt #0 (CHAIN DataSource (at getDefaultTextLineDataSet(WordCountData.java:70) (org.apache.flink.api.java.io.CollectionInputFormat)) -> FlatMap (FlatMap at main(WordCount.java:69)) -> Combine(SUM(1), at main(WordCount.java:72) (1/1)) @ (unassigned) - [SCHEDULED] > with groupID < 31e497f2f68c9cee5864c8fddaff3d59 > in sharing group < SlotSharingGroup [f9ed1aab933e061a8ce1ecaa3534f18c, 037bb78a1902f7edea69a978ad7b54ce, 31e497f2f68c9cee5864c8fddaff3d59] >. Resources available to scheduler: Number of instances=0, total number of slots=0, available slots=0
        at org.apache.flink.runtime.jobmanager.scheduler.Scheduler.scheduleTask(Scheduler.java:256)
        at org.apache.flink.runtime.jobmanager.scheduler.Scheduler.scheduleImmediately(Scheduler.java:131)
        at org.apache.flink.runtime.executiongraph.Execution.scheduleForExecution(Execution.java:298)
        at org.apache.flink.runtime.executiongraph.ExecutionVertex.scheduleForExecution(ExecutionVertex.java:458)
        at org.apache.flink.runtime.executiongraph.ExecutionJobVertex.scheduleAll(ExecutionJobVertex.java:322)
        at org.apache.flink.runtime.executiongraph.ExecutionGraph.scheduleForExecution(ExecutionGraph.java:679)
        at org.apache.flink.runtime.jobmanager.JobManager$$anonfun$org$apache$flink$runtime$jobmanager$JobManager$$submitJob$1.apply$mcV$sp(JobManager.scala:982)
        at org.apache.flink.runtime.jobmanager.JobManager$$anonfun$org$apache$flink$runtime$jobmanager$JobManager$$submitJob$1.apply(JobManager.scala:962)
        at org.apache.flink.runtime.jobmanager.JobManager$$anonfun$org$apache$flink$runtime$jobmanager$JobManager$$submitJob$1.apply(JobManager.scala:962)
        at scala.concurrent.impl.Future$PromiseCompletingRunnable.liftedTree1$1(Future.scala:24)
        at scala.concurrent.impl.Future$PromiseCompletingRunnable.run(Future.scala:24)


On Wed, Feb 3, 2016 at 9:27 PM, Robert Metzger <[hidden email]> wrote:
Hi,

the TaskManager is starting up, but its not able to register at the job manager. Did you check the JobManager log? Do you see anything suspicious there? Are the ports matching?


On Wed, Feb 3, 2016 at 9:23 PM, Ravinder Kaur <[hidden email]> wrote:
Hello,

Thank you for pointing it out. I had a little typo while I edited the hostname in flink-conf.yaml. I've reset it and the TaskManager started up. But I still can't run the WordCount example and it throws the same NoResourceAvaliableException.

Caused by: org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableExce                                                                                        ption: Not enough free slots available to run the job. You can decrease the oper                                                                                        ator parallelism or increase the number of slots per TaskManager in the configur                                                                                        ation. Task to schedule: < Attempt #0 (CHAIN DataSource (at getDefaultTextLineDa                                                                                        taSet(WordCountData.java:70) (org.apache.flink.api.java.io.CollectionInputFormat                                                                                        )) -> FlatMap (FlatMap at main(WordCount.java:69)) -> Combine(SUM(1), at main(Wo                                                                                        rdCount.java:72) (1/1)) @ (unassigned) - [SCHEDULED] > with groupID < 31e497f2f6                                                                                        8c9cee5864c8fddaff3d59 > in sharing group < SlotSharingGroup [f9ed1aab933e061a8c                                                                                        e1ecaa3534f18c, 037bb78a1902f7edea69a978ad7b54ce, 31e497f2f68c9cee5864c8fddaff3d                                                                                        59] >. Resources available to scheduler: Number of instances=0, total number of                                                                                         slots=0, available slots=0
        at org.apache.flink.runtime.jobmanager.scheduler.Scheduler.scheduleTask(                                                                                        Scheduler.java:256)
        at org.apache.flink.runtime.jobmanager.scheduler.Scheduler.scheduleImmed                                                                                        iately(Scheduler.java:131)
        at org.apache.flink.runtime.executiongraph.Execution.scheduleForExecutio                                                                                        n(Execution.java:298)
        at org.apache.flink.runtime.executiongraph.ExecutionVertex.scheduleForEx                                                                                        ecution(ExecutionVertex.java:458)
        at org.apache.flink.runtime.executiongraph.ExecutionJobVertex.scheduleAl                                                                                        l(ExecutionJobVertex.java:322)
        at org.apache.flink.runtime.executiongraph.ExecutionGraph.scheduleForExe                                                                                        cution(ExecutionGraph.java:679)
        at org.apache.flink.runtime.jobmanager.JobManager$$anonfun$org$apache$fl                                                                                        ink$runtime$jobmanager$JobManager$$submitJob$1.apply$mcV$sp(JobManager.scala:982                                                                                        )
        at org.apache.flink.runtime.jobmanager.JobManager$$anonfun$org$apache$fl                                                                                        ink$runtime$jobmanager$JobManager$$submitJob$1.apply(JobManager.scala:962)
        at org.apache.flink.runtime.jobmanager.JobManager$$anonfun$org$apache$fl                                                                                        ink$runtime$jobmanager$JobManager$$submitJob$1.apply(JobManager.scala:962)
        ... 8 more

The log of TaskManager again has the same errors as before.

20:58:58,457 INFO  org.apache.flink.runtime.net.ConnectionUtils                  - Failed to connect from address '/slave-IP': connect timed out
20:58:58,458 INFO  org.apache.flink.runtime.net.ConnectionUtils                  - Failed to connect from address '/0:0:0:0:0:0:0:1%1': Network is unreachable
20:58:58,458 INFO  org.apache.flink.runtime.net.ConnectionUtils                  - Failed to connect from address '/127.0.0.1': Invalid argument
20:58:59,048 WARN  org.apache.flink.runtime.net.ConnectionUtils                  - Could not connect to /master-IP:6123. Selecting a local address using heuristics.
20:58:59,050 INFO  org.apache.flink.runtime.taskmanager.TaskManager              - TaskManager will use hostname/address 'hostname-of-slave' (slave-IP) for communication.
20:58:59,051 INFO  org.apache.flink.runtime.taskmanager.TaskManager              - Starting TaskManager in streaming mode BATCH_ONLY
20:58:59,052 INFO  org.apache.flink.runtime.taskmanager.TaskManager              - Starting TaskManager actor system at slave_IP:0
20:58:59,776 INFO  akka.event.slf4j.Slf4jLogger                                  - Slf4jLogger started
20:58:59,842 INFO  Remoting                                                      - Starting remoting
20:59:00,094 INFO  Remoting                                                      - Remoting started; listening on addresses :[akka.tcp://flink@slave-IP:33813]
20:59:00,100 INFO  org.apache.flink.runtime.taskmanager.TaskManager              - Starting TaskManager actor
20:59:00,125 INFO  org.apache.flink.runtime.io.network.netty.NettyConfig         - NettyConfig [server address: hostname-of-master/master-IP, server port: 49030, memory segment size (bytes): 32768, transport type: NIO, number of server threads: 0 (use Netty's default), number of client threads: 0 (use Netty's default), server connect backlog: 0 (use Netty's default), client connect timeout (sec): 120, send/receive buffer size (bytes): 0 (use Netty's default)]
20:59:00,131 INFO  org.apache.flink.runtime.taskmanager.TaskManager              - Messages between TaskManager and JobManager have a max timeout of 100000 milliseconds
20:59:00,142 INFO  org.apache.flink.runtime.taskmanager.TaskManager              - Temporary file directory '/tmp': total 4 GB, usable 1 GB (25.00% usable)
20:59:00,210 INFO  org.apache.flink.runtime.io.network.buffer.NetworkBufferPool  - Allocated 64 MB for network buffer pool (number of memory segments: 2048, bytes per segment: 32768).
20:59:00,323 INFO  org.apache.flink.runtime.taskmanager.TaskManager              - Using 0.7 of the currently free heap space for Flink managed heap memory (293 MB).
20:59:00,565 INFO  org.apache.flink.runtime.io.disk.iomanager.IOManager          - I/O manager uses directory /tmp/flink-io-c7796b82-6676-4604-97fd-df09001a84e8 for spill files.
20:59:00,578 INFO  org.apache.flink.runtime.filecache.FileCache                  - User file cache uses directory /tmp/flink-dist-cache-13ed3e76-cf1e-46fa-9ba2-5177e801429e
20:59:00,908 INFO  org.apache.flink.runtime.taskmanager.TaskManager              - Starting TaskManager actor at akka://flink/user/taskmanager#-157676733.
20:59:00,908 INFO  org.apache.flink.runtime.taskmanager.TaskManager              - TaskManager data connection information: hostname-of-master (dataPort=49030)
20:59:00,909 INFO  org.apache.flink.runtime.taskmanager.TaskManager              - TaskManager has 1 task slot(s).
20:59:00,910 INFO  org.apache.flink.runtime.taskmanager.TaskManager              - Memory usage stats: [HEAP: 376/491/491 MB, NON HEAP: 24/49/304 MB (used/committed/max)]
20:59:00,917 INFO  org.apache.flink.runtime.taskmanager.TaskManager              - Trying to register at JobManager akka.tcp://flink@master-IP:6123/user/jobmanager (attempt 1, timeout: 500 milliseconds)
20:59:01,443 INFO  org.apache.flink.runtime.taskmanager.TaskManager              - Trying to register at JobManager akka.tcp://flink@master-IP:6123/user/jobmanager (attempt 2, timeout: 1000 milliseconds)
20:59:02,873 INFO  org.apache.flink.runtime.taskmanager.TaskManager              - Trying to register at JobManager akka.tcp://flink@master-IP:6123/user/jobmanager (attempt 3, timeout: 2000 milliseconds)
20:59:04,893 INFO  org.apache.flink.runtime.taskmanager.TaskManager              - Trying to register at JobManager akka.tcp://flink@master-IP:6123/user/jobmanager (attempt 4, timeout: 4000 milliseconds)
20:59:08,914 INFO  org.apache.flink.runtime.taskmanager.TaskManager              - Trying to register at JobManager akka.tcp://flink@master-IP:6123/user/jobmanager (attempt 5, timeout: 8000 milliseconds)


Kind Regards,
Ravinder Kaur

On Wed, Feb 3, 2016 at 8:12 PM, Stephan Ewen <[hidden email]> wrote:
This looks like the reason:

java.net.UnknownHostException: Cannot resolve the JobManager hostname 'hostname-of-master' specified in the configuration

On Wed, Feb 3, 2016 at 7:29 PM, Ravinder Kaur <[hidden email]> wrote:
Hello,

The log file of the Taskmanager now shows the following 

18:27:10,082 WARN  org.apache.hadoop.util.NativeCodeLoader                       - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
18:27:10,244 INFO  org.apache.flink.runtime.taskmanager.TaskManager              - --------------------------------------------------------------------------------
18:27:10,244 INFO  org.apache.flink.runtime.taskmanager.TaskManager              -  Starting TaskManager (Version: 0.10.1, Rev:2e9b231, Date:22.11.2015 @ 12:41:12 CET)
18:27:10,244 INFO  org.apache.flink.runtime.taskmanager.TaskManager              -  Current user: flink
18:27:10,245 INFO  org.apache.flink.runtime.taskmanager.TaskManager              -  JVM: OpenJDK 64-Bit Server VM - Oracle Corporation - 1.7/24.91-b01
18:27:10,245 INFO  org.apache.flink.runtime.taskmanager.TaskManager              -  Maximum heap size: 491 MiBytes
18:27:10,245 INFO  org.apache.flink.runtime.taskmanager.TaskManager              -  JAVA_HOME: /usr/lib/jvm/java-1.7.0-openjdk-amd64
18:27:10,247 INFO  org.apache.flink.runtime.taskmanager.TaskManager              -  Hadoop version: 2.7.0
18:27:10,247 INFO  org.apache.flink.runtime.taskmanager.TaskManager              -  JVM Options:
18:27:10,247 INFO  org.apache.flink.runtime.taskmanager.TaskManager              -     -Xms512M
18:27:10,247 INFO  org.apache.flink.runtime.taskmanager.TaskManager              -     -Xmx512M
18:27:10,248 INFO  org.apache.flink.runtime.taskmanager.TaskManager              -     -XX:MaxDirectMemorySize=8388607T
18:27:10,248 INFO  org.apache.flink.runtime.taskmanager.TaskManager              -     -XX:MaxPermSize=256m
18:27:10,248 INFO  org.apache.flink.runtime.taskmanager.TaskManager              -     -Dlog.file=/home/flink/flink-0.10.1/log/flink-flink-taskmanager-0-vm-10-155-208-137.cloud.mwn.de.log
18:27:10,248 INFO  org.apache.flink.runtime.taskmanager.TaskManager              -     -Dlog4j.configuration=file:/home/flink/flink-0.10.1/conf/log4j.properties
18:27:10,248 INFO  org.apache.flink.runtime.taskmanager.TaskManager              -     -Dlogback.configurationFile=file:/home/flink/flink-0.10.1/conf/logback.xml
18:27:10,248 INFO  org.apache.flink.runtime.taskmanager.TaskManager              -  Program Arguments:
18:27:10,248 INFO  org.apache.flink.runtime.taskmanager.TaskManager              -     --configDir
18:27:10,248 INFO  org.apache.flink.runtime.taskmanager.TaskManager              -     /home/flink/flink-0.10.1/conf
18:27:10,248 INFO  org.apache.flink.runtime.taskmanager.TaskManager              -     --streamingMode
18:27:10,248 INFO  org.apache.flink.runtime.taskmanager.TaskManager              -     batch
18:27:10,248 INFO  org.apache.flink.runtime.taskmanager.TaskManager              -  Classpath: /home/flink/flink-0.10.1/lib/flink-dist_2.11-0.10.1.jar:/home/flink/flink-0.10.1/lib/flink-python_2.11-0.10.1.jar:/home/flink/flink-0.10.1/lib/log4j-1.2.17.jar:/home/flink/flink-0.10.1/lib/slf4j-log4j12-1.7.7.jar:/usr/lib/jvm/java-1.7.0-openjdk-amd64/lib/tools.jar::
18:27:10,248 INFO  org.apache.flink.runtime.taskmanager.TaskManager              - --------------------------------------------------------------------------------
18:27:10,252 INFO  org.apache.flink.runtime.taskmanager.TaskManager              - Maximum number of open file descriptors is 4096
18:27:10,277 INFO  org.apache.flink.runtime.taskmanager.TaskManager              - Loading configuration from /home/flink/flink-0.10.1/conf
18:27:10,356 INFO  org.apache.flink.runtime.taskmanager.TaskManager              - Security is not enabled. Starting non-authenticated TaskManager.
18:27:10,365 ERROR org.apache.flink.runtime.taskmanager.TaskManager              - Failed to run TaskManager.
java.net.UnknownHostException: Cannot resolve the JobManager hostname 'hostname-of-master' specified in the configuration
        at org.apache.flink.runtime.util.StandaloneUtils.createLeaderRetrievalService(StandaloneUtils.java:79)
        at org.apache.flink.runtime.util.StandaloneUtils.createLeaderRetrievalService(StandaloneUtils.java:48)
        at org.apache.flink.runtime.util.LeaderRetrievalUtils.createLeaderRetrievalService(LeaderRetrievalUtils.java:69)
        at org.apache.flink.runtime.taskmanager.TaskManager$.selectNetworkInterfaceAndPort(TaskManager.scala:1351)
        at org.apache.flink.runtime.taskmanager.TaskManager$.selectNetworkInterfaceAndRunTaskManager(TaskManager.scala:1328)
        at org.apache.flink.runtime.taskmanager.TaskManager$.main(TaskManager.scala:1240)
        at org.apache.flink.runtime.taskmanager.TaskManager.main(TaskManager.scala)

Kind Regards,
Ravinder Kaur

On Wed, Feb 3, 2016 at 7:19 PM, Stephan Ewen <[hidden email]> wrote:
What do the TaskManger logs say?

On Wed, Feb 3, 2016 at 6:34 PM, Ravinder Kaur <[hidden email]> wrote:
Hello,

Thanks for the quick reply. I tried to set jobmanager.rpc.address in flink-conf.yaml to the hostname of master node on both the nodes.

Now it does not start the Taskmanager at the worker node at all. When I start the cluster using ./bin/start-cluster.sh on master it shows the normal output of starting the Jobmanager and Taskmanager but when I run jps on the nodes the slave does not have the Taskmanager running. 

Running the WordCount example again fails showing the same error. Stopping the cluster says no taskmanager to stop.

Kind Regards,
Ravinder Kaur

On Wed, Feb 3, 2016 at 5:47 PM, Stephan Ewen <[hidden email]> wrote:
Looks like the network configuration is not correct.

I would try setting the full host name (like "master.abc.xyz.com") as jobmanager.rpc.address.

Greetings,
Stephan


On Wed, Feb 3, 2016 at 5:43 PM, Ravinder Kaur <[hidden email]> wrote:

Hello Community,

I'm a student and new to Apache Flink. I'm trying to learn and have setup a 2- node standalone Flink(0.10.1) cluster (one master and one worker). I'm facing the following issue. 

Cluster: consists of 2 vms (one master and one worker)


When I start the cluster both the JobManager and the TaskManager are started on the master and worker respectively.

Command to start the cluster : bin/start-cluster.sh

JPS shows all the processes running.

Then I run the following command to run a WordCount example job: ./bin/flink run ./examples/WordCount.jar

the result is attached with the mail.

The error is org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailabeException: Not enough free slots available to run to run the job ....................... Resources available to scheduler: Number of instances=0, total number of slots= 0, available slots=0

Therefore I suppose that the JobManager does not find the TaskManager and checked the logs of the TaskManager which indeed shows that the TaskManager is unable to register at the JobManager for quite a long time. There are org.apache.flink.runtime.net.ConnectionUtils: Failed to connect from localhost: Connect timed out and org.apache.flink.runtime.net.ConnectionUtils: Failed to connect from address localhost: Network is Unreachable messages in the log of the TaskManager. Later when it starts up after a number of attempts and tries to register at the JobManager, which also fails after a lot of attempts showing the following message org.apache.flink.runtime.taskmanager.Taskmanager: Trying to register at JobManager akka.tcp://flink@master:6123/user'/jobmanager (attempt:92, timeout:30seconds) and org.apache.flink.runtime.taskmanager.Taskmanager: Tried to associate with unreachable remote host [akka.tcp://flink@master:6123/user/jobmanager]. Address is now gated for 5000ms, all messages to this address will be delivered to dead letters. Reason: Connection timed out: /master:6123

and https://issues.apache.org/jira/browse/FLINK-1119 these links helpful. Stephan Ewen the guy who provided the solution in both the links gives a good explanation that the TaskManagers take quite some time to register at the JobManager and therefore I waited for as long as 20 mins after starting the cluster to run the job. But even after waiting so long I get the same error. 

Another suggestion was to run the cluster in streaming mode. So I tried it with the command : bin/start-cluster-streaming.sh and ran the job but I get the same error. I have rechecked all the configurations but I'm unable to find out the fault.

I re-checked all the configurations but could not find anything wrong. Also checked the port 6123 on master which is in LISTEN state and tcp request from worker to master shows SYN_SENT state using netstat -na and lsof -i commands.

I opened the webpage on master http://localhost:8081 but it shows nothing and localhost:8080 says connection refused.

Kindly help me out as it is very important for me. Let me know if you have any questions.

Kind Regards,
Ravinder Kaur












Reply | Threaded
Open this post in threaded view
|

Re: TaskManager unable to register with JobManager

Stephan Ewen
Note that some of these config options are only available starting from version 1.0-SNAPSHOT

On Wed, Feb 10, 2016 at 2:16 PM, Fabian Hueske <[hidden email]> wrote:
Hi Ravinder,

please have a look at the configuration documentation:

--> https://ci.apache.org/projects/flink/flink-docs-master/setup/config.html#jobmanager-amp-taskmanager

Best, Fabian

2016-02-10 13:55 GMT+01:00 Ravinder Kaur <[hidden email]>:
Hello All,

I need to know the range of ports that are being used during the master/slave communication in the Flink cluster. Also is there a way I can specify a range of ports, at the slaves, to restrict them to connect to master only in this range?

Kind Regards,
Ravinder Kaur


On Wed, Feb 3, 2016 at 10:09 PM, Stephan Ewen <[hidden email]> wrote:
Can machines connect to port 6123? The firewall may block that port, put permit SSH.

On Wed, Feb 3, 2016 at 9:52 PM, Ravinder Kaur <[hidden email]> wrote:
Hello,

Here is the log file of Jobmanager. I did not see some thing suspicious and as it suggests the ports are also listening.

20:58:46,906 INFO  org.apache.flink.runtime.jobmanager.JobManager                - Starting JobManager on IP-of-master:6123 with execution mode CLUSTER and streaming mode BATCH_ONLY
20:58:46,978 INFO  org.apache.flink.runtime.jobmanager.JobManager                - Security is not enabled. Starting non-authenticated JobManager.
20:58:46,979 INFO  org.apache.flink.runtime.jobmanager.JobManager                - Starting JobManager
20:58:46,980 INFO  org.apache.flink.runtime.jobmanager.JobManager                - Starting JobManager actor system at 10.155.208.138:6123
20:58:48,196 INFO  akka.event.slf4j.Slf4jLogger                                  - Slf4jLogger started
20:58:48,295 INFO  Remoting                                                      - Starting remoting
20:58:48,541 INFO  Remoting                                                      - Remoting started; listening on addresses :[akka.tcp://flink@IP-of-master:6123]
20:58:48,549 INFO  org.apache.flink.runtime.jobmanager.JobManager                - Starting JobManger web frontend
20:58:48,690 INFO  org.apache.flink.runtime.webmonitor.WebRuntimeMonitor         - Using directory /tmp/flink-web-876a4755-4f38-4ff7-8202-f263afa9b986 for the web interface files
20:58:48,691 INFO  org.apache.flink.runtime.webmonitor.WebRuntimeMonitor         - Serving job manager log from /home/flink/flink-0.10.1/log/flink-flink-jobmanager-0-hostname.log
20:58:48,691 INFO  org.apache.flink.runtime.webmonitor.WebRuntimeMonitor         - Serving job manager stdout from /home/flink/flink-0.10.1/log/flink-flink-jobmanager-0-hostname.out
20:58:49,044 INFO  org.apache.flink.runtime.webmonitor.WebRuntimeMonitor         - Web frontend listening at 0:0:0:0:0:0:0:0:8081
20:58:49,045 INFO  org.apache.flink.runtime.jobmanager.JobManager                - Starting JobManager actor
20:58:49,052 INFO  org.apache.flink.runtime.blob.BlobServer                      - Created BLOB server storage directory /tmp/blobStore-e0c52bfb-2411-4a83-ac8d-5664a5894258
20:58:49,054 INFO  org.apache.flink.runtime.blob.BlobServer                      - Started BLOB server at 0.0.0.0:43683 - max concurrent requests: 50 - max backlog: 1000
20:58:49,075 INFO  org.apache.flink.runtime.jobmanager.MemoryArchivist           - Started memory archivist akka://flink/user/archive
20:58:49,075 INFO  org.apache.flink.runtime.jobmanager.JobManager                - Starting JobManager at akka.tcp://flink@IP-of-master:6123/user/jobmanager.
20:58:49,081 INFO  org.apache.flink.runtime.jobmanager.JobManager                - JobManager akka.tcp://flink@IP-of-master:6123/user/jobmanager was granted leadership with leader session ID None.
20:58:49,082 INFO  org.apache.flink.runtime.webmonitor.WebRuntimeMonitor         - Starting with JobManager akka.tcp://flink@IP-of-master:6123/user/jobmanager on port 8081
20:58:49,083 INFO  org.apache.flink.runtime.webmonitor.JobManagerRetriever       - New leader reachable under akka.tcp://flink@IP-of-master:6123/user/jobmanager:null.
20:59:22,794 INFO  org.apache.flink.runtime.jobmanager.JobManager                - Submitting job 72733d69588678ec224003ab5577cab8 (Flink Java Job at Wed Feb 03 20:59:22 CET 2016).
20:59:22,853 INFO  org.apache.flink.runtime.jobmanager.JobManager                - Scheduling job 72733d69588678ec224003ab5577cab8 (Flink Java Job at Wed Feb 03 20:59:22 CET 2016).
20:59:22,857 INFO  org.apache.flink.runtime.jobmanager.JobManager                - Status of job 72733d69588678ec224003ab5577cab8 (Flink Java Job at Wed Feb 03 20:59:22 CET 2016) changed to RUNNING.
20:59:22,859 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - CHAIN DataSource (at getDefaultTextLineDataSet(WordCountData.java:70) (org.apache.flink.api.java.io.CollectionInputFormat)) -> FlatMap (FlatMap at main(WordCount.java:69)) -> Combine(SUM(1), at main(WordCount.java:72) (1/1) (23fb37019a504fd6c7bf95e46a8cd7a3) switched from CREATED to SCHEDULED
20:59:22,881 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - CHAIN DataSource (at getDefaultTextLineDataSet(WordCountData.java:70) (org.apache.flink.api.java.io.CollectionInputFormat)) -> FlatMap (FlatMap at main(WordCount.java:69)) -> Combine(SUM(1), at main(WordCount.java:72) (1/1) (23fb37019a504fd6c7bf95e46a8cd7a3) switched from SCHEDULED to CANCELED
20:59:22,881 INFO  org.apache.flink.runtime.jobmanager.JobManager                - Status of job 72733d69588678ec224003ab5577cab8 (Flink Java Job at Wed Feb 03 20:59:22 CET 2016) changed to FAILING.
org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException: Not enough free slots available to run the job. You can decrease the operator parallelism or increase the number of slots per TaskManager in the configuration. Task to schedule: < Attempt #0 (CHAIN DataSource (at getDefaultTextLineDataSet(WordCountData.java:70) (org.apache.flink.api.java.io.CollectionInputFormat)) -> FlatMap (FlatMap at main(WordCount.java:69)) -> Combine(SUM(1), at main(WordCount.java:72) (1/1)) @ (unassigned) - [SCHEDULED] > with groupID < 31e497f2f68c9cee5864c8fddaff3d59 > in sharing group < SlotSharingGroup [f9ed1aab933e061a8ce1ecaa3534f18c, 037bb78a1902f7edea69a978ad7b54ce, 31e497f2f68c9cee5864c8fddaff3d59] >. Resources available to scheduler: Number of instances=0, total number of slots=0, available slots=0
        at org.apache.flink.runtime.jobmanager.scheduler.Scheduler.scheduleTask(Scheduler.java:256)
        at org.apache.flink.runtime.jobmanager.scheduler.Scheduler.scheduleImmediately(Scheduler.java:131)
        at org.apache.flink.runtime.executiongraph.Execution.scheduleForExecution(Execution.java:298)
        at org.apache.flink.runtime.executiongraph.ExecutionVertex.scheduleForExecution(ExecutionVertex.java:458)
        at org.apache.flink.runtime.executiongraph.ExecutionJobVertex.scheduleAll(ExecutionJobVertex.java:322)
        at org.apache.flink.runtime.executiongraph.ExecutionGraph.scheduleForExecution(ExecutionGraph.java:679)
        at org.apache.flink.runtime.jobmanager.JobManager$$anonfun$org$apache$flink$runtime$jobmanager$JobManager$$submitJob$1.apply$mcV$sp(JobManager.scala:982)
        at org.apache.flink.runtime.jobmanager.JobManager$$anonfun$org$apache$flink$runtime$jobmanager$JobManager$$submitJob$1.apply(JobManager.scala:962)
        at org.apache.flink.runtime.jobmanager.JobManager$$anonfun$org$apache$flink$runtime$jobmanager$JobManager$$submitJob$1.apply(JobManager.scala:962)
        at scala.concurrent.impl.Future$PromiseCompletingRunnable.liftedTree1$1(Future.scala:24)
        at scala.concurrent.impl.Future$PromiseCompletingRunnable.run(Future.scala:24)
        at akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:41)
        at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:401)
        at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
        at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
        at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
        at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
20:59:22,886 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - CHAIN Reduce (SUM(1), at main(WordCount.java:72) -> FlatMap (collect()) (1/1) (824b6e3771304cd0f92aea4ab763a11d) switched from CREATED to CANCELED
20:59:22,887 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - DataSink (collect() sink) (1/1) (1bb64a2edc6f68ad716acd9f8d2d7d67) switched from CREATED to CANCELED
20:59:22,890 INFO  org.apache.flink.runtime.jobmanager.JobManager                - Status of job 72733d69588678ec224003ab5577cab8 (Flink Java Job at Wed Feb 03 20:59:22 CET 2016) changed to FAILED.
org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException: Not enough free slots available to run the job. You can decrease the operator parallelism or increase the number of slots per TaskManager in the configuration. Task to schedule: < Attempt #0 (CHAIN DataSource (at getDefaultTextLineDataSet(WordCountData.java:70) (org.apache.flink.api.java.io.CollectionInputFormat)) -> FlatMap (FlatMap at main(WordCount.java:69)) -> Combine(SUM(1), at main(WordCount.java:72) (1/1)) @ (unassigned) - [SCHEDULED] > with groupID < 31e497f2f68c9cee5864c8fddaff3d59 > in sharing group < SlotSharingGroup [f9ed1aab933e061a8ce1ecaa3534f18c, 037bb78a1902f7edea69a978ad7b54ce, 31e497f2f68c9cee5864c8fddaff3d59] >. Resources available to scheduler: Number of instances=0, total number of slots=0, available slots=0
        at org.apache.flink.runtime.jobmanager.scheduler.Scheduler.scheduleTask(Scheduler.java:256)
        at org.apache.flink.runtime.jobmanager.scheduler.Scheduler.scheduleImmediately(Scheduler.java:131)
        at org.apache.flink.runtime.executiongraph.Execution.scheduleForExecution(Execution.java:298)
        at org.apache.flink.runtime.executiongraph.ExecutionVertex.scheduleForExecution(ExecutionVertex.java:458)
        at org.apache.flink.runtime.executiongraph.ExecutionJobVertex.scheduleAll(ExecutionJobVertex.java:322)
        at org.apache.flink.runtime.executiongraph.ExecutionGraph.scheduleForExecution(ExecutionGraph.java:679)
        at org.apache.flink.runtime.jobmanager.JobManager$$anonfun$org$apache$flink$runtime$jobmanager$JobManager$$submitJob$1.apply$mcV$sp(JobManager.scala:982)
        at org.apache.flink.runtime.jobmanager.JobManager$$anonfun$org$apache$flink$runtime$jobmanager$JobManager$$submitJob$1.apply(JobManager.scala:962)
        at org.apache.flink.runtime.jobmanager.JobManager$$anonfun$org$apache$flink$runtime$jobmanager$JobManager$$submitJob$1.apply(JobManager.scala:962)
        at scala.concurrent.impl.Future$PromiseCompletingRunnable.liftedTree1$1(Future.scala:24)
        at scala.concurrent.impl.Future$PromiseCompletingRunnable.run(Future.scala:24)


On Wed, Feb 3, 2016 at 9:27 PM, Robert Metzger <[hidden email]> wrote:
Hi,

the TaskManager is starting up, but its not able to register at the job manager. Did you check the JobManager log? Do you see anything suspicious there? Are the ports matching?


On Wed, Feb 3, 2016 at 9:23 PM, Ravinder Kaur <[hidden email]> wrote:
Hello,

Thank you for pointing it out. I had a little typo while I edited the hostname in flink-conf.yaml. I've reset it and the TaskManager started up. But I still can't run the WordCount example and it throws the same NoResourceAvaliableException.

Caused by: org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableExce                                                                                        ption: Not enough free slots available to run the job. You can decrease the oper                                                                                        ator parallelism or increase the number of slots per TaskManager in the configur                                                                                        ation. Task to schedule: < Attempt #0 (CHAIN DataSource (at getDefaultTextLineDa                                                                                        taSet(WordCountData.java:70) (org.apache.flink.api.java.io.CollectionInputFormat                                                                                        )) -> FlatMap (FlatMap at main(WordCount.java:69)) -> Combine(SUM(1), at main(Wo                                                                                        rdCount.java:72) (1/1)) @ (unassigned) - [SCHEDULED] > with groupID < 31e497f2f6                                                                                        8c9cee5864c8fddaff3d59 > in sharing group < SlotSharingGroup [f9ed1aab933e061a8c                                                                                        e1ecaa3534f18c, 037bb78a1902f7edea69a978ad7b54ce, 31e497f2f68c9cee5864c8fddaff3d                                                                                        59] >. Resources available to scheduler: Number of instances=0, total number of                                                                                         slots=0, available slots=0
        at org.apache.flink.runtime.jobmanager.scheduler.Scheduler.scheduleTask(                                                                                        Scheduler.java:256)
        at org.apache.flink.runtime.jobmanager.scheduler.Scheduler.scheduleImmed                                                                                        iately(Scheduler.java:131)
        at org.apache.flink.runtime.executiongraph.Execution.scheduleForExecutio                                                                                        n(Execution.java:298)
        at org.apache.flink.runtime.executiongraph.ExecutionVertex.scheduleForEx                                                                                        ecution(ExecutionVertex.java:458)
        at org.apache.flink.runtime.executiongraph.ExecutionJobVertex.scheduleAl                                                                                        l(ExecutionJobVertex.java:322)
        at org.apache.flink.runtime.executiongraph.ExecutionGraph.scheduleForExe                                                                                        cution(ExecutionGraph.java:679)
        at org.apache.flink.runtime.jobmanager.JobManager$$anonfun$org$apache$fl                                                                                        ink$runtime$jobmanager$JobManager$$submitJob$1.apply$mcV$sp(JobManager.scala:982                                                                                        )
        at org.apache.flink.runtime.jobmanager.JobManager$$anonfun$org$apache$fl                                                                                        ink$runtime$jobmanager$JobManager$$submitJob$1.apply(JobManager.scala:962)
        at org.apache.flink.runtime.jobmanager.JobManager$$anonfun$org$apache$fl                                                                                        ink$runtime$jobmanager$JobManager$$submitJob$1.apply(JobManager.scala:962)
        ... 8 more

The log of TaskManager again has the same errors as before.

20:58:58,457 INFO  org.apache.flink.runtime.net.ConnectionUtils                  - Failed to connect from address '/slave-IP': connect timed out
20:58:58,458 INFO  org.apache.flink.runtime.net.ConnectionUtils                  - Failed to connect from address '/0:0:0:0:0:0:0:1%1': Network is unreachable
20:58:58,458 INFO  org.apache.flink.runtime.net.ConnectionUtils                  - Failed to connect from address '/127.0.0.1': Invalid argument
20:58:59,048 WARN  org.apache.flink.runtime.net.ConnectionUtils                  - Could not connect to /master-IP:6123. Selecting a local address using heuristics.
20:58:59,050 INFO  org.apache.flink.runtime.taskmanager.TaskManager              - TaskManager will use hostname/address 'hostname-of-slave' (slave-IP) for communication.
20:58:59,051 INFO  org.apache.flink.runtime.taskmanager.TaskManager              - Starting TaskManager in streaming mode BATCH_ONLY
20:58:59,052 INFO  org.apache.flink.runtime.taskmanager.TaskManager              - Starting TaskManager actor system at slave_IP:0
20:58:59,776 INFO  akka.event.slf4j.Slf4jLogger                                  - Slf4jLogger started
20:58:59,842 INFO  Remoting                                                      - Starting remoting
20:59:00,094 INFO  Remoting                                                      - Remoting started; listening on addresses :[akka.tcp://flink@slave-IP:33813]
20:59:00,100 INFO  org.apache.flink.runtime.taskmanager.TaskManager              - Starting TaskManager actor
20:59:00,125 INFO  org.apache.flink.runtime.io.network.netty.NettyConfig         - NettyConfig [server address: hostname-of-master/master-IP, server port: 49030, memory segment size (bytes): 32768, transport type: NIO, number of server threads: 0 (use Netty's default), number of client threads: 0 (use Netty's default), server connect backlog: 0 (use Netty's default), client connect timeout (sec): 120, send/receive buffer size (bytes): 0 (use Netty's default)]
20:59:00,131 INFO  org.apache.flink.runtime.taskmanager.TaskManager              - Messages between TaskManager and JobManager have a max timeout of 100000 milliseconds
20:59:00,142 INFO  org.apache.flink.runtime.taskmanager.TaskManager              - Temporary file directory '/tmp': total 4 GB, usable 1 GB (25.00% usable)
20:59:00,210 INFO  org.apache.flink.runtime.io.network.buffer.NetworkBufferPool  - Allocated 64 MB for network buffer pool (number of memory segments: 2048, bytes per segment: 32768).
20:59:00,323 INFO  org.apache.flink.runtime.taskmanager.TaskManager              - Using 0.7 of the currently free heap space for Flink managed heap memory (293 MB).
20:59:00,565 INFO  org.apache.flink.runtime.io.disk.iomanager.IOManager          - I/O manager uses directory /tmp/flink-io-c7796b82-6676-4604-97fd-df09001a84e8 for spill files.
20:59:00,578 INFO  org.apache.flink.runtime.filecache.FileCache                  - User file cache uses directory /tmp/flink-dist-cache-13ed3e76-cf1e-46fa-9ba2-5177e801429e
20:59:00,908 INFO  org.apache.flink.runtime.taskmanager.TaskManager              - Starting TaskManager actor at akka://flink/user/taskmanager#-157676733.
20:59:00,908 INFO  org.apache.flink.runtime.taskmanager.TaskManager              - TaskManager data connection information: hostname-of-master (dataPort=49030)
20:59:00,909 INFO  org.apache.flink.runtime.taskmanager.TaskManager              - TaskManager has 1 task slot(s).
20:59:00,910 INFO  org.apache.flink.runtime.taskmanager.TaskManager              - Memory usage stats: [HEAP: 376/491/491 MB, NON HEAP: 24/49/304 MB (used/committed/max)]
20:59:00,917 INFO  org.apache.flink.runtime.taskmanager.TaskManager              - Trying to register at JobManager akka.tcp://flink@master-IP:6123/user/jobmanager (attempt 1, timeout: 500 milliseconds)
20:59:01,443 INFO  org.apache.flink.runtime.taskmanager.TaskManager              - Trying to register at JobManager akka.tcp://flink@master-IP:6123/user/jobmanager (attempt 2, timeout: 1000 milliseconds)
20:59:02,873 INFO  org.apache.flink.runtime.taskmanager.TaskManager              - Trying to register at JobManager akka.tcp://flink@master-IP:6123/user/jobmanager (attempt 3, timeout: 2000 milliseconds)
20:59:04,893 INFO  org.apache.flink.runtime.taskmanager.TaskManager              - Trying to register at JobManager akka.tcp://flink@master-IP:6123/user/jobmanager (attempt 4, timeout: 4000 milliseconds)
20:59:08,914 INFO  org.apache.flink.runtime.taskmanager.TaskManager              - Trying to register at JobManager akka.tcp://flink@master-IP:6123/user/jobmanager (attempt 5, timeout: 8000 milliseconds)


Kind Regards,
Ravinder Kaur

On Wed, Feb 3, 2016 at 8:12 PM, Stephan Ewen <[hidden email]> wrote:
This looks like the reason:

java.net.UnknownHostException: Cannot resolve the JobManager hostname 'hostname-of-master' specified in the configuration

On Wed, Feb 3, 2016 at 7:29 PM, Ravinder Kaur <[hidden email]> wrote:
Hello,

The log file of the Taskmanager now shows the following 

18:27:10,082 WARN  org.apache.hadoop.util.NativeCodeLoader                       - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
18:27:10,244 INFO  org.apache.flink.runtime.taskmanager.TaskManager              - --------------------------------------------------------------------------------
18:27:10,244 INFO  org.apache.flink.runtime.taskmanager.TaskManager              -  Starting TaskManager (Version: 0.10.1, Rev:2e9b231, Date:22.11.2015 @ 12:41:12 CET)
18:27:10,244 INFO  org.apache.flink.runtime.taskmanager.TaskManager              -  Current user: flink
18:27:10,245 INFO  org.apache.flink.runtime.taskmanager.TaskManager              -  JVM: OpenJDK 64-Bit Server VM - Oracle Corporation - 1.7/24.91-b01
18:27:10,245 INFO  org.apache.flink.runtime.taskmanager.TaskManager              -  Maximum heap size: 491 MiBytes
18:27:10,245 INFO  org.apache.flink.runtime.taskmanager.TaskManager              -  JAVA_HOME: /usr/lib/jvm/java-1.7.0-openjdk-amd64
18:27:10,247 INFO  org.apache.flink.runtime.taskmanager.TaskManager              -  Hadoop version: 2.7.0
18:27:10,247 INFO  org.apache.flink.runtime.taskmanager.TaskManager              -  JVM Options:
18:27:10,247 INFO  org.apache.flink.runtime.taskmanager.TaskManager              -     -Xms512M
18:27:10,247 INFO  org.apache.flink.runtime.taskmanager.TaskManager              -     -Xmx512M
18:27:10,248 INFO  org.apache.flink.runtime.taskmanager.TaskManager              -     -XX:MaxDirectMemorySize=8388607T
18:27:10,248 INFO  org.apache.flink.runtime.taskmanager.TaskManager              -     -XX:MaxPermSize=256m
18:27:10,248 INFO  org.apache.flink.runtime.taskmanager.TaskManager              -     -Dlog.file=/home/flink/flink-0.10.1/log/flink-flink-taskmanager-0-vm-10-155-208-137.cloud.mwn.de.log
18:27:10,248 INFO  org.apache.flink.runtime.taskmanager.TaskManager              -     -Dlog4j.configuration=file:/home/flink/flink-0.10.1/conf/log4j.properties
18:27:10,248 INFO  org.apache.flink.runtime.taskmanager.TaskManager              -     -Dlogback.configurationFile=file:/home/flink/flink-0.10.1/conf/logback.xml
18:27:10,248 INFO  org.apache.flink.runtime.taskmanager.TaskManager              -  Program Arguments:
18:27:10,248 INFO  org.apache.flink.runtime.taskmanager.TaskManager              -     --configDir
18:27:10,248 INFO  org.apache.flink.runtime.taskmanager.TaskManager              -     /home/flink/flink-0.10.1/conf
18:27:10,248 INFO  org.apache.flink.runtime.taskmanager.TaskManager              -     --streamingMode
18:27:10,248 INFO  org.apache.flink.runtime.taskmanager.TaskManager              -     batch
18:27:10,248 INFO  org.apache.flink.runtime.taskmanager.TaskManager              -  Classpath: /home/flink/flink-0.10.1/lib/flink-dist_2.11-0.10.1.jar:/home/flink/flink-0.10.1/lib/flink-python_2.11-0.10.1.jar:/home/flink/flink-0.10.1/lib/log4j-1.2.17.jar:/home/flink/flink-0.10.1/lib/slf4j-log4j12-1.7.7.jar:/usr/lib/jvm/java-1.7.0-openjdk-amd64/lib/tools.jar::
18:27:10,248 INFO  org.apache.flink.runtime.taskmanager.TaskManager              - --------------------------------------------------------------------------------
18:27:10,252 INFO  org.apache.flink.runtime.taskmanager.TaskManager              - Maximum number of open file descriptors is 4096
18:27:10,277 INFO  org.apache.flink.runtime.taskmanager.TaskManager              - Loading configuration from /home/flink/flink-0.10.1/conf
18:27:10,356 INFO  org.apache.flink.runtime.taskmanager.TaskManager              - Security is not enabled. Starting non-authenticated TaskManager.
18:27:10,365 ERROR org.apache.flink.runtime.taskmanager.TaskManager              - Failed to run TaskManager.
java.net.UnknownHostException: Cannot resolve the JobManager hostname 'hostname-of-master' specified in the configuration
        at org.apache.flink.runtime.util.StandaloneUtils.createLeaderRetrievalService(StandaloneUtils.java:79)
        at org.apache.flink.runtime.util.StandaloneUtils.createLeaderRetrievalService(StandaloneUtils.java:48)
        at org.apache.flink.runtime.util.LeaderRetrievalUtils.createLeaderRetrievalService(LeaderRetrievalUtils.java:69)
        at org.apache.flink.runtime.taskmanager.TaskManager$.selectNetworkInterfaceAndPort(TaskManager.scala:1351)
        at org.apache.flink.runtime.taskmanager.TaskManager$.selectNetworkInterfaceAndRunTaskManager(TaskManager.scala:1328)
        at org.apache.flink.runtime.taskmanager.TaskManager$.main(TaskManager.scala:1240)
        at org.apache.flink.runtime.taskmanager.TaskManager.main(TaskManager.scala)

Kind Regards,
Ravinder Kaur

On Wed, Feb 3, 2016 at 7:19 PM, Stephan Ewen <[hidden email]> wrote:
What do the TaskManger logs say?

On Wed, Feb 3, 2016 at 6:34 PM, Ravinder Kaur <[hidden email]> wrote:
Hello,

Thanks for the quick reply. I tried to set jobmanager.rpc.address in flink-conf.yaml to the hostname of master node on both the nodes.

Now it does not start the Taskmanager at the worker node at all. When I start the cluster using ./bin/start-cluster.sh on master it shows the normal output of starting the Jobmanager and Taskmanager but when I run jps on the nodes the slave does not have the Taskmanager running. 

Running the WordCount example again fails showing the same error. Stopping the cluster says no taskmanager to stop.

Kind Regards,
Ravinder Kaur

On Wed, Feb 3, 2016 at 5:47 PM, Stephan Ewen <[hidden email]> wrote:
Looks like the network configuration is not correct.

I would try setting the full host name (like "master.abc.xyz.com") as jobmanager.rpc.address.

Greetings,
Stephan


On Wed, Feb 3, 2016 at 5:43 PM, Ravinder Kaur <[hidden email]> wrote:

Hello Community,

I'm a student and new to Apache Flink. I'm trying to learn and have setup a 2- node standalone Flink(0.10.1) cluster (one master and one worker). I'm facing the following issue. 

Cluster: consists of 2 vms (one master and one worker)


When I start the cluster both the JobManager and the TaskManager are started on the master and worker respectively.

Command to start the cluster : bin/start-cluster.sh

JPS shows all the processes running.

Then I run the following command to run a WordCount example job: ./bin/flink run ./examples/WordCount.jar

the result is attached with the mail.

The error is org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailabeException: Not enough free slots available to run to run the job ....................... Resources available to scheduler: Number of instances=0, total number of slots= 0, available slots=0

Therefore I suppose that the JobManager does not find the TaskManager and checked the logs of the TaskManager which indeed shows that the TaskManager is unable to register at the JobManager for quite a long time. There are org.apache.flink.runtime.net.ConnectionUtils: Failed to connect from localhost: Connect timed out and org.apache.flink.runtime.net.ConnectionUtils: Failed to connect from address localhost: Network is Unreachable messages in the log of the TaskManager. Later when it starts up after a number of attempts and tries to register at the JobManager, which also fails after a lot of attempts showing the following message org.apache.flink.runtime.taskmanager.Taskmanager: Trying to register at JobManager akka.tcp://flink@master:6123/user'/jobmanager (attempt:92, timeout:30seconds) and org.apache.flink.runtime.taskmanager.Taskmanager: Tried to associate with unreachable remote host [akka.tcp://flink@master:6123/user/jobmanager]. Address is now gated for 5000ms, all messages to this address will be delivered to dead letters. Reason: Connection timed out: /master:6123

and https://issues.apache.org/jira/browse/FLINK-1119 these links helpful. Stephan Ewen the guy who provided the solution in both the links gives a good explanation that the TaskManagers take quite some time to register at the JobManager and therefore I waited for as long as 20 mins after starting the cluster to run the job. But even after waiting so long I get the same error. 

Another suggestion was to run the cluster in streaming mode. So I tried it with the command : bin/start-cluster-streaming.sh and ran the job but I get the same error. I have rechecked all the configurations but I'm unable to find out the fault.

I re-checked all the configurations but could not find anything wrong. Also checked the port 6123 on master which is in LISTEN state and tcp request from worker to master shows SYN_SENT state using netstat -na and lsof -i commands.

I opened the webpage on master http://localhost:8081 but it shows nothing and localhost:8080 says connection refused.

Kindly help me out as it is very important for me. Let me know if you have any questions.

Kind Regards,
Ravinder Kaur













Reply | Threaded
Open this post in threaded view
|

Re: TaskManager unable to register with JobManager

Ravinder Kaur
In reply to this post by Fabian Hueske-2
Hello Fabian,

Thank you very much for the resource. I had already gone through this and have found port '6123' as default for taskmanager registration. But I want to know the specific range of ports the taskmanager access during job execution. 

The taskmanager always tries to access a random port during job execution for which I need to disable firewall using 'ufw allow port' during the execution, otherwise the job hangs and finally fails. So I  wanted to know a particular range of ports which I can specify in the iptables to always allow access. 


Kind Regards,
Ravinder Kaur

On Wed, Feb 10, 2016 at 2:16 PM, Fabian Hueske <[hidden email]> wrote:
Hi Ravinder,

please have a look at the configuration documentation:

--> https://ci.apache.org/projects/flink/flink-docs-master/setup/config.html#jobmanager-amp-taskmanager

Best, Fabian

2016-02-10 13:55 GMT+01:00 Ravinder Kaur <[hidden email]>:
Hello All,

I need to know the range of ports that are being used during the master/slave communication in the Flink cluster. Also is there a way I can specify a range of ports, at the slaves, to restrict them to connect to master only in this range?

Kind Regards,
Ravinder Kaur


On Wed, Feb 3, 2016 at 10:09 PM, Stephan Ewen <[hidden email]> wrote:
Can machines connect to port 6123? The firewall may block that port, put permit SSH.

On Wed, Feb 3, 2016 at 9:52 PM, Ravinder Kaur <[hidden email]> wrote:
Hello,

Here is the log file of Jobmanager. I did not see some thing suspicious and as it suggests the ports are also listening.

20:58:46,906 INFO  org.apache.flink.runtime.jobmanager.JobManager                - Starting JobManager on IP-of-master:6123 with execution mode CLUSTER and streaming mode BATCH_ONLY
20:58:46,978 INFO  org.apache.flink.runtime.jobmanager.JobManager                - Security is not enabled. Starting non-authenticated JobManager.
20:58:46,979 INFO  org.apache.flink.runtime.jobmanager.JobManager                - Starting JobManager
20:58:46,980 INFO  org.apache.flink.runtime.jobmanager.JobManager                - Starting JobManager actor system at 10.155.208.138:6123
20:58:48,196 INFO  akka.event.slf4j.Slf4jLogger                                  - Slf4jLogger started
20:58:48,295 INFO  Remoting                                                      - Starting remoting
20:58:48,541 INFO  Remoting                                                      - Remoting started; listening on addresses :[akka.tcp://flink@IP-of-master:6123]
20:58:48,549 INFO  org.apache.flink.runtime.jobmanager.JobManager                - Starting JobManger web frontend
20:58:48,690 INFO  org.apache.flink.runtime.webmonitor.WebRuntimeMonitor         - Using directory /tmp/flink-web-876a4755-4f38-4ff7-8202-f263afa9b986 for the web interface files
20:58:48,691 INFO  org.apache.flink.runtime.webmonitor.WebRuntimeMonitor         - Serving job manager log from /home/flink/flink-0.10.1/log/flink-flink-jobmanager-0-hostname.log
20:58:48,691 INFO  org.apache.flink.runtime.webmonitor.WebRuntimeMonitor         - Serving job manager stdout from /home/flink/flink-0.10.1/log/flink-flink-jobmanager-0-hostname.out
20:58:49,044 INFO  org.apache.flink.runtime.webmonitor.WebRuntimeMonitor         - Web frontend listening at 0:0:0:0:0:0:0:0:8081
20:58:49,045 INFO  org.apache.flink.runtime.jobmanager.JobManager                - Starting JobManager actor
20:58:49,052 INFO  org.apache.flink.runtime.blob.BlobServer                      - Created BLOB server storage directory /tmp/blobStore-e0c52bfb-2411-4a83-ac8d-5664a5894258
20:58:49,054 INFO  org.apache.flink.runtime.blob.BlobServer                      - Started BLOB server at 0.0.0.0:43683 - max concurrent requests: 50 - max backlog: 1000
20:58:49,075 INFO  org.apache.flink.runtime.jobmanager.MemoryArchivist           - Started memory archivist akka://flink/user/archive
20:58:49,075 INFO  org.apache.flink.runtime.jobmanager.JobManager                - Starting JobManager at akka.tcp://flink@IP-of-master:6123/user/jobmanager.
20:58:49,081 INFO  org.apache.flink.runtime.jobmanager.JobManager                - JobManager akka.tcp://flink@IP-of-master:6123/user/jobmanager was granted leadership with leader session ID None.
20:58:49,082 INFO  org.apache.flink.runtime.webmonitor.WebRuntimeMonitor         - Starting with JobManager akka.tcp://flink@IP-of-master:6123/user/jobmanager on port 8081
20:58:49,083 INFO  org.apache.flink.runtime.webmonitor.JobManagerRetriever       - New leader reachable under akka.tcp://flink@IP-of-master:6123/user/jobmanager:null.
20:59:22,794 INFO  org.apache.flink.runtime.jobmanager.JobManager                - Submitting job 72733d69588678ec224003ab5577cab8 (Flink Java Job at Wed Feb 03 20:59:22 CET 2016).
20:59:22,853 INFO  org.apache.flink.runtime.jobmanager.JobManager                - Scheduling job 72733d69588678ec224003ab5577cab8 (Flink Java Job at Wed Feb 03 20:59:22 CET 2016).
20:59:22,857 INFO  org.apache.flink.runtime.jobmanager.JobManager                - Status of job 72733d69588678ec224003ab5577cab8 (Flink Java Job at Wed Feb 03 20:59:22 CET 2016) changed to RUNNING.
20:59:22,859 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - CHAIN DataSource (at getDefaultTextLineDataSet(WordCountData.java:70) (org.apache.flink.api.java.io.CollectionInputFormat)) -> FlatMap (FlatMap at main(WordCount.java:69)) -> Combine(SUM(1), at main(WordCount.java:72) (1/1) (23fb37019a504fd6c7bf95e46a8cd7a3) switched from CREATED to SCHEDULED
20:59:22,881 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - CHAIN DataSource (at getDefaultTextLineDataSet(WordCountData.java:70) (org.apache.flink.api.java.io.CollectionInputFormat)) -> FlatMap (FlatMap at main(WordCount.java:69)) -> Combine(SUM(1), at main(WordCount.java:72) (1/1) (23fb37019a504fd6c7bf95e46a8cd7a3) switched from SCHEDULED to CANCELED
20:59:22,881 INFO  org.apache.flink.runtime.jobmanager.JobManager                - Status of job 72733d69588678ec224003ab5577cab8 (Flink Java Job at Wed Feb 03 20:59:22 CET 2016) changed to FAILING.
org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException: Not enough free slots available to run the job. You can decrease the operator parallelism or increase the number of slots per TaskManager in the configuration. Task to schedule: < Attempt #0 (CHAIN DataSource (at getDefaultTextLineDataSet(WordCountData.java:70) (org.apache.flink.api.java.io.CollectionInputFormat)) -> FlatMap (FlatMap at main(WordCount.java:69)) -> Combine(SUM(1), at main(WordCount.java:72) (1/1)) @ (unassigned) - [SCHEDULED] > with groupID < 31e497f2f68c9cee5864c8fddaff3d59 > in sharing group < SlotSharingGroup [f9ed1aab933e061a8ce1ecaa3534f18c, 037bb78a1902f7edea69a978ad7b54ce, 31e497f2f68c9cee5864c8fddaff3d59] >. Resources available to scheduler: Number of instances=0, total number of slots=0, available slots=0
        at org.apache.flink.runtime.jobmanager.scheduler.Scheduler.scheduleTask(Scheduler.java:256)
        at org.apache.flink.runtime.jobmanager.scheduler.Scheduler.scheduleImmediately(Scheduler.java:131)
        at org.apache.flink.runtime.executiongraph.Execution.scheduleForExecution(Execution.java:298)
        at org.apache.flink.runtime.executiongraph.ExecutionVertex.scheduleForExecution(ExecutionVertex.java:458)
        at org.apache.flink.runtime.executiongraph.ExecutionJobVertex.scheduleAll(ExecutionJobVertex.java:322)
        at org.apache.flink.runtime.executiongraph.ExecutionGraph.scheduleForExecution(ExecutionGraph.java:679)
        at org.apache.flink.runtime.jobmanager.JobManager$$anonfun$org$apache$flink$runtime$jobmanager$JobManager$$submitJob$1.apply$mcV$sp(JobManager.scala:982)
        at org.apache.flink.runtime.jobmanager.JobManager$$anonfun$org$apache$flink$runtime$jobmanager$JobManager$$submitJob$1.apply(JobManager.scala:962)
        at org.apache.flink.runtime.jobmanager.JobManager$$anonfun$org$apache$flink$runtime$jobmanager$JobManager$$submitJob$1.apply(JobManager.scala:962)
        at scala.concurrent.impl.Future$PromiseCompletingRunnable.liftedTree1$1(Future.scala:24)
        at scala.concurrent.impl.Future$PromiseCompletingRunnable.run(Future.scala:24)
        at akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:41)
        at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:401)
        at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
        at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
        at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
        at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
20:59:22,886 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - CHAIN Reduce (SUM(1), at main(WordCount.java:72) -> FlatMap (collect()) (1/1) (824b6e3771304cd0f92aea4ab763a11d) switched from CREATED to CANCELED
20:59:22,887 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - DataSink (collect() sink) (1/1) (1bb64a2edc6f68ad716acd9f8d2d7d67) switched from CREATED to CANCELED
20:59:22,890 INFO  org.apache.flink.runtime.jobmanager.JobManager                - Status of job 72733d69588678ec224003ab5577cab8 (Flink Java Job at Wed Feb 03 20:59:22 CET 2016) changed to FAILED.
org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException: Not enough free slots available to run the job. You can decrease the operator parallelism or increase the number of slots per TaskManager in the configuration. Task to schedule: < Attempt #0 (CHAIN DataSource (at getDefaultTextLineDataSet(WordCountData.java:70) (org.apache.flink.api.java.io.CollectionInputFormat)) -> FlatMap (FlatMap at main(WordCount.java:69)) -> Combine(SUM(1), at main(WordCount.java:72) (1/1)) @ (unassigned) - [SCHEDULED] > with groupID < 31e497f2f68c9cee5864c8fddaff3d59 > in sharing group < SlotSharingGroup [f9ed1aab933e061a8ce1ecaa3534f18c, 037bb78a1902f7edea69a978ad7b54ce, 31e497f2f68c9cee5864c8fddaff3d59] >. Resources available to scheduler: Number of instances=0, total number of slots=0, available slots=0
        at org.apache.flink.runtime.jobmanager.scheduler.Scheduler.scheduleTask(Scheduler.java:256)
        at org.apache.flink.runtime.jobmanager.scheduler.Scheduler.scheduleImmediately(Scheduler.java:131)
        at org.apache.flink.runtime.executiongraph.Execution.scheduleForExecution(Execution.java:298)
        at org.apache.flink.runtime.executiongraph.ExecutionVertex.scheduleForExecution(ExecutionVertex.java:458)
        at org.apache.flink.runtime.executiongraph.ExecutionJobVertex.scheduleAll(ExecutionJobVertex.java:322)
        at org.apache.flink.runtime.executiongraph.ExecutionGraph.scheduleForExecution(ExecutionGraph.java:679)
        at org.apache.flink.runtime.jobmanager.JobManager$$anonfun$org$apache$flink$runtime$jobmanager$JobManager$$submitJob$1.apply$mcV$sp(JobManager.scala:982)
        at org.apache.flink.runtime.jobmanager.JobManager$$anonfun$org$apache$flink$runtime$jobmanager$JobManager$$submitJob$1.apply(JobManager.scala:962)
        at org.apache.flink.runtime.jobmanager.JobManager$$anonfun$org$apache$flink$runtime$jobmanager$JobManager$$submitJob$1.apply(JobManager.scala:962)
        at scala.concurrent.impl.Future$PromiseCompletingRunnable.liftedTree1$1(Future.scala:24)
        at scala.concurrent.impl.Future$PromiseCompletingRunnable.run(Future.scala:24)


On Wed, Feb 3, 2016 at 9:27 PM, Robert Metzger <[hidden email]> wrote:
Hi,

the TaskManager is starting up, but its not able to register at the job manager. Did you check the JobManager log? Do you see anything suspicious there? Are the ports matching?


On Wed, Feb 3, 2016 at 9:23 PM, Ravinder Kaur <[hidden email]> wrote:
Hello,

Thank you for pointing it out. I had a little typo while I edited the hostname in flink-conf.yaml. I've reset it and the TaskManager started up. But I still can't run the WordCount example and it throws the same NoResourceAvaliableException.

Caused by: org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableExce                                                                                        ption: Not enough free slots available to run the job. You can decrease the oper                                                                                        ator parallelism or increase the number of slots per TaskManager in the configur                                                                                        ation. Task to schedule: < Attempt #0 (CHAIN DataSource (at getDefaultTextLineDa                                                                                        taSet(WordCountData.java:70) (org.apache.flink.api.java.io.CollectionInputFormat                                                                                        )) -> FlatMap (FlatMap at main(WordCount.java:69)) -> Combine(SUM(1), at main(Wo                                                                                        rdCount.java:72) (1/1)) @ (unassigned) - [SCHEDULED] > with groupID < 31e497f2f6                                                                                        8c9cee5864c8fddaff3d59 > in sharing group < SlotSharingGroup [f9ed1aab933e061a8c                                                                                        e1ecaa3534f18c, 037bb78a1902f7edea69a978ad7b54ce, 31e497f2f68c9cee5864c8fddaff3d                                                                                        59] >. Resources available to scheduler: Number of instances=0, total number of                                                                                         slots=0, available slots=0
        at org.apache.flink.runtime.jobmanager.scheduler.Scheduler.scheduleTask(                                                                                        Scheduler.java:256)
        at org.apache.flink.runtime.jobmanager.scheduler.Scheduler.scheduleImmed                                                                                        iately(Scheduler.java:131)
        at org.apache.flink.runtime.executiongraph.Execution.scheduleForExecutio                                                                                        n(Execution.java:298)
        at org.apache.flink.runtime.executiongraph.ExecutionVertex.scheduleForEx                                                                                        ecution(ExecutionVertex.java:458)
        at org.apache.flink.runtime.executiongraph.ExecutionJobVertex.scheduleAl                                                                                        l(ExecutionJobVertex.java:322)
        at org.apache.flink.runtime.executiongraph.ExecutionGraph.scheduleForExe                                                                                        cution(ExecutionGraph.java:679)
        at org.apache.flink.runtime.jobmanager.JobManager$$anonfun$org$apache$fl                                                                                        ink$runtime$jobmanager$JobManager$$submitJob$1.apply$mcV$sp(JobManager.scala:982                                                                                        )
        at org.apache.flink.runtime.jobmanager.JobManager$$anonfun$org$apache$fl                                                                                        ink$runtime$jobmanager$JobManager$$submitJob$1.apply(JobManager.scala:962)
        at org.apache.flink.runtime.jobmanager.JobManager$$anonfun$org$apache$fl                                                                                        ink$runtime$jobmanager$JobManager$$submitJob$1.apply(JobManager.scala:962)
        ... 8 more

The log of TaskManager again has the same errors as before.

20:58:58,457 INFO  org.apache.flink.runtime.net.ConnectionUtils                  - Failed to connect from address '/slave-IP': connect timed out
20:58:58,458 INFO  org.apache.flink.runtime.net.ConnectionUtils                  - Failed to connect from address '/0:0:0:0:0:0:0:1%1': Network is unreachable
20:58:58,458 INFO  org.apache.flink.runtime.net.ConnectionUtils                  - Failed to connect from address '/127.0.0.1': Invalid argument
20:58:59,048 WARN  org.apache.flink.runtime.net.ConnectionUtils                  - Could not connect to /master-IP:6123. Selecting a local address using heuristics.
20:58:59,050 INFO  org.apache.flink.runtime.taskmanager.TaskManager              - TaskManager will use hostname/address 'hostname-of-slave' (slave-IP) for communication.
20:58:59,051 INFO  org.apache.flink.runtime.taskmanager.TaskManager              - Starting TaskManager in streaming mode BATCH_ONLY
20:58:59,052 INFO  org.apache.flink.runtime.taskmanager.TaskManager              - Starting TaskManager actor system at slave_IP:0
20:58:59,776 INFO  akka.event.slf4j.Slf4jLogger                                  - Slf4jLogger started
20:58:59,842 INFO  Remoting                                                      - Starting remoting
20:59:00,094 INFO  Remoting                                                      - Remoting started; listening on addresses :[akka.tcp://flink@slave-IP:33813]
20:59:00,100 INFO  org.apache.flink.runtime.taskmanager.TaskManager              - Starting TaskManager actor
20:59:00,125 INFO  org.apache.flink.runtime.io.network.netty.NettyConfig         - NettyConfig [server address: hostname-of-master/master-IP, server port: 49030, memory segment size (bytes): 32768, transport type: NIO, number of server threads: 0 (use Netty's default), number of client threads: 0 (use Netty's default), server connect backlog: 0 (use Netty's default), client connect timeout (sec): 120, send/receive buffer size (bytes): 0 (use Netty's default)]
20:59:00,131 INFO  org.apache.flink.runtime.taskmanager.TaskManager              - Messages between TaskManager and JobManager have a max timeout of 100000 milliseconds
20:59:00,142 INFO  org.apache.flink.runtime.taskmanager.TaskManager              - Temporary file directory '/tmp': total 4 GB, usable 1 GB (25.00% usable)
20:59:00,210 INFO  org.apache.flink.runtime.io.network.buffer.NetworkBufferPool  - Allocated 64 MB for network buffer pool (number of memory segments: 2048, bytes per segment: 32768).
20:59:00,323 INFO  org.apache.flink.runtime.taskmanager.TaskManager              - Using 0.7 of the currently free heap space for Flink managed heap memory (293 MB).
20:59:00,565 INFO  org.apache.flink.runtime.io.disk.iomanager.IOManager          - I/O manager uses directory /tmp/flink-io-c7796b82-6676-4604-97fd-df09001a84e8 for spill files.
20:59:00,578 INFO  org.apache.flink.runtime.filecache.FileCache                  - User file cache uses directory /tmp/flink-dist-cache-13ed3e76-cf1e-46fa-9ba2-5177e801429e
20:59:00,908 INFO  org.apache.flink.runtime.taskmanager.TaskManager              - Starting TaskManager actor at akka://flink/user/taskmanager#-157676733.
20:59:00,908 INFO  org.apache.flink.runtime.taskmanager.TaskManager              - TaskManager data connection information: hostname-of-master (dataPort=49030)
20:59:00,909 INFO  org.apache.flink.runtime.taskmanager.TaskManager              - TaskManager has 1 task slot(s).
20:59:00,910 INFO  org.apache.flink.runtime.taskmanager.TaskManager              - Memory usage stats: [HEAP: 376/491/491 MB, NON HEAP: 24/49/304 MB (used/committed/max)]
20:59:00,917 INFO  org.apache.flink.runtime.taskmanager.TaskManager              - Trying to register at JobManager akka.tcp://flink@master-IP:6123/user/jobmanager (attempt 1, timeout: 500 milliseconds)
20:59:01,443 INFO  org.apache.flink.runtime.taskmanager.TaskManager              - Trying to register at JobManager akka.tcp://flink@master-IP:6123/user/jobmanager (attempt 2, timeout: 1000 milliseconds)
20:59:02,873 INFO  org.apache.flink.runtime.taskmanager.TaskManager              - Trying to register at JobManager akka.tcp://flink@master-IP:6123/user/jobmanager (attempt 3, timeout: 2000 milliseconds)
20:59:04,893 INFO  org.apache.flink.runtime.taskmanager.TaskManager              - Trying to register at JobManager akka.tcp://flink@master-IP:6123/user/jobmanager (attempt 4, timeout: 4000 milliseconds)
20:59:08,914 INFO  org.apache.flink.runtime.taskmanager.TaskManager              - Trying to register at JobManager akka.tcp://flink@master-IP:6123/user/jobmanager (attempt 5, timeout: 8000 milliseconds)


Kind Regards,
Ravinder Kaur

On Wed, Feb 3, 2016 at 8:12 PM, Stephan Ewen <[hidden email]> wrote:
This looks like the reason:

java.net.UnknownHostException: Cannot resolve the JobManager hostname 'hostname-of-master' specified in the configuration

On Wed, Feb 3, 2016 at 7:29 PM, Ravinder Kaur <[hidden email]> wrote:
Hello,

The log file of the Taskmanager now shows the following 

18:27:10,082 WARN  org.apache.hadoop.util.NativeCodeLoader                       - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
18:27:10,244 INFO  org.apache.flink.runtime.taskmanager.TaskManager              - --------------------------------------------------------------------------------
18:27:10,244 INFO  org.apache.flink.runtime.taskmanager.TaskManager              -  Starting TaskManager (Version: 0.10.1, Rev:2e9b231, Date:22.11.2015 @ 12:41:12 CET)
18:27:10,244 INFO  org.apache.flink.runtime.taskmanager.TaskManager              -  Current user: flink
18:27:10,245 INFO  org.apache.flink.runtime.taskmanager.TaskManager              -  JVM: OpenJDK 64-Bit Server VM - Oracle Corporation - 1.7/24.91-b01
18:27:10,245 INFO  org.apache.flink.runtime.taskmanager.TaskManager              -  Maximum heap size: 491 MiBytes
18:27:10,245 INFO  org.apache.flink.runtime.taskmanager.TaskManager              -  JAVA_HOME: /usr/lib/jvm/java-1.7.0-openjdk-amd64
18:27:10,247 INFO  org.apache.flink.runtime.taskmanager.TaskManager              -  Hadoop version: 2.7.0
18:27:10,247 INFO  org.apache.flink.runtime.taskmanager.TaskManager              -  JVM Options:
18:27:10,247 INFO  org.apache.flink.runtime.taskmanager.TaskManager              -     -Xms512M
18:27:10,247 INFO  org.apache.flink.runtime.taskmanager.TaskManager              -     -Xmx512M
18:27:10,248 INFO  org.apache.flink.runtime.taskmanager.TaskManager              -     -XX:MaxDirectMemorySize=8388607T
18:27:10,248 INFO  org.apache.flink.runtime.taskmanager.TaskManager              -     -XX:MaxPermSize=256m
18:27:10,248 INFO  org.apache.flink.runtime.taskmanager.TaskManager              -     -Dlog.file=/home/flink/flink-0.10.1/log/flink-flink-taskmanager-0-vm-10-155-208-137.cloud.mwn.de.log
18:27:10,248 INFO  org.apache.flink.runtime.taskmanager.TaskManager              -     -Dlog4j.configuration=file:/home/flink/flink-0.10.1/conf/log4j.properties
18:27:10,248 INFO  org.apache.flink.runtime.taskmanager.TaskManager              -     -Dlogback.configurationFile=file:/home/flink/flink-0.10.1/conf/logback.xml
18:27:10,248 INFO  org.apache.flink.runtime.taskmanager.TaskManager              -  Program Arguments:
18:27:10,248 INFO  org.apache.flink.runtime.taskmanager.TaskManager              -     --configDir
18:27:10,248 INFO  org.apache.flink.runtime.taskmanager.TaskManager              -     /home/flink/flink-0.10.1/conf
18:27:10,248 INFO  org.apache.flink.runtime.taskmanager.TaskManager              -     --streamingMode
18:27:10,248 INFO  org.apache.flink.runtime.taskmanager.TaskManager              -     batch
18:27:10,248 INFO  org.apache.flink.runtime.taskmanager.TaskManager              -  Classpath: /home/flink/flink-0.10.1/lib/flink-dist_2.11-0.10.1.jar:/home/flink/flink-0.10.1/lib/flink-python_2.11-0.10.1.jar:/home/flink/flink-0.10.1/lib/log4j-1.2.17.jar:/home/flink/flink-0.10.1/lib/slf4j-log4j12-1.7.7.jar:/usr/lib/jvm/java-1.7.0-openjdk-amd64/lib/tools.jar::
18:27:10,248 INFO  org.apache.flink.runtime.taskmanager.TaskManager              - --------------------------------------------------------------------------------
18:27:10,252 INFO  org.apache.flink.runtime.taskmanager.TaskManager              - Maximum number of open file descriptors is 4096
18:27:10,277 INFO  org.apache.flink.runtime.taskmanager.TaskManager              - Loading configuration from /home/flink/flink-0.10.1/conf
18:27:10,356 INFO  org.apache.flink.runtime.taskmanager.TaskManager              - Security is not enabled. Starting non-authenticated TaskManager.
18:27:10,365 ERROR org.apache.flink.runtime.taskmanager.TaskManager              - Failed to run TaskManager.
java.net.UnknownHostException: Cannot resolve the JobManager hostname 'hostname-of-master' specified in the configuration
        at org.apache.flink.runtime.util.StandaloneUtils.createLeaderRetrievalService(StandaloneUtils.java:79)
        at org.apache.flink.runtime.util.StandaloneUtils.createLeaderRetrievalService(StandaloneUtils.java:48)
        at org.apache.flink.runtime.util.LeaderRetrievalUtils.createLeaderRetrievalService(LeaderRetrievalUtils.java:69)
        at org.apache.flink.runtime.taskmanager.TaskManager$.selectNetworkInterfaceAndPort(TaskManager.scala:1351)
        at org.apache.flink.runtime.taskmanager.TaskManager$.selectNetworkInterfaceAndRunTaskManager(TaskManager.scala:1328)
        at org.apache.flink.runtime.taskmanager.TaskManager$.main(TaskManager.scala:1240)
        at org.apache.flink.runtime.taskmanager.TaskManager.main(TaskManager.scala)

Kind Regards,
Ravinder Kaur

On Wed, Feb 3, 2016 at 7:19 PM, Stephan Ewen <[hidden email]> wrote:
What do the TaskManger logs say?

On Wed, Feb 3, 2016 at 6:34 PM, Ravinder Kaur <[hidden email]> wrote:
Hello,

Thanks for the quick reply. I tried to set jobmanager.rpc.address in flink-conf.yaml to the hostname of master node on both the nodes.

Now it does not start the Taskmanager at the worker node at all. When I start the cluster using ./bin/start-cluster.sh on master it shows the normal output of starting the Jobmanager and Taskmanager but when I run jps on the nodes the slave does not have the Taskmanager running. 

Running the WordCount example again fails showing the same error. Stopping the cluster says no taskmanager to stop.

Kind Regards,
Ravinder Kaur

On Wed, Feb 3, 2016 at 5:47 PM, Stephan Ewen <[hidden email]> wrote:
Looks like the network configuration is not correct.

I would try setting the full host name (like "master.abc.xyz.com") as jobmanager.rpc.address.

Greetings,
Stephan


On Wed, Feb 3, 2016 at 5:43 PM, Ravinder Kaur <[hidden email]> wrote:

Hello Community,

I'm a student and new to Apache Flink. I'm trying to learn and have setup a 2- node standalone Flink(0.10.1) cluster (one master and one worker). I'm facing the following issue. 

Cluster: consists of 2 vms (one master and one worker)


When I start the cluster both the JobManager and the TaskManager are started on the master and worker respectively.

Command to start the cluster : bin/start-cluster.sh

JPS shows all the processes running.

Then I run the following command to run a WordCount example job: ./bin/flink run ./examples/WordCount.jar

the result is attached with the mail.

The error is org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailabeException: Not enough free slots available to run to run the job ....................... Resources available to scheduler: Number of instances=0, total number of slots= 0, available slots=0

Therefore I suppose that the JobManager does not find the TaskManager and checked the logs of the TaskManager which indeed shows that the TaskManager is unable to register at the JobManager for quite a long time. There are org.apache.flink.runtime.net.ConnectionUtils: Failed to connect from localhost: Connect timed out and org.apache.flink.runtime.net.ConnectionUtils: Failed to connect from address localhost: Network is Unreachable messages in the log of the TaskManager. Later when it starts up after a number of attempts and tries to register at the JobManager, which also fails after a lot of attempts showing the following message org.apache.flink.runtime.taskmanager.Taskmanager: Trying to register at JobManager akka.tcp://flink@master:6123/user'/jobmanager (attempt:92, timeout:30seconds) and org.apache.flink.runtime.taskmanager.Taskmanager: Tried to associate with unreachable remote host [akka.tcp://flink@master:6123/user/jobmanager]. Address is now gated for 5000ms, all messages to this address will be delivered to dead letters. Reason: Connection timed out: /master:6123

and https://issues.apache.org/jira/browse/FLINK-1119 these links helpful. Stephan Ewen the guy who provided the solution in both the links gives a good explanation that the TaskManagers take quite some time to register at the JobManager and therefore I waited for as long as 20 mins after starting the cluster to run the job. But even after waiting so long I get the same error. 

Another suggestion was to run the cluster in streaming mode. So I tried it with the command : bin/start-cluster-streaming.sh and ran the job but I get the same error. I have rechecked all the configurations but I'm unable to find out the fault.

I re-checked all the configurations but could not find anything wrong. Also checked the port 6123 on master which is in LISTEN state and tcp request from worker to master shows SYN_SENT state using netstat -na and lsof -i commands.

I opened the webpage on master http://localhost:8081 but it shows nothing and localhost:8080 says connection refused.

Kindly help me out as it is very important for me. Let me know if you have any questions.

Kind Regards,
Ravinder Kaur













Reply | Threaded
Open this post in threaded view
|

Re: TaskManager unable to register with JobManager

Ufuk Celebi
Hey Ravinder,

check out the following config keys:

blob.server.port
taskmanager.rpc.port
taskmanager.data.port


– Ufuk


On Wed, Feb 10, 2016 at 4:06 PM, Ravinder Kaur <[hidden email]> wrote:

> Hello Fabian,
>
> Thank you very much for the resource. I had already gone through this and
> have found port '6123' as default for taskmanager registration. But I want
> to know the specific range of ports the taskmanager access during job
> execution.
>
> The taskmanager always tries to access a random port during job execution
> for which I need to disable firewall using 'ufw allow port' during the
> execution, otherwise the job hangs and finally fails. So I  wanted to know a
> particular range of ports which I can specify in the iptables to always
> allow access.
>
>
> Kind Regards,
> Ravinder Kaur
>
> On Wed, Feb 10, 2016 at 2:16 PM, Fabian Hueske <[hidden email]> wrote:
>>
>> Hi Ravinder,
>>
>> please have a look at the configuration documentation:
>>
>> -->
>> https://ci.apache.org/projects/flink/flink-docs-master/setup/config.html#jobmanager-amp-taskmanager
>>
>> Best, Fabian
>>
>> 2016-02-10 13:55 GMT+01:00 Ravinder Kaur <[hidden email]>:
>>>
>>> Hello All,
>>>
>>> I need to know the range of ports that are being used during the
>>> master/slave communication in the Flink cluster. Also is there a way I can
>>> specify a range of ports, at the slaves, to restrict them to connect to
>>> master only in this range?
>>>
>>> Kind Regards,
>>> Ravinder Kaur
>>>
>>>
>>> On Wed, Feb 3, 2016 at 10:09 PM, Stephan Ewen <[hidden email]> wrote:
>>>>
>>>> Can machines connect to port 6123? The firewall may block that port, put
>>>> permit SSH.
>>>>
>>>> On Wed, Feb 3, 2016 at 9:52 PM, Ravinder Kaur <[hidden email]>
>>>> wrote:
>>>>>
>>>>> Hello,
>>>>>
>>>>> Here is the log file of Jobmanager. I did not see some thing suspicious
>>>>> and as it suggests the ports are also listening.
>>>>>
>>>>> 20:58:46,906 INFO  org.apache.flink.runtime.jobmanager.JobManager
>>>>> - Starting JobManager on IP-of-master:6123 with execution mode CLUSTER and
>>>>> streaming mode BATCH_ONLY
>>>>> 20:58:46,978 INFO  org.apache.flink.runtime.jobmanager.JobManager
>>>>> - Security is not enabled. Starting non-authenticated JobManager.
>>>>> 20:58:46,979 INFO  org.apache.flink.runtime.jobmanager.JobManager
>>>>> - Starting JobManager
>>>>> 20:58:46,980 INFO  org.apache.flink.runtime.jobmanager.JobManager
>>>>> - Starting JobManager actor system at 10.155.208.138:6123
>>>>> 20:58:48,196 INFO  akka.event.slf4j.Slf4jLogger
>>>>> - Slf4jLogger started
>>>>> 20:58:48,295 INFO  Remoting
>>>>> - Starting remoting
>>>>> 20:58:48,541 INFO  Remoting
>>>>> - Remoting started; listening on addresses
>>>>> :[akka.tcp://flink@IP-of-master:6123]
>>>>> 20:58:48,549 INFO  org.apache.flink.runtime.jobmanager.JobManager
>>>>> - Starting JobManger web frontend
>>>>> 20:58:48,690 INFO
>>>>> org.apache.flink.runtime.webmonitor.WebRuntimeMonitor         - Using
>>>>> directory /tmp/flink-web-876a4755-4f38-4ff7-8202-f263afa9b986 for the web
>>>>> interface files
>>>>> 20:58:48,691 INFO
>>>>> org.apache.flink.runtime.webmonitor.WebRuntimeMonitor         - Serving job
>>>>> manager log from
>>>>> /home/flink/flink-0.10.1/log/flink-flink-jobmanager-0-hostname.log
>>>>> 20:58:48,691 INFO
>>>>> org.apache.flink.runtime.webmonitor.WebRuntimeMonitor         - Serving job
>>>>> manager stdout from
>>>>> /home/flink/flink-0.10.1/log/flink-flink-jobmanager-0-hostname.out
>>>>> 20:58:49,044 INFO
>>>>> org.apache.flink.runtime.webmonitor.WebRuntimeMonitor         - Web frontend
>>>>> listening at 0:0:0:0:0:0:0:0:8081
>>>>> 20:58:49,045 INFO  org.apache.flink.runtime.jobmanager.JobManager
>>>>> - Starting JobManager actor
>>>>> 20:58:49,052 INFO  org.apache.flink.runtime.blob.BlobServer
>>>>> - Created BLOB server storage directory
>>>>> /tmp/blobStore-e0c52bfb-2411-4a83-ac8d-5664a5894258
>>>>> 20:58:49,054 INFO  org.apache.flink.runtime.blob.BlobServer
>>>>> - Started BLOB server at 0.0.0.0:43683 - max concurrent requests: 50 - max
>>>>> backlog: 1000
>>>>> 20:58:49,075 INFO  org.apache.flink.runtime.jobmanager.MemoryArchivist
>>>>> - Started memory archivist akka://flink/user/archive
>>>>> 20:58:49,075 INFO  org.apache.flink.runtime.jobmanager.JobManager
>>>>> - Starting JobManager at akka.tcp://flink@IP-of-master:6123/user/jobmanager.
>>>>> 20:58:49,081 INFO  org.apache.flink.runtime.jobmanager.JobManager
>>>>> - JobManager akka.tcp://flink@IP-of-master:6123/user/jobmanager was granted
>>>>> leadership with leader session ID None.
>>>>> 20:58:49,082 INFO
>>>>> org.apache.flink.runtime.webmonitor.WebRuntimeMonitor         - Starting
>>>>> with JobManager akka.tcp://flink@IP-of-master:6123/user/jobmanager on port
>>>>> 8081
>>>>> 20:58:49,083 INFO
>>>>> org.apache.flink.runtime.webmonitor.JobManagerRetriever       - New leader
>>>>> reachable under akka.tcp://flink@IP-of-master:6123/user/jobmanager:null.
>>>>> 20:59:22,794 INFO  org.apache.flink.runtime.jobmanager.JobManager
>>>>> - Submitting job 72733d69588678ec224003ab5577cab8 (Flink Java Job at Wed Feb
>>>>> 03 20:59:22 CET 2016).
>>>>> 20:59:22,853 INFO  org.apache.flink.runtime.jobmanager.JobManager
>>>>> - Scheduling job 72733d69588678ec224003ab5577cab8 (Flink Java Job at Wed Feb
>>>>> 03 20:59:22 CET 2016).
>>>>> 20:59:22,857 INFO  org.apache.flink.runtime.jobmanager.JobManager
>>>>> - Status of job 72733d69588678ec224003ab5577cab8 (Flink Java Job at Wed Feb
>>>>> 03 20:59:22 CET 2016) changed to RUNNING.
>>>>> 20:59:22,859 INFO
>>>>> org.apache.flink.runtime.executiongraph.ExecutionGraph        - CHAIN
>>>>> DataSource (at getDefaultTextLineDataSet(WordCountData.java:70)
>>>>> (org.apache.flink.api.java.io.CollectionInputFormat)) -> FlatMap (FlatMap at
>>>>> main(WordCount.java:69)) -> Combine(SUM(1), at main(WordCount.java:72) (1/1)
>>>>> (23fb37019a504fd6c7bf95e46a8cd7a3) switched from CREATED to SCHEDULED
>>>>> 20:59:22,881 INFO
>>>>> org.apache.flink.runtime.executiongraph.ExecutionGraph        - CHAIN
>>>>> DataSource (at getDefaultTextLineDataSet(WordCountData.java:70)
>>>>> (org.apache.flink.api.java.io.CollectionInputFormat)) -> FlatMap (FlatMap at
>>>>> main(WordCount.java:69)) -> Combine(SUM(1), at main(WordCount.java:72) (1/1)
>>>>> (23fb37019a504fd6c7bf95e46a8cd7a3) switched from SCHEDULED to CANCELED
>>>>> 20:59:22,881 INFO  org.apache.flink.runtime.jobmanager.JobManager
>>>>> - Status of job 72733d69588678ec224003ab5577cab8 (Flink Java Job at Wed Feb
>>>>> 03 20:59:22 CET 2016) changed to FAILING.
>>>>>
>>>>> org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException:
>>>>> Not enough free slots available to run the job. You can decrease the
>>>>> operator parallelism or increase the number of slots per TaskManager in the
>>>>> configuration. Task to schedule: < Attempt #0 (CHAIN DataSource (at
>>>>> getDefaultTextLineDataSet(WordCountData.java:70)
>>>>> (org.apache.flink.api.java.io.CollectionInputFormat)) -> FlatMap (FlatMap at
>>>>> main(WordCount.java:69)) -> Combine(SUM(1), at main(WordCount.java:72)
>>>>> (1/1)) @ (unassigned) - [SCHEDULED] > with groupID <
>>>>> 31e497f2f68c9cee5864c8fddaff3d59 > in sharing group < SlotSharingGroup
>>>>> [f9ed1aab933e061a8ce1ecaa3534f18c, 037bb78a1902f7edea69a978ad7b54ce,
>>>>> 31e497f2f68c9cee5864c8fddaff3d59] >. Resources available to scheduler:
>>>>> Number of instances=0, total number of slots=0, available slots=0
>>>>>         at
>>>>> org.apache.flink.runtime.jobmanager.scheduler.Scheduler.scheduleTask(Scheduler.java:256)
>>>>>         at
>>>>> org.apache.flink.runtime.jobmanager.scheduler.Scheduler.scheduleImmediately(Scheduler.java:131)
>>>>>         at
>>>>> org.apache.flink.runtime.executiongraph.Execution.scheduleForExecution(Execution.java:298)
>>>>>         at
>>>>> org.apache.flink.runtime.executiongraph.ExecutionVertex.scheduleForExecution(ExecutionVertex.java:458)
>>>>>         at
>>>>> org.apache.flink.runtime.executiongraph.ExecutionJobVertex.scheduleAll(ExecutionJobVertex.java:322)
>>>>>         at
>>>>> org.apache.flink.runtime.executiongraph.ExecutionGraph.scheduleForExecution(ExecutionGraph.java:679)
>>>>>         at
>>>>> org.apache.flink.runtime.jobmanager.JobManager$$anonfun$org$apache$flink$runtime$jobmanager$JobManager$$submitJob$1.apply$mcV$sp(JobManager.scala:982)
>>>>>         at
>>>>> org.apache.flink.runtime.jobmanager.JobManager$$anonfun$org$apache$flink$runtime$jobmanager$JobManager$$submitJob$1.apply(JobManager.scala:962)
>>>>>         at
>>>>> org.apache.flink.runtime.jobmanager.JobManager$$anonfun$org$apache$flink$runtime$jobmanager$JobManager$$submitJob$1.apply(JobManager.scala:962)
>>>>>         at
>>>>> scala.concurrent.impl.Future$PromiseCompletingRunnable.liftedTree1$1(Future.scala:24)
>>>>>         at
>>>>> scala.concurrent.impl.Future$PromiseCompletingRunnable.run(Future.scala:24)
>>>>>         at
>>>>> akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:41)
>>>>>         at
>>>>> akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:401)
>>>>>         at
>>>>> scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
>>>>>         at
>>>>> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
>>>>>         at
>>>>> scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
>>>>>         at
>>>>> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
>>>>> 20:59:22,886 INFO
>>>>> org.apache.flink.runtime.executiongraph.ExecutionGraph        - CHAIN Reduce
>>>>> (SUM(1), at main(WordCount.java:72) -> FlatMap (collect()) (1/1)
>>>>> (824b6e3771304cd0f92aea4ab763a11d) switched from CREATED to CANCELED
>>>>> 20:59:22,887 INFO
>>>>> org.apache.flink.runtime.executiongraph.ExecutionGraph        - DataSink
>>>>> (collect() sink) (1/1) (1bb64a2edc6f68ad716acd9f8d2d7d67) switched from
>>>>> CREATED to CANCELED
>>>>> 20:59:22,890 INFO  org.apache.flink.runtime.jobmanager.JobManager
>>>>> - Status of job 72733d69588678ec224003ab5577cab8 (Flink Java Job at Wed Feb
>>>>> 03 20:59:22 CET 2016) changed to FAILED.
>>>>>
>>>>> org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException:
>>>>> Not enough free slots available to run the job. You can decrease the
>>>>> operator parallelism or increase the number of slots per TaskManager in the
>>>>> configuration. Task to schedule: < Attempt #0 (CHAIN DataSource (at
>>>>> getDefaultTextLineDataSet(WordCountData.java:70)
>>>>> (org.apache.flink.api.java.io.CollectionInputFormat)) -> FlatMap (FlatMap at
>>>>> main(WordCount.java:69)) -> Combine(SUM(1), at main(WordCount.java:72)
>>>>> (1/1)) @ (unassigned) - [SCHEDULED] > with groupID <
>>>>> 31e497f2f68c9cee5864c8fddaff3d59 > in sharing group < SlotSharingGroup
>>>>> [f9ed1aab933e061a8ce1ecaa3534f18c, 037bb78a1902f7edea69a978ad7b54ce,
>>>>> 31e497f2f68c9cee5864c8fddaff3d59] >. Resources available to scheduler:
>>>>> Number of instances=0, total number of slots=0, available slots=0
>>>>>         at
>>>>> org.apache.flink.runtime.jobmanager.scheduler.Scheduler.scheduleTask(Scheduler.java:256)
>>>>>         at
>>>>> org.apache.flink.runtime.jobmanager.scheduler.Scheduler.scheduleImmediately(Scheduler.java:131)
>>>>>         at
>>>>> org.apache.flink.runtime.executiongraph.Execution.scheduleForExecution(Execution.java:298)
>>>>>         at
>>>>> org.apache.flink.runtime.executiongraph.ExecutionVertex.scheduleForExecution(ExecutionVertex.java:458)
>>>>>         at
>>>>> org.apache.flink.runtime.executiongraph.ExecutionJobVertex.scheduleAll(ExecutionJobVertex.java:322)
>>>>>         at
>>>>> org.apache.flink.runtime.executiongraph.ExecutionGraph.scheduleForExecution(ExecutionGraph.java:679)
>>>>>         at
>>>>> org.apache.flink.runtime.jobmanager.JobManager$$anonfun$org$apache$flink$runtime$jobmanager$JobManager$$submitJob$1.apply$mcV$sp(JobManager.scala:982)
>>>>>         at
>>>>> org.apache.flink.runtime.jobmanager.JobManager$$anonfun$org$apache$flink$runtime$jobmanager$JobManager$$submitJob$1.apply(JobManager.scala:962)
>>>>>         at
>>>>> org.apache.flink.runtime.jobmanager.JobManager$$anonfun$org$apache$flink$runtime$jobmanager$JobManager$$submitJob$1.apply(JobManager.scala:962)
>>>>>         at
>>>>> scala.concurrent.impl.Future$PromiseCompletingRunnable.liftedTree1$1(Future.scala:24)
>>>>>         at
>>>>> scala.concurrent.impl.Future$PromiseCompletingRunnable.run(Future.scala:24)
>>>>>
>>>>>
>>>>> On Wed, Feb 3, 2016 at 9:27 PM, Robert Metzger <[hidden email]>
>>>>> wrote:
>>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> the TaskManager is starting up, but its not able to register at the
>>>>>> job manager. Did you check the JobManager log? Do you see anything
>>>>>> suspicious there? Are the ports matching?
>>>>>>
>>>>>>
>>>>>> On Wed, Feb 3, 2016 at 9:23 PM, Ravinder Kaur <[hidden email]>
>>>>>> wrote:
>>>>>>>
>>>>>>> Hello,
>>>>>>>
>>>>>>> Thank you for pointing it out. I had a little typo while I edited the
>>>>>>> hostname in flink-conf.yaml. I've reset it and the TaskManager started up.
>>>>>>> But I still can't run the WordCount example and it throws the same
>>>>>>> NoResourceAvaliableException.
>>>>>>>
>>>>>>> Caused by:
>>>>>>> org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableExce
>>>>>>> ption: Not enough free slots available to run the job. You can decrease the
>>>>>>> oper
>>>>>>> ator parallelism or increase the number of slots per TaskManager in the
>>>>>>> configur
>>>>>>> ation. Task to schedule: < Attempt #0 (CHAIN DataSource (at
>>>>>>> getDefaultTextLineDa
>>>>>>> taSet(WordCountData.java:70)
>>>>>>> (org.apache.flink.api.java.io.CollectionInputFormat
>>>>>>> )) -> FlatMap (FlatMap at main(WordCount.java:69)) -> Combine(SUM(1), at
>>>>>>> main(Wo
>>>>>>> rdCount.java:72) (1/1)) @ (unassigned) - [SCHEDULED] > with groupID <
>>>>>>> 31e497f2f6
>>>>>>> 8c9cee5864c8fddaff3d59 > in sharing group < SlotSharingGroup
>>>>>>> [f9ed1aab933e061a8c
>>>>>>> e1ecaa3534f18c, 037bb78a1902f7edea69a978ad7b54ce,
>>>>>>> 31e497f2f68c9cee5864c8fddaff3d
>>>>>>> 59] >. Resources available to scheduler: Number of instances=0, total number
>>>>>>> of
>>>>>>> slots=0, available slots=0
>>>>>>>         at
>>>>>>> org.apache.flink.runtime.jobmanager.scheduler.Scheduler.scheduleTask(
>>>>>>> Scheduler.java:256)
>>>>>>>         at
>>>>>>> org.apache.flink.runtime.jobmanager.scheduler.Scheduler.scheduleImmed
>>>>>>> iately(Scheduler.java:131)
>>>>>>>         at
>>>>>>> org.apache.flink.runtime.executiongraph.Execution.scheduleForExecutio
>>>>>>> n(Execution.java:298)
>>>>>>>         at
>>>>>>> org.apache.flink.runtime.executiongraph.ExecutionVertex.scheduleForEx
>>>>>>> ecution(ExecutionVertex.java:458)
>>>>>>>         at
>>>>>>> org.apache.flink.runtime.executiongraph.ExecutionJobVertex.scheduleAl
>>>>>>> l(ExecutionJobVertex.java:322)
>>>>>>>         at
>>>>>>> org.apache.flink.runtime.executiongraph.ExecutionGraph.scheduleForExe
>>>>>>> cution(ExecutionGraph.java:679)
>>>>>>>         at
>>>>>>> org.apache.flink.runtime.jobmanager.JobManager$$anonfun$org$apache$fl
>>>>>>> ink$runtime$jobmanager$JobManager$$submitJob$1.apply$mcV$sp(JobManager.scala:982
>>>>>>> )
>>>>>>>         at
>>>>>>> org.apache.flink.runtime.jobmanager.JobManager$$anonfun$org$apache$fl
>>>>>>> ink$runtime$jobmanager$JobManager$$submitJob$1.apply(JobManager.scala:962)
>>>>>>>         at
>>>>>>> org.apache.flink.runtime.jobmanager.JobManager$$anonfun$org$apache$fl
>>>>>>> ink$runtime$jobmanager$JobManager$$submitJob$1.apply(JobManager.scala:962)
>>>>>>>         ... 8 more
>>>>>>>
>>>>>>> The log of TaskManager again has the same errors as before.
>>>>>>>
>>>>>>> 20:58:58,457 INFO  org.apache.flink.runtime.net.ConnectionUtils
>>>>>>> - Failed to connect from address '/slave-IP': connect timed out
>>>>>>> 20:58:58,458 INFO  org.apache.flink.runtime.net.ConnectionUtils
>>>>>>> - Failed to connect from address '/0:0:0:0:0:0:0:1%1': Network is
>>>>>>> unreachable
>>>>>>> 20:58:58,458 INFO  org.apache.flink.runtime.net.ConnectionUtils
>>>>>>> - Failed to connect from address '/127.0.0.1': Invalid argument
>>>>>>> 20:58:59,048 WARN  org.apache.flink.runtime.net.ConnectionUtils
>>>>>>> - Could not connect to /master-IP:6123. Selecting a local address using
>>>>>>> heuristics.
>>>>>>> 20:58:59,050 INFO  org.apache.flink.runtime.taskmanager.TaskManager
>>>>>>> - TaskManager will use hostname/address 'hostname-of-slave' (slave-IP) for
>>>>>>> communication.
>>>>>>> 20:58:59,051 INFO  org.apache.flink.runtime.taskmanager.TaskManager
>>>>>>> - Starting TaskManager in streaming mode BATCH_ONLY
>>>>>>> 20:58:59,052 INFO  org.apache.flink.runtime.taskmanager.TaskManager
>>>>>>> - Starting TaskManager actor system at slave_IP:0
>>>>>>> 20:58:59,776 INFO  akka.event.slf4j.Slf4jLogger
>>>>>>> - Slf4jLogger started
>>>>>>> 20:58:59,842 INFO  Remoting
>>>>>>> - Starting remoting
>>>>>>> 20:59:00,094 INFO  Remoting
>>>>>>> - Remoting started; listening on addresses
>>>>>>> :[akka.tcp://flink@slave-IP:33813]
>>>>>>> 20:59:00,100 INFO  org.apache.flink.runtime.taskmanager.TaskManager
>>>>>>> - Starting TaskManager actor
>>>>>>> 20:59:00,125 INFO
>>>>>>> org.apache.flink.runtime.io.network.netty.NettyConfig         - NettyConfig
>>>>>>> [server address: hostname-of-master/master-IP, server port: 49030, memory
>>>>>>> segment size (bytes): 32768, transport type: NIO, number of server threads:
>>>>>>> 0 (use Netty's default), number of client threads: 0 (use Netty's default),
>>>>>>> server connect backlog: 0 (use Netty's default), client connect timeout
>>>>>>> (sec): 120, send/receive buffer size (bytes): 0 (use Netty's default)]
>>>>>>> 20:59:00,131 INFO  org.apache.flink.runtime.taskmanager.TaskManager
>>>>>>> - Messages between TaskManager and JobManager have a max timeout of 100000
>>>>>>> milliseconds
>>>>>>> 20:59:00,142 INFO  org.apache.flink.runtime.taskmanager.TaskManager
>>>>>>> - Temporary file directory '/tmp': total 4 GB, usable 1 GB (25.00% usable)
>>>>>>> 20:59:00,210 INFO
>>>>>>> org.apache.flink.runtime.io.network.buffer.NetworkBufferPool  - Allocated 64
>>>>>>> MB for network buffer pool (number of memory segments: 2048, bytes per
>>>>>>> segment: 32768).
>>>>>>> 20:59:00,323 INFO  org.apache.flink.runtime.taskmanager.TaskManager
>>>>>>> - Using 0.7 of the currently free heap space for Flink managed heap memory
>>>>>>> (293 MB).
>>>>>>> 20:59:00,565 INFO
>>>>>>> org.apache.flink.runtime.io.disk.iomanager.IOManager          - I/O manager
>>>>>>> uses directory /tmp/flink-io-c7796b82-6676-4604-97fd-df09001a84e8 for spill
>>>>>>> files.
>>>>>>> 20:59:00,578 INFO  org.apache.flink.runtime.filecache.FileCache
>>>>>>> - User file cache uses directory
>>>>>>> /tmp/flink-dist-cache-13ed3e76-cf1e-46fa-9ba2-5177e801429e
>>>>>>> 20:59:00,908 INFO  org.apache.flink.runtime.taskmanager.TaskManager
>>>>>>> - Starting TaskManager actor at akka://flink/user/taskmanager#-157676733.
>>>>>>> 20:59:00,908 INFO  org.apache.flink.runtime.taskmanager.TaskManager
>>>>>>> - TaskManager data connection information: hostname-of-master
>>>>>>> (dataPort=49030)
>>>>>>> 20:59:00,909 INFO  org.apache.flink.runtime.taskmanager.TaskManager
>>>>>>> - TaskManager has 1 task slot(s).
>>>>>>> 20:59:00,910 INFO  org.apache.flink.runtime.taskmanager.TaskManager
>>>>>>> - Memory usage stats: [HEAP: 376/491/491 MB, NON HEAP: 24/49/304 MB
>>>>>>> (used/committed/max)]
>>>>>>> 20:59:00,917 INFO  org.apache.flink.runtime.taskmanager.TaskManager
>>>>>>> - Trying to register at JobManager
>>>>>>> akka.tcp://flink@master-IP:6123/user/jobmanager (attempt 1, timeout: 500
>>>>>>> milliseconds)
>>>>>>> 20:59:01,443 INFO  org.apache.flink.runtime.taskmanager.TaskManager
>>>>>>> - Trying to register at JobManager
>>>>>>> akka.tcp://flink@master-IP:6123/user/jobmanager (attempt 2, timeout: 1000
>>>>>>> milliseconds)
>>>>>>> 20:59:02,873 INFO  org.apache.flink.runtime.taskmanager.TaskManager
>>>>>>> - Trying to register at JobManager
>>>>>>> akka.tcp://flink@master-IP:6123/user/jobmanager (attempt 3, timeout: 2000
>>>>>>> milliseconds)
>>>>>>> 20:59:04,893 INFO  org.apache.flink.runtime.taskmanager.TaskManager
>>>>>>> - Trying to register at JobManager
>>>>>>> akka.tcp://flink@master-IP:6123/user/jobmanager (attempt 4, timeout: 4000
>>>>>>> milliseconds)
>>>>>>> 20:59:08,914 INFO  org.apache.flink.runtime.taskmanager.TaskManager
>>>>>>> - Trying to register at JobManager
>>>>>>> akka.tcp://flink@master-IP:6123/user/jobmanager (attempt 5, timeout: 8000
>>>>>>> milliseconds)
>>>>>>>
>>>>>>>
>>>>>>> Kind Regards,
>>>>>>> Ravinder Kaur
>>>>>>>
>>>>>>> On Wed, Feb 3, 2016 at 8:12 PM, Stephan Ewen <[hidden email]>
>>>>>>> wrote:
>>>>>>>>
>>>>>>>> This looks like the reason:
>>>>>>>>
>>>>>>>> java.net.UnknownHostException: Cannot resolve the JobManager
>>>>>>>> hostname 'hostname-of-master' specified in the configuration
>>>>>>>>
>>>>>>>> On Wed, Feb 3, 2016 at 7:29 PM, Ravinder Kaur <[hidden email]>
>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>> Hello,
>>>>>>>>>
>>>>>>>>> The log file of the Taskmanager now shows the following
>>>>>>>>>
>>>>>>>>> 18:27:10,082 WARN  org.apache.hadoop.util.NativeCodeLoader
>>>>>>>>> - Unable to load native-hadoop library for your platform... using
>>>>>>>>> builtin-java classes where applicable
>>>>>>>>> 18:27:10,244 INFO  org.apache.flink.runtime.taskmanager.TaskManager
>>>>>>>>> -
>>>>>>>>> --------------------------------------------------------------------------------
>>>>>>>>> 18:27:10,244 INFO  org.apache.flink.runtime.taskmanager.TaskManager
>>>>>>>>> -  Starting TaskManager (Version: 0.10.1, Rev:2e9b231, Date:22.11.2015 @
>>>>>>>>> 12:41:12 CET)
>>>>>>>>> 18:27:10,244 INFO  org.apache.flink.runtime.taskmanager.TaskManager
>>>>>>>>> -  Current user: flink
>>>>>>>>> 18:27:10,245 INFO  org.apache.flink.runtime.taskmanager.TaskManager
>>>>>>>>> -  JVM: OpenJDK 64-Bit Server VM - Oracle Corporation - 1.7/24.91-b01
>>>>>>>>> 18:27:10,245 INFO  org.apache.flink.runtime.taskmanager.TaskManager
>>>>>>>>> -  Maximum heap size: 491 MiBytes
>>>>>>>>> 18:27:10,245 INFO  org.apache.flink.runtime.taskmanager.TaskManager
>>>>>>>>> -  JAVA_HOME: /usr/lib/jvm/java-1.7.0-openjdk-amd64
>>>>>>>>> 18:27:10,247 INFO  org.apache.flink.runtime.taskmanager.TaskManager
>>>>>>>>> -  Hadoop version: 2.7.0
>>>>>>>>> 18:27:10,247 INFO  org.apache.flink.runtime.taskmanager.TaskManager
>>>>>>>>> -  JVM Options:
>>>>>>>>> 18:27:10,247 INFO  org.apache.flink.runtime.taskmanager.TaskManager
>>>>>>>>> -     -Xms512M
>>>>>>>>> 18:27:10,247 INFO  org.apache.flink.runtime.taskmanager.TaskManager
>>>>>>>>> -     -Xmx512M
>>>>>>>>> 18:27:10,248 INFO  org.apache.flink.runtime.taskmanager.TaskManager
>>>>>>>>> -     -XX:MaxDirectMemorySize=8388607T
>>>>>>>>> 18:27:10,248 INFO  org.apache.flink.runtime.taskmanager.TaskManager
>>>>>>>>> -     -XX:MaxPermSize=256m
>>>>>>>>> 18:27:10,248 INFO  org.apache.flink.runtime.taskmanager.TaskManager
>>>>>>>>> -
>>>>>>>>> -Dlog.file=/home/flink/flink-0.10.1/log/flink-flink-taskmanager-0-vm-10-155-208-137.cloud.mwn.de.log
>>>>>>>>> 18:27:10,248 INFO  org.apache.flink.runtime.taskmanager.TaskManager
>>>>>>>>> -
>>>>>>>>> -Dlog4j.configuration=file:/home/flink/flink-0.10.1/conf/log4j.properties
>>>>>>>>> 18:27:10,248 INFO  org.apache.flink.runtime.taskmanager.TaskManager
>>>>>>>>> -
>>>>>>>>> -Dlogback.configurationFile=file:/home/flink/flink-0.10.1/conf/logback.xml
>>>>>>>>> 18:27:10,248 INFO  org.apache.flink.runtime.taskmanager.TaskManager
>>>>>>>>> -  Program Arguments:
>>>>>>>>> 18:27:10,248 INFO  org.apache.flink.runtime.taskmanager.TaskManager
>>>>>>>>> -     --configDir
>>>>>>>>> 18:27:10,248 INFO  org.apache.flink.runtime.taskmanager.TaskManager
>>>>>>>>> -     /home/flink/flink-0.10.1/conf
>>>>>>>>> 18:27:10,248 INFO  org.apache.flink.runtime.taskmanager.TaskManager
>>>>>>>>> -     --streamingMode
>>>>>>>>> 18:27:10,248 INFO  org.apache.flink.runtime.taskmanager.TaskManager
>>>>>>>>> -     batch
>>>>>>>>> 18:27:10,248 INFO  org.apache.flink.runtime.taskmanager.TaskManager
>>>>>>>>> -  Classpath:
>>>>>>>>> /home/flink/flink-0.10.1/lib/flink-dist_2.11-0.10.1.jar:/home/flink/flink-0.10.1/lib/flink-python_2.11-0.10.1.jar:/home/flink/flink-0.10.1/lib/log4j-1.2.17.jar:/home/flink/flink-0.10.1/lib/slf4j-log4j12-1.7.7.jar:/usr/lib/jvm/java-1.7.0-openjdk-amd64/lib/tools.jar::
>>>>>>>>> 18:27:10,248 INFO  org.apache.flink.runtime.taskmanager.TaskManager
>>>>>>>>> -
>>>>>>>>> --------------------------------------------------------------------------------
>>>>>>>>> 18:27:10,252 INFO  org.apache.flink.runtime.taskmanager.TaskManager
>>>>>>>>> - Maximum number of open file descriptors is 4096
>>>>>>>>> 18:27:10,277 INFO  org.apache.flink.runtime.taskmanager.TaskManager
>>>>>>>>> - Loading configuration from /home/flink/flink-0.10.1/conf
>>>>>>>>> 18:27:10,356 INFO  org.apache.flink.runtime.taskmanager.TaskManager
>>>>>>>>> - Security is not enabled. Starting non-authenticated TaskManager.
>>>>>>>>> 18:27:10,365 ERROR org.apache.flink.runtime.taskmanager.TaskManager
>>>>>>>>> - Failed to run TaskManager.
>>>>>>>>> java.net.UnknownHostException: Cannot resolve the JobManager
>>>>>>>>> hostname 'hostname-of-master' specified in the configuration
>>>>>>>>>         at
>>>>>>>>> org.apache.flink.runtime.util.StandaloneUtils.createLeaderRetrievalService(StandaloneUtils.java:79)
>>>>>>>>>         at
>>>>>>>>> org.apache.flink.runtime.util.StandaloneUtils.createLeaderRetrievalService(StandaloneUtils.java:48)
>>>>>>>>>         at
>>>>>>>>> org.apache.flink.runtime.util.LeaderRetrievalUtils.createLeaderRetrievalService(LeaderRetrievalUtils.java:69)
>>>>>>>>>         at
>>>>>>>>> org.apache.flink.runtime.taskmanager.TaskManager$.selectNetworkInterfaceAndPort(TaskManager.scala:1351)
>>>>>>>>>         at
>>>>>>>>> org.apache.flink.runtime.taskmanager.TaskManager$.selectNetworkInterfaceAndRunTaskManager(TaskManager.scala:1328)
>>>>>>>>>         at
>>>>>>>>> org.apache.flink.runtime.taskmanager.TaskManager$.main(TaskManager.scala:1240)
>>>>>>>>>         at
>>>>>>>>> org.apache.flink.runtime.taskmanager.TaskManager.main(TaskManager.scala)
>>>>>>>>>
>>>>>>>>> Kind Regards,
>>>>>>>>> Ravinder Kaur
>>>>>>>>>
>>>>>>>>> On Wed, Feb 3, 2016 at 7:19 PM, Stephan Ewen <[hidden email]>
>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>> What do the TaskManger logs say?
>>>>>>>>>>
>>>>>>>>>> On Wed, Feb 3, 2016 at 6:34 PM, Ravinder Kaur
>>>>>>>>>> <[hidden email]> wrote:
>>>>>>>>>>>
>>>>>>>>>>> Hello,
>>>>>>>>>>>
>>>>>>>>>>> Thanks for the quick reply. I tried to set jobmanager.rpc.address
>>>>>>>>>>> in flink-conf.yaml to the hostname of master node on both the nodes.
>>>>>>>>>>>
>>>>>>>>>>> Now it does not start the Taskmanager at the worker node at all.
>>>>>>>>>>> When I start the cluster using ./bin/start-cluster.sh on master it shows the
>>>>>>>>>>> normal output of starting the Jobmanager and Taskmanager but when I run jps
>>>>>>>>>>> on the nodes the slave does not have the Taskmanager running.
>>>>>>>>>>>
>>>>>>>>>>> Running the WordCount example again fails showing the same error.
>>>>>>>>>>> Stopping the cluster says no taskmanager to stop.
>>>>>>>>>>>
>>>>>>>>>>> Kind Regards,
>>>>>>>>>>> Ravinder Kaur
>>>>>>>>>>>
>>>>>>>>>>> On Wed, Feb 3, 2016 at 5:47 PM, Stephan Ewen <[hidden email]>
>>>>>>>>>>> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>> Looks like the network configuration is not correct.
>>>>>>>>>>>>
>>>>>>>>>>>> I would try setting the full host name (like
>>>>>>>>>>>> "master.abc.xyz.com") as jobmanager.rpc.address.
>>>>>>>>>>>>
>>>>>>>>>>>> Greetings,
>>>>>>>>>>>> Stephan
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Wed, Feb 3, 2016 at 5:43 PM, Ravinder Kaur
>>>>>>>>>>>> <[hidden email]> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> Hello Community,
>>>>>>>>>>>>>
>>>>>>>>>>>>> I'm a student and new to Apache Flink. I'm trying to learn and
>>>>>>>>>>>>> have setup a 2- node standalone Flink(0.10.1) cluster (one master and one
>>>>>>>>>>>>> worker). I'm facing the following issue.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Cluster: consists of 2 vms (one master and one worker)
>>>>>>>>>>>>>
>>>>>>>>>>>>> The configurations are done as per
>>>>>>>>>>>>> https://ci.apache.org/projects/flink/flink-docs-release-0.10/setup/cluster_setup.html
>>>>>>>>>>>>>
>>>>>>>>>>>>> When I start the cluster both the JobManager and the
>>>>>>>>>>>>> TaskManager are started on the master and worker respectively.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Command to start the cluster : bin/start-cluster.sh
>>>>>>>>>>>>>
>>>>>>>>>>>>> JPS shows all the processes running.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Then I run the following command to run a WordCount example
>>>>>>>>>>>>> job: ./bin/flink run ./examples/WordCount.jar
>>>>>>>>>>>>>
>>>>>>>>>>>>> the result is attached with the mail.
>>>>>>>>>>>>>
>>>>>>>>>>>>> The error is
>>>>>>>>>>>>> org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailabeException:
>>>>>>>>>>>>> Not enough free slots available to run to run the job
>>>>>>>>>>>>> ....................... Resources available to scheduler: Number of
>>>>>>>>>>>>> instances=0, total number of slots= 0, available slots=0
>>>>>>>>>>>>>
>>>>>>>>>>>>> Therefore I suppose that the JobManager does not find the
>>>>>>>>>>>>> TaskManager and checked the logs of the TaskManager which indeed shows that
>>>>>>>>>>>>> the TaskManager is unable to register at the JobManager for quite a long
>>>>>>>>>>>>> time. There are org.apache.flink.runtime.net.ConnectionUtils: Failed to
>>>>>>>>>>>>> connect from localhost: Connect timed out and
>>>>>>>>>>>>> org.apache.flink.runtime.net.ConnectionUtils: Failed to connect from address
>>>>>>>>>>>>> localhost: Network is Unreachable messages in the log of the TaskManager.
>>>>>>>>>>>>> Later when it starts up after a number of attempts and tries to register at
>>>>>>>>>>>>> the JobManager, which also fails after a lot of attempts showing the
>>>>>>>>>>>>> following message org.apache.flink.runtime.taskmanager.Taskmanager: Trying
>>>>>>>>>>>>> to register at JobManager akka.tcp://flink@master:6123/user'/jobmanager
>>>>>>>>>>>>> (attempt:92, timeout:30seconds) and
>>>>>>>>>>>>> org.apache.flink.runtime.taskmanager.Taskmanager: Tried to associate with
>>>>>>>>>>>>> unreachable remote host [akka.tcp://flink@master:6123/user/jobmanager].
>>>>>>>>>>>>> Address is now gated for 5000ms, all messages to this address will be
>>>>>>>>>>>>> delivered to dead letters. Reason: Connection timed out: /master:6123
>>>>>>>>>>>>>
>>>>>>>>>>>>> I browsed the internet for these and found
>>>>>>>>>>>>> http://stackoverflow.com/questions/33601020/flink-job-wont-run-with-higher-taskmanager-heap-mb
>>>>>>>>>>>>> and https://issues.apache.org/jira/browse/FLINK-1119 these
>>>>>>>>>>>>> links helpful. Stephan Ewen the guy who provided the solution in both the
>>>>>>>>>>>>> links gives a good explanation that the TaskManagers take quite some time to
>>>>>>>>>>>>> register at the JobManager and therefore I waited for as long as 20 mins
>>>>>>>>>>>>> after starting the cluster to run the job. But even after waiting so long I
>>>>>>>>>>>>> get the same error.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Another suggestion was to run the cluster in streaming mode. So
>>>>>>>>>>>>> I tried it with the command : bin/start-cluster-streaming.sh and ran the job
>>>>>>>>>>>>> but I get the same error. I have rechecked all the configurations but I'm
>>>>>>>>>>>>> unable to find out the fault.
>>>>>>>>>>>>>
>>>>>>>>>>>>> I re-checked all the configurations but could not find anything
>>>>>>>>>>>>> wrong. Also checked the port 6123 on master which is in LISTEN state and tcp
>>>>>>>>>>>>> request from worker to master shows SYN_SENT state using netstat -na and
>>>>>>>>>>>>> lsof -i commands.
>>>>>>>>>>>>>
>>>>>>>>>>>>> I opened the webpage on master http://localhost:8081 but it
>>>>>>>>>>>>> shows nothing and localhost:8080 says connection refused.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Kindly help me out as it is very important for me. Let me know
>>>>>>>>>>>>> if you have any questions.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Kind Regards,
>>>>>>>>>>>>> Ravinder Kaur
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>