Hello Community,
I'm a student and new to Apache Flink. I'm trying to learn and have setup a 2- node standalone Flink(0.10.1) cluster (one master and one worker). I'm facing the following issue. Cluster: consists of 2 vms (one master and one worker) The configurations are done as per https://ci.apache.org/projects/flink/flink-docs-release-0.10/setup/cluster_setup.html When I start the cluster both the JobManager and the TaskManager are started on the master and worker respectively. Command to start the cluster : bin/start-cluster.sh JPS shows all the processes running. Then I run the following command to run a WordCount example job: ./bin/flink run ./examples/WordCount.jar the result is attached with the mail. The error is org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailabeException: Not enough free slots available to run to run the job ....................... Resources available to scheduler: Number of instances=0, total number of slots= 0, available slots=0 Therefore I suppose that the JobManager does not find the TaskManager and checked the logs of the TaskManager which indeed shows that the TaskManager is unable to register at the JobManager for quite a long time. There are org.apache.flink.runtime.net.ConnectionUtils: Failed to connect from localhost: Connect timed out and org.apache.flink.runtime.net.ConnectionUtils: Failed to connect from address localhost: Network is Unreachable messages in the log of the TaskManager. Later when it starts up after a number of attempts and tries to register at the JobManager, which also fails after a lot of attempts showing the following message org.apache.flink.runtime.taskmanager.Taskmanager: Trying to register at JobManager akka.tcp://flink@master:6123/user'/jobmanager (attempt:92, timeout:30seconds) and org.apache.flink.runtime.taskmanager.Taskmanager: Tried to associate with unreachable remote host [akka.tcp://flink@master:6123/user/jobmanager]. Address is now gated for 5000ms, all messages to this address will be delivered to dead letters. Reason: Connection timed out: /master:6123 I browsed the internet for these and found http://stackoverflow.com/questions/33601020/flink-job-wont-run-with-higher-taskmanager-heap-mb and https://issues.apache.org/jira/browse/FLINK-1119 these links helpful. Stephan Ewen the guy who provided the solution in both the links gives a good explanation that the TaskManagers take quite some time to register at the JobManager and therefore I waited for as long as 20 mins after starting the cluster to run the job. But even after waiting so long I get the same error. Another suggestion was to run the cluster in streaming mode. So I tried it with the command : bin/start-cluster-streaming.sh and ran the job but I get the same error. I have rechecked all the configurations but I'm unable to find out the fault. I re-checked all the configurations but could not find anything wrong. Also checked the port 6123 on master which is in LISTEN state and tcp request from worker to master shows SYN_SENT state using netstat -na and lsof -i commands. I opened the webpage on master http://localhost:8081 but it shows nothing and localhost:8080 says connection refused. Kindly help me out as it is very important for me. Let me know if you have any questions. Kind Regards, Ravinder Kaur image.png (106K) Download Attachment |
Looks like the network configuration is not correct. I would try setting the full host name (like "master.abc.xyz.com") as jobmanager.rpc.address. Greetings, Stephan On Wed, Feb 3, 2016 at 5:43 PM, Ravinder Kaur <[hidden email]> wrote:
|
Hello, Thanks for the quick reply. I tried to set jobmanager.rpc.address in flink-conf.yaml to the hostname of master node on both the nodes. Now it does not start the Taskmanager at the worker node at all. When I start the cluster using ./bin/start-cluster.sh on master it shows the normal output of starting the Jobmanager and Taskmanager but when I run jps on the nodes the slave does not have the Taskmanager running. Running the WordCount example again fails showing the same error. Stopping the cluster says no taskmanager to stop. Kind Regards, Ravinder Kaur On Wed, Feb 3, 2016 at 5:47 PM, Stephan Ewen <[hidden email]> wrote:
|
What do the TaskManger logs say? On Wed, Feb 3, 2016 at 6:34 PM, Ravinder Kaur <[hidden email]> wrote:
|
Hello, The log file of the Taskmanager now shows the following 18:27:10,082 WARN org.apache.hadoop.util.NativeCodeLoader - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 18:27:10,244 INFO org.apache.flink.runtime.taskmanager.TaskManager - -------------------------------------------------------------------------------- 18:27:10,244 INFO org.apache.flink.runtime.taskmanager.TaskManager - Starting TaskManager (Version: 0.10.1, Rev:2e9b231, Date:22.11.2015 @ 12:41:12 CET) 18:27:10,244 INFO org.apache.flink.runtime.taskmanager.TaskManager - Current user: flink 18:27:10,245 INFO org.apache.flink.runtime.taskmanager.TaskManager - JVM: OpenJDK 64-Bit Server VM - Oracle Corporation - 1.7/24.91-b01 18:27:10,245 INFO org.apache.flink.runtime.taskmanager.TaskManager - Maximum heap size: 491 MiBytes 18:27:10,245 INFO org.apache.flink.runtime.taskmanager.TaskManager - JAVA_HOME: /usr/lib/jvm/java-1.7.0-openjdk-amd64 18:27:10,247 INFO org.apache.flink.runtime.taskmanager.TaskManager - Hadoop version: 2.7.0 18:27:10,247 INFO org.apache.flink.runtime.taskmanager.TaskManager - JVM Options: 18:27:10,247 INFO org.apache.flink.runtime.taskmanager.TaskManager - -Xms512M 18:27:10,247 INFO org.apache.flink.runtime.taskmanager.TaskManager - -Xmx512M 18:27:10,248 INFO org.apache.flink.runtime.taskmanager.TaskManager - -XX:MaxDirectMemorySize=8388607T 18:27:10,248 INFO org.apache.flink.runtime.taskmanager.TaskManager - -XX:MaxPermSize=256m 18:27:10,248 INFO org.apache.flink.runtime.taskmanager.TaskManager - -Dlog.file=/home/flink/flink-0.10.1/log/flink-flink-taskmanager-0-vm-10-155-208-137.cloud.mwn.de.log 18:27:10,248 INFO org.apache.flink.runtime.taskmanager.TaskManager - -Dlog4j.configuration=file:/home/flink/flink-0.10.1/conf/log4j.properties 18:27:10,248 INFO org.apache.flink.runtime.taskmanager.TaskManager - -Dlogback.configurationFile=file:/home/flink/flink-0.10.1/conf/logback.xml 18:27:10,248 INFO org.apache.flink.runtime.taskmanager.TaskManager - Program Arguments: 18:27:10,248 INFO org.apache.flink.runtime.taskmanager.TaskManager - --configDir 18:27:10,248 INFO org.apache.flink.runtime.taskmanager.TaskManager - /home/flink/flink-0.10.1/conf 18:27:10,248 INFO org.apache.flink.runtime.taskmanager.TaskManager - --streamingMode 18:27:10,248 INFO org.apache.flink.runtime.taskmanager.TaskManager - batch 18:27:10,248 INFO org.apache.flink.runtime.taskmanager.TaskManager - Classpath: /home/flink/flink-0.10.1/lib/flink-dist_2.11-0.10.1.jar:/home/flink/flink-0.10.1/lib/flink-python_2.11-0.10.1.jar:/home/flink/flink-0.10.1/lib/log4j-1.2.17.jar:/home/flink/flink-0.10.1/lib/slf4j-log4j12-1.7.7.jar:/usr/lib/jvm/java-1.7.0-openjdk-amd64/lib/tools.jar:: 18:27:10,248 INFO org.apache.flink.runtime.taskmanager.TaskManager - -------------------------------------------------------------------------------- 18:27:10,252 INFO org.apache.flink.runtime.taskmanager.TaskManager - Maximum number of open file descriptors is 4096 18:27:10,277 INFO org.apache.flink.runtime.taskmanager.TaskManager - Loading configuration from /home/flink/flink-0.10.1/conf 18:27:10,356 INFO org.apache.flink.runtime.taskmanager.TaskManager - Security is not enabled. Starting non-authenticated TaskManager. 18:27:10,365 ERROR org.apache.flink.runtime.taskmanager.TaskManager - Failed to run TaskManager. java.net.UnknownHostException: Cannot resolve the JobManager hostname 'hostname-of-master' specified in the configuration at org.apache.flink.runtime.util.StandaloneUtils.createLeaderRetrievalService(StandaloneUtils.java:79) at org.apache.flink.runtime.util.StandaloneUtils.createLeaderRetrievalService(StandaloneUtils.java:48) at org.apache.flink.runtime.util.LeaderRetrievalUtils.createLeaderRetrievalService(LeaderRetrievalUtils.java:69) at org.apache.flink.runtime.taskmanager.TaskManager$.selectNetworkInterfaceAndPort(TaskManager.scala:1351) at org.apache.flink.runtime.taskmanager.TaskManager$.selectNetworkInterfaceAndRunTaskManager(TaskManager.scala:1328) at org.apache.flink.runtime.taskmanager.TaskManager$.main(TaskManager.scala:1240) at org.apache.flink.runtime.taskmanager.TaskManager.main(TaskManager.scala) Kind Regards, Ravinder Kaur On Wed, Feb 3, 2016 at 7:19 PM, Stephan Ewen <[hidden email]> wrote:
|
This looks like the reason: java.net.UnknownHostException: Cannot resolve the JobManager hostname 'hostname-of-master' specified in the configuration On Wed, Feb 3, 2016 at 7:29 PM, Ravinder Kaur <[hidden email]> wrote:
|
Hello, Thank you for pointing it out. I had a little typo while I edited the hostname in flink-conf.yaml. I've reset it and the TaskManager started up. But I still can't run the WordCount example and it throws the same NoResourceAvaliableException. Caused by: org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableExce ption: Not enough free slots available to run the job. You can decrease the oper ator parallelism or increase the number of slots per TaskManager in the configur ation. Task to schedule: < Attempt #0 (CHAIN DataSource (at getDefaultTextLineDa taSet(WordCountData.java:70) (org.apache.flink.api.java.io.CollectionInputFormat )) -> FlatMap (FlatMap at main(WordCount.java:69)) -> Combine(SUM(1), at main(Wo rdCount.java:72) (1/1)) @ (unassigned) - [SCHEDULED] > with groupID < 31e497f2f6 8c9cee5864c8fddaff3d59 > in sharing group < SlotSharingGroup [f9ed1aab933e061a8c e1ecaa3534f18c, 037bb78a1902f7edea69a978ad7b54ce, 31e497f2f68c9cee5864c8fddaff3d 59] >. Resources available to scheduler: Number of instances=0, total number of slots=0, available slots=0 at org.apache.flink.runtime.jobmanager.scheduler.Scheduler.scheduleTask( Scheduler.java:256) at org.apache.flink.runtime.jobmanager.scheduler.Scheduler.scheduleImmed iately(Scheduler.java:131) at org.apache.flink.runtime.executiongraph.Execution.scheduleForExecutio n(Execution.java:298) at org.apache.flink.runtime.executiongraph.ExecutionVertex.scheduleForEx ecution(ExecutionVertex.java:458) at org.apache.flink.runtime.executiongraph.ExecutionJobVertex.scheduleAl l(ExecutionJobVertex.java:322) at org.apache.flink.runtime.executiongraph.ExecutionGraph.scheduleForExe cution(ExecutionGraph.java:679) at org.apache.flink.runtime.jobmanager.JobManager$$anonfun$org$apache$fl ink$runtime$jobmanager$JobManager$$submitJob$1.apply$mcV$sp(JobManager.scala:982 ) at org.apache.flink.runtime.jobmanager.JobManager$$anonfun$org$apache$fl ink$runtime$jobmanager$JobManager$$submitJob$1.apply(JobManager.scala:962) at org.apache.flink.runtime.jobmanager.JobManager$$anonfun$org$apache$fl ink$runtime$jobmanager$JobManager$$submitJob$1.apply(JobManager.scala:962) ... 8 more The log of TaskManager again has the same errors as before. 20:58:58,457 INFO org.apache.flink.runtime.net.ConnectionUtils - Failed to connect from address '/slave-IP': connect timed out 20:58:58,458 INFO org.apache.flink.runtime.net.ConnectionUtils - Failed to connect from address '/0:0:0:0:0:0:0:1%1': Network is unreachable 20:58:58,458 INFO org.apache.flink.runtime.net.ConnectionUtils - Failed to connect from address '/127.0.0.1': Invalid argument 20:58:59,048 WARN org.apache.flink.runtime.net.ConnectionUtils - Could not connect to /master-IP:6123. Selecting a local address using heuristics. 20:58:59,050 INFO org.apache.flink.runtime.taskmanager.TaskManager - TaskManager will use hostname/address 'hostname-of-slave' (slave-IP) for communication. 20:58:59,051 INFO org.apache.flink.runtime.taskmanager.TaskManager - Starting TaskManager in streaming mode BATCH_ONLY 20:58:59,052 INFO org.apache.flink.runtime.taskmanager.TaskManager - Starting TaskManager actor system at slave_IP:0 20:58:59,776 INFO akka.event.slf4j.Slf4jLogger - Slf4jLogger started 20:58:59,842 INFO Remoting - Starting remoting 20:59:00,094 INFO Remoting - Remoting started; listening on addresses :[akka.tcp://flink@slave-IP:33813] 20:59:00,100 INFO org.apache.flink.runtime.taskmanager.TaskManager - Starting TaskManager actor 20:59:00,125 INFO org.apache.flink.runtime.io.network.netty.NettyConfig - NettyConfig [server address: hostname-of-master/master-IP, server port: 49030, memory segment size (bytes): 32768, transport type: NIO, number of server threads: 0 (use Netty's default), number of client threads: 0 (use Netty's default), server connect backlog: 0 (use Netty's default), client connect timeout (sec): 120, send/receive buffer size (bytes): 0 (use Netty's default)] 20:59:00,131 INFO org.apache.flink.runtime.taskmanager.TaskManager - Messages between TaskManager and JobManager have a max timeout of 100000 milliseconds 20:59:00,142 INFO org.apache.flink.runtime.taskmanager.TaskManager - Temporary file directory '/tmp': total 4 GB, usable 1 GB (25.00% usable) 20:59:00,210 INFO org.apache.flink.runtime.io.network.buffer.NetworkBufferPool - Allocated 64 MB for network buffer pool (number of memory segments: 2048, bytes per segment: 32768). 20:59:00,323 INFO org.apache.flink.runtime.taskmanager.TaskManager - Using 0.7 of the currently free heap space for Flink managed heap memory (293 MB). 20:59:00,565 INFO org.apache.flink.runtime.io.disk.iomanager.IOManager - I/O manager uses directory /tmp/flink-io-c7796b82-6676-4604-97fd-df09001a84e8 for spill files. 20:59:00,578 INFO org.apache.flink.runtime.filecache.FileCache - User file cache uses directory /tmp/flink-dist-cache-13ed3e76-cf1e-46fa-9ba2-5177e801429e 20:59:00,908 INFO org.apache.flink.runtime.taskmanager.TaskManager - Starting TaskManager actor at akka://flink/user/taskmanager#-157676733. 20:59:00,908 INFO org.apache.flink.runtime.taskmanager.TaskManager - TaskManager data connection information: hostname-of-master (dataPort=49030) 20:59:00,909 INFO org.apache.flink.runtime.taskmanager.TaskManager - TaskManager has 1 task slot(s). 20:59:00,910 INFO org.apache.flink.runtime.taskmanager.TaskManager - Memory usage stats: [HEAP: 376/491/491 MB, NON HEAP: 24/49/304 MB (used/committed/max)] 20:59:00,917 INFO org.apache.flink.runtime.taskmanager.TaskManager - Trying to register at JobManager akka.tcp://flink@master-IP:6123/user/jobmanager (attempt 1, timeout: 500 milliseconds) 20:59:01,443 INFO org.apache.flink.runtime.taskmanager.TaskManager - Trying to register at JobManager akka.tcp://flink@master-IP:6123/user/jobmanager (attempt 2, timeout: 1000 milliseconds) 20:59:02,873 INFO org.apache.flink.runtime.taskmanager.TaskManager - Trying to register at JobManager akka.tcp://flink@master-IP:6123/user/jobmanager (attempt 3, timeout: 2000 milliseconds) 20:59:04,893 INFO org.apache.flink.runtime.taskmanager.TaskManager - Trying to register at JobManager akka.tcp://flink@master-IP:6123/user/jobmanager (attempt 4, timeout: 4000 milliseconds) 20:59:08,914 INFO org.apache.flink.runtime.taskmanager.TaskManager - Trying to register at JobManager akka.tcp://flink@master-IP:6123/user/jobmanager (attempt 5, timeout: 8000 milliseconds) Kind Regards, Ravinder Kaur On Wed, Feb 3, 2016 at 8:12 PM, Stephan Ewen <[hidden email]> wrote:
|
Hi, the TaskManager is starting up, but its not able to register at the job manager. Did you check the JobManager log? Do you see anything suspicious there? Are the ports matching? On Wed, Feb 3, 2016 at 9:23 PM, Ravinder Kaur <[hidden email]> wrote:
|
There still seems to be something wrong with your network config. This looks not like a Flink problem and needs work on your end, we cannot debug that for you. Please go through your network setup and check for example - if the hostnames are right (is "master-IP" really the name of the network interface on the JobManager) - can the machines actually communicate with each other (firewall, etc) - if the "master-IP" interface externally visible (such that other machines can connect to it) These things are prerequisites for any distributed system installation. On Wed, Feb 3, 2016 at 9:27 PM, Robert Metzger <[hidden email]> wrote:
|
Hello, I also feel like it is something to do with network configuration. But then I have checked all these pre-requisites that you have mentioned.
Is there anything else that you can maybe suggest. Kind Regards, Ravinder On Wed, Feb 3, 2016 at 9:36 PM, Stephan Ewen <[hidden email]> wrote:
|
In reply to this post by rmetzger0
Hello, Here is the log file of Jobmanager. I did not see some thing suspicious and as it suggests the ports are also listening. 20:58:46,906 INFO org.apache.flink.runtime.jobmanager.JobManager - Starting JobManager on IP-of-master:6123 with execution mode CLUSTER and streaming mode BATCH_ONLY 20:58:46,978 INFO org.apache.flink.runtime.jobmanager.JobManager - Security is not enabled. Starting non-authenticated JobManager. 20:58:46,979 INFO org.apache.flink.runtime.jobmanager.JobManager - Starting JobManager 20:58:46,980 INFO org.apache.flink.runtime.jobmanager.JobManager - Starting JobManager actor system at 10.155.208.138:6123 20:58:48,196 INFO akka.event.slf4j.Slf4jLogger - Slf4jLogger started 20:58:48,295 INFO Remoting - Starting remoting 20:58:48,541 INFO Remoting - Remoting started; listening on addresses :[akka.tcp://flink@IP-of-master:6123] 20:58:48,549 INFO org.apache.flink.runtime.jobmanager.JobManager - Starting JobManger web frontend 20:58:48,690 INFO org.apache.flink.runtime.webmonitor.WebRuntimeMonitor - Using directory /tmp/flink-web-876a4755-4f38-4ff7-8202-f263afa9b986 for the web interface files 20:58:48,691 INFO org.apache.flink.runtime.webmonitor.WebRuntimeMonitor - Serving job manager log from /home/flink/flink-0.10.1/log/flink-flink-jobmanager-0-hostname.log 20:58:48,691 INFO org.apache.flink.runtime.webmonitor.WebRuntimeMonitor - Serving job manager stdout from /home/flink/flink-0.10.1/log/flink-flink-jobmanager-0-hostname.out 20:58:49,044 INFO org.apache.flink.runtime.webmonitor.WebRuntimeMonitor - Web frontend listening at 0:0:0:0:0:0:0:0:8081 20:58:49,045 INFO org.apache.flink.runtime.jobmanager.JobManager - Starting JobManager actor 20:58:49,052 INFO org.apache.flink.runtime.blob.BlobServer - Created BLOB server storage directory /tmp/blobStore-e0c52bfb-2411-4a83-ac8d-5664a5894258 20:58:49,054 INFO org.apache.flink.runtime.blob.BlobServer - Started BLOB server at 0.0.0.0:43683 - max concurrent requests: 50 - max backlog: 1000 20:58:49,075 INFO org.apache.flink.runtime.jobmanager.MemoryArchivist - Started memory archivist akka://flink/user/archive 20:58:49,075 INFO org.apache.flink.runtime.jobmanager.JobManager - Starting JobManager at akka.tcp://flink@IP-of-master:6123/user/jobmanager. 20:58:49,081 INFO org.apache.flink.runtime.jobmanager.JobManager - JobManager akka.tcp://flink@IP-of-master:6123/user/jobmanager was granted leadership with leader session ID None. 20:58:49,082 INFO org.apache.flink.runtime.webmonitor.WebRuntimeMonitor - Starting with JobManager akka.tcp://flink@IP-of-master:6123/user/jobmanager on port 8081 20:58:49,083 INFO org.apache.flink.runtime.webmonitor.JobManagerRetriever - New leader reachable under akka.tcp://flink@IP-of-master:6123/user/jobmanager:null. 20:59:22,794 INFO org.apache.flink.runtime.jobmanager.JobManager - Submitting job 72733d69588678ec224003ab5577cab8 (Flink Java Job at Wed Feb 03 20:59:22 CET 2016). 20:59:22,853 INFO org.apache.flink.runtime.jobmanager.JobManager - Scheduling job 72733d69588678ec224003ab5577cab8 (Flink Java Job at Wed Feb 03 20:59:22 CET 2016). 20:59:22,857 INFO org.apache.flink.runtime.jobmanager.JobManager - Status of job 72733d69588678ec224003ab5577cab8 (Flink Java Job at Wed Feb 03 20:59:22 CET 2016) changed to RUNNING. 20:59:22,859 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph - CHAIN DataSource (at getDefaultTextLineDataSet(WordCountData.java:70) (org.apache.flink.api.java.io.CollectionInputFormat)) -> FlatMap (FlatMap at main(WordCount.java:69)) -> Combine(SUM(1), at main(WordCount.java:72) (1/1) (23fb37019a504fd6c7bf95e46a8cd7a3) switched from CREATED to SCHEDULED 20:59:22,881 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph - CHAIN DataSource (at getDefaultTextLineDataSet(WordCountData.java:70) (org.apache.flink.api.java.io.CollectionInputFormat)) -> FlatMap (FlatMap at main(WordCount.java:69)) -> Combine(SUM(1), at main(WordCount.java:72) (1/1) (23fb37019a504fd6c7bf95e46a8cd7a3) switched from SCHEDULED to CANCELED 20:59:22,881 INFO org.apache.flink.runtime.jobmanager.JobManager - Status of job 72733d69588678ec224003ab5577cab8 (Flink Java Job at Wed Feb 03 20:59:22 CET 2016) changed to FAILING. org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException: Not enough free slots available to run the job. You can decrease the operator parallelism or increase the number of slots per TaskManager in the configuration. Task to schedule: < Attempt #0 (CHAIN DataSource (at getDefaultTextLineDataSet(WordCountData.java:70) (org.apache.flink.api.java.io.CollectionInputFormat)) -> FlatMap (FlatMap at main(WordCount.java:69)) -> Combine(SUM(1), at main(WordCount.java:72) (1/1)) @ (unassigned) - [SCHEDULED] > with groupID < 31e497f2f68c9cee5864c8fddaff3d59 > in sharing group < SlotSharingGroup [f9ed1aab933e061a8ce1ecaa3534f18c, 037bb78a1902f7edea69a978ad7b54ce, 31e497f2f68c9cee5864c8fddaff3d59] >. Resources available to scheduler: Number of instances=0, total number of slots=0, available slots=0 at org.apache.flink.runtime.jobmanager.scheduler.Scheduler.scheduleTask(Scheduler.java:256) at org.apache.flink.runtime.jobmanager.scheduler.Scheduler.scheduleImmediately(Scheduler.java:131) at org.apache.flink.runtime.executiongraph.Execution.scheduleForExecution(Execution.java:298) at org.apache.flink.runtime.executiongraph.ExecutionVertex.scheduleForExecution(ExecutionVertex.java:458) at org.apache.flink.runtime.executiongraph.ExecutionJobVertex.scheduleAll(ExecutionJobVertex.java:322) at org.apache.flink.runtime.executiongraph.ExecutionGraph.scheduleForExecution(ExecutionGraph.java:679) at org.apache.flink.runtime.jobmanager.JobManager$$anonfun$org$apache$flink$runtime$jobmanager$JobManager$$submitJob$1.apply$mcV$sp(JobManager.scala:982) at org.apache.flink.runtime.jobmanager.JobManager$$anonfun$org$apache$flink$runtime$jobmanager$JobManager$$submitJob$1.apply(JobManager.scala:962) at org.apache.flink.runtime.jobmanager.JobManager$$anonfun$org$apache$flink$runtime$jobmanager$JobManager$$submitJob$1.apply(JobManager.scala:962) at scala.concurrent.impl.Future$PromiseCompletingRunnable.liftedTree1$1(Future.scala:24) at scala.concurrent.impl.Future$PromiseCompletingRunnable.run(Future.scala:24) at akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:41) at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:401) at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) 20:59:22,886 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph - CHAIN Reduce (SUM(1), at main(WordCount.java:72) -> FlatMap (collect()) (1/1) (824b6e3771304cd0f92aea4ab763a11d) switched from CREATED to CANCELED 20:59:22,887 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph - DataSink (collect() sink) (1/1) (1bb64a2edc6f68ad716acd9f8d2d7d67) switched from CREATED to CANCELED 20:59:22,890 INFO org.apache.flink.runtime.jobmanager.JobManager - Status of job 72733d69588678ec224003ab5577cab8 (Flink Java Job at Wed Feb 03 20:59:22 CET 2016) changed to FAILED. org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException: Not enough free slots available to run the job. You can decrease the operator parallelism or increase the number of slots per TaskManager in the configuration. Task to schedule: < Attempt #0 (CHAIN DataSource (at getDefaultTextLineDataSet(WordCountData.java:70) (org.apache.flink.api.java.io.CollectionInputFormat)) -> FlatMap (FlatMap at main(WordCount.java:69)) -> Combine(SUM(1), at main(WordCount.java:72) (1/1)) @ (unassigned) - [SCHEDULED] > with groupID < 31e497f2f68c9cee5864c8fddaff3d59 > in sharing group < SlotSharingGroup [f9ed1aab933e061a8ce1ecaa3534f18c, 037bb78a1902f7edea69a978ad7b54ce, 31e497f2f68c9cee5864c8fddaff3d59] >. Resources available to scheduler: Number of instances=0, total number of slots=0, available slots=0 at org.apache.flink.runtime.jobmanager.scheduler.Scheduler.scheduleTask(Scheduler.java:256) at org.apache.flink.runtime.jobmanager.scheduler.Scheduler.scheduleImmediately(Scheduler.java:131) at org.apache.flink.runtime.executiongraph.Execution.scheduleForExecution(Execution.java:298) at org.apache.flink.runtime.executiongraph.ExecutionVertex.scheduleForExecution(ExecutionVertex.java:458) at org.apache.flink.runtime.executiongraph.ExecutionJobVertex.scheduleAll(ExecutionJobVertex.java:322) at org.apache.flink.runtime.executiongraph.ExecutionGraph.scheduleForExecution(ExecutionGraph.java:679) at org.apache.flink.runtime.jobmanager.JobManager$$anonfun$org$apache$flink$runtime$jobmanager$JobManager$$submitJob$1.apply$mcV$sp(JobManager.scala:982) at org.apache.flink.runtime.jobmanager.JobManager$$anonfun$org$apache$flink$runtime$jobmanager$JobManager$$submitJob$1.apply(JobManager.scala:962) at org.apache.flink.runtime.jobmanager.JobManager$$anonfun$org$apache$flink$runtime$jobmanager$JobManager$$submitJob$1.apply(JobManager.scala:962) at scala.concurrent.impl.Future$PromiseCompletingRunnable.liftedTree1$1(Future.scala:24) at scala.concurrent.impl.Future$PromiseCompletingRunnable.run(Future.scala:24) On Wed, Feb 3, 2016 at 9:27 PM, Robert Metzger <[hidden email]> wrote:
|
Can machines connect to port 6123? The firewall may block that port, put permit SSH. On Wed, Feb 3, 2016 at 9:52 PM, Ravinder Kaur <[hidden email]> wrote:
|
Hello, Thank you very much. This was indeed the problem. The firewall was blocking 6123 and 43008. Also the user did not have permissions to unblock the firewall. Retried the following command with root privileges : ufw allow port and this made the job run. Kind Regards, Ravinder Kaur On Wed, Feb 3, 2016 at 10:09 PM, Stephan Ewen <[hidden email]> wrote:
|
In reply to this post by Stephan Ewen
Hello All, I need to know the range of ports that are being used during the master/slave communication in the Flink cluster. Also is there a way I can specify a range of ports, at the slaves, to restrict them to connect to master only in this range? Kind Regards, Ravinder Kaur On Wed, Feb 3, 2016 at 10:09 PM, Stephan Ewen <[hidden email]> wrote:
|
Hi Ravinder, please have a look at the configuration documentation: --> https://ci.apache.org/projects/flink/flink-docs-master/setup/config.html#jobmanager-amp-taskmanager 2016-02-10 13:55 GMT+01:00 Ravinder Kaur <[hidden email]>:
|
Note that some of these config options are only available starting from version 1.0-SNAPSHOT On Wed, Feb 10, 2016 at 2:16 PM, Fabian Hueske <[hidden email]> wrote:
|
In reply to this post by Fabian Hueske-2
Hello Fabian, Thank you very much for the resource. I had already gone through this and have found port '6123' as default for taskmanager registration. But I want to know the specific range of ports the taskmanager access during job execution. The taskmanager always tries to access a random port during job execution for which I need to disable firewall using 'ufw allow port' during the execution, otherwise the job hangs and finally fails. So I wanted to know a particular range of ports which I can specify in the iptables to always allow access. Kind Regards, Ravinder Kaur On Wed, Feb 10, 2016 at 2:16 PM, Fabian Hueske <[hidden email]> wrote:
|
Hey Ravinder,
check out the following config keys: blob.server.port taskmanager.rpc.port taskmanager.data.port – Ufuk On Wed, Feb 10, 2016 at 4:06 PM, Ravinder Kaur <[hidden email]> wrote: > Hello Fabian, > > Thank you very much for the resource. I had already gone through this and > have found port '6123' as default for taskmanager registration. But I want > to know the specific range of ports the taskmanager access during job > execution. > > The taskmanager always tries to access a random port during job execution > for which I need to disable firewall using 'ufw allow port' during the > execution, otherwise the job hangs and finally fails. So I wanted to know a > particular range of ports which I can specify in the iptables to always > allow access. > > > Kind Regards, > Ravinder Kaur > > On Wed, Feb 10, 2016 at 2:16 PM, Fabian Hueske <[hidden email]> wrote: >> >> Hi Ravinder, >> >> please have a look at the configuration documentation: >> >> --> >> https://ci.apache.org/projects/flink/flink-docs-master/setup/config.html#jobmanager-amp-taskmanager >> >> Best, Fabian >> >> 2016-02-10 13:55 GMT+01:00 Ravinder Kaur <[hidden email]>: >>> >>> Hello All, >>> >>> I need to know the range of ports that are being used during the >>> master/slave communication in the Flink cluster. Also is there a way I can >>> specify a range of ports, at the slaves, to restrict them to connect to >>> master only in this range? >>> >>> Kind Regards, >>> Ravinder Kaur >>> >>> >>> On Wed, Feb 3, 2016 at 10:09 PM, Stephan Ewen <[hidden email]> wrote: >>>> >>>> Can machines connect to port 6123? The firewall may block that port, put >>>> permit SSH. >>>> >>>> On Wed, Feb 3, 2016 at 9:52 PM, Ravinder Kaur <[hidden email]> >>>> wrote: >>>>> >>>>> Hello, >>>>> >>>>> Here is the log file of Jobmanager. I did not see some thing suspicious >>>>> and as it suggests the ports are also listening. >>>>> >>>>> 20:58:46,906 INFO org.apache.flink.runtime.jobmanager.JobManager >>>>> - Starting JobManager on IP-of-master:6123 with execution mode CLUSTER and >>>>> streaming mode BATCH_ONLY >>>>> 20:58:46,978 INFO org.apache.flink.runtime.jobmanager.JobManager >>>>> - Security is not enabled. Starting non-authenticated JobManager. >>>>> 20:58:46,979 INFO org.apache.flink.runtime.jobmanager.JobManager >>>>> - Starting JobManager >>>>> 20:58:46,980 INFO org.apache.flink.runtime.jobmanager.JobManager >>>>> - Starting JobManager actor system at 10.155.208.138:6123 >>>>> 20:58:48,196 INFO akka.event.slf4j.Slf4jLogger >>>>> - Slf4jLogger started >>>>> 20:58:48,295 INFO Remoting >>>>> - Starting remoting >>>>> 20:58:48,541 INFO Remoting >>>>> - Remoting started; listening on addresses >>>>> :[akka.tcp://flink@IP-of-master:6123] >>>>> 20:58:48,549 INFO org.apache.flink.runtime.jobmanager.JobManager >>>>> - Starting JobManger web frontend >>>>> 20:58:48,690 INFO >>>>> org.apache.flink.runtime.webmonitor.WebRuntimeMonitor - Using >>>>> directory /tmp/flink-web-876a4755-4f38-4ff7-8202-f263afa9b986 for the web >>>>> interface files >>>>> 20:58:48,691 INFO >>>>> org.apache.flink.runtime.webmonitor.WebRuntimeMonitor - Serving job >>>>> manager log from >>>>> /home/flink/flink-0.10.1/log/flink-flink-jobmanager-0-hostname.log >>>>> 20:58:48,691 INFO >>>>> org.apache.flink.runtime.webmonitor.WebRuntimeMonitor - Serving job >>>>> manager stdout from >>>>> /home/flink/flink-0.10.1/log/flink-flink-jobmanager-0-hostname.out >>>>> 20:58:49,044 INFO >>>>> org.apache.flink.runtime.webmonitor.WebRuntimeMonitor - Web frontend >>>>> listening at 0:0:0:0:0:0:0:0:8081 >>>>> 20:58:49,045 INFO org.apache.flink.runtime.jobmanager.JobManager >>>>> - Starting JobManager actor >>>>> 20:58:49,052 INFO org.apache.flink.runtime.blob.BlobServer >>>>> - Created BLOB server storage directory >>>>> /tmp/blobStore-e0c52bfb-2411-4a83-ac8d-5664a5894258 >>>>> 20:58:49,054 INFO org.apache.flink.runtime.blob.BlobServer >>>>> - Started BLOB server at 0.0.0.0:43683 - max concurrent requests: 50 - max >>>>> backlog: 1000 >>>>> 20:58:49,075 INFO org.apache.flink.runtime.jobmanager.MemoryArchivist >>>>> - Started memory archivist akka://flink/user/archive >>>>> 20:58:49,075 INFO org.apache.flink.runtime.jobmanager.JobManager >>>>> - Starting JobManager at akka.tcp://flink@IP-of-master:6123/user/jobmanager. >>>>> 20:58:49,081 INFO org.apache.flink.runtime.jobmanager.JobManager >>>>> - JobManager akka.tcp://flink@IP-of-master:6123/user/jobmanager was granted >>>>> leadership with leader session ID None. >>>>> 20:58:49,082 INFO >>>>> org.apache.flink.runtime.webmonitor.WebRuntimeMonitor - Starting >>>>> with JobManager akka.tcp://flink@IP-of-master:6123/user/jobmanager on port >>>>> 8081 >>>>> 20:58:49,083 INFO >>>>> org.apache.flink.runtime.webmonitor.JobManagerRetriever - New leader >>>>> reachable under akka.tcp://flink@IP-of-master:6123/user/jobmanager:null. >>>>> 20:59:22,794 INFO org.apache.flink.runtime.jobmanager.JobManager >>>>> - Submitting job 72733d69588678ec224003ab5577cab8 (Flink Java Job at Wed Feb >>>>> 03 20:59:22 CET 2016). >>>>> 20:59:22,853 INFO org.apache.flink.runtime.jobmanager.JobManager >>>>> - Scheduling job 72733d69588678ec224003ab5577cab8 (Flink Java Job at Wed Feb >>>>> 03 20:59:22 CET 2016). >>>>> 20:59:22,857 INFO org.apache.flink.runtime.jobmanager.JobManager >>>>> - Status of job 72733d69588678ec224003ab5577cab8 (Flink Java Job at Wed Feb >>>>> 03 20:59:22 CET 2016) changed to RUNNING. >>>>> 20:59:22,859 INFO >>>>> org.apache.flink.runtime.executiongraph.ExecutionGraph - CHAIN >>>>> DataSource (at getDefaultTextLineDataSet(WordCountData.java:70) >>>>> (org.apache.flink.api.java.io.CollectionInputFormat)) -> FlatMap (FlatMap at >>>>> main(WordCount.java:69)) -> Combine(SUM(1), at main(WordCount.java:72) (1/1) >>>>> (23fb37019a504fd6c7bf95e46a8cd7a3) switched from CREATED to SCHEDULED >>>>> 20:59:22,881 INFO >>>>> org.apache.flink.runtime.executiongraph.ExecutionGraph - CHAIN >>>>> DataSource (at getDefaultTextLineDataSet(WordCountData.java:70) >>>>> (org.apache.flink.api.java.io.CollectionInputFormat)) -> FlatMap (FlatMap at >>>>> main(WordCount.java:69)) -> Combine(SUM(1), at main(WordCount.java:72) (1/1) >>>>> (23fb37019a504fd6c7bf95e46a8cd7a3) switched from SCHEDULED to CANCELED >>>>> 20:59:22,881 INFO org.apache.flink.runtime.jobmanager.JobManager >>>>> - Status of job 72733d69588678ec224003ab5577cab8 (Flink Java Job at Wed Feb >>>>> 03 20:59:22 CET 2016) changed to FAILING. >>>>> >>>>> org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException: >>>>> Not enough free slots available to run the job. You can decrease the >>>>> operator parallelism or increase the number of slots per TaskManager in the >>>>> configuration. Task to schedule: < Attempt #0 (CHAIN DataSource (at >>>>> getDefaultTextLineDataSet(WordCountData.java:70) >>>>> (org.apache.flink.api.java.io.CollectionInputFormat)) -> FlatMap (FlatMap at >>>>> main(WordCount.java:69)) -> Combine(SUM(1), at main(WordCount.java:72) >>>>> (1/1)) @ (unassigned) - [SCHEDULED] > with groupID < >>>>> 31e497f2f68c9cee5864c8fddaff3d59 > in sharing group < SlotSharingGroup >>>>> [f9ed1aab933e061a8ce1ecaa3534f18c, 037bb78a1902f7edea69a978ad7b54ce, >>>>> 31e497f2f68c9cee5864c8fddaff3d59] >. Resources available to scheduler: >>>>> Number of instances=0, total number of slots=0, available slots=0 >>>>> at >>>>> org.apache.flink.runtime.jobmanager.scheduler.Scheduler.scheduleTask(Scheduler.java:256) >>>>> at >>>>> org.apache.flink.runtime.jobmanager.scheduler.Scheduler.scheduleImmediately(Scheduler.java:131) >>>>> at >>>>> org.apache.flink.runtime.executiongraph.Execution.scheduleForExecution(Execution.java:298) >>>>> at >>>>> org.apache.flink.runtime.executiongraph.ExecutionVertex.scheduleForExecution(ExecutionVertex.java:458) >>>>> at >>>>> org.apache.flink.runtime.executiongraph.ExecutionJobVertex.scheduleAll(ExecutionJobVertex.java:322) >>>>> at >>>>> org.apache.flink.runtime.executiongraph.ExecutionGraph.scheduleForExecution(ExecutionGraph.java:679) >>>>> at >>>>> org.apache.flink.runtime.jobmanager.JobManager$$anonfun$org$apache$flink$runtime$jobmanager$JobManager$$submitJob$1.apply$mcV$sp(JobManager.scala:982) >>>>> at >>>>> org.apache.flink.runtime.jobmanager.JobManager$$anonfun$org$apache$flink$runtime$jobmanager$JobManager$$submitJob$1.apply(JobManager.scala:962) >>>>> at >>>>> org.apache.flink.runtime.jobmanager.JobManager$$anonfun$org$apache$flink$runtime$jobmanager$JobManager$$submitJob$1.apply(JobManager.scala:962) >>>>> at >>>>> scala.concurrent.impl.Future$PromiseCompletingRunnable.liftedTree1$1(Future.scala:24) >>>>> at >>>>> scala.concurrent.impl.Future$PromiseCompletingRunnable.run(Future.scala:24) >>>>> at >>>>> akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:41) >>>>> at >>>>> akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:401) >>>>> at >>>>> scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) >>>>> at >>>>> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) >>>>> at >>>>> scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) >>>>> at >>>>> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) >>>>> 20:59:22,886 INFO >>>>> org.apache.flink.runtime.executiongraph.ExecutionGraph - CHAIN Reduce >>>>> (SUM(1), at main(WordCount.java:72) -> FlatMap (collect()) (1/1) >>>>> (824b6e3771304cd0f92aea4ab763a11d) switched from CREATED to CANCELED >>>>> 20:59:22,887 INFO >>>>> org.apache.flink.runtime.executiongraph.ExecutionGraph - DataSink >>>>> (collect() sink) (1/1) (1bb64a2edc6f68ad716acd9f8d2d7d67) switched from >>>>> CREATED to CANCELED >>>>> 20:59:22,890 INFO org.apache.flink.runtime.jobmanager.JobManager >>>>> - Status of job 72733d69588678ec224003ab5577cab8 (Flink Java Job at Wed Feb >>>>> 03 20:59:22 CET 2016) changed to FAILED. >>>>> >>>>> org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException: >>>>> Not enough free slots available to run the job. You can decrease the >>>>> operator parallelism or increase the number of slots per TaskManager in the >>>>> configuration. Task to schedule: < Attempt #0 (CHAIN DataSource (at >>>>> getDefaultTextLineDataSet(WordCountData.java:70) >>>>> (org.apache.flink.api.java.io.CollectionInputFormat)) -> FlatMap (FlatMap at >>>>> main(WordCount.java:69)) -> Combine(SUM(1), at main(WordCount.java:72) >>>>> (1/1)) @ (unassigned) - [SCHEDULED] > with groupID < >>>>> 31e497f2f68c9cee5864c8fddaff3d59 > in sharing group < SlotSharingGroup >>>>> [f9ed1aab933e061a8ce1ecaa3534f18c, 037bb78a1902f7edea69a978ad7b54ce, >>>>> 31e497f2f68c9cee5864c8fddaff3d59] >. Resources available to scheduler: >>>>> Number of instances=0, total number of slots=0, available slots=0 >>>>> at >>>>> org.apache.flink.runtime.jobmanager.scheduler.Scheduler.scheduleTask(Scheduler.java:256) >>>>> at >>>>> org.apache.flink.runtime.jobmanager.scheduler.Scheduler.scheduleImmediately(Scheduler.java:131) >>>>> at >>>>> org.apache.flink.runtime.executiongraph.Execution.scheduleForExecution(Execution.java:298) >>>>> at >>>>> org.apache.flink.runtime.executiongraph.ExecutionVertex.scheduleForExecution(ExecutionVertex.java:458) >>>>> at >>>>> org.apache.flink.runtime.executiongraph.ExecutionJobVertex.scheduleAll(ExecutionJobVertex.java:322) >>>>> at >>>>> org.apache.flink.runtime.executiongraph.ExecutionGraph.scheduleForExecution(ExecutionGraph.java:679) >>>>> at >>>>> org.apache.flink.runtime.jobmanager.JobManager$$anonfun$org$apache$flink$runtime$jobmanager$JobManager$$submitJob$1.apply$mcV$sp(JobManager.scala:982) >>>>> at >>>>> org.apache.flink.runtime.jobmanager.JobManager$$anonfun$org$apache$flink$runtime$jobmanager$JobManager$$submitJob$1.apply(JobManager.scala:962) >>>>> at >>>>> org.apache.flink.runtime.jobmanager.JobManager$$anonfun$org$apache$flink$runtime$jobmanager$JobManager$$submitJob$1.apply(JobManager.scala:962) >>>>> at >>>>> scala.concurrent.impl.Future$PromiseCompletingRunnable.liftedTree1$1(Future.scala:24) >>>>> at >>>>> scala.concurrent.impl.Future$PromiseCompletingRunnable.run(Future.scala:24) >>>>> >>>>> >>>>> On Wed, Feb 3, 2016 at 9:27 PM, Robert Metzger <[hidden email]> >>>>> wrote: >>>>>> >>>>>> Hi, >>>>>> >>>>>> the TaskManager is starting up, but its not able to register at the >>>>>> job manager. Did you check the JobManager log? Do you see anything >>>>>> suspicious there? Are the ports matching? >>>>>> >>>>>> >>>>>> On Wed, Feb 3, 2016 at 9:23 PM, Ravinder Kaur <[hidden email]> >>>>>> wrote: >>>>>>> >>>>>>> Hello, >>>>>>> >>>>>>> Thank you for pointing it out. I had a little typo while I edited the >>>>>>> hostname in flink-conf.yaml. I've reset it and the TaskManager started up. >>>>>>> But I still can't run the WordCount example and it throws the same >>>>>>> NoResourceAvaliableException. >>>>>>> >>>>>>> Caused by: >>>>>>> org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableExce >>>>>>> ption: Not enough free slots available to run the job. You can decrease the >>>>>>> oper >>>>>>> ator parallelism or increase the number of slots per TaskManager in the >>>>>>> configur >>>>>>> ation. Task to schedule: < Attempt #0 (CHAIN DataSource (at >>>>>>> getDefaultTextLineDa >>>>>>> taSet(WordCountData.java:70) >>>>>>> (org.apache.flink.api.java.io.CollectionInputFormat >>>>>>> )) -> FlatMap (FlatMap at main(WordCount.java:69)) -> Combine(SUM(1), at >>>>>>> main(Wo >>>>>>> rdCount.java:72) (1/1)) @ (unassigned) - [SCHEDULED] > with groupID < >>>>>>> 31e497f2f6 >>>>>>> 8c9cee5864c8fddaff3d59 > in sharing group < SlotSharingGroup >>>>>>> [f9ed1aab933e061a8c >>>>>>> e1ecaa3534f18c, 037bb78a1902f7edea69a978ad7b54ce, >>>>>>> 31e497f2f68c9cee5864c8fddaff3d >>>>>>> 59] >. Resources available to scheduler: Number of instances=0, total number >>>>>>> of >>>>>>> slots=0, available slots=0 >>>>>>> at >>>>>>> org.apache.flink.runtime.jobmanager.scheduler.Scheduler.scheduleTask( >>>>>>> Scheduler.java:256) >>>>>>> at >>>>>>> org.apache.flink.runtime.jobmanager.scheduler.Scheduler.scheduleImmed >>>>>>> iately(Scheduler.java:131) >>>>>>> at >>>>>>> org.apache.flink.runtime.executiongraph.Execution.scheduleForExecutio >>>>>>> n(Execution.java:298) >>>>>>> at >>>>>>> org.apache.flink.runtime.executiongraph.ExecutionVertex.scheduleForEx >>>>>>> ecution(ExecutionVertex.java:458) >>>>>>> at >>>>>>> org.apache.flink.runtime.executiongraph.ExecutionJobVertex.scheduleAl >>>>>>> l(ExecutionJobVertex.java:322) >>>>>>> at >>>>>>> org.apache.flink.runtime.executiongraph.ExecutionGraph.scheduleForExe >>>>>>> cution(ExecutionGraph.java:679) >>>>>>> at >>>>>>> org.apache.flink.runtime.jobmanager.JobManager$$anonfun$org$apache$fl >>>>>>> ink$runtime$jobmanager$JobManager$$submitJob$1.apply$mcV$sp(JobManager.scala:982 >>>>>>> ) >>>>>>> at >>>>>>> org.apache.flink.runtime.jobmanager.JobManager$$anonfun$org$apache$fl >>>>>>> ink$runtime$jobmanager$JobManager$$submitJob$1.apply(JobManager.scala:962) >>>>>>> at >>>>>>> org.apache.flink.runtime.jobmanager.JobManager$$anonfun$org$apache$fl >>>>>>> ink$runtime$jobmanager$JobManager$$submitJob$1.apply(JobManager.scala:962) >>>>>>> ... 8 more >>>>>>> >>>>>>> The log of TaskManager again has the same errors as before. >>>>>>> >>>>>>> 20:58:58,457 INFO org.apache.flink.runtime.net.ConnectionUtils >>>>>>> - Failed to connect from address '/slave-IP': connect timed out >>>>>>> 20:58:58,458 INFO org.apache.flink.runtime.net.ConnectionUtils >>>>>>> - Failed to connect from address '/0:0:0:0:0:0:0:1%1': Network is >>>>>>> unreachable >>>>>>> 20:58:58,458 INFO org.apache.flink.runtime.net.ConnectionUtils >>>>>>> - Failed to connect from address '/127.0.0.1': Invalid argument >>>>>>> 20:58:59,048 WARN org.apache.flink.runtime.net.ConnectionUtils >>>>>>> - Could not connect to /master-IP:6123. Selecting a local address using >>>>>>> heuristics. >>>>>>> 20:58:59,050 INFO org.apache.flink.runtime.taskmanager.TaskManager >>>>>>> - TaskManager will use hostname/address 'hostname-of-slave' (slave-IP) for >>>>>>> communication. >>>>>>> 20:58:59,051 INFO org.apache.flink.runtime.taskmanager.TaskManager >>>>>>> - Starting TaskManager in streaming mode BATCH_ONLY >>>>>>> 20:58:59,052 INFO org.apache.flink.runtime.taskmanager.TaskManager >>>>>>> - Starting TaskManager actor system at slave_IP:0 >>>>>>> 20:58:59,776 INFO akka.event.slf4j.Slf4jLogger >>>>>>> - Slf4jLogger started >>>>>>> 20:58:59,842 INFO Remoting >>>>>>> - Starting remoting >>>>>>> 20:59:00,094 INFO Remoting >>>>>>> - Remoting started; listening on addresses >>>>>>> :[akka.tcp://flink@slave-IP:33813] >>>>>>> 20:59:00,100 INFO org.apache.flink.runtime.taskmanager.TaskManager >>>>>>> - Starting TaskManager actor >>>>>>> 20:59:00,125 INFO >>>>>>> org.apache.flink.runtime.io.network.netty.NettyConfig - NettyConfig >>>>>>> [server address: hostname-of-master/master-IP, server port: 49030, memory >>>>>>> segment size (bytes): 32768, transport type: NIO, number of server threads: >>>>>>> 0 (use Netty's default), number of client threads: 0 (use Netty's default), >>>>>>> server connect backlog: 0 (use Netty's default), client connect timeout >>>>>>> (sec): 120, send/receive buffer size (bytes): 0 (use Netty's default)] >>>>>>> 20:59:00,131 INFO org.apache.flink.runtime.taskmanager.TaskManager >>>>>>> - Messages between TaskManager and JobManager have a max timeout of 100000 >>>>>>> milliseconds >>>>>>> 20:59:00,142 INFO org.apache.flink.runtime.taskmanager.TaskManager >>>>>>> - Temporary file directory '/tmp': total 4 GB, usable 1 GB (25.00% usable) >>>>>>> 20:59:00,210 INFO >>>>>>> org.apache.flink.runtime.io.network.buffer.NetworkBufferPool - Allocated 64 >>>>>>> MB for network buffer pool (number of memory segments: 2048, bytes per >>>>>>> segment: 32768). >>>>>>> 20:59:00,323 INFO org.apache.flink.runtime.taskmanager.TaskManager >>>>>>> - Using 0.7 of the currently free heap space for Flink managed heap memory >>>>>>> (293 MB). >>>>>>> 20:59:00,565 INFO >>>>>>> org.apache.flink.runtime.io.disk.iomanager.IOManager - I/O manager >>>>>>> uses directory /tmp/flink-io-c7796b82-6676-4604-97fd-df09001a84e8 for spill >>>>>>> files. >>>>>>> 20:59:00,578 INFO org.apache.flink.runtime.filecache.FileCache >>>>>>> - User file cache uses directory >>>>>>> /tmp/flink-dist-cache-13ed3e76-cf1e-46fa-9ba2-5177e801429e >>>>>>> 20:59:00,908 INFO org.apache.flink.runtime.taskmanager.TaskManager >>>>>>> - Starting TaskManager actor at akka://flink/user/taskmanager#-157676733. >>>>>>> 20:59:00,908 INFO org.apache.flink.runtime.taskmanager.TaskManager >>>>>>> - TaskManager data connection information: hostname-of-master >>>>>>> (dataPort=49030) >>>>>>> 20:59:00,909 INFO org.apache.flink.runtime.taskmanager.TaskManager >>>>>>> - TaskManager has 1 task slot(s). >>>>>>> 20:59:00,910 INFO org.apache.flink.runtime.taskmanager.TaskManager >>>>>>> - Memory usage stats: [HEAP: 376/491/491 MB, NON HEAP: 24/49/304 MB >>>>>>> (used/committed/max)] >>>>>>> 20:59:00,917 INFO org.apache.flink.runtime.taskmanager.TaskManager >>>>>>> - Trying to register at JobManager >>>>>>> akka.tcp://flink@master-IP:6123/user/jobmanager (attempt 1, timeout: 500 >>>>>>> milliseconds) >>>>>>> 20:59:01,443 INFO org.apache.flink.runtime.taskmanager.TaskManager >>>>>>> - Trying to register at JobManager >>>>>>> akka.tcp://flink@master-IP:6123/user/jobmanager (attempt 2, timeout: 1000 >>>>>>> milliseconds) >>>>>>> 20:59:02,873 INFO org.apache.flink.runtime.taskmanager.TaskManager >>>>>>> - Trying to register at JobManager >>>>>>> akka.tcp://flink@master-IP:6123/user/jobmanager (attempt 3, timeout: 2000 >>>>>>> milliseconds) >>>>>>> 20:59:04,893 INFO org.apache.flink.runtime.taskmanager.TaskManager >>>>>>> - Trying to register at JobManager >>>>>>> akka.tcp://flink@master-IP:6123/user/jobmanager (attempt 4, timeout: 4000 >>>>>>> milliseconds) >>>>>>> 20:59:08,914 INFO org.apache.flink.runtime.taskmanager.TaskManager >>>>>>> - Trying to register at JobManager >>>>>>> akka.tcp://flink@master-IP:6123/user/jobmanager (attempt 5, timeout: 8000 >>>>>>> milliseconds) >>>>>>> >>>>>>> >>>>>>> Kind Regards, >>>>>>> Ravinder Kaur >>>>>>> >>>>>>> On Wed, Feb 3, 2016 at 8:12 PM, Stephan Ewen <[hidden email]> >>>>>>> wrote: >>>>>>>> >>>>>>>> This looks like the reason: >>>>>>>> >>>>>>>> java.net.UnknownHostException: Cannot resolve the JobManager >>>>>>>> hostname 'hostname-of-master' specified in the configuration >>>>>>>> >>>>>>>> On Wed, Feb 3, 2016 at 7:29 PM, Ravinder Kaur <[hidden email]> >>>>>>>> wrote: >>>>>>>>> >>>>>>>>> Hello, >>>>>>>>> >>>>>>>>> The log file of the Taskmanager now shows the following >>>>>>>>> >>>>>>>>> 18:27:10,082 WARN org.apache.hadoop.util.NativeCodeLoader >>>>>>>>> - Unable to load native-hadoop library for your platform... using >>>>>>>>> builtin-java classes where applicable >>>>>>>>> 18:27:10,244 INFO org.apache.flink.runtime.taskmanager.TaskManager >>>>>>>>> - >>>>>>>>> -------------------------------------------------------------------------------- >>>>>>>>> 18:27:10,244 INFO org.apache.flink.runtime.taskmanager.TaskManager >>>>>>>>> - Starting TaskManager (Version: 0.10.1, Rev:2e9b231, Date:22.11.2015 @ >>>>>>>>> 12:41:12 CET) >>>>>>>>> 18:27:10,244 INFO org.apache.flink.runtime.taskmanager.TaskManager >>>>>>>>> - Current user: flink >>>>>>>>> 18:27:10,245 INFO org.apache.flink.runtime.taskmanager.TaskManager >>>>>>>>> - JVM: OpenJDK 64-Bit Server VM - Oracle Corporation - 1.7/24.91-b01 >>>>>>>>> 18:27:10,245 INFO org.apache.flink.runtime.taskmanager.TaskManager >>>>>>>>> - Maximum heap size: 491 MiBytes >>>>>>>>> 18:27:10,245 INFO org.apache.flink.runtime.taskmanager.TaskManager >>>>>>>>> - JAVA_HOME: /usr/lib/jvm/java-1.7.0-openjdk-amd64 >>>>>>>>> 18:27:10,247 INFO org.apache.flink.runtime.taskmanager.TaskManager >>>>>>>>> - Hadoop version: 2.7.0 >>>>>>>>> 18:27:10,247 INFO org.apache.flink.runtime.taskmanager.TaskManager >>>>>>>>> - JVM Options: >>>>>>>>> 18:27:10,247 INFO org.apache.flink.runtime.taskmanager.TaskManager >>>>>>>>> - -Xms512M >>>>>>>>> 18:27:10,247 INFO org.apache.flink.runtime.taskmanager.TaskManager >>>>>>>>> - -Xmx512M >>>>>>>>> 18:27:10,248 INFO org.apache.flink.runtime.taskmanager.TaskManager >>>>>>>>> - -XX:MaxDirectMemorySize=8388607T >>>>>>>>> 18:27:10,248 INFO org.apache.flink.runtime.taskmanager.TaskManager >>>>>>>>> - -XX:MaxPermSize=256m >>>>>>>>> 18:27:10,248 INFO org.apache.flink.runtime.taskmanager.TaskManager >>>>>>>>> - >>>>>>>>> -Dlog.file=/home/flink/flink-0.10.1/log/flink-flink-taskmanager-0-vm-10-155-208-137.cloud.mwn.de.log >>>>>>>>> 18:27:10,248 INFO org.apache.flink.runtime.taskmanager.TaskManager >>>>>>>>> - >>>>>>>>> -Dlog4j.configuration=file:/home/flink/flink-0.10.1/conf/log4j.properties >>>>>>>>> 18:27:10,248 INFO org.apache.flink.runtime.taskmanager.TaskManager >>>>>>>>> - >>>>>>>>> -Dlogback.configurationFile=file:/home/flink/flink-0.10.1/conf/logback.xml >>>>>>>>> 18:27:10,248 INFO org.apache.flink.runtime.taskmanager.TaskManager >>>>>>>>> - Program Arguments: >>>>>>>>> 18:27:10,248 INFO org.apache.flink.runtime.taskmanager.TaskManager >>>>>>>>> - --configDir >>>>>>>>> 18:27:10,248 INFO org.apache.flink.runtime.taskmanager.TaskManager >>>>>>>>> - /home/flink/flink-0.10.1/conf >>>>>>>>> 18:27:10,248 INFO org.apache.flink.runtime.taskmanager.TaskManager >>>>>>>>> - --streamingMode >>>>>>>>> 18:27:10,248 INFO org.apache.flink.runtime.taskmanager.TaskManager >>>>>>>>> - batch >>>>>>>>> 18:27:10,248 INFO org.apache.flink.runtime.taskmanager.TaskManager >>>>>>>>> - Classpath: >>>>>>>>> /home/flink/flink-0.10.1/lib/flink-dist_2.11-0.10.1.jar:/home/flink/flink-0.10.1/lib/flink-python_2.11-0.10.1.jar:/home/flink/flink-0.10.1/lib/log4j-1.2.17.jar:/home/flink/flink-0.10.1/lib/slf4j-log4j12-1.7.7.jar:/usr/lib/jvm/java-1.7.0-openjdk-amd64/lib/tools.jar:: >>>>>>>>> 18:27:10,248 INFO org.apache.flink.runtime.taskmanager.TaskManager >>>>>>>>> - >>>>>>>>> -------------------------------------------------------------------------------- >>>>>>>>> 18:27:10,252 INFO org.apache.flink.runtime.taskmanager.TaskManager >>>>>>>>> - Maximum number of open file descriptors is 4096 >>>>>>>>> 18:27:10,277 INFO org.apache.flink.runtime.taskmanager.TaskManager >>>>>>>>> - Loading configuration from /home/flink/flink-0.10.1/conf >>>>>>>>> 18:27:10,356 INFO org.apache.flink.runtime.taskmanager.TaskManager >>>>>>>>> - Security is not enabled. Starting non-authenticated TaskManager. >>>>>>>>> 18:27:10,365 ERROR org.apache.flink.runtime.taskmanager.TaskManager >>>>>>>>> - Failed to run TaskManager. >>>>>>>>> java.net.UnknownHostException: Cannot resolve the JobManager >>>>>>>>> hostname 'hostname-of-master' specified in the configuration >>>>>>>>> at >>>>>>>>> org.apache.flink.runtime.util.StandaloneUtils.createLeaderRetrievalService(StandaloneUtils.java:79) >>>>>>>>> at >>>>>>>>> org.apache.flink.runtime.util.StandaloneUtils.createLeaderRetrievalService(StandaloneUtils.java:48) >>>>>>>>> at >>>>>>>>> org.apache.flink.runtime.util.LeaderRetrievalUtils.createLeaderRetrievalService(LeaderRetrievalUtils.java:69) >>>>>>>>> at >>>>>>>>> org.apache.flink.runtime.taskmanager.TaskManager$.selectNetworkInterfaceAndPort(TaskManager.scala:1351) >>>>>>>>> at >>>>>>>>> org.apache.flink.runtime.taskmanager.TaskManager$.selectNetworkInterfaceAndRunTaskManager(TaskManager.scala:1328) >>>>>>>>> at >>>>>>>>> org.apache.flink.runtime.taskmanager.TaskManager$.main(TaskManager.scala:1240) >>>>>>>>> at >>>>>>>>> org.apache.flink.runtime.taskmanager.TaskManager.main(TaskManager.scala) >>>>>>>>> >>>>>>>>> Kind Regards, >>>>>>>>> Ravinder Kaur >>>>>>>>> >>>>>>>>> On Wed, Feb 3, 2016 at 7:19 PM, Stephan Ewen <[hidden email]> >>>>>>>>> wrote: >>>>>>>>>> >>>>>>>>>> What do the TaskManger logs say? >>>>>>>>>> >>>>>>>>>> On Wed, Feb 3, 2016 at 6:34 PM, Ravinder Kaur >>>>>>>>>> <[hidden email]> wrote: >>>>>>>>>>> >>>>>>>>>>> Hello, >>>>>>>>>>> >>>>>>>>>>> Thanks for the quick reply. I tried to set jobmanager.rpc.address >>>>>>>>>>> in flink-conf.yaml to the hostname of master node on both the nodes. >>>>>>>>>>> >>>>>>>>>>> Now it does not start the Taskmanager at the worker node at all. >>>>>>>>>>> When I start the cluster using ./bin/start-cluster.sh on master it shows the >>>>>>>>>>> normal output of starting the Jobmanager and Taskmanager but when I run jps >>>>>>>>>>> on the nodes the slave does not have the Taskmanager running. >>>>>>>>>>> >>>>>>>>>>> Running the WordCount example again fails showing the same error. >>>>>>>>>>> Stopping the cluster says no taskmanager to stop. >>>>>>>>>>> >>>>>>>>>>> Kind Regards, >>>>>>>>>>> Ravinder Kaur >>>>>>>>>>> >>>>>>>>>>> On Wed, Feb 3, 2016 at 5:47 PM, Stephan Ewen <[hidden email]> >>>>>>>>>>> wrote: >>>>>>>>>>>> >>>>>>>>>>>> Looks like the network configuration is not correct. >>>>>>>>>>>> >>>>>>>>>>>> I would try setting the full host name (like >>>>>>>>>>>> "master.abc.xyz.com") as jobmanager.rpc.address. >>>>>>>>>>>> >>>>>>>>>>>> Greetings, >>>>>>>>>>>> Stephan >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> On Wed, Feb 3, 2016 at 5:43 PM, Ravinder Kaur >>>>>>>>>>>> <[hidden email]> wrote: >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> Hello Community, >>>>>>>>>>>>> >>>>>>>>>>>>> I'm a student and new to Apache Flink. I'm trying to learn and >>>>>>>>>>>>> have setup a 2- node standalone Flink(0.10.1) cluster (one master and one >>>>>>>>>>>>> worker). I'm facing the following issue. >>>>>>>>>>>>> >>>>>>>>>>>>> Cluster: consists of 2 vms (one master and one worker) >>>>>>>>>>>>> >>>>>>>>>>>>> The configurations are done as per >>>>>>>>>>>>> https://ci.apache.org/projects/flink/flink-docs-release-0.10/setup/cluster_setup.html >>>>>>>>>>>>> >>>>>>>>>>>>> When I start the cluster both the JobManager and the >>>>>>>>>>>>> TaskManager are started on the master and worker respectively. >>>>>>>>>>>>> >>>>>>>>>>>>> Command to start the cluster : bin/start-cluster.sh >>>>>>>>>>>>> >>>>>>>>>>>>> JPS shows all the processes running. >>>>>>>>>>>>> >>>>>>>>>>>>> Then I run the following command to run a WordCount example >>>>>>>>>>>>> job: ./bin/flink run ./examples/WordCount.jar >>>>>>>>>>>>> >>>>>>>>>>>>> the result is attached with the mail. >>>>>>>>>>>>> >>>>>>>>>>>>> The error is >>>>>>>>>>>>> org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailabeException: >>>>>>>>>>>>> Not enough free slots available to run to run the job >>>>>>>>>>>>> ....................... Resources available to scheduler: Number of >>>>>>>>>>>>> instances=0, total number of slots= 0, available slots=0 >>>>>>>>>>>>> >>>>>>>>>>>>> Therefore I suppose that the JobManager does not find the >>>>>>>>>>>>> TaskManager and checked the logs of the TaskManager which indeed shows that >>>>>>>>>>>>> the TaskManager is unable to register at the JobManager for quite a long >>>>>>>>>>>>> time. There are org.apache.flink.runtime.net.ConnectionUtils: Failed to >>>>>>>>>>>>> connect from localhost: Connect timed out and >>>>>>>>>>>>> org.apache.flink.runtime.net.ConnectionUtils: Failed to connect from address >>>>>>>>>>>>> localhost: Network is Unreachable messages in the log of the TaskManager. >>>>>>>>>>>>> Later when it starts up after a number of attempts and tries to register at >>>>>>>>>>>>> the JobManager, which also fails after a lot of attempts showing the >>>>>>>>>>>>> following message org.apache.flink.runtime.taskmanager.Taskmanager: Trying >>>>>>>>>>>>> to register at JobManager akka.tcp://flink@master:6123/user'/jobmanager >>>>>>>>>>>>> (attempt:92, timeout:30seconds) and >>>>>>>>>>>>> org.apache.flink.runtime.taskmanager.Taskmanager: Tried to associate with >>>>>>>>>>>>> unreachable remote host [akka.tcp://flink@master:6123/user/jobmanager]. >>>>>>>>>>>>> Address is now gated for 5000ms, all messages to this address will be >>>>>>>>>>>>> delivered to dead letters. Reason: Connection timed out: /master:6123 >>>>>>>>>>>>> >>>>>>>>>>>>> I browsed the internet for these and found >>>>>>>>>>>>> http://stackoverflow.com/questions/33601020/flink-job-wont-run-with-higher-taskmanager-heap-mb >>>>>>>>>>>>> and https://issues.apache.org/jira/browse/FLINK-1119 these >>>>>>>>>>>>> links helpful. Stephan Ewen the guy who provided the solution in both the >>>>>>>>>>>>> links gives a good explanation that the TaskManagers take quite some time to >>>>>>>>>>>>> register at the JobManager and therefore I waited for as long as 20 mins >>>>>>>>>>>>> after starting the cluster to run the job. But even after waiting so long I >>>>>>>>>>>>> get the same error. >>>>>>>>>>>>> >>>>>>>>>>>>> Another suggestion was to run the cluster in streaming mode. So >>>>>>>>>>>>> I tried it with the command : bin/start-cluster-streaming.sh and ran the job >>>>>>>>>>>>> but I get the same error. I have rechecked all the configurations but I'm >>>>>>>>>>>>> unable to find out the fault. >>>>>>>>>>>>> >>>>>>>>>>>>> I re-checked all the configurations but could not find anything >>>>>>>>>>>>> wrong. Also checked the port 6123 on master which is in LISTEN state and tcp >>>>>>>>>>>>> request from worker to master shows SYN_SENT state using netstat -na and >>>>>>>>>>>>> lsof -i commands. >>>>>>>>>>>>> >>>>>>>>>>>>> I opened the webpage on master http://localhost:8081 but it >>>>>>>>>>>>> shows nothing and localhost:8080 says connection refused. >>>>>>>>>>>>> >>>>>>>>>>>>> Kindly help me out as it is very important for me. Let me know >>>>>>>>>>>>> if you have any questions. >>>>>>>>>>>>> >>>>>>>>>>>>> Kind Regards, >>>>>>>>>>>>> Ravinder Kaur >>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>> >> > |
Free forum by Nabble | Edit this page |