Hi Guys!
Im attempting to run Flink on YARN, but I run into an issue. Im starting the Flink YARN session from an Ubuntu 14.04 VM. All goes well until after the JobManager web UI is started: JobManager web interface address http://head05.hathi.surfsara.nl:8088/proxy/application_1452780322684_10532/ Waiting until all TaskManagers have connected 11:09:51,557 INFO org.apache.flink.yarn.ApplicationClient - Notification about new leader address akka.tcp://flink@145.100.41.148:35666/user/jobmanager with session ID null. No status updates from the YARN cluster received so far. Waiting ... 11:09:51,578 INFO org.apache.flink.yarn.ApplicationClient - Received address of new leader akka.tcp://flink@145.100.41.148:35666/user/jobmanager with session ID null. 11:09:51,583 INFO org.apache.flink.yarn.ApplicationClient - Disconnect from JobManager null. 11:09:51,595 INFO org.apache.flink.yarn.ApplicationClient - Trying to register at JobManager akka.tcp://flink@145.100.41.148:35666/user/jobmanager. No status updates from the YARN cluster received so far. Waiting ... No status updates from the YARN cluster received so far. Waiting ... It then hangs on these last steps (trying to register, no status updates..) Im sure there must be a problem on my side that is causing me not to be able to register at the JobManager. What could cause such connection problems? Any tips are very welcome :-) Cheers and have a good weekend! - Pieter |
Hi, did you check the logs of the JobManager itself? Maybe it'll tell us already whats going on. On Sat, Feb 6, 2016 at 12:14 PM, Pieter Hameete <[hidden email]> wrote:
|
Hi Robert, unfortunately there are no signs of what is going wrong in the logs. The last log messages are about succesful registration of the TaskManagers. I'm also fairly sure it must be something in my VM that is causing this, because when I start the yarn-session from a login node that is on the same network as the hadoop cluster there are no problems registering with the JobManager. I did also notice the following message in the local console: 12:30:27,173 WARN Remoting - Tried to associate with unreachable remote address [akka.tcp://flink@145.100.41.13:41539]. Address is now gated for 5000 ms, all messages to this address will be delivered to dead letters. Reason: connection timed out: /145.100.41.13:41539 I can ping the JobManager fine from with VM. Could there be some invalid or missing configuration on my side? Cheers, Pieter 2016-02-06 12:54 GMT+01:00 Robert Metzger <[hidden email]>:
|
Hi Pieter,
Which version of Flink are you using? It appears you've created a Flink YARN cluster but you can't reach the JobManager afterwards. Cheers, Max On Sat, Feb 6, 2016 at 1:42 PM, Pieter Hameete <[hidden email]> wrote: > Hi Robert, > > unfortunately there are no signs of what is going wrong in the logs. The > last log messages are about succesful registration of the TaskManagers. > > I'm also fairly sure it must be something in my VM that is causing this, > because when I start the yarn-session from a login node that is on the same > network as the hadoop cluster there are no problems registering with the > JobManager. I did also notice the following message in the local console: > > 12:30:27,173 WARN Remoting > - Tried to associate with unreachable remote address > [akka.tcp://flink@145.100.41.13:41539]. Address is now gated for 5000 ms, > all messages to this address will be delivered to dead letters. Reason: > connection timed out: /145.100.41.13:41539 > > I can ping the JobManager fine from with VM. Could there be some invalid or > missing configuration on my side? > > Cheers, > > Pieter > > > 2016-02-06 12:54 GMT+01:00 Robert Metzger <[hidden email]>: >> >> Hi, >> >> did you check the logs of the JobManager itself? Maybe it'll tell us >> already whats going on. >> >> On Sat, Feb 6, 2016 at 12:14 PM, Pieter Hameete <[hidden email]> >> wrote: >>> >>> Hi Guys! >>> >>> Im attempting to run Flink on YARN, but I run into an issue. Im starting >>> the Flink YARN session from an Ubuntu 14.04 VM. All goes well until after >>> the JobManager web UI is started: >>> >>> JobManager web interface address >>> http://head05.hathi.surfsara.nl:8088/proxy/application_1452780322684_10532/ >>> Waiting until all TaskManagers have connected >>> 11:09:51,557 INFO org.apache.flink.yarn.ApplicationClient >>> - Notification about new leader address >>> akka.tcp://flink@145.100.41.148:35666/user/jobmanager with session ID null. >>> No status updates from the YARN cluster received so far. Waiting ... >>> 11:09:51,578 INFO org.apache.flink.yarn.ApplicationClient >>> - Received address of new leader >>> akka.tcp://flink@145.100.41.148:35666/user/jobmanager with session ID null. >>> 11:09:51,583 INFO org.apache.flink.yarn.ApplicationClient >>> - Disconnect from JobManager null. >>> 11:09:51,595 INFO org.apache.flink.yarn.ApplicationClient >>> - Trying to register at JobManager >>> akka.tcp://flink@145.100.41.148:35666/user/jobmanager. >>> No status updates from the YARN cluster received so far. Waiting ... >>> No status updates from the YARN cluster received so far. Waiting ... >>> >>> It then hangs on these last steps (trying to register, no status >>> updates..) >>> >>> Im sure there must be a problem on my side that is causing me not to be >>> able to register at the JobManager. What could cause such connection >>> problems? >>> >>> Any tips are very welcome :-) >>> >>> Cheers and have a good weekend! >>> >>> - Pieter >>> >>> >> > |
Hi Max! I'm using Flink 0.10.1 and indeed the cluster seems to be created fine, all in the JobManager Web UI looks good. It seems like the JobManager initiates the connection with my VM and cannot reach it. It could be that this is similar to the problem here: I probably have to make some changes to the networking configuration of my VM so it can be reached by the JobManager despite using a different port each time. - Pieter 2016-02-06 14:05 GMT+01:00 Maximilian Michels <[hidden email]>: Hi Pieter, |
Yeah, sounds a lot like the client cannot connect to the JobManager port. The ports to communicate with HDFS and the YARN resource manager may be whitelisted r forwarded, so you can submit the YARN session, but then not connect to the JobManager afterwards. On Sat, Feb 6, 2016 at 2:11 PM, Pieter Hameete <[hidden email]> wrote:
|
Hi Stephan, surely it seems this way! I must not be the first with this issue though? I'll have to contact the cluster admins to find a solution together. What would be a way of make the JobManagers accessible from outside the network, because the IP and port number changes every time.2016-02-06 16:22 GMT+01:00 Stephan Ewen <[hidden email]>:
|
Hi, we had other users with a similar issue as well. There is a configuration value which allows you to specify a single port or a range of ports for the JobManager to allocate when running on YARN. Note that when using this with a single port, the JMs may collide. On Sun, Feb 7, 2016 at 7:25 PM, Pieter Hameete <[hidden email]> wrote:
|
I found the relevant information on the website. Ill consult with the cluster admin tomorrow, thanks for the help :-) - Pieter 2016-02-07 19:31 GMT+01:00 Robert Metzger <[hidden email]>:
|
Ive tried setting the yarn.application-master.port property in flink-conf.yaml to a range suggested in https://ci.apache.org/projects/flink/flink-docs-master/setup/yarn_setup.html#running-flink-on-yarn-behind-firewalls The JobManager does not seem to be picking the property up. Am I setting this in the wrong place? Or is there another way to enforce this property?2016-02-07 20:04 GMT+01:00 Pieter Hameete <[hidden email]>:
|
You said earlier that you are using Flink 0.10. The feature is only available in 1.0-SNAPSHOT. On Mon, Feb 8, 2016 at 4:53 PM, Pieter Hameete <[hidden email]> wrote:
|
Matter of RTFM eh ;-) thx and sorry for the bother. 2016-02-08 17:06 GMT+01:00 Robert Metzger <[hidden email]>:
|
After downloading and building the 1.0-SNAPSHOT from the master branch I do run into another problem when starting a YARN cluster. The startup now infinitely loops at the following step: 17:39:12,369 INFO org.apache.hadoop.yarn.client.ConfiguredRMFailoverProxyProvider - Failing over to rm2 17:39:34,855 INFO org.apache.hadoop.yarn.client.ConfiguredRMFailoverProxyProvider - Failing over to rm1 Any clue what couldve gone wrong? I used all-default for building with maven. - Pieter 2016-02-08 17:07 GMT+01:00 Pieter Hameete <[hidden email]>:
|
Mh, that's weird. Maybe both resource managers are marked as "standby"? Not sure what can cause this issue. Which YARN version are you using? Maybe you need to build Flink against that specific hadoop version yourself. On Mon, Feb 8, 2016 at 5:50 PM, Pieter Hameete <[hidden email]> wrote:
|
Solved: indeed it needed to be built for YARN 2.7.1 specifically. Cheers! 2016-02-08 19:13 GMT+01:00 Robert Metzger <[hidden email]>:
|
Free forum by Nabble | Edit this page |