Flink on YARN: Stuck on "Trying to register at JobManager"

classic Classic list List threaded Threaded
15 messages Options
Reply | Threaded
Open this post in threaded view
|

Flink on YARN: Stuck on "Trying to register at JobManager"

Pieter Hameete
Hi Guys!

Im attempting to run Flink on YARN, but I run into an issue. Im starting the Flink YARN session from an Ubuntu 14.04 VM. All goes well until after the JobManager web UI is started:

Waiting until all TaskManagers have connected
11:09:51,557 INFO  org.apache.flink.yarn.ApplicationClient                       - Notification about new leader address akka.tcp://flink@145.100.41.148:35666/user/jobmanager with session ID null.
No status updates from the YARN cluster received so far. Waiting ...
11:09:51,578 INFO  org.apache.flink.yarn.ApplicationClient                       - Received address of new leader akka.tcp://flink@145.100.41.148:35666/user/jobmanager with session ID null.
11:09:51,583 INFO  org.apache.flink.yarn.ApplicationClient                       - Disconnect from JobManager null.
11:09:51,595 INFO  org.apache.flink.yarn.ApplicationClient                       - Trying to register at JobManager akka.tcp://flink@145.100.41.148:35666/user/jobmanager.
No status updates from the YARN cluster received so far. Waiting ...
No status updates from the YARN cluster received so far. Waiting ...

It then hangs on these last steps (trying to register, no status updates..)

Im sure there must be a problem on my side that is causing me not to be able to register at the JobManager. What could cause such connection problems?

Any tips are very welcome :-)

Cheers and have a good weekend!

- Pieter


Reply | Threaded
Open this post in threaded view
|

Re: Flink on YARN: Stuck on "Trying to register at JobManager"

rmetzger0
Hi,

did you check the logs of the JobManager itself? Maybe it'll tell us already whats going on.

On Sat, Feb 6, 2016 at 12:14 PM, Pieter Hameete <[hidden email]> wrote:
Hi Guys!

Im attempting to run Flink on YARN, but I run into an issue. Im starting the Flink YARN session from an Ubuntu 14.04 VM. All goes well until after the JobManager web UI is started:

Waiting until all TaskManagers have connected
11:09:51,557 INFO  org.apache.flink.yarn.ApplicationClient                       - Notification about new leader address akka.tcp://flink@145.100.41.148:35666/user/jobmanager with session ID null.
No status updates from the YARN cluster received so far. Waiting ...
11:09:51,578 INFO  org.apache.flink.yarn.ApplicationClient                       - Received address of new leader akka.tcp://flink@145.100.41.148:35666/user/jobmanager with session ID null.
11:09:51,583 INFO  org.apache.flink.yarn.ApplicationClient                       - Disconnect from JobManager null.
11:09:51,595 INFO  org.apache.flink.yarn.ApplicationClient                       - Trying to register at JobManager akka.tcp://flink@145.100.41.148:35666/user/jobmanager.
No status updates from the YARN cluster received so far. Waiting ...
No status updates from the YARN cluster received so far. Waiting ...

It then hangs on these last steps (trying to register, no status updates..)

Im sure there must be a problem on my side that is causing me not to be able to register at the JobManager. What could cause such connection problems?

Any tips are very welcome :-)

Cheers and have a good weekend!

- Pieter



Reply | Threaded
Open this post in threaded view
|

Re: Flink on YARN: Stuck on "Trying to register at JobManager"

Pieter Hameete
Hi Robert,

unfortunately there are no signs of what is going wrong in the logs. The last log messages are about succesful registration of the TaskManagers.

I'm also fairly sure it must be something in my VM that is causing this, because when I start the yarn-session from a login node that is on the same network as the hadoop cluster there are no problems registering with the JobManager. I did also notice the following message in the local console:

12:30:27,173 WARN  Remoting                                                      - Tried to associate with unreachable remote address [akka.tcp://flink@145.100.41.13:41539]. Address is now gated for 5000 ms, all messages to this address will be delivered to dead letters. Reason: connection timed out: /145.100.41.13:41539

I can ping the JobManager fine from with VM. Could there be some invalid or missing configuration on my side?

Cheers,

Pieter


2016-02-06 12:54 GMT+01:00 Robert Metzger <[hidden email]>:
Hi,

did you check the logs of the JobManager itself? Maybe it'll tell us already whats going on.

On Sat, Feb 6, 2016 at 12:14 PM, Pieter Hameete <[hidden email]> wrote:
Hi Guys!

Im attempting to run Flink on YARN, but I run into an issue. Im starting the Flink YARN session from an Ubuntu 14.04 VM. All goes well until after the JobManager web UI is started:

Waiting until all TaskManagers have connected
11:09:51,557 INFO  org.apache.flink.yarn.ApplicationClient                       - Notification about new leader address akka.tcp://flink@145.100.41.148:35666/user/jobmanager with session ID null.
No status updates from the YARN cluster received so far. Waiting ...
11:09:51,578 INFO  org.apache.flink.yarn.ApplicationClient                       - Received address of new leader akka.tcp://flink@145.100.41.148:35666/user/jobmanager with session ID null.
11:09:51,583 INFO  org.apache.flink.yarn.ApplicationClient                       - Disconnect from JobManager null.
11:09:51,595 INFO  org.apache.flink.yarn.ApplicationClient                       - Trying to register at JobManager akka.tcp://flink@145.100.41.148:35666/user/jobmanager.
No status updates from the YARN cluster received so far. Waiting ...
No status updates from the YARN cluster received so far. Waiting ...

It then hangs on these last steps (trying to register, no status updates..)

Im sure there must be a problem on my side that is causing me not to be able to register at the JobManager. What could cause such connection problems?

Any tips are very welcome :-)

Cheers and have a good weekend!

- Pieter




Reply | Threaded
Open this post in threaded view
|

Re: Flink on YARN: Stuck on "Trying to register at JobManager"

Maximilian Michels
Hi Pieter,

Which version of Flink are you using? It appears you've created a
Flink YARN cluster but you can't reach the JobManager afterwards.

Cheers,
Max

On Sat, Feb 6, 2016 at 1:42 PM, Pieter Hameete <[hidden email]> wrote:

> Hi Robert,
>
> unfortunately there are no signs of what is going wrong in the logs. The
> last log messages are about succesful registration of the TaskManagers.
>
> I'm also fairly sure it must be something in my VM that is causing this,
> because when I start the yarn-session from a login node that is on the same
> network as the hadoop cluster there are no problems registering with the
> JobManager. I did also notice the following message in the local console:
>
> 12:30:27,173 WARN  Remoting
> - Tried to associate with unreachable remote address
> [akka.tcp://flink@145.100.41.13:41539]. Address is now gated for 5000 ms,
> all messages to this address will be delivered to dead letters. Reason:
> connection timed out: /145.100.41.13:41539
>
> I can ping the JobManager fine from with VM. Could there be some invalid or
> missing configuration on my side?
>
> Cheers,
>
> Pieter
>
>
> 2016-02-06 12:54 GMT+01:00 Robert Metzger <[hidden email]>:
>>
>> Hi,
>>
>> did you check the logs of the JobManager itself? Maybe it'll tell us
>> already whats going on.
>>
>> On Sat, Feb 6, 2016 at 12:14 PM, Pieter Hameete <[hidden email]>
>> wrote:
>>>
>>> Hi Guys!
>>>
>>> Im attempting to run Flink on YARN, but I run into an issue. Im starting
>>> the Flink YARN session from an Ubuntu 14.04 VM. All goes well until after
>>> the JobManager web UI is started:
>>>
>>> JobManager web interface address
>>> http://head05.hathi.surfsara.nl:8088/proxy/application_1452780322684_10532/
>>> Waiting until all TaskManagers have connected
>>> 11:09:51,557 INFO  org.apache.flink.yarn.ApplicationClient
>>> - Notification about new leader address
>>> akka.tcp://flink@145.100.41.148:35666/user/jobmanager with session ID null.
>>> No status updates from the YARN cluster received so far. Waiting ...
>>> 11:09:51,578 INFO  org.apache.flink.yarn.ApplicationClient
>>> - Received address of new leader
>>> akka.tcp://flink@145.100.41.148:35666/user/jobmanager with session ID null.
>>> 11:09:51,583 INFO  org.apache.flink.yarn.ApplicationClient
>>> - Disconnect from JobManager null.
>>> 11:09:51,595 INFO  org.apache.flink.yarn.ApplicationClient
>>> - Trying to register at JobManager
>>> akka.tcp://flink@145.100.41.148:35666/user/jobmanager.
>>> No status updates from the YARN cluster received so far. Waiting ...
>>> No status updates from the YARN cluster received so far. Waiting ...
>>>
>>> It then hangs on these last steps (trying to register, no status
>>> updates..)
>>>
>>> Im sure there must be a problem on my side that is causing me not to be
>>> able to register at the JobManager. What could cause such connection
>>> problems?
>>>
>>> Any tips are very welcome :-)
>>>
>>> Cheers and have a good weekend!
>>>
>>> - Pieter
>>>
>>>
>>
>
Reply | Threaded
Open this post in threaded view
|

Re: Flink on YARN: Stuck on "Trying to register at JobManager"

Pieter Hameete
Hi Max!

I'm using Flink 0.10.1 and indeed the cluster seems to be created fine, all in the JobManager Web UI looks good.

It seems like the JobManager initiates the connection with my VM and cannot reach it. It could be that this is similar to the problem here:


I probably have to make some changes to the networking configuration of my VM so it can be reached by the JobManager despite using a different port each time.

- Pieter

2016-02-06 14:05 GMT+01:00 Maximilian Michels <[hidden email]>:
Hi Pieter,

Which version of Flink are you using? It appears you've created a
Flink YARN cluster but you can't reach the JobManager afterwards.

Cheers,
Max

On Sat, Feb 6, 2016 at 1:42 PM, Pieter Hameete <[hidden email]> wrote:
> Hi Robert,
>
> unfortunately there are no signs of what is going wrong in the logs. The
> last log messages are about succesful registration of the TaskManagers.
>
> I'm also fairly sure it must be something in my VM that is causing this,
> because when I start the yarn-session from a login node that is on the same
> network as the hadoop cluster there are no problems registering with the
> JobManager. I did also notice the following message in the local console:
>
> 12:30:27,173 WARN  Remoting
> - Tried to associate with unreachable remote address
> [akka.tcp://flink@145.100.41.13:41539]. Address is now gated for 5000 ms,
> all messages to this address will be delivered to dead letters. Reason:
> connection timed out: /145.100.41.13:41539
>
> I can ping the JobManager fine from with VM. Could there be some invalid or
> missing configuration on my side?
>
> Cheers,
>
> Pieter
>
>
> 2016-02-06 12:54 GMT+01:00 Robert Metzger <[hidden email]>:
>>
>> Hi,
>>
>> did you check the logs of the JobManager itself? Maybe it'll tell us
>> already whats going on.
>>
>> On Sat, Feb 6, 2016 at 12:14 PM, Pieter Hameete <[hidden email]>
>> wrote:
>>>
>>> Hi Guys!
>>>
>>> Im attempting to run Flink on YARN, but I run into an issue. Im starting
>>> the Flink YARN session from an Ubuntu 14.04 VM. All goes well until after
>>> the JobManager web UI is started:
>>>
>>> JobManager web interface address
>>> http://head05.hathi.surfsara.nl:8088/proxy/application_1452780322684_10532/
>>> Waiting until all TaskManagers have connected
>>> 11:09:51,557 INFO  org.apache.flink.yarn.ApplicationClient
>>> - Notification about new leader address
>>> akka.tcp://flink@145.100.41.148:35666/user/jobmanager with session ID null.
>>> No status updates from the YARN cluster received so far. Waiting ...
>>> 11:09:51,578 INFO  org.apache.flink.yarn.ApplicationClient
>>> - Received address of new leader
>>> akka.tcp://flink@145.100.41.148:35666/user/jobmanager with session ID null.
>>> 11:09:51,583 INFO  org.apache.flink.yarn.ApplicationClient
>>> - Disconnect from JobManager null.
>>> 11:09:51,595 INFO  org.apache.flink.yarn.ApplicationClient
>>> - Trying to register at JobManager
>>> akka.tcp://flink@145.100.41.148:35666/user/jobmanager.
>>> No status updates from the YARN cluster received so far. Waiting ...
>>> No status updates from the YARN cluster received so far. Waiting ...
>>>
>>> It then hangs on these last steps (trying to register, no status
>>> updates..)
>>>
>>> Im sure there must be a problem on my side that is causing me not to be
>>> able to register at the JobManager. What could cause such connection
>>> problems?
>>>
>>> Any tips are very welcome :-)
>>>
>>> Cheers and have a good weekend!
>>>
>>> - Pieter
>>>
>>>
>>
>

Reply | Threaded
Open this post in threaded view
|

Re: Flink on YARN: Stuck on "Trying to register at JobManager"

Stephan Ewen
Yeah, sounds a lot like the client cannot connect to the JobManager port.

The ports to communicate with HDFS and the YARN resource manager may be whitelisted r forwarded, so you can submit the YARN session, but then not connect to the JobManager afterwards.



On Sat, Feb 6, 2016 at 2:11 PM, Pieter Hameete <[hidden email]> wrote:
Hi Max!

I'm using Flink 0.10.1 and indeed the cluster seems to be created fine, all in the JobManager Web UI looks good.

It seems like the JobManager initiates the connection with my VM and cannot reach it. It could be that this is similar to the problem here:


I probably have to make some changes to the networking configuration of my VM so it can be reached by the JobManager despite using a different port each time.

- Pieter

2016-02-06 14:05 GMT+01:00 Maximilian Michels <[hidden email]>:
Hi Pieter,

Which version of Flink are you using? It appears you've created a
Flink YARN cluster but you can't reach the JobManager afterwards.

Cheers,
Max

On Sat, Feb 6, 2016 at 1:42 PM, Pieter Hameete <[hidden email]> wrote:
> Hi Robert,
>
> unfortunately there are no signs of what is going wrong in the logs. The
> last log messages are about succesful registration of the TaskManagers.
>
> I'm also fairly sure it must be something in my VM that is causing this,
> because when I start the yarn-session from a login node that is on the same
> network as the hadoop cluster there are no problems registering with the
> JobManager. I did also notice the following message in the local console:
>
> 12:30:27,173 WARN  Remoting
> - Tried to associate with unreachable remote address
> [akka.tcp://flink@145.100.41.13:41539]. Address is now gated for 5000 ms,
> all messages to this address will be delivered to dead letters. Reason:
> connection timed out: /145.100.41.13:41539
>
> I can ping the JobManager fine from with VM. Could there be some invalid or
> missing configuration on my side?
>
> Cheers,
>
> Pieter
>
>
> 2016-02-06 12:54 GMT+01:00 Robert Metzger <[hidden email]>:
>>
>> Hi,
>>
>> did you check the logs of the JobManager itself? Maybe it'll tell us
>> already whats going on.
>>
>> On Sat, Feb 6, 2016 at 12:14 PM, Pieter Hameete <[hidden email]>
>> wrote:
>>>
>>> Hi Guys!
>>>
>>> Im attempting to run Flink on YARN, but I run into an issue. Im starting
>>> the Flink YARN session from an Ubuntu 14.04 VM. All goes well until after
>>> the JobManager web UI is started:
>>>
>>> JobManager web interface address
>>> http://head05.hathi.surfsara.nl:8088/proxy/application_1452780322684_10532/
>>> Waiting until all TaskManagers have connected
>>> 11:09:51,557 INFO  org.apache.flink.yarn.ApplicationClient
>>> - Notification about new leader address
>>> akka.tcp://flink@145.100.41.148:35666/user/jobmanager with session ID null.
>>> No status updates from the YARN cluster received so far. Waiting ...
>>> 11:09:51,578 INFO  org.apache.flink.yarn.ApplicationClient
>>> - Received address of new leader
>>> akka.tcp://flink@145.100.41.148:35666/user/jobmanager with session ID null.
>>> 11:09:51,583 INFO  org.apache.flink.yarn.ApplicationClient
>>> - Disconnect from JobManager null.
>>> 11:09:51,595 INFO  org.apache.flink.yarn.ApplicationClient
>>> - Trying to register at JobManager
>>> akka.tcp://flink@145.100.41.148:35666/user/jobmanager.
>>> No status updates from the YARN cluster received so far. Waiting ...
>>> No status updates from the YARN cluster received so far. Waiting ...
>>>
>>> It then hangs on these last steps (trying to register, no status
>>> updates..)
>>>
>>> Im sure there must be a problem on my side that is causing me not to be
>>> able to register at the JobManager. What could cause such connection
>>> problems?
>>>
>>> Any tips are very welcome :-)
>>>
>>> Cheers and have a good weekend!
>>>
>>> - Pieter
>>>
>>>
>>
>


Reply | Threaded
Open this post in threaded view
|

Re: Flink on YARN: Stuck on "Trying to register at JobManager"

Pieter Hameete
Hi Stephan,

surely it seems this way! I must not be the first with this issue though? I'll have to contact the cluster admins to find a solution together. What would be a way of make the JobManagers accessible from outside the network, because the IP and port number changes every time.

Alternatively, I can ask for ssh access to a node within the network. that will surely work but it's not my preferred solution.

- Pieter

2016-02-06 16:22 GMT+01:00 Stephan Ewen <[hidden email]>:
Yeah, sounds a lot like the client cannot connect to the JobManager port.

The ports to communicate with HDFS and the YARN resource manager may be whitelisted r forwarded, so you can submit the YARN session, but then not connect to the JobManager afterwards.



On Sat, Feb 6, 2016 at 2:11 PM, Pieter Hameete <[hidden email]> wrote:
Hi Max!

I'm using Flink 0.10.1 and indeed the cluster seems to be created fine, all in the JobManager Web UI looks good.

It seems like the JobManager initiates the connection with my VM and cannot reach it. It could be that this is similar to the problem here:


I probably have to make some changes to the networking configuration of my VM so it can be reached by the JobManager despite using a different port each time.

- Pieter

2016-02-06 14:05 GMT+01:00 Maximilian Michels <[hidden email]>:
Hi Pieter,

Which version of Flink are you using? It appears you've created a
Flink YARN cluster but you can't reach the JobManager afterwards.

Cheers,
Max

On Sat, Feb 6, 2016 at 1:42 PM, Pieter Hameete <[hidden email]> wrote:
> Hi Robert,
>
> unfortunately there are no signs of what is going wrong in the logs. The
> last log messages are about succesful registration of the TaskManagers.
>
> I'm also fairly sure it must be something in my VM that is causing this,
> because when I start the yarn-session from a login node that is on the same
> network as the hadoop cluster there are no problems registering with the
> JobManager. I did also notice the following message in the local console:
>
> 12:30:27,173 WARN  Remoting
> - Tried to associate with unreachable remote address
> [akka.tcp://flink@145.100.41.13:41539]. Address is now gated for 5000 ms,
> all messages to this address will be delivered to dead letters. Reason:
> connection timed out: /145.100.41.13:41539
>
> I can ping the JobManager fine from with VM. Could there be some invalid or
> missing configuration on my side?
>
> Cheers,
>
> Pieter
>
>
> 2016-02-06 12:54 GMT+01:00 Robert Metzger <[hidden email]>:
>>
>> Hi,
>>
>> did you check the logs of the JobManager itself? Maybe it'll tell us
>> already whats going on.
>>
>> On Sat, Feb 6, 2016 at 12:14 PM, Pieter Hameete <[hidden email]>
>> wrote:
>>>
>>> Hi Guys!
>>>
>>> Im attempting to run Flink on YARN, but I run into an issue. Im starting
>>> the Flink YARN session from an Ubuntu 14.04 VM. All goes well until after
>>> the JobManager web UI is started:
>>>
>>> JobManager web interface address
>>> http://head05.hathi.surfsara.nl:8088/proxy/application_1452780322684_10532/
>>> Waiting until all TaskManagers have connected
>>> 11:09:51,557 INFO  org.apache.flink.yarn.ApplicationClient
>>> - Notification about new leader address
>>> akka.tcp://flink@145.100.41.148:35666/user/jobmanager with session ID null.
>>> No status updates from the YARN cluster received so far. Waiting ...
>>> 11:09:51,578 INFO  org.apache.flink.yarn.ApplicationClient
>>> - Received address of new leader
>>> akka.tcp://flink@145.100.41.148:35666/user/jobmanager with session ID null.
>>> 11:09:51,583 INFO  org.apache.flink.yarn.ApplicationClient
>>> - Disconnect from JobManager null.
>>> 11:09:51,595 INFO  org.apache.flink.yarn.ApplicationClient
>>> - Trying to register at JobManager
>>> akka.tcp://flink@145.100.41.148:35666/user/jobmanager.
>>> No status updates from the YARN cluster received so far. Waiting ...
>>> No status updates from the YARN cluster received so far. Waiting ...
>>>
>>> It then hangs on these last steps (trying to register, no status
>>> updates..)
>>>
>>> Im sure there must be a problem on my side that is causing me not to be
>>> able to register at the JobManager. What could cause such connection
>>> problems?
>>>
>>> Any tips are very welcome :-)
>>>
>>> Cheers and have a good weekend!
>>>
>>> - Pieter
>>>
>>>
>>
>



Reply | Threaded
Open this post in threaded view
|

Re: Flink on YARN: Stuck on "Trying to register at JobManager"

rmetzger0
Hi,

we had other users with a similar issue as well. There is a configuration value which allows you to specify a single port or a range of ports for the JobManager to allocate when running on YARN.
Note that when using this with a single port, the JMs may collide.



On Sun, Feb 7, 2016 at 7:25 PM, Pieter Hameete <[hidden email]> wrote:
Hi Stephan,

surely it seems this way! I must not be the first with this issue though? I'll have to contact the cluster admins to find a solution together. What would be a way of make the JobManagers accessible from outside the network, because the IP and port number changes every time.

Alternatively, I can ask for ssh access to a node within the network. that will surely work but it's not my preferred solution.

- Pieter

2016-02-06 16:22 GMT+01:00 Stephan Ewen <[hidden email]>:
Yeah, sounds a lot like the client cannot connect to the JobManager port.

The ports to communicate with HDFS and the YARN resource manager may be whitelisted r forwarded, so you can submit the YARN session, but then not connect to the JobManager afterwards.



On Sat, Feb 6, 2016 at 2:11 PM, Pieter Hameete <[hidden email]> wrote:
Hi Max!

I'm using Flink 0.10.1 and indeed the cluster seems to be created fine, all in the JobManager Web UI looks good.

It seems like the JobManager initiates the connection with my VM and cannot reach it. It could be that this is similar to the problem here:


I probably have to make some changes to the networking configuration of my VM so it can be reached by the JobManager despite using a different port each time.

- Pieter

2016-02-06 14:05 GMT+01:00 Maximilian Michels <[hidden email]>:
Hi Pieter,

Which version of Flink are you using? It appears you've created a
Flink YARN cluster but you can't reach the JobManager afterwards.

Cheers,
Max

On Sat, Feb 6, 2016 at 1:42 PM, Pieter Hameete <[hidden email]> wrote:
> Hi Robert,
>
> unfortunately there are no signs of what is going wrong in the logs. The
> last log messages are about succesful registration of the TaskManagers.
>
> I'm also fairly sure it must be something in my VM that is causing this,
> because when I start the yarn-session from a login node that is on the same
> network as the hadoop cluster there are no problems registering with the
> JobManager. I did also notice the following message in the local console:
>
> 12:30:27,173 WARN  Remoting
> - Tried to associate with unreachable remote address
> [akka.tcp://flink@145.100.41.13:41539]. Address is now gated for 5000 ms,
> all messages to this address will be delivered to dead letters. Reason:
> connection timed out: /145.100.41.13:41539
>
> I can ping the JobManager fine from with VM. Could there be some invalid or
> missing configuration on my side?
>
> Cheers,
>
> Pieter
>
>
> 2016-02-06 12:54 GMT+01:00 Robert Metzger <[hidden email]>:
>>
>> Hi,
>>
>> did you check the logs of the JobManager itself? Maybe it'll tell us
>> already whats going on.
>>
>> On Sat, Feb 6, 2016 at 12:14 PM, Pieter Hameete <[hidden email]>
>> wrote:
>>>
>>> Hi Guys!
>>>
>>> Im attempting to run Flink on YARN, but I run into an issue. Im starting
>>> the Flink YARN session from an Ubuntu 14.04 VM. All goes well until after
>>> the JobManager web UI is started:
>>>
>>> JobManager web interface address
>>> http://head05.hathi.surfsara.nl:8088/proxy/application_1452780322684_10532/
>>> Waiting until all TaskManagers have connected
>>> 11:09:51,557 INFO  org.apache.flink.yarn.ApplicationClient
>>> - Notification about new leader address
>>> akka.tcp://flink@145.100.41.148:35666/user/jobmanager with session ID null.
>>> No status updates from the YARN cluster received so far. Waiting ...
>>> 11:09:51,578 INFO  org.apache.flink.yarn.ApplicationClient
>>> - Received address of new leader
>>> akka.tcp://flink@145.100.41.148:35666/user/jobmanager with session ID null.
>>> 11:09:51,583 INFO  org.apache.flink.yarn.ApplicationClient
>>> - Disconnect from JobManager null.
>>> 11:09:51,595 INFO  org.apache.flink.yarn.ApplicationClient
>>> - Trying to register at JobManager
>>> akka.tcp://flink@145.100.41.148:35666/user/jobmanager.
>>> No status updates from the YARN cluster received so far. Waiting ...
>>> No status updates from the YARN cluster received so far. Waiting ...
>>>
>>> It then hangs on these last steps (trying to register, no status
>>> updates..)
>>>
>>> Im sure there must be a problem on my side that is causing me not to be
>>> able to register at the JobManager. What could cause such connection
>>> problems?
>>>
>>> Any tips are very welcome :-)
>>>
>>> Cheers and have a good weekend!
>>>
>>> - Pieter
>>>
>>>
>>
>




Reply | Threaded
Open this post in threaded view
|

Re: Flink on YARN: Stuck on "Trying to register at JobManager"

Pieter Hameete
I found the relevant information on the website. Ill consult with the cluster admin tomorrow, thanks for the help :-)

- Pieter

2016-02-07 19:31 GMT+01:00 Robert Metzger <[hidden email]>:
Hi,

we had other users with a similar issue as well. There is a configuration value which allows you to specify a single port or a range of ports for the JobManager to allocate when running on YARN.
Note that when using this with a single port, the JMs may collide.



On Sun, Feb 7, 2016 at 7:25 PM, Pieter Hameete <[hidden email]> wrote:
Hi Stephan,

surely it seems this way! I must not be the first with this issue though? I'll have to contact the cluster admins to find a solution together. What would be a way of make the JobManagers accessible from outside the network, because the IP and port number changes every time.

Alternatively, I can ask for ssh access to a node within the network. that will surely work but it's not my preferred solution.

- Pieter

2016-02-06 16:22 GMT+01:00 Stephan Ewen <[hidden email]>:
Yeah, sounds a lot like the client cannot connect to the JobManager port.

The ports to communicate with HDFS and the YARN resource manager may be whitelisted r forwarded, so you can submit the YARN session, but then not connect to the JobManager afterwards.



On Sat, Feb 6, 2016 at 2:11 PM, Pieter Hameete <[hidden email]> wrote:
Hi Max!

I'm using Flink 0.10.1 and indeed the cluster seems to be created fine, all in the JobManager Web UI looks good.

It seems like the JobManager initiates the connection with my VM and cannot reach it. It could be that this is similar to the problem here:


I probably have to make some changes to the networking configuration of my VM so it can be reached by the JobManager despite using a different port each time.

- Pieter

2016-02-06 14:05 GMT+01:00 Maximilian Michels <[hidden email]>:
Hi Pieter,

Which version of Flink are you using? It appears you've created a
Flink YARN cluster but you can't reach the JobManager afterwards.

Cheers,
Max

On Sat, Feb 6, 2016 at 1:42 PM, Pieter Hameete <[hidden email]> wrote:
> Hi Robert,
>
> unfortunately there are no signs of what is going wrong in the logs. The
> last log messages are about succesful registration of the TaskManagers.
>
> I'm also fairly sure it must be something in my VM that is causing this,
> because when I start the yarn-session from a login node that is on the same
> network as the hadoop cluster there are no problems registering with the
> JobManager. I did also notice the following message in the local console:
>
> 12:30:27,173 WARN  Remoting
> - Tried to associate with unreachable remote address
> [akka.tcp://flink@145.100.41.13:41539]. Address is now gated for 5000 ms,
> all messages to this address will be delivered to dead letters. Reason:
> connection timed out: /145.100.41.13:41539
>
> I can ping the JobManager fine from with VM. Could there be some invalid or
> missing configuration on my side?
>
> Cheers,
>
> Pieter
>
>
> 2016-02-06 12:54 GMT+01:00 Robert Metzger <[hidden email]>:
>>
>> Hi,
>>
>> did you check the logs of the JobManager itself? Maybe it'll tell us
>> already whats going on.
>>
>> On Sat, Feb 6, 2016 at 12:14 PM, Pieter Hameete <[hidden email]>
>> wrote:
>>>
>>> Hi Guys!
>>>
>>> Im attempting to run Flink on YARN, but I run into an issue. Im starting
>>> the Flink YARN session from an Ubuntu 14.04 VM. All goes well until after
>>> the JobManager web UI is started:
>>>
>>> JobManager web interface address
>>> http://head05.hathi.surfsara.nl:8088/proxy/application_1452780322684_10532/
>>> Waiting until all TaskManagers have connected
>>> 11:09:51,557 INFO  org.apache.flink.yarn.ApplicationClient
>>> - Notification about new leader address
>>> akka.tcp://flink@145.100.41.148:35666/user/jobmanager with session ID null.
>>> No status updates from the YARN cluster received so far. Waiting ...
>>> 11:09:51,578 INFO  org.apache.flink.yarn.ApplicationClient
>>> - Received address of new leader
>>> akka.tcp://flink@145.100.41.148:35666/user/jobmanager with session ID null.
>>> 11:09:51,583 INFO  org.apache.flink.yarn.ApplicationClient
>>> - Disconnect from JobManager null.
>>> 11:09:51,595 INFO  org.apache.flink.yarn.ApplicationClient
>>> - Trying to register at JobManager
>>> akka.tcp://flink@145.100.41.148:35666/user/jobmanager.
>>> No status updates from the YARN cluster received so far. Waiting ...
>>> No status updates from the YARN cluster received so far. Waiting ...
>>>
>>> It then hangs on these last steps (trying to register, no status
>>> updates..)
>>>
>>> Im sure there must be a problem on my side that is causing me not to be
>>> able to register at the JobManager. What could cause such connection
>>> problems?
>>>
>>> Any tips are very welcome :-)
>>>
>>> Cheers and have a good weekend!
>>>
>>> - Pieter
>>>
>>>
>>
>





Reply | Threaded
Open this post in threaded view
|

Re: Flink on YARN: Stuck on "Trying to register at JobManager"

Pieter Hameete
Ive tried setting the yarn.application-master.port property in flink-conf.yaml to a range suggested in https://ci.apache.org/projects/flink/flink-docs-master/setup/yarn_setup.html#running-flink-on-yarn-behind-firewalls

The JobManager does not seem to be picking the property up. Am I setting this in the wrong place? Or is there another way to enforce this property?

Cheers,

Pieter

2016-02-07 20:04 GMT+01:00 Pieter Hameete <[hidden email]>:
I found the relevant information on the website. Ill consult with the cluster admin tomorrow, thanks for the help :-)

- Pieter

2016-02-07 19:31 GMT+01:00 Robert Metzger <[hidden email]>:
Hi,

we had other users with a similar issue as well. There is a configuration value which allows you to specify a single port or a range of ports for the JobManager to allocate when running on YARN.
Note that when using this with a single port, the JMs may collide.



On Sun, Feb 7, 2016 at 7:25 PM, Pieter Hameete <[hidden email]> wrote:
Hi Stephan,

surely it seems this way! I must not be the first with this issue though? I'll have to contact the cluster admins to find a solution together. What would be a way of make the JobManagers accessible from outside the network, because the IP and port number changes every time.

Alternatively, I can ask for ssh access to a node within the network. that will surely work but it's not my preferred solution.

- Pieter

2016-02-06 16:22 GMT+01:00 Stephan Ewen <[hidden email]>:
Yeah, sounds a lot like the client cannot connect to the JobManager port.

The ports to communicate with HDFS and the YARN resource manager may be whitelisted r forwarded, so you can submit the YARN session, but then not connect to the JobManager afterwards.



On Sat, Feb 6, 2016 at 2:11 PM, Pieter Hameete <[hidden email]> wrote:
Hi Max!

I'm using Flink 0.10.1 and indeed the cluster seems to be created fine, all in the JobManager Web UI looks good.

It seems like the JobManager initiates the connection with my VM and cannot reach it. It could be that this is similar to the problem here:


I probably have to make some changes to the networking configuration of my VM so it can be reached by the JobManager despite using a different port each time.

- Pieter

2016-02-06 14:05 GMT+01:00 Maximilian Michels <[hidden email]>:
Hi Pieter,

Which version of Flink are you using? It appears you've created a
Flink YARN cluster but you can't reach the JobManager afterwards.

Cheers,
Max

On Sat, Feb 6, 2016 at 1:42 PM, Pieter Hameete <[hidden email]> wrote:
> Hi Robert,
>
> unfortunately there are no signs of what is going wrong in the logs. The
> last log messages are about succesful registration of the TaskManagers.
>
> I'm also fairly sure it must be something in my VM that is causing this,
> because when I start the yarn-session from a login node that is on the same
> network as the hadoop cluster there are no problems registering with the
> JobManager. I did also notice the following message in the local console:
>
> 12:30:27,173 WARN  Remoting
> - Tried to associate with unreachable remote address
> [akka.tcp://flink@145.100.41.13:41539]. Address is now gated for 5000 ms,
> all messages to this address will be delivered to dead letters. Reason:
> connection timed out: /145.100.41.13:41539
>
> I can ping the JobManager fine from with VM. Could there be some invalid or
> missing configuration on my side?
>
> Cheers,
>
> Pieter
>
>
> 2016-02-06 12:54 GMT+01:00 Robert Metzger <[hidden email]>:
>>
>> Hi,
>>
>> did you check the logs of the JobManager itself? Maybe it'll tell us
>> already whats going on.
>>
>> On Sat, Feb 6, 2016 at 12:14 PM, Pieter Hameete <[hidden email]>
>> wrote:
>>>
>>> Hi Guys!
>>>
>>> Im attempting to run Flink on YARN, but I run into an issue. Im starting
>>> the Flink YARN session from an Ubuntu 14.04 VM. All goes well until after
>>> the JobManager web UI is started:
>>>
>>> JobManager web interface address
>>> http://head05.hathi.surfsara.nl:8088/proxy/application_1452780322684_10532/
>>> Waiting until all TaskManagers have connected
>>> 11:09:51,557 INFO  org.apache.flink.yarn.ApplicationClient
>>> - Notification about new leader address
>>> akka.tcp://flink@145.100.41.148:35666/user/jobmanager with session ID null.
>>> No status updates from the YARN cluster received so far. Waiting ...
>>> 11:09:51,578 INFO  org.apache.flink.yarn.ApplicationClient
>>> - Received address of new leader
>>> akka.tcp://flink@145.100.41.148:35666/user/jobmanager with session ID null.
>>> 11:09:51,583 INFO  org.apache.flink.yarn.ApplicationClient
>>> - Disconnect from JobManager null.
>>> 11:09:51,595 INFO  org.apache.flink.yarn.ApplicationClient
>>> - Trying to register at JobManager
>>> akka.tcp://flink@145.100.41.148:35666/user/jobmanager.
>>> No status updates from the YARN cluster received so far. Waiting ...
>>> No status updates from the YARN cluster received so far. Waiting ...
>>>
>>> It then hangs on these last steps (trying to register, no status
>>> updates..)
>>>
>>> Im sure there must be a problem on my side that is causing me not to be
>>> able to register at the JobManager. What could cause such connection
>>> problems?
>>>
>>> Any tips are very welcome :-)
>>>
>>> Cheers and have a good weekend!
>>>
>>> - Pieter
>>>
>>>
>>
>






Reply | Threaded
Open this post in threaded view
|

Re: Flink on YARN: Stuck on "Trying to register at JobManager"

rmetzger0
You said earlier that you are using Flink 0.10. The feature is only available in 1.0-SNAPSHOT.

On Mon, Feb 8, 2016 at 4:53 PM, Pieter Hameete <[hidden email]> wrote:
Ive tried setting the yarn.application-master.port property in flink-conf.yaml to a range suggested in https://ci.apache.org/projects/flink/flink-docs-master/setup/yarn_setup.html#running-flink-on-yarn-behind-firewalls

The JobManager does not seem to be picking the property up. Am I setting this in the wrong place? Or is there another way to enforce this property?

Cheers,

Pieter

2016-02-07 20:04 GMT+01:00 Pieter Hameete <[hidden email]>:
I found the relevant information on the website. Ill consult with the cluster admin tomorrow, thanks for the help :-)

- Pieter

2016-02-07 19:31 GMT+01:00 Robert Metzger <[hidden email]>:
Hi,

we had other users with a similar issue as well. There is a configuration value which allows you to specify a single port or a range of ports for the JobManager to allocate when running on YARN.
Note that when using this with a single port, the JMs may collide.



On Sun, Feb 7, 2016 at 7:25 PM, Pieter Hameete <[hidden email]> wrote:
Hi Stephan,

surely it seems this way! I must not be the first with this issue though? I'll have to contact the cluster admins to find a solution together. What would be a way of make the JobManagers accessible from outside the network, because the IP and port number changes every time.

Alternatively, I can ask for ssh access to a node within the network. that will surely work but it's not my preferred solution.

- Pieter

2016-02-06 16:22 GMT+01:00 Stephan Ewen <[hidden email]>:
Yeah, sounds a lot like the client cannot connect to the JobManager port.

The ports to communicate with HDFS and the YARN resource manager may be whitelisted r forwarded, so you can submit the YARN session, but then not connect to the JobManager afterwards.



On Sat, Feb 6, 2016 at 2:11 PM, Pieter Hameete <[hidden email]> wrote:
Hi Max!

I'm using Flink 0.10.1 and indeed the cluster seems to be created fine, all in the JobManager Web UI looks good.

It seems like the JobManager initiates the connection with my VM and cannot reach it. It could be that this is similar to the problem here:


I probably have to make some changes to the networking configuration of my VM so it can be reached by the JobManager despite using a different port each time.

- Pieter

2016-02-06 14:05 GMT+01:00 Maximilian Michels <[hidden email]>:
Hi Pieter,

Which version of Flink are you using? It appears you've created a
Flink YARN cluster but you can't reach the JobManager afterwards.

Cheers,
Max

On Sat, Feb 6, 2016 at 1:42 PM, Pieter Hameete <[hidden email]> wrote:
> Hi Robert,
>
> unfortunately there are no signs of what is going wrong in the logs. The
> last log messages are about succesful registration of the TaskManagers.
>
> I'm also fairly sure it must be something in my VM that is causing this,
> because when I start the yarn-session from a login node that is on the same
> network as the hadoop cluster there are no problems registering with the
> JobManager. I did also notice the following message in the local console:
>
> 12:30:27,173 WARN  Remoting
> - Tried to associate with unreachable remote address
> [akka.tcp://flink@145.100.41.13:41539]. Address is now gated for 5000 ms,
> all messages to this address will be delivered to dead letters. Reason:
> connection timed out: /145.100.41.13:41539
>
> I can ping the JobManager fine from with VM. Could there be some invalid or
> missing configuration on my side?
>
> Cheers,
>
> Pieter
>
>
> 2016-02-06 12:54 GMT+01:00 Robert Metzger <[hidden email]>:
>>
>> Hi,
>>
>> did you check the logs of the JobManager itself? Maybe it'll tell us
>> already whats going on.
>>
>> On Sat, Feb 6, 2016 at 12:14 PM, Pieter Hameete <[hidden email]>
>> wrote:
>>>
>>> Hi Guys!
>>>
>>> Im attempting to run Flink on YARN, but I run into an issue. Im starting
>>> the Flink YARN session from an Ubuntu 14.04 VM. All goes well until after
>>> the JobManager web UI is started:
>>>
>>> JobManager web interface address
>>> http://head05.hathi.surfsara.nl:8088/proxy/application_1452780322684_10532/
>>> Waiting until all TaskManagers have connected
>>> 11:09:51,557 INFO  org.apache.flink.yarn.ApplicationClient
>>> - Notification about new leader address
>>> akka.tcp://flink@145.100.41.148:35666/user/jobmanager with session ID null.
>>> No status updates from the YARN cluster received so far. Waiting ...
>>> 11:09:51,578 INFO  org.apache.flink.yarn.ApplicationClient
>>> - Received address of new leader
>>> akka.tcp://flink@145.100.41.148:35666/user/jobmanager with session ID null.
>>> 11:09:51,583 INFO  org.apache.flink.yarn.ApplicationClient
>>> - Disconnect from JobManager null.
>>> 11:09:51,595 INFO  org.apache.flink.yarn.ApplicationClient
>>> - Trying to register at JobManager
>>> akka.tcp://flink@145.100.41.148:35666/user/jobmanager.
>>> No status updates from the YARN cluster received so far. Waiting ...
>>> No status updates from the YARN cluster received so far. Waiting ...
>>>
>>> It then hangs on these last steps (trying to register, no status
>>> updates..)
>>>
>>> Im sure there must be a problem on my side that is causing me not to be
>>> able to register at the JobManager. What could cause such connection
>>> problems?
>>>
>>> Any tips are very welcome :-)
>>>
>>> Cheers and have a good weekend!
>>>
>>> - Pieter
>>>
>>>
>>
>







Reply | Threaded
Open this post in threaded view
|

Re: Flink on YARN: Stuck on "Trying to register at JobManager"

Pieter Hameete
Matter of RTFM eh ;-) thx and sorry for the bother.

2016-02-08 17:06 GMT+01:00 Robert Metzger <[hidden email]>:
You said earlier that you are using Flink 0.10. The feature is only available in 1.0-SNAPSHOT.

On Mon, Feb 8, 2016 at 4:53 PM, Pieter Hameete <[hidden email]> wrote:
Ive tried setting the yarn.application-master.port property in flink-conf.yaml to a range suggested in https://ci.apache.org/projects/flink/flink-docs-master/setup/yarn_setup.html#running-flink-on-yarn-behind-firewalls

The JobManager does not seem to be picking the property up. Am I setting this in the wrong place? Or is there another way to enforce this property?

Cheers,

Pieter

2016-02-07 20:04 GMT+01:00 Pieter Hameete <[hidden email]>:
I found the relevant information on the website. Ill consult with the cluster admin tomorrow, thanks for the help :-)

- Pieter

2016-02-07 19:31 GMT+01:00 Robert Metzger <[hidden email]>:
Hi,

we had other users with a similar issue as well. There is a configuration value which allows you to specify a single port or a range of ports for the JobManager to allocate when running on YARN.
Note that when using this with a single port, the JMs may collide.



On Sun, Feb 7, 2016 at 7:25 PM, Pieter Hameete <[hidden email]> wrote:
Hi Stephan,

surely it seems this way! I must not be the first with this issue though? I'll have to contact the cluster admins to find a solution together. What would be a way of make the JobManagers accessible from outside the network, because the IP and port number changes every time.

Alternatively, I can ask for ssh access to a node within the network. that will surely work but it's not my preferred solution.

- Pieter

2016-02-06 16:22 GMT+01:00 Stephan Ewen <[hidden email]>:
Yeah, sounds a lot like the client cannot connect to the JobManager port.

The ports to communicate with HDFS and the YARN resource manager may be whitelisted r forwarded, so you can submit the YARN session, but then not connect to the JobManager afterwards.



On Sat, Feb 6, 2016 at 2:11 PM, Pieter Hameete <[hidden email]> wrote:
Hi Max!

I'm using Flink 0.10.1 and indeed the cluster seems to be created fine, all in the JobManager Web UI looks good.

It seems like the JobManager initiates the connection with my VM and cannot reach it. It could be that this is similar to the problem here:


I probably have to make some changes to the networking configuration of my VM so it can be reached by the JobManager despite using a different port each time.

- Pieter

2016-02-06 14:05 GMT+01:00 Maximilian Michels <[hidden email]>:
Hi Pieter,

Which version of Flink are you using? It appears you've created a
Flink YARN cluster but you can't reach the JobManager afterwards.

Cheers,
Max

On Sat, Feb 6, 2016 at 1:42 PM, Pieter Hameete <[hidden email]> wrote:
> Hi Robert,
>
> unfortunately there are no signs of what is going wrong in the logs. The
> last log messages are about succesful registration of the TaskManagers.
>
> I'm also fairly sure it must be something in my VM that is causing this,
> because when I start the yarn-session from a login node that is on the same
> network as the hadoop cluster there are no problems registering with the
> JobManager. I did also notice the following message in the local console:
>
> 12:30:27,173 WARN  Remoting
> - Tried to associate with unreachable remote address
> [akka.tcp://flink@145.100.41.13:41539]. Address is now gated for 5000 ms,
> all messages to this address will be delivered to dead letters. Reason:
> connection timed out: /145.100.41.13:41539
>
> I can ping the JobManager fine from with VM. Could there be some invalid or
> missing configuration on my side?
>
> Cheers,
>
> Pieter
>
>
> 2016-02-06 12:54 GMT+01:00 Robert Metzger <[hidden email]>:
>>
>> Hi,
>>
>> did you check the logs of the JobManager itself? Maybe it'll tell us
>> already whats going on.
>>
>> On Sat, Feb 6, 2016 at 12:14 PM, Pieter Hameete <[hidden email]>
>> wrote:
>>>
>>> Hi Guys!
>>>
>>> Im attempting to run Flink on YARN, but I run into an issue. Im starting
>>> the Flink YARN session from an Ubuntu 14.04 VM. All goes well until after
>>> the JobManager web UI is started:
>>>
>>> JobManager web interface address
>>> http://head05.hathi.surfsara.nl:8088/proxy/application_1452780322684_10532/
>>> Waiting until all TaskManagers have connected
>>> 11:09:51,557 INFO  org.apache.flink.yarn.ApplicationClient
>>> - Notification about new leader address
>>> akka.tcp://flink@145.100.41.148:35666/user/jobmanager with session ID null.
>>> No status updates from the YARN cluster received so far. Waiting ...
>>> 11:09:51,578 INFO  org.apache.flink.yarn.ApplicationClient
>>> - Received address of new leader
>>> akka.tcp://flink@145.100.41.148:35666/user/jobmanager with session ID null.
>>> 11:09:51,583 INFO  org.apache.flink.yarn.ApplicationClient
>>> - Disconnect from JobManager null.
>>> 11:09:51,595 INFO  org.apache.flink.yarn.ApplicationClient
>>> - Trying to register at JobManager
>>> akka.tcp://flink@145.100.41.148:35666/user/jobmanager.
>>> No status updates from the YARN cluster received so far. Waiting ...
>>> No status updates from the YARN cluster received so far. Waiting ...
>>>
>>> It then hangs on these last steps (trying to register, no status
>>> updates..)
>>>
>>> Im sure there must be a problem on my side that is causing me not to be
>>> able to register at the JobManager. What could cause such connection
>>> problems?
>>>
>>> Any tips are very welcome :-)
>>>
>>> Cheers and have a good weekend!
>>>
>>> - Pieter
>>>
>>>
>>
>








Reply | Threaded
Open this post in threaded view
|

Re: Flink on YARN: Stuck on "Trying to register at JobManager"

Pieter Hameete
After downloading and building the 1.0-SNAPSHOT from the master branch I do run into another problem when starting a YARN cluster. The startup now infinitely loops at the following step:

17:39:12,369 INFO  org.apache.hadoop.yarn.client.ConfiguredRMFailoverProxyProvider  - Failing over to rm2
17:39:34,855 INFO  org.apache.hadoop.yarn.client.ConfiguredRMFailoverProxyProvider  - Failing over to rm1

Any clue what couldve gone wrong? I used all-default for building with maven.

- Pieter



2016-02-08 17:07 GMT+01:00 Pieter Hameete <[hidden email]>:
Matter of RTFM eh ;-) thx and sorry for the bother.

2016-02-08 17:06 GMT+01:00 Robert Metzger <[hidden email]>:
You said earlier that you are using Flink 0.10. The feature is only available in 1.0-SNAPSHOT.

On Mon, Feb 8, 2016 at 4:53 PM, Pieter Hameete <[hidden email]> wrote:
Ive tried setting the yarn.application-master.port property in flink-conf.yaml to a range suggested in https://ci.apache.org/projects/flink/flink-docs-master/setup/yarn_setup.html#running-flink-on-yarn-behind-firewalls

The JobManager does not seem to be picking the property up. Am I setting this in the wrong place? Or is there another way to enforce this property?

Cheers,

Pieter

2016-02-07 20:04 GMT+01:00 Pieter Hameete <[hidden email]>:
I found the relevant information on the website. Ill consult with the cluster admin tomorrow, thanks for the help :-)

- Pieter

2016-02-07 19:31 GMT+01:00 Robert Metzger <[hidden email]>:
Hi,

we had other users with a similar issue as well. There is a configuration value which allows you to specify a single port or a range of ports for the JobManager to allocate when running on YARN.
Note that when using this with a single port, the JMs may collide.



On Sun, Feb 7, 2016 at 7:25 PM, Pieter Hameete <[hidden email]> wrote:
Hi Stephan,

surely it seems this way! I must not be the first with this issue though? I'll have to contact the cluster admins to find a solution together. What would be a way of make the JobManagers accessible from outside the network, because the IP and port number changes every time.

Alternatively, I can ask for ssh access to a node within the network. that will surely work but it's not my preferred solution.

- Pieter

2016-02-06 16:22 GMT+01:00 Stephan Ewen <[hidden email]>:
Yeah, sounds a lot like the client cannot connect to the JobManager port.

The ports to communicate with HDFS and the YARN resource manager may be whitelisted r forwarded, so you can submit the YARN session, but then not connect to the JobManager afterwards.



On Sat, Feb 6, 2016 at 2:11 PM, Pieter Hameete <[hidden email]> wrote:
Hi Max!

I'm using Flink 0.10.1 and indeed the cluster seems to be created fine, all in the JobManager Web UI looks good.

It seems like the JobManager initiates the connection with my VM and cannot reach it. It could be that this is similar to the problem here:


I probably have to make some changes to the networking configuration of my VM so it can be reached by the JobManager despite using a different port each time.

- Pieter

2016-02-06 14:05 GMT+01:00 Maximilian Michels <[hidden email]>:
Hi Pieter,

Which version of Flink are you using? It appears you've created a
Flink YARN cluster but you can't reach the JobManager afterwards.

Cheers,
Max

On Sat, Feb 6, 2016 at 1:42 PM, Pieter Hameete <[hidden email]> wrote:
> Hi Robert,
>
> unfortunately there are no signs of what is going wrong in the logs. The
> last log messages are about succesful registration of the TaskManagers.
>
> I'm also fairly sure it must be something in my VM that is causing this,
> because when I start the yarn-session from a login node that is on the same
> network as the hadoop cluster there are no problems registering with the
> JobManager. I did also notice the following message in the local console:
>
> 12:30:27,173 WARN  Remoting
> - Tried to associate with unreachable remote address
> [akka.tcp://flink@145.100.41.13:41539]. Address is now gated for 5000 ms,
> all messages to this address will be delivered to dead letters. Reason:
> connection timed out: /145.100.41.13:41539
>
> I can ping the JobManager fine from with VM. Could there be some invalid or
> missing configuration on my side?
>
> Cheers,
>
> Pieter
>
>
> 2016-02-06 12:54 GMT+01:00 Robert Metzger <[hidden email]>:
>>
>> Hi,
>>
>> did you check the logs of the JobManager itself? Maybe it'll tell us
>> already whats going on.
>>
>> On Sat, Feb 6, 2016 at 12:14 PM, Pieter Hameete <[hidden email]>
>> wrote:
>>>
>>> Hi Guys!
>>>
>>> Im attempting to run Flink on YARN, but I run into an issue. Im starting
>>> the Flink YARN session from an Ubuntu 14.04 VM. All goes well until after
>>> the JobManager web UI is started:
>>>
>>> JobManager web interface address
>>> http://head05.hathi.surfsara.nl:8088/proxy/application_1452780322684_10532/
>>> Waiting until all TaskManagers have connected
>>> 11:09:51,557 INFO  org.apache.flink.yarn.ApplicationClient
>>> - Notification about new leader address
>>> akka.tcp://flink@145.100.41.148:35666/user/jobmanager with session ID null.
>>> No status updates from the YARN cluster received so far. Waiting ...
>>> 11:09:51,578 INFO  org.apache.flink.yarn.ApplicationClient
>>> - Received address of new leader
>>> akka.tcp://flink@145.100.41.148:35666/user/jobmanager with session ID null.
>>> 11:09:51,583 INFO  org.apache.flink.yarn.ApplicationClient
>>> - Disconnect from JobManager null.
>>> 11:09:51,595 INFO  org.apache.flink.yarn.ApplicationClient
>>> - Trying to register at JobManager
>>> akka.tcp://flink@145.100.41.148:35666/user/jobmanager.
>>> No status updates from the YARN cluster received so far. Waiting ...
>>> No status updates from the YARN cluster received so far. Waiting ...
>>>
>>> It then hangs on these last steps (trying to register, no status
>>> updates..)
>>>
>>> Im sure there must be a problem on my side that is causing me not to be
>>> able to register at the JobManager. What could cause such connection
>>> problems?
>>>
>>> Any tips are very welcome :-)
>>>
>>> Cheers and have a good weekend!
>>>
>>> - Pieter
>>>
>>>
>>
>









Reply | Threaded
Open this post in threaded view
|

Re: Flink on YARN: Stuck on "Trying to register at JobManager"

rmetzger0
Mh, that's weird. Maybe both resource managers are marked as "standby"? Not sure what can cause this issue. 

Which YARN version are you using? Maybe you need to build Flink against that specific hadoop version yourself.

On Mon, Feb 8, 2016 at 5:50 PM, Pieter Hameete <[hidden email]> wrote:
After downloading and building the 1.0-SNAPSHOT from the master branch I do run into another problem when starting a YARN cluster. The startup now infinitely loops at the following step:

17:39:12,369 INFO  org.apache.hadoop.yarn.client.ConfiguredRMFailoverProxyProvider  - Failing over to rm2
17:39:34,855 INFO  org.apache.hadoop.yarn.client.ConfiguredRMFailoverProxyProvider  - Failing over to rm1

Any clue what couldve gone wrong? I used all-default for building with maven.

- Pieter



2016-02-08 17:07 GMT+01:00 Pieter Hameete <[hidden email]>:
Matter of RTFM eh ;-) thx and sorry for the bother.

2016-02-08 17:06 GMT+01:00 Robert Metzger <[hidden email]>:
You said earlier that you are using Flink 0.10. The feature is only available in 1.0-SNAPSHOT.

On Mon, Feb 8, 2016 at 4:53 PM, Pieter Hameete <[hidden email]> wrote:
Ive tried setting the yarn.application-master.port property in flink-conf.yaml to a range suggested in https://ci.apache.org/projects/flink/flink-docs-master/setup/yarn_setup.html#running-flink-on-yarn-behind-firewalls

The JobManager does not seem to be picking the property up. Am I setting this in the wrong place? Or is there another way to enforce this property?

Cheers,

Pieter

2016-02-07 20:04 GMT+01:00 Pieter Hameete <[hidden email]>:
I found the relevant information on the website. Ill consult with the cluster admin tomorrow, thanks for the help :-)

- Pieter

2016-02-07 19:31 GMT+01:00 Robert Metzger <[hidden email]>:
Hi,

we had other users with a similar issue as well. There is a configuration value which allows you to specify a single port or a range of ports for the JobManager to allocate when running on YARN.
Note that when using this with a single port, the JMs may collide.



On Sun, Feb 7, 2016 at 7:25 PM, Pieter Hameete <[hidden email]> wrote:
Hi Stephan,

surely it seems this way! I must not be the first with this issue though? I'll have to contact the cluster admins to find a solution together. What would be a way of make the JobManagers accessible from outside the network, because the IP and port number changes every time.

Alternatively, I can ask for ssh access to a node within the network. that will surely work but it's not my preferred solution.

- Pieter

2016-02-06 16:22 GMT+01:00 Stephan Ewen <[hidden email]>:
Yeah, sounds a lot like the client cannot connect to the JobManager port.

The ports to communicate with HDFS and the YARN resource manager may be whitelisted r forwarded, so you can submit the YARN session, but then not connect to the JobManager afterwards.



On Sat, Feb 6, 2016 at 2:11 PM, Pieter Hameete <[hidden email]> wrote:
Hi Max!

I'm using Flink 0.10.1 and indeed the cluster seems to be created fine, all in the JobManager Web UI looks good.

It seems like the JobManager initiates the connection with my VM and cannot reach it. It could be that this is similar to the problem here:


I probably have to make some changes to the networking configuration of my VM so it can be reached by the JobManager despite using a different port each time.

- Pieter

2016-02-06 14:05 GMT+01:00 Maximilian Michels <[hidden email]>:
Hi Pieter,

Which version of Flink are you using? It appears you've created a
Flink YARN cluster but you can't reach the JobManager afterwards.

Cheers,
Max

On Sat, Feb 6, 2016 at 1:42 PM, Pieter Hameete <[hidden email]> wrote:
> Hi Robert,
>
> unfortunately there are no signs of what is going wrong in the logs. The
> last log messages are about succesful registration of the TaskManagers.
>
> I'm also fairly sure it must be something in my VM that is causing this,
> because when I start the yarn-session from a login node that is on the same
> network as the hadoop cluster there are no problems registering with the
> JobManager. I did also notice the following message in the local console:
>
> 12:30:27,173 WARN  Remoting
> - Tried to associate with unreachable remote address
> [akka.tcp://flink@145.100.41.13:41539]. Address is now gated for 5000 ms,
> all messages to this address will be delivered to dead letters. Reason:
> connection timed out: /145.100.41.13:41539
>
> I can ping the JobManager fine from with VM. Could there be some invalid or
> missing configuration on my side?
>
> Cheers,
>
> Pieter
>
>
> 2016-02-06 12:54 GMT+01:00 Robert Metzger <[hidden email]>:
>>
>> Hi,
>>
>> did you check the logs of the JobManager itself? Maybe it'll tell us
>> already whats going on.
>>
>> On Sat, Feb 6, 2016 at 12:14 PM, Pieter Hameete <[hidden email]>
>> wrote:
>>>
>>> Hi Guys!
>>>
>>> Im attempting to run Flink on YARN, but I run into an issue. Im starting
>>> the Flink YARN session from an Ubuntu 14.04 VM. All goes well until after
>>> the JobManager web UI is started:
>>>
>>> JobManager web interface address
>>> http://head05.hathi.surfsara.nl:8088/proxy/application_1452780322684_10532/
>>> Waiting until all TaskManagers have connected
>>> 11:09:51,557 INFO  org.apache.flink.yarn.ApplicationClient
>>> - Notification about new leader address
>>> akka.tcp://flink@145.100.41.148:35666/user/jobmanager with session ID null.
>>> No status updates from the YARN cluster received so far. Waiting ...
>>> 11:09:51,578 INFO  org.apache.flink.yarn.ApplicationClient
>>> - Received address of new leader
>>> akka.tcp://flink@145.100.41.148:35666/user/jobmanager with session ID null.
>>> 11:09:51,583 INFO  org.apache.flink.yarn.ApplicationClient
>>> - Disconnect from JobManager null.
>>> 11:09:51,595 INFO  org.apache.flink.yarn.ApplicationClient
>>> - Trying to register at JobManager
>>> akka.tcp://flink@145.100.41.148:35666/user/jobmanager.
>>> No status updates from the YARN cluster received so far. Waiting ...
>>> No status updates from the YARN cluster received so far. Waiting ...
>>>
>>> It then hangs on these last steps (trying to register, no status
>>> updates..)
>>>
>>> Im sure there must be a problem on my side that is causing me not to be
>>> able to register at the JobManager. What could cause such connection
>>> problems?
>>>
>>> Any tips are very welcome :-)
>>>
>>> Cheers and have a good weekend!
>>>
>>> - Pieter
>>>
>>>
>>
>










Reply | Threaded
Open this post in threaded view
|

Re: Flink on YARN: Stuck on "Trying to register at JobManager"

Pieter Hameete
Solved: indeed it needed to be built for YARN 2.7.1 specifically. Cheers!

2016-02-08 19:13 GMT+01:00 Robert Metzger <[hidden email]>:
Mh, that's weird. Maybe both resource managers are marked as "standby"? Not sure what can cause this issue. 

Which YARN version are you using? Maybe you need to build Flink against that specific hadoop version yourself.

On Mon, Feb 8, 2016 at 5:50 PM, Pieter Hameete <[hidden email]> wrote:
After downloading and building the 1.0-SNAPSHOT from the master branch I do run into another problem when starting a YARN cluster. The startup now infinitely loops at the following step:

17:39:12,369 INFO  org.apache.hadoop.yarn.client.ConfiguredRMFailoverProxyProvider  - Failing over to rm2
17:39:34,855 INFO  org.apache.hadoop.yarn.client.ConfiguredRMFailoverProxyProvider  - Failing over to rm1

Any clue what couldve gone wrong? I used all-default for building with maven.

- Pieter



2016-02-08 17:07 GMT+01:00 Pieter Hameete <[hidden email]>:
Matter of RTFM eh ;-) thx and sorry for the bother.

2016-02-08 17:06 GMT+01:00 Robert Metzger <[hidden email]>:
You said earlier that you are using Flink 0.10. The feature is only available in 1.0-SNAPSHOT.

On Mon, Feb 8, 2016 at 4:53 PM, Pieter Hameete <[hidden email]> wrote:
Ive tried setting the yarn.application-master.port property in flink-conf.yaml to a range suggested in https://ci.apache.org/projects/flink/flink-docs-master/setup/yarn_setup.html#running-flink-on-yarn-behind-firewalls

The JobManager does not seem to be picking the property up. Am I setting this in the wrong place? Or is there another way to enforce this property?

Cheers,

Pieter

2016-02-07 20:04 GMT+01:00 Pieter Hameete <[hidden email]>:
I found the relevant information on the website. Ill consult with the cluster admin tomorrow, thanks for the help :-)

- Pieter

2016-02-07 19:31 GMT+01:00 Robert Metzger <[hidden email]>:
Hi,

we had other users with a similar issue as well. There is a configuration value which allows you to specify a single port or a range of ports for the JobManager to allocate when running on YARN.
Note that when using this with a single port, the JMs may collide.



On Sun, Feb 7, 2016 at 7:25 PM, Pieter Hameete <[hidden email]> wrote:
Hi Stephan,

surely it seems this way! I must not be the first with this issue though? I'll have to contact the cluster admins to find a solution together. What would be a way of make the JobManagers accessible from outside the network, because the IP and port number changes every time.

Alternatively, I can ask for ssh access to a node within the network. that will surely work but it's not my preferred solution.

- Pieter

2016-02-06 16:22 GMT+01:00 Stephan Ewen <[hidden email]>:
Yeah, sounds a lot like the client cannot connect to the JobManager port.

The ports to communicate with HDFS and the YARN resource manager may be whitelisted r forwarded, so you can submit the YARN session, but then not connect to the JobManager afterwards.



On Sat, Feb 6, 2016 at 2:11 PM, Pieter Hameete <[hidden email]> wrote:
Hi Max!

I'm using Flink 0.10.1 and indeed the cluster seems to be created fine, all in the JobManager Web UI looks good.

It seems like the JobManager initiates the connection with my VM and cannot reach it. It could be that this is similar to the problem here:


I probably have to make some changes to the networking configuration of my VM so it can be reached by the JobManager despite using a different port each time.

- Pieter

2016-02-06 14:05 GMT+01:00 Maximilian Michels <[hidden email]>:
Hi Pieter,

Which version of Flink are you using? It appears you've created a
Flink YARN cluster but you can't reach the JobManager afterwards.

Cheers,
Max

On Sat, Feb 6, 2016 at 1:42 PM, Pieter Hameete <[hidden email]> wrote:
> Hi Robert,
>
> unfortunately there are no signs of what is going wrong in the logs. The
> last log messages are about succesful registration of the TaskManagers.
>
> I'm also fairly sure it must be something in my VM that is causing this,
> because when I start the yarn-session from a login node that is on the same
> network as the hadoop cluster there are no problems registering with the
> JobManager. I did also notice the following message in the local console:
>
> 12:30:27,173 WARN  Remoting
> - Tried to associate with unreachable remote address
> [akka.tcp://flink@145.100.41.13:41539]. Address is now gated for 5000 ms,
> all messages to this address will be delivered to dead letters. Reason:
> connection timed out: /145.100.41.13:41539
>
> I can ping the JobManager fine from with VM. Could there be some invalid or
> missing configuration on my side?
>
> Cheers,
>
> Pieter
>
>
> 2016-02-06 12:54 GMT+01:00 Robert Metzger <[hidden email]>:
>>
>> Hi,
>>
>> did you check the logs of the JobManager itself? Maybe it'll tell us
>> already whats going on.
>>
>> On Sat, Feb 6, 2016 at 12:14 PM, Pieter Hameete <[hidden email]>
>> wrote:
>>>
>>> Hi Guys!
>>>
>>> Im attempting to run Flink on YARN, but I run into an issue. Im starting
>>> the Flink YARN session from an Ubuntu 14.04 VM. All goes well until after
>>> the JobManager web UI is started:
>>>
>>> JobManager web interface address
>>> http://head05.hathi.surfsara.nl:8088/proxy/application_1452780322684_10532/
>>> Waiting until all TaskManagers have connected
>>> 11:09:51,557 INFO  org.apache.flink.yarn.ApplicationClient
>>> - Notification about new leader address
>>> akka.tcp://flink@145.100.41.148:35666/user/jobmanager with session ID null.
>>> No status updates from the YARN cluster received so far. Waiting ...
>>> 11:09:51,578 INFO  org.apache.flink.yarn.ApplicationClient
>>> - Received address of new leader
>>> akka.tcp://flink@145.100.41.148:35666/user/jobmanager with session ID null.
>>> 11:09:51,583 INFO  org.apache.flink.yarn.ApplicationClient
>>> - Disconnect from JobManager null.
>>> 11:09:51,595 INFO  org.apache.flink.yarn.ApplicationClient
>>> - Trying to register at JobManager
>>> akka.tcp://flink@145.100.41.148:35666/user/jobmanager.
>>> No status updates from the YARN cluster received so far. Waiting ...
>>> No status updates from the YARN cluster received so far. Waiting ...
>>>
>>> It then hangs on these last steps (trying to register, no status
>>> updates..)
>>>
>>> Im sure there must be a problem on my side that is causing me not to be
>>> able to register at the JobManager. What could cause such connection
>>> problems?
>>>
>>> Any tips are very welcome :-)
>>>
>>> Cheers and have a good weekend!
>>>
>>> - Pieter
>>>
>>>
>>
>