Re: Master (1.1-SNAPSHOT) Can't run on YARN

Posted by stefanobaghino on
URL: http://deprecated-apache-flink-user-mailing-list-archive.369.s1.nabble.com/Master-1-1-SNAPSHOT-Can-t-run-on-YARN-tp6223p6239.html

Not exactly, I just wanted to let you know about it and know if someone else experimented this issue; perhaps it's more of a dev mailing list discussion, sorry for posting this here. Feel free to continue the discussion on the other list if you feel it's more appropriate.

On Tue, Apr 19, 2016 at 6:53 PM, Ufuk Celebi <[hidden email]> wrote:
Hey Stefano,

Flink's resource management has been refactored for 1.1 recently. This
could be a regression introduced by this. Max can probably help you
with more details. Is this currently a blocker for you?

– Ufuk

On Tue, Apr 19, 2016 at 6:31 PM, Stefano Baghino
<[hidden email]> wrote:
> Hi everyone,
>
> I'm currently experiencing a weird situation, I hope you can help me out
> with this.
>
> I've cloned and built from the master, then I've edited the default config
> fil by adding my Hadoop config path, exported the HADOOP_CONF_DIR env var
> and ran bin/yarn-session.sh -n 1 -s 2 -jm 2048 -tm 2048
>
> The first thing I noticed is that I had to put "-s 2" or the task managers
> gets created with -1 slots (!) by default.
>
> After putting "-s 2" the YARN session startup hangs when trying to register
> the task managers. I've stopped the session and aggregated the logs and read
> a lot (several thousands) of the messages I attach at the bottom; any idea
> of what this may be?
>
> Thank you a lot in advance!
>
> 2016-04-19 12:15:59,507 INFO  org.apache.flink.yarn.YarnTaskManager
> - Trying to register at JobManager
> akka.tcp://flink@172.31.20.101:57379/user/jobmanager (attempt 1, timeout:
> 500 milliseconds)
>
> 2016-04-19 12:15:59,649 ERROR org.apache.flink.yarn.YarnTaskManager
> - The registration at JobManager
> Some(akka.tcp://flink@172.31.20.101:57379/user/jobmanager) was refused,
> because: java.lang.IllegalStateException: Resource
> ResourceID{resourceId='container_e02_1461077293721_0016_01_000002'} not
> registered with resource manager.. Retrying later...
>
> 2016-04-19 12:16:00,025 INFO  org.apache.flink.yarn.YarnTaskManager
> - Trying to register at JobManager
> akka.tcp://flink@172.31.20.101:57379/user/jobmanager (attempt 2, timeout:
> 1000 milliseconds)
>
> 2016-04-19 12:16:00,033 ERROR org.apache.flink.yarn.YarnTaskManager
> - The registration at JobManager
> Some(akka.tcp://flink@172.31.20.101:57379/user/jobmanager) was refused,
> because: java.lang.IllegalStateException: Resource
> ResourceID{resourceId='container_e02_1461077293721_0016_01_000002'} not
> registered with resource manager.. Retrying later...
>
> 2016-04-19 12:16:01,045 INFO  org.apache.flink.yarn.YarnTaskManager
> - Trying to register at JobManager
> akka.tcp://flink@172.31.20.101:57379/user/jobmanager (attempt 3, timeout:
> 2000 milliseconds)
>
> 2016-04-19 12:16:01,053 ERROR org.apache.flink.yarn.YarnTaskManager
> - The registration at JobManager
> Some(akka.tcp://flink@172.31.20.101:57379/user/jobmanager) was refused,
> because: java.lang.IllegalStateException: Resource
> ResourceID{resourceId='container_e02_1461077293721_0016_01_000002'} not
> registered with resource manager.. Retrying later...
>
> 2016-04-19 12:16:03,064 INFO  org.apache.flink.yarn.YarnTaskManager
> - Trying to register at JobManager
> akka.tcp://flink@172.31.20.101:57379/user/jobmanager (attempt 4, timeout:
> 4000 milliseconds)
>
> 2016-04-19 12:16:03,072 ERROR org.apache.flink.yarn.YarnTaskManager
> - The registration at JobManager
> Some(akka.tcp://flink@172.31.20.101:57379/user/jobmanager) was refused,
> because: java.lang.IllegalStateException: Resource
> ResourceID{resourceId='container_e02_1461077293721_0016_01_000002'} not
> registered with resource manager.. Retrying later...
>
> 2016-04-19 12:16:07,085 INFO  org.apache.flink.yarn.YarnTaskManager
> - Trying to register at JobManager
> akka.tcp://flink@172.31.20.101:57379/user/jobmanager (attempt 5, timeout:
> 8000 milliseconds)
>
> 2016-04-19 12:16:07,092 ERROR org.apache.flink.yarn.YarnTaskManager
> - The registration at JobManager
> Some(akka.tcp://flink@172.31.20.101:57379/user/jobmanager) was refused,
> because: java.lang.IllegalStateException: Resource
> ResourceID{resourceId='container_e02_1461077293721_0016_01_000002'} not
> registered with resource manager.. Retrying later...
>
> 2016-04-19 12:16:09,664 INFO  org.apache.flink.yarn.YarnTaskManager
> - Trying to register at JobManager
> akka.tcp://flink@172.31.20.101:57379/user/jobmanager (attempt 1, timeout:
> 500 milliseconds)
>
>
> --
> BR,
> Stefano Baghino
>
> Software Engineer @ Radicalbit



--
BR,
Stefano Baghino

Software Engineer @ Radicalbit