Hi everyone,
I'm currently experiencing a weird situation, I hope you can help me out with this. I've cloned and built from the master, then I've edited the default config fil by adding my Hadoop config path, exported the HADOOP_CONF_DIR env var and ran bin/yarn-session.sh -n 1 -s 2 -jm 2048 -tm 2048 The first thing I noticed is that I had to put "-s 2" or the task managers gets created with -1 slots (!) by default. After putting "-s 2" the YARN session startup hangs when trying to register the task managers. I've stopped the session and aggregated the logs and read a lot (several thousands) of the messages I attach at the bottom; any idea of what this may be? Thank you a lot in advance! 2016-04-19 12:15:59,507 INFO org.apache.flink.yarn.YarnTaskManager - Trying to register at JobManager akka.tcp://flink@172.31.20.101:57379/user/jobmanager (attempt 1, timeout: 500 milliseconds) 2016-04-19 12:15:59,649 ERROR org.apache.flink.yarn.YarnTaskManager - The registration at JobManager Some(akka.tcp://flink@172.31.20.101:57379/user/jobmanager) was refused, because: java.lang.IllegalStateException: Resource ResourceID{resourceId='container_e02_1461077293721_0016_01_000002'} not registered with resource manager.. Retrying later... 2016-04-19 12:16:00,025 INFO org.apache.flink.yarn.YarnTaskManager - Trying to register at JobManager akka.tcp://flink@172.31.20.101:57379/user/jobmanager (attempt 2, timeout: 1000 milliseconds) 2016-04-19 12:16:00,033 ERROR org.apache.flink.yarn.YarnTaskManager - The registration at JobManager Some(akka.tcp://flink@172.31.20.101:57379/user/jobmanager) was refused, because: java.lang.IllegalStateException: Resource ResourceID{resourceId='container_e02_1461077293721_0016_01_000002'} not registered with resource manager.. Retrying later... 2016-04-19 12:16:01,045 INFO org.apache.flink.yarn.YarnTaskManager - Trying to register at JobManager akka.tcp://flink@172.31.20.101:57379/user/jobmanager (attempt 3, timeout: 2000 milliseconds) 2016-04-19 12:16:01,053 ERROR org.apache.flink.yarn.YarnTaskManager - The registration at JobManager Some(akka.tcp://flink@172.31.20.101:57379/user/jobmanager) was refused, because: java.lang.IllegalStateException: Resource ResourceID{resourceId='container_e02_1461077293721_0016_01_000002'} not registered with resource manager.. Retrying later... 2016-04-19 12:16:03,064 INFO org.apache.flink.yarn.YarnTaskManager - Trying to register at JobManager akka.tcp://flink@172.31.20.101:57379/user/jobmanager (attempt 4, timeout: 4000 milliseconds) 2016-04-19 12:16:03,072 ERROR org.apache.flink.yarn.YarnTaskManager - The registration at JobManager Some(akka.tcp://flink@172.31.20.101:57379/user/jobmanager) was refused, because: java.lang.IllegalStateException: Resource ResourceID{resourceId='container_e02_1461077293721_0016_01_000002'} not registered with resource manager.. Retrying later... 2016-04-19 12:16:07,085 INFO org.apache.flink.yarn.YarnTaskManager - Trying to register at JobManager akka.tcp://flink@172.31.20.101:57379/user/jobmanager (attempt 5, timeout: 8000 milliseconds) 2016-04-19 12:16:07,092 ERROR org.apache.flink.yarn.YarnTaskManager - The registration at JobManager Some(akka.tcp://flink@172.31.20.101:57379/user/jobmanager) was refused, because: java.lang.IllegalStateException: Resource ResourceID{resourceId='container_e02_1461077293721_0016_01_000002'} not registered with resource manager.. Retrying later... 2016-04-19 12:16:09,664 INFO org.apache.flink.yarn.YarnTaskManager - Trying to register at JobManager akka.tcp://flink@172.31.20.101:57379/user/jobmanager (attempt 1, timeout: 500 milliseconds) BR, Stefano Baghino |
Hey Stefano,
Flink's resource management has been refactored for 1.1 recently. This could be a regression introduced by this. Max can probably help you with more details. Is this currently a blocker for you? – Ufuk On Tue, Apr 19, 2016 at 6:31 PM, Stefano Baghino <[hidden email]> wrote: > Hi everyone, > > I'm currently experiencing a weird situation, I hope you can help me out > with this. > > I've cloned and built from the master, then I've edited the default config > fil by adding my Hadoop config path, exported the HADOOP_CONF_DIR env var > and ran bin/yarn-session.sh -n 1 -s 2 -jm 2048 -tm 2048 > > The first thing I noticed is that I had to put "-s 2" or the task managers > gets created with -1 slots (!) by default. > > After putting "-s 2" the YARN session startup hangs when trying to register > the task managers. I've stopped the session and aggregated the logs and read > a lot (several thousands) of the messages I attach at the bottom; any idea > of what this may be? > > Thank you a lot in advance! > > 2016-04-19 12:15:59,507 INFO org.apache.flink.yarn.YarnTaskManager > - Trying to register at JobManager > akka.tcp://flink@172.31.20.101:57379/user/jobmanager (attempt 1, timeout: > 500 milliseconds) > > 2016-04-19 12:15:59,649 ERROR org.apache.flink.yarn.YarnTaskManager > - The registration at JobManager > Some(akka.tcp://flink@172.31.20.101:57379/user/jobmanager) was refused, > because: java.lang.IllegalStateException: Resource > ResourceID{resourceId='container_e02_1461077293721_0016_01_000002'} not > registered with resource manager.. Retrying later... > > 2016-04-19 12:16:00,025 INFO org.apache.flink.yarn.YarnTaskManager > - Trying to register at JobManager > akka.tcp://flink@172.31.20.101:57379/user/jobmanager (attempt 2, timeout: > 1000 milliseconds) > > 2016-04-19 12:16:00,033 ERROR org.apache.flink.yarn.YarnTaskManager > - The registration at JobManager > Some(akka.tcp://flink@172.31.20.101:57379/user/jobmanager) was refused, > because: java.lang.IllegalStateException: Resource > ResourceID{resourceId='container_e02_1461077293721_0016_01_000002'} not > registered with resource manager.. Retrying later... > > 2016-04-19 12:16:01,045 INFO org.apache.flink.yarn.YarnTaskManager > - Trying to register at JobManager > akka.tcp://flink@172.31.20.101:57379/user/jobmanager (attempt 3, timeout: > 2000 milliseconds) > > 2016-04-19 12:16:01,053 ERROR org.apache.flink.yarn.YarnTaskManager > - The registration at JobManager > Some(akka.tcp://flink@172.31.20.101:57379/user/jobmanager) was refused, > because: java.lang.IllegalStateException: Resource > ResourceID{resourceId='container_e02_1461077293721_0016_01_000002'} not > registered with resource manager.. Retrying later... > > 2016-04-19 12:16:03,064 INFO org.apache.flink.yarn.YarnTaskManager > - Trying to register at JobManager > akka.tcp://flink@172.31.20.101:57379/user/jobmanager (attempt 4, timeout: > 4000 milliseconds) > > 2016-04-19 12:16:03,072 ERROR org.apache.flink.yarn.YarnTaskManager > - The registration at JobManager > Some(akka.tcp://flink@172.31.20.101:57379/user/jobmanager) was refused, > because: java.lang.IllegalStateException: Resource > ResourceID{resourceId='container_e02_1461077293721_0016_01_000002'} not > registered with resource manager.. Retrying later... > > 2016-04-19 12:16:07,085 INFO org.apache.flink.yarn.YarnTaskManager > - Trying to register at JobManager > akka.tcp://flink@172.31.20.101:57379/user/jobmanager (attempt 5, timeout: > 8000 milliseconds) > > 2016-04-19 12:16:07,092 ERROR org.apache.flink.yarn.YarnTaskManager > - The registration at JobManager > Some(akka.tcp://flink@172.31.20.101:57379/user/jobmanager) was refused, > because: java.lang.IllegalStateException: Resource > ResourceID{resourceId='container_e02_1461077293721_0016_01_000002'} not > registered with resource manager.. Retrying later... > > 2016-04-19 12:16:09,664 INFO org.apache.flink.yarn.YarnTaskManager > - Trying to register at JobManager > akka.tcp://flink@172.31.20.101:57379/user/jobmanager (attempt 1, timeout: > 500 milliseconds) > > > -- > BR, > Stefano Baghino > > Software Engineer @ Radicalbit |
Not exactly, I just wanted to let you know about it and know if someone else experimented this issue; perhaps it's more of a dev mailing list discussion, sorry for posting this here. Feel free to continue the discussion on the other list if you feel it's more appropriate. On Tue, Apr 19, 2016 at 6:53 PM, Ufuk Celebi <[hidden email]> wrote: Hey Stefano, BR, Stefano Baghino |
The user list is OK since you are reporting a bug here ;-) I'm
confident that this will be fixed soon! :-) On Wed, Apr 20, 2016 at 11:28 AM, Stefano Baghino <[hidden email]> wrote: > Not exactly, I just wanted to let you know about it and know if someone else > experimented this issue; perhaps it's more of a dev mailing list discussion, > sorry for posting this here. Feel free to continue the discussion on the > other list if you feel it's more appropriate. > > On Tue, Apr 19, 2016 at 6:53 PM, Ufuk Celebi <[hidden email]> wrote: >> >> Hey Stefano, >> >> Flink's resource management has been refactored for 1.1 recently. This >> could be a regression introduced by this. Max can probably help you >> with more details. Is this currently a blocker for you? >> >> – Ufuk >> >> On Tue, Apr 19, 2016 at 6:31 PM, Stefano Baghino >> <[hidden email]> wrote: >> > Hi everyone, >> > >> > I'm currently experiencing a weird situation, I hope you can help me out >> > with this. >> > >> > I've cloned and built from the master, then I've edited the default >> > config >> > fil by adding my Hadoop config path, exported the HADOOP_CONF_DIR env >> > var >> > and ran bin/yarn-session.sh -n 1 -s 2 -jm 2048 -tm 2048 >> > >> > The first thing I noticed is that I had to put "-s 2" or the task >> > managers >> > gets created with -1 slots (!) by default. >> > >> > After putting "-s 2" the YARN session startup hangs when trying to >> > register >> > the task managers. I've stopped the session and aggregated the logs and >> > read >> > a lot (several thousands) of the messages I attach at the bottom; any >> > idea >> > of what this may be? >> > >> > Thank you a lot in advance! >> > >> > 2016-04-19 12:15:59,507 INFO org.apache.flink.yarn.YarnTaskManager >> > - Trying to register at JobManager >> > akka.tcp://flink@172.31.20.101:57379/user/jobmanager (attempt 1, >> > timeout: >> > 500 milliseconds) >> > >> > 2016-04-19 12:15:59,649 ERROR org.apache.flink.yarn.YarnTaskManager >> > - The registration at JobManager >> > Some(akka.tcp://flink@172.31.20.101:57379/user/jobmanager) was refused, >> > because: java.lang.IllegalStateException: Resource >> > ResourceID{resourceId='container_e02_1461077293721_0016_01_000002'} not >> > registered with resource manager.. Retrying later... >> > >> > 2016-04-19 12:16:00,025 INFO org.apache.flink.yarn.YarnTaskManager >> > - Trying to register at JobManager >> > akka.tcp://flink@172.31.20.101:57379/user/jobmanager (attempt 2, >> > timeout: >> > 1000 milliseconds) >> > >> > 2016-04-19 12:16:00,033 ERROR org.apache.flink.yarn.YarnTaskManager >> > - The registration at JobManager >> > Some(akka.tcp://flink@172.31.20.101:57379/user/jobmanager) was refused, >> > because: java.lang.IllegalStateException: Resource >> > ResourceID{resourceId='container_e02_1461077293721_0016_01_000002'} not >> > registered with resource manager.. Retrying later... >> > >> > 2016-04-19 12:16:01,045 INFO org.apache.flink.yarn.YarnTaskManager >> > - Trying to register at JobManager >> > akka.tcp://flink@172.31.20.101:57379/user/jobmanager (attempt 3, >> > timeout: >> > 2000 milliseconds) >> > >> > 2016-04-19 12:16:01,053 ERROR org.apache.flink.yarn.YarnTaskManager >> > - The registration at JobManager >> > Some(akka.tcp://flink@172.31.20.101:57379/user/jobmanager) was refused, >> > because: java.lang.IllegalStateException: Resource >> > ResourceID{resourceId='container_e02_1461077293721_0016_01_000002'} not >> > registered with resource manager.. Retrying later... >> > >> > 2016-04-19 12:16:03,064 INFO org.apache.flink.yarn.YarnTaskManager >> > - Trying to register at JobManager >> > akka.tcp://flink@172.31.20.101:57379/user/jobmanager (attempt 4, >> > timeout: >> > 4000 milliseconds) >> > >> > 2016-04-19 12:16:03,072 ERROR org.apache.flink.yarn.YarnTaskManager >> > - The registration at JobManager >> > Some(akka.tcp://flink@172.31.20.101:57379/user/jobmanager) was refused, >> > because: java.lang.IllegalStateException: Resource >> > ResourceID{resourceId='container_e02_1461077293721_0016_01_000002'} not >> > registered with resource manager.. Retrying later... >> > >> > 2016-04-19 12:16:07,085 INFO org.apache.flink.yarn.YarnTaskManager >> > - Trying to register at JobManager >> > akka.tcp://flink@172.31.20.101:57379/user/jobmanager (attempt 5, >> > timeout: >> > 8000 milliseconds) >> > >> > 2016-04-19 12:16:07,092 ERROR org.apache.flink.yarn.YarnTaskManager >> > - The registration at JobManager >> > Some(akka.tcp://flink@172.31.20.101:57379/user/jobmanager) was refused, >> > because: java.lang.IllegalStateException: Resource >> > ResourceID{resourceId='container_e02_1461077293721_0016_01_000002'} not >> > registered with resource manager.. Retrying later... >> > >> > 2016-04-19 12:16:09,664 INFO org.apache.flink.yarn.YarnTaskManager >> > - Trying to register at JobManager >> > akka.tcp://flink@172.31.20.101:57379/user/jobmanager (attempt 1, >> > timeout: >> > 500 milliseconds) >> > >> > >> > -- >> > BR, >> > Stefano Baghino >> > >> > Software Engineer @ Radicalbit > > > > > -- > BR, > Stefano Baghino > > Software Engineer @ Radicalbit |
Hi Stefano,
Thanks for reporting. I wasn't able to reproduce the problem. I ran ./bin/yarn-session.sh -n 1 -s 2 -jm 2048 -tm 2048 on a Yarn cluster and it created a Flink cluster with a JobManager and a TaskManager with two task slots. By the way, if you omit the "-s 2" flag, then the default is read from the config, which is one task slot. Could it be that an old TaskManager instance is trying to register with a new JobManager? It looks like it from the log messages because the ResourceManager (which creates TaskManagers) is not aware of it. Still questionable why that instance is lingering around. Could you try to kill the instance and try bringing up a cluster several times to see if that solved the problem? If not, could you send me the full logs to my email address? Thanks, Max On Wed, Apr 20, 2016 at 4:30 PM, Ufuk Celebi <[hidden email]> wrote: > The user list is OK since you are reporting a bug here ;-) I'm > confident that this will be fixed soon! :-) > > On Wed, Apr 20, 2016 at 11:28 AM, Stefano Baghino > <[hidden email]> wrote: >> Not exactly, I just wanted to let you know about it and know if someone else >> experimented this issue; perhaps it's more of a dev mailing list discussion, >> sorry for posting this here. Feel free to continue the discussion on the >> other list if you feel it's more appropriate. >> >> On Tue, Apr 19, 2016 at 6:53 PM, Ufuk Celebi <[hidden email]> wrote: >>> >>> Hey Stefano, >>> >>> Flink's resource management has been refactored for 1.1 recently. This >>> could be a regression introduced by this. Max can probably help you >>> with more details. Is this currently a blocker for you? >>> >>> – Ufuk >>> >>> On Tue, Apr 19, 2016 at 6:31 PM, Stefano Baghino >>> <[hidden email]> wrote: >>> > Hi everyone, >>> > >>> > I'm currently experiencing a weird situation, I hope you can help me out >>> > with this. >>> > >>> > I've cloned and built from the master, then I've edited the default >>> > config >>> > fil by adding my Hadoop config path, exported the HADOOP_CONF_DIR env >>> > var >>> > and ran bin/yarn-session.sh -n 1 -s 2 -jm 2048 -tm 2048 >>> > >>> > The first thing I noticed is that I had to put "-s 2" or the task >>> > managers >>> > gets created with -1 slots (!) by default. >>> > >>> > After putting "-s 2" the YARN session startup hangs when trying to >>> > register >>> > the task managers. I've stopped the session and aggregated the logs and >>> > read >>> > a lot (several thousands) of the messages I attach at the bottom; any >>> > idea >>> > of what this may be? >>> > >>> > Thank you a lot in advance! >>> > >>> > 2016-04-19 12:15:59,507 INFO org.apache.flink.yarn.YarnTaskManager >>> > - Trying to register at JobManager >>> > akka.tcp://flink@172.31.20.101:57379/user/jobmanager (attempt 1, >>> > timeout: >>> > 500 milliseconds) >>> > >>> > 2016-04-19 12:15:59,649 ERROR org.apache.flink.yarn.YarnTaskManager >>> > - The registration at JobManager >>> > Some(akka.tcp://flink@172.31.20.101:57379/user/jobmanager) was refused, >>> > because: java.lang.IllegalStateException: Resource >>> > ResourceID{resourceId='container_e02_1461077293721_0016_01_000002'} not >>> > registered with resource manager.. Retrying later... >>> > >>> > 2016-04-19 12:16:00,025 INFO org.apache.flink.yarn.YarnTaskManager >>> > - Trying to register at JobManager >>> > akka.tcp://flink@172.31.20.101:57379/user/jobmanager (attempt 2, >>> > timeout: >>> > 1000 milliseconds) >>> > >>> > 2016-04-19 12:16:00,033 ERROR org.apache.flink.yarn.YarnTaskManager >>> > - The registration at JobManager >>> > Some(akka.tcp://flink@172.31.20.101:57379/user/jobmanager) was refused, >>> > because: java.lang.IllegalStateException: Resource >>> > ResourceID{resourceId='container_e02_1461077293721_0016_01_000002'} not >>> > registered with resource manager.. Retrying later... >>> > >>> > 2016-04-19 12:16:01,045 INFO org.apache.flink.yarn.YarnTaskManager >>> > - Trying to register at JobManager >>> > akka.tcp://flink@172.31.20.101:57379/user/jobmanager (attempt 3, >>> > timeout: >>> > 2000 milliseconds) >>> > >>> > 2016-04-19 12:16:01,053 ERROR org.apache.flink.yarn.YarnTaskManager >>> > - The registration at JobManager >>> > Some(akka.tcp://flink@172.31.20.101:57379/user/jobmanager) was refused, >>> > because: java.lang.IllegalStateException: Resource >>> > ResourceID{resourceId='container_e02_1461077293721_0016_01_000002'} not >>> > registered with resource manager.. Retrying later... >>> > >>> > 2016-04-19 12:16:03,064 INFO org.apache.flink.yarn.YarnTaskManager >>> > - Trying to register at JobManager >>> > akka.tcp://flink@172.31.20.101:57379/user/jobmanager (attempt 4, >>> > timeout: >>> > 4000 milliseconds) >>> > >>> > 2016-04-19 12:16:03,072 ERROR org.apache.flink.yarn.YarnTaskManager >>> > - The registration at JobManager >>> > Some(akka.tcp://flink@172.31.20.101:57379/user/jobmanager) was refused, >>> > because: java.lang.IllegalStateException: Resource >>> > ResourceID{resourceId='container_e02_1461077293721_0016_01_000002'} not >>> > registered with resource manager.. Retrying later... >>> > >>> > 2016-04-19 12:16:07,085 INFO org.apache.flink.yarn.YarnTaskManager >>> > - Trying to register at JobManager >>> > akka.tcp://flink@172.31.20.101:57379/user/jobmanager (attempt 5, >>> > timeout: >>> > 8000 milliseconds) >>> > >>> > 2016-04-19 12:16:07,092 ERROR org.apache.flink.yarn.YarnTaskManager >>> > - The registration at JobManager >>> > Some(akka.tcp://flink@172.31.20.101:57379/user/jobmanager) was refused, >>> > because: java.lang.IllegalStateException: Resource >>> > ResourceID{resourceId='container_e02_1461077293721_0016_01_000002'} not >>> > registered with resource manager.. Retrying later... >>> > >>> > 2016-04-19 12:16:09,664 INFO org.apache.flink.yarn.YarnTaskManager >>> > - Trying to register at JobManager >>> > akka.tcp://flink@172.31.20.101:57379/user/jobmanager (attempt 1, >>> > timeout: >>> > 500 milliseconds) >>> > >>> > >>> > -- >>> > BR, >>> > Stefano Baghino >>> > >>> > Software Engineer @ Radicalbit >> >> >> >> >> -- >> BR, >> Stefano Baghino >> >> Software Engineer @ Radicalbit |
Free forum by Nabble | Edit this page |