StandaloneResourceManager failed to associate with JobManager leader

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

StandaloneResourceManager failed to associate with JobManager leader

Hao Sun
Hi,

I am trying to run a cluster of job-manager and task-manager in docker.
One of each for now. I got a StandaloneResourceManager error, stating that it can not associate with job-manager. I do not know what was wrong.

I am sure that job-manager can be connected.
===============================
root@flink-jobmanager:/opt/flink# telnet flink_jobmanager 32929
Trying 172.18.0.3...
Connected to flink-jobmanager.
Escape character is '^]'.
Connection closed by foreign host.
===============================

Here is my config:
===============================
Starting Job Manager
config file:
jobmanager.rpc.address: flink_jobmanager
jobmanager.rpc.port: 6123
jobmanager.web.port: 8081
jobmanager.heap.mb: 1024
taskmanager.heap.mb: 1024
taskmanager.numberOfTaskSlots: 1
taskmanager.memory.preallocate: false
parallelism.default: 1
jobmanager.archive.fs.dir: file:///flink_data/completed-jobs/
historyserver.archive.fs.dir: file:///flink_data/completed-jobs/
state.backend: rocksdb
state.backend.fs.checkpointdir: file:///flink_data/checkpoints
taskmanager.tmp.dirs: /flink_data/tmp
blob.storage.directory: /flink_data/tmp
jobmanager.web.tmpdir: /flink_data/tmp
env.log.dir: /flink_data/logs
high-availability: zookeeper
high-availability.storageDir: file:///flink_data/ha/
high-availability.zookeeper.quorum: kafka:2181
blob.server.port: 6124
query.server.port: 6125
===============================

Here is the major error I see:
===============================
2017-08-16 02:46:23,586 INFO org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService - Starting ZooKeeperLeaderRetrievalService.
2017-08-16 02:46:23,612 INFO org.apache.flink.runtime.jobmanager.JobManager - JobManager akka.tcp://flink@flink_jobmanager:32929/user/jobmanager was granted leadership with leader session ID Some(06abc8f5-c1b9-44b2-bb7f-771c74981552).
2017-08-16 02:46:23,627 INFO org.apache.flink.runtime.jobmanager.JobManager - Delaying recovery of all jobs by 10000 milliseconds.
2017-08-16 02:46:23,638 INFO org.apache.flink.runtime.webmonitor.JobManagerRetriever - New leader reachable under akka.tcp://flink@flink_jobmanager:32929/user/jobmanager:06abc8f5-c1b9-44b2-bb7f-771c74981552.
2017-08-16 02:46:23,640 INFO org.apache.flink.runtime.clusterframework.standalone.StandaloneResourceManager - Trying to associate with JobManager leader akka.tcp://flink@flink_jobmanager:32929/user/jobmanager
2017-08-16 02:46:23,653 WARN org.apache.flink.runtime.webmonitor.JobManagerRetriever - Failed to retrieve leader gateway and port.
akka.actor.ActorNotFound: Actor not found for: ActorSelection[Anchor(akka://flink/deadLetters), Path(/)]
at akka.actor.ActorSelection$$anonfun$resolveOne$1.apply(ActorSelection.scala:65)
at akka.actor.ActorSelection$$anonfun$resolveOne$1.apply(ActorSelection.scala:63)
at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:32)
at akka.dispatch.BatchingExecutor$AbstractBatch.processBatch(BatchingExecutor.scala:55)
at akka.dispatch.BatchingExecutor$Batch.run(BatchingExecutor.scala:73)
at akka.dispatch.ExecutionContexts$sameThreadExecutionContext$.unbatchedExecute(Future.scala:74)
at akka.dispatch.BatchingExecutor$class.execute(BatchingExecutor.scala:120)
at akka.dispatch.ExecutionContexts$sameThreadExecutionContext$.execute(Future.scala:73)
at scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:40)
at scala.concurrent.impl.Promise$DefaultPromise.scala$concurrent$impl$Promise$DefaultPromise$$dispatchOrAddCallback(Promise.scala:280)
at scala.concurrent.impl.Promise$DefaultPromise.onComplete(Promise.scala:270)
at akka.actor.ActorSelection.resolveOne(ActorSelection.scala:63)
at org.apache.flink.runtime.akka.AkkaUtils$.getActorRefFuture(AkkaUtils.scala:498)
at org.apache.flink.runtime.akka.AkkaUtils.getActorRefFuture(AkkaUtils.scala)
at org.apache.flink.runtime.webmonitor.JobManagerRetriever.notifyLeaderAddress(JobManagerRetriever.java:141)
at org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService.nodeChanged(ZooKeeperLeaderRetrievalService.java:168)
at org.apache.flink.shaded.org.apache.curator.framework.recipes.cache.NodeCache$4.apply(NodeCache.java:310)
at org.apache.flink.shaded.org.apache.curator.framework.recipes.cache.NodeCache$4.apply(NodeCache.java:304)
at org.apache.flink.shaded.org.apache.curator.framework.listen.ListenerContainer$1.run(ListenerContainer.java:93)
at org.apache.flink.shaded.org.apache.curator.shaded.com.google.common.util.concurrent.MoreExecutors$SameThreadExecutorService.execute(MoreExecutors.java:297)
at org.apache.flink.shaded.org.apache.curator.framework.listen.ListenerContainer.forEach(ListenerContainer.java:85)
at org.apache.flink.shaded.org.apache.curator.framework.recipes.cache.NodeCache.setNewData(NodeCache.java:302)
at org.apache.flink.shaded.org.apache.curator.framework.recipes.cache.NodeCache.processBackgroundResult(NodeCache.java:269)
at org.apache.flink.shaded.org.apache.curator.framework.recipes.cache.NodeCache.access$300(NodeCache.java:56)
at org.apache.flink.shaded.org.apache.curator.framework.recipes.cache.NodeCache$3.processResult(NodeCache.java:122)
at org.apache.flink.shaded.org.apache.curator.framework.imps.CuratorFrameworkImpl.sendToBackgroundCallback(CuratorFrameworkImpl.java:749)
at org.apache.flink.shaded.org.apache.curator.framework.imps.CuratorFrameworkImpl.processBackgroundOperation(CuratorFrameworkImpl.java:522)
at org.apache.flink.shaded.org.apache.curator.framework.imps.GetDataBuilderImpl$3.processResult(GetDataBuilderImpl.java:257)
at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:561)
at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498)
2017-08-16 02:46:33,644 INFO org.apache.flink.runtime.jobmanager.JobManager - Attempting to recover all jobs.
2017-08-16 02:46:33,648 INFO org.apache.flink.runtime.jobmanager.JobManager - There are no jobs to recover.
===============================

More detailed log:
https://gist.github.com/zenhao/19926402438f613c331ffe5b6e6e005d
Reply | Threaded
Open this post in threaded view
|

Re: StandaloneResourceManager failed to associate with JobManager leader

Till Rohrmann
Hi Hao Sun,

have you checked that one can resolve the hostname flink_jobmanager from within the container? This is required to connect to the JobManager. If this is the case, then log files with DEBUG log level would be helpful to track down the problem.

Cheers,
Till

On Wed, Aug 16, 2017 at 5:35 AM, Hao Sun <[hidden email]> wrote:
Hi,

I am trying to run a cluster of job-manager and task-manager in docker.
One of each for now. I got a StandaloneResourceManager error, stating that it can not associate with job-manager. I do not know what was wrong.

I am sure that job-manager can be connected.
===============================
root@flink-jobmanager:/opt/flink# telnet flink_jobmanager 32929
Trying 172.18.0.3...
Connected to flink-jobmanager.
Escape character is '^]'.
Connection closed by foreign host.
===============================

Here is my config:
===============================
Starting Job Manager
config file:
jobmanager.rpc.address: flink_jobmanager
jobmanager.rpc.port: 6123
jobmanager.web.port: 8081
jobmanager.heap.mb: 1024
taskmanager.heap.mb: 1024
taskmanager.numberOfTaskSlots: 1
taskmanager.memory.preallocate: false
parallelism.default: 1
jobmanager.archive.fs.dir: file:///flink_data/completed-jobs/
historyserver.archive.fs.dir: file:///flink_data/completed-jobs/
state.backend: rocksdb
state.backend.fs.checkpointdir: file:///flink_data/checkpoints
taskmanager.tmp.dirs: /flink_data/tmp
blob.storage.directory: /flink_data/tmp
jobmanager.web.tmpdir: /flink_data/tmp
env.log.dir: /flink_data/logs
high-availability: zookeeper
high-availability.storageDir: file:///flink_data/ha/
high-availability.zookeeper.quorum: kafka:2181
blob.server.port: 6124
query.server.port: 6125
===============================

Here is the major error I see:
===============================
2017-08-16 02:46:23,586 INFO org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService - Starting ZooKeeperLeaderRetrievalService.
2017-08-16 02:46:23,612 INFO org.apache.flink.runtime.jobmanager.JobManager - JobManager akka.tcp://flink@flink_jobmanager:32929/user/jobmanager was granted leadership with leader session ID Some(06abc8f5-c1b9-44b2-bb7f-771c74981552).
2017-08-16 02:46:23,627 INFO org.apache.flink.runtime.jobmanager.JobManager - Delaying recovery of all jobs by 10000 milliseconds.
2017-08-16 02:46:23,638 INFO org.apache.flink.runtime.webmonitor.JobManagerRetriever - New leader reachable under akka.tcp://flink@flink_jobmanager:32929/user/jobmanager:06abc8f5-c1b9-44b2-bb7f-771c74981552.
2017-08-16 02:46:23,640 INFO org.apache.flink.runtime.clusterframework.standalone.StandaloneResourceManager - Trying to associate with JobManager leader akka.tcp://flink@flink_jobmanager:32929/user/jobmanager
2017-08-16 02:46:23,653 WARN org.apache.flink.runtime.webmonitor.JobManagerRetriever - Failed to retrieve leader gateway and port.
akka.actor.ActorNotFound: Actor not found for: ActorSelection[Anchor(akka://flink/deadLetters), Path(/)]
at akka.actor.ActorSelection$$anonfun$resolveOne$1.apply(ActorSelection.scala:65)
at akka.actor.ActorSelection$$anonfun$resolveOne$1.apply(ActorSelection.scala:63)
at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:32)
at akka.dispatch.BatchingExecutor$AbstractBatch.processBatch(BatchingExecutor.scala:55)
at akka.dispatch.BatchingExecutor$Batch.run(BatchingExecutor.scala:73)
at akka.dispatch.ExecutionContexts$sameThreadExecutionContext$.unbatchedExecute(Future.scala:74)
at akka.dispatch.BatchingExecutor$class.execute(BatchingExecutor.scala:120)
at akka.dispatch.ExecutionContexts$sameThreadExecutionContext$.execute(Future.scala:73)
at scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:40)
at scala.concurrent.impl.Promise$DefaultPromise.scala$concurrent$impl$Promise$DefaultPromise$$dispatchOrAddCallback(Promise.scala:280)
at scala.concurrent.impl.Promise$DefaultPromise.onComplete(Promise.scala:270)
at akka.actor.ActorSelection.resolveOne(ActorSelection.scala:63)
at org.apache.flink.runtime.akka.AkkaUtils$.getActorRefFuture(AkkaUtils.scala:498)
at org.apache.flink.runtime.akka.AkkaUtils.getActorRefFuture(AkkaUtils.scala)
at org.apache.flink.runtime.webmonitor.JobManagerRetriever.notifyLeaderAddress(JobManagerRetriever.java:141)
at org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService.nodeChanged(ZooKeeperLeaderRetrievalService.java:168)
at org.apache.flink.shaded.org.apache.curator.framework.recipes.cache.NodeCache$4.apply(NodeCache.java:310)
at org.apache.flink.shaded.org.apache.curator.framework.recipes.cache.NodeCache$4.apply(NodeCache.java:304)
at org.apache.flink.shaded.org.apache.curator.framework.listen.ListenerContainer$1.run(ListenerContainer.java:93)
at org.apache.flink.shaded.org.apache.curator.shaded.com.google.common.util.concurrent.MoreExecutors$SameThreadExecutorService.execute(MoreExecutors.java:297)
at org.apache.flink.shaded.org.apache.curator.framework.listen.ListenerContainer.forEach(ListenerContainer.java:85)
at org.apache.flink.shaded.org.apache.curator.framework.recipes.cache.NodeCache.setNewData(NodeCache.java:302)
at org.apache.flink.shaded.org.apache.curator.framework.recipes.cache.NodeCache.processBackgroundResult(NodeCache.java:269)
at org.apache.flink.shaded.org.apache.curator.framework.recipes.cache.NodeCache.access$300(NodeCache.java:56)
at org.apache.flink.shaded.org.apache.curator.framework.recipes.cache.NodeCache$3.processResult(NodeCache.java:122)
at org.apache.flink.shaded.org.apache.curator.framework.imps.CuratorFrameworkImpl.sendToBackgroundCallback(CuratorFrameworkImpl.java:749)
at org.apache.flink.shaded.org.apache.curator.framework.imps.CuratorFrameworkImpl.processBackgroundOperation(CuratorFrameworkImpl.java:522)
at org.apache.flink.shaded.org.apache.curator.framework.imps.GetDataBuilderImpl$3.processResult(GetDataBuilderImpl.java:257)
at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:561)
at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498)
2017-08-16 02:46:33,644 INFO org.apache.flink.runtime.jobmanager.JobManager - Attempting to recover all jobs.
2017-08-16 02:46:33,648 INFO org.apache.flink.runtime.jobmanager.JobManager - There are no jobs to recover.
===============================

More detailed log:
https://gist.github.com/zenhao/19926402438f613c331ffe5b6e6e005d

Reply | Threaded
Open this post in threaded view
|

Re: StandaloneResourceManager failed to associate with JobManager leader

Hao Sun
Thanks Till, the DEBUG log level is a good idea. I figured it out. I made a mistake with `-` and `_`. 

On Tue, Aug 22, 2017 at 1:39 AM Till Rohrmann <[hidden email]> wrote:
Hi Hao Sun,

have you checked that one can resolve the hostname flink_jobmanager from within the container? This is required to connect to the JobManager. If this is the case, then log files with DEBUG log level would be helpful to track down the problem.

Cheers,
Till

On Wed, Aug 16, 2017 at 5:35 AM, Hao Sun <[hidden email]> wrote:
Hi,

I am trying to run a cluster of job-manager and task-manager in docker.
One of each for now. I got a StandaloneResourceManager error, stating that it can not associate with job-manager. I do not know what was wrong.

I am sure that job-manager can be connected.
===============================
root@flink-jobmanager:/opt/flink# telnet flink_jobmanager 32929
Trying 172.18.0.3...
Connected to flink-jobmanager.
Escape character is '^]'.
Connection closed by foreign host.
===============================

Here is my config:
===============================
Starting Job Manager
config file:
jobmanager.rpc.address: flink_jobmanager
jobmanager.rpc.port: 6123
jobmanager.web.port: 8081
jobmanager.heap.mb: 1024
taskmanager.heap.mb: 1024
taskmanager.numberOfTaskSlots: 1
taskmanager.memory.preallocate: false
parallelism.default: 1
jobmanager.archive.fs.dir: file:///flink_data/completed-jobs/
historyserver.archive.fs.dir: file:///flink_data/completed-jobs/
state.backend: rocksdb
state.backend.fs.checkpointdir: file:///flink_data/checkpoints
taskmanager.tmp.dirs: /flink_data/tmp
blob.storage.directory: /flink_data/tmp
jobmanager.web.tmpdir: /flink_data/tmp
env.log.dir: /flink_data/logs
high-availability: zookeeper
high-availability.storageDir: file:///flink_data/ha/
high-availability.zookeeper.quorum: kafka:2181
blob.server.port: 6124
query.server.port: 6125
===============================

Here is the major error I see:
===============================
2017-08-16 02:46:23,586 INFO org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService - Starting ZooKeeperLeaderRetrievalService.
2017-08-16 02:46:23,612 INFO org.apache.flink.runtime.jobmanager.JobManager - JobManager akka.tcp://flink@flink_jobmanager:32929/user/jobmanager was granted leadership with leader session ID Some(06abc8f5-c1b9-44b2-bb7f-771c74981552).
2017-08-16 02:46:23,627 INFO org.apache.flink.runtime.jobmanager.JobManager - Delaying recovery of all jobs by 10000 milliseconds.
2017-08-16 02:46:23,638 INFO org.apache.flink.runtime.webmonitor.JobManagerRetriever - New leader reachable under akka.tcp://flink@flink_jobmanager:32929/user/jobmanager:06abc8f5-c1b9-44b2-bb7f-771c74981552.
2017-08-16 02:46:23,640 INFO org.apache.flink.runtime.clusterframework.standalone.StandaloneResourceManager - Trying to associate with JobManager leader akka.tcp://flink@flink_jobmanager:32929/user/jobmanager
2017-08-16 02:46:23,653 WARN org.apache.flink.runtime.webmonitor.JobManagerRetriever - Failed to retrieve leader gateway and port.
akka.actor.ActorNotFound: Actor not found for: ActorSelection[Anchor(akka://flink/deadLetters), Path(/)]
at akka.actor.ActorSelection$$anonfun$resolveOne$1.apply(ActorSelection.scala:65)
at akka.actor.ActorSelection$$anonfun$resolveOne$1.apply(ActorSelection.scala:63)
at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:32)
at akka.dispatch.BatchingExecutor$AbstractBatch.processBatch(BatchingExecutor.scala:55)
at akka.dispatch.BatchingExecutor$Batch.run(BatchingExecutor.scala:73)
at akka.dispatch.ExecutionContexts$sameThreadExecutionContext$.unbatchedExecute(Future.scala:74)
at akka.dispatch.BatchingExecutor$class.execute(BatchingExecutor.scala:120)
at akka.dispatch.ExecutionContexts$sameThreadExecutionContext$.execute(Future.scala:73)
at scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:40)
at scala.concurrent.impl.Promise$DefaultPromise.scala$concurrent$impl$Promise$DefaultPromise$$dispatchOrAddCallback(Promise.scala:280)
at scala.concurrent.impl.Promise$DefaultPromise.onComplete(Promise.scala:270)
at akka.actor.ActorSelection.resolveOne(ActorSelection.scala:63)
at org.apache.flink.runtime.akka.AkkaUtils$.getActorRefFuture(AkkaUtils.scala:498)
at org.apache.flink.runtime.akka.AkkaUtils.getActorRefFuture(AkkaUtils.scala)
at org.apache.flink.runtime.webmonitor.JobManagerRetriever.notifyLeaderAddress(JobManagerRetriever.java:141)
at org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService.nodeChanged(ZooKeeperLeaderRetrievalService.java:168)
at org.apache.flink.shaded.org.apache.curator.framework.recipes.cache.NodeCache$4.apply(NodeCache.java:310)
at org.apache.flink.shaded.org.apache.curator.framework.recipes.cache.NodeCache$4.apply(NodeCache.java:304)
at org.apache.flink.shaded.org.apache.curator.framework.listen.ListenerContainer$1.run(ListenerContainer.java:93)
at org.apache.flink.shaded.org.apache.curator.shaded.com.google.common.util.concurrent.MoreExecutors$SameThreadExecutorService.execute(MoreExecutors.java:297)
at org.apache.flink.shaded.org.apache.curator.framework.listen.ListenerContainer.forEach(ListenerContainer.java:85)
at org.apache.flink.shaded.org.apache.curator.framework.recipes.cache.NodeCache.setNewData(NodeCache.java:302)
at org.apache.flink.shaded.org.apache.curator.framework.recipes.cache.NodeCache.processBackgroundResult(NodeCache.java:269)
at org.apache.flink.shaded.org.apache.curator.framework.recipes.cache.NodeCache.access$300(NodeCache.java:56)
at org.apache.flink.shaded.org.apache.curator.framework.recipes.cache.NodeCache$3.processResult(NodeCache.java:122)
at org.apache.flink.shaded.org.apache.curator.framework.imps.CuratorFrameworkImpl.sendToBackgroundCallback(CuratorFrameworkImpl.java:749)
at org.apache.flink.shaded.org.apache.curator.framework.imps.CuratorFrameworkImpl.processBackgroundOperation(CuratorFrameworkImpl.java:522)
at org.apache.flink.shaded.org.apache.curator.framework.imps.GetDataBuilderImpl$3.processResult(GetDataBuilderImpl.java:257)
at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:561)
at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498)
2017-08-16 02:46:33,644 INFO org.apache.flink.runtime.jobmanager.JobManager - Attempting to recover all jobs.
2017-08-16 02:46:33,648 INFO org.apache.flink.runtime.jobmanager.JobManager - There are no jobs to recover.
===============================

More detailed log:
https://gist.github.com/zenhao/19926402438f613c331ffe5b6e6e005d