Could not resolve ResourceManager address on Flink 1.7.1

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

Could not resolve ResourceManager address on Flink 1.7.1

flint-stone
Hello:

I am trying to set up a standalone flink cluster (1.7.1) and I'm getting a very similar error as the user reported in 
this thread. However, I believe the root cause should be different -- as I tried start job manager using both start-cluster.sh and jobmanager.sh but both of them failed with the same error. 
The error I got is on task manager (flink-worker1) is similar to the following:

6:6123/user/resourcemanager, retrying in 10000 ms: Could not connect to rpc endpoint under address akka.tcp://flink@10.0.0.6:6123/user/resourcemanager..
2019-03-12 07:39:42,884 INFO  org.apache.flink.runtime.taskexecutor.TaskExecutor            - Could not resolve ResourceManager address akka.tcp://flink@10.0.0.6:6123/user/resourcemanager, retrying in 10000 ms: Could not connect to rpc endpoint under address akka.tcp://flink@10.0.0.6:6123/user/resourcemanager..
2019-03-12 07:39:52,901 INFO  org.apache.flink.runtime.taskexecutor.TaskExecutor            - Could not resolve ResourceManager address akka.tcp://flink@10.0.0.6:6123/user/resourcemanager, retrying in 10000 ms: Could not connect to rpc endpoint under address akka.tcp://flink@10.0.0.6:6123/user/resourcemanager..
2019-03-12 07:40:02,925 INFO  org.apache.flink.runtime.taskexecutor.TaskExecutor            - Could not resolve ResourceManager address akka.tcp://flink@10.0.0.6:6123/user/resourcemanager, retrying in 10000 ms: Could not connect to rpc endpoint under address akka.tcp://flink@10.0.0.6:6123/user/resourcemanager..
2019-03-12 07:40:12,939 INFO  org.apache.flink.runtime.taskexecutor.TaskExecutor            - Could not resolve ResourceManager address akka.tcp://flink@10.0.0.6:6123/user/resourcemanager, retrying in 10000 ms: Could not connect to rpc endpoint under address akka.tcp://flink@10.0.0.6:6123/user/resourcemanager..
2019-03-12 07:40:22,963 INFO  org.apache.flink.runtime.taskexecutor.TaskExecutor            - Could not resolve ResourceManager address akka.tcp://flink@10.0.0.6:6123/user/resourcemanager, retrying in 10000 ms: Could not connect to rpc endpoint under address akka.tcp://flink@10.0.0.6:6123/user/resourcemanager..
2019-03-12 07:40:32,978 INFO  org.apache.flink.runtime.taskexecutor.TaskExecutor            - Could not resolve ResourceManager address akka.tcp://flink@10.0.0.6:6123/user/resourcemanager, retrying in 10000 ms: Could not connect to rpc endpoint under address akka.tcp://flink@10.0.0.6:6123/user/resourcemanager..


But the job manager seems to start up ok:

2019-03-12 07:38:36,643 INFO  akka.remote.Remoting                                          - Remoting started; listening on addresses :[akka.tcp://flink@10.0.0.6:6123]
2019-03-12 07:38:36,659 INFO  org.apache.flink.runtime.rpc.akka.AkkaRpcServiceUtils         - Actor system started at akka.tcp://flink@10.0.0.6:6123
2019-03-12 07:38:36,690 INFO  org.apache.flink.runtime.blob.BlobServer                      - Created BLOB server storage directory C:\cygwin64\tmp\blobStore-85b28100-fa08-4488-9f79-d0d712f34733
2019-03-12 07:38:36,690 INFO  org.apache.flink.runtime.blob.BlobServer                      - Started BLOB server at 0.0.0.0:54072 - max concurrent requests: 50 - max backlog: 1000
2019-03-12 07:38:36,705 INFO  org.apache.flink.runtime.metrics.MetricRegistryImpl           - No metrics reporter configured, no metrics will be exposed/reported.
2019-03-12 07:38:36,721 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         - Trying to start actor system at 10.0.0.6:0
2019-03-12 07:38:36,737 INFO  akka.event.slf4j.Slf4jLogger                                  - Slf4jLogger started
2019-03-12 07:38:36,752 INFO  akka.remote.Remoting                                          - Starting remoting
2019-03-12 07:38:36,768 INFO  akka.remote.Remoting                                          - Remoting started; listening on addresses :[akka.tcp://flink-metrics@10.0.0.6:54085]
2019-03-12 07:38:36,768 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint         - Actor system started at akka.tcp://flink-metrics@10.0.0.6:54085
2019-03-12 07:38:36,784 INFO  org.apache.flink.runtime.dispatcher.FileArchivedExecutionGraphStore  - Initializing FileArchivedExecutionGraphStore: Storage directory C:\cygwin64\tmp\executionGraphStore-550bff8d-314e-4a04-b10e-93bdc7af80c6, expiration time 3600000, maximum cache size 52428800 bytes.
2019-03-12 07:38:36,815 INFO  org.apache.flink.runtime.blob.TransientBlobCache              - Created BLOB cache storage directory C:\cygwin64\tmp\blobStore-608a5134-9f0d-44dd-8e3d-d9fbe4185d21
2019-03-12 07:38:36,830 WARN  org.apache.flink.runtime.dispatcher.DispatcherRestEndpoint    - Upload directory C:\cygwin64\tmp\flink-web-2d9712e2-54cb-428a-a27a-826fa2214dad\flink-web-upload does not exist, or has been deleted externally. Previously uploaded files are no longer available.
2019-03-12 07:38:36,830 INFO  org.apache.flink.runtime.dispatcher.DispatcherRestEndpoint    - Created directory C:\cygwin64\tmp\flink-web-2d9712e2-54cb-428a-a27a-826fa2214dad\flink-web-upload for file uploads.
2019-03-12 07:38:36,830 INFO  org.apache.flink.runtime.dispatcher.DispatcherRestEndpoint    - Starting rest endpoint.
2019-03-12 07:38:37,065 WARN  org.apache.flink.runtime.webmonitor.WebMonitorUtils           - Log file environment variable 'log.file' is not set.
2019-03-12 07:38:37,065 WARN  org.apache.flink.runtime.webmonitor.WebMonitorUtils           - JobManager log files are unavailable in the web dashboard. Log file location not found in environment variable 'log.file' or configuration key 'Key: 'web.log.path' , default: null (deprecated keys: [jobmanager.web.log.path])'.
2019-03-12 07:38:38,034 INFO  org.apache.flink.runtime.dispatcher.DispatcherRestEndpoint    - Rest endpoint listening at 10.0.0.6:8081
2019-03-12 07:38:38,034 INFO  org.apache.flink.runtime.dispatcher.DispatcherRestEndpoint    - http://10.0.0.6:8081 was granted leadership with leaderSessionID=00000000-0000-0000-0000-000000000000
2019-03-12 07:38:38,034 INFO  org.apache.flink.runtime.dispatcher.DispatcherRestEndpoint    - Web frontend listening at http://10.0.0.6:8081.
2019-03-12 07:38:38,096 INFO  org.apache.flink.runtime.rpc.akka.AkkaRpcService              - Starting RPC endpoint for org.apache.flink.runtime.resourcemanager.StandaloneResourceManager at akka://flink/user/resourcemanager .
2019-03-12 07:38:38,112 INFO  org.apache.flink.runtime.rpc.akka.AkkaRpcService              - Starting RPC endpoint for org.apache.flink.runtime.dispatcher.StandaloneDispatcher at akka://flink/user/dispatcher .
2019-03-12 07:38:38,190 INFO  org.apache.flink.runtime.resourcemanager.StandaloneResourceManager  - ResourceManager akka.tcp://flink@10.0.0.6:6123/user/resourcemanager was granted leadership with fencing token 00000000000000000000000000000000
2019-03-12 07:38:38,190 INFO  org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager  - Starting the SlotManager.
2019-03-12 07:38:38,206 INFO  org.apache.flink.runtime.dispatcher.StandaloneDispatcher      - Dispatcher akka.tcp://flink@10.0.0.6:6123/user/dispatcher was granted leadership with fencing token 00000000-0000-0000-0000-000000000000
2019-03-12 07:38:38,221 INFO  org.apache.flink.runtime.dispatcher.StandaloneDispatcher      - Recovering all persisted jobs.
2019-03-12 07:44:20,564 WARN  akka.remote.transport.netty.NettyTransport                    - Remote connection to [/10.0.0.7:51057] failed with java.io.IOException: An existing connection was forcibly closed by the remote host
2019-03-12 07:44:20,564 WARN  akka.remote.ReliableDeliverySupervisor                        - Association with remote system [akka.tcp://flink@flink-worker1:50978] has failed, address is now gated for [50] ms. Reason: [Disassociated]


Interestingly, the worker node (flink-worker1) never seems to connect to the jobmanager since it keeps retrying. But when I force the task manager to close, job manager reports an error at the end saying the association has failed. For some reason, none of the job manager managed to connect even though port 6123 on the job manager is open and listening.

Any suggestion will be appreciated.

Thanks!

Le