Hi all,
I’m experimenting with using my own implementation of HA services instead of ZooKeeper that would persist JobManager information on a Kubernetes volume instead of in ZooKeeper. I’ve set the high-availability option in flink-conf.yaml to the FQN of my factory class, and started the docker ensemble as I usually do (i.e. with no special “cluster” arguments or scripts.) What’s happening now is that TaskManager is unable to connect to ResourceManager, because it seems it’s trying to use the /user/jobmanager path instead of /user/resourcemanager. Here’s what I found in the logs: jobmanager_1 | 2019-08-22 00:05:00,963 INFO akka.remote.Remoting - Remoting started; listening on addresses :[<a href="akka.tcp://flink@jobmanager:6123" class="">akka.tcp://flink@jobmanager:6123] jobmanager_1 | 2019-08-22 00:05:00,975 INFO org.apache.flink.runtime.rpc.akka.AkkaRpcServiceUtils - Actor system started at <a href="akka.tcp://flink@jobmanager:6123" class="">akka.tcp://flink@jobmanager:6123 jobmanager_1 | 2019-08-22 00:05:02,380 INFO org.apache.flink.runtime.rpc.akka.AkkaRpcService - Starting RPC endpoint for org.apache.flink.runtime.resourcemanager.StandaloneResourceManager at <a href="akka://flink/user/resourcemanager" class="">akka://flink/user/resourcemanager . jobmanager_1 | 2019-08-22 00:05:03,138 INFO org.apache.flink.runtime.rpc.akka.AkkaRpcService - Starting RPC endpoint for org.apache.flink.runtime.dispatcher.StandaloneDispatcher at <a href="akka://flink/user/dispatcher" class="">akka://flink/user/dispatcher . jobmanager_1 | 2019-08-22 00:05:03,211 INFO org.apache.flink.runtime.resourcemanager.StandaloneResourceManager - ResourceManager <a href="akka.tcp://flink@jobmanager:6123/user/resourcemanager" class="">akka.tcp://flink@jobmanager:6123/user/resourcemanager was granted leadership with fencing token 00000000000000000000000000000000 jobmanager_1 | 2019-08-22 00:05:03,292 INFO org.apache.flink.runtime.dispatcher.StandaloneDispatcher - Dispatcher <a href="akka.tcp://flink@jobmanager:6123/user/dispatcher" class="">akka.tcp://flink@jobmanager:6123/user/dispatcher was granted leadership with fencing token 00000000-0000-0000-0000-000000000000 taskmanager_1 | 2019-08-22 00:05:03,713 INFO org.apache.flink.runtime.taskexecutor.TaskExecutor - Connecting to ResourceManager <a href="akka.tcp://flink@jobmanager:6123/user/jobmanager(00000000000000000000000000000000)" class="">akka.tcp://flink@jobmanager:6123/user/jobmanager(00000000000000000000000000000000). taskmanager_1 | 2019-08-22 00:05:04,137 INFO org.apache.flink.runtime.taskexecutor.TaskExecutor - Could not resolve ResourceManager address <a href="akka.tcp://flink@jobmanager:6123/user/jobmanager" class="">akka.tcp://flink@jobmanager:6123/user/jobmanager, retrying in 10000 ms: Could not connect to rpc endpoint under address <a href="akka.tcp://flink@jobmanager:6123/user/jobmanager" class="">akka.tcp://flink@jobmanager:6123/user/jobmanager.. Is this a known bug? I’d appreciate any help I can get. Thanks, Aleksandar Mastilovic
|
Hi Aleksandar, The resource manager address is retrieved from the HA services. Would you check whether your customized HA services is returning the right LeaderRetrievalService and whether the LeaderRetrievalService is really retrieving the right leader's address? Or is it possible that the stored resource manager address in HA is replaced by jobmanager address in any case? Thanks, Zhu Zhu Aleksandar Mastilovic <[hidden email]> 于2019年8月22日周四 上午8:16写道:
|
Hi Aleksandar, base on your log: taskmanager_1 | 2019-08-22 00:05:03,713 INFO org.apache.flink.runtime.taskexecutor.TaskExecutor - Connecting to ResourceManager akka.tcp://flink@jobmanager:6123/user/jobmanager(00000000000000000000000000000000). taskmanager_1 | 2019-08-22 00:05:04,137 INFO org.apache.flink.runtime.taskexecutor.TaskExecutor - Could not resolve ResourceManager address akka.tcp://flink@jobmanager:6123/user/jobmanager, retrying in 10000 ms: Could not connect to rpc endpoint under address akka.tcp://flink@jobmanager:6123/user/jobmanager.. it looks like you return a jobmanager address on retrieval service of resource manager. Please check the implementation carefully or share it on mailing list that others can help for investigation. Best, tison. Zhu Zhu <[hidden email]> 于2019年8月22日周四 上午10:11写道:
|
Besides, would you like to participant our survey thread[1] on user list about "How do you use high-availability services in Flink?" It would help Flink improve its high-availability serving. Best, tison. Zili Chen <[hidden email]> 于2019年8月22日周四 上午10:16写道:
|
Thanks for all the help, people - you made me go through my code once again and discover that I switched argument positions for job manager and resource manager addresses :-)
The docker ensemble now starts fine, I’m working on ironing out the bugs now.
I’ll participate in the survey too!
|
Nice to hear :-) Best, tison. Aleksandar Mastilovic <[hidden email]> 于2019年8月23日周五 上午2:22写道:
|
Free forum by Nabble | Edit this page |