Hi, I am trying to deploy a Flink cluster (1 JM, 2TM) on a Docker Swarm. For JobManager HA, I have started a 3 node zookeeper service on the same swarm network and configured Flink's zookeeper quorum with zookeeper service instances. JobManager gets started with the LeaderElectionService and gets assigned a LeaderSessionID too, which I can see from the following log statements(attaching only related logs) : org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService - Starting ZooKeeperLeaderElectionService org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService - Starting ZooKeeperLeaderRetrievalService. org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService - Starting ZooKeeperLeaderRetrievalService. JobManager akka.tcp://flink@jobmanager:6123/user/jobmanager was granted leadership with leader session ID Some(1f3b2ec6-77b6-4532-928f-ad8befd5202f). Trying to associate with JobManager leader akka.tcp://flink@jobmanager:6123/user/jobmanager Resource Manager associating with leading JobManager Actor[akka://flink/user/jobmanager#590681231] - leader session 1f3b2ec6-77b6-4532-928f-ad8befd5202f But TaskManagers are not able to register with the JobManager and gives the following error: Discard message LeaderSessionMessage(00000000-0000-0000-0000-000000000000,RegisterTaskManager(4fc8aceeae1e27e42b9f16df6c0cf5e3,4fc8aceeae1e27e42b9f16df6c0cf5e3 @ a118cdf39114 (dataPort=43017),cores=1, physMem=1044111360, heap=536870912, managed=324208384,1)) because the expected leader session ID 1f3b2ec6-77b6-4532-928f-ad8befd5202f did not equal the received leader session ID 00000000-0000-0000-0000-000000000000. Seems like the ResourceManager was not able to retrieve the LeaderSessionID and passed 00 ID. One interesting thing I observed was a ZK version log: The version of ZooKeeper being used doesn't support Container nodes. CreateMode.PERSISTENT will be used instead. Is this a ZK version problem? Should I be using ZK 3.4.6? My configuration: Flink Version : 1.4.0 ZK version : 3.4.11 (I just pulled the latest image) Thanks in advance. Chirag |
Do you see in the logs whether the TaskManager correctly connect to ZooKeeper as well? They need this in order to find the JobManager leader.
Best, Aljoscha
|
Thanks Aljoscha.
I haven't checked that bit. Is there any configuration for TaskManagers to find ZK? Regards, Chirag
|
It should be roughly the same settings that you use in your JobManager. They are described here: https://ci.apache.org/projects/flink/flink-docs-master/ops/config.html#zookeeper-based-ha-mode
|
Thanks a lot Aljoscha. I was doing a silly mistake. TaskManagers can now register with JobManager. One more thing, does Flink now store Job Graphs on ZK too? Regards, Chirag
On Wednesday, 14 February, 2018, 8:06:14 PM IST, Aljoscha Krettek <[hidden email]> wrote:
It should be roughly the same settings that you use in your JobManager. They are described here: https://ci.apache.org/projects/flink/flink-docs-master/ops/config.html#zookeeper-based-ha-mode
|
Hi,
AFAIK, the JobGraph itself is not stored in ZK but in HDFS. ZK only stores a handle to the serialised JobGraph. Best, Aljoscha
|
Free forum by Nabble | Edit this page |