Hi Yang,
I tried the deployment of flink with three replicas of Jobmanger to test a faster job recovery scenario. Below is my deployment :
$ kubectl get po -namit | grep zk
eric-data-coordinator-zk-0 1/1 Running 0 6d21h
eric-data-coordinator-zk-1 1/1 Running 0 6d21h
eric-data-coordinator-zk-2 1/1 Running 0 6d21h
flink-jobmanager-ha-zk-1-5d58dc469-8bjpb 1/1 Running 0 19h
flink-jobmanager-ha-zk-1-5d58dc469-klg5p 1/1 Running 0 19h
flink-jobmanager-ha-zk-1-5d58dc469-kvwzk 1/1 Running 0 19h
$ kubectl get svc -namit | grep zk
flink-jobmanager-ha-rest-zk1 NodePort 10.100.118.186 <none> 8081:32115/TCP 21h
flink-jobmanager-ha-zk1 ClusterIP 10.111.135.174 <none> 6123/TCP,6124/TCP,8081/TCP 21h
eric-data-coordinator-zk ClusterIP 10.105.139.167 <none> 2181/TCP,8080/TCP,21007/TCP 7d20h
eric-data-coordinator-zk-ensemble-service ClusterIP None <none> 2888/TCP,3888/TCP 7d20h
Flink Configmap:
====================
apiVersion: v1
kind: ConfigMap
metadata:
name: flink-config-ha-zk-1
namespace: amit
labels:
app: flink
data:
flink-conf.yaml: |+
jobmanager.rpc.address: flink-jobmanager-ha-zk1
taskmanager.numberOfTaskSlots: 2
blob.server.port: 6124
jobmanager.rpc.port: 6123
taskmanager.rpc.port: 6122
queryable-state.proxy.ports: 6125
jobmanager.memory.process.size: 1600m
taskmanager.memory.process.size: 1728m
parallelism.default: 2
# High Availability parameters
high-availability: zookeeper
high-availability.cluster-id: /haclusterzk1
high-availability.storageDir: file:///opt/flink/recovery/
high-availability.zookeeper.path.root: /flinkhazk
high-availability.zookeeper.quorum: eric-data-coordinator-zk:2181
high-availability.jobmanager.port: 6123
===============================================================
Out of the three replicas of Job manager pods in one of the pod i am getting this error:
2021-01-19 08:18:33,982 INFO org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService [] - Starting ZooKeeperLeaderElectionService ZooKeeperLeaderElectionService{leaderPath='/leader/resource_manager_lock'}.
2021-01-19 08:21:39,381 WARN org.apache.flink.runtime.webmonitor.retriever.impl.RpcGatewayRetriever [] - Error while retrieving the leader gateway. Retrying to connect to akka.tcp://flink@flink-jobmanager-ha-zk1:6123/user/rpc/dispatcher_1.
2021-01-19 08:21:42,521 WARN org.apache.flink.runtime.webmonitor.retriever.impl.RpcGatewayRetriever [] - Error while retrieving the leader gateway. Retrying to connect to akka.tcp://flink@flink-jobmanager-ha-zk1:6123/user/rpc/dispatcher_1.
2021-01-19 08:21:45,508 WARN org.apache.flink.runtime.webmonitor.retriever.impl.RpcGatewayRetriever [] - Error while retrieving the leader gateway. Retrying to connect to akka.tcp://flink@flink-jobmanager-ha-zk1:6123/user/rpc/dispatcher_1.
2021-01-19 08:21:46,369 WARN org.apache.flink.runtime.webmonitor.retriever.impl.RpcGatewayRetriever [] - Error while retrieving the leader gateway. Retrying to connect to akka.tcp://flink@flink-jobmanager-ha-zk1:6123/user/rpc/dispatcher_1.
2021-01-19 08:22:13,658 WARN org.apache.flink.runtime.webmonitor.retriever.impl.RpcGatewayRetriever [] - Error while retrieving the leader gateway. Retrying to connect to akka.tcp://flink@flink-jobmanager-ha-zk1:6123/user/rpc/dispatcher_1.
2021-01-20 04:10:39,836 WARN org.apache.flink.runtime.webmonitor.retriever.impl.RpcGatewayRetriever [] - Error while retrieving the leader gateway. Retrying to connect to akka.tcp://flink@flink-jobmanager-ha-zk1:6123/user/rpc/dispatcher_1.
And when trying to access the GUI getting below error:
In zookeeper i could see all the three id's are there
[zk: localhost:2181(CONNECTED) 5] ls /flinkhazk/haclusterzk1/leaderlatch/dispatcher_lock
[_c_1d5fc8b1-063f-4a1c-ad0f-ec46b6f10f36-latch-0000000020, _c_229d0739-8854-4a5a-ace7-377d9edc575f-latch-0000000018, _c_4eac3aaf-3f0f-4297-ac7f-086821548697-latch-0000000019]
[zk: localhost:2181(CONNECTED) 6]
So i have below queries on this:
1) what is the correct way to start three jobmanager replicas with zk ? Is there any link which explains this deployment scenario and configuration ?
2) How we'll identify that out of three replicas, which Job Manager replica is the leader ?
Regards,
Amit Bhatia