Hi All,
Flink 1.13.0 I have a Session cluster deployed with StatefulSet + PVs with HA configured within a Kubernetes cluster. I have submitted jobs to it, and it all works fine. Most of my jobs are long-running, typically consuming data from Kafka. I have noticed that after some time all my JobManagers have restarted multiple times and can no longer recover. These are some of the logs I have seen in the multiple JobManager instances: This doesn't seem harmful right? It just means multiple JMs are trying to edit the ConfigMap at the same time to become the leader and it's locked? This is the only one marked as an ERROR ``` 2021-05-24 23:07:30,962 ERROR io.fabric8.kubernetes.client.extended.leaderelection.LeaderElector - Exception occurred while acquiring lock 'ConfigMapLock: ns - foo-restserver-leader (4a786be1-80e0-4fae-bf75-2dafc5f7526b)' ``` I have seen this one multiple times and seems to be an issue with the Java version and OkHttp version https://github.com/fabric8io/kubernetes-client/pull/2176. We are using JDK11: ``` io.fabric8.kubernetes.client.KubernetesClientException: Operation: [get] for kind: [ConfigMap] with name: [foo-restserver-leader] in namespace: [foo] failed. ``` I have created Roles for RBAC in the cluster so there shouldn't be an issue with that: ``` rules: - verbs: - get - watch - list - delete - create - update apiGroups: - '' resources: - configmaps ``` Seems to be a timeout watching ConfigMaps? ``` 2021-05-25 01:00:21,470 WARN io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager - Exec Failure java.net.SocketTimeoutException: timeout ``` Endless loop with the following message. I have found the line https://github.com/apache/flink/blob/80ad5b3b511a68cce19a53291000c9936e10db17/flink-runtime/src/main/java/org/apache/flink/runtime/resourcemanager/ResourceManager.java#L395 but what does this message really mean? ``` org.apache.flink.util.SerializedThrowable: The leading JobMaster id 94293ee005832f68401020e856274c84 did not match the received JobMaster id 8287ad63d2d10239c0839abe06dd4344. This indicates that a JobMaster leader change has happened. ``` Final log before the pod fails: ``` WARN org.apache.flink.runtime.rest.handler.job.JobsOverviewHandler - The connection was unexpectedly closed by the client. ``` Could this be due to having old config maps when redeploying the Flink Session cluster and trying to recover those jobs? I have also seen that in some cases the leader address in the 3 ConfigMaps (Dispatcher, Restserver, and Resource Manage) can differ - is this correct? Would really appreciate any feedback! Many thanks Enrique -- Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/ |
To add to my post, instead of using POD IP for the `jobmanager.rpc.address`
configuration we start each JM pod with the Fully Qualified Name `--host <pod-name>.<stateful-set-name>.ns.svc:8081` and this address gets persisted to the ConfigMaps. In some scenarios, the leader address in the ConfigMaps might differ. For example, let's assume I have 3 JMs: jm-0.jm-statefulset.ns.svc:8081 <-- Leader jm-1.jm-statefulset.ns.svc:8081 jm-2.jm-statefulset..ns.svc:8081 I have seen the ConfigMaps in the following state: RestServer Configmap Address: jm-0.jm-statefulset.ns.svc:8081 DispatchServer Configmap Address: jm-1.jm-statefulset.ns.svc:8081 ResourceManager ConfigMap Address: jm-0.jm-statefulset.ns.svc:8081 Is this the correct behaviour? I then have seen that the TM pods fail to connect due to ``` java.util.concurrent.CompletionException: org.apache.flink.runtime.rpc.exceptions.FencingTokenException: Fencing token not set: Ignoring message RemoteFencedMessage(b870874c1c590d593178811f052a42c9, RemoteRpcInvocation(registerTaskExecutor(TaskExecutorRegistration, Time))) sent to akka.tcp://[hidden email]:6123/user/rpc/resourcemanager_0 because the fencing token is null. ``` This is explained by Till https://issues.apache.org/jira/browse/FLINK-18367?focusedCommentId=17141070&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17141070 Has anyone else seen this? Thanks! Enrique -- Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/ |
Hi Enrique, thanks for reaching out to the community. I'm not 100% sure what problem you're facing. The log messages you're sharing could mean that the Flink cluster still behaves as normal having some outages and the HA functionality kicking in. The behavior you're seeing with leaders for the different actors (i.e. RestServer, Dispatcher, ResourceManager) being located on different hosts is fine and no indication for something going wrong as well. It might help to share the entire logs with us if you need assistance in investigating your issue. Best, Matthias On Thu, May 27, 2021 at 12:42 PM Enrique <[hidden email]> wrote: To add to my post, instead of using POD IP for the `jobmanager.rpc.address` |
When your APIServer or ETCD of your K8s cluster is working in heavy load, then the fabric8 kubernetes client might get a timeout when watching/renewing/getting the ConfigMap. I think you could increase the read/connect timeout(default is 10s) of http client and have a try. env.java.opts: "-Dkubernetes.connection.timeout=30000 -Dkubernetes.request.timeout=30000" As well as the leader election timeout. high-availability.kubernetes.leader-election.lease-duration: 30s high-availability.kubernetes.leader-election.renew-deadline: 30s After you apply these configurations, I think the Flink cluster could tolerate the "not-very-good" network environment. Moreover, if you could share the failed JobManager logs, it will be easier for the community to debug the issues. Best, Yang Matthias Pohl <[hidden email]> 于2021年5月28日周五 下午11:37写道:
|
Free forum by Nabble | Edit this page |