JM cannot recover with Kubernetes HA

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

JM cannot recover with Kubernetes HA

Enrique
Hi All,

Flink 1.13.0

I have a Session cluster deployed with StatefulSet + PVs with HA configured
within a Kubernetes cluster. I have submitted jobs to it, and it all works
fine. Most of my jobs are long-running, typically consuming data from Kafka.

I have noticed that after some time all my JobManagers have restarted
multiple times and can no longer recover.

These are some of the logs I have seen in the multiple JobManager instances:

This doesn't seem harmful right? It just means multiple JMs are trying to
edit the ConfigMap at the same time to become the leader and it's locked?
This is the only one marked as an ERROR
```
2021-05-24 23:07:30,962 ERROR
io.fabric8.kubernetes.client.extended.leaderelection.LeaderElector -
Exception occurred while acquiring lock 'ConfigMapLock: ns -
foo-restserver-leader (4a786be1-80e0-4fae-bf75-2dafc5f7526b)'
```



I have seen this one multiple times and seems to be an issue with the Java
version and OkHttp version
https://github.com/fabric8io/kubernetes-client/pull/2176. We are using
JDK11:
```
io.fabric8.kubernetes.client.KubernetesClientException: Operation: [get]
for kind: [ConfigMap]  with name: [foo-restserver-leader]  in namespace:
[foo]  failed.
```



I have created Roles for RBAC in the cluster so there shouldn't be an issue
with that:
```
rules:
  - verbs:
      - get
      - watch
      - list
      - delete
      - create
      - update
    apiGroups:
      - ''
    resources:
      - configmaps
```

Seems to be a timeout watching ConfigMaps?
```
2021-05-25 01:00:21,470 WARN
io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager - Exec
Failure
java.net.SocketTimeoutException: timeout
```

Endless loop with the following message. I have found the line
https://github.com/apache/flink/blob/80ad5b3b511a68cce19a53291000c9936e10db17/flink-runtime/src/main/java/org/apache/flink/runtime/resourcemanager/ResourceManager.java#L395
but what does this message really mean?
```
org.apache.flink.util.SerializedThrowable: The leading JobMaster id
94293ee005832f68401020e856274c84 did not match the received JobMaster id
8287ad63d2d10239c0839abe06dd4344. This indicates that a JobMaster leader
change has happened.
```

Final log before the pod fails:
```
WARN  org.apache.flink.runtime.rest.handler.job.JobsOverviewHandler - The
connection was unexpectedly closed by the client.
```

Could this be due to having old config maps when redeploying the Flink
Session cluster and trying to recover those jobs?

I have also seen that in some cases the leader address in the 3 ConfigMaps
(Dispatcher, Restserver, and Resource Manage) can differ - is this correct?

Would really appreciate any feedback!

Many thanks

Enrique



--
Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/
Reply | Threaded
Open this post in threaded view
|

Re: JM cannot recover with Kubernetes HA

Enrique
To add to my post, instead of using POD IP for the `jobmanager.rpc.address`
configuration we start each JM pod with the Fully Qualified Name `--host
<pod-name>.<stateful-set-name>.ns.svc:8081`  and this address gets persisted
to the ConfigMaps. In some scenarios, the leader address in the ConfigMaps
might differ.

For example, let's assume I have 3 JMs:

jm-0.jm-statefulset.ns.svc:8081 <-- Leader
jm-1.jm-statefulset.ns.svc:8081
jm-2.jm-statefulset..ns.svc:8081

I have seen the ConfigMaps in the following state:

RestServer Configmap Address: jm-0.jm-statefulset.ns.svc:8081
DispatchServer Configmap Address: jm-1.jm-statefulset.ns.svc:8081
ResourceManager ConfigMap Address: jm-0.jm-statefulset.ns.svc:8081

Is this the correct behaviour?

I then have seen that the TM pods fail to connect due to

```
java.util.concurrent.CompletionException:
org.apache.flink.runtime.rpc.exceptions.FencingTokenException: Fencing token
not set: Ignoring message
RemoteFencedMessage(b870874c1c590d593178811f052a42c9,
RemoteRpcInvocation(registerTaskExecutor(TaskExecutorRegistration, Time)))
sent to
akka.tcp://[hidden email]:6123/user/rpc/resourcemanager_0
because the fencing token is null.
```

This is explained by Till
https://issues.apache.org/jira/browse/FLINK-18367?focusedCommentId=17141070&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17141070

Has anyone else seen this?

Thanks!

Enrique



--
Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/
Reply | Threaded
Open this post in threaded view
|

Re: JM cannot recover with Kubernetes HA

Matthias
Hi Enrique,
thanks for reaching out to the community. I'm not 100% sure what problem you're facing. The log messages you're sharing could mean that the Flink cluster still behaves as normal having some outages and the HA functionality kicking in.

The behavior you're seeing with leaders for the different actors (i.e. RestServer, Dispatcher, ResourceManager) being located on different hosts is fine and no indication for something going wrong as well.

It might help to share the entire logs with us if you need assistance in investigating your issue.

Best,
Matthias

On Thu, May 27, 2021 at 12:42 PM Enrique <[hidden email]> wrote:
To add to my post, instead of using POD IP for the `jobmanager.rpc.address`
configuration we start each JM pod with the Fully Qualified Name `--host
<pod-name>.<stateful-set-name>.ns.svc:8081`  and this address gets persisted
to the ConfigMaps. In some scenarios, the leader address in the ConfigMaps
might differ.

For example, let's assume I have 3 JMs:

jm-0.jm-statefulset.ns.svc:8081 <-- Leader
jm-1.jm-statefulset.ns.svc:8081
jm-2.jm-statefulset..ns.svc:8081

I have seen the ConfigMaps in the following state:

RestServer Configmap Address: jm-0.jm-statefulset.ns.svc:8081
DispatchServer Configmap Address: jm-1.jm-statefulset.ns.svc:8081
ResourceManager ConfigMap Address: jm-0.jm-statefulset.ns.svc:8081

Is this the correct behaviour?

I then have seen that the TM pods fail to connect due to

```
java.util.concurrent.CompletionException:
org.apache.flink.runtime.rpc.exceptions.FencingTokenException: Fencing token
not set: Ignoring message
RemoteFencedMessage(b870874c1c590d593178811f052a42c9,
RemoteRpcInvocation(registerTaskExecutor(TaskExecutorRegistration, Time)))
sent to
akka.tcp://[hidden email]:6123/user/rpc/resourcemanager_0
because the fencing token is null.
```

This is explained by Till
https://issues.apache.org/jira/browse/FLINK-18367?focusedCommentId=17141070&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17141070

Has anyone else seen this?

Thanks!

Enrique



--
Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/
Reply | Threaded
Open this post in threaded view
|

Re: JM cannot recover with Kubernetes HA

Yang Wang
When your APIServer or ETCD of your K8s cluster is working in heavy load, then the fabric8 kubernetes client
might get a timeout when watching/renewing/getting the ConfigMap.

I think you could increase the read/connect timeout(default is 10s) of http client and have a try.
env.java.opts: "-Dkubernetes.connection.timeout=30000 -Dkubernetes.request.timeout=30000"

As well as the leader election timeout.
high-availability.kubernetes.leader-election.lease-duration: 30s
high-availability.kubernetes.leader-election.renew-deadline: 30s


After you apply these configurations, I think the Flink cluster could tolerate the "not-very-good" network environment.


Moreover, if you could share the failed JobManager logs, it will be easier for the community to debug the issues.


Best,
Yang

Matthias Pohl <[hidden email]> 于2021年5月28日周五 下午11:37写道:
Hi Enrique,
thanks for reaching out to the community. I'm not 100% sure what problem you're facing. The log messages you're sharing could mean that the Flink cluster still behaves as normal having some outages and the HA functionality kicking in.

The behavior you're seeing with leaders for the different actors (i.e. RestServer, Dispatcher, ResourceManager) being located on different hosts is fine and no indication for something going wrong as well.

It might help to share the entire logs with us if you need assistance in investigating your issue.

Best,
Matthias

On Thu, May 27, 2021 at 12:42 PM Enrique <[hidden email]> wrote:
To add to my post, instead of using POD IP for the `jobmanager.rpc.address`
configuration we start each JM pod with the Fully Qualified Name `--host
<pod-name>.<stateful-set-name>.ns.svc:8081`  and this address gets persisted
to the ConfigMaps. In some scenarios, the leader address in the ConfigMaps
might differ.

For example, let's assume I have 3 JMs:

jm-0.jm-statefulset.ns.svc:8081 <-- Leader
jm-1.jm-statefulset.ns.svc:8081
jm-2.jm-statefulset..ns.svc:8081

I have seen the ConfigMaps in the following state:

RestServer Configmap Address: jm-0.jm-statefulset.ns.svc:8081
DispatchServer Configmap Address: jm-1.jm-statefulset.ns.svc:8081
ResourceManager ConfigMap Address: jm-0.jm-statefulset.ns.svc:8081

Is this the correct behaviour?

I then have seen that the TM pods fail to connect due to

```
java.util.concurrent.CompletionException:
org.apache.flink.runtime.rpc.exceptions.FencingTokenException: Fencing token
not set: Ignoring message
RemoteFencedMessage(b870874c1c590d593178811f052a42c9,
RemoteRpcInvocation(registerTaskExecutor(TaskExecutorRegistration, Time)))
sent to
akka.tcp://[hidden email]:6123/user/rpc/resourcemanager_0
because the fencing token is null.
```

This is explained by Till
https://issues.apache.org/jira/browse/FLINK-18367?focusedCommentId=17141070&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17141070

Has anyone else seen this?

Thanks!

Enrique



--
Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/