Native K8S HA Session Cluster Issue 1.12.1

classic Classic list List threaded Threaded
3 messages Options
kb
Reply | Threaded
Open this post in threaded view
|

Native K8S HA Session Cluster Issue 1.12.1

kb
Hi All,

On long lived session clusters we are seeing a k8s error `Error while watching the ConfigMap`.
Good news is it looks like `too old resource version` issue is fixed :).

Logs are attached below. Any tips?

best
Kevin


2021-02-11 07:55:15,249 INFO  org.apache.flink.runtime.checkpoint.CheckpointCoordinator    [] - Completed checkpoint 4 for job 58ec7a029cd31ad057e25479a9979cb4 (202852094 bytes in 49274 ms).
2021-02-11 08:00:15,732 INFO  org.apache.flink.runtime.checkpoint.CheckpointCoordinator    [] - Triggering checkpoint 5 (type=CHECKPOINT) @ 1613030415249 for job 58ec7a029cd31ad057e25479a9979cb4.
2021-02-11 08:00:25,446 ERROR org.apache.flink.runtime.resourcemanager.active.ActiveResourceManager [] - Fatal error occurred in ResourceManager.
org.apache.flink.runtime.leaderretrieval.LeaderRetrievalException: Error while watching the ConfigMap JOB_NAME-6a3361c3fdeb4dd9ba80d8e667a8093e-jobmanager-leader
at org.apache.flink.kubernetes.highavailability.KubernetesLeaderRetrievalDriver$ConfigMapCallbackHandlerImpl.handleFatalError(KubernetesLeaderRetrievalDriver.java:120) [flink-dist_2.12-1.12.1.jar:1.12.1]
at org.apache.flink.kubernetes.kubeclient.resources.AbstractKubernetesWatcher.onClose(AbstractKubernetesWatcher.java:48) [flink-dist_2.12-1.12.1.jar:1.12.1]
at io.fabric8.kubernetes.client.utils.WatcherToggle.onClose(WatcherToggle.java:56) [flink-dist_2.12-1.12.1.jar:1.12.1]
at io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager.closeEvent(WatchConnectionManager.java:367) [flink-dist_2.12-1.12.1.jar:1.12.1]
at io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager.access$700(WatchConnectionManager.java:50) [flink-dist_2.12-1.12.1.jar:1.12.1]
at io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager$1.onMessage(WatchConnectionManager.java:259) [flink-dist_2.12-1.12.1.jar:1.12.1]
at org.apache.flink.kubernetes.shaded.okhttp3.internal.ws.RealWebSocket.onReadMessage(RealWebSocket.java:323) [flink-dist_2.12-1.12.1.jar:1.12.1]
at org.apache.flink.kubernetes.shaded.okhttp3.internal.ws.WebSocketReader.readMessageFrame(WebSocketReader.java:219) [flink-dist_2.12-1.12.1.jar:1.12.1]
at org.apache.flink.kubernetes.shaded.okhttp3.internal.ws.WebSocketReader.processNextFrame(WebSocketReader.java:105) [flink-dist_2.12-1.12.1.jar:1.12.1]
at org.apache.flink.kubernetes.shaded.okhttp3.internal.ws.RealWebSocket.loopReader(RealWebSocket.java:274) [flink-dist_2.12-1.12.1.jar:1.12.1]
at org.apache.flink.kubernetes.shaded.okhttp3.internal.ws.RealWebSocket$2.onResponse(RealWebSocket.java:214) [flink-dist_2.12-1.12.1.jar:1.12.1]
at org.apache.flink.kubernetes.shaded.okhttp3.RealCall$AsyncCall.execute(RealCall.java:206) [flink-dist_2.12-1.12.1.jar:1.12.1]
at org.apache.flink.kubernetes.shaded.okhttp3.internal.NamedRunnable.run(NamedRunnable.java:32) [flink-dist_2.12-1.12.1.jar:1.12.1]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_282]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_282]
at java.lang.Thread.run(Thread.java:748) [?:1.8.0_282]
2021-02-11 08:00:25,456 ERROR org.apache.flink.runtime.entrypoint.ClusterEntrypoint        [] - Fatal error occurred in the cluster entrypoint.
org.apache.flink.runtime.leaderretrieval.LeaderRetrievalException: Error while watching the ConfigMap JOB_NAME-6a3361c3fdeb4dd9ba80d8e667a8093e-jobmanager-leader
at org.apache.flink.kubernetes.highavailability.KubernetesLeaderRetrievalDriver$ConfigMapCallbackHandlerImpl.handleFatalError(KubernetesLeaderRetrievalDriver.java:120) [flink-dist_2.12-1.12.1.jar:1.12.1]
at org.apache.flink.kubernetes.kubeclient.resources.AbstractKubernetesWatcher.onClose(AbstractKubernetesWatcher.java:48) [flink-dist_2.12-1.12.1.jar:1.12.1]
at io.fabric8.kubernetes.client.utils.WatcherToggle.onClose(WatcherToggle.java:56) [flink-dist_2.12-1.12.1.jar:1.12.1]
at io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager.closeEvent(WatchConnectionManager.java:367) [flink-dist_2.12-1.12.1.jar:1.12.1]
at io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager.access$700(WatchConnectionManager.java:50) [flink-dist_2.12-1.12.1.jar:1.12.1]
at io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager$1.onMessage(WatchConnectionManager.java:259) [flink-dist_2.12-1.12.1.jar:1.12.1]
at org.apache.flink.kubernetes.shaded.okhttp3.internal.ws.RealWebSocket.onReadMessage(RealWebSocket.java:323) [flink-dist_2.12-1.12.1.jar:1.12.1]
at org.apache.flink.kubernetes.shaded.okhttp3.internal.ws.WebSocketReader.readMessageFrame(WebSocketReader.java:219) [flink-dist_2.12-1.12.1.jar:1.12.1]
at org.apache.flink.kubernetes.shaded.okhttp3.internal.ws.WebSocketReader.processNextFrame(WebSocketReader.java:105) [flink-dist_2.12-1.12.1.jar:1.12.1]
at org.apache.flink.kubernetes.shaded.okhttp3.internal.ws.RealWebSocket.loopReader(RealWebSocket.java:274) [flink-dist_2.12-1.12.1.jar:1.12.1]
at org.apache.flink.kubernetes.shaded.okhttp3.internal.ws.RealWebSocket$2.onResponse(RealWebSocket.java:214) [flink-dist_2.12-1.12.1.jar:1.12.1]
at org.apache.flink.kubernetes.shaded.okhttp3.RealCall$AsyncCall.execute(RealCall.java:206) [flink-dist_2.12-1.12.1.jar:1.12.1]
at org.apache.flink.kubernetes.shaded.okhttp3.internal.NamedRunnable.run(NamedRunnable.java:32) [flink-dist_2.12-1.12.1.jar:1.12.1]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_282]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_282]
at java.lang.Thread.run(Thread.java:748) [?:1.8.0_282]
2021-02-11 08:00:25,487 INFO  org.apache.flink.runtime.blob.BlobServer                     [] - Stopped BLOB server at 0.0.0.0:6124
Reply | Threaded
Open this post in threaded view
|

Re: Native K8S HA Session Cluster Issue 1.12.1

Till Rohrmann
Hi Kevin,

Unfortunately, the root cause for the error is missing. I can only guess but it could indeed be FLINK-20417 [1]. If this is the case, then the problem should be fixed with the upcoming Flink 1.12.2 version. It should be released next week hopefully. If it should be a different problem, then we will know better because Flink 1.12.2 will fix the problem with swallowing the root cause. So I would highly recommend upgrading once the next bug fix release has been released.


Cheers,
Till

On Thu, Feb 11, 2021 at 9:21 AM Bohinski, Kevin <[hidden email]> wrote:
Hi All,

On long lived session clusters we are seeing a k8s error `Error while watching the ConfigMap`.
Good news is it looks like `too old resource version` issue is fixed :).

Logs are attached below. Any tips?

best
Kevin


2021-02-11 07:55:15,249 INFO  org.apache.flink.runtime.checkpoint.CheckpointCoordinator    [] - Completed checkpoint 4 for job 58ec7a029cd31ad057e25479a9979cb4 (202852094 bytes in 49274 ms).
2021-02-11 08:00:15,732 INFO  org.apache.flink.runtime.checkpoint.CheckpointCoordinator    [] - Triggering checkpoint 5 (type=CHECKPOINT) @ 1613030415249 for job 58ec7a029cd31ad057e25479a9979cb4.
2021-02-11 08:00:25,446 ERROR org.apache.flink.runtime.resourcemanager.active.ActiveResourceManager [] - Fatal error occurred in ResourceManager.
org.apache.flink.runtime.leaderretrieval.LeaderRetrievalException: Error while watching the ConfigMap JOB_NAME-6a3361c3fdeb4dd9ba80d8e667a8093e-jobmanager-leader
at org.apache.flink.kubernetes.highavailability.KubernetesLeaderRetrievalDriver$ConfigMapCallbackHandlerImpl.handleFatalError(KubernetesLeaderRetrievalDriver.java:120) [flink-dist_2.12-1.12.1.jar:1.12.1]
at org.apache.flink.kubernetes.kubeclient.resources.AbstractKubernetesWatcher.onClose(AbstractKubernetesWatcher.java:48) [flink-dist_2.12-1.12.1.jar:1.12.1]
at io.fabric8.kubernetes.client.utils.WatcherToggle.onClose(WatcherToggle.java:56) [flink-dist_2.12-1.12.1.jar:1.12.1]
at io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager.closeEvent(WatchConnectionManager.java:367) [flink-dist_2.12-1.12.1.jar:1.12.1]
at io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager.access$700(WatchConnectionManager.java:50) [flink-dist_2.12-1.12.1.jar:1.12.1]
at io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager$1.onMessage(WatchConnectionManager.java:259) [flink-dist_2.12-1.12.1.jar:1.12.1]
at org.apache.flink.kubernetes.shaded.okhttp3.internal.ws.RealWebSocket.onReadMessage(RealWebSocket.java:323) [flink-dist_2.12-1.12.1.jar:1.12.1]
at org.apache.flink.kubernetes.shaded.okhttp3.internal.ws.WebSocketReader.readMessageFrame(WebSocketReader.java:219) [flink-dist_2.12-1.12.1.jar:1.12.1]
at org.apache.flink.kubernetes.shaded.okhttp3.internal.ws.WebSocketReader.processNextFrame(WebSocketReader.java:105) [flink-dist_2.12-1.12.1.jar:1.12.1]
at org.apache.flink.kubernetes.shaded.okhttp3.internal.ws.RealWebSocket.loopReader(RealWebSocket.java:274) [flink-dist_2.12-1.12.1.jar:1.12.1]
at org.apache.flink.kubernetes.shaded.okhttp3.internal.ws.RealWebSocket$2.onResponse(RealWebSocket.java:214) [flink-dist_2.12-1.12.1.jar:1.12.1]
at org.apache.flink.kubernetes.shaded.okhttp3.RealCall$AsyncCall.execute(RealCall.java:206) [flink-dist_2.12-1.12.1.jar:1.12.1]
at org.apache.flink.kubernetes.shaded.okhttp3.internal.NamedRunnable.run(NamedRunnable.java:32) [flink-dist_2.12-1.12.1.jar:1.12.1]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_282]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_282]
at java.lang.Thread.run(Thread.java:748) [?:1.8.0_282]
2021-02-11 08:00:25,456 ERROR org.apache.flink.runtime.entrypoint.ClusterEntrypoint        [] - Fatal error occurred in the cluster entrypoint.
org.apache.flink.runtime.leaderretrieval.LeaderRetrievalException: Error while watching the ConfigMap JOB_NAME-6a3361c3fdeb4dd9ba80d8e667a8093e-jobmanager-leader
at org.apache.flink.kubernetes.highavailability.KubernetesLeaderRetrievalDriver$ConfigMapCallbackHandlerImpl.handleFatalError(KubernetesLeaderRetrievalDriver.java:120) [flink-dist_2.12-1.12.1.jar:1.12.1]
at org.apache.flink.kubernetes.kubeclient.resources.AbstractKubernetesWatcher.onClose(AbstractKubernetesWatcher.java:48) [flink-dist_2.12-1.12.1.jar:1.12.1]
at io.fabric8.kubernetes.client.utils.WatcherToggle.onClose(WatcherToggle.java:56) [flink-dist_2.12-1.12.1.jar:1.12.1]
at io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager.closeEvent(WatchConnectionManager.java:367) [flink-dist_2.12-1.12.1.jar:1.12.1]
at io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager.access$700(WatchConnectionManager.java:50) [flink-dist_2.12-1.12.1.jar:1.12.1]
at io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager$1.onMessage(WatchConnectionManager.java:259) [flink-dist_2.12-1.12.1.jar:1.12.1]
at org.apache.flink.kubernetes.shaded.okhttp3.internal.ws.RealWebSocket.onReadMessage(RealWebSocket.java:323) [flink-dist_2.12-1.12.1.jar:1.12.1]
at org.apache.flink.kubernetes.shaded.okhttp3.internal.ws.WebSocketReader.readMessageFrame(WebSocketReader.java:219) [flink-dist_2.12-1.12.1.jar:1.12.1]
at org.apache.flink.kubernetes.shaded.okhttp3.internal.ws.WebSocketReader.processNextFrame(WebSocketReader.java:105) [flink-dist_2.12-1.12.1.jar:1.12.1]
at org.apache.flink.kubernetes.shaded.okhttp3.internal.ws.RealWebSocket.loopReader(RealWebSocket.java:274) [flink-dist_2.12-1.12.1.jar:1.12.1]
at org.apache.flink.kubernetes.shaded.okhttp3.internal.ws.RealWebSocket$2.onResponse(RealWebSocket.java:214) [flink-dist_2.12-1.12.1.jar:1.12.1]
at org.apache.flink.kubernetes.shaded.okhttp3.RealCall$AsyncCall.execute(RealCall.java:206) [flink-dist_2.12-1.12.1.jar:1.12.1]
at org.apache.flink.kubernetes.shaded.okhttp3.internal.NamedRunnable.run(NamedRunnable.java:32) [flink-dist_2.12-1.12.1.jar:1.12.1]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_282]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_282]
at java.lang.Thread.run(Thread.java:748) [?:1.8.0_282]
2021-02-11 08:00:25,487 INFO  org.apache.flink.runtime.blob.BlobServer                     [] - Stopped BLOB server at 0.0.0.0:6124
Reply | Threaded
Open this post in threaded view
|

Re: Native K8S HA Session Cluster Issue 1.12.1

Yang Wang
I second till's suggestion. 

You could also build your own flink-kubernetes jar from source code of branch 1.12. After that, bundle the
flink-kubernetes jar to the image under /opt/flink/lib directory. And push to docker repository.

Some users come into the same issues with you and have verified the "too old resource version" fix works well for them.


Best,
Yang

Till Rohrmann <[hidden email]> 于2021年2月12日周五 上午1:20写道:
Hi Kevin,

Unfortunately, the root cause for the error is missing. I can only guess but it could indeed be FLINK-20417 [1]. If this is the case, then the problem should be fixed with the upcoming Flink 1.12.2 version. It should be released next week hopefully. If it should be a different problem, then we will know better because Flink 1.12.2 will fix the problem with swallowing the root cause. So I would highly recommend upgrading once the next bug fix release has been released.


Cheers,
Till

On Thu, Feb 11, 2021 at 9:21 AM Bohinski, Kevin <[hidden email]> wrote:
Hi All,

On long lived session clusters we are seeing a k8s error `Error while watching the ConfigMap`.
Good news is it looks like `too old resource version` issue is fixed :).

Logs are attached below. Any tips?

best
Kevin


2021-02-11 07:55:15,249 INFO  org.apache.flink.runtime.checkpoint.CheckpointCoordinator    [] - Completed checkpoint 4 for job 58ec7a029cd31ad057e25479a9979cb4 (202852094 bytes in 49274 ms).
2021-02-11 08:00:15,732 INFO  org.apache.flink.runtime.checkpoint.CheckpointCoordinator    [] - Triggering checkpoint 5 (type=CHECKPOINT) @ 1613030415249 for job 58ec7a029cd31ad057e25479a9979cb4.
2021-02-11 08:00:25,446 ERROR org.apache.flink.runtime.resourcemanager.active.ActiveResourceManager [] - Fatal error occurred in ResourceManager.
org.apache.flink.runtime.leaderretrieval.LeaderRetrievalException: Error while watching the ConfigMap JOB_NAME-6a3361c3fdeb4dd9ba80d8e667a8093e-jobmanager-leader
at org.apache.flink.kubernetes.highavailability.KubernetesLeaderRetrievalDriver$ConfigMapCallbackHandlerImpl.handleFatalError(KubernetesLeaderRetrievalDriver.java:120) [flink-dist_2.12-1.12.1.jar:1.12.1]
at org.apache.flink.kubernetes.kubeclient.resources.AbstractKubernetesWatcher.onClose(AbstractKubernetesWatcher.java:48) [flink-dist_2.12-1.12.1.jar:1.12.1]
at io.fabric8.kubernetes.client.utils.WatcherToggle.onClose(WatcherToggle.java:56) [flink-dist_2.12-1.12.1.jar:1.12.1]
at io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager.closeEvent(WatchConnectionManager.java:367) [flink-dist_2.12-1.12.1.jar:1.12.1]
at io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager.access$700(WatchConnectionManager.java:50) [flink-dist_2.12-1.12.1.jar:1.12.1]
at io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager$1.onMessage(WatchConnectionManager.java:259) [flink-dist_2.12-1.12.1.jar:1.12.1]
at org.apache.flink.kubernetes.shaded.okhttp3.internal.ws.RealWebSocket.onReadMessage(RealWebSocket.java:323) [flink-dist_2.12-1.12.1.jar:1.12.1]
at org.apache.flink.kubernetes.shaded.okhttp3.internal.ws.WebSocketReader.readMessageFrame(WebSocketReader.java:219) [flink-dist_2.12-1.12.1.jar:1.12.1]
at org.apache.flink.kubernetes.shaded.okhttp3.internal.ws.WebSocketReader.processNextFrame(WebSocketReader.java:105) [flink-dist_2.12-1.12.1.jar:1.12.1]
at org.apache.flink.kubernetes.shaded.okhttp3.internal.ws.RealWebSocket.loopReader(RealWebSocket.java:274) [flink-dist_2.12-1.12.1.jar:1.12.1]
at org.apache.flink.kubernetes.shaded.okhttp3.internal.ws.RealWebSocket$2.onResponse(RealWebSocket.java:214) [flink-dist_2.12-1.12.1.jar:1.12.1]
at org.apache.flink.kubernetes.shaded.okhttp3.RealCall$AsyncCall.execute(RealCall.java:206) [flink-dist_2.12-1.12.1.jar:1.12.1]
at org.apache.flink.kubernetes.shaded.okhttp3.internal.NamedRunnable.run(NamedRunnable.java:32) [flink-dist_2.12-1.12.1.jar:1.12.1]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_282]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_282]
at java.lang.Thread.run(Thread.java:748) [?:1.8.0_282]
2021-02-11 08:00:25,456 ERROR org.apache.flink.runtime.entrypoint.ClusterEntrypoint        [] - Fatal error occurred in the cluster entrypoint.
org.apache.flink.runtime.leaderretrieval.LeaderRetrievalException: Error while watching the ConfigMap JOB_NAME-6a3361c3fdeb4dd9ba80d8e667a8093e-jobmanager-leader
at org.apache.flink.kubernetes.highavailability.KubernetesLeaderRetrievalDriver$ConfigMapCallbackHandlerImpl.handleFatalError(KubernetesLeaderRetrievalDriver.java:120) [flink-dist_2.12-1.12.1.jar:1.12.1]
at org.apache.flink.kubernetes.kubeclient.resources.AbstractKubernetesWatcher.onClose(AbstractKubernetesWatcher.java:48) [flink-dist_2.12-1.12.1.jar:1.12.1]
at io.fabric8.kubernetes.client.utils.WatcherToggle.onClose(WatcherToggle.java:56) [flink-dist_2.12-1.12.1.jar:1.12.1]
at io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager.closeEvent(WatchConnectionManager.java:367) [flink-dist_2.12-1.12.1.jar:1.12.1]
at io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager.access$700(WatchConnectionManager.java:50) [flink-dist_2.12-1.12.1.jar:1.12.1]
at io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager$1.onMessage(WatchConnectionManager.java:259) [flink-dist_2.12-1.12.1.jar:1.12.1]
at org.apache.flink.kubernetes.shaded.okhttp3.internal.ws.RealWebSocket.onReadMessage(RealWebSocket.java:323) [flink-dist_2.12-1.12.1.jar:1.12.1]
at org.apache.flink.kubernetes.shaded.okhttp3.internal.ws.WebSocketReader.readMessageFrame(WebSocketReader.java:219) [flink-dist_2.12-1.12.1.jar:1.12.1]
at org.apache.flink.kubernetes.shaded.okhttp3.internal.ws.WebSocketReader.processNextFrame(WebSocketReader.java:105) [flink-dist_2.12-1.12.1.jar:1.12.1]
at org.apache.flink.kubernetes.shaded.okhttp3.internal.ws.RealWebSocket.loopReader(RealWebSocket.java:274) [flink-dist_2.12-1.12.1.jar:1.12.1]
at org.apache.flink.kubernetes.shaded.okhttp3.internal.ws.RealWebSocket$2.onResponse(RealWebSocket.java:214) [flink-dist_2.12-1.12.1.jar:1.12.1]
at org.apache.flink.kubernetes.shaded.okhttp3.RealCall$AsyncCall.execute(RealCall.java:206) [flink-dist_2.12-1.12.1.jar:1.12.1]
at org.apache.flink.kubernetes.shaded.okhttp3.internal.NamedRunnable.run(NamedRunnable.java:32) [flink-dist_2.12-1.12.1.jar:1.12.1]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_282]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_282]
at java.lang.Thread.run(Thread.java:748) [?:1.8.0_282]
2021-02-11 08:00:25,487 INFO  org.apache.flink.runtime.blob.BlobServer                     [] - Stopped BLOB server at 0.0.0.0:6124