Hi All,
On long lived session clusters we are seeing a k8s error `Error while watching the ConfigMap`. Good news is it looks like `too old resource version` issue is fixed :). Logs are attached below. Any tips? best Kevin 2021-02-11 07:55:15,249 INFO org.apache.flink.runtime.checkpoint.CheckpointCoordinator [] - Completed checkpoint 4 for job 58ec7a029cd31ad057e25479a9979cb4 (202852094 bytes in 49274 ms). 2021-02-11 08:00:15,732 INFO org.apache.flink.runtime.checkpoint.CheckpointCoordinator [] - Triggering checkpoint 5 (type=CHECKPOINT) @ 1613030415249 for job 58ec7a029cd31ad057e25479a9979cb4. 2021-02-11 08:00:25,446 ERROR org.apache.flink.runtime.resourcemanager.active.ActiveResourceManager [] - Fatal error occurred in ResourceManager. org.apache.flink.runtime.leaderretrieval.LeaderRetrievalException: Error while watching the ConfigMap JOB_NAME-6a3361c3fdeb4dd9ba80d8e667a8093e-jobmanager-leader at org.apache.flink.kubernetes.highavailability.KubernetesLeaderRetrievalDriver$ConfigMapCallbackHandlerImpl.handleFatalError(KubernetesLeaderRetrievalDriver.java:120) [flink-dist_2.12-1.12.1.jar:1.12.1] at org.apache.flink.kubernetes.kubeclient.resources.AbstractKubernetesWatcher.onClose(AbstractKubernetesWatcher.java:48) [flink-dist_2.12-1.12.1.jar:1.12.1] at io.fabric8.kubernetes.client.utils.WatcherToggle.onClose(WatcherToggle.java:56) [flink-dist_2.12-1.12.1.jar:1.12.1] at io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager.closeEvent(WatchConnectionManager.java:367) [flink-dist_2.12-1.12.1.jar:1.12.1] at io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager.access$700(WatchConnectionManager.java:50) [flink-dist_2.12-1.12.1.jar:1.12.1] at io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager$1.onMessage(WatchConnectionManager.java:259) [flink-dist_2.12-1.12.1.jar:1.12.1] at org.apache.flink.kubernetes.shaded.okhttp3.internal.ws.RealWebSocket.onReadMessage(RealWebSocket.java:323) [flink-dist_2.12-1.12.1.jar:1.12.1] at org.apache.flink.kubernetes.shaded.okhttp3.internal.ws.WebSocketReader.readMessageFrame(WebSocketReader.java:219) [flink-dist_2.12-1.12.1.jar:1.12.1] at org.apache.flink.kubernetes.shaded.okhttp3.internal.ws.WebSocketReader.processNextFrame(WebSocketReader.java:105) [flink-dist_2.12-1.12.1.jar:1.12.1] at org.apache.flink.kubernetes.shaded.okhttp3.internal.ws.RealWebSocket.loopReader(RealWebSocket.java:274) [flink-dist_2.12-1.12.1.jar:1.12.1] at org.apache.flink.kubernetes.shaded.okhttp3.internal.ws.RealWebSocket$2.onResponse(RealWebSocket.java:214) [flink-dist_2.12-1.12.1.jar:1.12.1] at org.apache.flink.kubernetes.shaded.okhttp3.RealCall$AsyncCall.execute(RealCall.java:206) [flink-dist_2.12-1.12.1.jar:1.12.1] at org.apache.flink.kubernetes.shaded.okhttp3.internal.NamedRunnable.run(NamedRunnable.java:32) [flink-dist_2.12-1.12.1.jar:1.12.1] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_282] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_282] at java.lang.Thread.run(Thread.java:748) [?:1.8.0_282] 2021-02-11 08:00:25,456 ERROR org.apache.flink.runtime.entrypoint.ClusterEntrypoint [] - Fatal error occurred in the cluster entrypoint. org.apache.flink.runtime.leaderretrieval.LeaderRetrievalException: Error while watching the ConfigMap JOB_NAME-6a3361c3fdeb4dd9ba80d8e667a8093e-jobmanager-leader at org.apache.flink.kubernetes.highavailability.KubernetesLeaderRetrievalDriver$ConfigMapCallbackHandlerImpl.handleFatalError(KubernetesLeaderRetrievalDriver.java:120) [flink-dist_2.12-1.12.1.jar:1.12.1] at org.apache.flink.kubernetes.kubeclient.resources.AbstractKubernetesWatcher.onClose(AbstractKubernetesWatcher.java:48) [flink-dist_2.12-1.12.1.jar:1.12.1] at io.fabric8.kubernetes.client.utils.WatcherToggle.onClose(WatcherToggle.java:56) [flink-dist_2.12-1.12.1.jar:1.12.1] at io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager.closeEvent(WatchConnectionManager.java:367) [flink-dist_2.12-1.12.1.jar:1.12.1] at io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager.access$700(WatchConnectionManager.java:50) [flink-dist_2.12-1.12.1.jar:1.12.1] at io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager$1.onMessage(WatchConnectionManager.java:259) [flink-dist_2.12-1.12.1.jar:1.12.1] at org.apache.flink.kubernetes.shaded.okhttp3.internal.ws.RealWebSocket.onReadMessage(RealWebSocket.java:323) [flink-dist_2.12-1.12.1.jar:1.12.1] at org.apache.flink.kubernetes.shaded.okhttp3.internal.ws.WebSocketReader.readMessageFrame(WebSocketReader.java:219) [flink-dist_2.12-1.12.1.jar:1.12.1] at org.apache.flink.kubernetes.shaded.okhttp3.internal.ws.WebSocketReader.processNextFrame(WebSocketReader.java:105) [flink-dist_2.12-1.12.1.jar:1.12.1] at org.apache.flink.kubernetes.shaded.okhttp3.internal.ws.RealWebSocket.loopReader(RealWebSocket.java:274) [flink-dist_2.12-1.12.1.jar:1.12.1] at org.apache.flink.kubernetes.shaded.okhttp3.internal.ws.RealWebSocket$2.onResponse(RealWebSocket.java:214) [flink-dist_2.12-1.12.1.jar:1.12.1] at org.apache.flink.kubernetes.shaded.okhttp3.RealCall$AsyncCall.execute(RealCall.java:206) [flink-dist_2.12-1.12.1.jar:1.12.1] at org.apache.flink.kubernetes.shaded.okhttp3.internal.NamedRunnable.run(NamedRunnable.java:32) [flink-dist_2.12-1.12.1.jar:1.12.1] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_282] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_282] at java.lang.Thread.run(Thread.java:748) [?:1.8.0_282] 2021-02-11 08:00:25,487 INFO org.apache.flink.runtime.blob.BlobServer [] - Stopped BLOB server at 0.0.0.0:6124 |
Hi Kevin, Unfortunately, the root cause for the error is missing. I can only guess but it could indeed be FLINK-20417 [1]. If this is the case, then the problem should be fixed with the upcoming Flink 1.12.2 version. It should be released next week hopefully. If it should be a different problem, then we will know better because Flink 1.12.2 will fix the problem with swallowing the root cause. So I would highly recommend upgrading once the next bug fix release has been released. Cheers, Till On Thu, Feb 11, 2021 at 9:21 AM Bohinski, Kevin <[hidden email]> wrote: Hi All, |
I second till's suggestion. You could also build your own flink-kubernetes jar from source code of branch 1.12. After that, bundle the flink-kubernetes jar to the image under /opt/flink/lib directory. And push to docker repository. Some users come into the same issues with you and have verified the "too old resource version" fix works well for them. Best, Yang Till Rohrmann <[hidden email]> 于2021年2月12日周五 上午1:20写道:
|
Free forum by Nabble | Edit this page |