Cannot start from savepoint using Flink 1.12 in standalone Kubernetes + Kubernetes HA

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

Cannot start from savepoint using Flink 1.12 in standalone Kubernetes + Kubernetes HA

ChangZhuo Chen (陳昌倬)
Hi,

We cannot start job from savepoint (created by Flink 1.12, Standalone
Kubernetes + zookeeper HA) in Flink 1.12, Standalone Kubernetes +
Kubernetes HA. The following is the exception that stops the job.

    Caused by: java.util.concurrent.CompletionException: org.apache.flink.kubernetes.kubeclient.resources.KubernetesException: Cannot retry checkAndUpdateConfigMap with configMap name-51e5afd90227d537ff442403d1b279da-jobmanager-leader because it does not exist.


Cluster can start new job from scratch, so we think cluster
configuration is good.

The following is HA related config:

    high-availability: org.apache.flink.kubernetes.highavailability.KubernetesHaServicesFactory
    high-availability.storageDir: gs://some/path/recovery
    kubernetes.cluster-id: cluster-name
    kubernetes.context: kubernetes-context
    kubernetes.namespace: kubernetes-namespace


--
ChangZhuo Chen (陳昌倬) czchen@{czchen,debconf,debian}.org
http://czchen.info/
Key fingerprint = BA04 346D C2E1 FE63 C790  8793 CC65 B0CD EC27 5D5B

signature.asc (849 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Cannot start from savepoint using Flink 1.12 in standalone Kubernetes + Kubernetes HA

Yang Wang
This is a known issue. Please refer here[1] for more information. And it is already fixed in master and 1.12 branch.
Also the next minor Flink release version(1.12.1) will include it. Maybe you could help to verify that.


Best,
Yang

ChangZhuo Chen (陳昌倬) <[hidden email]> 于2020年12月30日周三 上午9:35写道:
Hi,

We cannot start job from savepoint (created by Flink 1.12, Standalone
Kubernetes + zookeeper HA) in Flink 1.12, Standalone Kubernetes +
Kubernetes HA. The following is the exception that stops the job.

    Caused by: java.util.concurrent.CompletionException: org.apache.flink.kubernetes.kubeclient.resources.KubernetesException: Cannot retry checkAndUpdateConfigMap with configMap name-51e5afd90227d537ff442403d1b279da-jobmanager-leader because it does not exist.


Cluster can start new job from scratch, so we think cluster
configuration is good.

The following is HA related config:

    high-availability: org.apache.flink.kubernetes.highavailability.KubernetesHaServicesFactory
    high-availability.storageDir: gs://some/path/recovery
    kubernetes.cluster-id: cluster-name
    kubernetes.context: kubernetes-context
    kubernetes.namespace: kubernetes-namespace


--
ChangZhuo Chen (陳昌倬) czchen@{czchen,debconf,debian}.org
http://czchen.info/
Key fingerprint = BA04 346D C2E1 FE63 C790  8793 CC65 B0CD EC27 5D5B