Hi Yang, Thanks again for all the help! We are still seeing this with 1.11.2 and ZK. Looks like others are seeing this as well and they found a solution
https://translate.google.com/translate?hl=en&sl=zh-CN&u=https://cloud.tencent.com/developer/article/1731416&prev=search Should this solution be added to 1.12? Best kevin On 2020/08/14 02:48:50, Yang Wang <[hidden email]> wrote:
> Hi kevin,> > > Thanks for sharing more information. You are right. Actually, "too old>
> resource version" is caused by a bug> > of fabric8 kubernetes-client[1]. It has been fix in v4.6.1. And we have>
> bumped the kubernetes-client version> > to v4.9.2 in Flink release-1.11. Also it has been backported to release>
> 1.10 and will be included in the next> > minor release version(1.10.2).> > > BTW, if you really want all your jobs recovered when jobmanager crashed,>
> you still need to configure the Zookeeper high availability.>
> > > > Best,> > Yang> > > Bohinski, Kevin <[hidden email]>
于2020年8月14日周五
上午6:32写道:>
> > > Might be useful> > >> > >> > >> > >> > > Best,> > >> > > kevin> > >> > >> > >> > >> > >> > > *From: *"Bohinski, Kevin" <[hidden email]>>
> > *Date: *Thursday, August 13, 2020 at 6:13 PM> > > *To: *Yang Wang <[hidden email]>>
> > *Cc: *"[hidden email]" <[hidden email]>>
> > *Subject: *Re: [EXTERNAL] Re: Native K8S Jobmanager restarts and job>
> > never recovers> > >> > >> > >> > > Hi> > >> > >> > >> > > Got the logs on crash, hopefully they help.> > >> > >> > >> > > 2020-08-13 22:00:40,336 ERROR> > > org.apache.flink.kubernetes.KubernetesResourceManager [] - Fatal>
> > error occurred in ResourceManager.> > >> > > io.fabric8.kubernetes.client.KubernetesClientException: too old resource>
> > version: 8617182 (8633230)> > >> > > at> > > io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager$1.onMessage(WatchConnectionManager.java:259)>
> > [flink-dist_2.11-1.11.1.jar:1.11.1]> > >> > > at> > > org.apache.flink.kubernetes.shaded.okhttp3.internal.ws.RealWebSocket.onReadMessage(RealWebSocket.java:323)>
> > [flink-dist_2.11-1.11.1.jar:1.11.1]> > >> > > at> > > org.apache.flink.kubernetes.shaded.okhttp3.internal.ws.WebSocketReader.readMessageFrame(WebSocketReader.java:219)>
> > [flink-dist_2.11-1.11.1.jar:1.11.1]> > >> > > at> > > org.apache.flink.kubernetes.shaded.okhttp3.internal.ws.WebSocketReader.processNextFrame(WebSocketReader.java:105)>
> > [flink-dist_2.11-1.11.1.jar:1.11.1]> > >> > > at> > > org.apache.flink.kubernetes.shaded.okhttp3.internal.ws.RealWebSocket.loopReader(RealWebSocket.java:274)>
> > [flink-dist_2.11-1.11.1.jar:1.11.1]> > >> > > at> > > org.apache.flink.kubernetes.shaded.okhttp3.internal.ws.RealWebSocket$2.onResponse(RealWebSocket.java:214)>
> > [flink-dist_2.11-1.11.1.jar:1.11.1]> > >> > > at> > > org.apache.flink.kubernetes.shaded.okhttp3.RealCall$AsyncCall.execute(RealCall.java:206)>
> > [flink-dist_2.11-1.11.1.jar:1.11.1]> > >> > > at> > > org.apache.flink.kubernetes.shaded.okhttp3.internal.NamedRunnable.run(NamedRunnable.java:32)>
> > [flink-dist_2.11-1.11.1.jar:1.11.1]> > >> > > at> > > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)>
> > [?:1.8.0_262]> > >> > > at> > > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)>
> > [?:1.8.0_262]> > >> > > at java.lang.Thread.run(Thread.java:748) [?:1.8.0_262]>
> >> > > 2020-08-13 22:00:40,337 ERROR> > > org.apache.flink.runtime.entrypoint.ClusterEntrypoint [] - Fatal>
> > error occurred in the cluster entrypoint.> > >> > > io.fabric8.kubernetes.client.KubernetesClientException: too old resource>
> > version: 8617182 (8633230)> > >> > > at> > > io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager$1.onMessage(WatchConnectionManager.java:259)>
> > [flink-dist_2.11-1.11.1.jar:1.11.1]> > >> > > at> > > org.apache.flink.kubernetes.shaded.okhttp3.internal.ws.RealWebSocket.onReadMessage(RealWebSocket.java:323)>
> > [flink-dist_2.11-1.11.1.jar:1.11.1]> > >> > > at> > > org.apache.flink.kubernetes.shaded.okhttp3.internal.ws.WebSocketReader.readMessageFrame(WebSocketReader.java:219)>
> > [flink-dist_2.11-1.11.1.jar:1.11.1]> > >> > > at> > > org.apache.flink.kubernetes.shaded.okhttp3.internal.ws.WebSocketReader.processNextFrame(WebSocketReader.java:105)>
> > [flink-dist_2.11-1.11.1.jar:1.11.1]> > >> > > at> > > org.apache.flink.kubernetes.shaded.okhttp3.internal.ws.RealWebSocket.loopReader(RealWebSocket.java:274)>
> > [flink-dist_2.11-1.11.1.jar:1.11.1]> > >> > > at> > > org.apache.flink.kubernetes.shaded.okhttp3.internal.ws.RealWebSocket$2.onResponse(RealWebSocket.java:214)>
> > [flink-dist_2.11-1.11.1.jar:1.11.1]> > >> > > at> > > org.apache.flink.kubernetes.shaded.okhttp3.RealCall$AsyncCall.execute(RealCall.java:206)>
> > [flink-dist_2.11-1.11.1.jar:1.11.1]> > >> > > at> > > org.apache.flink.kubernetes.shaded.okhttp3.internal.NamedRunnable.run(NamedRunnable.java:32)>
> > [flink-dist_2.11-1.11.1.jar:1.11.1]> > >> > > at> > > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)>
> > [?:1.8.0_262]> > >> > > at> > > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)>
> > [?:1.8.0_262]> > >> > > at java.lang.Thread.run(Thread.java:748) [?:1.8.0_262]>
> >> > > 2020-08-13 22:00:40,416 INFO> > > org.apache.flink.runtime.blob.BlobServer [] - Stopped>
> > BLOB server at 0.0.0.0:6124> > >> > >> > >> > > Best,> > >> > > kevin> > >> > >> > >> > >> > >> > > *From: *Yang Wang <[hidden email]>>
> > *Date: *Sunday, August 9, 2020 at 10:29 PM> > > *To: *"Bohinski, Kevin" <[hidden email]>>
> > *Cc: *"[hidden email]" <[hidden email]>>
> > *Subject: *[EXTERNAL] Re: Native K8S Jobmanager restarts and job never>
> > recovers> > >> > >> > >> > > Hi Kevin,> > >> > >> > >> > > I think you may not set the high availability configurations in your>
> > native K8s session. Currently, we only> > >> > > support zookeeper HA, so you need to add the following configuration.>
> > After the HA is configured, the> > >> > > running job, checkpoint and other meta could be stored. When the>
> > jobmanager failover, all the jobs> > >> > > could be recovered then. I have tested it could work properly.>
> >> > >> > >> > > high-availability: zookeeper> > >> > > high-availability.zookeeper.quorum: zk-client:2181> > >> > > high-availability.storageDir: hdfs:///flink/recovery> > >> > >> > >> > > I know you may not have a zookeeper cluster.You could a zookeeper K8s>
> > operator[1] to deploy a new one.> > >> > >> > >> > > More over, it is not very convenient to use zookeeper as HA. So a K8s>
> > native HA support[2] is in plan and we> > >> > > are trying to finish it in the next major release cycle(1.12).>
> >> > >> > >> > >> > >> > >> > >> > >> > >> > >> > >> > > Best,> > >> > > Yang> > >> > >> > >> > > Bohinski, Kevin <[hidden email]>
于2020年8月7日周五
下午11:40写道:>
> >> > > Hi all,> > >> > >> > >> > > In our 1.11.1 native k8s session after we submit a job it will run>
> > successfully for a few hours then fail when the jobmanager pod restarts.>
> >> > >> > >> > > Relevant logs after restart are attached below. Any suggestions?>
> >> > >> > >> > > Best> > >> > > kevin> > >> > >> > >> > > 2020-08-06 21:50:24,425 INFO> > > org.apache.flink.kubernetes.KubernetesResourceManager [] - Recovered>
> > 32 pods from previous attempts, current attempt id is 2.>
> >> > > 2020-08-06 21:50:24,610 DEBUG> > > org.apache.flink.kubernetes.kubeclient.resources.KubernetesPodsWatcher [] ->
> > Received ADDED event for pod REDACTED-flink-session-taskmanager-1-16,>
> > details: PodStatus(conditions=[PodCondition(lastProbeTime=null,>
> > lastTransitionTime=2020-08-06T18:48:33Z, message=null, reason=null,>
> > status=True, type=Initialized, additionalProperties={}),>
> > PodCondition(lastProbeTime=null, lastTransitionTime=2020-08-06T18:48:37Z,>
> > message=null, reason=null, status=True, type=Ready,> > > additionalProperties={}), PodCondition(lastProbeTime=null,>
> > lastTransitionTime=2020-08-06T18:48:37Z, message=null, reason=null,>
> > status=True, type=ContainersReady, additionalProperties={}),>
> > PodCondition(lastProbeTime=null, lastTransitionTime=2020-08-06T18:48:33Z,>
> > message=null, reason=null, status=True, type=PodScheduled,>
> > additionalProperties={})],> > > containerStatuses=[ContainerStatus(containerID=docker://REDACTED,>
> > image=REDACTED/flink:1.11.1-scala_2.11-s3-0,> > > imageID=docker-pullable://REDACTED/flink@sha256:REDACTED,>
> > lastState=ContainerState(running=null, terminated=null, waiting=null,>
> > additionalProperties={}), name=flink-task-manager, ready=true,>
> > restartCount=0, started=true,> > > state=ContainerState(running=ContainerStateRunning(startedAt=2020-08-06T18:48:35Z,>
> > additionalProperties={}), terminated=null, waiting=null,>
> > additionalProperties={}), additionalProperties={})],> > > ephemeralContainerStatuses=[], hostIP=REDACTED, initContainerStatuses=[],>
> > message=null, nominatedNodeName=null, phase=Running, podIP=REDACTED,>
> > podIPs=[PodIP(ip=REDACTED, additionalProperties={})], qosClass=Guaranteed,>
> > reason=null, startTime=2020-08-06T18:48:33Z, additionalProperties={})>
> >> > > 2020-08-06 21:50:24,613 DEBUG> > > org.apache.flink.kubernetes.KubernetesResourceManager [] - Ignore>
> > TaskManager pod that is already added:> > > REDACTED-flink-session-taskmanager-1-16> > >> > > 2020-08-06 21:50:24,615 INFO> > > org.apache.flink.kubernetes.KubernetesResourceManager [] - Received>
> > 0 new TaskManager pods. Remaining pending pod requests: 0>
> >> > > 2020-08-06 21:50:24,631 DEBUG> > > org.apache.flink.runtime.dispatcher.FileArchivedExecutionGraphStore [] ->
> > Could not load archived execution graph for job id> > > 8c76f37962afd87783c65c95387fb828.> > >> > > java.util.concurrent.ExecutionException: java.io.FileNotFoundException:>
> > Could not find file for archived execution graph> > > 8c76f37962afd87783c65c95387fb828. This indicates that the file either has>
> > been deleted or never written.> > >> > > at> > > org.apache.flink.shaded.guava18.com.google.common.util.concurrent.AbstractFuture$Sync.getValue(AbstractFuture.java:299)>
> > ~[flink-dist_2.11-1.11.1.jar:1.11.1]> > >> > > at> > > org.apache.flink.shaded.guava18.com.google.common.util.concurrent.AbstractFuture$Sync.get(AbstractFuture.java:286)>
> > ~[flink-dist_2.11-1.11.1.jar:1.11.1]> > >> > > at> > > org.apache.flink.shaded.guava18.com.google.common.util.concurrent.AbstractFuture.get(AbstractFuture.java:116)>
> > ~[flink-dist_2.11-1.11.1.jar:1.11.1]> > >> > > at> > > org.apache.flink.shaded.guava18.com.google.common.util.concurrent.Uninterruptibles.getUninterruptibly(Uninterruptibles.java:137)>
> > ~[flink-dist_2.11-1.11.1.jar:1.11.1]> > >> > > at> > > org.apache.flink.shaded.guava18.com.google.common.cache.LocalCache$Segment.getAndRecordStats(LocalCache.java:2348)>
> > ~[flink-dist_2.11-1.11.1.jar:1.11.1]> > >> > > at> > > org.apache.flink.shaded.guava18.com.google.common.cache.LocalCache$Segment.loadSync(LocalCache.java:2320)>
> > ~[flink-dist_2.11-1.11.1.jar:1.11.1]> > >> > > at> > > org.apache.flink.shaded.guava18.com.google.common.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2282)>
> > ~[flink-dist_2.11-1.11.1.jar:1.11.1]> > >> > > at> > > org.apache.flink.shaded.guava18.com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2197)>
> > ~[flink-dist_2.11-1.11.1.jar:1.11.1]> > >> > > at> > > org.apache.flink.shaded.guava18.com.google.common.cache.LocalCache.get(LocalCache.java:3937)>
> > ~[flink-dist_2.11-1.11.1.jar:1.11.1]> > >> > > at> > > org.apache.flink.shaded.guava18.com.google.common.cache.LocalCache.getOrLoad(LocalCache.java:3941)>
> > ~[flink-dist_2.11-1.11.1.jar:1.11.1]> > >> > > at> > > org.apache.flink.shaded.guava18.com.google.common.cache.LocalCache$LocalLoadingCache.get(LocalCache.java:4824)>
> > ~[flink-dist_2.11-1.11.1.jar:1.11.1]> > >> > > at> > > org.apache.flink.runtime.dispatcher.FileArchivedExecutionGraphStore.get(FileArchivedExecutionGraphStore.java:143)>
> > ~[flink-dist_2.11-1.11.1.jar:1.11.1]> > >> > > at> > > org.apache.flink.runtime.dispatcher.Dispatcher.lambda$requestJob$20(Dispatcher.java:554)>
> > ~[flink-dist_2.11-1.11.1.jar:1.11.1]> > >> > > at> > > java.util.concurrent.CompletableFuture.uniExceptionally(CompletableFuture.java:884)>
> > ~[?:1.8.0_262]> > >> > > at> > > java.util.concurrent.CompletableFuture.uniExceptionallyStage(CompletableFuture.java:898)>
> > ~[?:1.8.0_262]> > >> > > at> > > java.util.concurrent.CompletableFuture.exceptionally(CompletableFuture.java:2209)>
> > ~[?:1.8.0_262]> > >> > > at> > > org.apache.flink.runtime.dispatcher.Dispatcher.requestJob(Dispatcher.java:552)>
> > ~[flink-dist_2.11-1.11.1.jar:1.11.1]> > >> > > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native>
> > Method) ~[?:1.8.0_262]> > >> > > at> > > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)>
> > ~[?:1.8.0_262]> > >> > > at> > > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)>
> > ~[?:1.8.0_262]> > >> > > at java.lang.reflect.Method.invoke(Method.java:498)>
> > ~[?:1.8.0_262]> > >> > > at> > > org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcInvocation(AkkaRpcActor.java:284)>
> > ~[flink-dist_2.11-1.11.1.jar:1.11.1]> > >> > > at> > > org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:199)>
> > ~[flink-dist_2.11-1.11.1.jar:1.11.1]> > >> > > at> > > org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.handleRpcMessage(FencedAkkaRpcActor.java:74)>
> > ~[flink-dist_2.11-1.11.1.jar:1.11.1]> > >> > > at> > > org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleMessage(AkkaRpcActor.java:152)>
> > ~[flink-dist_2.11-1.11.1.jar:1.11.1]> > >> > > at> > > akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:26)>
> > [flink-dist_2.11-1.11.1.jar:1.11.1]> > >> > > at> > > akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:21)>
> > [flink-dist_2.11-1.11.1.jar:1.11.1]> > >> > > at> > > scala.PartialFunction$class.applyOrElse(PartialFunction.scala:123)>
> > [flink-dist_2.11-1.11.1.jar:1.11.1]> > >> > > at> > > akka.japi.pf.UnitCaseStatement.applyOrElse(CaseStatements.scala:21)>
> > [flink-dist_2.11-1.11.1.jar:1.11.1]> > >> > > at> > > scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:170)>
> > [flink-dist_2.11-1.11.1.jar:1.11.1]> > >> > > at> > > scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171)>
> > [flink-dist_2.11 [message truncated...] |
Hi kevin, Thanks for sharing the information. I will dig into and create a ticket if necessary. Best, Yang Bohinski, Kevin <[hidden email]> 于2020年10月29日周四 上午2:35写道:
|
Free forum by Nabble | Edit this page |