Re: [EXTERNAL] Re: Native K8S Jobmanager restarts and job never recovers

classic Classic list List threaded Threaded
2 messages Options
kb
Reply | Threaded
Open this post in threaded view
|

Re: [EXTERNAL] Re: Native K8S Jobmanager restarts and job never recovers

kb

Hi Yang,

 

Thanks again for all the help!

 

We are still seeing this with 1.11.2 and ZK.

Looks like others are seeing this as well and they found a solution https://translate.google.com/translate?hl=en&sl=zh-CN&u=https://cloud.tencent.com/developer/article/1731416&prev=search

 

Should this solution be added to 1.12?

 

Best

kevin

 

On 2020/08/14 02:48:50, Yang Wang <[hidden email]> wrote:

> Hi kevin,>

>

> Thanks for sharing more information. You are right. Actually, "too old>

> resource version" is caused by a bug>

> of fabric8 kubernetes-client[1]. It has been fix in v4.6.1. And we have>

> bumped the kubernetes-client version>

> to v4.9.2 in Flink release-1.11. Also it has been backported to release>

> 1.10 and will be included in the next>

> minor release version(1.10.2).>

>

> BTW, if you really want all your jobs recovered when jobmanager crashed,>

> you still need to configure the Zookeeper high availability.>

>

>

>

> Best,>

> Yang>

>

> Bohinski, Kevin <[hidden email]> 2020814日周五 上午6:32写道:>

>

> > Might be useful>

> >>

> >>

> >>

> >>

> > Best,>

> >>

> > kevin>

> >>

> >>

> >>

> >>

> >>

> > *From: *"Bohinski, Kevin" <[hidden email]>>

> > *Date: *Thursday, August 13, 2020 at 6:13 PM>

> > *To: *Yang Wang <[hidden email]>>

> > *Subject: *Re: [EXTERNAL] Re: Native K8S Jobmanager restarts and job>

> > never recovers>

> >>

> >>

> >>

> > Hi>

> >>

> >>

> >>

> > Got the logs on crash, hopefully they help.>

> >>

> >>

> >>

> > 2020-08-13 22:00:40,336 ERROR>

> > org.apache.flink.kubernetes.KubernetesResourceManager        [] - Fatal>

> > error occurred in ResourceManager.>

> >>

> > io.fabric8.kubernetes.client.KubernetesClientException: too old resource>

> > version: 8617182 (8633230)>

> >>

> >                 at>

> > io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager$1.onMessage(WatchConnectionManager.java:259)>

> > [flink-dist_2.11-1.11.1.jar:1.11.1]>

> >>

> >                 at>

> > org.apache.flink.kubernetes.shaded.okhttp3.internal.ws.RealWebSocket.onReadMessage(RealWebSocket.java:323)>

> > [flink-dist_2.11-1.11.1.jar:1.11.1]>

> >>

> >                 at>

> > org.apache.flink.kubernetes.shaded.okhttp3.internal.ws.WebSocketReader.readMessageFrame(WebSocketReader.java:219)>

> > [flink-dist_2.11-1.11.1.jar:1.11.1]>

> >>

> >                 at>

> > org.apache.flink.kubernetes.shaded.okhttp3.internal.ws.WebSocketReader.processNextFrame(WebSocketReader.java:105)>

> > [flink-dist_2.11-1.11.1.jar:1.11.1]>

> >>

> >                 at>

> > org.apache.flink.kubernetes.shaded.okhttp3.internal.ws.RealWebSocket.loopReader(RealWebSocket.java:274)>

> > [flink-dist_2.11-1.11.1.jar:1.11.1]>

> >>

> >                 at>

> > org.apache.flink.kubernetes.shaded.okhttp3.internal.ws.RealWebSocket$2.onResponse(RealWebSocket.java:214)>

> > [flink-dist_2.11-1.11.1.jar:1.11.1]>

> >>

> >                 at>

> > org.apache.flink.kubernetes.shaded.okhttp3.RealCall$AsyncCall.execute(RealCall.java:206)>

> > [flink-dist_2.11-1.11.1.jar:1.11.1]>

> >>

> >                 at>

> > org.apache.flink.kubernetes.shaded.okhttp3.internal.NamedRunnable.run(NamedRunnable.java:32)>

> > [flink-dist_2.11-1.11.1.jar:1.11.1]>

> >>

> >                 at>

> > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)>

> > [?:1.8.0_262]>

> >>

> >                 at>

> > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)>

> > [?:1.8.0_262]>

> >>

> >                 at java.lang.Thread.run(Thread.java:748) [?:1.8.0_262]>

> >>

> > 2020-08-13 22:00:40,337 ERROR>

> > org.apache.flink.runtime.entrypoint.ClusterEntrypoint        [] - Fatal>

> > error occurred in the cluster entrypoint.>

> >>

> > io.fabric8.kubernetes.client.KubernetesClientException: too old resource>

> > version: 8617182 (8633230)>

> >>

> >                 at>

> > io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager$1.onMessage(WatchConnectionManager.java:259)>

> > [flink-dist_2.11-1.11.1.jar:1.11.1]>

> >>

> >                 at>

> > org.apache.flink.kubernetes.shaded.okhttp3.internal.ws.RealWebSocket.onReadMessage(RealWebSocket.java:323)>

> > [flink-dist_2.11-1.11.1.jar:1.11.1]>

> >>

> >                 at>

> > org.apache.flink.kubernetes.shaded.okhttp3.internal.ws.WebSocketReader.readMessageFrame(WebSocketReader.java:219)>

> > [flink-dist_2.11-1.11.1.jar:1.11.1]>

> >>

> >                 at>

> > org.apache.flink.kubernetes.shaded.okhttp3.internal.ws.WebSocketReader.processNextFrame(WebSocketReader.java:105)>

> > [flink-dist_2.11-1.11.1.jar:1.11.1]>

> >>

> >                 at>

> > org.apache.flink.kubernetes.shaded.okhttp3.internal.ws.RealWebSocket.loopReader(RealWebSocket.java:274)>

> > [flink-dist_2.11-1.11.1.jar:1.11.1]>

> >>

> >                 at>

> > org.apache.flink.kubernetes.shaded.okhttp3.internal.ws.RealWebSocket$2.onResponse(RealWebSocket.java:214)>

> > [flink-dist_2.11-1.11.1.jar:1.11.1]>

> >>

> >                 at>

> > org.apache.flink.kubernetes.shaded.okhttp3.RealCall$AsyncCall.execute(RealCall.java:206)>

> > [flink-dist_2.11-1.11.1.jar:1.11.1]>

> >>

> >                 at>

> > org.apache.flink.kubernetes.shaded.okhttp3.internal.NamedRunnable.run(NamedRunnable.java:32)>

> > [flink-dist_2.11-1.11.1.jar:1.11.1]>

> >>

> >                 at>

> > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)>

> > [?:1.8.0_262]>

> >>

> >                 at>

> > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)>

> > [?:1.8.0_262]>

> >>

> >                 at java.lang.Thread.run(Thread.java:748) [?:1.8.0_262]>

> >>

> > 2020-08-13 22:00:40,416 INFO>

> > org.apache.flink.runtime.blob.BlobServer                     [] - Stopped>

> > BLOB server at 0.0.0.0:6124>

> >>

> >>

> >>

> > Best,>

> >>

> > kevin>

> >>

> >>

> >>

> >>

> >>

> > *From: *Yang Wang <[hidden email]>>

> > *Date: *Sunday, August 9, 2020 at 10:29 PM>

> > *To: *"Bohinski, Kevin" <[hidden email]>>

> > *Subject: *[EXTERNAL] Re: Native K8S Jobmanager restarts and job never>

> > recovers>

> >>

> >>

> >>

> > Hi Kevin,>

> >>

> >>

> >>

> > I think you may not set the high availability configurations in your>

> > native K8s session. Currently, we only>

> >>

> > support zookeeper HA, so you need to add the following configuration.>

> > After the HA is configured, the>

> >>

> > running job, checkpoint and other meta could be stored. When the>

> > jobmanager failover, all the jobs>

> >>

> > could be recovered then. I have tested it could work properly.>

> >>

> >>

> >>

> > high-availability: zookeeper>

> >>

> > high-availability.zookeeper.quorum: zk-client:2181>

> >>

> > high-availability.storageDir: hdfs:///flink/recovery>

> >>

> >>

> >>

> > I know you may not have a zookeeper cluster.You could a zookeeper K8s>

> > operator[1] to deploy a new one.>

> >>

> >>

> >>

> > More over, it is not very convenient to use zookeeper as HA. So a K8s>

> > native HA support[2] is in plan and we>

> >>

> > are trying to finish it in the next major release cycle(1.12).>

> >>

> >>

> >>

> >>

> >>

> >>

> >>

> >>

> >>

> >>

> >>

> > Best,>

> >>

> > Yang>

> >>

> >>

> >>

> > Bohinski, Kevin <[hidden email]> 202087日周五 下午11:40写道:>

> >>

> > Hi all,>

> >>

> >>

> >>

> > In our 1.11.1 native k8s session after we submit a job it will run>

> > successfully for a few hours then fail when the jobmanager pod restarts.>

> >>

> >>

> >>

> > Relevant logs after restart are attached below. Any suggestions?>

> >>

> >>

> >>

> > Best>

> >>

> > kevin>

> >>

> >>

> >>

> > 2020-08-06 21:50:24,425 INFO>

> > org.apache.flink.kubernetes.KubernetesResourceManager        [] - Recovered>

> > 32 pods from previous attempts, current attempt id is 2.>

> >>

> > 2020-08-06 21:50:24,610 DEBUG>

> > org.apache.flink.kubernetes.kubeclient.resources.KubernetesPodsWatcher [] ->

> > Received ADDED event for pod REDACTED-flink-session-taskmanager-1-16,>

> > details: PodStatus(conditions=[PodCondition(lastProbeTime=null,>

> > lastTransitionTime=2020-08-06T18:48:33Z, message=null, reason=null,>

> > status=True, type=Initialized, additionalProperties={}),>

> > PodCondition(lastProbeTime=null, lastTransitionTime=2020-08-06T18:48:37Z,>

> > message=null, reason=null, status=True, type=Ready,>

> > additionalProperties={}), PodCondition(lastProbeTime=null,>

> > lastTransitionTime=2020-08-06T18:48:37Z, message=null, reason=null,>

> > status=True, type=ContainersReady, additionalProperties={}),>

> > PodCondition(lastProbeTime=null, lastTransitionTime=2020-08-06T18:48:33Z,>

> > message=null, reason=null, status=True, type=PodScheduled,>

> > additionalProperties={})],>

> > containerStatuses=[ContainerStatus(containerID=docker://REDACTED,>

> > image=REDACTED/flink:1.11.1-scala_2.11-s3-0,>

> > imageID=docker-pullable://REDACTED/flink@sha256:REDACTED,>

> > lastState=ContainerState(running=null, terminated=null, waiting=null,>

> > additionalProperties={}), name=flink-task-manager, ready=true,>

> > restartCount=0, started=true,>

> > state=ContainerState(running=ContainerStateRunning(startedAt=2020-08-06T18:48:35Z,>

> > additionalProperties={}), terminated=null, waiting=null,>

> > additionalProperties={}), additionalProperties={})],>

> > ephemeralContainerStatuses=[], hostIP=REDACTED, initContainerStatuses=[],>

> > message=null, nominatedNodeName=null, phase=Running, podIP=REDACTED,>

> > podIPs=[PodIP(ip=REDACTED, additionalProperties={})], qosClass=Guaranteed,>

> > reason=null, startTime=2020-08-06T18:48:33Z, additionalProperties={})>

> >>

> > 2020-08-06 21:50:24,613 DEBUG>

> > org.apache.flink.kubernetes.KubernetesResourceManager        [] - Ignore>

> > TaskManager pod that is already added:>

> > REDACTED-flink-session-taskmanager-1-16>

> >>

> > 2020-08-06 21:50:24,615 INFO>

> > org.apache.flink.kubernetes.KubernetesResourceManager        [] - Received>

> > 0 new TaskManager pods. Remaining pending pod requests: 0>

> >>

> > 2020-08-06 21:50:24,631 DEBUG>

> > org.apache.flink.runtime.dispatcher.FileArchivedExecutionGraphStore [] ->

> > Could not load archived execution graph for job id>

> > 8c76f37962afd87783c65c95387fb828.>

> >>

> > java.util.concurrent.ExecutionException: java.io.FileNotFoundException:>

> > Could not find file for archived execution graph>

> > 8c76f37962afd87783c65c95387fb828. This indicates that the file either has>

> > been deleted or never written.>

> >>

> >                 at>

> > org.apache.flink.shaded.guava18.com.google.common.util.concurrent.AbstractFuture$Sync.getValue(AbstractFuture.java:299)>

> > ~[flink-dist_2.11-1.11.1.jar:1.11.1]>

> >>

> >                 at>

> > org.apache.flink.shaded.guava18.com.google.common.util.concurrent.AbstractFuture$Sync.get(AbstractFuture.java:286)>

> > ~[flink-dist_2.11-1.11.1.jar:1.11.1]>

> >>

> >                 at>

> > org.apache.flink.shaded.guava18.com.google.common.util.concurrent.AbstractFuture.get(AbstractFuture.java:116)>

> > ~[flink-dist_2.11-1.11.1.jar:1.11.1]>

> >>

> >                 at>

> > org.apache.flink.shaded.guava18.com.google.common.util.concurrent.Uninterruptibles.getUninterruptibly(Uninterruptibles.java:137)>

> > ~[flink-dist_2.11-1.11.1.jar:1.11.1]>

> >>

> >                 at>

> > org.apache.flink.shaded.guava18.com.google.common.cache.LocalCache$Segment.getAndRecordStats(LocalCache.java:2348)>

> > ~[flink-dist_2.11-1.11.1.jar:1.11.1]>

> >>

> >                 at>

> > org.apache.flink.shaded.guava18.com.google.common.cache.LocalCache$Segment.loadSync(LocalCache.java:2320)>

> > ~[flink-dist_2.11-1.11.1.jar:1.11.1]>

> >>

> >                 at>

> > org.apache.flink.shaded.guava18.com.google.common.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2282)>

> > ~[flink-dist_2.11-1.11.1.jar:1.11.1]>

> >>

> >                 at>

> > org.apache.flink.shaded.guava18.com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2197)>

> > ~[flink-dist_2.11-1.11.1.jar:1.11.1]>

> >>

> >                 at>

> > org.apache.flink.shaded.guava18.com.google.common.cache.LocalCache.get(LocalCache.java:3937)>

> > ~[flink-dist_2.11-1.11.1.jar:1.11.1]>

> >>

> >                 at>

> > org.apache.flink.shaded.guava18.com.google.common.cache.LocalCache.getOrLoad(LocalCache.java:3941)>

> > ~[flink-dist_2.11-1.11.1.jar:1.11.1]>

> >>

> >                 at>

> > org.apache.flink.shaded.guava18.com.google.common.cache.LocalCache$LocalLoadingCache.get(LocalCache.java:4824)>

> > ~[flink-dist_2.11-1.11.1.jar:1.11.1]>

> >>

> >                 at>

> > org.apache.flink.runtime.dispatcher.FileArchivedExecutionGraphStore.get(FileArchivedExecutionGraphStore.java:143)>

> > ~[flink-dist_2.11-1.11.1.jar:1.11.1]>

> >>

> >                 at>

> > org.apache.flink.runtime.dispatcher.Dispatcher.lambda$requestJob$20(Dispatcher.java:554)>

> > ~[flink-dist_2.11-1.11.1.jar:1.11.1]>

> >>

> >                 at>

> > java.util.concurrent.CompletableFuture.uniExceptionally(CompletableFuture.java:884)>

> > ~[?:1.8.0_262]>

> >>

> >                 at>

> > java.util.concurrent.CompletableFuture.uniExceptionallyStage(CompletableFuture.java:898)>

> > ~[?:1.8.0_262]>

> >>

> >                 at>

> > java.util.concurrent.CompletableFuture.exceptionally(CompletableFuture.java:2209)>

> > ~[?:1.8.0_262]>

> >>

> >                 at>

> > org.apache.flink.runtime.dispatcher.Dispatcher.requestJob(Dispatcher.java:552)>

> > ~[flink-dist_2.11-1.11.1.jar:1.11.1]>

> >>

> >                 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native>

> > Method) ~[?:1.8.0_262]>

> >>

> >                 at>

> > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)>

> > ~[?:1.8.0_262]>

> >>

> >                 at>

> > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)>

> > ~[?:1.8.0_262]>

> >>

> >                 at java.lang.reflect.Method.invoke(Method.java:498)>

> > ~[?:1.8.0_262]>

> >>

> >                 at>

> > org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcInvocation(AkkaRpcActor.java:284)>

> > ~[flink-dist_2.11-1.11.1.jar:1.11.1]>

> >>

> >                 at>

> > org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:199)>

> > ~[flink-dist_2.11-1.11.1.jar:1.11.1]>

> >>

> >                 at>

> > org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.handleRpcMessage(FencedAkkaRpcActor.java:74)>

> > ~[flink-dist_2.11-1.11.1.jar:1.11.1]>

> >>

> >                 at>

> > org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleMessage(AkkaRpcActor.java:152)>

> > ~[flink-dist_2.11-1.11.1.jar:1.11.1]>

> >>

> >                 at>

> > akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:26)>

> > [flink-dist_2.11-1.11.1.jar:1.11.1]>

> >>

> >                 at>

> > akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:21)>

> > [flink-dist_2.11-1.11.1.jar:1.11.1]>

> >>

> >                 at>

> > scala.PartialFunction$class.applyOrElse(PartialFunction.scala:123)>

> > [flink-dist_2.11-1.11.1.jar:1.11.1]>

> >>

> >                 at>

> > akka.japi.pf.UnitCaseStatement.applyOrElse(CaseStatements.scala:21)>

> > [flink-dist_2.11-1.11.1.jar:1.11.1]>

> >>

> >                 at>

> > scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:170)>

> > [flink-dist_2.11-1.11.1.jar:1.11.1]>

> >>

> >                 at>

> > scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171)>

> > [flink-dist_2.11

[message truncated...]

Reply | Threaded
Open this post in threaded view
|

Re: [EXTERNAL] Re: Native K8S Jobmanager restarts and job never recovers

Yang Wang
Hi kevin,

Thanks for sharing the information. I will dig into and create a ticket if necessary.

Best,
Yang

Bohinski, Kevin <[hidden email]> 于2020年10月29日周四 上午2:35写道:

Hi Yang,

 

Thanks again for all the help!

 

We are still seeing this with 1.11.2 and ZK.

Looks like others are seeing this as well and they found a solution https://translate.google.com/translate?hl=en&sl=zh-CN&u=https://cloud.tencent.com/developer/article/1731416&prev=search

 

Should this solution be added to 1.12?

 

Best

kevin

 

On 2020/08/14 02:48:50, Yang Wang <[hidden email]> wrote:

> Hi kevin,>

>

> Thanks for sharing more information. You are right. Actually, "too old>

> resource version" is caused by a bug>

> of fabric8 kubernetes-client[1]. It has been fix in v4.6.1. And we have>

> bumped the kubernetes-client version>

> to v4.9.2 in Flink release-1.11. Also it has been backported to release>

> 1.10 and will be included in the next>

> minor release version(1.10.2).>

>

> BTW, if you really want all your jobs recovered when jobmanager crashed,>

> you still need to configure the Zookeeper high availability.>

>

>

>

> Best,>

> Yang>

>

> Bohinski, Kevin <[hidden email]> 2020814日周五 上午6:32写道:>

>

> > Might be useful>

> >>

> >>

> >>

> >>

> > Best,>

> >>

> > kevin>

> >>

> >>

> >>

> >>

> >>

> > *From: *"Bohinski, Kevin" <[hidden email]>>

> > *Date: *Thursday, August 13, 2020 at 6:13 PM>

> > *To: *Yang Wang <[hidden email]>>

> > *Subject: *Re: [EXTERNAL] Re: Native K8S Jobmanager restarts and job>

> > never recovers>

> >>

> >>

> >>

> > Hi>

> >>

> >>

> >>

> > Got the logs on crash, hopefully they help.>

> >>

> >>

> >>

> > 2020-08-13 22:00:40,336 ERROR>

> > org.apache.flink.kubernetes.KubernetesResourceManager        [] - Fatal>

> > error occurred in ResourceManager.>

> >>

> > io.fabric8.kubernetes.client.KubernetesClientException: too old resource>

> > version: 8617182 (8633230)>

> >>

> >                 at>

> > io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager$1.onMessage(WatchConnectionManager.java:259)>

> > [flink-dist_2.11-1.11.1.jar:1.11.1]>

> >>

> >                 at>

> > org.apache.flink.kubernetes.shaded.okhttp3.internal.ws.RealWebSocket.onReadMessage(RealWebSocket.java:323)>

> > [flink-dist_2.11-1.11.1.jar:1.11.1]>

> >>

> >                 at>

> > org.apache.flink.kubernetes.shaded.okhttp3.internal.ws.WebSocketReader.readMessageFrame(WebSocketReader.java:219)>

> > [flink-dist_2.11-1.11.1.jar:1.11.1]>

> >>

> >                 at>

> > org.apache.flink.kubernetes.shaded.okhttp3.internal.ws.WebSocketReader.processNextFrame(WebSocketReader.java:105)>

> > [flink-dist_2.11-1.11.1.jar:1.11.1]>

> >>

> >                 at>

> > org.apache.flink.kubernetes.shaded.okhttp3.internal.ws.RealWebSocket.loopReader(RealWebSocket.java:274)>

> > [flink-dist_2.11-1.11.1.jar:1.11.1]>

> >>

> >                 at>

> > org.apache.flink.kubernetes.shaded.okhttp3.internal.ws.RealWebSocket$2.onResponse(RealWebSocket.java:214)>

> > [flink-dist_2.11-1.11.1.jar:1.11.1]>

> >>

> >                 at>

> > org.apache.flink.kubernetes.shaded.okhttp3.RealCall$AsyncCall.execute(RealCall.java:206)>

> > [flink-dist_2.11-1.11.1.jar:1.11.1]>

> >>

> >                 at>

> > org.apache.flink.kubernetes.shaded.okhttp3.internal.NamedRunnable.run(NamedRunnable.java:32)>

> > [flink-dist_2.11-1.11.1.jar:1.11.1]>

> >>

> >                 at>

> > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)>

> > [?:1.8.0_262]>

> >>

> >                 at>

> > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)>

> > [?:1.8.0_262]>

> >>

> >                 at java.lang.Thread.run(Thread.java:748) [?:1.8.0_262]>

> >>

> > 2020-08-13 22:00:40,337 ERROR>

> > org.apache.flink.runtime.entrypoint.ClusterEntrypoint        [] - Fatal>

> > error occurred in the cluster entrypoint.>

> >>

> > io.fabric8.kubernetes.client.KubernetesClientException: too old resource>

> > version: 8617182 (8633230)>

> >>

> >                 at>

> > io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager$1.onMessage(WatchConnectionManager.java:259)>

> > [flink-dist_2.11-1.11.1.jar:1.11.1]>

> >>

> >                 at>

> > org.apache.flink.kubernetes.shaded.okhttp3.internal.ws.RealWebSocket.onReadMessage(RealWebSocket.java:323)>

> > [flink-dist_2.11-1.11.1.jar:1.11.1]>

> >>

> >                 at>

> > org.apache.flink.kubernetes.shaded.okhttp3.internal.ws.WebSocketReader.readMessageFrame(WebSocketReader.java:219)>

> > [flink-dist_2.11-1.11.1.jar:1.11.1]>

> >>

> >                 at>

> > org.apache.flink.kubernetes.shaded.okhttp3.internal.ws.WebSocketReader.processNextFrame(WebSocketReader.java:105)>

> > [flink-dist_2.11-1.11.1.jar:1.11.1]>

> >>

> >                 at>

> > org.apache.flink.kubernetes.shaded.okhttp3.internal.ws.RealWebSocket.loopReader(RealWebSocket.java:274)>

> > [flink-dist_2.11-1.11.1.jar:1.11.1]>

> >>

> >                 at>

> > org.apache.flink.kubernetes.shaded.okhttp3.internal.ws.RealWebSocket$2.onResponse(RealWebSocket.java:214)>

> > [flink-dist_2.11-1.11.1.jar:1.11.1]>

> >>

> >                 at>

> > org.apache.flink.kubernetes.shaded.okhttp3.RealCall$AsyncCall.execute(RealCall.java:206)>

> > [flink-dist_2.11-1.11.1.jar:1.11.1]>

> >>

> >                 at>

> > org.apache.flink.kubernetes.shaded.okhttp3.internal.NamedRunnable.run(NamedRunnable.java:32)>

> > [flink-dist_2.11-1.11.1.jar:1.11.1]>

> >>

> >                 at>

> > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)>

> > [?:1.8.0_262]>

> >>

> >                 at>

> > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)>

> > [?:1.8.0_262]>

> >>

> >                 at java.lang.Thread.run(Thread.java:748) [?:1.8.0_262]>

> >>

> > 2020-08-13 22:00:40,416 INFO>

> > org.apache.flink.runtime.blob.BlobServer                     [] - Stopped>

> > BLOB server at 0.0.0.0:6124>

> >>

> >>

> >>

> > Best,>

> >>

> > kevin>

> >>

> >>

> >>

> >>

> >>

> > *From: *Yang Wang <[hidden email]>>

> > *Date: *Sunday, August 9, 2020 at 10:29 PM>

> > *To: *"Bohinski, Kevin" <[hidden email]>>

> > *Subject: *[EXTERNAL] Re: Native K8S Jobmanager restarts and job never>

> > recovers>

> >>

> >>

> >>

> > Hi Kevin,>

> >>

> >>

> >>

> > I think you may not set the high availability configurations in your>

> > native K8s session. Currently, we only>

> >>

> > support zookeeper HA, so you need to add the following configuration.>

> > After the HA is configured, the>

> >>

> > running job, checkpoint and other meta could be stored. When the>

> > jobmanager failover, all the jobs>

> >>

> > could be recovered then. I have tested it could work properly.>

> >>

> >>

> >>

> > high-availability: zookeeper>

> >>

> > high-availability.zookeeper.quorum: zk-client:2181>

> >>

> > high-availability.storageDir: hdfs:///flink/recovery>

> >>

> >>

> >>

> > I know you may not have a zookeeper cluster.You could a zookeeper K8s>

> > operator[1] to deploy a new one.>

> >>

> >>

> >>

> > More over, it is not very convenient to use zookeeper as HA. So a K8s>

> > native HA support[2] is in plan and we>

> >>

> > are trying to finish it in the next major release cycle(1.12).>

> >>

> >>

> >>

> >>

> >>

> >>

> >>

> >>

> >>

> >>

> >>

> > Best,>

> >>

> > Yang>

> >>

> >>

> >>

> > Bohinski, Kevin <[hidden email]> 202087日周五 下午11:40写道:>

> >>

> > Hi all,>

> >>

> >>

> >>

> > In our 1.11.1 native k8s session after we submit a job it will run>

> > successfully for a few hours then fail when the jobmanager pod restarts.>

> >>

> >>

> >>

> > Relevant logs after restart are attached below. Any suggestions?>

> >>

> >>

> >>

> > Best>

> >>

> > kevin>

> >>

> >>

> >>

> > 2020-08-06 21:50:24,425 INFO>

> > org.apache.flink.kubernetes.KubernetesResourceManager        [] - Recovered>

> > 32 pods from previous attempts, current attempt id is 2.>

> >>

> > 2020-08-06 21:50:24,610 DEBUG>

> > org.apache.flink.kubernetes.kubeclient.resources.KubernetesPodsWatcher [] ->

> > Received ADDED event for pod REDACTED-flink-session-taskmanager-1-16,>

> > details: PodStatus(conditions=[PodCondition(lastProbeTime=null,>

> > lastTransitionTime=2020-08-06T18:48:33Z, message=null, reason=null,>

> > status=True, type=Initialized, additionalProperties={}),>

> > PodCondition(lastProbeTime=null, lastTransitionTime=2020-08-06T18:48:37Z,>

> > message=null, reason=null, status=True, type=Ready,>

> > additionalProperties={}), PodCondition(lastProbeTime=null,>

> > lastTransitionTime=2020-08-06T18:48:37Z, message=null, reason=null,>

> > status=True, type=ContainersReady, additionalProperties={}),>

> > PodCondition(lastProbeTime=null, lastTransitionTime=2020-08-06T18:48:33Z,>

> > message=null, reason=null, status=True, type=PodScheduled,>

> > additionalProperties={})],>

> > containerStatuses=[ContainerStatus(containerID=docker://REDACTED,>

> > image=REDACTED/flink:1.11.1-scala_2.11-s3-0,>

> > imageID=docker-pullable://REDACTED/flink@sha256:REDACTED,>

> > lastState=ContainerState(running=null, terminated=null, waiting=null,>

> > additionalProperties={}), name=flink-task-manager, ready=true,>

> > restartCount=0, started=true,>

> > state=ContainerState(running=ContainerStateRunning(startedAt=2020-08-06T18:48:35Z,>

> > additionalProperties={}), terminated=null, waiting=null,>

> > additionalProperties={}), additionalProperties={})],>

> > ephemeralContainerStatuses=[], hostIP=REDACTED, initContainerStatuses=[],>

> > message=null, nominatedNodeName=null, phase=Running, podIP=REDACTED,>

> > podIPs=[PodIP(ip=REDACTED, additionalProperties={})], qosClass=Guaranteed,>

> > reason=null, startTime=2020-08-06T18:48:33Z, additionalProperties={})>

> >>

> > 2020-08-06 21:50:24,613 DEBUG>

> > org.apache.flink.kubernetes.KubernetesResourceManager        [] - Ignore>

> > TaskManager pod that is already added:>

> > REDACTED-flink-session-taskmanager-1-16>

> >>

> > 2020-08-06 21:50:24,615 INFO>

> > org.apache.flink.kubernetes.KubernetesResourceManager        [] - Received>

> > 0 new TaskManager pods. Remaining pending pod requests: 0>

> >>

> > 2020-08-06 21:50:24,631 DEBUG>

> > org.apache.flink.runtime.dispatcher.FileArchivedExecutionGraphStore [] ->

> > Could not load archived execution graph for job id>

> > 8c76f37962afd87783c65c95387fb828.>

> >>

> > java.util.concurrent.ExecutionException: java.io.FileNotFoundException:>

> > Could not find file for archived execution graph>

> > 8c76f37962afd87783c65c95387fb828. This indicates that the file either has>

> > been deleted or never written.>

> >>

> >                 at>

> > org.apache.flink.shaded.guava18.com.google.common.util.concurrent.AbstractFuture$Sync.getValue(AbstractFuture.java:299)>

> > ~[flink-dist_2.11-1.11.1.jar:1.11.1]>

> >>

> >                 at>

> > org.apache.flink.shaded.guava18.com.google.common.util.concurrent.AbstractFuture$Sync.get(AbstractFuture.java:286)>

> > ~[flink-dist_2.11-1.11.1.jar:1.11.1]>

> >>

> >                 at>

> > org.apache.flink.shaded.guava18.com.google.common.util.concurrent.AbstractFuture.get(AbstractFuture.java:116)>

> > ~[flink-dist_2.11-1.11.1.jar:1.11.1]>

> >>

> >                 at>

> > org.apache.flink.shaded.guava18.com.google.common.util.concurrent.Uninterruptibles.getUninterruptibly(Uninterruptibles.java:137)>

> > ~[flink-dist_2.11-1.11.1.jar:1.11.1]>

> >>

> >                 at>

> > org.apache.flink.shaded.guava18.com.google.common.cache.LocalCache$Segment.getAndRecordStats(LocalCache.java:2348)>

> > ~[flink-dist_2.11-1.11.1.jar:1.11.1]>

> >>

> >                 at>

> > org.apache.flink.shaded.guava18.com.google.common.cache.LocalCache$Segment.loadSync(LocalCache.java:2320)>

> > ~[flink-dist_2.11-1.11.1.jar:1.11.1]>

> >>

> >                 at>

> > org.apache.flink.shaded.guava18.com.google.common.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2282)>

> > ~[flink-dist_2.11-1.11.1.jar:1.11.1]>

> >>

> >                 at>

> > org.apache.flink.shaded.guava18.com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2197)>

> > ~[flink-dist_2.11-1.11.1.jar:1.11.1]>

> >>

> >                 at>

> > org.apache.flink.shaded.guava18.com.google.common.cache.LocalCache.get(LocalCache.java:3937)>

> > ~[flink-dist_2.11-1.11.1.jar:1.11.1]>

> >>

> >                 at>

> > org.apache.flink.shaded.guava18.com.google.common.cache.LocalCache.getOrLoad(LocalCache.java:3941)>

> > ~[flink-dist_2.11-1.11.1.jar:1.11.1]>

> >>

> >                 at>

> > org.apache.flink.shaded.guava18.com.google.common.cache.LocalCache$LocalLoadingCache.get(LocalCache.java:4824)>

> > ~[flink-dist_2.11-1.11.1.jar:1.11.1]>

> >>

> >                 at>

> > org.apache.flink.runtime.dispatcher.FileArchivedExecutionGraphStore.get(FileArchivedExecutionGraphStore.java:143)>

> > ~[flink-dist_2.11-1.11.1.jar:1.11.1]>

> >>

> >                 at>

> > org.apache.flink.runtime.dispatcher.Dispatcher.lambda$requestJob$20(Dispatcher.java:554)>

> > ~[flink-dist_2.11-1.11.1.jar:1.11.1]>

> >>

> >                 at>

> > java.util.concurrent.CompletableFuture.uniExceptionally(CompletableFuture.java:884)>

> > ~[?:1.8.0_262]>

> >>

> >                 at>

> > java.util.concurrent.CompletableFuture.uniExceptionallyStage(CompletableFuture.java:898)>

> > ~[?:1.8.0_262]>

> >>

> >                 at>

> > java.util.concurrent.CompletableFuture.exceptionally(CompletableFuture.java:2209)>

> > ~[?:1.8.0_262]>

> >>

> >                 at>

> > org.apache.flink.runtime.dispatcher.Dispatcher.requestJob(Dispatcher.java:552)>

> > ~[flink-dist_2.11-1.11.1.jar:1.11.1]>

> >>

> >                 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native>

> > Method) ~[?:1.8.0_262]>

> >>

> >                 at>

> > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)>

> > ~[?:1.8.0_262]>

> >>

> >                 at>

> > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)>

> > ~[?:1.8.0_262]>

> >>

> >                 at java.lang.reflect.Method.invoke(Method.java:498)>

> > ~[?:1.8.0_262]>

> >>

> >                 at>

> > org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcInvocation(AkkaRpcActor.java:284)>

> > ~[flink-dist_2.11-1.11.1.jar:1.11.1]>

> >>

> >                 at>

> > org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:199)>

> > ~[flink-dist_2.11-1.11.1.jar:1.11.1]>

> >>

> >                 at>

> > org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.handleRpcMessage(FencedAkkaRpcActor.java:74)>

> > ~[flink-dist_2.11-1.11.1.jar:1.11.1]>

> >>

> >                 at>

> > org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleMessage(AkkaRpcActor.java:152)>

> > ~[flink-dist_2.11-1.11.1.jar:1.11.1]>

> >>

> >                 at>

> > akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:26)>

> > [flink-dist_2.11-1.11.1.jar:1.11.1]>

> >>

> >                 at>

> > akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:21)>

> > [flink-dist_2.11-1.11.1.jar:1.11.1]>

> >>

> >                 at>

> > scala.PartialFunction$class.applyOrElse(PartialFunction.scala:123)>

> > [flink-dist_2.11-1.11.1.jar:1.11.1]>

> >>

> >                 at>

> > akka.japi.pf.UnitCaseStatement.applyOrElse(CaseStatements.scala:21)>

> > [flink-dist_2.11-1.11.1.jar:1.11.1]>

> >>

> >                 at>

> > scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:170)>

> > [flink-dist_2.11-1.11.1.jar:1.11.1]>

> >>

> >                 at>

> > scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171)>

> > [flink-dist_2.11

[message truncated...]