Native K8S Jobmanager restarts and job never recovers

classic Classic list List threaded Threaded
5 messages Options
kb
Reply | Threaded
Open this post in threaded view
|

Native K8S Jobmanager restarts and job never recovers

kb

Hi all,

 

In our 1.11.1 native k8s session after we submit a job it will run successfully for a few hours then fail when the jobmanager pod restarts.

 

Relevant logs after restart are attached below. Any suggestions?

 

Best

kevin

 

2020-08-06 21:50:24,425 INFO  org.apache.flink.kubernetes.KubernetesResourceManager        [] - Recovered 32 pods from previous attempts, current attempt id is 2.

2020-08-06 21:50:24,610 DEBUG org.apache.flink.kubernetes.kubeclient.resources.KubernetesPodsWatcher [] - Received ADDED event for pod REDACTED-flink-session-taskmanager-1-16, details: PodStatus(conditions=[PodCondition(lastProbeTime=null, lastTransitionTime=2020-08-06T18:48:33Z, message=null, reason=null, status=True, type=Initialized, additionalProperties={}), PodCondition(lastProbeTime=null, lastTransitionTime=2020-08-06T18:48:37Z, message=null, reason=null, status=True, type=Ready, additionalProperties={}), PodCondition(lastProbeTime=null, lastTransitionTime=2020-08-06T18:48:37Z, message=null, reason=null, status=True, type=ContainersReady, additionalProperties={}), PodCondition(lastProbeTime=null, lastTransitionTime=2020-08-06T18:48:33Z, message=null, reason=null, status=True, type=PodScheduled, additionalProperties={})], containerStatuses=[ContainerStatus(containerID=docker://REDACTED, image=REDACTED/flink:1.11.1-scala_2.11-s3-0, imageID=docker-pullable://REDACTED/flink@sha256:REDACTED, lastState=ContainerState(running=null, terminated=null, waiting=null, additionalProperties={}), name=flink-task-manager, ready=true, restartCount=0, started=true, state=ContainerState(running=ContainerStateRunning(startedAt=2020-08-06T18:48:35Z, additionalProperties={}), terminated=null, waiting=null, additionalProperties={}), additionalProperties={})], ephemeralContainerStatuses=[], hostIP=REDACTED, initContainerStatuses=[], message=null, nominatedNodeName=null, phase=Running, podIP=REDACTED, podIPs=[PodIP(ip=REDACTED, additionalProperties={})], qosClass=Guaranteed, reason=null, startTime=2020-08-06T18:48:33Z, additionalProperties={})

2020-08-06 21:50:24,613 DEBUG org.apache.flink.kubernetes.KubernetesResourceManager        [] - Ignore TaskManager pod that is already added: REDACTED-flink-session-taskmanager-1-16

2020-08-06 21:50:24,615 INFO  org.apache.flink.kubernetes.KubernetesResourceManager        [] - Received 0 new TaskManager pods. Remaining pending pod requests: 0

2020-08-06 21:50:24,631 DEBUG org.apache.flink.runtime.dispatcher.FileArchivedExecutionGraphStore [] - Could not load archived execution graph for job id 8c76f37962afd87783c65c95387fb828.

java.util.concurrent.ExecutionException: java.io.FileNotFoundException: Could not find file for archived execution graph 8c76f37962afd87783c65c95387fb828. This indicates that the file either has been deleted or never written.

                at org.apache.flink.shaded.guava18.com.google.common.util.concurrent.AbstractFuture$Sync.getValue(AbstractFuture.java:299) ~[flink-dist_2.11-1.11.1.jar:1.11.1]

                at org.apache.flink.shaded.guava18.com.google.common.util.concurrent.AbstractFuture$Sync.get(AbstractFuture.java:286) ~[flink-dist_2.11-1.11.1.jar:1.11.1]

                at org.apache.flink.shaded.guava18.com.google.common.util.concurrent.AbstractFuture.get(AbstractFuture.java:116) ~[flink-dist_2.11-1.11.1.jar:1.11.1]

                at org.apache.flink.shaded.guava18.com.google.common.util.concurrent.Uninterruptibles.getUninterruptibly(Uninterruptibles.java:137) ~[flink-dist_2.11-1.11.1.jar:1.11.1]

                at org.apache.flink.shaded.guava18.com.google.common.cache.LocalCache$Segment.getAndRecordStats(LocalCache.java:2348) ~[flink-dist_2.11-1.11.1.jar:1.11.1]

                at org.apache.flink.shaded.guava18.com.google.common.cache.LocalCache$Segment.loadSync(LocalCache.java:2320) ~[flink-dist_2.11-1.11.1.jar:1.11.1]

                at org.apache.flink.shaded.guava18.com.google.common.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2282) ~[flink-dist_2.11-1.11.1.jar:1.11.1]

                at org.apache.flink.shaded.guava18.com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2197) ~[flink-dist_2.11-1.11.1.jar:1.11.1]

                at org.apache.flink.shaded.guava18.com.google.common.cache.LocalCache.get(LocalCache.java:3937) ~[flink-dist_2.11-1.11.1.jar:1.11.1]

                at org.apache.flink.shaded.guava18.com.google.common.cache.LocalCache.getOrLoad(LocalCache.java:3941) ~[flink-dist_2.11-1.11.1.jar:1.11.1]

                at org.apache.flink.shaded.guava18.com.google.common.cache.LocalCache$LocalLoadingCache.get(LocalCache.java:4824) ~[flink-dist_2.11-1.11.1.jar:1.11.1]

                at org.apache.flink.runtime.dispatcher.FileArchivedExecutionGraphStore.get(FileArchivedExecutionGraphStore.java:143) ~[flink-dist_2.11-1.11.1.jar:1.11.1]

                at org.apache.flink.runtime.dispatcher.Dispatcher.lambda$requestJob$20(Dispatcher.java:554) ~[flink-dist_2.11-1.11.1.jar:1.11.1]

                at java.util.concurrent.CompletableFuture.uniExceptionally(CompletableFuture.java:884) ~[?:1.8.0_262]

                at java.util.concurrent.CompletableFuture.uniExceptionallyStage(CompletableFuture.java:898) ~[?:1.8.0_262]

                at java.util.concurrent.CompletableFuture.exceptionally(CompletableFuture.java:2209) ~[?:1.8.0_262]

                at org.apache.flink.runtime.dispatcher.Dispatcher.requestJob(Dispatcher.java:552) ~[flink-dist_2.11-1.11.1.jar:1.11.1]

                at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) ~[?:1.8.0_262]

                at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) ~[?:1.8.0_262]

                at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[?:1.8.0_262]

                at java.lang.reflect.Method.invoke(Method.java:498) ~[?:1.8.0_262]

                at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcInvocation(AkkaRpcActor.java:284) ~[flink-dist_2.11-1.11.1.jar:1.11.1]

                at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:199) ~[flink-dist_2.11-1.11.1.jar:1.11.1]

                at org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.handleRpcMessage(FencedAkkaRpcActor.java:74) ~[flink-dist_2.11-1.11.1.jar:1.11.1]

                at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleMessage(AkkaRpcActor.java:152) ~[flink-dist_2.11-1.11.1.jar:1.11.1]

                at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:26) [flink-dist_2.11-1.11.1.jar:1.11.1]

                at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:21) [flink-dist_2.11-1.11.1.jar:1.11.1]

                at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:123) [flink-dist_2.11-1.11.1.jar:1.11.1]

                at akka.japi.pf.UnitCaseStatement.applyOrElse(CaseStatements.scala:21) [flink-dist_2.11-1.11.1.jar:1.11.1]

                at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:170) [flink-dist_2.11-1.11.1.jar:1.11.1]

                at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171) [flink-dist_2.11-1.11.1.jar:1.11.1]

                at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171) [flink-dist_2.11-1.11.1.jar:1.11.1]

                at akka.actor.Actor$class.aroundReceive(Actor.scala:517) [flink-dist_2.11-1.11.1.jar:1.11.1]

                at akka.actor.AbstractActor.aroundReceive(AbstractActor.scala:225) [flink-dist_2.11-1.11.1.jar:1.11.1]

                at akka.actor.ActorCell.receiveMessage(ActorCell.scala:592) [flink-dist_2.11-1.11.1.jar:1.11.1]

                at akka.actor.ActorCell.invoke(ActorCell.scala:561) [flink-dist_2.11-1.11.1.jar:1.11.1]

                at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:258) [flink-dist_2.11-1.11.1.jar:1.11.1]

                at akka.dispatch.Mailbox.run(Mailbox.scala:225) [flink-dist_2.11-1.11.1.jar:1.11.1]

                at akka.dispatch.Mailbox.exec(Mailbox.scala:235) [flink-dist_2.11-1.11.1.jar:1.11.1]

                at akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) [flink-dist_2.11-1.11.1.jar:1.11.1]

                at akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) [flink-dist_2.11-1.11.1.jar:1.11.1]

                at akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) [flink-dist_2.11-1.11.1.jar:1.11.1]

                at akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) [flink-dist_2.11-1.11.1.jar:1.11.1]

Caused by: java.io.FileNotFoundException: Could not find file for archived execution graph 8c76f37962afd87783c65c95387fb828. This indicates that the file either has been deleted or never written.

                at org.apache.flink.runtime.dispatcher.FileArchivedExecutionGraphStore.loadExecutionGraph(FileArchivedExecutionGraphStore.java:239) ~[flink-dist_2.11-1.11.1.jar:1.11.1]

                at org.apache.flink.runtime.dispatcher.FileArchivedExecutionGraphStore.access$000(FileArchivedExecutionGraphStore.java:63) ~[flink-dist_2.11-1.11.1.jar:1.11.1]

                at org.apache.flink.runtime.dispatcher.FileArchivedExecutionGraphStore$1.load(FileArchivedExecutionGraphStore.java:118) ~[flink-dist_2.11-1.11.1.jar:1.11.1]

                at org.apache.flink.runtime.dispatcher.FileArchivedExecutionGraphStore$1.load(FileArchivedExecutionGraphStore.java:115) ~[flink-dist_2.11-1.11.1.jar:1.11.1]

                at org.apache.flink.shaded.guava18.com.google.common.cache.LocalCache$LoadingValueReference.loadFuture(LocalCache.java:3527) ~[flink-dist_2.11-1.11.1.jar:1.11.1]

                at org.apache.flink.shaded.guava18.com.google.common.cache.LocalCache$Segment.loadSync(LocalCache.java:2319) ~[flink-dist_2.11-1.11.1.jar:1.11.1]

                ... 37 more

2020-08-06 21:50:55,824 DEBUG org.apache.flink.kubernetes.KubernetesResourceManager        [] - Ignoring outdated TaskExecutorGateway connection for REDACTED-flink-session-taskmanager-1-16.

2020-08-06 21:50:56,519 INFO  org.apache.flink.kubernetes.KubernetesResourceManager        [] - Registering TaskManager with ResourceID REDACTED-flink-session-taskmanager-1-16 (akka.tcp://flink@REDACTED:6122/user/rpc/taskmanager_0) at ResourceManager

2020-08-06 21:50:56,613 DEBUG org.apache.flink.runtime.resourcemanager.slotmanager.SlotManagerImpl [] - Registering TaskManager REDACTED-flink-session-taskmanager-1-16 under 89c651cee3d3f4f92b1c9472a9ec4507 at the SlotManager.

2020-08-06 21:50:55,823 DEBUG org.apache.flink.runtime.rpc.akka.AkkaRpcService             [] - Try to connect to remote RPC endpoint with address akka.tcp://flink@REDACTED:6122/user/rpc/taskmanager_0. Returning a org.apache.flink.runtime.taskexecutor.TaskExecutorGateway gateway.

Reply | Threaded
Open this post in threaded view
|

Re: Native K8S Jobmanager restarts and job never recovers

Yang Wang
Hi Kevin,

I think you may not set the high availability configurations in your native K8s session. Currently, we only
support zookeeper HA, so you need to add the following configuration. After the HA is configured, the
running job, checkpoint and other meta could be stored. When the jobmanager failover, all the jobs
could be recovered then. I have tested it could work properly.

high-availability: zookeeper
high-availability.zookeeper.quorum: zk-client:2181
high-availability.storageDir: hdfs:///flink/recovery

I know you may not have a zookeeper cluster.You could a zookeeper K8s operator[1] to deploy a new one.

More over, it is not very convenient to use zookeeper as HA. So a K8s native HA support[2] is in plan and we
are trying to finish it in the next major release cycle(1.12).




Best,
Yang

Bohinski, Kevin <[hidden email]> 于2020年8月7日周五 下午11:40写道:

Hi all,

 

In our 1.11.1 native k8s session after we submit a job it will run successfully for a few hours then fail when the jobmanager pod restarts.

 

Relevant logs after restart are attached below. Any suggestions?

 

Best

kevin

 

2020-08-06 21:50:24,425 INFO  org.apache.flink.kubernetes.KubernetesResourceManager        [] - Recovered 32 pods from previous attempts, current attempt id is 2.

2020-08-06 21:50:24,610 DEBUG org.apache.flink.kubernetes.kubeclient.resources.KubernetesPodsWatcher [] - Received ADDED event for pod REDACTED-flink-session-taskmanager-1-16, details: PodStatus(conditions=[PodCondition(lastProbeTime=null, lastTransitionTime=2020-08-06T18:48:33Z, message=null, reason=null, status=True, type=Initialized, additionalProperties={}), PodCondition(lastProbeTime=null, lastTransitionTime=2020-08-06T18:48:37Z, message=null, reason=null, status=True, type=Ready, additionalProperties={}), PodCondition(lastProbeTime=null, lastTransitionTime=2020-08-06T18:48:37Z, message=null, reason=null, status=True, type=ContainersReady, additionalProperties={}), PodCondition(lastProbeTime=null, lastTransitionTime=2020-08-06T18:48:33Z, message=null, reason=null, status=True, type=PodScheduled, additionalProperties={})], containerStatuses=[ContainerStatus(containerID=docker://REDACTED, image=REDACTED/flink:1.11.1-scala_2.11-s3-0, imageID=docker-pullable://REDACTED/flink@sha256:REDACTED, lastState=ContainerState(running=null, terminated=null, waiting=null, additionalProperties={}), name=flink-task-manager, ready=true, restartCount=0, started=true, state=ContainerState(running=ContainerStateRunning(startedAt=2020-08-06T18:48:35Z, additionalProperties={}), terminated=null, waiting=null, additionalProperties={}), additionalProperties={})], ephemeralContainerStatuses=[], hostIP=REDACTED, initContainerStatuses=[], message=null, nominatedNodeName=null, phase=Running, podIP=REDACTED, podIPs=[PodIP(ip=REDACTED, additionalProperties={})], qosClass=Guaranteed, reason=null, startTime=2020-08-06T18:48:33Z, additionalProperties={})

2020-08-06 21:50:24,613 DEBUG org.apache.flink.kubernetes.KubernetesResourceManager        [] - Ignore TaskManager pod that is already added: REDACTED-flink-session-taskmanager-1-16

2020-08-06 21:50:24,615 INFO  org.apache.flink.kubernetes.KubernetesResourceManager        [] - Received 0 new TaskManager pods. Remaining pending pod requests: 0

2020-08-06 21:50:24,631 DEBUG org.apache.flink.runtime.dispatcher.FileArchivedExecutionGraphStore [] - Could not load archived execution graph for job id 8c76f37962afd87783c65c95387fb828.

java.util.concurrent.ExecutionException: java.io.FileNotFoundException: Could not find file for archived execution graph 8c76f37962afd87783c65c95387fb828. This indicates that the file either has been deleted or never written.

                at org.apache.flink.shaded.guava18.com.google.common.util.concurrent.AbstractFuture$Sync.getValue(AbstractFuture.java:299) ~[flink-dist_2.11-1.11.1.jar:1.11.1]

                at org.apache.flink.shaded.guava18.com.google.common.util.concurrent.AbstractFuture$Sync.get(AbstractFuture.java:286) ~[flink-dist_2.11-1.11.1.jar:1.11.1]

                at org.apache.flink.shaded.guava18.com.google.common.util.concurrent.AbstractFuture.get(AbstractFuture.java:116) ~[flink-dist_2.11-1.11.1.jar:1.11.1]

                at org.apache.flink.shaded.guava18.com.google.common.util.concurrent.Uninterruptibles.getUninterruptibly(Uninterruptibles.java:137) ~[flink-dist_2.11-1.11.1.jar:1.11.1]

                at org.apache.flink.shaded.guava18.com.google.common.cache.LocalCache$Segment.getAndRecordStats(LocalCache.java:2348) ~[flink-dist_2.11-1.11.1.jar:1.11.1]

                at org.apache.flink.shaded.guava18.com.google.common.cache.LocalCache$Segment.loadSync(LocalCache.java:2320) ~[flink-dist_2.11-1.11.1.jar:1.11.1]

                at org.apache.flink.shaded.guava18.com.google.common.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2282) ~[flink-dist_2.11-1.11.1.jar:1.11.1]

                at org.apache.flink.shaded.guava18.com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2197) ~[flink-dist_2.11-1.11.1.jar:1.11.1]

                at org.apache.flink.shaded.guava18.com.google.common.cache.LocalCache.get(LocalCache.java:3937) ~[flink-dist_2.11-1.11.1.jar:1.11.1]

                at org.apache.flink.shaded.guava18.com.google.common.cache.LocalCache.getOrLoad(LocalCache.java:3941) ~[flink-dist_2.11-1.11.1.jar:1.11.1]

                at org.apache.flink.shaded.guava18.com.google.common.cache.LocalCache$LocalLoadingCache.get(LocalCache.java:4824) ~[flink-dist_2.11-1.11.1.jar:1.11.1]

                at org.apache.flink.runtime.dispatcher.FileArchivedExecutionGraphStore.get(FileArchivedExecutionGraphStore.java:143) ~[flink-dist_2.11-1.11.1.jar:1.11.1]

                at org.apache.flink.runtime.dispatcher.Dispatcher.lambda$requestJob$20(Dispatcher.java:554) ~[flink-dist_2.11-1.11.1.jar:1.11.1]

                at java.util.concurrent.CompletableFuture.uniExceptionally(CompletableFuture.java:884) ~[?:1.8.0_262]

                at java.util.concurrent.CompletableFuture.uniExceptionallyStage(CompletableFuture.java:898) ~[?:1.8.0_262]

                at java.util.concurrent.CompletableFuture.exceptionally(CompletableFuture.java:2209) ~[?:1.8.0_262]

                at org.apache.flink.runtime.dispatcher.Dispatcher.requestJob(Dispatcher.java:552) ~[flink-dist_2.11-1.11.1.jar:1.11.1]

                at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) ~[?:1.8.0_262]

                at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) ~[?:1.8.0_262]

                at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[?:1.8.0_262]

                at java.lang.reflect.Method.invoke(Method.java:498) ~[?:1.8.0_262]

                at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcInvocation(AkkaRpcActor.java:284) ~[flink-dist_2.11-1.11.1.jar:1.11.1]

                at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:199) ~[flink-dist_2.11-1.11.1.jar:1.11.1]

                at org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.handleRpcMessage(FencedAkkaRpcActor.java:74) ~[flink-dist_2.11-1.11.1.jar:1.11.1]

                at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleMessage(AkkaRpcActor.java:152) ~[flink-dist_2.11-1.11.1.jar:1.11.1]

                at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:26) [flink-dist_2.11-1.11.1.jar:1.11.1]

                at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:21) [flink-dist_2.11-1.11.1.jar:1.11.1]

                at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:123) [flink-dist_2.11-1.11.1.jar:1.11.1]

                at akka.japi.pf.UnitCaseStatement.applyOrElse(CaseStatements.scala:21) [flink-dist_2.11-1.11.1.jar:1.11.1]

                at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:170) [flink-dist_2.11-1.11.1.jar:1.11.1]

                at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171) [flink-dist_2.11-1.11.1.jar:1.11.1]

                at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171) [flink-dist_2.11-1.11.1.jar:1.11.1]

                at akka.actor.Actor$class.aroundReceive(Actor.scala:517) [flink-dist_2.11-1.11.1.jar:1.11.1]

                at akka.actor.AbstractActor.aroundReceive(AbstractActor.scala:225) [flink-dist_2.11-1.11.1.jar:1.11.1]

                at akka.actor.ActorCell.receiveMessage(ActorCell.scala:592) [flink-dist_2.11-1.11.1.jar:1.11.1]

                at akka.actor.ActorCell.invoke(ActorCell.scala:561) [flink-dist_2.11-1.11.1.jar:1.11.1]

                at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:258) [flink-dist_2.11-1.11.1.jar:1.11.1]

                at akka.dispatch.Mailbox.run(Mailbox.scala:225) [flink-dist_2.11-1.11.1.jar:1.11.1]

                at akka.dispatch.Mailbox.exec(Mailbox.scala:235) [flink-dist_2.11-1.11.1.jar:1.11.1]

                at akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) [flink-dist_2.11-1.11.1.jar:1.11.1]

                at akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) [flink-dist_2.11-1.11.1.jar:1.11.1]

                at akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) [flink-dist_2.11-1.11.1.jar:1.11.1]

                at akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) [flink-dist_2.11-1.11.1.jar:1.11.1]

Caused by: java.io.FileNotFoundException: Could not find file for archived execution graph 8c76f37962afd87783c65c95387fb828. This indicates that the file either has been deleted or never written.

                at org.apache.flink.runtime.dispatcher.FileArchivedExecutionGraphStore.loadExecutionGraph(FileArchivedExecutionGraphStore.java:239) ~[flink-dist_2.11-1.11.1.jar:1.11.1]

                at org.apache.flink.runtime.dispatcher.FileArchivedExecutionGraphStore.access$000(FileArchivedExecutionGraphStore.java:63) ~[flink-dist_2.11-1.11.1.jar:1.11.1]

                at org.apache.flink.runtime.dispatcher.FileArchivedExecutionGraphStore$1.load(FileArchivedExecutionGraphStore.java:118) ~[flink-dist_2.11-1.11.1.jar:1.11.1]

                at org.apache.flink.runtime.dispatcher.FileArchivedExecutionGraphStore$1.load(FileArchivedExecutionGraphStore.java:115) ~[flink-dist_2.11-1.11.1.jar:1.11.1]

                at org.apache.flink.shaded.guava18.com.google.common.cache.LocalCache$LoadingValueReference.loadFuture(LocalCache.java:3527) ~[flink-dist_2.11-1.11.1.jar:1.11.1]

                at org.apache.flink.shaded.guava18.com.google.common.cache.LocalCache$Segment.loadSync(LocalCache.java:2319) ~[flink-dist_2.11-1.11.1.jar:1.11.1]

                ... 37 more

2020-08-06 21:50:55,824 DEBUG org.apache.flink.kubernetes.KubernetesResourceManager        [] - Ignoring outdated TaskExecutorGateway connection for REDACTED-flink-session-taskmanager-1-16.

2020-08-06 21:50:56,519 INFO  org.apache.flink.kubernetes.KubernetesResourceManager        [] - Registering TaskManager with ResourceID REDACTED-flink-session-taskmanager-1-16 (akka.tcp://flink@REDACTED:6122/user/rpc/taskmanager_0) at ResourceManager

2020-08-06 21:50:56,613 DEBUG org.apache.flink.runtime.resourcemanager.slotmanager.SlotManagerImpl [] - Registering TaskManager REDACTED-flink-session-taskmanager-1-16 under 89c651cee3d3f4f92b1c9472a9ec4507 at the SlotManager.

2020-08-06 21:50:55,823 DEBUG org.apache.flink.runtime.rpc.akka.AkkaRpcService             [] - Try to connect to remote RPC endpoint with address akka.tcp://flink@REDACTED:6122/user/rpc/taskmanager_0. Returning a org.apache.flink.runtime.taskexecutor.TaskExecutorGateway gateway.

kb
Reply | Threaded
Open this post in threaded view
|

Re: [EXTERNAL] Re: Native K8S Jobmanager restarts and job never recovers

kb

Hi

 

Got the logs on crash, hopefully they help.

 

2020-08-13 22:00:40,336 ERROR org.apache.flink.kubernetes.KubernetesResourceManager        [] - Fatal error occurred in ResourceManager.

io.fabric8.kubernetes.client.KubernetesClientException: too old resource version: 8617182 (8633230)

                at io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager$1.onMessage(WatchConnectionManager.java:259) [flink-dist_2.11-1.11.1.jar:1.11.1]

                at org.apache.flink.kubernetes.shaded.okhttp3.internal.ws.RealWebSocket.onReadMessage(RealWebSocket.java:323) [flink-dist_2.11-1.11.1.jar:1.11.1]

                at org.apache.flink.kubernetes.shaded.okhttp3.internal.ws.WebSocketReader.readMessageFrame(WebSocketReader.java:219) [flink-dist_2.11-1.11.1.jar:1.11.1]

                at org.apache.flink.kubernetes.shaded.okhttp3.internal.ws.WebSocketReader.processNextFrame(WebSocketReader.java:105) [flink-dist_2.11-1.11.1.jar:1.11.1]

                at org.apache.flink.kubernetes.shaded.okhttp3.internal.ws.RealWebSocket.loopReader(RealWebSocket.java:274) [flink-dist_2.11-1.11.1.jar:1.11.1]

                at org.apache.flink.kubernetes.shaded.okhttp3.internal.ws.RealWebSocket$2.onResponse(RealWebSocket.java:214) [flink-dist_2.11-1.11.1.jar:1.11.1]

                at org.apache.flink.kubernetes.shaded.okhttp3.RealCall$AsyncCall.execute(RealCall.java:206) [flink-dist_2.11-1.11.1.jar:1.11.1]

                at org.apache.flink.kubernetes.shaded.okhttp3.internal.NamedRunnable.run(NamedRunnable.java:32) [flink-dist_2.11-1.11.1.jar:1.11.1]

                at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_262]

                at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_262]

                at java.lang.Thread.run(Thread.java:748) [?:1.8.0_262]

2020-08-13 22:00:40,337 ERROR org.apache.flink.runtime.entrypoint.ClusterEntrypoint        [] - Fatal error occurred in the cluster entrypoint.

io.fabric8.kubernetes.client.KubernetesClientException: too old resource version: 8617182 (8633230)

                at io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager$1.onMessage(WatchConnectionManager.java:259) [flink-dist_2.11-1.11.1.jar:1.11.1]

                at org.apache.flink.kubernetes.shaded.okhttp3.internal.ws.RealWebSocket.onReadMessage(RealWebSocket.java:323) [flink-dist_2.11-1.11.1.jar:1.11.1]

                at org.apache.flink.kubernetes.shaded.okhttp3.internal.ws.WebSocketReader.readMessageFrame(WebSocketReader.java:219) [flink-dist_2.11-1.11.1.jar:1.11.1]

                at org.apache.flink.kubernetes.shaded.okhttp3.internal.ws.WebSocketReader.processNextFrame(WebSocketReader.java:105) [flink-dist_2.11-1.11.1.jar:1.11.1]

                at org.apache.flink.kubernetes.shaded.okhttp3.internal.ws.RealWebSocket.loopReader(RealWebSocket.java:274) [flink-dist_2.11-1.11.1.jar:1.11.1]

                at org.apache.flink.kubernetes.shaded.okhttp3.internal.ws.RealWebSocket$2.onResponse(RealWebSocket.java:214) [flink-dist_2.11-1.11.1.jar:1.11.1]

                at org.apache.flink.kubernetes.shaded.okhttp3.RealCall$AsyncCall.execute(RealCall.java:206) [flink-dist_2.11-1.11.1.jar:1.11.1]

                at org.apache.flink.kubernetes.shaded.okhttp3.internal.NamedRunnable.run(NamedRunnable.java:32) [flink-dist_2.11-1.11.1.jar:1.11.1]

                at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_262]

                at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_262]

                at java.lang.Thread.run(Thread.java:748) [?:1.8.0_262]

2020-08-13 22:00:40,416 INFO  org.apache.flink.runtime.blob.BlobServer                     [] - Stopped BLOB server at 0.0.0.0:6124

 

Best,

kevin

 

 

From: Yang Wang <[hidden email]>
Date: Sunday, August 9, 2020 at 10:29 PM
To: "Bohinski, Kevin" <[hidden email]>
Cc: "[hidden email]" <[hidden email]>
Subject: [EXTERNAL] Re: Native K8S Jobmanager restarts and job never recovers

 

Hi Kevin,

 

I think you may not set the high availability configurations in your native K8s session. Currently, we only

support zookeeper HA, so you need to add the following configuration. After the HA is configured, the

running job, checkpoint and other meta could be stored. When the jobmanager failover, all the jobs

could be recovered then. I have tested it could work properly.

 

high-availability: zookeeper
high-availability.zookeeper.quorum: zk-client:2181
high-availability.storageDir: hdfs:///flink/recovery

 

I know you may not have a zookeeper cluster.You could a zookeeper K8s operator[1] to deploy a new one.

 

More over, it is not very convenient to use zookeeper as HA. So a K8s native HA support[2] is in plan and we

are trying to finish it in the next major release cycle(1.12).

 

 

 

 

Best,

Yang

 

Bohinski, Kevin <[hidden email]> 202087日周五 下午11:40写道:

Hi all,

 

In our 1.11.1 native k8s session after we submit a job it will run successfully for a few hours then fail when the jobmanager pod restarts.

 

Relevant logs after restart are attached below. Any suggestions?

 

Best

kevin

 

2020-08-06 21:50:24,425 INFO  org.apache.flink.kubernetes.KubernetesResourceManager        [] - Recovered 32 pods from previous attempts, current attempt id is 2.

2020-08-06 21:50:24,610 DEBUG org.apache.flink.kubernetes.kubeclient.resources.KubernetesPodsWatcher [] - Received ADDED event for pod REDACTED-flink-session-taskmanager-1-16, details: PodStatus(conditions=[PodCondition(lastProbeTime=null, lastTransitionTime=2020-08-06T18:48:33Z, message=null, reason=null, status=True, type=Initialized, additionalProperties={}), PodCondition(lastProbeTime=null, lastTransitionTime=2020-08-06T18:48:37Z, message=null, reason=null, status=True, type=Ready, additionalProperties={}), PodCondition(lastProbeTime=null, lastTransitionTime=2020-08-06T18:48:37Z, message=null, reason=null, status=True, type=ContainersReady, additionalProperties={}), PodCondition(lastProbeTime=null, lastTransitionTime=2020-08-06T18:48:33Z, message=null, reason=null, status=True, type=PodScheduled, additionalProperties={})], containerStatuses=[ContainerStatus(containerID=docker://REDACTED, image=REDACTED/flink:1.11.1-scala_2.11-s3-0, imageID=docker-pullable://REDACTED/flink@sha256:REDACTED, lastState=ContainerState(running=null, terminated=null, waiting=null, additionalProperties={}), name=flink-task-manager, ready=true, restartCount=0, started=true, state=ContainerState(running=ContainerStateRunning(startedAt=2020-08-06T18:48:35Z, additionalProperties={}), terminated=null, waiting=null, additionalProperties={}), additionalProperties={})], ephemeralContainerStatuses=[], hostIP=REDACTED, initContainerStatuses=[], message=null, nominatedNodeName=null, phase=Running, podIP=REDACTED, podIPs=[PodIP(ip=REDACTED, additionalProperties={})], qosClass=Guaranteed, reason=null, startTime=2020-08-06T18:48:33Z, additionalProperties={})

2020-08-06 21:50:24,613 DEBUG org.apache.flink.kubernetes.KubernetesResourceManager        [] - Ignore TaskManager pod that is already added: REDACTED-flink-session-taskmanager-1-16

2020-08-06 21:50:24,615 INFO  org.apache.flink.kubernetes.KubernetesResourceManager        [] - Received 0 new TaskManager pods. Remaining pending pod requests: 0

2020-08-06 21:50:24,631 DEBUG org.apache.flink.runtime.dispatcher.FileArchivedExecutionGraphStore [] - Could not load archived execution graph for job id 8c76f37962afd87783c65c95387fb828.

java.util.concurrent.ExecutionException: java.io.FileNotFoundException: Could not find file for archived execution graph 8c76f37962afd87783c65c95387fb828. This indicates that the file either has been deleted or never written.

                at org.apache.flink.shaded.guava18.com.google.common.util.concurrent.AbstractFuture$Sync.getValue(AbstractFuture.java:299) ~[flink-dist_2.11-1.11.1.jar:1.11.1]

                at org.apache.flink.shaded.guava18.com.google.common.util.concurrent.AbstractFuture$Sync.get(AbstractFuture.java:286) ~[flink-dist_2.11-1.11.1.jar:1.11.1]

                at org.apache.flink.shaded.guava18.com.google.common.util.concurrent.AbstractFuture.get(AbstractFuture.java:116) ~[flink-dist_2.11-1.11.1.jar:1.11.1]

                at org.apache.flink.shaded.guava18.com.google.common.util.concurrent.Uninterruptibles.getUninterruptibly(Uninterruptibles.java:137) ~[flink-dist_2.11-1.11.1.jar:1.11.1]

                at org.apache.flink.shaded.guava18.com.google.common.cache.LocalCache$Segment.getAndRecordStats(LocalCache.java:2348) ~[flink-dist_2.11-1.11.1.jar:1.11.1]

                at org.apache.flink.shaded.guava18.com.google.common.cache.LocalCache$Segment.loadSync(LocalCache.java:2320) ~[flink-dist_2.11-1.11.1.jar:1.11.1]

                at org.apache.flink.shaded.guava18.com.google.common.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2282) ~[flink-dist_2.11-1.11.1.jar:1.11.1]

                at org.apache.flink.shaded.guava18.com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2197) ~[flink-dist_2.11-1.11.1.jar:1.11.1]

                at org.apache.flink.shaded.guava18.com.google.common.cache.LocalCache.get(LocalCache.java:3937) ~[flink-dist_2.11-1.11.1.jar:1.11.1]

                at org.apache.flink.shaded.guava18.com.google.common.cache.LocalCache.getOrLoad(LocalCache.java:3941) ~[flink-dist_2.11-1.11.1.jar:1.11.1]

                at org.apache.flink.shaded.guava18.com.google.common.cache.LocalCache$LocalLoadingCache.get(LocalCache.java:4824) ~[flink-dist_2.11-1.11.1.jar:1.11.1]

                at org.apache.flink.runtime.dispatcher.FileArchivedExecutionGraphStore.get(FileArchivedExecutionGraphStore.java:143) ~[flink-dist_2.11-1.11.1.jar:1.11.1]

                at org.apache.flink.runtime.dispatcher.Dispatcher.lambda$requestJob$20(Dispatcher.java:554) ~[flink-dist_2.11-1.11.1.jar:1.11.1]

                at java.util.concurrent.CompletableFuture.uniExceptionally(CompletableFuture.java:884) ~[?:1.8.0_262]

                at java.util.concurrent.CompletableFuture.uniExceptionallyStage(CompletableFuture.java:898) ~[?:1.8.0_262]

                at java.util.concurrent.CompletableFuture.exceptionally(CompletableFuture.java:2209) ~[?:1.8.0_262]

                at org.apache.flink.runtime.dispatcher.Dispatcher.requestJob(Dispatcher.java:552) ~[flink-dist_2.11-1.11.1.jar:1.11.1]

                at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) ~[?:1.8.0_262]

                at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) ~[?:1.8.0_262]

                at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[?:1.8.0_262]

                at java.lang.reflect.Method.invoke(Method.java:498) ~[?:1.8.0_262]

                at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcInvocation(AkkaRpcActor.java:284) ~[flink-dist_2.11-1.11.1.jar:1.11.1]

                at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:199) ~[flink-dist_2.11-1.11.1.jar:1.11.1]

                at org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.handleRpcMessage(FencedAkkaRpcActor.java:74) ~[flink-dist_2.11-1.11.1.jar:1.11.1]

                at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleMessage(AkkaRpcActor.java:152) ~[flink-dist_2.11-1.11.1.jar:1.11.1]

                at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:26) [flink-dist_2.11-1.11.1.jar:1.11.1]

                at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:21) [flink-dist_2.11-1.11.1.jar:1.11.1]

                at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:123) [flink-dist_2.11-1.11.1.jar:1.11.1]

                at akka.japi.pf.UnitCaseStatement.applyOrElse(CaseStatements.scala:21) [flink-dist_2.11-1.11.1.jar:1.11.1]

                at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:170) [flink-dist_2.11-1.11.1.jar:1.11.1]

                at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171) [flink-dist_2.11-1.11.1.jar:1.11.1]

                at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171) [flink-dist_2.11-1.11.1.jar:1.11.1]

                at akka.actor.Actor$class.aroundReceive(Actor.scala:517) [flink-dist_2.11-1.11.1.jar:1.11.1]

                at akka.actor.AbstractActor.aroundReceive(AbstractActor.scala:225) [flink-dist_2.11-1.11.1.jar:1.11.1]

                at akka.actor.ActorCell.receiveMessage(ActorCell.scala:592) [flink-dist_2.11-1.11.1.jar:1.11.1]

                at akka.actor.ActorCell.invoke(ActorCell.scala:561) [flink-dist_2.11-1.11.1.jar:1.11.1]

                at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:258) [flink-dist_2.11-1.11.1.jar:1.11.1]

                at akka.dispatch.Mailbox.run(Mailbox.scala:225) [flink-dist_2.11-1.11.1.jar:1.11.1]

                at akka.dispatch.Mailbox.exec(Mailbox.scala:235) [flink-dist_2.11-1.11.1.jar:1.11.1]

                at akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) [flink-dist_2.11-1.11.1.jar:1.11.1]

                at akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) [flink-dist_2.11-1.11.1.jar:1.11.1]

                at akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) [flink-dist_2.11-1.11.1.jar:1.11.1]

                at akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) [flink-dist_2.11-1.11.1.jar:1.11.1]

Caused by: java.io.FileNotFoundException: Could not find file for archived execution graph 8c76f37962afd87783c65c95387fb828. This indicates that the file either has been deleted or never written.

                at org.apache.flink.runtime.dispatcher.FileArchivedExecutionGraphStore.loadExecutionGraph(FileArchivedExecutionGraphStore.java:239) ~[flink-dist_2.11-1.11.1.jar:1.11.1]

                at org.apache.flink.runtime.dispatcher.FileArchivedExecutionGraphStore.access$000(FileArchivedExecutionGraphStore.java:63) ~[flink-dist_2.11-1.11.1.jar:1.11.1]

                at org.apache.flink.runtime.dispatcher.FileArchivedExecutionGraphStore$1.load(FileArchivedExecutionGraphStore.java:118) ~[flink-dist_2.11-1.11.1.jar:1.11.1]

                at org.apache.flink.runtime.dispatcher.FileArchivedExecutionGraphStore$1.load(FileArchivedExecutionGraphStore.java:115) ~[flink-dist_2.11-1.11.1.jar:1.11.1]

                at org.apache.flink.shaded.guava18.com.google.common.cache.LocalCache$LoadingValueReference.loadFuture(LocalCache.java:3527) ~[flink-dist_2.11-1.11.1.jar:1.11.1]

                at org.apache.flink.shaded.guava18.com.google.common.cache.LocalCache$Segment.loadSync(LocalCache.java:2319) ~[flink-dist_2.11-1.11.1.jar:1.11.1]

                ... 37 more

2020-08-06 21:50:55,824 DEBUG org.apache.flink.kubernetes.KubernetesResourceManager        [] - Ignoring outdated TaskExecutorGateway connection for REDACTED-flink-session-taskmanager-1-16.

2020-08-06 21:50:56,519 INFO  org.apache.flink.kubernetes.KubernetesResourceManager        [] - Registering TaskManager with ResourceID REDACTED-flink-session-taskmanager-1-16 (akka.tcp://flink@REDACTED:6122/user/rpc/taskmanager_0) at ResourceManager

2020-08-06 21:50:56,613 DEBUG org.apache.flink.runtime.resourcemanager.slotmanager.SlotManagerImpl [] - Registering TaskManager REDACTED-flink-session-taskmanager-1-16 under 89c651cee3d3f4f92b1c9472a9ec4507 at the SlotManager.

2020-08-06 21:50:55,823 DEBUG org.apache.flink.runtime.rpc.akka.AkkaRpcService             [] - Try to connect to remote RPC endpoint with address akka.tcp://flink@REDACTED:6122/user/rpc/taskmanager_0. Returning a org.apache.flink.runtime.taskexecutor.TaskExecutorGateway gateway.

kb
Reply | Threaded
Open this post in threaded view
|

Re: [EXTERNAL] Re: Native K8S Jobmanager restarts and job never recovers

kb

Might be useful

https://stackoverflow.com/a/61437982

 

Best,

kevin

 

 

From: "Bohinski, Kevin" <[hidden email]>
Date: Thursday, August 13, 2020 at 6:13 PM
To: Yang Wang <[hidden email]>
Cc: "[hidden email]" <[hidden email]>
Subject: Re: [EXTERNAL] Re: Native K8S Jobmanager restarts and job never recovers

 

Hi

 

Got the logs on crash, hopefully they help.

 

2020-08-13 22:00:40,336 ERROR org.apache.flink.kubernetes.KubernetesResourceManager        [] - Fatal error occurred in ResourceManager.

io.fabric8.kubernetes.client.KubernetesClientException: too old resource version: 8617182 (8633230)

                at io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager$1.onMessage(WatchConnectionManager.java:259) [flink-dist_2.11-1.11.1.jar:1.11.1]

                at org.apache.flink.kubernetes.shaded.okhttp3.internal.ws.RealWebSocket.onReadMessage(RealWebSocket.java:323) [flink-dist_2.11-1.11.1.jar:1.11.1]

                at org.apache.flink.kubernetes.shaded.okhttp3.internal.ws.WebSocketReader.readMessageFrame(WebSocketReader.java:219) [flink-dist_2.11-1.11.1.jar:1.11.1]

                at org.apache.flink.kubernetes.shaded.okhttp3.internal.ws.WebSocketReader.processNextFrame(WebSocketReader.java:105) [flink-dist_2.11-1.11.1.jar:1.11.1]

                at org.apache.flink.kubernetes.shaded.okhttp3.internal.ws.RealWebSocket.loopReader(RealWebSocket.java:274) [flink-dist_2.11-1.11.1.jar:1.11.1]

                at org.apache.flink.kubernetes.shaded.okhttp3.internal.ws.RealWebSocket$2.onResponse(RealWebSocket.java:214) [flink-dist_2.11-1.11.1.jar:1.11.1]

                at org.apache.flink.kubernetes.shaded.okhttp3.RealCall$AsyncCall.execute(RealCall.java:206) [flink-dist_2.11-1.11.1.jar:1.11.1]

                at org.apache.flink.kubernetes.shaded.okhttp3.internal.NamedRunnable.run(NamedRunnable.java:32) [flink-dist_2.11-1.11.1.jar:1.11.1]

                at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_262]

                at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_262]

                at java.lang.Thread.run(Thread.java:748) [?:1.8.0_262]

2020-08-13 22:00:40,337 ERROR org.apache.flink.runtime.entrypoint.ClusterEntrypoint        [] - Fatal error occurred in the cluster entrypoint.

io.fabric8.kubernetes.client.KubernetesClientException: too old resource version: 8617182 (8633230)

                at io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager$1.onMessage(WatchConnectionManager.java:259) [flink-dist_2.11-1.11.1.jar:1.11.1]

                at org.apache.flink.kubernetes.shaded.okhttp3.internal.ws.RealWebSocket.onReadMessage(RealWebSocket.java:323) [flink-dist_2.11-1.11.1.jar:1.11.1]

                at org.apache.flink.kubernetes.shaded.okhttp3.internal.ws.WebSocketReader.readMessageFrame(WebSocketReader.java:219) [flink-dist_2.11-1.11.1.jar:1.11.1]

                at org.apache.flink.kubernetes.shaded.okhttp3.internal.ws.WebSocketReader.processNextFrame(WebSocketReader.java:105) [flink-dist_2.11-1.11.1.jar:1.11.1]

                at org.apache.flink.kubernetes.shaded.okhttp3.internal.ws.RealWebSocket.loopReader(RealWebSocket.java:274) [flink-dist_2.11-1.11.1.jar:1.11.1]

                at org.apache.flink.kubernetes.shaded.okhttp3.internal.ws.RealWebSocket$2.onResponse(RealWebSocket.java:214) [flink-dist_2.11-1.11.1.jar:1.11.1]

                at org.apache.flink.kubernetes.shaded.okhttp3.RealCall$AsyncCall.execute(RealCall.java:206) [flink-dist_2.11-1.11.1.jar:1.11.1]

                at org.apache.flink.kubernetes.shaded.okhttp3.internal.NamedRunnable.run(NamedRunnable.java:32) [flink-dist_2.11-1.11.1.jar:1.11.1]

                at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_262]

                at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_262]

                at java.lang.Thread.run(Thread.java:748) [?:1.8.0_262]

2020-08-13 22:00:40,416 INFO  org.apache.flink.runtime.blob.BlobServer                     [] - Stopped BLOB server at 0.0.0.0:6124

 

Best,

kevin

 

 

From: Yang Wang <[hidden email]>
Date: Sunday, August 9, 2020 at 10:29 PM
To: "Bohinski, Kevin" <[hidden email]>
Cc: "[hidden email]" <[hidden email]>
Subject: [EXTERNAL] Re: Native K8S Jobmanager restarts and job never recovers

 

Hi Kevin,

 

I think you may not set the high availability configurations in your native K8s session. Currently, we only

support zookeeper HA, so you need to add the following configuration. After the HA is configured, the

running job, checkpoint and other meta could be stored. When the jobmanager failover, all the jobs

could be recovered then. I have tested it could work properly.

 

high-availability: zookeeper
high-availability.zookeeper.quorum: zk-client:2181
high-availability.storageDir: hdfs:///flink/recovery

 

I know you may not have a zookeeper cluster.You could a zookeeper K8s operator[1] to deploy a new one.

 

More over, it is not very convenient to use zookeeper as HA. So a K8s native HA support[2] is in plan and we

are trying to finish it in the next major release cycle(1.12).

 

 

 

 

Best,

Yang

 

Bohinski, Kevin <[hidden email]> 202087日周五 下午11:40写道:

Hi all,

 

In our 1.11.1 native k8s session after we submit a job it will run successfully for a few hours then fail when the jobmanager pod restarts.

 

Relevant logs after restart are attached below. Any suggestions?

 

Best

kevin

 

2020-08-06 21:50:24,425 INFO  org.apache.flink.kubernetes.KubernetesResourceManager        [] - Recovered 32 pods from previous attempts, current attempt id is 2.

2020-08-06 21:50:24,610 DEBUG org.apache.flink.kubernetes.kubeclient.resources.KubernetesPodsWatcher [] - Received ADDED event for pod REDACTED-flink-session-taskmanager-1-16, details: PodStatus(conditions=[PodCondition(lastProbeTime=null, lastTransitionTime=2020-08-06T18:48:33Z, message=null, reason=null, status=True, type=Initialized, additionalProperties={}), PodCondition(lastProbeTime=null, lastTransitionTime=2020-08-06T18:48:37Z, message=null, reason=null, status=True, type=Ready, additionalProperties={}), PodCondition(lastProbeTime=null, lastTransitionTime=2020-08-06T18:48:37Z, message=null, reason=null, status=True, type=ContainersReady, additionalProperties={}), PodCondition(lastProbeTime=null, lastTransitionTime=2020-08-06T18:48:33Z, message=null, reason=null, status=True, type=PodScheduled, additionalProperties={})], containerStatuses=[ContainerStatus(containerID=docker://REDACTED, image=REDACTED/flink:1.11.1-scala_2.11-s3-0, imageID=docker-pullable://REDACTED/flink@sha256:REDACTED, lastState=ContainerState(running=null, terminated=null, waiting=null, additionalProperties={}), name=flink-task-manager, ready=true, restartCount=0, started=true, state=ContainerState(running=ContainerStateRunning(startedAt=2020-08-06T18:48:35Z, additionalProperties={}), terminated=null, waiting=null, additionalProperties={}), additionalProperties={})], ephemeralContainerStatuses=[], hostIP=REDACTED, initContainerStatuses=[], message=null, nominatedNodeName=null, phase=Running, podIP=REDACTED, podIPs=[PodIP(ip=REDACTED, additionalProperties={})], qosClass=Guaranteed, reason=null, startTime=2020-08-06T18:48:33Z, additionalProperties={})

2020-08-06 21:50:24,613 DEBUG org.apache.flink.kubernetes.KubernetesResourceManager        [] - Ignore TaskManager pod that is already added: REDACTED-flink-session-taskmanager-1-16

2020-08-06 21:50:24,615 INFO  org.apache.flink.kubernetes.KubernetesResourceManager        [] - Received 0 new TaskManager pods. Remaining pending pod requests: 0

2020-08-06 21:50:24,631 DEBUG org.apache.flink.runtime.dispatcher.FileArchivedExecutionGraphStore [] - Could not load archived execution graph for job id 8c76f37962afd87783c65c95387fb828.

java.util.concurrent.ExecutionException: java.io.FileNotFoundException: Could not find file for archived execution graph 8c76f37962afd87783c65c95387fb828. This indicates that the file either has been deleted or never written.

                at org.apache.flink.shaded.guava18.com.google.common.util.concurrent.AbstractFuture$Sync.getValue(AbstractFuture.java:299) ~[flink-dist_2.11-1.11.1.jar:1.11.1]

                at org.apache.flink.shaded.guava18.com.google.common.util.concurrent.AbstractFuture$Sync.get(AbstractFuture.java:286) ~[flink-dist_2.11-1.11.1.jar:1.11.1]

                at org.apache.flink.shaded.guava18.com.google.common.util.concurrent.AbstractFuture.get(AbstractFuture.java:116) ~[flink-dist_2.11-1.11.1.jar:1.11.1]

                at org.apache.flink.shaded.guava18.com.google.common.util.concurrent.Uninterruptibles.getUninterruptibly(Uninterruptibles.java:137) ~[flink-dist_2.11-1.11.1.jar:1.11.1]

                at org.apache.flink.shaded.guava18.com.google.common.cache.LocalCache$Segment.getAndRecordStats(LocalCache.java:2348) ~[flink-dist_2.11-1.11.1.jar:1.11.1]

                at org.apache.flink.shaded.guava18.com.google.common.cache.LocalCache$Segment.loadSync(LocalCache.java:2320) ~[flink-dist_2.11-1.11.1.jar:1.11.1]

                at org.apache.flink.shaded.guava18.com.google.common.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2282) ~[flink-dist_2.11-1.11.1.jar:1.11.1]

                at org.apache.flink.shaded.guava18.com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2197) ~[flink-dist_2.11-1.11.1.jar:1.11.1]

                at org.apache.flink.shaded.guava18.com.google.common.cache.LocalCache.get(LocalCache.java:3937) ~[flink-dist_2.11-1.11.1.jar:1.11.1]

                at org.apache.flink.shaded.guava18.com.google.common.cache.LocalCache.getOrLoad(LocalCache.java:3941) ~[flink-dist_2.11-1.11.1.jar:1.11.1]

                at org.apache.flink.shaded.guava18.com.google.common.cache.LocalCache$LocalLoadingCache.get(LocalCache.java:4824) ~[flink-dist_2.11-1.11.1.jar:1.11.1]

                at org.apache.flink.runtime.dispatcher.FileArchivedExecutionGraphStore.get(FileArchivedExecutionGraphStore.java:143) ~[flink-dist_2.11-1.11.1.jar:1.11.1]

                at org.apache.flink.runtime.dispatcher.Dispatcher.lambda$requestJob$20(Dispatcher.java:554) ~[flink-dist_2.11-1.11.1.jar:1.11.1]

                at java.util.concurrent.CompletableFuture.uniExceptionally(CompletableFuture.java:884) ~[?:1.8.0_262]

                at java.util.concurrent.CompletableFuture.uniExceptionallyStage(CompletableFuture.java:898) ~[?:1.8.0_262]

                at java.util.concurrent.CompletableFuture.exceptionally(CompletableFuture.java:2209) ~[?:1.8.0_262]

                at org.apache.flink.runtime.dispatcher.Dispatcher.requestJob(Dispatcher.java:552) ~[flink-dist_2.11-1.11.1.jar:1.11.1]

                at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) ~[?:1.8.0_262]

                at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) ~[?:1.8.0_262]

                at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[?:1.8.0_262]

                at java.lang.reflect.Method.invoke(Method.java:498) ~[?:1.8.0_262]

                at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcInvocation(AkkaRpcActor.java:284) ~[flink-dist_2.11-1.11.1.jar:1.11.1]

                at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:199) ~[flink-dist_2.11-1.11.1.jar:1.11.1]

                at org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.handleRpcMessage(FencedAkkaRpcActor.java:74) ~[flink-dist_2.11-1.11.1.jar:1.11.1]

                at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleMessage(AkkaRpcActor.java:152) ~[flink-dist_2.11-1.11.1.jar:1.11.1]

                at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:26) [flink-dist_2.11-1.11.1.jar:1.11.1]

                at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:21) [flink-dist_2.11-1.11.1.jar:1.11.1]

                at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:123) [flink-dist_2.11-1.11.1.jar:1.11.1]

                at akka.japi.pf.UnitCaseStatement.applyOrElse(CaseStatements.scala:21) [flink-dist_2.11-1.11.1.jar:1.11.1]

                at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:170) [flink-dist_2.11-1.11.1.jar:1.11.1]

                at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171) [flink-dist_2.11-1.11.1.jar:1.11.1]

                at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171) [flink-dist_2.11-1.11.1.jar:1.11.1]

                at akka.actor.Actor$class.aroundReceive(Actor.scala:517) [flink-dist_2.11-1.11.1.jar:1.11.1]

                at akka.actor.AbstractActor.aroundReceive(AbstractActor.scala:225) [flink-dist_2.11-1.11.1.jar:1.11.1]

                at akka.actor.ActorCell.receiveMessage(ActorCell.scala:592) [flink-dist_2.11-1.11.1.jar:1.11.1]

                at akka.actor.ActorCell.invoke(ActorCell.scala:561) [flink-dist_2.11-1.11.1.jar:1.11.1]

                at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:258) [flink-dist_2.11-1.11.1.jar:1.11.1]

                at akka.dispatch.Mailbox.run(Mailbox.scala:225) [flink-dist_2.11-1.11.1.jar:1.11.1]

                at akka.dispatch.Mailbox.exec(Mailbox.scala:235) [flink-dist_2.11-1.11.1.jar:1.11.1]

                at akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) [flink-dist_2.11-1.11.1.jar:1.11.1]

                at akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) [flink-dist_2.11-1.11.1.jar:1.11.1]

                at akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) [flink-dist_2.11-1.11.1.jar:1.11.1]

                at akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) [flink-dist_2.11-1.11.1.jar:1.11.1]

Caused by: java.io.FileNotFoundException: Could not find file for archived execution graph 8c76f37962afd87783c65c95387fb828. This indicates that the file either has been deleted or never written.

                at org.apache.flink.runtime.dispatcher.FileArchivedExecutionGraphStore.loadExecutionGraph(FileArchivedExecutionGraphStore.java:239) ~[flink-dist_2.11-1.11.1.jar:1.11.1]

                at org.apache.flink.runtime.dispatcher.FileArchivedExecutionGraphStore.access$000(FileArchivedExecutionGraphStore.java:63) ~[flink-dist_2.11-1.11.1.jar:1.11.1]

                at org.apache.flink.runtime.dispatcher.FileArchivedExecutionGraphStore$1.load(FileArchivedExecutionGraphStore.java:118) ~[flink-dist_2.11-1.11.1.jar:1.11.1]

                at org.apache.flink.runtime.dispatcher.FileArchivedExecutionGraphStore$1.load(FileArchivedExecutionGraphStore.java:115) ~[flink-dist_2.11-1.11.1.jar:1.11.1]

                at org.apache.flink.shaded.guava18.com.google.common.cache.LocalCache$LoadingValueReference.loadFuture(LocalCache.java:3527) ~[flink-dist_2.11-1.11.1.jar:1.11.1]

                at org.apache.flink.shaded.guava18.com.google.common.cache.LocalCache$Segment.loadSync(LocalCache.java:2319) ~[flink-dist_2.11-1.11.1.jar:1.11.1]

                ... 37 more

2020-08-06 21:50:55,824 DEBUG org.apache.flink.kubernetes.KubernetesResourceManager        [] - Ignoring outdated TaskExecutorGateway connection for REDACTED-flink-session-taskmanager-1-16.

2020-08-06 21:50:56,519 INFO  org.apache.flink.kubernetes.KubernetesResourceManager        [] - Registering TaskManager with ResourceID REDACTED-flink-session-taskmanager-1-16 (akka.tcp://flink@REDACTED:6122/user/rpc/taskmanager_0) at ResourceManager

2020-08-06 21:50:56,613 DEBUG org.apache.flink.runtime.resourcemanager.slotmanager.SlotManagerImpl [] - Registering TaskManager REDACTED-flink-session-taskmanager-1-16 under 89c651cee3d3f4f92b1c9472a9ec4507 at the SlotManager.

2020-08-06 21:50:55,823 DEBUG org.apache.flink.runtime.rpc.akka.AkkaRpcService             [] - Try to connect to remote RPC endpoint with address akka.tcp://flink@REDACTED:6122/user/rpc/taskmanager_0. Returning a org.apache.flink.runtime.taskexecutor.TaskExecutorGateway gateway.

Reply | Threaded
Open this post in threaded view
|

Re: [EXTERNAL] Re: Native K8S Jobmanager restarts and job never recovers

Yang Wang
Hi kevin,

Thanks for sharing more information. You are right. Actually, "too old resource version" is caused by a bug
of fabric8 kubernetes-client[1]. It has been fix in v4.6.1. And we have bumped the kubernetes-client version
to v4.9.2 in Flink release-1.11. Also it has been backported to release 1.10 and will be included in the next
minor release version(1.10.2).

BTW, if you really want all your jobs recovered when jobmanager crashed, you still need to configure the Zookeeper high availability.



Best,
Yang

Bohinski, Kevin <[hidden email]> 于2020年8月14日周五 上午6:32写道:

Might be useful

https://stackoverflow.com/a/61437982

 

Best,

kevin

 

 

From: "Bohinski, Kevin" <[hidden email]>
Date: Thursday, August 13, 2020 at 6:13 PM
To: Yang Wang <[hidden email]>
Cc: "[hidden email]" <[hidden email]>
Subject: Re: [EXTERNAL] Re: Native K8S Jobmanager restarts and job never recovers

 

Hi

 

Got the logs on crash, hopefully they help.

 

2020-08-13 22:00:40,336 ERROR org.apache.flink.kubernetes.KubernetesResourceManager        [] - Fatal error occurred in ResourceManager.

io.fabric8.kubernetes.client.KubernetesClientException: too old resource version: 8617182 (8633230)

                at io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager$1.onMessage(WatchConnectionManager.java:259) [flink-dist_2.11-1.11.1.jar:1.11.1]

                at org.apache.flink.kubernetes.shaded.okhttp3.internal.ws.RealWebSocket.onReadMessage(RealWebSocket.java:323) [flink-dist_2.11-1.11.1.jar:1.11.1]

                at org.apache.flink.kubernetes.shaded.okhttp3.internal.ws.WebSocketReader.readMessageFrame(WebSocketReader.java:219) [flink-dist_2.11-1.11.1.jar:1.11.1]

                at org.apache.flink.kubernetes.shaded.okhttp3.internal.ws.WebSocketReader.processNextFrame(WebSocketReader.java:105) [flink-dist_2.11-1.11.1.jar:1.11.1]

                at org.apache.flink.kubernetes.shaded.okhttp3.internal.ws.RealWebSocket.loopReader(RealWebSocket.java:274) [flink-dist_2.11-1.11.1.jar:1.11.1]

                at org.apache.flink.kubernetes.shaded.okhttp3.internal.ws.RealWebSocket$2.onResponse(RealWebSocket.java:214) [flink-dist_2.11-1.11.1.jar:1.11.1]

                at org.apache.flink.kubernetes.shaded.okhttp3.RealCall$AsyncCall.execute(RealCall.java:206) [flink-dist_2.11-1.11.1.jar:1.11.1]

                at org.apache.flink.kubernetes.shaded.okhttp3.internal.NamedRunnable.run(NamedRunnable.java:32) [flink-dist_2.11-1.11.1.jar:1.11.1]

                at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_262]

                at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_262]

                at java.lang.Thread.run(Thread.java:748) [?:1.8.0_262]

2020-08-13 22:00:40,337 ERROR org.apache.flink.runtime.entrypoint.ClusterEntrypoint        [] - Fatal error occurred in the cluster entrypoint.

io.fabric8.kubernetes.client.KubernetesClientException: too old resource version: 8617182 (8633230)

                at io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager$1.onMessage(WatchConnectionManager.java:259) [flink-dist_2.11-1.11.1.jar:1.11.1]

                at org.apache.flink.kubernetes.shaded.okhttp3.internal.ws.RealWebSocket.onReadMessage(RealWebSocket.java:323) [flink-dist_2.11-1.11.1.jar:1.11.1]

                at org.apache.flink.kubernetes.shaded.okhttp3.internal.ws.WebSocketReader.readMessageFrame(WebSocketReader.java:219) [flink-dist_2.11-1.11.1.jar:1.11.1]

                at org.apache.flink.kubernetes.shaded.okhttp3.internal.ws.WebSocketReader.processNextFrame(WebSocketReader.java:105) [flink-dist_2.11-1.11.1.jar:1.11.1]

                at org.apache.flink.kubernetes.shaded.okhttp3.internal.ws.RealWebSocket.loopReader(RealWebSocket.java:274) [flink-dist_2.11-1.11.1.jar:1.11.1]

                at org.apache.flink.kubernetes.shaded.okhttp3.internal.ws.RealWebSocket$2.onResponse(RealWebSocket.java:214) [flink-dist_2.11-1.11.1.jar:1.11.1]

                at org.apache.flink.kubernetes.shaded.okhttp3.RealCall$AsyncCall.execute(RealCall.java:206) [flink-dist_2.11-1.11.1.jar:1.11.1]

                at org.apache.flink.kubernetes.shaded.okhttp3.internal.NamedRunnable.run(NamedRunnable.java:32) [flink-dist_2.11-1.11.1.jar:1.11.1]

                at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_262]

                at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_262]

                at java.lang.Thread.run(Thread.java:748) [?:1.8.0_262]

2020-08-13 22:00:40,416 INFO  org.apache.flink.runtime.blob.BlobServer                     [] - Stopped BLOB server at 0.0.0.0:6124

 

Best,

kevin

 

 

From: Yang Wang <[hidden email]>
Date: Sunday, August 9, 2020 at 10:29 PM
To: "Bohinski, Kevin" <[hidden email]>
Cc: "[hidden email]" <[hidden email]>
Subject: [EXTERNAL] Re: Native K8S Jobmanager restarts and job never recovers

 

Hi Kevin,

 

I think you may not set the high availability configurations in your native K8s session. Currently, we only

support zookeeper HA, so you need to add the following configuration. After the HA is configured, the

running job, checkpoint and other meta could be stored. When the jobmanager failover, all the jobs

could be recovered then. I have tested it could work properly.

 

high-availability: zookeeper
high-availability.zookeeper.quorum: zk-client:2181
high-availability.storageDir: hdfs:///flink/recovery

 

I know you may not have a zookeeper cluster.You could a zookeeper K8s operator[1] to deploy a new one.

 

More over, it is not very convenient to use zookeeper as HA. So a K8s native HA support[2] is in plan and we

are trying to finish it in the next major release cycle(1.12).

 

 

 

 

Best,

Yang

 

Bohinski, Kevin <[hidden email]> 202087日周五 下午11:40写道:

Hi all,

 

In our 1.11.1 native k8s session after we submit a job it will run successfully for a few hours then fail when the jobmanager pod restarts.

 

Relevant logs after restart are attached below. Any suggestions?

 

Best

kevin

 

2020-08-06 21:50:24,425 INFO  org.apache.flink.kubernetes.KubernetesResourceManager        [] - Recovered 32 pods from previous attempts, current attempt id is 2.

2020-08-06 21:50:24,610 DEBUG org.apache.flink.kubernetes.kubeclient.resources.KubernetesPodsWatcher [] - Received ADDED event for pod REDACTED-flink-session-taskmanager-1-16, details: PodStatus(conditions=[PodCondition(lastProbeTime=null, lastTransitionTime=2020-08-06T18:48:33Z, message=null, reason=null, status=True, type=Initialized, additionalProperties={}), PodCondition(lastProbeTime=null, lastTransitionTime=2020-08-06T18:48:37Z, message=null, reason=null, status=True, type=Ready, additionalProperties={}), PodCondition(lastProbeTime=null, lastTransitionTime=2020-08-06T18:48:37Z, message=null, reason=null, status=True, type=ContainersReady, additionalProperties={}), PodCondition(lastProbeTime=null, lastTransitionTime=2020-08-06T18:48:33Z, message=null, reason=null, status=True, type=PodScheduled, additionalProperties={})], containerStatuses=[ContainerStatus(containerID=docker://REDACTED, image=REDACTED/flink:1.11.1-scala_2.11-s3-0, imageID=docker-pullable://REDACTED/flink@sha256:REDACTED, lastState=ContainerState(running=null, terminated=null, waiting=null, additionalProperties={}), name=flink-task-manager, ready=true, restartCount=0, started=true, state=ContainerState(running=ContainerStateRunning(startedAt=2020-08-06T18:48:35Z, additionalProperties={}), terminated=null, waiting=null, additionalProperties={}), additionalProperties={})], ephemeralContainerStatuses=[], hostIP=REDACTED, initContainerStatuses=[], message=null, nominatedNodeName=null, phase=Running, podIP=REDACTED, podIPs=[PodIP(ip=REDACTED, additionalProperties={})], qosClass=Guaranteed, reason=null, startTime=2020-08-06T18:48:33Z, additionalProperties={})

2020-08-06 21:50:24,613 DEBUG org.apache.flink.kubernetes.KubernetesResourceManager        [] - Ignore TaskManager pod that is already added: REDACTED-flink-session-taskmanager-1-16

2020-08-06 21:50:24,615 INFO  org.apache.flink.kubernetes.KubernetesResourceManager        [] - Received 0 new TaskManager pods. Remaining pending pod requests: 0

2020-08-06 21:50:24,631 DEBUG org.apache.flink.runtime.dispatcher.FileArchivedExecutionGraphStore [] - Could not load archived execution graph for job id 8c76f37962afd87783c65c95387fb828.

java.util.concurrent.ExecutionException: java.io.FileNotFoundException: Could not find file for archived execution graph 8c76f37962afd87783c65c95387fb828. This indicates that the file either has been deleted or never written.

                at org.apache.flink.shaded.guava18.com.google.common.util.concurrent.AbstractFuture$Sync.getValue(AbstractFuture.java:299) ~[flink-dist_2.11-1.11.1.jar:1.11.1]

                at org.apache.flink.shaded.guava18.com.google.common.util.concurrent.AbstractFuture$Sync.get(AbstractFuture.java:286) ~[flink-dist_2.11-1.11.1.jar:1.11.1]

                at org.apache.flink.shaded.guava18.com.google.common.util.concurrent.AbstractFuture.get(AbstractFuture.java:116) ~[flink-dist_2.11-1.11.1.jar:1.11.1]

                at org.apache.flink.shaded.guava18.com.google.common.util.concurrent.Uninterruptibles.getUninterruptibly(Uninterruptibles.java:137) ~[flink-dist_2.11-1.11.1.jar:1.11.1]

                at org.apache.flink.shaded.guava18.com.google.common.cache.LocalCache$Segment.getAndRecordStats(LocalCache.java:2348) ~[flink-dist_2.11-1.11.1.jar:1.11.1]

                at org.apache.flink.shaded.guava18.com.google.common.cache.LocalCache$Segment.loadSync(LocalCache.java:2320) ~[flink-dist_2.11-1.11.1.jar:1.11.1]

                at org.apache.flink.shaded.guava18.com.google.common.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2282) ~[flink-dist_2.11-1.11.1.jar:1.11.1]

                at org.apache.flink.shaded.guava18.com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2197) ~[flink-dist_2.11-1.11.1.jar:1.11.1]

                at org.apache.flink.shaded.guava18.com.google.common.cache.LocalCache.get(LocalCache.java:3937) ~[flink-dist_2.11-1.11.1.jar:1.11.1]

                at org.apache.flink.shaded.guava18.com.google.common.cache.LocalCache.getOrLoad(LocalCache.java:3941) ~[flink-dist_2.11-1.11.1.jar:1.11.1]

                at org.apache.flink.shaded.guava18.com.google.common.cache.LocalCache$LocalLoadingCache.get(LocalCache.java:4824) ~[flink-dist_2.11-1.11.1.jar:1.11.1]

                at org.apache.flink.runtime.dispatcher.FileArchivedExecutionGraphStore.get(FileArchivedExecutionGraphStore.java:143) ~[flink-dist_2.11-1.11.1.jar:1.11.1]

                at org.apache.flink.runtime.dispatcher.Dispatcher.lambda$requestJob$20(Dispatcher.java:554) ~[flink-dist_2.11-1.11.1.jar:1.11.1]

                at java.util.concurrent.CompletableFuture.uniExceptionally(CompletableFuture.java:884) ~[?:1.8.0_262]

                at java.util.concurrent.CompletableFuture.uniExceptionallyStage(CompletableFuture.java:898) ~[?:1.8.0_262]

                at java.util.concurrent.CompletableFuture.exceptionally(CompletableFuture.java:2209) ~[?:1.8.0_262]

                at org.apache.flink.runtime.dispatcher.Dispatcher.requestJob(Dispatcher.java:552) ~[flink-dist_2.11-1.11.1.jar:1.11.1]

                at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) ~[?:1.8.0_262]

                at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) ~[?:1.8.0_262]

                at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[?:1.8.0_262]

                at java.lang.reflect.Method.invoke(Method.java:498) ~[?:1.8.0_262]

                at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcInvocation(AkkaRpcActor.java:284) ~[flink-dist_2.11-1.11.1.jar:1.11.1]

                at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:199) ~[flink-dist_2.11-1.11.1.jar:1.11.1]

                at org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.handleRpcMessage(FencedAkkaRpcActor.java:74) ~[flink-dist_2.11-1.11.1.jar:1.11.1]

                at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleMessage(AkkaRpcActor.java:152) ~[flink-dist_2.11-1.11.1.jar:1.11.1]

                at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:26) [flink-dist_2.11-1.11.1.jar:1.11.1]

                at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:21) [flink-dist_2.11-1.11.1.jar:1.11.1]

                at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:123) [flink-dist_2.11-1.11.1.jar:1.11.1]

                at akka.japi.pf.UnitCaseStatement.applyOrElse(CaseStatements.scala:21) [flink-dist_2.11-1.11.1.jar:1.11.1]

                at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:170) [flink-dist_2.11-1.11.1.jar:1.11.1]

                at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171) [flink-dist_2.11-1.11.1.jar:1.11.1]

                at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171) [flink-dist_2.11-1.11.1.jar:1.11.1]

                at akka.actor.Actor$class.aroundReceive(Actor.scala:517) [flink-dist_2.11-1.11.1.jar:1.11.1]

                at akka.actor.AbstractActor.aroundReceive(AbstractActor.scala:225) [flink-dist_2.11-1.11.1.jar:1.11.1]

                at akka.actor.ActorCell.receiveMessage(ActorCell.scala:592) [flink-dist_2.11-1.11.1.jar:1.11.1]

                at akka.actor.ActorCell.invoke(ActorCell.scala:561) [flink-dist_2.11-1.11.1.jar:1.11.1]

                at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:258) [flink-dist_2.11-1.11.1.jar:1.11.1]

                at akka.dispatch.Mailbox.run(Mailbox.scala:225) [flink-dist_2.11-1.11.1.jar:1.11.1]

                at akka.dispatch.Mailbox.exec(Mailbox.scala:235) [flink-dist_2.11-1.11.1.jar:1.11.1]

                at akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) [flink-dist_2.11-1.11.1.jar:1.11.1]

                at akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) [flink-dist_2.11-1.11.1.jar:1.11.1]

                at akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) [flink-dist_2.11-1.11.1.jar:1.11.1]

                at akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) [flink-dist_2.11-1.11.1.jar:1.11.1]

Caused by: java.io.FileNotFoundException: Could not find file for archived execution graph 8c76f37962afd87783c65c95387fb828. This indicates that the file either has been deleted or never written.

                at org.apache.flink.runtime.dispatcher.FileArchivedExecutionGraphStore.loadExecutionGraph(FileArchivedExecutionGraphStore.java:239) ~[flink-dist_2.11-1.11.1.jar:1.11.1]

                at org.apache.flink.runtime.dispatcher.FileArchivedExecutionGraphStore.access$000(FileArchivedExecutionGraphStore.java:63) ~[flink-dist_2.11-1.11.1.jar:1.11.1]

                at org.apache.flink.runtime.dispatcher.FileArchivedExecutionGraphStore$1.load(FileArchivedExecutionGraphStore.java:118) ~[flink-dist_2.11-1.11.1.jar:1.11.1]

                at org.apache.flink.runtime.dispatcher.FileArchivedExecutionGraphStore$1.load(FileArchivedExecutionGraphStore.java:115) ~[flink-dist_2.11-1.11.1.jar:1.11.1]

                at org.apache.flink.shaded.guava18.com.google.common.cache.LocalCache$LoadingValueReference.loadFuture(LocalCache.java:3527) ~[flink-dist_2.11-1.11.1.jar:1.11.1]

                at org.apache.flink.shaded.guava18.com.google.common.cache.LocalCache$Segment.loadSync(LocalCache.java:2319) ~[flink-dist_2.11-1.11.1.jar:1.11.1]

                ... 37 more

2020-08-06 21:50:55,824 DEBUG org.apache.flink.kubernetes.KubernetesResourceManager        [] - Ignoring outdated TaskExecutorGateway connection for REDACTED-flink-session-taskmanager-1-16.

2020-08-06 21:50:56,519 INFO  org.apache.flink.kubernetes.KubernetesResourceManager        [] - Registering TaskManager with ResourceID REDACTED-flink-session-taskmanager-1-16 (akka.tcp://flink@REDACTED:6122/user/rpc/taskmanager_0) at ResourceManager

2020-08-06 21:50:56,613 DEBUG org.apache.flink.runtime.resourcemanager.slotmanager.SlotManagerImpl [] - Registering TaskManager REDACTED-flink-session-taskmanager-1-16 under 89c651cee3d3f4f92b1c9472a9ec4507 at the SlotManager.

2020-08-06 21:50:55,823 DEBUG org.apache.flink.runtime.rpc.akka.AkkaRpcService             [] - Try to connect to remote RPC endpoint with address akka.tcp://flink@REDACTED:6122/user/rpc/taskmanager_0. Returning a org.apache.flink.runtime.taskexecutor.TaskExecutorGateway gateway.