(DEPRECATED) Apache Flink User Mailing List archive.

回复： JobGraphs not cleaned up in HA mode

Classic

List

Threaded

8 messages Options

seuzxc

回复： JobGraphs not cleaned up in HA mode

if i clean the zookeeper data , it runs fine . but next time when the jobmanager failed and redeploy the error occurs again

------------------ 原始邮件 ------------------

发件人: "Vijay Bhaskar"<[hidden email]>;

发送时间: 2019年11月28日(星期四) 下午3:05

收件人: "曾祥才"<[hidden email]>;

主题: Re: JobGraphs not cleaned up in HA mode

Again it could not find the state store file: "Caused by: java.io.FileNotFoundException: /flink/ha/submittedJobGraph0c6bcff01199 "

Check why its unable to find.

Better thing is: Clean up zookeeper state and check your configurations, correct them and restart cluster.

Otherwise it always picks up corrupted state from zookeeper and it will never restart

Regards

Bhaskar

On Thu, Nov 28, 2019 at 11:51 AM 曾祥才 <[hidden email]> wrote:

i've made a misstake( the log before is another cluster) . the full exception log is :

INFO org.apache.flink.runtime.dispatcher.StandaloneDispatcher - Recovering all persisted jobs.
2019-11-28 02:33:12,726 INFO org.apache.flink.runtime.resourcemanager.slotmanager.SlotManagerImpl - Starting the SlotManager.
2019-11-28 02:33:12,743 INFO org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore - Released locks of job graph 639170a9d710bacfd113ca66b2aacefa from ZooKeeper.
2019-11-28 02:33:12,744 ERROR org.apache.flink.runtime.entrypoint.ClusterEntrypoint - Fatal error occurred in the cluster entrypoint.
org.apache.flink.runtime.dispatcher.DispatcherException: Failed to take leadership with session id 607e1fe0-f1b4-4af0-af16-4137d779defb.
at org.apache.flink.runtime.dispatcher.Dispatcher.lambda$null$30(Dispatcher.java:915)
at java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:774)
at java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:750)
at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488)
at java.util.concurrent.CompletableFuture.postFire(CompletableFuture.java:575)
at java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:594)
at java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:456)
at akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:40)
at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(ForkJoinExecutorConfigurator.scala:44)
at akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
Caused by: java.lang.RuntimeException: org.apache.flink.util.FlinkException: Could not retrieve submitted JobGraph from state handle under /639170a9d710bacfd113ca66b2aacefa. This indicates that the retrieved state handle is broken. Try cleaning the state handle store.
at org.apache.flink.util.ExceptionUtils.rethrow(ExceptionUtils.java:199)
at org.apache.flink.util.function.FunctionUtils.lambda$uncheckedFunction$2(FunctionUtils.java:75)
at java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:616)
at java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:591)
... 7 more
Caused by: org.apache.flink.util.FlinkException: Could not retrieve submitted JobGraph from state handle under /639170a9d710bacfd113ca66b2aacefa. This indicates that the retrieved state handle is broken. Try cleaning the state handle store.
at org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore.recoverJobGraph(ZooKeeperSubmittedJobGraphStore.java:190)
at org.apache.flink.runtime.dispatcher.Dispatcher.recoverJob(Dispatcher.java:755)
at org.apache.flink.runtime.dispatcher.Dispatcher.recoverJobGraphs(Dispatcher.java:740)
at org.apache.flink.runtime.dispatcher.Dispatcher.recoverJobs(Dispatcher.java:721)
at org.apache.flink.runtime.dispatcher.Dispatcher.lambda$null$27(Dispatcher.java:888)
at org.apache.flink.util.function.FunctionUtils.lambda$uncheckedFunction$2(FunctionUtils.java:73)
... 9 more
Caused by: java.io.FileNotFoundException: /flink/ha/submittedJobGraph0c6bcff01199 (No such file or directory)
at java.io.FileInputStream.open0(Native Method)
at java.io.FileInputStream.open(FileInputStream.java:195)

------------------ 原始邮件 ------------------
发件人: "Vijay Bhaskar"<[hidden email]>;
发送时间: 2019年11月28日(星期四) 下午2:46
收件人: "曾祥才"<[hidden email]>;
抄送: "User-Flink"<[hidden email]>;
主题: Re: JobGraphs not cleaned up in HA mode

Is it filesystem or hadoop? If its NAS then why the exception "Caused by: org.apache.hadoop.hdfs.BlockMissingException: "
It seems you configured hadoop state store and giving NAS mount.

Regards
Bhaskar

On Thu, Nov 28, 2019 at 11:36 AM 曾祥才 <[hidden email]> wrote:
/flink/checkpoints is a external persistent store (a nas directory mounts to the job manager)

------------------ 原始邮件 ------------------
发件人: "Vijay Bhaskar"<[hidden email]>;
发送时间: 2019年11月28日(星期四) 下午2:29
收件人: "曾祥才"<[hidden email]>;
抄送: "user"<[hidden email]>;
主题: Re: JobGraphs not cleaned up in HA mode

Following are the mandatory condition to run in HA:

a) You should have persistent common external store for jobmanager and task managers to while writing the state
b) You should have persistent external store for zookeeper to store the Jobgraph.

Zookeeper is referring path: /flink/checkpoints/submittedJobGraph480ddf9572ed to get the job graph but jobmanager unable to find it.
It seems /flink/checkpoints is not the external persistent store

Regards
Bhaskar

On Thu, Nov 28, 2019 at 10:43 AM seuzxc <[hidden email]> wrote:
hi ，I've the same problem with flink 1.9.1 , any solution to fix it
when the k8s redoploy jobmanager , the error looks like (seems zk not
remove submitted job info, but jobmanager remove the file):

Caused by: org.apache.flink.util.FlinkException: Could not retrieve
submitted JobGraph from state handle under
/147dd022ec91f7381ad4ca3d290387e9. This indicates that the retrieved state
handle is broken. Try cleaning the state handle store.
at
org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore.recoverJobGraph(ZooKeeperSubmittedJobGraphStore.java:208)
at
org.apache.flink.runtime.dispatcher.Dispatcher.recoverJob(Dispatcher.java:696)
at
org.apache.flink.runtime.dispatcher.Dispatcher.recoverJobGraphs(Dispatcher.java:681)
at
org.apache.flink.runtime.dispatcher.Dispatcher.recoverJobs(Dispatcher.java:662)
at
org.apache.flink.runtime.dispatcher.Dispatcher.lambda$null$26(Dispatcher.java:821)
at
org.apache.flink.util.function.FunctionUtils.lambda$uncheckedFunction$2(FunctionUtils.java:72)
... 9 more
Caused by: org.apache.hadoop.hdfs.BlockMissingException: Could not obtain
block: BP-1651346363-10.20.1.81-1525354906737:blk_1083182315_9441494
file=/flink/checkpoints/submittedJobGraph480ddf9572ed
at
org.apache.hadoop.hdfs.DFSInputStream.refetchLocations(DFSInputStream.java:1052)

--
Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/

Vijay Bhaskar

Re: JobGraphs not cleaned up in HA mode

Can you share the flink configuration once?

Regards

Bhaskar

On Thu, Nov 28, 2019 at 12:09 PM 曾祥才 <[hidden email]> wrote:

if i clean the zookeeper data , it runs fine . but next time when the jobmanager failed and redeploy the error occurs again

------------------ 原始邮件 ------------------
发件人: "Vijay Bhaskar"<[hidden email]>;
发送时间: 2019年11月28日(星期四) 下午3:05
收件人: "曾祥才"<[hidden email]>;
主题: Re: JobGraphs not cleaned up in HA mode

Again it could not find the state store file: "Caused by: java.io.FileNotFoundException: /flink/ha/submittedJobGraph0c6bcff01199 "
Check why its unable to find.
Better thing is: Clean up zookeeper state and check your configurations, correct them and restart cluster.
Otherwise it always picks up corrupted state from zookeeper and it will never restart

Regards
Bhaskar

On Thu, Nov 28, 2019 at 11:51 AM 曾祥才 <[hidden email]> wrote:
i've made a misstake( the log before is another cluster) . the full exception log is :

INFO org.apache.flink.runtime.dispatcher.StandaloneDispatcher - Recovering all persisted jobs.
2019-11-28 02:33:12,726 INFO org.apache.flink.runtime.resourcemanager.slotmanager.SlotManagerImpl - Starting the SlotManager.
2019-11-28 02:33:12,743 INFO org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore - Released locks of job graph 639170a9d710bacfd113ca66b2aacefa from ZooKeeper.
2019-11-28 02:33:12,744 ERROR org.apache.flink.runtime.entrypoint.ClusterEntrypoint - Fatal error occurred in the cluster entrypoint.
org.apache.flink.runtime.dispatcher.DispatcherException: Failed to take leadership with session id 607e1fe0-f1b4-4af0-af16-4137d779defb.
at org.apache.flink.runtime.dispatcher.Dispatcher.lambda$null$30(Dispatcher.java:915)
at java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:774)
at java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:750)
at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488)
at java.util.concurrent.CompletableFuture.postFire(CompletableFuture.java:575)
at java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:594)
at java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:456)
at akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:40)
at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(ForkJoinExecutorConfigurator.scala:44)
at akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
Caused by: java.lang.RuntimeException: org.apache.flink.util.FlinkException: Could not retrieve submitted JobGraph from state handle under /639170a9d710bacfd113ca66b2aacefa. This indicates that the retrieved state handle is broken. Try cleaning the state handle store.
at org.apache.flink.util.ExceptionUtils.rethrow(ExceptionUtils.java:199)
at org.apache.flink.util.function.FunctionUtils.lambda$uncheckedFunction$2(FunctionUtils.java:75)
at java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:616)
at java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:591)
... 7 more
Caused by: org.apache.flink.util.FlinkException: Could not retrieve submitted JobGraph from state handle under /639170a9d710bacfd113ca66b2aacefa. This indicates that the retrieved state handle is broken. Try cleaning the state handle store.
at org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore.recoverJobGraph(ZooKeeperSubmittedJobGraphStore.java:190)
at org.apache.flink.runtime.dispatcher.Dispatcher.recoverJob(Dispatcher.java:755)
at org.apache.flink.runtime.dispatcher.Dispatcher.recoverJobGraphs(Dispatcher.java:740)
at org.apache.flink.runtime.dispatcher.Dispatcher.recoverJobs(Dispatcher.java:721)
at org.apache.flink.runtime.dispatcher.Dispatcher.lambda$null$27(Dispatcher.java:888)
at org.apache.flink.util.function.FunctionUtils.lambda$uncheckedFunction$2(FunctionUtils.java:73)
... 9 more
Caused by: java.io.FileNotFoundException: /flink/ha/submittedJobGraph0c6bcff01199 (No such file or directory)
at java.io.FileInputStream.open0(Native Method)
at java.io.FileInputStream.open(FileInputStream.java:195)

------------------ 原始邮件 ------------------
发件人: "Vijay Bhaskar"<[hidden email]>;
发送时间: 2019年11月28日(星期四) 下午2:46
收件人: "曾祥才"<[hidden email]>;
抄送: "User-Flink"<[hidden email]>;
主题: Re: JobGraphs not cleaned up in HA mode

Is it filesystem or hadoop? If its NAS then why the exception "Caused by: org.apache.hadoop.hdfs.BlockMissingException: "
It seems you configured hadoop state store and giving NAS mount.

Regards
Bhaskar

On Thu, Nov 28, 2019 at 11:36 AM 曾祥才 <[hidden email]> wrote:
/flink/checkpoints is a external persistent store (a nas directory mounts to the job manager)

------------------ 原始邮件 ------------------
发件人: "Vijay Bhaskar"<[hidden email]>;
发送时间: 2019年11月28日(星期四) 下午2:29
收件人: "曾祥才"<[hidden email]>;
抄送: "user"<[hidden email]>;
主题: Re: JobGraphs not cleaned up in HA mode

Following are the mandatory condition to run in HA:

a) You should have persistent common external store for jobmanager and task managers to while writing the state
b) You should have persistent external store for zookeeper to store the Jobgraph.

Zookeeper is referring path: /flink/checkpoints/submittedJobGraph480ddf9572ed to get the job graph but jobmanager unable to find it.
It seems /flink/checkpoints is not the external persistent store

Regards
Bhaskar

On Thu, Nov 28, 2019 at 10:43 AM seuzxc <[hidden email]> wrote:
hi ，I've the same problem with flink 1.9.1 , any solution to fix it
when the k8s redoploy jobmanager , the error looks like (seems zk not
remove submitted job info, but jobmanager remove the file):

Caused by: org.apache.flink.util.FlinkException: Could not retrieve
submitted JobGraph from state handle under
/147dd022ec91f7381ad4ca3d290387e9. This indicates that the retrieved state
handle is broken. Try cleaning the state handle store.
at
org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore.recoverJobGraph(ZooKeeperSubmittedJobGraphStore.java:208)
at
org.apache.flink.runtime.dispatcher.Dispatcher.recoverJob(Dispatcher.java:696)
at
org.apache.flink.runtime.dispatcher.Dispatcher.recoverJobGraphs(Dispatcher.java:681)
at
org.apache.flink.runtime.dispatcher.Dispatcher.recoverJobs(Dispatcher.java:662)
at
org.apache.flink.runtime.dispatcher.Dispatcher.lambda$null$26(Dispatcher.java:821)
at
org.apache.flink.util.function.FunctionUtils.lambda$uncheckedFunction$2(FunctionUtils.java:72)
... 9 more
Caused by: org.apache.hadoop.hdfs.BlockMissingException: Could not obtain
block: BP-1651346363-10.20.1.81-1525354906737:blk_1083182315_9441494
file=/flink/checkpoints/submittedJobGraph480ddf9572ed
at
org.apache.hadoop.hdfs.DFSInputStream.refetchLocations(DFSInputStream.java:1052)

--
Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/

seuzxc

回复： JobGraphs not cleaned up in HA mode

the config (/flink is the NASdirectory ):

jobmanager.rpc.address: flink-jobmanager

taskmanager.numberOfTaskSlots: 16

web.upload.dir: /flink/webUpload

blob.server.port: 6124

jobmanager.rpc.port: 6123

taskmanager.rpc.port: 6122

jobmanager.heap.size: 1024m

taskmanager.heap.size: 1024m

high-availability: zookeeper

high-availability.cluster-id: /cluster-test

high-availability.storageDir: /flink/ha

high-availability.zookeeper.quorum: ****:2181

high-availability.jobmanager.port: 6123

high-availability.zookeeper.path.root: /flink/risk-insight

high-availability.zookeeper.path.checkpoints: /flink/zk-checkpoints

state.backend: filesystem

state.checkpoints.dir: file:///flink/checkpoints

state.savepoints.dir: file:///flink/savepoints

state.checkpoints.num-retained: 2

jobmanager.execution.failover-strategy: region

jobmanager.archive.fs.dir: file:///flink/archive/history

------------------ 原始邮件 ------------------

发件人: "Vijay Bhaskar"<[hidden email]>;

发送时间: 2019年11月28日(星期四) 下午3:12

收件人: "曾祥才"<[hidden email]>;

抄送: "User-Flink"<[hidden email]>;

主题: Re: JobGraphs not cleaned up in HA mode

Can you share the flink configuration once?

Regards

Bhaskar

On Thu, Nov 28, 2019 at 12:09 PM 曾祥才 <[hidden email]> wrote:

if i clean the zookeeper data , it runs fine . but next time when the jobmanager failed and redeploy the error occurs again

------------------ 原始邮件 ------------------
发件人: "Vijay Bhaskar"<[hidden email]>;
发送时间: 2019年11月28日(星期四) 下午3:05
收件人: "曾祥才"<[hidden email]>;
主题: Re: JobGraphs not cleaned up in HA mode

Again it could not find the state store file: "Caused by: java.io.FileNotFoundException: /flink/ha/submittedJobGraph0c6bcff01199 "
Check why its unable to find.
Better thing is: Clean up zookeeper state and check your configurations, correct them and restart cluster.
Otherwise it always picks up corrupted state from zookeeper and it will never restart

Regards
Bhaskar

On Thu, Nov 28, 2019 at 11:51 AM 曾祥才 <[hidden email]> wrote:
i've made a misstake( the log before is another cluster) . the full exception log is :

INFO org.apache.flink.runtime.dispatcher.StandaloneDispatcher - Recovering all persisted jobs.
2019-11-28 02:33:12,726 INFO org.apache.flink.runtime.resourcemanager.slotmanager.SlotManagerImpl - Starting the SlotManager.
2019-11-28 02:33:12,743 INFO org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore - Released locks of job graph 639170a9d710bacfd113ca66b2aacefa from ZooKeeper.
2019-11-28 02:33:12,744 ERROR org.apache.flink.runtime.entrypoint.ClusterEntrypoint - Fatal error occurred in the cluster entrypoint.
org.apache.flink.runtime.dispatcher.DispatcherException: Failed to take leadership with session id 607e1fe0-f1b4-4af0-af16-4137d779defb.
at org.apache.flink.runtime.dispatcher.Dispatcher.lambda$null$30(Dispatcher.java:915)
at java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:774)
at java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:750)
at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488)
at java.util.concurrent.CompletableFuture.postFire(CompletableFuture.java:575)
at java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:594)
at java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:456)
at akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:40)
at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(ForkJoinExecutorConfigurator.scala:44)
at akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
Caused by: java.lang.RuntimeException: org.apache.flink.util.FlinkException: Could not retrieve submitted JobGraph from state handle under /639170a9d710bacfd113ca66b2aacefa. This indicates that the retrieved state handle is broken. Try cleaning the state handle store.
at org.apache.flink.util.ExceptionUtils.rethrow(ExceptionUtils.java:199)
at org.apache.flink.util.function.FunctionUtils.lambda$uncheckedFunction$2(FunctionUtils.java:75)
at java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:616)
at java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:591)
... 7 more
Caused by: org.apache.flink.util.FlinkException: Could not retrieve submitted JobGraph from state handle under /639170a9d710bacfd113ca66b2aacefa. This indicates that the retrieved state handle is broken. Try cleaning the state handle store.
at org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore.recoverJobGraph(ZooKeeperSubmittedJobGraphStore.java:190)
at org.apache.flink.runtime.dispatcher.Dispatcher.recoverJob(Dispatcher.java:755)
at org.apache.flink.runtime.dispatcher.Dispatcher.recoverJobGraphs(Dispatcher.java:740)
at org.apache.flink.runtime.dispatcher.Dispatcher.recoverJobs(Dispatcher.java:721)
at org.apache.flink.runtime.dispatcher.Dispatcher.lambda$null$27(Dispatcher.java:888)
at org.apache.flink.util.function.FunctionUtils.lambda$uncheckedFunction$2(FunctionUtils.java:73)
... 9 more
Caused by: java.io.FileNotFoundException: /flink/ha/submittedJobGraph0c6bcff01199 (No such file or directory)
at java.io.FileInputStream.open0(Native Method)
at java.io.FileInputStream.open(FileInputStream.java:195)

------------------ 原始邮件 ------------------
发件人: "Vijay Bhaskar"<[hidden email]>;
发送时间: 2019年11月28日(星期四) 下午2:46
收件人: "曾祥才"<[hidden email]>;
抄送: "User-Flink"<[hidden email]>;
主题: Re: JobGraphs not cleaned up in HA mode

Is it filesystem or hadoop? If its NAS then why the exception "Caused by: org.apache.hadoop.hdfs.BlockMissingException: "
It seems you configured hadoop state store and giving NAS mount.

Regards
Bhaskar

On Thu, Nov 28, 2019 at 11:36 AM 曾祥才 <[hidden email]> wrote:
/flink/checkpoints is a external persistent store (a nas directory mounts to the job manager)

------------------ 原始邮件 ------------------
发件人: "Vijay Bhaskar"<[hidden email]>;
发送时间: 2019年11月28日(星期四) 下午2:29
收件人: "曾祥才"<[hidden email]>;
抄送: "user"<[hidden email]>;
主题: Re: JobGraphs not cleaned up in HA mode

Following are the mandatory condition to run in HA:

a) You should have persistent common external store for jobmanager and task managers to while writing the state
b) You should have persistent external store for zookeeper to store the Jobgraph.

Zookeeper is referring path: /flink/checkpoints/submittedJobGraph480ddf9572ed to get the job graph but jobmanager unable to find it.
It seems /flink/checkpoints is not the external persistent store

Regards
Bhaskar

On Thu, Nov 28, 2019 at 10:43 AM seuzxc <[hidden email]> wrote:
hi ，I've the same problem with flink 1.9.1 , any solution to fix it
when the k8s redoploy jobmanager , the error looks like (seems zk not
remove submitted job info, but jobmanager remove the file):

Caused by: org.apache.flink.util.FlinkException: Could not retrieve
submitted JobGraph from state handle under
/147dd022ec91f7381ad4ca3d290387e9. This indicates that the retrieved state
handle is broken. Try cleaning the state handle store.
at
org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore.recoverJobGraph(ZooKeeperSubmittedJobGraphStore.java:208)
at
org.apache.flink.runtime.dispatcher.Dispatcher.recoverJob(Dispatcher.java:696)
at
org.apache.flink.runtime.dispatcher.Dispatcher.recoverJobGraphs(Dispatcher.java:681)
at
org.apache.flink.runtime.dispatcher.Dispatcher.recoverJobs(Dispatcher.java:662)
at
org.apache.flink.runtime.dispatcher.Dispatcher.lambda$null$26(Dispatcher.java:821)
at
org.apache.flink.util.function.FunctionUtils.lambda$uncheckedFunction$2(FunctionUtils.java:72)
... 9 more
Caused by: org.apache.hadoop.hdfs.BlockMissingException: Could not obtain
block: BP-1651346363-10.20.1.81-1525354906737:blk_1083182315_9441494
file=/flink/checkpoints/submittedJobGraph480ddf9572ed
at
org.apache.hadoop.hdfs.DFSInputStream.refetchLocations(DFSInputStream.java:1052)

--
Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/

seuzxc

回复： JobGraphs not cleaned up in HA mode

anyone have the same problem？ pls help, thks

------------------ 原始邮件 ------------------

发件人: "曾祥才"<[hidden email]>;

发送时间: 2019年11月28日(星期四) 下午2:46

收件人: "Vijay Bhaskar"<[hidden email]>;

抄送: "User-Flink"<[hidden email]>;

主题: 回复： JobGraphs not cleaned up in HA mode

the config (/flink is the NASdirectory ):

jobmanager.rpc.address: flink-jobmanager

taskmanager.numberOfTaskSlots: 16

web.upload.dir: /flink/webUpload

blob.server.port: 6124

jobmanager.rpc.port: 6123

taskmanager.rpc.port: 6122

jobmanager.heap.size: 1024m

taskmanager.heap.size: 1024m

high-availability: zookeeper

high-availability.cluster-id: /cluster-test

high-availability.storageDir: /flink/ha

high-availability.zookeeper.quorum: ****:2181

high-availability.jobmanager.port: 6123

high-availability.zookeeper.path.root: /flink/risk-insight

high-availability.zookeeper.path.checkpoints: /flink/zk-checkpoints

state.backend: filesystem

state.checkpoints.dir: file:///flink/checkpoints

state.savepoints.dir: file:///flink/savepoints

state.checkpoints.num-retained: 2

jobmanager.execution.failover-strategy: region

jobmanager.archive.fs.dir: file:///flink/archive/history

------------------ 原始邮件 ------------------

发件人: "Vijay Bhaskar"<[hidden email]>;

发送时间: 2019年11月28日(星期四) 下午3:12

收件人: "曾祥才"<[hidden email]>;

抄送: "User-Flink"<[hidden email]>;

主题: Re: JobGraphs not cleaned up in HA mode

Can you share the flink configuration once?

Regards

Bhaskar

On Thu, Nov 28, 2019 at 12:09 PM 曾祥才 <[hidden email]> wrote:

if i clean the zookeeper data , it runs fine . but next time when the jobmanager failed and redeploy the error occurs again

------------------ 原始邮件 ------------------
发件人: "Vijay Bhaskar"<[hidden email]>;
发送时间: 2019年11月28日(星期四) 下午3:05
收件人: "曾祥才"<[hidden email]>;
主题: Re: JobGraphs not cleaned up in HA mode

Again it could not find the state store file: "Caused by: java.io.FileNotFoundException: /flink/ha/submittedJobGraph0c6bcff01199 "
Check why its unable to find.
Better thing is: Clean up zookeeper state and check your configurations, correct them and restart cluster.
Otherwise it always picks up corrupted state from zookeeper and it will never restart

Regards
Bhaskar

On Thu, Nov 28, 2019 at 11:51 AM 曾祥才 <[hidden email]> wrote:
i've made a misstake( the log before is another cluster) . the full exception log is :

INFO org.apache.flink.runtime.dispatcher.StandaloneDispatcher - Recovering all persisted jobs.
2019-11-28 02:33:12,726 INFO org.apache.flink.runtime.resourcemanager.slotmanager.SlotManagerImpl - Starting the SlotManager.
2019-11-28 02:33:12,743 INFO org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore - Released locks of job graph 639170a9d710bacfd113ca66b2aacefa from ZooKeeper.
2019-11-28 02:33:12,744 ERROR org.apache.flink.runtime.entrypoint.ClusterEntrypoint - Fatal error occurred in the cluster entrypoint.
org.apache.flink.runtime.dispatcher.DispatcherException: Failed to take leadership with session id 607e1fe0-f1b4-4af0-af16-4137d779defb.
at org.apache.flink.runtime.dispatcher.Dispatcher.lambda$null$30(Dispatcher.java:915)
at java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:774)
at java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:750)
at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488)
at java.util.concurrent.CompletableFuture.postFire(CompletableFuture.java:575)
at java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:594)
at java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:456)
at akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:40)
at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(ForkJoinExecutorConfigurator.scala:44)
at akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
Caused by: java.lang.RuntimeException: org.apache.flink.util.FlinkException: Could not retrieve submitted JobGraph from state handle under /639170a9d710bacfd113ca66b2aacefa. This indicates that the retrieved state handle is broken. Try cleaning the state handle store.
at org.apache.flink.util.ExceptionUtils.rethrow(ExceptionUtils.java:199)
at org.apache.flink.util.function.FunctionUtils.lambda$uncheckedFunction$2(FunctionUtils.java:75)
at java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:616)
at java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:591)
... 7 more
Caused by: org.apache.flink.util.FlinkException: Could not retrieve submitted JobGraph from state handle under /639170a9d710bacfd113ca66b2aacefa. This indicates that the retrieved state handle is broken. Try cleaning the state handle store.
at org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore.recoverJobGraph(ZooKeeperSubmittedJobGraphStore.java:190)
at org.apache.flink.runtime.dispatcher.Dispatcher.recoverJob(Dispatcher.java:755)
at org.apache.flink.runtime.dispatcher.Dispatcher.recoverJobGraphs(Dispatcher.java:740)
at org.apache.flink.runtime.dispatcher.Dispatcher.recoverJobs(Dispatcher.java:721)
at org.apache.flink.runtime.dispatcher.Dispatcher.lambda$null$27(Dispatcher.java:888)
at org.apache.flink.util.function.FunctionUtils.lambda$uncheckedFunction$2(FunctionUtils.java:73)
... 9 more
Caused by: java.io.FileNotFoundException: /flink/ha/submittedJobGraph0c6bcff01199 (No such file or directory)
at java.io.FileInputStream.open0(Native Method)
at java.io.FileInputStream.open(FileInputStream.java:195)

------------------ 原始邮件 ------------------
发件人: "Vijay Bhaskar"<[hidden email]>;
发送时间: 2019年11月28日(星期四) 下午2:46
收件人: "曾祥才"<[hidden email]>;
抄送: "User-Flink"<[hidden email]>;
主题: Re: JobGraphs not cleaned up in HA mode

Is it filesystem or hadoop? If its NAS then why the exception "Caused by: org.apache.hadoop.hdfs.BlockMissingException: "
It seems you configured hadoop state store and giving NAS mount.

Regards
Bhaskar

On Thu, Nov 28, 2019 at 11:36 AM 曾祥才 <[hidden email]> wrote:
/flink/checkpoints is a external persistent store (a nas directory mounts to the job manager)

------------------ 原始邮件 ------------------
发件人: "Vijay Bhaskar"<[hidden email]>;
发送时间: 2019年11月28日(星期四) 下午2:29
收件人: "曾祥才"<[hidden email]>;
抄送: "user"<[hidden email]>;
主题: Re: JobGraphs not cleaned up in HA mode

Following are the mandatory condition to run in HA:

a) You should have persistent common external store for jobmanager and task managers to while writing the state
b) You should have persistent external store for zookeeper to store the Jobgraph.

Zookeeper is referring path: /flink/checkpoints/submittedJobGraph480ddf9572ed to get the job graph but jobmanager unable to find it.
It seems /flink/checkpoints is not the external persistent store

Regards
Bhaskar

On Thu, Nov 28, 2019 at 10:43 AM seuzxc <[hidden email]> wrote:
hi ，I've the same problem with flink 1.9.1 , any solution to fix it
when the k8s redoploy jobmanager , the error looks like (seems zk not
remove submitted job info, but jobmanager remove the file):

Caused by: org.apache.flink.util.FlinkException: Could not retrieve
submitted JobGraph from state handle under
/147dd022ec91f7381ad4ca3d290387e9. This indicates that the retrieved state
handle is broken. Try cleaning the state handle store.
at
org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore.recoverJobGraph(ZooKeeperSubmittedJobGraphStore.java:208)
at
org.apache.flink.runtime.dispatcher.Dispatcher.recoverJob(Dispatcher.java:696)
at
org.apache.flink.runtime.dispatcher.Dispatcher.recoverJobGraphs(Dispatcher.java:681)
at
org.apache.flink.runtime.dispatcher.Dispatcher.recoverJobs(Dispatcher.java:662)
at
org.apache.flink.runtime.dispatcher.Dispatcher.lambda$null$26(Dispatcher.java:821)
at
org.apache.flink.util.function.FunctionUtils.lambda$uncheckedFunction$2(FunctionUtils.java:72)
... 9 more
Caused by: org.apache.hadoop.hdfs.BlockMissingException: Could not obtain
block: BP-1651346363-10.20.1.81-1525354906737:blk_1083182315_9441494
file=/flink/checkpoints/submittedJobGraph480ddf9572ed
at
org.apache.hadoop.hdfs.DFSInputStream.refetchLocations(DFSInputStream.java:1052)

--
Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/

vino yang

Re: JobGraphs not cleaned up in HA mode

Hi,

Why do you not use HDFS directly?

Best,

Vino

曾祥才 <[hidden email]> 于2019年11月28日周四下午6:48写道：

anyone have the same problem？ pls help, thks

------------------ 原始邮件 ------------------
发件人: "曾祥才"<[hidden email]>;
发送时间: 2019年11月28日(星期四) 下午2:46
收件人: "Vijay Bhaskar"<[hidden email]>;
抄送: "User-Flink"<[hidden email]>;
主题: 回复： JobGraphs not cleaned up in HA mode

the config (/flink is the NASdirectory ):

jobmanager.rpc.address: flink-jobmanager
taskmanager.numberOfTaskSlots: 16
web.upload.dir: /flink/webUpload
blob.server.port: 6124
jobmanager.rpc.port: 6123
taskmanager.rpc.port: 6122
jobmanager.heap.size: 1024m
taskmanager.heap.size: 1024m
high-availability: zookeeper
high-availability.cluster-id: /cluster-test
high-availability.storageDir: /flink/ha
high-availability.zookeeper.quorum: ****:2181
high-availability.jobmanager.port: 6123
high-availability.zookeeper.path.root: /flink/risk-insight
high-availability.zookeeper.path.checkpoints: /flink/zk-checkpoints
state.backend: filesystem
state.checkpoints.dir: file:///flink/checkpoints
state.savepoints.dir: file:///flink/savepoints
state.checkpoints.num-retained: 2
jobmanager.execution.failover-strategy: region
jobmanager.archive.fs.dir: file:///flink/archive/history

------------------ 原始邮件 ------------------
发件人: "Vijay Bhaskar"<[hidden email]>;
发送时间: 2019年11月28日(星期四) 下午3:12
收件人: "曾祥才"<[hidden email]>;
抄送: "User-Flink"<[hidden email]>;
主题: Re: JobGraphs not cleaned up in HA mode

Can you share the flink configuration once?

Regards
Bhaskar

On Thu, Nov 28, 2019 at 12:09 PM 曾祥才 <[hidden email]> wrote:
if i clean the zookeeper data , it runs fine . but next time when the jobmanager failed and redeploy the error occurs again

------------------ 原始邮件 ------------------
发件人: "Vijay Bhaskar"<[hidden email]>;
发送时间: 2019年11月28日(星期四) 下午3:05
收件人: "曾祥才"<[hidden email]>;
主题: Re: JobGraphs not cleaned up in HA mode

Again it could not find the state store file: "Caused by: java.io.FileNotFoundException: /flink/ha/submittedJobGraph0c6bcff01199 "
Check why its unable to find.
Better thing is: Clean up zookeeper state and check your configurations, correct them and restart cluster.
Otherwise it always picks up corrupted state from zookeeper and it will never restart

Regards
Bhaskar

On Thu, Nov 28, 2019 at 11:51 AM 曾祥才 <[hidden email]> wrote:
i've made a misstake( the log before is another cluster) . the full exception log is :

INFO org.apache.flink.runtime.dispatcher.StandaloneDispatcher - Recovering all persisted jobs.
2019-11-28 02:33:12,726 INFO org.apache.flink.runtime.resourcemanager.slotmanager.SlotManagerImpl - Starting the SlotManager.
2019-11-28 02:33:12,743 INFO org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore - Released locks of job graph 639170a9d710bacfd113ca66b2aacefa from ZooKeeper.
2019-11-28 02:33:12,744 ERROR org.apache.flink.runtime.entrypoint.ClusterEntrypoint - Fatal error occurred in the cluster entrypoint.
org.apache.flink.runtime.dispatcher.DispatcherException: Failed to take leadership with session id 607e1fe0-f1b4-4af0-af16-4137d779defb.
at org.apache.flink.runtime.dispatcher.Dispatcher.lambda$null$30(Dispatcher.java:915)
at java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:774)
at java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:750)
at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488)
at java.util.concurrent.CompletableFuture.postFire(CompletableFuture.java:575)
at java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:594)
at java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:456)
at akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:40)
at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(ForkJoinExecutorConfigurator.scala:44)
at akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
Caused by: java.lang.RuntimeException: org.apache.flink.util.FlinkException: Could not retrieve submitted JobGraph from state handle under /639170a9d710bacfd113ca66b2aacefa. This indicates that the retrieved state handle is broken. Try cleaning the state handle store.
at org.apache.flink.util.ExceptionUtils.rethrow(ExceptionUtils.java:199)
at org.apache.flink.util.function.FunctionUtils.lambda$uncheckedFunction$2(FunctionUtils.java:75)
at java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:616)
at java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:591)
... 7 more
Caused by: org.apache.flink.util.FlinkException: Could not retrieve submitted JobGraph from state handle under /639170a9d710bacfd113ca66b2aacefa. This indicates that the retrieved state handle is broken. Try cleaning the state handle store.
at org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore.recoverJobGraph(ZooKeeperSubmittedJobGraphStore.java:190)
at org.apache.flink.runtime.dispatcher.Dispatcher.recoverJob(Dispatcher.java:755)
at org.apache.flink.runtime.dispatcher.Dispatcher.recoverJobGraphs(Dispatcher.java:740)
at org.apache.flink.runtime.dispatcher.Dispatcher.recoverJobs(Dispatcher.java:721)
at org.apache.flink.runtime.dispatcher.Dispatcher.lambda$null$27(Dispatcher.java:888)
at org.apache.flink.util.function.FunctionUtils.lambda$uncheckedFunction$2(FunctionUtils.java:73)
... 9 more
Caused by: java.io.FileNotFoundException: /flink/ha/submittedJobGraph0c6bcff01199 (No such file or directory)
at java.io.FileInputStream.open0(Native Method)
at java.io.FileInputStream.open(FileInputStream.java:195)

------------------ 原始邮件 ------------------
发件人: "Vijay Bhaskar"<[hidden email]>;
发送时间: 2019年11月28日(星期四) 下午2:46
收件人: "曾祥才"<[hidden email]>;
抄送: "User-Flink"<[hidden email]>;
主题: Re: JobGraphs not cleaned up in HA mode

Is it filesystem or hadoop? If its NAS then why the exception "Caused by: org.apache.hadoop.hdfs.BlockMissingException: "
It seems you configured hadoop state store and giving NAS mount.

Regards
Bhaskar

On Thu, Nov 28, 2019 at 11:36 AM 曾祥才 <[hidden email]> wrote:
/flink/checkpoints is a external persistent store (a nas directory mounts to the job manager)

------------------ 原始邮件 ------------------
发件人: "Vijay Bhaskar"<[hidden email]>;
发送时间: 2019年11月28日(星期四) 下午2:29
收件人: "曾祥才"<[hidden email]>;
抄送: "user"<[hidden email]>;
主题: Re: JobGraphs not cleaned up in HA mode

Following are the mandatory condition to run in HA:

a) You should have persistent common external store for jobmanager and task managers to while writing the state
b) You should have persistent external store for zookeeper to store the Jobgraph.

Zookeeper is referring path: /flink/checkpoints/submittedJobGraph480ddf9572ed to get the job graph but jobmanager unable to find it.
It seems /flink/checkpoints is not the external persistent store

Regards
Bhaskar

On Thu, Nov 28, 2019 at 10:43 AM seuzxc <[hidden email]> wrote:
hi ，I've the same problem with flink 1.9.1 , any solution to fix it
when the k8s redoploy jobmanager , the error looks like (seems zk not
remove submitted job info, but jobmanager remove the file):

Caused by: org.apache.flink.util.FlinkException: Could not retrieve
submitted JobGraph from state handle under
/147dd022ec91f7381ad4ca3d290387e9. This indicates that the retrieved state
handle is broken. Try cleaning the state handle store.
at
org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore.recoverJobGraph(ZooKeeperSubmittedJobGraphStore.java:208)
at
org.apache.flink.runtime.dispatcher.Dispatcher.recoverJob(Dispatcher.java:696)
at
org.apache.flink.runtime.dispatcher.Dispatcher.recoverJobGraphs(Dispatcher.java:681)
at
org.apache.flink.runtime.dispatcher.Dispatcher.recoverJobs(Dispatcher.java:662)
at
org.apache.flink.runtime.dispatcher.Dispatcher.lambda$null$26(Dispatcher.java:821)
at
org.apache.flink.util.function.FunctionUtils.lambda$uncheckedFunction$2(FunctionUtils.java:72)
... 9 more
Caused by: org.apache.hadoop.hdfs.BlockMissingException: Could not obtain
block: BP-1651346363-10.20.1.81-1525354906737:blk_1083182315_9441494
file=/flink/checkpoints/submittedJobGraph480ddf9572ed
at
org.apache.hadoop.hdfs.DFSInputStream.refetchLocations(DFSInputStream.java:1052)

--
Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/

seuzxc

回复： JobGraphs not cleaned up in HA mode

hi，
Is there any deference （for me using nas is more convenient to test currently）?

from the docs seems hdfs ,s3, nfs etc all will be fine.

------------------ 原始邮件 ------------------

发件人: "vino yang"<[hidden email]>;

发送时间: 2019年11月28日(星期四) 晚上7:17

收件人: "曾祥才"<[hidden email]>;

抄送: "Vijay Bhaskar"<[hidden email]>;"User-Flink"<[hidden email]>;

主题: Re: JobGraphs not cleaned up in HA mode

Hi,

Why do you not use HDFS directly?

Best,

Vino

曾祥才 <[hidden email]> 于2019年11月28日周四下午6:48写道：

anyone have the same problem？ pls help, thks

------------------ 原始邮件 ------------------
发件人: "曾祥才"<[hidden email]>;
发送时间: 2019年11月28日(星期四) 下午2:46
收件人: "Vijay Bhaskar"<[hidden email]>;
抄送: "User-Flink"<[hidden email]>;
主题: 回复： JobGraphs not cleaned up in HA mode

the config (/flink is the NASdirectory ):

jobmanager.rpc.address: flink-jobmanager
taskmanager.numberOfTaskSlots: 16
web.upload.dir: /flink/webUpload
blob.server.port: 6124
jobmanager.rpc.port: 6123
taskmanager.rpc.port: 6122
jobmanager.heap.size: 1024m
taskmanager.heap.size: 1024m
high-availability: zookeeper
high-availability.cluster-id: /cluster-test
high-availability.storageDir: /flink/ha
high-availability.zookeeper.quorum: ****:2181
high-availability.jobmanager.port: 6123
high-availability.zookeeper.path.root: /flink/risk-insight
high-availability.zookeeper.path.checkpoints: /flink/zk-checkpoints
state.backend: filesystem
state.checkpoints.dir: file:///flink/checkpoints
state.savepoints.dir: file:///flink/savepoints
state.checkpoints.num-retained: 2
jobmanager.execution.failover-strategy: region
jobmanager.archive.fs.dir: file:///flink/archive/history

------------------ 原始邮件 ------------------
发件人: "Vijay Bhaskar"<[hidden email]>;
发送时间: 2019年11月28日(星期四) 下午3:12
收件人: "曾祥才"<[hidden email]>;
抄送: "User-Flink"<[hidden email]>;
主题: Re: JobGraphs not cleaned up in HA mode

Can you share the flink configuration once?

Regards
Bhaskar

On Thu, Nov 28, 2019 at 12:09 PM 曾祥才 <[hidden email]> wrote:
if i clean the zookeeper data , it runs fine . but next time when the jobmanager failed and redeploy the error occurs again

------------------ 原始邮件 ------------------
发件人: "Vijay Bhaskar"<[hidden email]>;
发送时间: 2019年11月28日(星期四) 下午3:05
收件人: "曾祥才"<[hidden email]>;
主题: Re: JobGraphs not cleaned up in HA mode

Again it could not find the state store file: "Caused by: java.io.FileNotFoundException: /flink/ha/submittedJobGraph0c6bcff01199 "
Check why its unable to find.
Better thing is: Clean up zookeeper state and check your configurations, correct them and restart cluster.
Otherwise it always picks up corrupted state from zookeeper and it will never restart

Regards
Bhaskar

On Thu, Nov 28, 2019 at 11:51 AM 曾祥才 <[hidden email]> wrote:
i've made a misstake( the log before is another cluster) . the full exception log is :

INFO org.apache.flink.runtime.dispatcher.StandaloneDispatcher - Recovering all persisted jobs.
2019-11-28 02:33:12,726 INFO org.apache.flink.runtime.resourcemanager.slotmanager.SlotManagerImpl - Starting the SlotManager.
2019-11-28 02:33:12,743 INFO org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore - Released locks of job graph 639170a9d710bacfd113ca66b2aacefa from ZooKeeper.
2019-11-28 02:33:12,744 ERROR org.apache.flink.runtime.entrypoint.ClusterEntrypoint - Fatal error occurred in the cluster entrypoint.
org.apache.flink.runtime.dispatcher.DispatcherException: Failed to take leadership with session id 607e1fe0-f1b4-4af0-af16-4137d779defb.
at org.apache.flink.runtime.dispatcher.Dispatcher.lambda$null$30(Dispatcher.java:915)
at java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:774)
at java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:750)
at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488)
at java.util.concurrent.CompletableFuture.postFire(CompletableFuture.java:575)
at java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:594)
at java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:456)
at akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:40)
at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(ForkJoinExecutorConfigurator.scala:44)
at akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
Caused by: java.lang.RuntimeException: org.apache.flink.util.FlinkException: Could not retrieve submitted JobGraph from state handle under /639170a9d710bacfd113ca66b2aacefa. This indicates that the retrieved state handle is broken. Try cleaning the state handle store.
at org.apache.flink.util.ExceptionUtils.rethrow(ExceptionUtils.java:199)
at org.apache.flink.util.function.FunctionUtils.lambda$uncheckedFunction$2(FunctionUtils.java:75)
at java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:616)
at java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:591)
... 7 more
Caused by: org.apache.flink.util.FlinkException: Could not retrieve submitted JobGraph from state handle under /639170a9d710bacfd113ca66b2aacefa. This indicates that the retrieved state handle is broken. Try cleaning the state handle store.
at org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore.recoverJobGraph(ZooKeeperSubmittedJobGraphStore.java:190)
at org.apache.flink.runtime.dispatcher.Dispatcher.recoverJob(Dispatcher.java:755)
at org.apache.flink.runtime.dispatcher.Dispatcher.recoverJobGraphs(Dispatcher.java:740)
at org.apache.flink.runtime.dispatcher.Dispatcher.recoverJobs(Dispatcher.java:721)
at org.apache.flink.runtime.dispatcher.Dispatcher.lambda$null$27(Dispatcher.java:888)
at org.apache.flink.util.function.FunctionUtils.lambda$uncheckedFunction$2(FunctionUtils.java:73)
... 9 more
Caused by: java.io.FileNotFoundException: /flink/ha/submittedJobGraph0c6bcff01199 (No such file or directory)
at java.io.FileInputStream.open0(Native Method)
at java.io.FileInputStream.open(FileInputStream.java:195)

------------------ 原始邮件 ------------------
发件人: "Vijay Bhaskar"<[hidden email]>;
发送时间: 2019年11月28日(星期四) 下午2:46
收件人: "曾祥才"<[hidden email]>;
抄送: "User-Flink"<[hidden email]>;
主题: Re: JobGraphs not cleaned up in HA mode

Is it filesystem or hadoop? If its NAS then why the exception "Caused by: org.apache.hadoop.hdfs.BlockMissingException: "
It seems you configured hadoop state store and giving NAS mount.

Regards
Bhaskar

On Thu, Nov 28, 2019 at 11:36 AM 曾祥才 <[hidden email]> wrote:
/flink/checkpoints is a external persistent store (a nas directory mounts to the job manager)

------------------ 原始邮件 ------------------
发件人: "Vijay Bhaskar"<[hidden email]>;
发送时间: 2019年11月28日(星期四) 下午2:29
收件人: "曾祥才"<[hidden email]>;
抄送: "user"<[hidden email]>;
主题: Re: JobGraphs not cleaned up in HA mode

Following are the mandatory condition to run in HA:

a) You should have persistent common external store for jobmanager and task managers to while writing the state
b) You should have persistent external store for zookeeper to store the Jobgraph.

Zookeeper is referring path: /flink/checkpoints/submittedJobGraph480ddf9572ed to get the job graph but jobmanager unable to find it.
It seems /flink/checkpoints is not the external persistent store

Regards
Bhaskar

On Thu, Nov 28, 2019 at 10:43 AM seuzxc <[hidden email]> wrote:
hi ，I've the same problem with flink 1.9.1 , any solution to fix it
when the k8s redoploy jobmanager , the error looks like (seems zk not
remove submitted job info, but jobmanager remove the file):

Caused by: org.apache.flink.util.FlinkException: Could not retrieve
submitted JobGraph from state handle under
/147dd022ec91f7381ad4ca3d290387e9. This indicates that the retrieved state
handle is broken. Try cleaning the state handle store.
at
org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore.recoverJobGraph(ZooKeeperSubmittedJobGraphStore.java:208)
at
org.apache.flink.runtime.dispatcher.Dispatcher.recoverJob(Dispatcher.java:696)
at
org.apache.flink.runtime.dispatcher.Dispatcher.recoverJobGraphs(Dispatcher.java:681)
at
org.apache.flink.runtime.dispatcher.Dispatcher.recoverJobs(Dispatcher.java:662)
at
org.apache.flink.runtime.dispatcher.Dispatcher.lambda$null$26(Dispatcher.java:821)
at
org.apache.flink.util.function.FunctionUtils.lambda$uncheckedFunction$2(FunctionUtils.java:72)
... 9 more
Caused by: org.apache.hadoop.hdfs.BlockMissingException: Could not obtain
block: BP-1651346363-10.20.1.81-1525354906737:blk_1083182315_9441494
file=/flink/checkpoints/submittedJobGraph480ddf9572ed
at
org.apache.hadoop.hdfs.DFSInputStream.refetchLocations(DFSInputStream.java:1052)

--
Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/

Vijay Bhaskar

Re: JobGraphs not cleaned up in HA mode

One more thing:

You configured:

high-availability.cluster-id: /cluster-test

it should be:

high-availability.cluster-id: cluster-test

I don't think this is major issue, in case it helps, you can check.

Can you check one more thing:

Is check pointing happening or not?

Were you able to see the chk-* folder under checkpoint directory?

Regards

Bhaskar

On Thu, Nov 28, 2019 at 5:00 PM 曾祥才 <[hidden email]> wrote:

hi，
Is there any deference （for me using nas is more convenient to test currently）?
from the docs seems hdfs ,s3, nfs etc all will be fine.

------------------ 原始邮件 ------------------
发件人: "vino yang"<[hidden email]>;
发送时间: 2019年11月28日(星期四) 晚上7:17
收件人: "曾祥才"<[hidden email]>;
抄送: "Vijay Bhaskar"<[hidden email]>;"User-Flink"<[hidden email]>;
主题: Re: JobGraphs not cleaned up in HA mode

Hi,

Why do you not use HDFS directly?

Best,
Vino

曾祥才 <[hidden email]> 于2019年11月28日周四下午6:48写道：

anyone have the same problem？ pls help, thks

------------------ 原始邮件 ------------------
发件人: "曾祥才"<[hidden email]>;
发送时间: 2019年11月28日(星期四) 下午2:46
收件人: "Vijay Bhaskar"<[hidden email]>;
抄送: "User-Flink"<[hidden email]>;
主题: 回复： JobGraphs not cleaned up in HA mode

the config (/flink is the NASdirectory ):

jobmanager.rpc.address: flink-jobmanager
taskmanager.numberOfTaskSlots: 16
web.upload.dir: /flink/webUpload
blob.server.port: 6124
jobmanager.rpc.port: 6123
taskmanager.rpc.port: 6122
jobmanager.heap.size: 1024m
taskmanager.heap.size: 1024m
high-availability: zookeeper
high-availability.cluster-id: /cluster-test
high-availability.storageDir: /flink/ha
high-availability.zookeeper.quorum: ****:2181
high-availability.jobmanager.port: 6123
high-availability.zookeeper.path.root: /flink/risk-insight
high-availability.zookeeper.path.checkpoints: /flink/zk-checkpoints
state.backend: filesystem
state.checkpoints.dir: file:///flink/checkpoints
state.savepoints.dir: file:///flink/savepoints
state.checkpoints.num-retained: 2
jobmanager.execution.failover-strategy: region
jobmanager.archive.fs.dir: file:///flink/archive/history

------------------ 原始邮件 ------------------
发件人: "Vijay Bhaskar"<[hidden email]>;
发送时间: 2019年11月28日(星期四) 下午3:12
收件人: "曾祥才"<[hidden email]>;
抄送: "User-Flink"<[hidden email]>;
主题: Re: JobGraphs not cleaned up in HA mode

Can you share the flink configuration once?

Regards
Bhaskar

On Thu, Nov 28, 2019 at 12:09 PM 曾祥才 <[hidden email]> wrote:
if i clean the zookeeper data , it runs fine . but next time when the jobmanager failed and redeploy the error occurs again

------------------ 原始邮件 ------------------
发件人: "Vijay Bhaskar"<[hidden email]>;
发送时间: 2019年11月28日(星期四) 下午3:05
收件人: "曾祥才"<[hidden email]>;
主题: Re: JobGraphs not cleaned up in HA mode

Again it could not find the state store file: "Caused by: java.io.FileNotFoundException: /flink/ha/submittedJobGraph0c6bcff01199 "
Check why its unable to find.
Better thing is: Clean up zookeeper state and check your configurations, correct them and restart cluster.
Otherwise it always picks up corrupted state from zookeeper and it will never restart

Regards
Bhaskar

On Thu, Nov 28, 2019 at 11:51 AM 曾祥才 <[hidden email]> wrote:
i've made a misstake( the log before is another cluster) . the full exception log is :

INFO org.apache.flink.runtime.dispatcher.StandaloneDispatcher - Recovering all persisted jobs.
2019-11-28 02:33:12,726 INFO org.apache.flink.runtime.resourcemanager.slotmanager.SlotManagerImpl - Starting the SlotManager.
2019-11-28 02:33:12,743 INFO org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore - Released locks of job graph 639170a9d710bacfd113ca66b2aacefa from ZooKeeper.
2019-11-28 02:33:12,744 ERROR org.apache.flink.runtime.entrypoint.ClusterEntrypoint - Fatal error occurred in the cluster entrypoint.
org.apache.flink.runtime.dispatcher.DispatcherException: Failed to take leadership with session id 607e1fe0-f1b4-4af0-af16-4137d779defb.
at org.apache.flink.runtime.dispatcher.Dispatcher.lambda$null$30(Dispatcher.java:915)
at java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:774)
at java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:750)
at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488)
at java.util.concurrent.CompletableFuture.postFire(CompletableFuture.java:575)
at java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:594)
at java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:456)
at akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:40)
at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(ForkJoinExecutorConfigurator.scala:44)
at akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
Caused by: java.lang.RuntimeException: org.apache.flink.util.FlinkException: Could not retrieve submitted JobGraph from state handle under /639170a9d710bacfd113ca66b2aacefa. This indicates that the retrieved state handle is broken. Try cleaning the state handle store.
at org.apache.flink.util.ExceptionUtils.rethrow(ExceptionUtils.java:199)
at org.apache.flink.util.function.FunctionUtils.lambda$uncheckedFunction$2(FunctionUtils.java:75)
at java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:616)
at java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:591)
... 7 more
Caused by: org.apache.flink.util.FlinkException: Could not retrieve submitted JobGraph from state handle under /639170a9d710bacfd113ca66b2aacefa. This indicates that the retrieved state handle is broken. Try cleaning the state handle store.
at org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore.recoverJobGraph(ZooKeeperSubmittedJobGraphStore.java:190)
at org.apache.flink.runtime.dispatcher.Dispatcher.recoverJob(Dispatcher.java:755)
at org.apache.flink.runtime.dispatcher.Dispatcher.recoverJobGraphs(Dispatcher.java:740)
at org.apache.flink.runtime.dispatcher.Dispatcher.recoverJobs(Dispatcher.java:721)
at org.apache.flink.runtime.dispatcher.Dispatcher.lambda$null$27(Dispatcher.java:888)
at org.apache.flink.util.function.FunctionUtils.lambda$uncheckedFunction$2(FunctionUtils.java:73)
... 9 more
Caused by: java.io.FileNotFoundException: /flink/ha/submittedJobGraph0c6bcff01199 (No such file or directory)
at java.io.FileInputStream.open0(Native Method)
at java.io.FileInputStream.open(FileInputStream.java:195)

------------------ 原始邮件 ------------------
发件人: "Vijay Bhaskar"<[hidden email]>;
发送时间: 2019年11月28日(星期四) 下午2:46
收件人: "曾祥才"<[hidden email]>;
抄送: "User-Flink"<[hidden email]>;
主题: Re: JobGraphs not cleaned up in HA mode

Is it filesystem or hadoop? If its NAS then why the exception "Caused by: org.apache.hadoop.hdfs.BlockMissingException: "
It seems you configured hadoop state store and giving NAS mount.

Regards
Bhaskar

On Thu, Nov 28, 2019 at 11:36 AM 曾祥才 <[hidden email]> wrote:
/flink/checkpoints is a external persistent store (a nas directory mounts to the job manager)

------------------ 原始邮件 ------------------
发件人: "Vijay Bhaskar"<[hidden email]>;
发送时间: 2019年11月28日(星期四) 下午2:29
收件人: "曾祥才"<[hidden email]>;
抄送: "user"<[hidden email]>;
主题: Re: JobGraphs not cleaned up in HA mode

Following are the mandatory condition to run in HA:

a) You should have persistent common external store for jobmanager and task managers to while writing the state
b) You should have persistent external store for zookeeper to store the Jobgraph.

Zookeeper is referring path: /flink/checkpoints/submittedJobGraph480ddf9572ed to get the job graph but jobmanager unable to find it.
It seems /flink/checkpoints is not the external persistent store

Regards
Bhaskar

On Thu, Nov 28, 2019 at 10:43 AM seuzxc <[hidden email]> wrote:
hi ，I've the same problem with flink 1.9.1 , any solution to fix it
when the k8s redoploy jobmanager , the error looks like (seems zk not
remove submitted job info, but jobmanager remove the file):

Caused by: org.apache.flink.util.FlinkException: Could not retrieve
submitted JobGraph from state handle under
/147dd022ec91f7381ad4ca3d290387e9. This indicates that the retrieved state
handle is broken. Try cleaning the state handle store.
at
org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore.recoverJobGraph(ZooKeeperSubmittedJobGraphStore.java:208)
at
org.apache.flink.runtime.dispatcher.Dispatcher.recoverJob(Dispatcher.java:696)
at
org.apache.flink.runtime.dispatcher.Dispatcher.recoverJobGraphs(Dispatcher.java:681)
at
org.apache.flink.runtime.dispatcher.Dispatcher.recoverJobs(Dispatcher.java:662)
at
org.apache.flink.runtime.dispatcher.Dispatcher.lambda$null$26(Dispatcher.java:821)
at
org.apache.flink.util.function.FunctionUtils.lambda$uncheckedFunction$2(FunctionUtils.java:72)
... 9 more
Caused by: org.apache.hadoop.hdfs.BlockMissingException: Could not obtain
block: BP-1651346363-10.20.1.81-1525354906737:blk_1083182315_9441494
file=/flink/checkpoints/submittedJobGraph480ddf9572ed
at
org.apache.hadoop.hdfs.DFSInputStream.refetchLocations(DFSInputStream.java:1052)

--
Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/

seuzxc

回复： JobGraphs not cleaned up in HA mode

the chk-* directory is not found , I think the misssing because of jobmanager removes it automaticly , but why it still in zookeeper?

------------------ 原始邮件 ------------------

发件人: "Vijay Bhaskar"<[hidden email]>;

发送时间: 2019年11月28日(星期四) 晚上8:31

收件人: "曾祥才"<[hidden email]>;

抄送: "vino yang"<[hidden email]>;"User-Flink"<[hidden email]>;

主题: Re: JobGraphs not cleaned up in HA mode

One more thing:

You configured:

high-availability.cluster-id: /cluster-test

it should be:

high-availability.cluster-id: cluster-test

I don't think this is major issue, in case it helps, you can check.

Can you check one more thing:

Is check pointing happening or not?

Were you able to see the chk-* folder under checkpoint directory?

Regards

Bhaskar

On Thu, Nov 28, 2019 at 5:00 PM 曾祥才 <[hidden email]> wrote:

hi，
Is there any deference （for me using nas is more convenient to test currently）?
from the docs seems hdfs ,s3, nfs etc all will be fine.

------------------ 原始邮件 ------------------
发件人: "vino yang"<[hidden email]>;
发送时间: 2019年11月28日(星期四) 晚上7:17
收件人: "曾祥才"<[hidden email]>;
抄送: "Vijay Bhaskar"<[hidden email]>;"User-Flink"<[hidden email]>;
主题: Re: JobGraphs not cleaned up in HA mode

Hi,

Why do you not use HDFS directly?

Best,
Vino

曾祥才 <[hidden email]> 于2019年11月28日周四下午6:48写道：

anyone have the same problem？ pls help, thks

------------------ 原始邮件 ------------------
发件人: "曾祥才"<[hidden email]>;
发送时间: 2019年11月28日(星期四) 下午2:46
收件人: "Vijay Bhaskar"<[hidden email]>;
抄送: "User-Flink"<[hidden email]>;
主题: 回复： JobGraphs not cleaned up in HA mode

the config (/flink is the NASdirectory ):

jobmanager.rpc.address: flink-jobmanager
taskmanager.numberOfTaskSlots: 16
web.upload.dir: /flink/webUpload
blob.server.port: 6124
jobmanager.rpc.port: 6123
taskmanager.rpc.port: 6122
jobmanager.heap.size: 1024m
taskmanager.heap.size: 1024m
high-availability: zookeeper
high-availability.cluster-id: /cluster-test
high-availability.storageDir: /flink/ha
high-availability.zookeeper.quorum: ****:2181
high-availability.jobmanager.port: 6123
high-availability.zookeeper.path.root: /flink/risk-insight
high-availability.zookeeper.path.checkpoints: /flink/zk-checkpoints
state.backend: filesystem
state.checkpoints.dir: file:///flink/checkpoints
state.savepoints.dir: file:///flink/savepoints
state.checkpoints.num-retained: 2
jobmanager.execution.failover-strategy: region
jobmanager.archive.fs.dir: file:///flink/archive/history

------------------ 原始邮件 ------------------
发件人: "Vijay Bhaskar"<[hidden email]>;
发送时间: 2019年11月28日(星期四) 下午3:12
收件人: "曾祥才"<[hidden email]>;
抄送: "User-Flink"<[hidden email]>;
主题: Re: JobGraphs not cleaned up in HA mode

Can you share the flink configuration once?

Regards
Bhaskar

On Thu, Nov 28, 2019 at 12:09 PM 曾祥才 <[hidden email]> wrote:
if i clean the zookeeper data , it runs fine . but next time when the jobmanager failed and redeploy the error occurs again

------------------ 原始邮件 ------------------
发件人: "Vijay Bhaskar"<[hidden email]>;
发送时间: 2019年11月28日(星期四) 下午3:05
收件人: "曾祥才"<[hidden email]>;
主题: Re: JobGraphs not cleaned up in HA mode

Again it could not find the state store file: "Caused by: java.io.FileNotFoundException: /flink/ha/submittedJobGraph0c6bcff01199 "
Check why its unable to find.
Better thing is: Clean up zookeeper state and check your configurations, correct them and restart cluster.
Otherwise it always picks up corrupted state from zookeeper and it will never restart

Regards
Bhaskar

On Thu, Nov 28, 2019 at 11:51 AM 曾祥才 <[hidden email]> wrote:
i've made a misstake( the log before is another cluster) . the full exception log is :

INFO org.apache.flink.runtime.dispatcher.StandaloneDispatcher - Recovering all persisted jobs.
2019-11-28 02:33:12,726 INFO org.apache.flink.runtime.resourcemanager.slotmanager.SlotManagerImpl - Starting the SlotManager.
2019-11-28 02:33:12,743 INFO org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore - Released locks of job graph 639170a9d710bacfd113ca66b2aacefa from ZooKeeper.
2019-11-28 02:33:12,744 ERROR org.apache.flink.runtime.entrypoint.ClusterEntrypoint - Fatal error occurred in the cluster entrypoint.
org.apache.flink.runtime.dispatcher.DispatcherException: Failed to take leadership with session id 607e1fe0-f1b4-4af0-af16-4137d779defb.
at org.apache.flink.runtime.dispatcher.Dispatcher.lambda$null$30(Dispatcher.java:915)
at java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:774)
at java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:750)
at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488)
at java.util.concurrent.CompletableFuture.postFire(CompletableFuture.java:575)
at java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:594)
at java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:456)
at akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:40)
at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(ForkJoinExecutorConfigurator.scala:44)
at akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
Caused by: java.lang.RuntimeException: org.apache.flink.util.FlinkException: Could not retrieve submitted JobGraph from state handle under /639170a9d710bacfd113ca66b2aacefa. This indicates that the retrieved state handle is broken. Try cleaning the state handle store.
at org.apache.flink.util.ExceptionUtils.rethrow(ExceptionUtils.java:199)
at org.apache.flink.util.function.FunctionUtils.lambda$uncheckedFunction$2(FunctionUtils.java:75)
at java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:616)
at java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:591)
... 7 more
Caused by: org.apache.flink.util.FlinkException: Could not retrieve submitted JobGraph from state handle under /639170a9d710bacfd113ca66b2aacefa. This indicates that the retrieved state handle is broken. Try cleaning the state handle store.
at org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore.recoverJobGraph(ZooKeeperSubmittedJobGraphStore.java:190)
at org.apache.flink.runtime.dispatcher.Dispatcher.recoverJob(Dispatcher.java:755)
at org.apache.flink.runtime.dispatcher.Dispatcher.recoverJobGraphs(Dispatcher.java:740)
at org.apache.flink.runtime.dispatcher.Dispatcher.recoverJobs(Dispatcher.java:721)
at org.apache.flink.runtime.dispatcher.Dispatcher.lambda$null$27(Dispatcher.java:888)
at org.apache.flink.util.function.FunctionUtils.lambda$uncheckedFunction$2(FunctionUtils.java:73)
... 9 more
Caused by: java.io.FileNotFoundException: /flink/ha/submittedJobGraph0c6bcff01199 (No such file or directory)
at java.io.FileInputStream.open0(Native Method)
at java.io.FileInputStream.open(FileInputStream.java:195)

------------------ 原始邮件 ------------------
发件人: "Vijay Bhaskar"<[hidden email]>;
发送时间: 2019年11月28日(星期四) 下午2:46
收件人: "曾祥才"<[hidden email]>;
抄送: "User-Flink"<[hidden email]>;
主题: Re: JobGraphs not cleaned up in HA mode

Is it filesystem or hadoop? If its NAS then why the exception "Caused by: org.apache.hadoop.hdfs.BlockMissingException: "
It seems you configured hadoop state store and giving NAS mount.

Regards
Bhaskar

On Thu, Nov 28, 2019 at 11:36 AM 曾祥才 <[hidden email]> wrote:
/flink/checkpoints is a external persistent store (a nas directory mounts to the job manager)

------------------ 原始邮件 ------------------
发件人: "Vijay Bhaskar"<[hidden email]>;
发送时间: 2019年11月28日(星期四) 下午2:29
收件人: "曾祥才"<[hidden email]>;
抄送: "user"<[hidden email]>;
主题: Re: JobGraphs not cleaned up in HA mode

Following are the mandatory condition to run in HA:

a) You should have persistent common external store for jobmanager and task managers to while writing the state
b) You should have persistent external store for zookeeper to store the Jobgraph.

Zookeeper is referring path: /flink/checkpoints/submittedJobGraph480ddf9572ed to get the job graph but jobmanager unable to find it.
It seems /flink/checkpoints is not the external persistent store

Regards
Bhaskar

On Thu, Nov 28, 2019 at 10:43 AM seuzxc <[hidden email]> wrote:
hi ，I've the same problem with flink 1.9.1 , any solution to fix it
when the k8s redoploy jobmanager , the error looks like (seems zk not
remove submitted job info, but jobmanager remove the file):

Caused by: org.apache.flink.util.FlinkException: Could not retrieve
submitted JobGraph from state handle under
/147dd022ec91f7381ad4ca3d290387e9. This indicates that the retrieved state
handle is broken. Try cleaning the state handle store.
at
org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore.recoverJobGraph(ZooKeeperSubmittedJobGraphStore.java:208)
at
org.apache.flink.runtime.dispatcher.Dispatcher.recoverJob(Dispatcher.java:696)
at
org.apache.flink.runtime.dispatcher.Dispatcher.recoverJobGraphs(Dispatcher.java:681)
at
org.apache.flink.runtime.dispatcher.Dispatcher.recoverJobs(Dispatcher.java:662)
at
org.apache.flink.runtime.dispatcher.Dispatcher.lambda$null$26(Dispatcher.java:821)
at
org.apache.flink.util.function.FunctionUtils.lambda$uncheckedFunction$2(FunctionUtils.java:72)
... 9 more
Caused by: org.apache.hadoop.hdfs.BlockMissingException: Could not obtain
block: BP-1651346363-10.20.1.81-1525354906737:blk_1083182315_9441494
file=/flink/checkpoints/submittedJobGraph480ddf9572ed
at
org.apache.hadoop.hdfs.DFSInputStream.refetchLocations(DFSInputStream.java:1052)

--
Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/