回复: JobGraphs not cleaned up in HA mode

classic Classic list List threaded Threaded
8 messages Options
Reply | Threaded
Open this post in threaded view
|

回复: JobGraphs not cleaned up in HA mode

seuzxc
if i clean the zookeeper data , it runs fine .  but next time when the jobmanager failed and redeploy the error occurs again




------------------ 原始邮件 ------------------
发件人: "Vijay Bhaskar"<[hidden email]>;
发送时间: 2019年11月28日(星期四) 下午3:05
收件人: "曾祥才"<[hidden email]>;
主题: Re: JobGraphs not cleaned up in HA mode

Again it could not find the state store file: "Caused by: java.io.FileNotFoundException: /flink/ha/submittedJobGraph0c6bcff01199 "
 Check why its unable to find. 
Better thing is: Clean up zookeeper state and check your configurations, correct them and restart cluster.
Otherwise it always picks up corrupted state from zookeeper and it will never restart

Regards
Bhaskar

On Thu, Nov 28, 2019 at 11:51 AM 曾祥才 <[hidden email]> wrote:
i've made a misstake( the log before is another cluster) . the full exception log is :


INFO  org.apache.flink.runtime.dispatcher.StandaloneDispatcher      - Recovering all persisted jobs. 
2019-11-28 02:33:12,726 INFO  org.apache.flink.runtime.resourcemanager.slotmanager.SlotManagerImpl  - Starting the SlotManager. 
2019-11-28 02:33:12,743 INFO  org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore  - Released locks of job graph 639170a9d710bacfd113ca66b2aacefa from ZooKeeper. 
2019-11-28 02:33:12,744 ERROR org.apache.flink.runtime.entrypoint.ClusterEntrypoint         - Fatal error occurred in the cluster entrypoint. 
org.apache.flink.runtime.dispatcher.DispatcherException: Failed to take leadership with session id 607e1fe0-f1b4-4af0-af16-4137d779defb. 
at org.apache.flink.runtime.dispatcher.Dispatcher.lambda$null$30(Dispatcher.java:915) 
at java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:774) 
at java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:750) 
at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488) 
at java.util.concurrent.CompletableFuture.postFire(CompletableFuture.java:575) 
at java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:594) 
at java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:456) 
at akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:40) 
at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(ForkJoinExecutorConfigurator.scala:44) 
at akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) 
at akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) 
at akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) 
at akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) 
Caused by: java.lang.RuntimeException: org.apache.flink.util.FlinkException: Could not retrieve submitted JobGraph from state handle under /639170a9d710bacfd113ca66b2aacefa. This indicates that the retrieved state handle is broken. Try cleaning the state handle store. 
at org.apache.flink.util.ExceptionUtils.rethrow(ExceptionUtils.java:199) 
at org.apache.flink.util.function.FunctionUtils.lambda$uncheckedFunction$2(FunctionUtils.java:75) 
at java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:616) 
at java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:591) 
... 7 more 
Caused by: org.apache.flink.util.FlinkException: Could not retrieve submitted JobGraph from state handle under /639170a9d710bacfd113ca66b2aacefa. This indicates that the retrieved state handle is broken. Try cleaning the state handle store. 
at org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore.recoverJobGraph(ZooKeeperSubmittedJobGraphStore.java:190) 
at org.apache.flink.runtime.dispatcher.Dispatcher.recoverJob(Dispatcher.java:755) 
at org.apache.flink.runtime.dispatcher.Dispatcher.recoverJobGraphs(Dispatcher.java:740) 
at org.apache.flink.runtime.dispatcher.Dispatcher.recoverJobs(Dispatcher.java:721) 
at org.apache.flink.runtime.dispatcher.Dispatcher.lambda$null$27(Dispatcher.java:888) 
at org.apache.flink.util.function.FunctionUtils.lambda$uncheckedFunction$2(FunctionUtils.java:73) 
... 9 more 
Caused by: java.io.FileNotFoundException: /flink/ha/submittedJobGraph0c6bcff01199 (No such file or directory) 
at java.io.FileInputStream.open0(Native Method) 
at java.io.FileInputStream.open(FileInputStream.java:195) 

 


------------------ 原始邮件 ------------------
发件人: "Vijay Bhaskar"<[hidden email]>;
发送时间: 2019年11月28日(星期四) 下午2:46
收件人: "曾祥才"<[hidden email]>;
抄送: "User-Flink"<[hidden email]>;
主题: Re: JobGraphs not cleaned up in HA mode

Is it filesystem or hadoop? If its NAS then why the exception "Caused by: org.apache.hadoop.hdfs.BlockMissingException: "
It seems you configured hadoop state store and giving NAS mount. 

Regards
Bhaskar

 

On Thu, Nov 28, 2019 at 11:36 AM 曾祥才 <[hidden email]> wrote:
/flink/checkpoints  is a external persistent store (a nas directory mounts to the job manager)




------------------ 原始邮件 ------------------
发件人: "Vijay Bhaskar"<[hidden email]>;
发送时间: 2019年11月28日(星期四) 下午2:29
收件人: "曾祥才"<[hidden email]>;
抄送: "user"<[hidden email]>;
主题: Re: JobGraphs not cleaned up in HA mode

Following are the mandatory condition to run in HA:

a) You should have persistent common external store for jobmanager and task managers to while writing the state
b) You should have persistent external store for zookeeper to store the Jobgraph.

Zookeeper is referring  path: /flink/checkpoints/submittedJobGraph480ddf9572ed  to get the job graph but jobmanager unable to find it.
It seems /flink/checkpoints  is not the external persistent store


Regards
Bhaskar

On Thu, Nov 28, 2019 at 10:43 AM seuzxc <[hidden email]> wrote:
hi ,I've the same problem with flink 1.9.1 , any solution to fix it
when the k8s redoploy jobmanager ,  the error looks like (seems zk not
remove submitted job info, but jobmanager remove the file): 


Caused by: org.apache.flink.util.FlinkException: Could not retrieve
submitted JobGraph from state handle under
/147dd022ec91f7381ad4ca3d290387e9. This indicates that the retrieved state
handle is broken. Try cleaning the state handle store.
        at
org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore.recoverJobGraph(ZooKeeperSubmittedJobGraphStore.java:208)
        at
org.apache.flink.runtime.dispatcher.Dispatcher.recoverJob(Dispatcher.java:696)
        at
org.apache.flink.runtime.dispatcher.Dispatcher.recoverJobGraphs(Dispatcher.java:681)
        at
org.apache.flink.runtime.dispatcher.Dispatcher.recoverJobs(Dispatcher.java:662)
        at
org.apache.flink.runtime.dispatcher.Dispatcher.lambda$null$26(Dispatcher.java:821)
        at
org.apache.flink.util.function.FunctionUtils.lambda$uncheckedFunction$2(FunctionUtils.java:72)
        ... 9 more
Caused by: org.apache.hadoop.hdfs.BlockMissingException: Could not obtain
block: BP-1651346363-10.20.1.81-1525354906737:blk_1083182315_9441494
file=/flink/checkpoints/submittedJobGraph480ddf9572ed
        at
org.apache.hadoop.hdfs.DFSInputStream.refetchLocations(DFSInputStream.java:1052)



--
Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/
Reply | Threaded
Open this post in threaded view
|

Re: JobGraphs not cleaned up in HA mode

Vijay Bhaskar
Can you share the flink configuration once?

Regards
Bhaskar

On Thu, Nov 28, 2019 at 12:09 PM 曾祥才 <[hidden email]> wrote:
if i clean the zookeeper data , it runs fine .  but next time when the jobmanager failed and redeploy the error occurs again




------------------ 原始邮件 ------------------
发件人: "Vijay Bhaskar"<[hidden email]>;
发送时间: 2019年11月28日(星期四) 下午3:05
收件人: "曾祥才"<[hidden email]>;
主题: Re: JobGraphs not cleaned up in HA mode

Again it could not find the state store file: "Caused by: java.io.FileNotFoundException: /flink/ha/submittedJobGraph0c6bcff01199 "
 Check why its unable to find. 
Better thing is: Clean up zookeeper state and check your configurations, correct them and restart cluster.
Otherwise it always picks up corrupted state from zookeeper and it will never restart

Regards
Bhaskar

On Thu, Nov 28, 2019 at 11:51 AM 曾祥才 <[hidden email]> wrote:
i've made a misstake( the log before is another cluster) . the full exception log is :


INFO  org.apache.flink.runtime.dispatcher.StandaloneDispatcher      - Recovering all persisted jobs. 
2019-11-28 02:33:12,726 INFO  org.apache.flink.runtime.resourcemanager.slotmanager.SlotManagerImpl  - Starting the SlotManager. 
2019-11-28 02:33:12,743 INFO  org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore  - Released locks of job graph 639170a9d710bacfd113ca66b2aacefa from ZooKeeper. 
2019-11-28 02:33:12,744 ERROR org.apache.flink.runtime.entrypoint.ClusterEntrypoint         - Fatal error occurred in the cluster entrypoint. 
org.apache.flink.runtime.dispatcher.DispatcherException: Failed to take leadership with session id 607e1fe0-f1b4-4af0-af16-4137d779defb. 
at org.apache.flink.runtime.dispatcher.Dispatcher.lambda$null$30(Dispatcher.java:915) 
at java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:774) 
at java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:750) 
at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488) 
at java.util.concurrent.CompletableFuture.postFire(CompletableFuture.java:575) 
at java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:594) 
at java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:456) 
at akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:40) 
at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(ForkJoinExecutorConfigurator.scala:44) 
at akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) 
at akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) 
at akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) 
at akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) 
Caused by: java.lang.RuntimeException: org.apache.flink.util.FlinkException: Could not retrieve submitted JobGraph from state handle under /639170a9d710bacfd113ca66b2aacefa. This indicates that the retrieved state handle is broken. Try cleaning the state handle store. 
at org.apache.flink.util.ExceptionUtils.rethrow(ExceptionUtils.java:199) 
at org.apache.flink.util.function.FunctionUtils.lambda$uncheckedFunction$2(FunctionUtils.java:75) 
at java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:616) 
at java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:591) 
... 7 more 
Caused by: org.apache.flink.util.FlinkException: Could not retrieve submitted JobGraph from state handle under /639170a9d710bacfd113ca66b2aacefa. This indicates that the retrieved state handle is broken. Try cleaning the state handle store. 
at org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore.recoverJobGraph(ZooKeeperSubmittedJobGraphStore.java:190) 
at org.apache.flink.runtime.dispatcher.Dispatcher.recoverJob(Dispatcher.java:755) 
at org.apache.flink.runtime.dispatcher.Dispatcher.recoverJobGraphs(Dispatcher.java:740) 
at org.apache.flink.runtime.dispatcher.Dispatcher.recoverJobs(Dispatcher.java:721) 
at org.apache.flink.runtime.dispatcher.Dispatcher.lambda$null$27(Dispatcher.java:888) 
at org.apache.flink.util.function.FunctionUtils.lambda$uncheckedFunction$2(FunctionUtils.java:73) 
... 9 more 
Caused by: java.io.FileNotFoundException: /flink/ha/submittedJobGraph0c6bcff01199 (No such file or directory) 
at java.io.FileInputStream.open0(Native Method) 
at java.io.FileInputStream.open(FileInputStream.java:195) 

 


------------------ 原始邮件 ------------------
发件人: "Vijay Bhaskar"<[hidden email]>;
发送时间: 2019年11月28日(星期四) 下午2:46
收件人: "曾祥才"<[hidden email]>;
抄送: "User-Flink"<[hidden email]>;
主题: Re: JobGraphs not cleaned up in HA mode

Is it filesystem or hadoop? If its NAS then why the exception "Caused by: org.apache.hadoop.hdfs.BlockMissingException: "
It seems you configured hadoop state store and giving NAS mount. 

Regards
Bhaskar

 

On Thu, Nov 28, 2019 at 11:36 AM 曾祥才 <[hidden email]> wrote:
/flink/checkpoints  is a external persistent store (a nas directory mounts to the job manager)




------------------ 原始邮件 ------------------
发件人: "Vijay Bhaskar"<[hidden email]>;
发送时间: 2019年11月28日(星期四) 下午2:29
收件人: "曾祥才"<[hidden email]>;
抄送: "user"<[hidden email]>;
主题: Re: JobGraphs not cleaned up in HA mode

Following are the mandatory condition to run in HA:

a) You should have persistent common external store for jobmanager and task managers to while writing the state
b) You should have persistent external store for zookeeper to store the Jobgraph.

Zookeeper is referring  path: /flink/checkpoints/submittedJobGraph480ddf9572ed  to get the job graph but jobmanager unable to find it.
It seems /flink/checkpoints  is not the external persistent store


Regards
Bhaskar

On Thu, Nov 28, 2019 at 10:43 AM seuzxc <[hidden email]> wrote:
hi ,I've the same problem with flink 1.9.1 , any solution to fix it
when the k8s redoploy jobmanager ,  the error looks like (seems zk not
remove submitted job info, but jobmanager remove the file): 


Caused by: org.apache.flink.util.FlinkException: Could not retrieve
submitted JobGraph from state handle under
/147dd022ec91f7381ad4ca3d290387e9. This indicates that the retrieved state
handle is broken. Try cleaning the state handle store.
        at
org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore.recoverJobGraph(ZooKeeperSubmittedJobGraphStore.java:208)
        at
org.apache.flink.runtime.dispatcher.Dispatcher.recoverJob(Dispatcher.java:696)
        at
org.apache.flink.runtime.dispatcher.Dispatcher.recoverJobGraphs(Dispatcher.java:681)
        at
org.apache.flink.runtime.dispatcher.Dispatcher.recoverJobs(Dispatcher.java:662)
        at
org.apache.flink.runtime.dispatcher.Dispatcher.lambda$null$26(Dispatcher.java:821)
        at
org.apache.flink.util.function.FunctionUtils.lambda$uncheckedFunction$2(FunctionUtils.java:72)
        ... 9 more
Caused by: org.apache.hadoop.hdfs.BlockMissingException: Could not obtain
block: BP-1651346363-10.20.1.81-1525354906737:blk_1083182315_9441494
file=/flink/checkpoints/submittedJobGraph480ddf9572ed
        at
org.apache.hadoop.hdfs.DFSInputStream.refetchLocations(DFSInputStream.java:1052)



--
Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/
Reply | Threaded
Open this post in threaded view
|

回复: JobGraphs not cleaned up in HA mode

seuzxc
the config  (/flink is the NASdirectory ):  

jobmanager.rpc.address: flink-jobmanager
taskmanager.numberOfTaskSlots: 16
web.upload.dir: /flink/webUpload
blob.server.port: 6124
jobmanager.rpc.port: 6123
taskmanager.rpc.port: 6122
jobmanager.heap.size: 1024m
taskmanager.heap.size: 1024m
high-availability: zookeeper
high-availability.cluster-id: /cluster-test
high-availability.storageDir: /flink/ha
high-availability.zookeeper.quorum: ****:2181
high-availability.jobmanager.port: 6123
high-availability.zookeeper.path.root: /flink/risk-insight
high-availability.zookeeper.path.checkpoints: /flink/zk-checkpoints
state.backend: filesystem
state.checkpoints.dir: file:///flink/checkpoints
state.savepoints.dir: file:///flink/savepoints
state.checkpoints.num-retained: 2
jobmanager.execution.failover-strategy: region
jobmanager.archive.fs.dir: file:///flink/archive/history
 


------------------ 原始邮件 ------------------
发件人: "Vijay Bhaskar"<[hidden email]>;
发送时间: 2019年11月28日(星期四) 下午3:12
收件人: "曾祥才"<[hidden email]>;
抄送: "User-Flink"<[hidden email]>;
主题: Re: JobGraphs not cleaned up in HA mode

Can you share the flink configuration once?

Regards
Bhaskar

On Thu, Nov 28, 2019 at 12:09 PM 曾祥才 <[hidden email]> wrote:
if i clean the zookeeper data , it runs fine .  but next time when the jobmanager failed and redeploy the error occurs again




------------------ 原始邮件 ------------------
发件人: "Vijay Bhaskar"<[hidden email]>;
发送时间: 2019年11月28日(星期四) 下午3:05
收件人: "曾祥才"<[hidden email]>;
主题: Re: JobGraphs not cleaned up in HA mode

Again it could not find the state store file: "Caused by: java.io.FileNotFoundException: /flink/ha/submittedJobGraph0c6bcff01199 "
 Check why its unable to find. 
Better thing is: Clean up zookeeper state and check your configurations, correct them and restart cluster.
Otherwise it always picks up corrupted state from zookeeper and it will never restart

Regards
Bhaskar

On Thu, Nov 28, 2019 at 11:51 AM 曾祥才 <[hidden email]> wrote:
i've made a misstake( the log before is another cluster) . the full exception log is :


INFO  org.apache.flink.runtime.dispatcher.StandaloneDispatcher      - Recovering all persisted jobs. 
2019-11-28 02:33:12,726 INFO  org.apache.flink.runtime.resourcemanager.slotmanager.SlotManagerImpl  - Starting the SlotManager. 
2019-11-28 02:33:12,743 INFO  org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore  - Released locks of job graph 639170a9d710bacfd113ca66b2aacefa from ZooKeeper. 
2019-11-28 02:33:12,744 ERROR org.apache.flink.runtime.entrypoint.ClusterEntrypoint         - Fatal error occurred in the cluster entrypoint. 
org.apache.flink.runtime.dispatcher.DispatcherException: Failed to take leadership with session id 607e1fe0-f1b4-4af0-af16-4137d779defb. 
at org.apache.flink.runtime.dispatcher.Dispatcher.lambda$null$30(Dispatcher.java:915) 
at java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:774) 
at java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:750) 
at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488) 
at java.util.concurrent.CompletableFuture.postFire(CompletableFuture.java:575) 
at java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:594) 
at java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:456) 
at akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:40) 
at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(ForkJoinExecutorConfigurator.scala:44) 
at akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) 
at akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) 
at akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) 
at akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) 
Caused by: java.lang.RuntimeException: org.apache.flink.util.FlinkException: Could not retrieve submitted JobGraph from state handle under /639170a9d710bacfd113ca66b2aacefa. This indicates that the retrieved state handle is broken. Try cleaning the state handle store. 
at org.apache.flink.util.ExceptionUtils.rethrow(ExceptionUtils.java:199) 
at org.apache.flink.util.function.FunctionUtils.lambda$uncheckedFunction$2(FunctionUtils.java:75) 
at java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:616) 
at java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:591) 
... 7 more 
Caused by: org.apache.flink.util.FlinkException: Could not retrieve submitted JobGraph from state handle under /639170a9d710bacfd113ca66b2aacefa. This indicates that the retrieved state handle is broken. Try cleaning the state handle store. 
at org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore.recoverJobGraph(ZooKeeperSubmittedJobGraphStore.java:190) 
at org.apache.flink.runtime.dispatcher.Dispatcher.recoverJob(Dispatcher.java:755) 
at org.apache.flink.runtime.dispatcher.Dispatcher.recoverJobGraphs(Dispatcher.java:740) 
at org.apache.flink.runtime.dispatcher.Dispatcher.recoverJobs(Dispatcher.java:721) 
at org.apache.flink.runtime.dispatcher.Dispatcher.lambda$null$27(Dispatcher.java:888) 
at org.apache.flink.util.function.FunctionUtils.lambda$uncheckedFunction$2(FunctionUtils.java:73) 
... 9 more 
Caused by: java.io.FileNotFoundException: /flink/ha/submittedJobGraph0c6bcff01199 (No such file or directory) 
at java.io.FileInputStream.open0(Native Method) 
at java.io.FileInputStream.open(FileInputStream.java:195) 

 


------------------ 原始邮件 ------------------
发件人: "Vijay Bhaskar"<[hidden email]>;
发送时间: 2019年11月28日(星期四) 下午2:46
收件人: "曾祥才"<[hidden email]>;
抄送: "User-Flink"<[hidden email]>;
主题: Re: JobGraphs not cleaned up in HA mode

Is it filesystem or hadoop? If its NAS then why the exception "Caused by: org.apache.hadoop.hdfs.BlockMissingException: "
It seems you configured hadoop state store and giving NAS mount. 

Regards
Bhaskar

 

On Thu, Nov 28, 2019 at 11:36 AM 曾祥才 <[hidden email]> wrote:
/flink/checkpoints  is a external persistent store (a nas directory mounts to the job manager)




------------------ 原始邮件 ------------------
发件人: "Vijay Bhaskar"<[hidden email]>;
发送时间: 2019年11月28日(星期四) 下午2:29
收件人: "曾祥才"<[hidden email]>;
抄送: "user"<[hidden email]>;
主题: Re: JobGraphs not cleaned up in HA mode

Following are the mandatory condition to run in HA:

a) You should have persistent common external store for jobmanager and task managers to while writing the state
b) You should have persistent external store for zookeeper to store the Jobgraph.

Zookeeper is referring  path: /flink/checkpoints/submittedJobGraph480ddf9572ed  to get the job graph but jobmanager unable to find it.
It seems /flink/checkpoints  is not the external persistent store


Regards
Bhaskar

On Thu, Nov 28, 2019 at 10:43 AM seuzxc <[hidden email]> wrote:
hi ,I've the same problem with flink 1.9.1 , any solution to fix it
when the k8s redoploy jobmanager ,  the error looks like (seems zk not
remove submitted job info, but jobmanager remove the file): 


Caused by: org.apache.flink.util.FlinkException: Could not retrieve
submitted JobGraph from state handle under
/147dd022ec91f7381ad4ca3d290387e9. This indicates that the retrieved state
handle is broken. Try cleaning the state handle store.
        at
org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore.recoverJobGraph(ZooKeeperSubmittedJobGraphStore.java:208)
        at
org.apache.flink.runtime.dispatcher.Dispatcher.recoverJob(Dispatcher.java:696)
        at
org.apache.flink.runtime.dispatcher.Dispatcher.recoverJobGraphs(Dispatcher.java:681)
        at
org.apache.flink.runtime.dispatcher.Dispatcher.recoverJobs(Dispatcher.java:662)
        at
org.apache.flink.runtime.dispatcher.Dispatcher.lambda$null$26(Dispatcher.java:821)
        at
org.apache.flink.util.function.FunctionUtils.lambda$uncheckedFunction$2(FunctionUtils.java:72)
        ... 9 more
Caused by: org.apache.hadoop.hdfs.BlockMissingException: Could not obtain
block: BP-1651346363-10.20.1.81-1525354906737:blk_1083182315_9441494
file=/flink/checkpoints/submittedJobGraph480ddf9572ed
        at
org.apache.hadoop.hdfs.DFSInputStream.refetchLocations(DFSInputStream.java:1052)



--
Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/
Reply | Threaded
Open this post in threaded view
|

回复: JobGraphs not cleaned up in HA mode

seuzxc

anyone have the same problem? pls help, thks



------------------ 原始邮件 ------------------
发件人: "曾祥才"<[hidden email]>;
发送时间: 2019年11月28日(星期四) 下午2:46
收件人: "Vijay Bhaskar"<[hidden email]>;
抄送: "User-Flink"<[hidden email]>;
主题: 回复: JobGraphs not cleaned up in HA mode

the config  (/flink is the NASdirectory ): 

jobmanager.rpc.address: flink-jobmanager
taskmanager.numberOfTaskSlots: 16
web.upload.dir: /flink/webUpload
blob.server.port: 6124
jobmanager.rpc.port: 6123
taskmanager.rpc.port: 6122
jobmanager.heap.size: 1024m
taskmanager.heap.size: 1024m
high-availability: zookeeper
high-availability.cluster-id: /cluster-test
high-availability.storageDir: /flink/ha
high-availability.zookeeper.quorum: ****:2181
high-availability.jobmanager.port: 6123
high-availability.zookeeper.path.root: /flink/risk-insight
high-availability.zookeeper.path.checkpoints: /flink/zk-checkpoints
state.backend: filesystem
state.checkpoints.dir: file:///flink/checkpoints
state.savepoints.dir: file:///flink/savepoints
state.checkpoints.num-retained: 2
jobmanager.execution.failover-strategy: region
jobmanager.archive.fs.dir: file:///flink/archive/history
 


------------------ 原始邮件 ------------------
发件人: "Vijay Bhaskar"<[hidden email]>;
发送时间: 2019年11月28日(星期四) 下午3:12
收件人: "曾祥才"<[hidden email]>;
抄送: "User-Flink"<[hidden email]>;
主题: Re: JobGraphs not cleaned up in HA mode

Can you share the flink configuration once?

Regards
Bhaskar

On Thu, Nov 28, 2019 at 12:09 PM 曾祥才 <[hidden email]> wrote:
if i clean the zookeeper data , it runs fine .  but next time when the jobmanager failed and redeploy the error occurs again




------------------ 原始邮件 ------------------
发件人: "Vijay Bhaskar"<[hidden email]>;
发送时间: 2019年11月28日(星期四) 下午3:05
收件人: "曾祥才"<[hidden email]>;
主题: Re: JobGraphs not cleaned up in HA mode

Again it could not find the state store file: "Caused by: java.io.FileNotFoundException: /flink/ha/submittedJobGraph0c6bcff01199 "
 Check why its unable to find.
Better thing is: Clean up zookeeper state and check your configurations, correct them and restart cluster.
Otherwise it always picks up corrupted state from zookeeper and it will never restart

Regards
Bhaskar

On Thu, Nov 28, 2019 at 11:51 AM 曾祥才 <[hidden email]> wrote:
i've made a misstake( the log before is another cluster) . the full exception log is :


INFO  org.apache.flink.runtime.dispatcher.StandaloneDispatcher      - Recovering all persisted jobs.
2019-11-28 02:33:12,726 INFO  org.apache.flink.runtime.resourcemanager.slotmanager.SlotManagerImpl  - Starting the SlotManager.
2019-11-28 02:33:12,743 INFO  org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore  - Released locks of job graph 639170a9d710bacfd113ca66b2aacefa from ZooKeeper.
2019-11-28 02:33:12,744 ERROR org.apache.flink.runtime.entrypoint.ClusterEntrypoint         - Fatal error occurred in the cluster entrypoint.
org.apache.flink.runtime.dispatcher.DispatcherException: Failed to take leadership with session id 607e1fe0-f1b4-4af0-af16-4137d779defb.
at org.apache.flink.runtime.dispatcher.Dispatcher.lambda$null$30(Dispatcher.java:915)
at java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:774)
at java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:750)
at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488)
at java.util.concurrent.CompletableFuture.postFire(CompletableFuture.java:575)
at java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:594)
at java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:456)
at akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:40)
at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(ForkJoinExecutorConfigurator.scala:44)
at akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
Caused by: java.lang.RuntimeException: org.apache.flink.util.FlinkException: Could not retrieve submitted JobGraph from state handle under /639170a9d710bacfd113ca66b2aacefa. This indicates that the retrieved state handle is broken. Try cleaning the state handle store.
at org.apache.flink.util.ExceptionUtils.rethrow(ExceptionUtils.java:199)
at org.apache.flink.util.function.FunctionUtils.lambda$uncheckedFunction$2(FunctionUtils.java:75)
at java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:616)
at java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:591)
... 7 more
Caused by: org.apache.flink.util.FlinkException: Could not retrieve submitted JobGraph from state handle under /639170a9d710bacfd113ca66b2aacefa. This indicates that the retrieved state handle is broken. Try cleaning the state handle store.
at org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore.recoverJobGraph(ZooKeeperSubmittedJobGraphStore.java:190)
at org.apache.flink.runtime.dispatcher.Dispatcher.recoverJob(Dispatcher.java:755)
at org.apache.flink.runtime.dispatcher.Dispatcher.recoverJobGraphs(Dispatcher.java:740)
at org.apache.flink.runtime.dispatcher.Dispatcher.recoverJobs(Dispatcher.java:721)
at org.apache.flink.runtime.dispatcher.Dispatcher.lambda$null$27(Dispatcher.java:888)
at org.apache.flink.util.function.FunctionUtils.lambda$uncheckedFunction$2(FunctionUtils.java:73)
... 9 more
Caused by: java.io.FileNotFoundException: /flink/ha/submittedJobGraph0c6bcff01199 (No such file or directory)
at java.io.FileInputStream.open0(Native Method)
at java.io.FileInputStream.open(FileInputStream.java:195)

 


------------------ 原始邮件 ------------------
发件人: "Vijay Bhaskar"<[hidden email]>;
发送时间: 2019年11月28日(星期四) 下午2:46
收件人: "曾祥才"<[hidden email]>;
抄送: "User-Flink"<[hidden email]>;
主题: Re: JobGraphs not cleaned up in HA mode

Is it filesystem or hadoop? If its NAS then why the exception "Caused by: org.apache.hadoop.hdfs.BlockMissingException: "
It seems you configured hadoop state store and giving NAS mount.

Regards
Bhaskar

 

On Thu, Nov 28, 2019 at 11:36 AM 曾祥才 <[hidden email]> wrote:
/flink/checkpoints  is a external persistent store (a nas directory mounts to the job manager)




------------------ 原始邮件 ------------------
发件人: "Vijay Bhaskar"<[hidden email]>;
发送时间: 2019年11月28日(星期四) 下午2:29
收件人: "曾祥才"<[hidden email]>;
抄送: "user"<[hidden email]>;
主题: Re: JobGraphs not cleaned up in HA mode

Following are the mandatory condition to run in HA:

a) You should have persistent common external store for jobmanager and task managers to while writing the state
b) You should have persistent external store for zookeeper to store the Jobgraph.

Zookeeper is referring  path: /flink/checkpoints/submittedJobGraph480ddf9572ed  to get the job graph but jobmanager unable to find it.
It seems /flink/checkpoints  is not the external persistent store


Regards
Bhaskar

On Thu, Nov 28, 2019 at 10:43 AM seuzxc <[hidden email]> wrote:
hi ,I've the same problem with flink 1.9.1 , any solution to fix it
when the k8s redoploy jobmanager ,  the error looks like (seems zk not
remove submitted job info, but jobmanager remove the file): 


Caused by: org.apache.flink.util.FlinkException: Could not retrieve
submitted JobGraph from state handle under
/147dd022ec91f7381ad4ca3d290387e9. This indicates that the retrieved state
handle is broken. Try cleaning the state handle store.
        at
org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore.recoverJobGraph(ZooKeeperSubmittedJobGraphStore.java:208)
        at
org.apache.flink.runtime.dispatcher.Dispatcher.recoverJob(Dispatcher.java:696)
        at
org.apache.flink.runtime.dispatcher.Dispatcher.recoverJobGraphs(Dispatcher.java:681)
        at
org.apache.flink.runtime.dispatcher.Dispatcher.recoverJobs(Dispatcher.java:662)
        at
org.apache.flink.runtime.dispatcher.Dispatcher.lambda$null$26(Dispatcher.java:821)
        at
org.apache.flink.util.function.FunctionUtils.lambda$uncheckedFunction$2(FunctionUtils.java:72)
        ... 9 more
Caused by: org.apache.hadoop.hdfs.BlockMissingException: Could not obtain
block: BP-1651346363-10.20.1.81-1525354906737:blk_1083182315_9441494
file=/flink/checkpoints/submittedJobGraph480ddf9572ed
        at
org.apache.hadoop.hdfs.DFSInputStream.refetchLocations(DFSInputStream.java:1052)



--
Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/
Reply | Threaded
Open this post in threaded view
|

Re: JobGraphs not cleaned up in HA mode

vino yang
Hi,

Why do you not use HDFS directly?

Best,
Vino

曾祥才 <[hidden email]> 于2019年11月28日周四 下午6:48写道:

anyone have the same problem? pls help, thks



------------------ 原始邮件 ------------------
发件人: "曾祥才"<[hidden email]>;
发送时间: 2019年11月28日(星期四) 下午2:46
收件人: "Vijay Bhaskar"<[hidden email]>;
抄送: "User-Flink"<[hidden email]>;
主题: 回复: JobGraphs not cleaned up in HA mode

the config  (/flink is the NASdirectory ): 

jobmanager.rpc.address: flink-jobmanager
taskmanager.numberOfTaskSlots: 16
web.upload.dir: /flink/webUpload
blob.server.port: 6124
jobmanager.rpc.port: 6123
taskmanager.rpc.port: 6122
jobmanager.heap.size: 1024m
taskmanager.heap.size: 1024m
high-availability: zookeeper
high-availability.cluster-id: /cluster-test
high-availability.storageDir: /flink/ha
high-availability.zookeeper.quorum: ****:2181
high-availability.jobmanager.port: 6123
high-availability.zookeeper.path.root: /flink/risk-insight
high-availability.zookeeper.path.checkpoints: /flink/zk-checkpoints
state.backend: filesystem
state.checkpoints.dir: file:///flink/checkpoints
state.savepoints.dir: file:///flink/savepoints
state.checkpoints.num-retained: 2
jobmanager.execution.failover-strategy: region
jobmanager.archive.fs.dir: file:///flink/archive/history
 


------------------ 原始邮件 ------------------
发件人: "Vijay Bhaskar"<[hidden email]>;
发送时间: 2019年11月28日(星期四) 下午3:12
收件人: "曾祥才"<[hidden email]>;
抄送: "User-Flink"<[hidden email]>;
主题: Re: JobGraphs not cleaned up in HA mode

Can you share the flink configuration once?

Regards
Bhaskar

On Thu, Nov 28, 2019 at 12:09 PM 曾祥才 <[hidden email]> wrote:
if i clean the zookeeper data , it runs fine .  but next time when the jobmanager failed and redeploy the error occurs again




------------------ 原始邮件 ------------------
发件人: "Vijay Bhaskar"<[hidden email]>;
发送时间: 2019年11月28日(星期四) 下午3:05
收件人: "曾祥才"<[hidden email]>;
主题: Re: JobGraphs not cleaned up in HA mode

Again it could not find the state store file: "Caused by: java.io.FileNotFoundException: /flink/ha/submittedJobGraph0c6bcff01199 "
 Check why its unable to find.
Better thing is: Clean up zookeeper state and check your configurations, correct them and restart cluster.
Otherwise it always picks up corrupted state from zookeeper and it will never restart

Regards
Bhaskar

On Thu, Nov 28, 2019 at 11:51 AM 曾祥才 <[hidden email]> wrote:
i've made a misstake( the log before is another cluster) . the full exception log is :


INFO  org.apache.flink.runtime.dispatcher.StandaloneDispatcher      - Recovering all persisted jobs.
2019-11-28 02:33:12,726 INFO  org.apache.flink.runtime.resourcemanager.slotmanager.SlotManagerImpl  - Starting the SlotManager.
2019-11-28 02:33:12,743 INFO  org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore  - Released locks of job graph 639170a9d710bacfd113ca66b2aacefa from ZooKeeper.
2019-11-28 02:33:12,744 ERROR org.apache.flink.runtime.entrypoint.ClusterEntrypoint         - Fatal error occurred in the cluster entrypoint.
org.apache.flink.runtime.dispatcher.DispatcherException: Failed to take leadership with session id 607e1fe0-f1b4-4af0-af16-4137d779defb.
at org.apache.flink.runtime.dispatcher.Dispatcher.lambda$null$30(Dispatcher.java:915)
at java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:774)
at java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:750)
at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488)
at java.util.concurrent.CompletableFuture.postFire(CompletableFuture.java:575)
at java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:594)
at java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:456)
at akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:40)
at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(ForkJoinExecutorConfigurator.scala:44)
at akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
Caused by: java.lang.RuntimeException: org.apache.flink.util.FlinkException: Could not retrieve submitted JobGraph from state handle under /639170a9d710bacfd113ca66b2aacefa. This indicates that the retrieved state handle is broken. Try cleaning the state handle store.
at org.apache.flink.util.ExceptionUtils.rethrow(ExceptionUtils.java:199)
at org.apache.flink.util.function.FunctionUtils.lambda$uncheckedFunction$2(FunctionUtils.java:75)
at java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:616)
at java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:591)
... 7 more
Caused by: org.apache.flink.util.FlinkException: Could not retrieve submitted JobGraph from state handle under /639170a9d710bacfd113ca66b2aacefa. This indicates that the retrieved state handle is broken. Try cleaning the state handle store.
at org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore.recoverJobGraph(ZooKeeperSubmittedJobGraphStore.java:190)
at org.apache.flink.runtime.dispatcher.Dispatcher.recoverJob(Dispatcher.java:755)
at org.apache.flink.runtime.dispatcher.Dispatcher.recoverJobGraphs(Dispatcher.java:740)
at org.apache.flink.runtime.dispatcher.Dispatcher.recoverJobs(Dispatcher.java:721)
at org.apache.flink.runtime.dispatcher.Dispatcher.lambda$null$27(Dispatcher.java:888)
at org.apache.flink.util.function.FunctionUtils.lambda$uncheckedFunction$2(FunctionUtils.java:73)
... 9 more
Caused by: java.io.FileNotFoundException: /flink/ha/submittedJobGraph0c6bcff01199 (No such file or directory)
at java.io.FileInputStream.open0(Native Method)
at java.io.FileInputStream.open(FileInputStream.java:195)

 


------------------ 原始邮件 ------------------
发件人: "Vijay Bhaskar"<[hidden email]>;
发送时间: 2019年11月28日(星期四) 下午2:46
收件人: "曾祥才"<[hidden email]>;
抄送: "User-Flink"<[hidden email]>;
主题: Re: JobGraphs not cleaned up in HA mode

Is it filesystem or hadoop? If its NAS then why the exception "Caused by: org.apache.hadoop.hdfs.BlockMissingException: "
It seems you configured hadoop state store and giving NAS mount.

Regards
Bhaskar

 

On Thu, Nov 28, 2019 at 11:36 AM 曾祥才 <[hidden email]> wrote:
/flink/checkpoints  is a external persistent store (a nas directory mounts to the job manager)




------------------ 原始邮件 ------------------
发件人: "Vijay Bhaskar"<[hidden email]>;
发送时间: 2019年11月28日(星期四) 下午2:29
收件人: "曾祥才"<[hidden email]>;
抄送: "user"<[hidden email]>;
主题: Re: JobGraphs not cleaned up in HA mode

Following are the mandatory condition to run in HA:

a) You should have persistent common external store for jobmanager and task managers to while writing the state
b) You should have persistent external store for zookeeper to store the Jobgraph.

Zookeeper is referring  path: /flink/checkpoints/submittedJobGraph480ddf9572ed  to get the job graph but jobmanager unable to find it.
It seems /flink/checkpoints  is not the external persistent store


Regards
Bhaskar

On Thu, Nov 28, 2019 at 10:43 AM seuzxc <[hidden email]> wrote:
hi ,I've the same problem with flink 1.9.1 , any solution to fix it
when the k8s redoploy jobmanager ,  the error looks like (seems zk not
remove submitted job info, but jobmanager remove the file): 


Caused by: org.apache.flink.util.FlinkException: Could not retrieve
submitted JobGraph from state handle under
/147dd022ec91f7381ad4ca3d290387e9. This indicates that the retrieved state
handle is broken. Try cleaning the state handle store.
        at
org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore.recoverJobGraph(ZooKeeperSubmittedJobGraphStore.java:208)
        at
org.apache.flink.runtime.dispatcher.Dispatcher.recoverJob(Dispatcher.java:696)
        at
org.apache.flink.runtime.dispatcher.Dispatcher.recoverJobGraphs(Dispatcher.java:681)
        at
org.apache.flink.runtime.dispatcher.Dispatcher.recoverJobs(Dispatcher.java:662)
        at
org.apache.flink.runtime.dispatcher.Dispatcher.lambda$null$26(Dispatcher.java:821)
        at
org.apache.flink.util.function.FunctionUtils.lambda$uncheckedFunction$2(FunctionUtils.java:72)
        ... 9 more
Caused by: org.apache.hadoop.hdfs.BlockMissingException: Could not obtain
block: BP-1651346363-10.20.1.81-1525354906737:blk_1083182315_9441494
file=/flink/checkpoints/submittedJobGraph480ddf9572ed
        at
org.apache.hadoop.hdfs.DFSInputStream.refetchLocations(DFSInputStream.java:1052)



--
Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/
Reply | Threaded
Open this post in threaded view
|

回复: JobGraphs not cleaned up in HA mode

seuzxc
hi,
Is there any deference (for me using nas is more convenient to test currently)?   
from the docs seems hdfs ,s3, nfs etc all will be fine.



------------------ 原始邮件 ------------------
发件人: "vino yang"<[hidden email]>;
发送时间: 2019年11月28日(星期四) 晚上7:17
收件人: "曾祥才"<[hidden email]>;
抄送: "Vijay Bhaskar"<[hidden email]>;"User-Flink"<[hidden email]>;
主题: Re: JobGraphs not cleaned up in HA mode

Hi,

Why do you not use HDFS directly?

Best,
Vino

曾祥才 <[hidden email]> 于2019年11月28日周四 下午6:48写道:

anyone have the same problem? pls help, thks



------------------ 原始邮件 ------------------
发件人: "曾祥才"<[hidden email]>;
发送时间: 2019年11月28日(星期四) 下午2:46
收件人: "Vijay Bhaskar"<[hidden email]>;
抄送: "User-Flink"<[hidden email]>;
主题: 回复: JobGraphs not cleaned up in HA mode

the config  (/flink is the NASdirectory ): 

jobmanager.rpc.address: flink-jobmanager
taskmanager.numberOfTaskSlots: 16
web.upload.dir: /flink/webUpload
blob.server.port: 6124
jobmanager.rpc.port: 6123
taskmanager.rpc.port: 6122
jobmanager.heap.size: 1024m
taskmanager.heap.size: 1024m
high-availability: zookeeper
high-availability.cluster-id: /cluster-test
high-availability.storageDir: /flink/ha
high-availability.zookeeper.quorum: ****:2181
high-availability.jobmanager.port: 6123
high-availability.zookeeper.path.root: /flink/risk-insight
high-availability.zookeeper.path.checkpoints: /flink/zk-checkpoints
state.backend: filesystem
state.checkpoints.dir: file:///flink/checkpoints
state.savepoints.dir: file:///flink/savepoints
state.checkpoints.num-retained: 2
jobmanager.execution.failover-strategy: region
jobmanager.archive.fs.dir: file:///flink/archive/history
 


------------------ 原始邮件 ------------------
发件人: "Vijay Bhaskar"<[hidden email]>;
发送时间: 2019年11月28日(星期四) 下午3:12
收件人: "曾祥才"<[hidden email]>;
抄送: "User-Flink"<[hidden email]>;
主题: Re: JobGraphs not cleaned up in HA mode

Can you share the flink configuration once?

Regards
Bhaskar

On Thu, Nov 28, 2019 at 12:09 PM 曾祥才 <[hidden email]> wrote:
if i clean the zookeeper data , it runs fine .  but next time when the jobmanager failed and redeploy the error occurs again




------------------ 原始邮件 ------------------
发件人: "Vijay Bhaskar"<[hidden email]>;
发送时间: 2019年11月28日(星期四) 下午3:05
收件人: "曾祥才"<[hidden email]>;
主题: Re: JobGraphs not cleaned up in HA mode

Again it could not find the state store file: "Caused by: java.io.FileNotFoundException: /flink/ha/submittedJobGraph0c6bcff01199 "
 Check why its unable to find.
Better thing is: Clean up zookeeper state and check your configurations, correct them and restart cluster.
Otherwise it always picks up corrupted state from zookeeper and it will never restart

Regards
Bhaskar

On Thu, Nov 28, 2019 at 11:51 AM 曾祥才 <[hidden email]> wrote:
i've made a misstake( the log before is another cluster) . the full exception log is :


INFO  org.apache.flink.runtime.dispatcher.StandaloneDispatcher      - Recovering all persisted jobs.
2019-11-28 02:33:12,726 INFO  org.apache.flink.runtime.resourcemanager.slotmanager.SlotManagerImpl  - Starting the SlotManager.
2019-11-28 02:33:12,743 INFO  org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore  - Released locks of job graph 639170a9d710bacfd113ca66b2aacefa from ZooKeeper.
2019-11-28 02:33:12,744 ERROR org.apache.flink.runtime.entrypoint.ClusterEntrypoint         - Fatal error occurred in the cluster entrypoint.
org.apache.flink.runtime.dispatcher.DispatcherException: Failed to take leadership with session id 607e1fe0-f1b4-4af0-af16-4137d779defb.
at org.apache.flink.runtime.dispatcher.Dispatcher.lambda$null$30(Dispatcher.java:915)
at java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:774)
at java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:750)
at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488)
at java.util.concurrent.CompletableFuture.postFire(CompletableFuture.java:575)
at java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:594)
at java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:456)
at akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:40)
at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(ForkJoinExecutorConfigurator.scala:44)
at akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
Caused by: java.lang.RuntimeException: org.apache.flink.util.FlinkException: Could not retrieve submitted JobGraph from state handle under /639170a9d710bacfd113ca66b2aacefa. This indicates that the retrieved state handle is broken. Try cleaning the state handle store.
at org.apache.flink.util.ExceptionUtils.rethrow(ExceptionUtils.java:199)
at org.apache.flink.util.function.FunctionUtils.lambda$uncheckedFunction$2(FunctionUtils.java:75)
at java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:616)
at java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:591)
... 7 more
Caused by: org.apache.flink.util.FlinkException: Could not retrieve submitted JobGraph from state handle under /639170a9d710bacfd113ca66b2aacefa. This indicates that the retrieved state handle is broken. Try cleaning the state handle store.
at org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore.recoverJobGraph(ZooKeeperSubmittedJobGraphStore.java:190)
at org.apache.flink.runtime.dispatcher.Dispatcher.recoverJob(Dispatcher.java:755)
at org.apache.flink.runtime.dispatcher.Dispatcher.recoverJobGraphs(Dispatcher.java:740)
at org.apache.flink.runtime.dispatcher.Dispatcher.recoverJobs(Dispatcher.java:721)
at org.apache.flink.runtime.dispatcher.Dispatcher.lambda$null$27(Dispatcher.java:888)
at org.apache.flink.util.function.FunctionUtils.lambda$uncheckedFunction$2(FunctionUtils.java:73)
... 9 more
Caused by: java.io.FileNotFoundException: /flink/ha/submittedJobGraph0c6bcff01199 (No such file or directory)
at java.io.FileInputStream.open0(Native Method)
at java.io.FileInputStream.open(FileInputStream.java:195)

 


------------------ 原始邮件 ------------------
发件人: "Vijay Bhaskar"<[hidden email]>;
发送时间: 2019年11月28日(星期四) 下午2:46
收件人: "曾祥才"<[hidden email]>;
抄送: "User-Flink"<[hidden email]>;
主题: Re: JobGraphs not cleaned up in HA mode

Is it filesystem or hadoop? If its NAS then why the exception "Caused by: org.apache.hadoop.hdfs.BlockMissingException: "
It seems you configured hadoop state store and giving NAS mount.

Regards
Bhaskar

 

On Thu, Nov 28, 2019 at 11:36 AM 曾祥才 <[hidden email]> wrote:
/flink/checkpoints  is a external persistent store (a nas directory mounts to the job manager)




------------------ 原始邮件 ------------------
发件人: "Vijay Bhaskar"<[hidden email]>;
发送时间: 2019年11月28日(星期四) 下午2:29
收件人: "曾祥才"<[hidden email]>;
抄送: "user"<[hidden email]>;
主题: Re: JobGraphs not cleaned up in HA mode

Following are the mandatory condition to run in HA:

a) You should have persistent common external store for jobmanager and task managers to while writing the state
b) You should have persistent external store for zookeeper to store the Jobgraph.

Zookeeper is referring  path: /flink/checkpoints/submittedJobGraph480ddf9572ed  to get the job graph but jobmanager unable to find it.
It seems /flink/checkpoints  is not the external persistent store


Regards
Bhaskar

On Thu, Nov 28, 2019 at 10:43 AM seuzxc <[hidden email]> wrote:
hi ,I've the same problem with flink 1.9.1 , any solution to fix it
when the k8s redoploy jobmanager ,  the error looks like (seems zk not
remove submitted job info, but jobmanager remove the file): 


Caused by: org.apache.flink.util.FlinkException: Could not retrieve
submitted JobGraph from state handle under
/147dd022ec91f7381ad4ca3d290387e9. This indicates that the retrieved state
handle is broken. Try cleaning the state handle store.
        at
org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore.recoverJobGraph(ZooKeeperSubmittedJobGraphStore.java:208)
        at
org.apache.flink.runtime.dispatcher.Dispatcher.recoverJob(Dispatcher.java:696)
        at
org.apache.flink.runtime.dispatcher.Dispatcher.recoverJobGraphs(Dispatcher.java:681)
        at
org.apache.flink.runtime.dispatcher.Dispatcher.recoverJobs(Dispatcher.java:662)
        at
org.apache.flink.runtime.dispatcher.Dispatcher.lambda$null$26(Dispatcher.java:821)
        at
org.apache.flink.util.function.FunctionUtils.lambda$uncheckedFunction$2(FunctionUtils.java:72)
        ... 9 more
Caused by: org.apache.hadoop.hdfs.BlockMissingException: Could not obtain
block: BP-1651346363-10.20.1.81-1525354906737:blk_1083182315_9441494
file=/flink/checkpoints/submittedJobGraph480ddf9572ed
        at
org.apache.hadoop.hdfs.DFSInputStream.refetchLocations(DFSInputStream.java:1052)



--
Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/
Reply | Threaded
Open this post in threaded view
|

Re: JobGraphs not cleaned up in HA mode

Vijay Bhaskar
One more thing:
You configured:
high-availability.cluster-id: /cluster-test
it should be:
high-availability.cluster-id: cluster-test
I don't think this is major issue, in case it helps, you can check.
Can you check one more thing:
Is check pointing happening or not? 
Were you able to see the chk-* folder under checkpoint directory?

Regards
Bhaskar

On Thu, Nov 28, 2019 at 5:00 PM 曾祥才 <[hidden email]> wrote:
hi,
Is there any deference (for me using nas is more convenient to test currently)?   
from the docs seems hdfs ,s3, nfs etc all will be fine.



------------------ 原始邮件 ------------------
发件人: "vino yang"<[hidden email]>;
发送时间: 2019年11月28日(星期四) 晚上7:17
收件人: "曾祥才"<[hidden email]>;
抄送: "Vijay Bhaskar"<[hidden email]>;"User-Flink"<[hidden email]>;
主题: Re: JobGraphs not cleaned up in HA mode

Hi,

Why do you not use HDFS directly?

Best,
Vino

曾祥才 <[hidden email]> 于2019年11月28日周四 下午6:48写道:

anyone have the same problem? pls help, thks



------------------ 原始邮件 ------------------
发件人: "曾祥才"<[hidden email]>;
发送时间: 2019年11月28日(星期四) 下午2:46
收件人: "Vijay Bhaskar"<[hidden email]>;
抄送: "User-Flink"<[hidden email]>;
主题: 回复: JobGraphs not cleaned up in HA mode

the config  (/flink is the NASdirectory ): 

jobmanager.rpc.address: flink-jobmanager
taskmanager.numberOfTaskSlots: 16
web.upload.dir: /flink/webUpload
blob.server.port: 6124
jobmanager.rpc.port: 6123
taskmanager.rpc.port: 6122
jobmanager.heap.size: 1024m
taskmanager.heap.size: 1024m
high-availability: zookeeper
high-availability.cluster-id: /cluster-test
high-availability.storageDir: /flink/ha
high-availability.zookeeper.quorum: ****:2181
high-availability.jobmanager.port: 6123
high-availability.zookeeper.path.root: /flink/risk-insight
high-availability.zookeeper.path.checkpoints: /flink/zk-checkpoints
state.backend: filesystem
state.checkpoints.dir: file:///flink/checkpoints
state.savepoints.dir: file:///flink/savepoints
state.checkpoints.num-retained: 2
jobmanager.execution.failover-strategy: region
jobmanager.archive.fs.dir: file:///flink/archive/history
 


------------------ 原始邮件 ------------------
发件人: "Vijay Bhaskar"<[hidden email]>;
发送时间: 2019年11月28日(星期四) 下午3:12
收件人: "曾祥才"<[hidden email]>;
抄送: "User-Flink"<[hidden email]>;
主题: Re: JobGraphs not cleaned up in HA mode

Can you share the flink configuration once?

Regards
Bhaskar

On Thu, Nov 28, 2019 at 12:09 PM 曾祥才 <[hidden email]> wrote:
if i clean the zookeeper data , it runs fine .  but next time when the jobmanager failed and redeploy the error occurs again




------------------ 原始邮件 ------------------
发件人: "Vijay Bhaskar"<[hidden email]>;
发送时间: 2019年11月28日(星期四) 下午3:05
收件人: "曾祥才"<[hidden email]>;
主题: Re: JobGraphs not cleaned up in HA mode

Again it could not find the state store file: "Caused by: java.io.FileNotFoundException: /flink/ha/submittedJobGraph0c6bcff01199 "
 Check why its unable to find.
Better thing is: Clean up zookeeper state and check your configurations, correct them and restart cluster.
Otherwise it always picks up corrupted state from zookeeper and it will never restart

Regards
Bhaskar

On Thu, Nov 28, 2019 at 11:51 AM 曾祥才 <[hidden email]> wrote:
i've made a misstake( the log before is another cluster) . the full exception log is :


INFO  org.apache.flink.runtime.dispatcher.StandaloneDispatcher      - Recovering all persisted jobs.
2019-11-28 02:33:12,726 INFO  org.apache.flink.runtime.resourcemanager.slotmanager.SlotManagerImpl  - Starting the SlotManager.
2019-11-28 02:33:12,743 INFO  org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore  - Released locks of job graph 639170a9d710bacfd113ca66b2aacefa from ZooKeeper.
2019-11-28 02:33:12,744 ERROR org.apache.flink.runtime.entrypoint.ClusterEntrypoint         - Fatal error occurred in the cluster entrypoint.
org.apache.flink.runtime.dispatcher.DispatcherException: Failed to take leadership with session id 607e1fe0-f1b4-4af0-af16-4137d779defb.
at org.apache.flink.runtime.dispatcher.Dispatcher.lambda$null$30(Dispatcher.java:915)
at java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:774)
at java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:750)
at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488)
at java.util.concurrent.CompletableFuture.postFire(CompletableFuture.java:575)
at java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:594)
at java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:456)
at akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:40)
at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(ForkJoinExecutorConfigurator.scala:44)
at akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
Caused by: java.lang.RuntimeException: org.apache.flink.util.FlinkException: Could not retrieve submitted JobGraph from state handle under /639170a9d710bacfd113ca66b2aacefa. This indicates that the retrieved state handle is broken. Try cleaning the state handle store.
at org.apache.flink.util.ExceptionUtils.rethrow(ExceptionUtils.java:199)
at org.apache.flink.util.function.FunctionUtils.lambda$uncheckedFunction$2(FunctionUtils.java:75)
at java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:616)
at java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:591)
... 7 more
Caused by: org.apache.flink.util.FlinkException: Could not retrieve submitted JobGraph from state handle under /639170a9d710bacfd113ca66b2aacefa. This indicates that the retrieved state handle is broken. Try cleaning the state handle store.
at org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore.recoverJobGraph(ZooKeeperSubmittedJobGraphStore.java:190)
at org.apache.flink.runtime.dispatcher.Dispatcher.recoverJob(Dispatcher.java:755)
at org.apache.flink.runtime.dispatcher.Dispatcher.recoverJobGraphs(Dispatcher.java:740)
at org.apache.flink.runtime.dispatcher.Dispatcher.recoverJobs(Dispatcher.java:721)
at org.apache.flink.runtime.dispatcher.Dispatcher.lambda$null$27(Dispatcher.java:888)
at org.apache.flink.util.function.FunctionUtils.lambda$uncheckedFunction$2(FunctionUtils.java:73)
... 9 more
Caused by: java.io.FileNotFoundException: /flink/ha/submittedJobGraph0c6bcff01199 (No such file or directory)
at java.io.FileInputStream.open0(Native Method)
at java.io.FileInputStream.open(FileInputStream.java:195)

 


------------------ 原始邮件 ------------------
发件人: "Vijay Bhaskar"<[hidden email]>;
发送时间: 2019年11月28日(星期四) 下午2:46
收件人: "曾祥才"<[hidden email]>;
抄送: "User-Flink"<[hidden email]>;
主题: Re: JobGraphs not cleaned up in HA mode

Is it filesystem or hadoop? If its NAS then why the exception "Caused by: org.apache.hadoop.hdfs.BlockMissingException: "
It seems you configured hadoop state store and giving NAS mount.

Regards
Bhaskar

 

On Thu, Nov 28, 2019 at 11:36 AM 曾祥才 <[hidden email]> wrote:
/flink/checkpoints  is a external persistent store (a nas directory mounts to the job manager)




------------------ 原始邮件 ------------------
发件人: "Vijay Bhaskar"<[hidden email]>;
发送时间: 2019年11月28日(星期四) 下午2:29
收件人: "曾祥才"<[hidden email]>;
抄送: "user"<[hidden email]>;
主题: Re: JobGraphs not cleaned up in HA mode

Following are the mandatory condition to run in HA:

a) You should have persistent common external store for jobmanager and task managers to while writing the state
b) You should have persistent external store for zookeeper to store the Jobgraph.

Zookeeper is referring  path: /flink/checkpoints/submittedJobGraph480ddf9572ed  to get the job graph but jobmanager unable to find it.
It seems /flink/checkpoints  is not the external persistent store


Regards
Bhaskar

On Thu, Nov 28, 2019 at 10:43 AM seuzxc <[hidden email]> wrote:
hi ,I've the same problem with flink 1.9.1 , any solution to fix it
when the k8s redoploy jobmanager ,  the error looks like (seems zk not
remove submitted job info, but jobmanager remove the file): 


Caused by: org.apache.flink.util.FlinkException: Could not retrieve
submitted JobGraph from state handle under
/147dd022ec91f7381ad4ca3d290387e9. This indicates that the retrieved state
handle is broken. Try cleaning the state handle store.
        at
org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore.recoverJobGraph(ZooKeeperSubmittedJobGraphStore.java:208)
        at
org.apache.flink.runtime.dispatcher.Dispatcher.recoverJob(Dispatcher.java:696)
        at
org.apache.flink.runtime.dispatcher.Dispatcher.recoverJobGraphs(Dispatcher.java:681)
        at
org.apache.flink.runtime.dispatcher.Dispatcher.recoverJobs(Dispatcher.java:662)
        at
org.apache.flink.runtime.dispatcher.Dispatcher.lambda$null$26(Dispatcher.java:821)
        at
org.apache.flink.util.function.FunctionUtils.lambda$uncheckedFunction$2(FunctionUtils.java:72)
        ... 9 more
Caused by: org.apache.hadoop.hdfs.BlockMissingException: Could not obtain
block: BP-1651346363-10.20.1.81-1525354906737:blk_1083182315_9441494
file=/flink/checkpoints/submittedJobGraph480ddf9572ed
        at
org.apache.hadoop.hdfs.DFSInputStream.refetchLocations(DFSInputStream.java:1052)



--
Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/
Reply | Threaded
Open this post in threaded view
|

回复: JobGraphs not cleaned up in HA mode

seuzxc
the chk-* directory is not found , I think the misssing because of jobmanager removes it automaticly , but why it still in zookeeper?  

 


------------------ 原始邮件 ------------------
发件人: "Vijay Bhaskar"<[hidden email]>;
发送时间: 2019年11月28日(星期四) 晚上8:31
收件人: "曾祥才"<[hidden email]>;
抄送: "vino yang"<[hidden email]>;"User-Flink"<[hidden email]>;
主题: Re: JobGraphs not cleaned up in HA mode

One more thing:
You configured:
high-availability.cluster-id: /cluster-test
it should be:
high-availability.cluster-id: cluster-test
I don't think this is major issue, in case it helps, you can check.
Can you check one more thing:
Is check pointing happening or not? 
Were you able to see the chk-* folder under checkpoint directory?

Regards
Bhaskar

On Thu, Nov 28, 2019 at 5:00 PM 曾祥才 <[hidden email]> wrote:
hi,
Is there any deference (for me using nas is more convenient to test currently)?   
from the docs seems hdfs ,s3, nfs etc all will be fine.



------------------ 原始邮件 ------------------
发件人: "vino yang"<[hidden email]>;
发送时间: 2019年11月28日(星期四) 晚上7:17
收件人: "曾祥才"<[hidden email]>;
抄送: "Vijay Bhaskar"<[hidden email]>;"User-Flink"<[hidden email]>;
主题: Re: JobGraphs not cleaned up in HA mode

Hi,

Why do you not use HDFS directly?

Best,
Vino

曾祥才 <[hidden email]> 于2019年11月28日周四 下午6:48写道:

anyone have the same problem? pls help, thks



------------------ 原始邮件 ------------------
发件人: "曾祥才"<[hidden email]>;
发送时间: 2019年11月28日(星期四) 下午2:46
收件人: "Vijay Bhaskar"<[hidden email]>;
抄送: "User-Flink"<[hidden email]>;
主题: 回复: JobGraphs not cleaned up in HA mode

the config  (/flink is the NASdirectory ): 

jobmanager.rpc.address: flink-jobmanager
taskmanager.numberOfTaskSlots: 16
web.upload.dir: /flink/webUpload
blob.server.port: 6124
jobmanager.rpc.port: 6123
taskmanager.rpc.port: 6122
jobmanager.heap.size: 1024m
taskmanager.heap.size: 1024m
high-availability: zookeeper
high-availability.cluster-id: /cluster-test
high-availability.storageDir: /flink/ha
high-availability.zookeeper.quorum: ****:2181
high-availability.jobmanager.port: 6123
high-availability.zookeeper.path.root: /flink/risk-insight
high-availability.zookeeper.path.checkpoints: /flink/zk-checkpoints
state.backend: filesystem
state.checkpoints.dir: file:///flink/checkpoints
state.savepoints.dir: file:///flink/savepoints
state.checkpoints.num-retained: 2
jobmanager.execution.failover-strategy: region
jobmanager.archive.fs.dir: file:///flink/archive/history
 


------------------ 原始邮件 ------------------
发件人: "Vijay Bhaskar"<[hidden email]>;
发送时间: 2019年11月28日(星期四) 下午3:12
收件人: "曾祥才"<[hidden email]>;
抄送: "User-Flink"<[hidden email]>;
主题: Re: JobGraphs not cleaned up in HA mode

Can you share the flink configuration once?

Regards
Bhaskar

On Thu, Nov 28, 2019 at 12:09 PM 曾祥才 <[hidden email]> wrote:
if i clean the zookeeper data , it runs fine .  but next time when the jobmanager failed and redeploy the error occurs again




------------------ 原始邮件 ------------------
发件人: "Vijay Bhaskar"<[hidden email]>;
发送时间: 2019年11月28日(星期四) 下午3:05
收件人: "曾祥才"<[hidden email]>;
主题: Re: JobGraphs not cleaned up in HA mode

Again it could not find the state store file: "Caused by: java.io.FileNotFoundException: /flink/ha/submittedJobGraph0c6bcff01199 "
 Check why its unable to find.
Better thing is: Clean up zookeeper state and check your configurations, correct them and restart cluster.
Otherwise it always picks up corrupted state from zookeeper and it will never restart

Regards
Bhaskar

On Thu, Nov 28, 2019 at 11:51 AM 曾祥才 <[hidden email]> wrote:
i've made a misstake( the log before is another cluster) . the full exception log is :


INFO  org.apache.flink.runtime.dispatcher.StandaloneDispatcher      - Recovering all persisted jobs.
2019-11-28 02:33:12,726 INFO  org.apache.flink.runtime.resourcemanager.slotmanager.SlotManagerImpl  - Starting the SlotManager.
2019-11-28 02:33:12,743 INFO  org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore  - Released locks of job graph 639170a9d710bacfd113ca66b2aacefa from ZooKeeper.
2019-11-28 02:33:12,744 ERROR org.apache.flink.runtime.entrypoint.ClusterEntrypoint         - Fatal error occurred in the cluster entrypoint.
org.apache.flink.runtime.dispatcher.DispatcherException: Failed to take leadership with session id 607e1fe0-f1b4-4af0-af16-4137d779defb.
at org.apache.flink.runtime.dispatcher.Dispatcher.lambda$null$30(Dispatcher.java:915)
at java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:774)
at java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:750)
at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488)
at java.util.concurrent.CompletableFuture.postFire(CompletableFuture.java:575)
at java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:594)
at java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:456)
at akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:40)
at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(ForkJoinExecutorConfigurator.scala:44)
at akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
Caused by: java.lang.RuntimeException: org.apache.flink.util.FlinkException: Could not retrieve submitted JobGraph from state handle under /639170a9d710bacfd113ca66b2aacefa. This indicates that the retrieved state handle is broken. Try cleaning the state handle store.
at org.apache.flink.util.ExceptionUtils.rethrow(ExceptionUtils.java:199)
at org.apache.flink.util.function.FunctionUtils.lambda$uncheckedFunction$2(FunctionUtils.java:75)
at java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:616)
at java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:591)
... 7 more
Caused by: org.apache.flink.util.FlinkException: Could not retrieve submitted JobGraph from state handle under /639170a9d710bacfd113ca66b2aacefa. This indicates that the retrieved state handle is broken. Try cleaning the state handle store.
at org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore.recoverJobGraph(ZooKeeperSubmittedJobGraphStore.java:190)
at org.apache.flink.runtime.dispatcher.Dispatcher.recoverJob(Dispatcher.java:755)
at org.apache.flink.runtime.dispatcher.Dispatcher.recoverJobGraphs(Dispatcher.java:740)
at org.apache.flink.runtime.dispatcher.Dispatcher.recoverJobs(Dispatcher.java:721)
at org.apache.flink.runtime.dispatcher.Dispatcher.lambda$null$27(Dispatcher.java:888)
at org.apache.flink.util.function.FunctionUtils.lambda$uncheckedFunction$2(FunctionUtils.java:73)
... 9 more
Caused by: java.io.FileNotFoundException: /flink/ha/submittedJobGraph0c6bcff01199 (No such file or directory)
at java.io.FileInputStream.open0(Native Method)
at java.io.FileInputStream.open(FileInputStream.java:195)

 


------------------ 原始邮件 ------------------
发件人: "Vijay Bhaskar"<[hidden email]>;
发送时间: 2019年11月28日(星期四) 下午2:46
收件人: "曾祥才"<[hidden email]>;
抄送: "User-Flink"<[hidden email]>;
主题: Re: JobGraphs not cleaned up in HA mode

Is it filesystem or hadoop? If its NAS then why the exception "Caused by: org.apache.hadoop.hdfs.BlockMissingException: "
It seems you configured hadoop state store and giving NAS mount.

Regards
Bhaskar

 

On Thu, Nov 28, 2019 at 11:36 AM 曾祥才 <[hidden email]> wrote:
/flink/checkpoints  is a external persistent store (a nas directory mounts to the job manager)




------------------ 原始邮件 ------------------
发件人: "Vijay Bhaskar"<[hidden email]>;
发送时间: 2019年11月28日(星期四) 下午2:29
收件人: "曾祥才"<[hidden email]>;
抄送: "user"<[hidden email]>;
主题: Re: JobGraphs not cleaned up in HA mode

Following are the mandatory condition to run in HA:

a) You should have persistent common external store for jobmanager and task managers to while writing the state
b) You should have persistent external store for zookeeper to store the Jobgraph.

Zookeeper is referring  path: /flink/checkpoints/submittedJobGraph480ddf9572ed  to get the job graph but jobmanager unable to find it.
It seems /flink/checkpoints  is not the external persistent store


Regards
Bhaskar

On Thu, Nov 28, 2019 at 10:43 AM seuzxc <[hidden email]> wrote:
hi ,I've the same problem with flink 1.9.1 , any solution to fix it
when the k8s redoploy jobmanager ,  the error looks like (seems zk not
remove submitted job info, but jobmanager remove the file): 


Caused by: org.apache.flink.util.FlinkException: Could not retrieve
submitted JobGraph from state handle under
/147dd022ec91f7381ad4ca3d290387e9. This indicates that the retrieved state
handle is broken. Try cleaning the state handle store.
        at
org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore.recoverJobGraph(ZooKeeperSubmittedJobGraphStore.java:208)
        at
org.apache.flink.runtime.dispatcher.Dispatcher.recoverJob(Dispatcher.java:696)
        at
org.apache.flink.runtime.dispatcher.Dispatcher.recoverJobGraphs(Dispatcher.java:681)
        at
org.apache.flink.runtime.dispatcher.Dispatcher.recoverJobs(Dispatcher.java:662)
        at
org.apache.flink.runtime.dispatcher.Dispatcher.lambda$null$26(Dispatcher.java:821)
        at
org.apache.flink.util.function.FunctionUtils.lambda$uncheckedFunction$2(FunctionUtils.java:72)
        ... 9 more
Caused by: org.apache.hadoop.hdfs.BlockMissingException: Could not obtain
block: BP-1651346363-10.20.1.81-1525354906737:blk_1083182315_9441494
file=/flink/checkpoints/submittedJobGraph480ddf9572ed
        at
org.apache.hadoop.hdfs.DFSInputStream.refetchLocations(DFSInputStream.java:1052)



--
Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/