Hi, We are trying to save checkpoints for one of the flink job running in Flink version 1.9 and tried to resume the same flink job in Flink version 1.11.2. We are getting
the below error when trying to restore the saved checkpoint in the newer flink version. Can
Cannot map checkpoint/savepoint state for operator fbb4ef531e002f8fb3a2052db255adf5 to the new program, because the
operator is not available in the new program. Complete Stack Trace : {"errors":["org.apache.flink.runtime.rest.handler.RestHandlerException:
Could not execute application.\n\tat org.apache.flink.runtime.webmonitor.handlers.JarRunHandler.lambda$handleRequest$1(JarRunHandler.java:103)\n\tat java.util.concurrent.CompletableFuture.uniHandle(CompletableFuture.java:836)\n\tat java.util.concurrent.CompletableFuture$UniHandle.tryFire(CompletableFuture.java:811)\n\tat
java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488)\n\tat java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1609)\n\tat java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)\n\tat java.util.concurrent.FutureTask.run(FutureTask.java:266)\n\tat
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)\n\tat java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)\n\tat java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)\n\tat
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)\n\tat java.lang.Thread.run(Thread.java:748)\nCaused by: java.util.concurrent.CompletionException: org.apache.flink.util.FlinkRuntimeException: Could not execute application.\n\tat
java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:273)\n\tat java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:280)\n\tat java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1606)\n\t...
7 more\nCaused by: org.apache.flink.util.FlinkRuntimeException: Could not execute application.\n\tat org.apache.flink.client.deployment.application.DetachedApplicationRunner.tryExecuteJobs(DetachedApplicationRunner.java:81)\n\tat org.apache.flink.client.deployment.application.DetachedApplicationRunner.run(DetachedApplicationRunner.java:67)\n\tat
org.apache.flink.runtime.webmonitor.handlers.JarRunHandler.lambda$handleRequest$0(JarRunHandler.java:100)\n\tat java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1604)\n\t... 7 more\nCaused by: org.apache.flink.client.program.ProgramInvocationException:
The main method caused an error: Failed to execute job 'ST1_100Services-preprod-Tumbling-ProcessedBased'.\n\tat org.apache.flink.client.program.PackagedProgram.callMainMethod(PackagedProgram.java:302)\n\tat org.apache.flink.client.program.PackagedProgram.invokeInteractiveModeForExecution(PackagedProgram.java:198)\n\tat
org.apache.flink.client.ClientUtils.executeProgram(ClientUtils.java:149)\n\tat org.apache.flink.client.deployment.application.DetachedApplicationRunner.tryExecuteJobs(DetachedApplicationRunner.java:78)\n\t... 10 more\nCaused by: org.apache.flink.util.FlinkException:
Failed to execute job 'ST1_100Services-preprod-Tumbling-ProcessedBased'.\n\tat org.apache.flink.streaming.api.environment.StreamExecutionEnvironment.executeAsync(StreamExecutionEnvironment.java:1821)\n\tat org.apache.flink.client.program.StreamContextEnvironment.executeAsync(StreamContextEnvironment.java:128)\n\tat
org.apache.flink.client.program.StreamContextEnvironment.execute(StreamContextEnvironment.java:76)\n\tat org.apache.flink.streaming.api.environment.StreamExecutionEnvironment.execute(StreamExecutionEnvironment.java:1697)\n\tat org.apache.flink.streaming.api.scala.StreamExecutionEnvironment.execute(StreamExecutionEnvironment.scala:699)\n\tat
com.man.ceon.cep.jobs.AnalyticService$.main(AnalyticService.scala:108)\n\tat com.man.ceon.cep.jobs.AnalyticService.main(AnalyticService.scala)\n\tat sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)\n\tat sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)\n\tat
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)\n\tat java.lang.reflect.Method.invoke(Method.java:498)\n\tat org.apache.flink.client.program.PackagedProgram.callMainMethod(PackagedProgram.java:288)\n\t... 13 more\nCaused
by: org.apache.flink.runtime.client.JobSubmissionException: Failed to submit job.\n\tat org.apache.flink.runtime.dispatcher.Dispatcher.lambda$internalSubmitJob$3(Dispatcher.java:344)\n\tat java.util.concurrent.CompletableFuture.uniHandle(CompletableFuture.java:836)\n\tat
java.util.concurrent.CompletableFuture$UniHandle.tryFire(CompletableFuture.java:811)\n\tat java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:456)\n\tat akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:40)\n\tat akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(ForkJoinExecutorConfigurator.scala:44)\n\tat
akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)\n\tat akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)\n\tat akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)\n\tat akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)\nCaused
by: org.apache.flink.runtime.client.JobExecutionException: Could not instantiate JobManager.\n\tat org.apache.flink.runtime.dispatcher.Dispatcher.lambda$createJobManagerRunner$6(Dispatcher.java:398)\n\tat java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1604)\n\t...
6 more\nCaused by: java.lang.IllegalStateException: Failed to rollback to checkpoint/savepoint s3://prodv2-flink-cluster/savepoints/savepoint-b76d18-d302cc7ca666.
Cannot map checkpoint/savepoint state for operator fbb4ef531e002f8fb3a2052db255adf5 to the new program, because the operator is not available in the new program. If you want to allow to skip this,
you can set the --allowNonRestoredState option on the CLI.\n\tat org.apache.flink.runtime.checkpoint.Checkpoints.throwNonRestoredStateException(Checkpoints.java:210)\n\tat org.apache.flink.runtime.checkpoint.Checkpoints.loadAndValidateCheckpoint(Checkpoints.java:180)\n\tat
org.apache.flink.runtime.checkpoint.CheckpointCoordinator.restoreSavepoint(CheckpointCoordinator.java:1397)\n\tat org.apache.flink.runtime.scheduler.SchedulerBase.tryRestoreExecutionGraphFromSavepoint(SchedulerBase.java:300)\n\tat org.apache.flink.runtime.scheduler.SchedulerBase.createAndRestoreExecutionGraph(SchedulerBase.java:253)\n\tat
org.apache.flink.runtime.scheduler.SchedulerBase.<init>(SchedulerBase.java:229)\n\tat org.apache.flink.runtime.scheduler.DefaultScheduler.<init>(DefaultScheduler.java:119)\n\tat org.apache.flink.runtime.scheduler.DefaultSchedulerFactory.createInstance(DefaultSchedulerFactory.java:103)\n\tat
org.apache.flink.runtime.jobmaster.JobMaster.createScheduler(JobMaster.java:284)\n\tat org.apache.flink.runtime.jobmaster.JobMaster.<init>(JobMaster.java:272)\n\tat org.apache.flink.runtime.jobmaster.factories.DefaultJobMasterServiceFactory.createJobMasterService(DefaultJobMasterServiceFactory.java:98)\n\tat
org.apache.flink.runtime.jobmaster.factories.DefaultJobMasterServiceFactory.createJobMasterService(DefaultJobMasterServiceFactory.java:40)\n\tat org.apache.flink.runtime.jobmaster.JobManagerRunnerImpl.<init>(JobManagerRunnerImpl.java:140)\n\tat org.apache.flink.runtime.dispatcher.DefaultJobManagerRunnerFactory.createJobManagerRunner(DefaultJobManagerRunnerFactory.java:84)\n\tat
org.apache.flink.runtime.dispatcher.Dispatcher.lambda$createJobManagerRunner$6(Dispatcher.java:388)\n\t... 7 more\n"]} |
Hi, Have you dropped or renamed any operator from the original job? If yes, and you are okay with discarding the state of that operator, you can submit the job with --allowNonRestoredState or -n. https://ci.apache.org/projects/flink/flink-docs-stable/ops/state/savepoints.html#allowing-non-restored-state - Sivaprasanna On Fri, Oct 23, 2020 at 10:48 AM Partha Mishra <[hidden email]> wrote:
|
Hi, None of the operator is renamed or removed. Testing is carried out with exactly same binary used with 1.9 and 1.11.2. Checkpoint saved
in 1.9 is not being able to retrieve in 1.11.2 From: Sivaprasanna <[hidden email]>
Hi, Have you dropped or renamed any operator from the original job? If yes, and you are okay with discarding the state of that operator, you can submit the job with --allowNonRestoredState or -n. https://ci.apache.org/projects/flink/flink-docs-stable/ops/state/savepoints.html#allowing-non-restored-state - Sivaprasanna On Fri, Oct 23, 2020 at 10:48 AM Partha Mishra <[hidden email]> wrote:
|
Hi Partha
The exception here said that there is some operator in the checkpoint/savepoint, but not in the new program. As you said that, both times use the same user program binary, this seems a little strange to me. did you ever set the uid for all the operators? Best, Congxian Partha Mishra <[hidden email]> 于2020年10月23日周五 下午3:02写道:
|
Hi Cong, No never set the uid for the operators. Regards, Partha Mishra From: Congxian Qiu <[hidden email]>
Hi Partha The exception here said that there is some operator in the checkpoint/savepoint, but not in the new program. As you said that, both times use the same user program binary, this seems a little strange to me. did you ever set the uid for all the operators? Best, Congxian Partha Mishra <[hidden email]>
于2020年10月23日周五
下午3:02写道:
|
Free forum by Nabble | Edit this page |