State restoration from checkpoint

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

State restoration from checkpoint

Parth Sarathy
_metadata._metadataHi, Recently we upgraded our application in which flink windowed transformation is used. The earlier version of the application used flink 1.7.2 while the new version uses flink 1.8.2. While submitting new job the application sets path to the latest checkpoint directory as restore path. After upgrade we see below statement in job manager log which was as expected:

2019-10-30 13:04:46.895 [flink-akka.actor.default-dispatcher-12] INFO org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Starting job b8370f434690bd82ae413183b0826567 from savepoint /data-processor/data/checkpoints/c7775bdc5d9f51f41576342946cf1037/chk-19089 (allowing non restored state)

But then the following error comes which prevents the new job to run:

2019-10-30 13:05:23.500 [flink-akka.actor.default-dispatcher-12] INFO org.apache.flink.runtime.executiongraph.ExecutionGraph - graph-updates-stream (1/8) (2d07dc9bf1ef6ef3e4b7fdbf32943e4e) switched from RUNNING to FAILED. java.lang.Exception: Exception while creating StreamOperatorStateContext. at org.apache.flink.streaming.api.operators.StreamTaskStateInitializerImpl.streamOperatorStateContext(StreamTaskStateInitializerImpl.java:195) ~[flink-dist_2.11-1.8.2.jar:1.8.2] at org.apache.flink.streaming.api.operators.AbstractStreamOperator.initializeState(AbstractStreamOperator.java:250) ~[flink-dist_2.11-1.8.2.jar:1.8.2] at org.apache.flink.streaming.runtime.tasks.StreamTask.initializeState(StreamTask.java:740) ~[flink-dist_2.11-1.8.2.jar:1.8.2] at org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:291) ~[flink-dist_2.11-1.8.2.jar:1.8.2] at org.apache.flink.runtime.taskmanager.Task.run(Task.java:711) ~[flink-dist_2.11-1.8.2.jar:1.8.2] at java.lang.Thread.run(Thread.java:748) ~[na:1.8.0_212] Caused by: org.apache.flink.util.FlinkException: Could not restore keyed state backend for WindowOperator_6d6ae3c2fa6c08ed7dd466debdde6743_(1/8) from any of the 1 provided restore options. at org.apache.flink.streaming.api.operators.BackendRestorerProcedure.createAndRestore(BackendRestorerProcedure.java:135) ~[flink-dist_2.11-1.8.2.jar:1.8.2] at org.apache.flink.streaming.api.operators.StreamTaskStateInitializerImpl.keyedStatedBackend(StreamTaskStateInitializerImpl.java:307) ~[flink-dist_2.11-1.8.2.jar:1.8.2] at org.apache.flink.streaming.api.operators.StreamTaskStateInitializerImpl.streamOperatorStateContext(StreamTaskStateInitializerImpl.java:135) ~[flink-dist_2.11-1.8.2.jar:1.8.2] ... 5 common frames omitted Caused by: org.apache.flink.runtime.state.BackendBuildingException: Caught unexpected exception. at org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackendBuilder.build(RocksDBKeyedStateBackendBuilder.java:324) ~[flink-dist_2.11-1.8.2.jar:1.8.2] at org.apache.flink.contrib.streaming.state.RocksDBStateBackend.createKeyedStateBackend(RocksDBStateBackend.java:520) ~[flink-dist_2.11-1.8.2.jar:1.8.2] at org.apache.flink.streaming.api.operators.StreamTaskStateInitializerImpl.lambda$keyedStatedBackend$1(StreamTaskStateInitializerImpl.java:291) ~[flink-dist_2.11-1.8.2.jar:1.8.2] at org.apache.flink.streaming.api.operators.BackendRestorerProcedure.attemptCreateAndRestore(BackendRestorerProcedure.java:142) ~[flink-dist_2.11-1.8.2.jar:1.8.2] at org.apache.flink.streaming.api.operators.BackendRestorerProcedure.createAndRestore(BackendRestorerProcedure.java:121) ~[flink-dist_2.11-1.8.2.jar:1.8.2] ... 7 common frames omitted

Caused by: java.io.FileNotFoundException: /data-processor/data/checkpoints/6f8c503c32329e7ed1df3029ba1e2c74/shared/75d656bd-450e-42eb-ab40-edde993e3f12 (No such file or directory) at java.io.FileInputStream.open0(Native Method) ~[na:1.8.0_212]...
Why is it looking for files in some other job directory for restoring state? I have attached metadata file from /data-processor/data/checkpoints/c7775bdc5d9f51f41576342946cf1037/chk-19089. It refers to multiple directories in /data-processor/data/checkpoints/6f8c503c32329e7ed1df3029ba1e2c74/shared/.
Please elaborate on which all checkpoint directories/files are used for state restoration. Our requirement is to clean up all the job/checkpoint directories which are not required.
Thanks,
Parth Sarathy

Sent from the Apache Flink User Mailing List archive. mailing list archive at Nabble.com.