Hi Josh!There are two ways to improve the RocksDB / S3 behavior(1) Use the FullyAsync mode. It stores the data in one file, not in a directory. Since directories are the "eventual consistent" part of S3, this prevents many issues.(2) Flink 1.2-SNAPSHOT has some additional fixes that circumvent additional S3 issues.Hope that helps,StephanOn Tue, Oct 11, 2016 at 4:42 PM, Josh <[hidden email]> wrote:Hi Aljoscha,Yeah I'm using S3. Is this a known problem when using S3? Do you have any ideas on how to restore my job from this state, or prevent it from happening again?Thanks,JoshOn Tue, Oct 11, 2016 at 1:58 PM, Aljoscha Krettek <[hidden email]> wrote:Hi,you are using S3 to store the checkpoints, right? It might be that you're running into a problem with S3 "directory listings" not being consistent.Cheers,AljoschaOn Tue, 11 Oct 2016 at 12:40 Josh <[hidden email]> wrote:Hi all,I just have a couple of questions about checkpointing and restoring state from RocksDB.1) In some cases, I find that it is impossible to restore a job from a checkpoint, due to an exception such as the one pasted below[*]. In this case, it appears that the last checkpoint is somehow corrupt. Does anyone know why this might happen?2) When the above happens, I have no choice but to cancel the job, as it repeatedly attempts to restart and keeps getting the same exception. Given that no savepoint was taken recently, is it possible for me to restore the job from an older checkpoint (e.g. the second-last checkpoint)?The version of Flink I'm using Flink-1.1-SNAPSHOT, from mid-June.Thanks,Josh[*]The exception when restoring state:java.lang.Exception: Could not restore checkpointed state to operators and functions
at org.apache.flink.streaming.runtime.tasks.StreamTask.restoreS tate(StreamTask.java:480)
at org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(S treamTask.java:219)
at org.apache.flink.runtime.taskmanager.Task.run(Task.java:588)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.RuntimeException: Error while restoring RocksDB state from /mnt/yarn/usercache/hadoop/appcache/application_147618129418 9_0001/flink-io-09ad1cb1-8dff- 4f9a-9f61-6cae27ee6f1d/d236820 a793043bd63360df6f175cae9/Stre amFlatMap_9_8/dummy_state/dc5b eab1-68fb-48b3-b3d6-272497d15a 09/chk-1
at org.apache.flink.contrib.streaming.state.RocksDBStateBackend .restoreFromSemiAsyncSnapshot( RocksDBStateBackend.java:537)
at org.apache.flink.contrib.streaming.state.RocksDBStateBackend .injectKeyValueStateSnapshots( RocksDBStateBackend.java:489)
at org.apache.flink.streaming.api.operators.AbstractStreamOpera tor.restoreState(AbstractStrea mOperator.java:204)
at org.apache.flink.streaming.api.operators.AbstractUdfStreamOp erator.restoreState(AbstractUd fStreamOperator.java:154)
at org.apache.flink.streaming.runtime.tasks.StreamTask.restoreS tate(StreamTask.java:472)
... 3 more
Caused by: org.rocksdb.RocksDBException: NotFound: Backup not found
at org.rocksdb.BackupEngine.restoreDbFromLatestBackup(Native Method)
at org.rocksdb.BackupEngine.restoreDbFromLatestBackup(BackupEng ine.java:177)
at org.apache.flink.contrib.streaming.state.RocksDBStateBackend .restoreFromSemiAsyncSnapshot( RocksDBStateBackend.java:535)
... 7 more
Free forum by Nabble | Edit this page |