Caused by: java.io.FileNotFoundException: Item not found: 'gs://../app/checkpoints/2834fa1c81dcf7c9578a8be9a371b0d1/shared/3477b236-fb4b-4a0d-be73-cb6fac62c007'. Note, it is possible that the live version is still available but the requested generation is deleted. at com.google.cloud.hadoop.gcsio.GoogleCloudStorageExceptions.createFileNotFoundException(GoogleCloudStorageExceptions.java:45)
at com.google.cloud.hadoop.gcsio.GoogleCloudStorageImpl.open(GoogleCloudStorageImpl.java:653)
at com.google.cloud.hadoop.gcsio.GoogleCloudStorageFileSystem.open(GoogleCloudStorageFileSystem.java:277)
at com.google.cloud.hadoop.fs.gcs.GoogleHadoopFSInputStream.<init>(GoogleHadoopFSInputStream.java:78)
at com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystemBase.open(GoogleHadoopFileSystemBase.java:620)
at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:950)
at com.css.flink.fs.gcs.moved.HadoopFileSystem.open(HadoopFileSystem.java:120)
at com.css.flink.fs.gcs.moved.HadoopFileSystem.open(HadoopFileSystem.java:37)
at org.apache.flink.core.fs.PluginFileSystemFactory$ClassLoaderFixingFileSystem.open(PluginFileSystemFactory.java:127)
at org.apache.flink.core.fs.SafetyNetWrapperFileSystem.open(SafetyNetWrapperFileSystem.java:85)
at org.apache.flink.runtime.state.filesystem.FileStateHandle.openInputStream(FileStateHandle.java:69)
at org.apache.flink.contrib.streaming.state.RocksDBStateDownloader.downloadDataForStateHandle(RocksDBStateDownloader.java:126)
at org.apache.flink.contrib.streaming.state.RocksDBStateDownloader.lambda$createDownloadRunnables$0(RocksDBStateDownloader.java:109)
at org.apache.flink.util.function.ThrowingRunnable.lambda$unchecked$0(ThrowingRunnable.java:50)
at java.util.concurrent.CompletableFuture$AsyncRun.run(CompletableFuture.java:1640)
at org.apache.flink.runtime.concurrent.DirectExecutorService.execute(DirectExecutorService.java:211)
at java.util.concurrent.CompletableFuture.asyncRunStage(CompletableFuture.java:1654)
at java.util.concurrent.CompletableFuture.runAsync(CompletableFuture.java:1871)
at org.apache.flink.contrib.streaming.state.RocksDBStateDownloader.downloadDataForAllStateHandles(RocksDBStateDownloader.java:83)
at org.apache.flink.contrib.streaming.state.RocksDBStateDownloader.transferAllStateDataToDirectory(RocksDBStateDownloader.java:66)
at org.apache.flink.contrib.streaming.state.restore.RocksDBIncrementalRestoreOperation.transferRemoteStateToLocalDirectory(RocksDBIncrementalRestoreOperation.java:230)
at org.apache.flink.contrib.streaming.state.restore.RocksDBIncrementalRestoreOperation.restoreFromRemoteState(RocksDBIncrementalRestoreOperation.java:195)
at org.apache.flink.contrib.streaming.state.restore.RocksDBIncrementalRestoreOperation.restoreWithoutRescaling(RocksDBIncrementalRestoreOperation.java:169)
at org.apache.flink.contrib.streaming.state.restore.RocksDBIncrementalRestoreOperation.restore(RocksDBIncrementalRestoreOperation.java:155)
at org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackendBuilder.build(RocksDBKeyedStateBackendBuilder.java:270)
... 15 more
We tried to roll back the code, we tried different checkpoints, but all the attempts failed with the same error. The job ID in the error is not from the same checkpoint path, it looks like restore logic
looks back at previous jobs, as all the checkpoints after 2834fa1c81dcf7c9578a8be9a371b0d1 are failing to restore with the same error.
We looked at different checkpoints and found that some of them miss metadata file and can't be used for restoration.
We also use ZK for HA, and we cleaned up the state there between deployments to make sure the non existent file
is not coming from there.