Hello, I’ve reported issues around checkpoint recovery in case of a job failure due to zookeeper connection loss in the past. I am still seeing issues occasionally. This is for Flink 1.12.3 with zookeeper for HA, S3 as the state backend, incremental checkpoints, and task-local recovery enabled. Here’s what happened: A zookeeper instance was terminated as part of a deployment for our zookeeper service, this caused a new jobmanager leader election (so far so good). A leader was elected and the job
was restarted from the latest checkpoint but never became healthy. The root exception and the logs show issues reading state: o.r.RocksDBException:
Sst file size mismatch: /mnt/data/tmp/flink-io-7139fea9-2dd8-42e6-8ffb-4d1a826f77d6/job_993eca72823b5ac13a377d7a844ac1b5_op_KeyedCoProcessOperator_d80b7e861bf73bdf93b8b27e5881807f__10_44__uuid_d3c2d251-c046-494a-bc25-57985a01fda1/db/003579.sst. Size recorded
in manifest 36718, actual size 2570\ Since we retain multiple checkpoints, I tried redeploying the job from all checkpoints that were still available. All those attempts lead to similar failures. (I eventually had to use an older savepoint to
recover the job.) Any guidance for avoiding this would be appreciated. Peter |
Hi Peter,
Do you experience this issue if running without local recovery or incremental checkpoints enabled? Or have you maybe compared local (on TM) and remove (on DFS) SST files? Regards, Roman On Thu, May 20, 2021 at 5:54 PM Peter Westermann <[hidden email]> wrote: > > Hello, > > > > I’ve reported issues around checkpoint recovery in case of a job failure due to zookeeper connection loss in the past. I am still seeing issues occasionally. > > This is for Flink 1.12.3 with zookeeper for HA, S3 as the state backend, incremental checkpoints, and task-local recovery enabled. > > > > Here’s what happened: A zookeeper instance was terminated as part of a deployment for our zookeeper service, this caused a new jobmanager leader election (so far so good). A leader was elected and the job was restarted from the latest checkpoint but never became healthy. The root exception and the logs show issues reading state: > > o.r.RocksDBException: Sst file size mismatch: /mnt/data/tmp/flink-io-7139fea9-2dd8-42e6-8ffb-4d1a826f77d6/job_993eca72823b5ac13a377d7a844ac1b5_op_KeyedCoProcessOperator_d80b7e861bf73bdf93b8b27e5881807f__10_44__uuid_d3c2d251-c046-494a-bc25-57985a01fda1/db/003579.sst. Size recorded in manifest 36718, actual size 2570\ > Sst file size mismatch: /mnt/data/tmp/flink-io-7139fea9-2dd8-42e6-8ffb-4d1a826f77d6/job_993eca72823b5ac13a377d7a844ac1b5_op_KeyedCoProcessOperator_d80b7e861bf73bdf93b8b27e5881807f__10_44__uuid_d3c2d251-c046-494a-bc25-57985a01fda1/db/003573.sst. Size recorded in manifest 13756, actual size 1307\ > Sst file size mismatch: /mnt/data/tmp/flink-io-7139fea9-2dd8-42e6-8ffb-4d1a826f77d6/job_993eca72823b5ac13a377d7a844ac1b5_op_KeyedCoProcessOperator_d80b7e861bf73bdf93b8b27e5881807f__10_44__uuid_d3c2d251-c046-494a-bc25-57985a01fda1/db/003575.sst. Size recorded in manifest 16278, actual size 1138\ > Sst file size mismatch: /mnt/data/tmp/flink-io-7139fea9-2dd8-42e6-8ffb-4d1a826f77d6/job_993eca72823b5ac13a377d7a844ac1b5_op_KeyedCoProcessOperator_d80b7e861bf73bdf93b8b27e5881807f__10_44__uuid_d3c2d251-c046-494a-bc25-57985a01fda1/db/003576.sst. Size recorded in manifest 23108, actual size 1267\ > Sst file size mismatch: /mnt/data/tmp/flink-io-7139fea9-2dd8-42e6-8ffb-4d1a826f77d6/job_993eca72823b5ac13a377d7a844ac1b5_op_KeyedCoProcessOperator_d80b7e861bf73bdf93b8b27e5881807f__10_44__uuid_d3c2d251-c046-494a-bc25-57985a01fda1/db/003577.sst. Size recorded in manifest 148089, actual size 1293\ > \ > \\tat org.rocksdb.RocksDB.open(RocksDB.java)\ > \\tat org.rocksdb.RocksDB.open(RocksDB.java:286)\ > \\tat o.a.f.c.s.s.RocksDBOperationUtils.openDB(RocksDBOperationUtils.java:80)\ > \\t... 22 common frames omitted\ > Wrapped by: java.io.IOException: Error while opening RocksDB instance.\ > \\tat o.a.f.c.s.s.RocksDBOperationUtils.openDB(RocksDBOperationUtils.java:92)\ > \\tat o.a.f.c.s.s.r.AbstractRocksDBRestoreOperation.openDB(AbstractRocksDBRestoreOperation.java:145)\ > \\tat o.a.f.c.s.s.r.RocksDBIncrementalRestoreOperation.restoreFromLocalState(RocksDBIncrementalRestoreOper... > > > > Since we retain multiple checkpoints, I tried redeploying the job from all checkpoints that were still available. All those attempts lead to similar failures. (I eventually had to use an older savepoint to recover the job.) > > Any guidance for avoiding this would be appreciated. > > > > Peter |
Hi Roman, I am not able to consistently reproduce this issue. It seems to only occur when the failover happens at the wrong time. I have disabled task local recovery and will report back if we see this again. We need incremental checkpoints for our
workload. The SST files are not the ones for task local recovery, those would be in a different directory (we have configured
io.tmp.dirs as /mnt/data/tmp). Thanks, Peter From:
Roman Khachatryan <[hidden email]> Hi Peter, |
> I am not able to consistently reproduce this issue. It seems to only occur when the failover happens at the wrong time. I have disabled task local recovery and will report back if we see this again.
Thanks, please post any results here. > The SST files are not the ones for task local recovery, those would be in a different directory (we have configured io.tmp.dirs as /mnt/data/tmp). Those files on /mnt could still be checked against the ones in checkpoint directories (on S3/DFS), the size should match. I'm also curious why do you place local recovery files on a remote FS? (I assume /mnt/data/tmp is a remote FS or a persistent volume). Currently, if a TM is lost (e.g. process dies) then those files can not be used - and recovery will fallback to S3/DFS. So this probably incurs some IO/latency unnecessarily. Regards, Roman On Tue, May 25, 2021 at 2:16 PM Peter Westermann <[hidden email]> wrote: > > Hi Roman, > > > > I am not able to consistently reproduce this issue. It seems to only occur when the failover happens at the wrong time. I have disabled task local recovery and will report back if we see this again. We need incremental checkpoints for our workload. > > The SST files are not the ones for task local recovery, those would be in a different directory (we have configured io.tmp.dirs as /mnt/data/tmp). > > > > Thanks, > > Peter > > > > > > From: Roman Khachatryan <[hidden email]> > Date: Thursday, May 20, 2021 at 4:54 PM > To: Peter Westermann <[hidden email]> > Cc: [hidden email] <[hidden email]> > Subject: Re: Job recovery issues with state restoration > > Hi Peter, > > Do you experience this issue if running without local recovery or > incremental checkpoints enabled? > Or have you maybe compared local (on TM) and remove (on DFS) SST files? > > Regards, > Roman > > On Thu, May 20, 2021 at 5:54 PM Peter Westermann > <[hidden email]> wrote: > > > > Hello, > > > > > > > > I’ve reported issues around checkpoint recovery in case of a job failure due to zookeeper connection loss in the past. I am still seeing issues occasionally. > > > > This is for Flink 1.12.3 with zookeeper for HA, S3 as the state backend, incremental checkpoints, and task-local recovery enabled. > > > > > > > > Here’s what happened: A zookeeper instance was terminated as part of a deployment for our zookeeper service, this caused a new jobmanager leader election (so far so good). A leader was elected and the job was restarted from the latest checkpoint but never became healthy. The root exception and the logs show issues reading state: > > > > o.r.RocksDBException: Sst file size mismatch: /mnt/data/tmp/flink-io-7139fea9-2dd8-42e6-8ffb-4d1a826f77d6/job_993eca72823b5ac13a377d7a844ac1b5_op_KeyedCoProcessOperator_d80b7e861bf73bdf93b8b27e5881807f__10_44__uuid_d3c2d251-c046-494a-bc25-57985a01fda1/db/003579.sst. Size recorded in manifest 36718, actual size 2570\ > > Sst file size mismatch: /mnt/data/tmp/flink-io-7139fea9-2dd8-42e6-8ffb-4d1a826f77d6/job_993eca72823b5ac13a377d7a844ac1b5_op_KeyedCoProcessOperator_d80b7e861bf73bdf93b8b27e5881807f__10_44__uuid_d3c2d251-c046-494a-bc25-57985a01fda1/db/003573.sst. Size recorded in manifest 13756, actual size 1307\ > > Sst file size mismatch: /mnt/data/tmp/flink-io-7139fea9-2dd8-42e6-8ffb-4d1a826f77d6/job_993eca72823b5ac13a377d7a844ac1b5_op_KeyedCoProcessOperator_d80b7e861bf73bdf93b8b27e5881807f__10_44__uuid_d3c2d251-c046-494a-bc25-57985a01fda1/db/003575.sst. Size recorded in manifest 16278, actual size 1138\ > > Sst file size mismatch: /mnt/data/tmp/flink-io-7139fea9-2dd8-42e6-8ffb-4d1a826f77d6/job_993eca72823b5ac13a377d7a844ac1b5_op_KeyedCoProcessOperator_d80b7e861bf73bdf93b8b27e5881807f__10_44__uuid_d3c2d251-c046-494a-bc25-57985a01fda1/db/003576.sst. Size recorded in manifest 23108, actual size 1267\ > > Sst file size mismatch: /mnt/data/tmp/flink-io-7139fea9-2dd8-42e6-8ffb-4d1a826f77d6/job_993eca72823b5ac13a377d7a844ac1b5_op_KeyedCoProcessOperator_d80b7e861bf73bdf93b8b27e5881807f__10_44__uuid_d3c2d251-c046-494a-bc25-57985a01fda1/db/003577.sst. Size recorded in manifest 148089, actual size 1293\ > > \ > > \\tat org.rocksdb.RocksDB.open(RocksDB.java)\ > > \\tat org.rocksdb.RocksDB.open(RocksDB.java:286)\ > > \\tat o.a.f.c.s.s.RocksDBOperationUtils.openDB(RocksDBOperationUtils.java:80)\ > > \\t... 22 common frames omitted\ > > Wrapped by: java.io.IOException: Error while opening RocksDB instance.\ > > \\tat o.a.f.c.s.s.RocksDBOperationUtils.openDB(RocksDBOperationUtils.java:92)\ > > \\tat o.a.f.c.s.s.r.AbstractRocksDBRestoreOperation.openDB(AbstractRocksDBRestoreOperation.java:145)\ > > \\tat o.a.f.c.s.s.r.RocksDBIncrementalRestoreOperation.restoreFromLocalState(RocksDBIncrementalRestoreOper... > > > > > > > > Since we retain multiple checkpoints, I tried redeploying the job from all checkpoints that were still available. All those attempts lead to similar failures. (I eventually had to use an older savepoint to recover the job.) > > > > Any guidance for avoiding this would be appreciated. > > > > > > > > Peter |
/mnt/data is a local disk, so there shouldn’t be any additional latency. I’ll provide more information when/if this happens again. Peter From:
Roman Khachatryan <[hidden email]> > I am not able to consistently reproduce this issue. It seems to only occur when the failover happens at the wrong time. I have disabled task local recovery and will report back if we see this again. |
Free forum by Nabble | Edit this page |