(DEPRECATED) Apache Flink User Mailing List archive.

Job recovery issues with state restoration

Classic

List

Threaded

5 messages Options

pwestermann

Job recovery issues with state restoration

Hello,

I’ve reported issues around checkpoint recovery in case of a job failure due to zookeeper connection loss in the past. I am still seeing issues occasionally.

This is for Flink 1.12.3 with zookeeper for HA, S3 as the state backend, incremental checkpoints, and task-local recovery enabled.

Here’s what happened: A zookeeper instance was terminated as part of a deployment for our zookeeper service, this caused a new jobmanager leader election (so far so good). A leader was elected and the job was restarted from the latest checkpoint but never became healthy. The root exception and the logs show issues reading state:

o.r.RocksDBException: Sst file size mismatch: /mnt/data/tmp/flink-io-7139fea9-2dd8-42e6-8ffb-4d1a826f77d6/job_993eca72823b5ac13a377d7a844ac1b5_op_KeyedCoProcessOperator_d80b7e861bf73bdf93b8b27e5881807f__10_44__uuid_d3c2d251-c046-494a-bc25-57985a01fda1/db/003579.sst. Size recorded in manifest 36718, actual size 2570\
Sst file size mismatch: /mnt/data/tmp/flink-io-7139fea9-2dd8-42e6-8ffb-4d1a826f77d6/job_993eca72823b5ac13a377d7a844ac1b5_op_KeyedCoProcessOperator_d80b7e861bf73bdf93b8b27e5881807f__10_44__uuid_d3c2d251-c046-494a-bc25-57985a01fda1/db/003573.sst. Size recorded in manifest 13756, actual size 1307\
Sst file size mismatch: /mnt/data/tmp/flink-io-7139fea9-2dd8-42e6-8ffb-4d1a826f77d6/job_993eca72823b5ac13a377d7a844ac1b5_op_KeyedCoProcessOperator_d80b7e861bf73bdf93b8b27e5881807f__10_44__uuid_d3c2d251-c046-494a-bc25-57985a01fda1/db/003575.sst. Size recorded in manifest 16278, actual size 1138\
Sst file size mismatch: /mnt/data/tmp/flink-io-7139fea9-2dd8-42e6-8ffb-4d1a826f77d6/job_993eca72823b5ac13a377d7a844ac1b5_op_KeyedCoProcessOperator_d80b7e861bf73bdf93b8b27e5881807f__10_44__uuid_d3c2d251-c046-494a-bc25-57985a01fda1/db/003576.sst. Size recorded in manifest 23108, actual size 1267\
Sst file size mismatch: /mnt/data/tmp/flink-io-7139fea9-2dd8-42e6-8ffb-4d1a826f77d6/job_993eca72823b5ac13a377d7a844ac1b5_op_KeyedCoProcessOperator_d80b7e861bf73bdf93b8b27e5881807f__10_44__uuid_d3c2d251-c046-494a-bc25-57985a01fda1/db/003577.sst. Size recorded in manifest 148089, actual size 1293\
\
\\tat org.rocksdb.RocksDB.open(RocksDB.java)\
\\tat org.rocksdb.RocksDB.open(RocksDB.java:286)\
\\tat o.a.f.c.s.s.RocksDBOperationUtils.openDB(RocksDBOperationUtils.java:80)\
\\t... 22 common frames omitted\
Wrapped by: java.io.IOException: Error while opening RocksDB instance.\
\\tat o.a.f.c.s.s.RocksDBOperationUtils.openDB(RocksDBOperationUtils.java:92)\
\\tat o.a.f.c.s.s.r.AbstractRocksDBRestoreOperation.openDB(AbstractRocksDBRestoreOperation.java:145)\
\\tat o.a.f.c.s.s.r.RocksDBIncrementalRestoreOperation.restoreFromLocalState(RocksDBIncrementalRestoreOper...

Since we retain multiple checkpoints, I tried redeploying the job from all checkpoints that were still available. All those attempts lead to similar failures. (I eventually had to use an older savepoint to recover the job.)

Any guidance for avoiding this would be appreciated.

Peter

Roman Khachatryan

Re: Job recovery issues with state restoration

>
> Hello,
>
>
>
> I’ve reported issues around checkpoint recovery in case of a job failure due to zookeeper connection loss in the past. I am still seeing issues occasionally.
>
> This is for Flink 1.12.3 with zookeeper for HA, S3 as the state backend, incremental checkpoints, and task-local recovery enabled.
>
>
>
> Here’s what happened: A zookeeper instance was terminated as part of a deployment for our zookeeper service, this caused a new jobmanager leader election (so far so good). A leader was elected and the job was restarted from the latest checkpoint but never became healthy. The root exception and the logs show issues reading state:
>
> o.r.RocksDBException: Sst file size mismatch: /mnt/data/tmp/flink-io-7139fea9-2dd8-42e6-8ffb-4d1a826f77d6/job_993eca72823b5ac13a377d7a844ac1b5_op_KeyedCoProcessOperator_d80b7e861bf73bdf93b8b27e5881807f__10_44__uuid_d3c2d251-c046-494a-bc25-57985a01fda1/db/003579.sst. Size recorded in manifest 36718, actual size 2570\
> Sst file size mismatch: /mnt/data/tmp/flink-io-7139fea9-2dd8-42e6-8ffb-4d1a826f77d6/job_993eca72823b5ac13a377d7a844ac1b5_op_KeyedCoProcessOperator_d80b7e861bf73bdf93b8b27e5881807f__10_44__uuid_d3c2d251-c046-494a-bc25-57985a01fda1/db/003573.sst. Size recorded in manifest 13756, actual size 1307\
> Sst file size mismatch: /mnt/data/tmp/flink-io-7139fea9-2dd8-42e6-8ffb-4d1a826f77d6/job_993eca72823b5ac13a377d7a844ac1b5_op_KeyedCoProcessOperator_d80b7e861bf73bdf93b8b27e5881807f__10_44__uuid_d3c2d251-c046-494a-bc25-57985a01fda1/db/003575.sst. Size recorded in manifest 16278, actual size 1138\
> Sst file size mismatch: /mnt/data/tmp/flink-io-7139fea9-2dd8-42e6-8ffb-4d1a826f77d6/job_993eca72823b5ac13a377d7a844ac1b5_op_KeyedCoProcessOperator_d80b7e861bf73bdf93b8b27e5881807f__10_44__uuid_d3c2d251-c046-494a-bc25-57985a01fda1/db/003576.sst. Size recorded in manifest 23108, actual size 1267\
> Sst file size mismatch: /mnt/data/tmp/flink-io-7139fea9-2dd8-42e6-8ffb-4d1a826f77d6/job_993eca72823b5ac13a377d7a844ac1b5_op_KeyedCoProcessOperator_d80b7e861bf73bdf93b8b27e5881807f__10_44__uuid_d3c2d251-c046-494a-bc25-57985a01fda1/db/003577.sst. Size recorded in manifest 148089, actual size 1293\
> \
> \\tat org.rocksdb.RocksDB.open(RocksDB.java)\
> \\tat org.rocksdb.RocksDB.open(RocksDB.java:286)\
> \\tat o.a.f.c.s.s.RocksDBOperationUtils.openDB(RocksDBOperationUtils.java:80)\
> \\t... 22 common frames omitted\
> Wrapped by: java.io.IOException: Error while opening RocksDB instance.\
> \\tat o.a.f.c.s.s.RocksDBOperationUtils.openDB(RocksDBOperationUtils.java:92)\
> \\tat o.a.f.c.s.s.r.AbstractRocksDBRestoreOperation.openDB(AbstractRocksDBRestoreOperation.java:145)\
> \\tat o.a.f.c.s.s.r.RocksDBIncrementalRestoreOperation.restoreFromLocalState(RocksDBIncrementalRestoreOper...
>
>
>
> Since we retain multiple checkpoints, I tried redeploying the job from all checkpoints that were still available. All those attempts lead to similar failures. (I eventually had to use an older savepoint to recover the job.)
>
> Any guidance for avoiding this would be appreciated.
>
>
>
> Peter

pwestermann

Re: Job recovery issues with state restoration

Hi Roman,

I am not able to consistently reproduce this issue. It seems to only occur when the failover happens at the wrong time. I have disabled task local recovery and will report back if we see this again. We need incremental checkpoints for our workload.

The SST files are not the ones for task local recovery, those would be in a different directory (we have configured io.tmp.dirs as /mnt/data/tmp).

Thanks,

Peter

From: Roman Khachatryan <[hidden email]>
Date: Thursday, May 20, 2021 at 4:54 PM
To: Peter Westermann <[hidden email]>
Cc: [hidden email] <[hidden email]>
Subject: Re: Job recovery issues with state restoration

Hi Peter,

Do you experience this issue if running without local recovery or
incremental checkpoints enabled?
Or have you maybe compared local (on TM) and remove (on DFS) SST files?

Regards,
Roman

On Thu, May 20, 2021 at 5:54 PM Peter Westermann
<[hidden email]> wrote:
>
> Hello,
>
>
>
> I’ve reported issues around checkpoint recovery in case of a job failure due to zookeeper connection loss in the past. I am still seeing issues occasionally.
>
> This is for Flink 1.12.3 with zookeeper for HA, S3 as the state backend, incremental checkpoints, and task-local recovery enabled.
>
>
>
> Here’s what happened: A zookeeper instance was terminated as part of a deployment for our zookeeper service, this caused a new jobmanager leader election (so far so good). A leader was elected and the job was restarted from the latest checkpoint but never became healthy. The root exception and the logs show issues reading state:
>
> o.r.RocksDBException: Sst file size mismatch: /mnt/data/tmp/flink-io-7139fea9-2dd8-42e6-8ffb-4d1a826f77d6/job_993eca72823b5ac13a377d7a844ac1b5_op_KeyedCoProcessOperator_d80b7e861bf73bdf93b8b27e5881807f__10_44__uuid_d3c2d251-c046-494a-bc25-57985a01fda1/db/003579.sst. Size recorded in manifest 36718, actual size 2570\
> Sst file size mismatch: /mnt/data/tmp/flink-io-7139fea9-2dd8-42e6-8ffb-4d1a826f77d6/job_993eca72823b5ac13a377d7a844ac1b5_op_KeyedCoProcessOperator_d80b7e861bf73bdf93b8b27e5881807f__10_44__uuid_d3c2d251-c046-494a-bc25-57985a01fda1/db/003573.sst. Size recorded in manifest 13756, actual size 1307\
> Sst file size mismatch: /mnt/data/tmp/flink-io-7139fea9-2dd8-42e6-8ffb-4d1a826f77d6/job_993eca72823b5ac13a377d7a844ac1b5_op_KeyedCoProcessOperator_d80b7e861bf73bdf93b8b27e5881807f__10_44__uuid_d3c2d251-c046-494a-bc25-57985a01fda1/db/003575.sst. Size recorded in manifest 16278, actual size 1138\
> Sst file size mismatch: /mnt/data/tmp/flink-io-7139fea9-2dd8-42e6-8ffb-4d1a826f77d6/job_993eca72823b5ac13a377d7a844ac1b5_op_KeyedCoProcessOperator_d80b7e861bf73bdf93b8b27e5881807f__10_44__uuid_d3c2d251-c046-494a-bc25-57985a01fda1/db/003576.sst. Size recorded in manifest 23108, actual size 1267\
> Sst file size mismatch: /mnt/data/tmp/flink-io-7139fea9-2dd8-42e6-8ffb-4d1a826f77d6/job_993eca72823b5ac13a377d7a844ac1b5_op_KeyedCoProcessOperator_d80b7e861bf73bdf93b8b27e5881807f__10_44__uuid_d3c2d251-c046-494a-bc25-57985a01fda1/db/003577.sst. Size recorded in manifest 148089, actual size 1293\
> \
> \\tat org.rocksdb.RocksDB.open(RocksDB.java)\
> \\tat org.rocksdb.RocksDB.open(RocksDB.java:286)\
> \\tat o.a.f.c.s.s.RocksDBOperationUtils.openDB(RocksDBOperationUtils.java:80)\
> \\t... 22 common frames omitted\
> Wrapped by: java.io.IOException: Error while opening RocksDB instance.\
> \\tat o.a.f.c.s.s.RocksDBOperationUtils.openDB(RocksDBOperationUtils.java:92)\
> \\tat o.a.f.c.s.s.r.AbstractRocksDBRestoreOperation.openDB(AbstractRocksDBRestoreOperation.java:145)\
> \\tat o.a.f.c.s.s.r.RocksDBIncrementalRestoreOperation.restoreFromLocalState(RocksDBIncrementalRestoreOper...
>
>
>
> Since we retain multiple checkpoints, I tried redeploying the job from all checkpoints that were still available. All those attempts lead to similar failures. (I eventually had to use an older savepoint to recover the job.)
>
> Any guidance for avoiding this would be appreciated.
>
>
>
> Peter

Roman Khachatryan

Re: Job recovery issues with state restoration

>
> Hi Roman,
>
>
>
> I am not able to consistently reproduce this issue. It seems to only occur when the failover happens at the wrong time. I have disabled task local recovery and will report back if we see this again. We need incremental checkpoints for our workload.
>
> The SST files are not the ones for task local recovery, those would be in a different directory (we have configured io.tmp.dirs as /mnt/data/tmp).
>
>
>
> Thanks,
>
> Peter
>
>
>
>
>
> From: Roman Khachatryan <[hidden email]>
> Date: Thursday, May 20, 2021 at 4:54 PM
> To: Peter Westermann <[hidden email]>
> Cc: [hidden email] <[hidden email]>
> Subject: Re: Job recovery issues with state restoration
>
> Hi Peter,
>
> Do you experience this issue if running without local recovery or
> incremental checkpoints enabled?
> Or have you maybe compared local (on TM) and remove (on DFS) SST files?
>
> Regards,
> Roman
>
> On Thu, May 20, 2021 at 5:54 PM Peter Westermann
> <[hidden email]> wrote:
> >
> > Hello,
> >
> >
> >
> > I’ve reported issues around checkpoint recovery in case of a job failure due to zookeeper connection loss in the past. I am still seeing issues occasionally.
> >
> > This is for Flink 1.12.3 with zookeeper for HA, S3 as the state backend, incremental checkpoints, and task-local recovery enabled.
> >
> >
> >
> > Here’s what happened: A zookeeper instance was terminated as part of a deployment for our zookeeper service, this caused a new jobmanager leader election (so far so good). A leader was elected and the job was restarted from the latest checkpoint but never became healthy. The root exception and the logs show issues reading state:
> >
> > o.r.RocksDBException: Sst file size mismatch: /mnt/data/tmp/flink-io-7139fea9-2dd8-42e6-8ffb-4d1a826f77d6/job_993eca72823b5ac13a377d7a844ac1b5_op_KeyedCoProcessOperator_d80b7e861bf73bdf93b8b27e5881807f__10_44__uuid_d3c2d251-c046-494a-bc25-57985a01fda1/db/003579.sst. Size recorded in manifest 36718, actual size 2570\
> > Sst file size mismatch: /mnt/data/tmp/flink-io-7139fea9-2dd8-42e6-8ffb-4d1a826f77d6/job_993eca72823b5ac13a377d7a844ac1b5_op_KeyedCoProcessOperator_d80b7e861bf73bdf93b8b27e5881807f__10_44__uuid_d3c2d251-c046-494a-bc25-57985a01fda1/db/003573.sst. Size recorded in manifest 13756, actual size 1307\
> > Sst file size mismatch: /mnt/data/tmp/flink-io-7139fea9-2dd8-42e6-8ffb-4d1a826f77d6/job_993eca72823b5ac13a377d7a844ac1b5_op_KeyedCoProcessOperator_d80b7e861bf73bdf93b8b27e5881807f__10_44__uuid_d3c2d251-c046-494a-bc25-57985a01fda1/db/003575.sst. Size recorded in manifest 16278, actual size 1138\
> > Sst file size mismatch: /mnt/data/tmp/flink-io-7139fea9-2dd8-42e6-8ffb-4d1a826f77d6/job_993eca72823b5ac13a377d7a844ac1b5_op_KeyedCoProcessOperator_d80b7e861bf73bdf93b8b27e5881807f__10_44__uuid_d3c2d251-c046-494a-bc25-57985a01fda1/db/003576.sst. Size recorded in manifest 23108, actual size 1267\
> > Sst file size mismatch: /mnt/data/tmp/flink-io-7139fea9-2dd8-42e6-8ffb-4d1a826f77d6/job_993eca72823b5ac13a377d7a844ac1b5_op_KeyedCoProcessOperator_d80b7e861bf73bdf93b8b27e5881807f__10_44__uuid_d3c2d251-c046-494a-bc25-57985a01fda1/db/003577.sst. Size recorded in manifest 148089, actual size 1293\
> > \
> > \\tat org.rocksdb.RocksDB.open(RocksDB.java)\
> > \\tat org.rocksdb.RocksDB.open(RocksDB.java:286)\
> > \\tat o.a.f.c.s.s.RocksDBOperationUtils.openDB(RocksDBOperationUtils.java:80)\
> > \\t... 22 common frames omitted\
> > Wrapped by: java.io.IOException: Error while opening RocksDB instance.\
> > \\tat o.a.f.c.s.s.RocksDBOperationUtils.openDB(RocksDBOperationUtils.java:92)\
> > \\tat o.a.f.c.s.s.r.AbstractRocksDBRestoreOperation.openDB(AbstractRocksDBRestoreOperation.java:145)\
> > \\tat o.a.f.c.s.s.r.RocksDBIncrementalRestoreOperation.restoreFromLocalState(RocksDBIncrementalRestoreOper...
> >
> >
> >
> > Since we retain multiple checkpoints, I tried redeploying the job from all checkpoints that were still available. All those attempts lead to similar failures. (I eventually had to use an older savepoint to recover the job.)
> >
> > Any guidance for avoiding this would be appreciated.
> >
> >
> >
> > Peter

pwestermann

Re: Job recovery issues with state restoration

/mnt/data is a local disk, so there shouldn’t be any additional latency. I’ll provide more information when/if this happens again.

Peter

From: Roman Khachatryan <[hidden email]>
Date: Tuesday, May 25, 2021 at 6:54 PM
To: Peter Westermann <[hidden email]>
Cc: [hidden email] <[hidden email]>
Subject: Re: Job recovery issues with state restoration

> I am not able to consistently reproduce this issue. It seems to only occur when the failover happens at the wrong time. I have disabled task local recovery and will report back if we see this again.

Thanks, please post any results here.

> The SST files are not the ones for task local recovery, those would be in a different directory (we have configured io.tmp.dirs as /mnt/data/tmp).

Those files on /mnt could still be checked against the ones in
checkpoint directories (on S3/DFS), the size should match.

I'm also curious why do you place local recovery files on a remote FS?
(I assume /mnt/data/tmp is a remote FS or a persistent volume).
Currently, if a TM is lost (e.g. process dies) then those files can
not be used - and recovery will fallback to S3/DFS. So this probably
incurs some IO/latency unnecessarily.

Regards,
Roman

On Tue, May 25, 2021 at 2:16 PM Peter Westermann
<[hidden email]> wrote:
>
> Hi Roman,
>
>
>
> I am not able to consistently reproduce this issue. It seems to only occur when the failover happens at the wrong time. I have disabled task local recovery and will report back if we see this again. We need incremental checkpoints for our workload.
>
> The SST files are not the ones for task local recovery, those would be in a different directory (we have configured io.tmp.dirs as /mnt/data/tmp).
>
>
>
> Thanks,
>
> Peter
>
>
>
>
>
> From: Roman Khachatryan <[hidden email]>
> Date: Thursday, May 20, 2021 at 4:54 PM
> To: Peter Westermann <[hidden email]>
> Cc: [hidden email] <[hidden email]>
> Subject: Re: Job recovery issues with state restoration
>
> Hi Peter,
>
> Do you experience this issue if running without local recovery or
> incremental checkpoints enabled?
> Or have you maybe compared local (on TM) and remove (on DFS) SST files?
>
> Regards,
> Roman
>
> On Thu, May 20, 2021 at 5:54 PM Peter Westermann
> <[hidden email]> wrote:
> >
> > Hello,
> >
> >
> >
> > I’ve reported issues around checkpoint recovery in case of a job failure due to zookeeper connection loss in the past. I am still seeing issues occasionally.
> >
> > This is for Flink 1.12.3 with zookeeper for HA, S3 as the state backend, incremental checkpoints, and task-local recovery enabled.
> >
> >
> >
> > Here’s what happened: A zookeeper instance was terminated as part of a deployment for our zookeeper service, this caused a new jobmanager leader election (so far so good). A leader was elected and the job was restarted from the latest checkpoint but never became healthy. The root exception and the logs show issues reading state:
> >
> > o.r.RocksDBException: Sst file size mismatch: /mnt/data/tmp/flink-io-7139fea9-2dd8-42e6-8ffb-4d1a826f77d6/job_993eca72823b5ac13a377d7a844ac1b5_op_KeyedCoProcessOperator_d80b7e861bf73bdf93b8b27e5881807f__10_44__uuid_d3c2d251-c046-494a-bc25-57985a01fda1/db/003579.sst. Size recorded in manifest 36718, actual size 2570\
> > Sst file size mismatch: /mnt/data/tmp/flink-io-7139fea9-2dd8-42e6-8ffb-4d1a826f77d6/job_993eca72823b5ac13a377d7a844ac1b5_op_KeyedCoProcessOperator_d80b7e861bf73bdf93b8b27e5881807f__10_44__uuid_d3c2d251-c046-494a-bc25-57985a01fda1/db/003573.sst. Size recorded in manifest 13756, actual size 1307\
> > Sst file size mismatch: /mnt/data/tmp/flink-io-7139fea9-2dd8-42e6-8ffb-4d1a826f77d6/job_993eca72823b5ac13a377d7a844ac1b5_op_KeyedCoProcessOperator_d80b7e861bf73bdf93b8b27e5881807f__10_44__uuid_d3c2d251-c046-494a-bc25-57985a01fda1/db/003575.sst. Size recorded in manifest 16278, actual size 1138\
> > Sst file size mismatch: /mnt/data/tmp/flink-io-7139fea9-2dd8-42e6-8ffb-4d1a826f77d6/job_993eca72823b5ac13a377d7a844ac1b5_op_KeyedCoProcessOperator_d80b7e861bf73bdf93b8b27e5881807f__10_44__uuid_d3c2d251-c046-494a-bc25-57985a01fda1/db/003576.sst. Size recorded in manifest 23108, actual size 1267\
> > Sst file size mismatch: /mnt/data/tmp/flink-io-7139fea9-2dd8-42e6-8ffb-4d1a826f77d6/job_993eca72823b5ac13a377d7a844ac1b5_op_KeyedCoProcessOperator_d80b7e861bf73bdf93b8b27e5881807f__10_44__uuid_d3c2d251-c046-494a-bc25-57985a01fda1/db/003577.sst. Size recorded in manifest 148089, actual size 1293\
> > \
> > \\tat org.rocksdb.RocksDB.open(RocksDB.java)\
> > \\tat org.rocksdb.RocksDB.open(RocksDB.java:286)\
> > \\tat o.a.f.c.s.s.RocksDBOperationUtils.openDB(RocksDBOperationUtils.java:80)\
> > \\t... 22 common frames omitted\
> > Wrapped by: java.io.IOException: Error while opening RocksDB instance.\
> > \\tat o.a.f.c.s.s.RocksDBOperationUtils.openDB(RocksDBOperationUtils.java:92)\
> > \\tat o.a.f.c.s.s.r.AbstractRocksDBRestoreOperation.openDB(AbstractRocksDBRestoreOperation.java:145)\
> > \\tat o.a.f.c.s.s.r.RocksDBIncrementalRestoreOperation.restoreFromLocalState(RocksDBIncrementalRestoreOper...
> >
> >
> >
> > Since we retain multiple checkpoints, I tried redeploying the job from all checkpoints that were still available. All those attempts lead to similar failures. (I eventually had to use an older savepoint to recover the job.)
> >
> > Any guidance for avoiding this would be appreciated.
> >
> >
> >
> > Peter