(DEPRECATED) Apache Flink User Mailing List archive.

Failed job reinitiated with wrong checkpoint after a ZK reconnection

Posted by Paul Lam on
URL: http://deprecated-apache-flink-user-mailing-list-archive.369.s1.nabble.com/Failed-job-reinitiated-with-wrong-checkpoint-after-a-ZK-reconnection-tp38895.html

Hi,

We have a job of Flink 1.11.0 running on YARN that reached FAILED state cause its jobmanager lost leadership

during a ZK full GC. But after the ZK connection was recovered, somehow the job was reinitiated again

with no checkpoints found in ZK, and hence used an earlier savepoint to restore the job, which rewound

the job unexpectedly.

I’ve filed an issue[1], and any comments are appreciated.

1. https://issues.apache.org/jira/browse/FLINK-19778

Best,

Paul Lam