Question on checkpoint management

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

Question on checkpoint management

Cliff Resnick
When a job cancel-with-savepoint finishes a successful Savepoint, the preceding last successful Checkpoint is removed. Is this the intended behavior? I thought that checkpoints and savepoints were separate entities and, as such, savepoints should not infringe on checkpoints. This is actually an issue for us because we have seen occurrences of false-positive successful savepoints, perhaps due to S3 latency. Bottom line, we'd like to treat savepoints as insurance rather than the critical path and would rather they be oblivious to checkpoint management.

We are using externalized checkpoints, which may be confusing things. Also I know checkpoint management is undergoing some changes in Flink 1.3 (we are on Flink 1.2.0). Any insight is greatly appreciated.


Reply | Threaded
Open this post in threaded view
|

Re: Question on checkpoint management

Stefan Richter
I think this jira is helpful for your question: https://issues.apache.org/jira/browse/FLINK-6328

Am 08.05.2017 um 19:33 schrieb Cliff Resnick <[hidden email]>:

When a job cancel-with-savepoint finishes a successful Savepoint, the preceding last successful Checkpoint is removed. Is this the intended behavior? I thought that checkpoints and savepoints were separate entities and, as such, savepoints should not infringe on checkpoints. This is actually an issue for us because we have seen occurrences of false-positive successful savepoints, perhaps due to S3 latency. Bottom line, we'd like to treat savepoints as insurance rather than the critical path and would rather they be oblivious to checkpoint management.

We are using externalized checkpoints, which may be confusing things. Also I know checkpoint management is undergoing some changes in Flink 1.3 (we are on Flink 1.2.0). Any insight is greatly appreciated.