Hello,
I have some confusion about checkpoints vs savepoints, and how to use them effectively in my application. I am working on an application which is relies on flink's fault tolerant mechanism to ensure exactly once semantics. I have enabled external checkpointing in my application as below: env.enableCheckpointing(CHECKPOINT_TIME_MS) Please correct me incase I am wrong but the above ensures if the application crashes, it is able to recover from the last know location. This however wont work if we cancel the application( for new deployments/restarts). Reading link about savepoints, hints that it should a good practice to have savepoints at regular intervals of time(by crons etc) so that the application can be restarted from a last known location. This also points to using command line option( -s ) to cancel an application, so that the application stops after saving a savepoint. Based on the above understanding I have some questions below: Questions:
Please let me know. Thanks, Vipul |
Hi, I have answered your questions inline: Not the same, because checkpoints and savepoints are different in certain aspects, but both methods leave you with something that survives job cancelation and can be used to restart from a certain state.
You could see it that way, but I would describe savepoints more as user-defined *restart* points than *recovery* points. Please take a look at my answers in this thread, because they cover most of your question:
I don’t think that this feature exists, you have to specify the savepoint.
If you restart a canceled application it will not consider checkpoints. They are only considered in recovery on failure. You need to specify a savepoint or externalized checkpoint for restarts to make explicit that you intend to restart a job, and not to run a new instance of the job.
Best, Stefan
|
Thanks Stefan for the answers above. These are really helpful. I have a few followup questions:
Please let me know, Thanks, Vipul On Tue, Sep 26, 2017 at 2:36 AM, Stefan Richter <[hidden email]> wrote:
Thanks, Vipul |
Hi,
Flink does not rely on file system operations to list contents, all necessary file paths are stored in the meta data file, as you guessed. This is the reason savepoints also work with file systems that "only" have read-after-write consistency. Best, Aljoscha
|
Thanks Aljoscha for the answer above. I am experimenting with savepoints and checkpoints on my end, so that we built fault tolerant application with exactly once semantics. I have been able to test various scenarios, but have doubts about one use case. My app is running on an emr cluster, and I am trying to test the case when a emr cluster is terminated. I have read that state.checkpoints.dir is responsible for storing metadata information, and links to data files in state.backend.fs. For my application I have configured both state.backend.fs.checkpointdir and state.checkpoints.dir Also I have the following in my main app: env.enableCheckpointing(CHECKPOINT_TIME_MS) In the application startup logs I can see state.backend.fs.checkpointdir and state.checkpoints.dir, values being loaded. However when the checkpoint happens I dont see any content in the metadata dir. Is there something I am missing? Please let me know. I am using flink version 1.3 Thanks, Vipul On Tue, Oct 10, 2017 at 7:55 AM, Aljoscha Krettek <[hidden email]> wrote:
Thanks, Vipul |
Hi, Did you enable externalized checkpoints? [1] Best, Tony Wei 2017-10-24 13:07 GMT+08:00 vipul singh <[hidden email]>:
|
Thanks Tony, that was the issue. I was thinking that when we use Rocksdb and provide an s3 path, it uses externalized checkpoints by default. Thanks so much! I have one followup question. Say in above case, I terminate the cluster, and since the metadata is on s3, and not on local storage, does flink avoid read after write consistency of s3? Would it be a valid concern, or we handle that case in externalized checkpoints as well, and dont deal with file system operations while dealing with retrieving externalized checkpoints on s3. Thanks, Vipul On Mon, Oct 23, 2017 at 11:00 PM, Tony Wei <[hidden email]> wrote:
Thanks, Vipul |
Hi,
That distinction with externalised checkpoints is a bit of a pitfall and I'm hoping that we can actually get rid of that distinction in the next version or the version after that. With that change, all checkpoints would always be externalised, since it's not really any noticeable overhead. Regarding read-after-write consistency, you should be fine since an the "externalised checkpoint", i.e. the metadata, is only one file. If you know the file-path (either from the Flink dashboard or by looking at the S3 bucket) you can restore from it. Best, Aljoscha
|
Thanks Aljoscha for the explanations. I was able to recover from the last externalized checkpoint, by using flink run -s <metadata file> <options> I am curious, are there any options to save the metadata file name to some other place like dynamo etc at the moment? The reason why I am asking is, for the end launcher code we are writing, we want to ensure if a flink job crashes, we can just start it from last known externalized checkpoint. In the present senario, we have to list the contents of the s3 bucket which saves the metadata, to see the last metadata before failure, and there might a window where we might run into read after write consistency of s3. Thoughts? On Tue, Oct 24, 2017 at 2:13 AM, Aljoscha Krettek <[hidden email]> wrote:
Thanks, Vipul |
As a followup to above, is there a way to get the last checkpoint metadata location inside notifyCheckpointComplete method? I tried poking around, but didnt see a way to achieve this. Or incase there is any other way to save the actual checkpoint metadata location information into a datastore(dynamodb etc)? We are looking to save the savepoint/externalized checkpoint metadata location in some storage space, so that we can pass this information to flink run command during recovery(thereby removing the possibility of any read after write consistency arising out of listing file paths etc). Thanks, Vipul On Tue, Oct 24, 2017 at 11:53 PM, vipul singh <[hidden email]> wrote:
Thanks, Vipul |
Hi team, I am a similar use case do we have any answers on this? When we trigger savepoint can we store that information to ZK as well? So I can avoid S3 file listing and do not have to use other external services? On Wed, Oct 25, 2017 at 11:19 PM vipul singh <[hidden email]> wrote:
|
Free forum by Nabble | Edit this page |