Tooling for resuming from checkpoints

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view

Tooling for resuming from checkpoints

Dominik Bruhn
we are running Flink 1.3.2 with streaming jobs and we are running into
issues when we are restarting a complete job (which can happen due to
various reasons: upgrading of the job, restarting of the cluster,
failures). The problem is that there is no automated way to find out
from which checkpoint-metadata (so externalized checkpoint) we should
resume. There can always be the situation that we are left with multiple
of those files: Now you want to use the most recent one which is
successfully written.

Is there any tooling available already which picks the latest good
checkpoint? Or at least a tool/commandline which we can use to validate
that a checkpoint is valid so we can pick the latest one?

How are others handling this? Manually?

Would be happy to get some input there,
Reply | Threaded
Open this post in threaded view

Re: Tooling for resuming from checkpoints

Timo Walther
Hi Dominik,

the Web UI shows you the status of a checkpoint [0], so it might be
possible to retrieve the information via REST calls. Usually, you should
perform a savepoint for planned restarts. If a savepoint is successful
you can be sure to restart from it.

Otherwise the platform from data Artisans might be interesting for you
[1], it aims to improve the deployment for streaming application
lifecycles (disclaimer: I work for them).



Am 11/22/17 um 10:41 AM schrieb [hidden email]:

> Hey,
> we are running Flink 1.3.2 with streaming jobs and we are running into
> issues when we are restarting a complete job (which can happen due to
> various reasons: upgrading of the job, restarting of the cluster,
> failures). The problem is that there is no automated way to find out
> from which checkpoint-metadata (so externalized checkpoint) we should
> resume. There can always be the situation that we are left with
> multiple of those files: Now you want to use the most recent one which
> is successfully written.
> Is there any tooling available already which picks the latest good
> checkpoint? Or at least a tool/commandline which we can use to
> validate that a checkpoint is valid so we can pick the latest one?
> How are others handling this? Manually?
> Would be happy to get some input there,
> Dominik