(DEPRECATED) Apache Flink User Mailing List archive.

Will job manager restarts lead to repeated savepoint restoration?

Classic

List

Threaded

6 messages Options

vishalovercome

Will job manager restarts lead to repeated savepoint restoration?

My flink job runs in kubernetes. This is the setup:

1. One job running as a job cluster with one job manager
2. HA powered by zookeeper (works fine)
3. Job/Deployment manifests stored in Github and deployed to kubernetes by
Argo
4. State persisted to S3

If I were to stop (drain and take a savepoint) and resume, I'll have to
update the job manager manifest with the savepoint location and save it in
Github and redeploy. After deployment, I'll presumably have to modify the
manifest again to remove the savepoint location so as to avoid starting the
application from the same savepoint. This raises some questions:

1. If the job manager were to crash before the manifest is updated again
then won't kubernetes restart the job manager from the savepoint rather than
the latest checkpoint?
2. Is there a way to ensure that restoration from a savepoint doesn't happen
more than once? Or not after first successful checkpoint?
3. If even one checkpoint has been finalized, then the job should prefer the
checkpoint rather than the savepoint. Will that happen automatically given
zookeeper?
4. Is it possible to not have to remove the savepoint path from the
kubernetes manifest and simply rely on newer checkpoints/savepoints? It
feels rather clumsy to have to add and remove back manually. We could use a
cron job to remove it but its still clumsy.
5. Is there a way of asking flink to use the latest savepoint rather than
specifying the location of the savepoint? If I were to manually rename the
s3 savepoint location to something fixed (s3://fixed_savepoint_path_always)
then would there be any problem restoring the job?
6. Any open source tool that solves this problem?

--
Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/

Till Rohrmann

Re: Will job manager restarts lead to repeated savepoint restoration?

Hi,

if you start a Flink job from a savepoint and the job needs to recover, then it will only reuse the savepoint if no later checkpoint has been created. Flink will always use the latest checkpoint/savepoint taken.

Cheers,

Till

On Wed, Dec 16, 2020 at 9:47 PM vishalovercome <[hidden email]> wrote:

My flink job runs in kubernetes. This is the setup:

1. One job running as a job cluster with one job manager
2. HA powered by zookeeper (works fine)
3. Job/Deployment manifests stored in Github and deployed to kubernetes by
Argo
4. State persisted to S3

If I were to stop (drain and take a savepoint) and resume, I'll have to
update the job manager manifest with the savepoint location and save it in
Github and redeploy. After deployment, I'll presumably have to modify the
manifest again to remove the savepoint location so as to avoid starting the
application from the same savepoint. This raises some questions:

1. If the job manager were to crash before the manifest is updated again
then won't kubernetes restart the job manager from the savepoint rather than
the latest checkpoint?
2. Is there a way to ensure that restoration from a savepoint doesn't happen
more than once? Or not after first successful checkpoint?
3. If even one checkpoint has been finalized, then the job should prefer the
checkpoint rather than the savepoint. Will that happen automatically given
zookeeper?
4. Is it possible to not have to remove the savepoint path from the
kubernetes manifest and simply rely on newer checkpoints/savepoints? It
feels rather clumsy to have to add and remove back manually. We could use a
cron job to remove it but its still clumsy.
5. Is there a way of asking flink to use the latest savepoint rather than
specifying the location of the savepoint? If I were to manually rename the
s3 savepoint location to something fixed (s3://fixed_savepoint_path_always)
then would there be any problem restoring the job?
6. Any open source tool that solves this problem?

--
Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/

vishalovercome

Re: Will job manager restarts lead to repeated savepoint restoration?

Thanks for your reply!

What I have seen is that the job terminates when there's intermittent loss
of connectivity with zookeeper. This is in-fact the most common reason why
our jobs are terminating at this point. Worse, it's unable to restore from
checkpoint during some (not all) of these terminations. Under these
scenarios, won't the job try to recover from a savepoint?

I've gone through various tickets reporting stability issues due to
zookeeper that you've mentioned you intend to resolve soon. But until the
zookeeper based HA is stable, should we assume that it will repeatedly
restore from savepoints? I would rather rely on kafka offsets to resume
where it left off rather than savepoints.

--
Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/

Till Rohrmann

Re: Will job manager restarts lead to repeated savepoint restoration?

What are exactly the problems when the checkpoint recovery does not work? Even if the ZooKeeper connection is temporarily disconnected which leads to the JobMaster losing leadership and the job being suspended, the next leader should continue where the first job left stopped because of the lost ZooKeeper connection.

What happens under the hood when restoring from a savepoint is that it is inserted into the CompletedCheckpointStore where also the other checkpoints are stored. If now a failure happens, Flink will first try to recover from a checkpoint/savepoint from the CompletedCheckpointStore and only if this store does not contain any checkpoints/savepoints, it will use the savepoint with which the job is started. The CompletedCheckpointStore persists the checkpoint/savepoint information by writing the pointers to ZooKeeper.

Cheers,

Till

On Mon, Dec 21, 2020 at 11:38 AM vishalovercome <[hidden email]> wrote:

Thanks for your reply!

What I have seen is that the job terminates when there's intermittent loss
of connectivity with zookeeper. This is in-fact the most common reason why
our jobs are terminating at this point. Worse, it's unable to restore from
checkpoint during some (not all) of these terminations. Under these
scenarios, won't the job try to recover from a savepoint?

I've gone through various tickets reporting stability issues due to
zookeeper that you've mentioned you intend to resolve soon. But until the
zookeeper based HA is stable, should we assume that it will repeatedly
restore from savepoints? I would rather rely on kafka offsets to resume
where it left off rather than savepoints.

--
Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/

vishalovercome

Re: Will job manager restarts lead to repeated savepoint restoration?

I don't know how to reproduce it but what I've observed are three kinds of
termination when connectivity with zookeeper is somehow disrupted. I don't
think its an issue with zookeeper as it supports a much bigger kafka cluster
since a few years.

1. The first kind is exactly this -
https://github.com/apache/flink/pull/11338. Basically temporary loss of
connectivity or rolling upgrade of zookeeper will cause job to terminate. It
will restart eventually from where it left off.
2. The second kind is when job terminates and restarts for the same reason
but is unable to recover from checkpoint. I think its similar to this -
https://issues.apache.org/jira/browse/FLINK-19154. If upgrading to 1.12.0
(from 1.11.2) will fix the second issue then I'll upgrade.
3. The third kind is where it repeatedly restarts as its unable to establish
a session with Zookeeper. I don't know if reducing session timeout will help
here but in this case, I'm forced to disable zookeeper HA entirely as the
job cannot even restart here.

I could create a JIRA ticket for discussion zookeeper itself if you suggest
but the issue of zookeeper and savepoints are related as I'm not sure what
will happen in each of the above.

--
Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/

Till Rohrmann

Re: Will job manager restarts lead to repeated savepoint restoration?

Hi Vishal,

thanks for the detailed description of the problems.

1. This is currently the intended behaviour of Flink. The reason is that if the system is no longer connected to ZooKeeper then we cannot rule out that there is another process who has taken over the leadership. FLINK-10052 has the goal to make this behaviour configurable and we intend to include it in the next major release.

2. This is indeed a bug of the newly introduced application mode. With Flink 1.11.3 or 1.12.0 it should be fixed. Hence, I would recommend you to upgrade your Flink cluster.

3. Hard to tell what the problem is here. From Flink's perspective, if it cannot establish a connection to ZooKeeper, then it cannot be sure who is the leader and whether it should start executing jobs. Maybe there is a problem with the connection to the ZooKeeper cluster from the nodes on which Flink runs. Decreasing the session timeouts usually makes the connection less stable if it is a network issue.

Cheers,

Till

On Mon, Dec 21, 2020 at 3:53 PM vishalovercome <[hidden email]> wrote:

I don't know how to reproduce it but what I've observed are three kinds of
termination when connectivity with zookeeper is somehow disrupted. I don't
think its an issue with zookeeper as it supports a much bigger kafka cluster
since a few years.

1. The first kind is exactly this -
https://github.com/apache/flink/pull/11338. Basically temporary loss of
connectivity or rolling upgrade of zookeeper will cause job to terminate. It
will restart eventually from where it left off.
2. The second kind is when job terminates and restarts for the same reason
but is unable to recover from checkpoint. I think its similar to this -
https://issues.apache.org/jira/browse/FLINK-19154. If upgrading to 1.12.0
(from 1.11.2) will fix the second issue then I'll upgrade.
3. The third kind is where it repeatedly restarts as its unable to establish
a session with Zookeeper. I don't know if reducing session timeout will help
here but in this case, I'm forced to disable zookeeper HA entirely as the
job cannot even restart here.

I could create a JIRA ticket for discussion zookeeper itself if you suggest
but the issue of zookeeper and savepoints are related as I'm not sure what
will happen in each of the above.

--
Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/