(DEPRECATED) Apache Flink User Mailing List archive.

Savepoint/checkpoint confusion

Classic

List

Threaded

4 messages Options

Igor Basov

Savepoint/checkpoint confusion

Hello,

I got confused about usage of savepoints and checkpoints in different scenarios.

I understand that checkpoints' main purpose is fault tolerance, they are more lightweight and don't support changing job graph, parallelism or state backend when restoring from them, as mentioned in the latest 1.13 docs:

https://ci.apache.org/projects/flink/flink-docs-release-1.13/docs/ops/state/checkpoints/#difference-to-savepoints

At the same time:

1) Reactive scaling mode (in 1.13) uses checkpoints exactly for that - rescaling.

2) There are use cases like here:

http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/What-happens-when-a-job-is-rescaled-td39462.html

where people seem to be using retained checkpoints instead of savepoints to do manual job restarts with rescaling.

3) There are claims like here:

https://lists.apache.org/thread.html/4299518f4da2810aa88fe6b21f841880b619f3f8ac264084a318c034%40%3Cuser.flink.apache.org%3E

that in HA setup JobManager is able to restart from a checkpoint even if operators are added/removed or parallelism is changed (in this case I'm not sure if the checkpoints used by HA JM in `high-availability.storageDir` is the same thing as usual checkpoints).

So I guess the questions are:

1) Can retained checkpoints be safely used for manual restarting and rescaling a job?

2) Are checkpoints made by HA JM structurally different from the usual ones? Can they be used to restore a job with a changed job graph?

Thank you,

Igor

rmetzger0

Re: Savepoint/checkpoint confusion

Hey Igor,

1) yes, reactive mode indeed does the same.

2) No, HA mode is only storing some metadata in ZK about the leadership and latest checkpoints, but the checkpoints itself are the same. They should be usable for a changed job graph (if the state matches the operators by setting the UUIDs [1]

[1] https://ci.apache.org/projects/flink/flink-docs-master/docs/ops/production_ready/#set-uuids-for-all-operators

On Fri, May 7, 2021 at 10:13 PM Igor Basov <[hidden email]> wrote:

Hello,
I got confused about usage of savepoints and checkpoints in different scenarios.
I understand that checkpoints' main purpose is fault tolerance, they are more lightweight and don't support changing job graph, parallelism or state backend when restoring from them, as mentioned in the latest 1.13 docs:
https://ci.apache.org/projects/flink/flink-docs-release-1.13/docs/ops/state/checkpoints/#difference-to-savepoints

At the same time:
1) Reactive scaling mode (in 1.13) uses checkpoints exactly for that - rescaling.
2) There are use cases like here:
http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/What-happens-when-a-job-is-rescaled-td39462.html
where people seem to be using retained checkpoints instead of savepoints to do manual job restarts with rescaling.
3) There are claims like here:
https://lists.apache.org/thread.html/4299518f4da2810aa88fe6b21f841880b619f3f8ac264084a318c034%40%3Cuser.flink.apache.org%3E
that in HA setup JobManager is able to restart from a checkpoint even if operators are added/removed or parallelism is changed (in this case I'm not sure if the checkpoints used by HA JM in `high-availability.storageDir` is the same thing as usual checkpoints).

So I guess the questions are:
1) Can retained checkpoints be safely used for manual restarting and rescaling a job?
2) Are checkpoints made by HA JM structurally different from the usual ones? Can they be used to restore a job with a changed job graph?

Thank you,
Igor

Igor Basov

Re: Savepoint/checkpoint confusion

Hey Robert,

Thanks for the answer! But then I guess the only difference between savepoints and checkpoints is that checkpoints are structurally state dependent and can be incremental, but otherwise they are functionally equivalent. So functionally savepoint can be considered a full checkpoint which provides 2 additional benefits: it's made on-demand and the state backend can be changed (since 1.13). Is this correct?

On Thu, 20 May 2021 at 05:35, Robert Metzger <[hidden email]> wrote:

Hey Igor,

1) yes, reactive mode indeed does the same.
2) No, HA mode is only storing some metadata in ZK about the leadership and latest checkpoints, but the checkpoints itself are the same. They should be usable for a changed job graph (if the state matches the operators by setting the UUIDs [1]

[1] https://ci.apache.org/projects/flink/flink-docs-master/docs/ops/production_ready/#set-uuids-for-all-operators

On Fri, May 7, 2021 at 10:13 PM Igor Basov <[hidden email]> wrote:
Hello,
I got confused about usage of savepoints and checkpoints in different scenarios.
I understand that checkpoints' main purpose is fault tolerance, they are more lightweight and don't support changing job graph, parallelism or state backend when restoring from them, as mentioned in the latest 1.13 docs:
https://ci.apache.org/projects/flink/flink-docs-release-1.13/docs/ops/state/checkpoints/#difference-to-savepoints

At the same time:
1) Reactive scaling mode (in 1.13) uses checkpoints exactly for that - rescaling.
2) There are use cases like here:
http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/What-happens-when-a-job-is-rescaled-td39462.html
where people seem to be using retained checkpoints instead of savepoints to do manual job restarts with rescaling.
3) There are claims like here:
https://lists.apache.org/thread.html/4299518f4da2810aa88fe6b21f841880b619f3f8ac264084a318c034%40%3Cuser.flink.apache.org%3E
that in HA setup JobManager is able to restart from a checkpoint even if operators are added/removed or parallelism is changed (in this case I'm not sure if the checkpoints used by HA JM in `high-availability.storageDir` is the same thing as usual checkpoints).

So I guess the questions are:
1) Can retained checkpoints be safely used for manual restarting and rescaling a job?
2) Are checkpoints made by HA JM structurally different from the usual ones? Can they be used to restore a job with a changed job graph?

Thank you,
Igor

rmetzger0

Re: Savepoint/checkpoint confusion

Hi Igor,

In my understanding, checkpoints are managed by the system (Flink decides when to create and delete them), while savepoints are managed by the user (they decide when to create and delete them).

Indeed, only checkpoints can be incremental (if that feature is enabled).

> it's made on-demand and the state backend can be changed (since 1.13). Is this correct?

Yes

On Thu, May 20, 2021 at 4:46 PM Igor Basov <[hidden email]> wrote:

Hey Robert,
Thanks for the answer! But then I guess the only difference between savepoints and checkpoints is that checkpoints are structurally state dependent and can be incremental, but otherwise they are functionally equivalent. So functionally savepoint can be considered a full checkpoint which provides 2 additional benefits: it's made on-demand and the state backend can be changed (since 1.13). Is this correct?

On Thu, 20 May 2021 at 05:35, Robert Metzger <[hidden email]> wrote:
Hey Igor,

1) yes, reactive mode indeed does the same.
2) No, HA mode is only storing some metadata in ZK about the leadership and latest checkpoints, but the checkpoints itself are the same. They should be usable for a changed job graph (if the state matches the operators by setting the UUIDs [1]

[1] https://ci.apache.org/projects/flink/flink-docs-master/docs/ops/production_ready/#set-uuids-for-all-operators

On Fri, May 7, 2021 at 10:13 PM Igor Basov <[hidden email]> wrote:
Hello,
I got confused about usage of savepoints and checkpoints in different scenarios.
I understand that checkpoints' main purpose is fault tolerance, they are more lightweight and don't support changing job graph, parallelism or state backend when restoring from them, as mentioned in the latest 1.13 docs:
https://ci.apache.org/projects/flink/flink-docs-release-1.13/docs/ops/state/checkpoints/#difference-to-savepoints

At the same time:
1) Reactive scaling mode (in 1.13) uses checkpoints exactly for that - rescaling.
2) There are use cases like here:
http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/What-happens-when-a-job-is-rescaled-td39462.html
where people seem to be using retained checkpoints instead of savepoints to do manual job restarts with rescaling.
3) There are claims like here:
https://lists.apache.org/thread.html/4299518f4da2810aa88fe6b21f841880b619f3f8ac264084a318c034%40%3Cuser.flink.apache.org%3E
that in HA setup JobManager is able to restart from a checkpoint even if operators are added/removed or parallelism is changed (in this case I'm not sure if the checkpoints used by HA JM in `high-availability.storageDir` is the same thing as usual checkpoints).

So I guess the questions are:
1) Can retained checkpoints be safely used for manual restarting and rescaling a job?
2) Are checkpoints made by HA JM structurally different from the usual ones? Can they be used to restore a job with a changed job graph?

Thank you,
Igor