(DEPRECATED) Apache Flink User Mailing List archive.

Restoring from a savepoint, constraining factors

Classic

List

Threaded

5 messages Options

Rex Fenley

Restoring from a savepoint, constraining factors

Hello,

We have a savepoint that's ~0.5 TiB in size. When we try to restore from it, we time out because it takes too long (write now checkpoint timeouts are set to 2 hours which is way above where we want them already).

I'm curious if it needs to download the entire savepoint to continue. Or, for further education, what are all the operations that take place before a job is restored from a savepoint?

Additionally, the network seems to be a big bottleneck. Our network should be operating in the GiB/s range per instance, but seems to operate between 70-100MiB per second when retrieving a savepoint. Are there any constraining factors in Flink's design that would slow down the network download of a savepoint this much (from S3)?

Thanks!

Rex Fenley | Software Engineer - Mobile and Backend

Remind.com | BLOG | FOLLOW US | LIKE US

Rex Fenley

Re: Restoring from a savepoint, constraining factors

Aftering looking through lots of graphs and AWS limits. I've come to the conclusion that we're hitting limits on our disk writes. I'm guessing this is backpressuring against the entire restore process. I'm still very curious about all the steps involved in savepoint restoration though!

On Fri, Jan 15, 2021 at 7:50 PM Rex Fenley <[hidden email]> wrote:

Hello,

We have a savepoint that's ~0.5 TiB in size. When we try to restore from it, we time out because it takes too long (write now checkpoint timeouts are set to 2 hours which is way above where we want them already).

I'm curious if it needs to download the entire savepoint to continue. Or, for further education, what are all the operations that take place before a job is restored from a savepoint?

Additionally, the network seems to be a big bottleneck. Our network should be operating in the GiB/s range per instance, but seems to operate between 70-100MiB per second when retrieving a savepoint. Are there any constraining factors in Flink's design that would slow down the network download of a savepoint this much (from S3)?

Thanks!

--
Rex Fenley | Software Engineer - Mobile and Backend

Remind.com | BLOG | FOLLOW US | LIKE US

Rex Fenley | Software Engineer - Mobile and Backend

Remind.com | BLOG | FOLLOW US | LIKE US

Piotr Nowojski-4

Re: Restoring from a savepoint, constraining factors

Hi Rex,

Good that you have found the source of your problem and thanks for reporting it back.

Regarding your question about the recovery steps (ignoring scheduling and deploying). I think it depends on the used state backend. From your other emails I see you are using RocksDB, so I believe this is the big picture how it works in the RocksDB case:

1. Relevant state files are loaded from the DFS to local disks of a TaskManager [1].

2. I presume RocksDB needs to load at least some meta data from those local files in order to finish the recovery process (I doubt it but maybe it also needs to load some actual data).

3. Task can start processing the records.

Best,

Piotrek

[1] This step can be avoided if you are using local recovery https://ci.apache.org/projects/flink/flink-docs-stable/ops/state/large_state_tuning.html#task-local-recovery

sob., 16 sty 2021 o 06:15 Rex Fenley <[hidden email]> napisał(a):

Aftering looking through lots of graphs and AWS limits. I've come to the conclusion that we're hitting limits on our disk writes. I'm guessing this is backpressuring against the entire restore process. I'm still very curious about all the steps involved in savepoint restoration though!

On Fri, Jan 15, 2021 at 7:50 PM Rex Fenley <[hidden email]> wrote:
Hello,

We have a savepoint that's ~0.5 TiB in size. When we try to restore from it, we time out because it takes too long (write now checkpoint timeouts are set to 2 hours which is way above where we want them already).

I'm curious if it needs to download the entire savepoint to continue. Or, for further education, what are all the operations that take place before a job is restored from a savepoint?

Additionally, the network seems to be a big bottleneck. Our network should be operating in the GiB/s range per instance, but seems to operate between 70-100MiB per second when retrieving a savepoint. Are there any constraining factors in Flink's design that would slow down the network download of a savepoint this much (from S3)?

Thanks!

--
Rex Fenley | Software Engineer - Mobile and Backend

Remind.com | BLOG | FOLLOW US | LIKE US

--
Rex Fenley | Software Engineer - Mobile and Backend

Remind.com | BLOG | FOLLOW US | LIKE US

Rex Fenley

Re: Restoring from a savepoint, constraining factors

This is great information, thank you. It does look like task local recovery only works for checkpoints, however we do want to bring down our recovery times so this is useful.

I'm wondering what all is involved with savepoints though too. Savepoint restoration must have some repartitioning step too I'd imagine which seems like it could be fairly involved? Anything else I'm missing?

Thanks!

On Mon, Jan 18, 2021 at 2:49 AM Piotr Nowojski <[hidden email]> wrote:

Hi Rex,

Good that you have found the source of your problem and thanks for reporting it back.

Regarding your question about the recovery steps (ignoring scheduling and deploying). I think it depends on the used state backend. From your other emails I see you are using RocksDB, so I believe this is the big picture how it works in the RocksDB case:

1. Relevant state files are loaded from the DFS to local disks of a TaskManager [1].
2. I presume RocksDB needs to load at least some meta data from those local files in order to finish the recovery process (I doubt it but maybe it also needs to load some actual data).
3. Task can start processing the records.

Best,
Piotrek

[1] This step can be avoided if you are using local recovery https://ci.apache.org/projects/flink/flink-docs-stable/ops/state/large_state_tuning.html#task-local-recovery

sob., 16 sty 2021 o 06:15 Rex Fenley <[hidden email]> napisał(a):
Aftering looking through lots of graphs and AWS limits. I've come to the conclusion that we're hitting limits on our disk writes. I'm guessing this is backpressuring against the entire restore process. I'm still very curious about all the steps involved in savepoint restoration though!

On Fri, Jan 15, 2021 at 7:50 PM Rex Fenley <[hidden email]> wrote:
Hello,

We have a savepoint that's ~0.5 TiB in size. When we try to restore from it, we time out because it takes too long (write now checkpoint timeouts are set to 2 hours which is way above where we want them already).

I'm curious if it needs to download the entire savepoint to continue. Or, for further education, what are all the operations that take place before a job is restored from a savepoint?

Additionally, the network seems to be a big bottleneck. Our network should be operating in the GiB/s range per instance, but seems to operate between 70-100MiB per second when retrieving a savepoint. Are there any constraining factors in Flink's design that would slow down the network download of a savepoint this much (from S3)?

Thanks!

--
Rex Fenley | Software Engineer - Mobile and Backend

Remind.com | BLOG | FOLLOW US | LIKE US

--
Rex Fenley | Software Engineer - Mobile and Backend

Remind.com | BLOG | FOLLOW US | LIKE US

Rex Fenley | Software Engineer - Mobile and Backend

Remind.com | BLOG | FOLLOW US | LIKE US

Piotr Nowojski-5

Re: Restoring from a savepoint, constraining factors

Hi,

Savepoints internally work in the exact same way as checkpoints.

I'm not sure what you are referring to as a repartitioning step. If you mean rescalling (changing the parallelism), then this can happen both for the checkpoints and savepoints. You can find answers to a lot of such questions by just searching on Google. About rescaling you can read here [1].

Piotrek

[1] https://flink.apache.org/features/2017/07/04/flink-rescalable-state.html

pon., 18 sty 2021 o 17:58 Rex Fenley <[hidden email]> napisał(a):

This is great information, thank you. It does look like task local recovery only works for checkpoints, however we do want to bring down our recovery times so this is useful.

I'm wondering what all is involved with savepoints though too. Savepoint restoration must have some repartitioning step too I'd imagine which seems like it could be fairly involved? Anything else I'm missing?

Thanks!

On Mon, Jan 18, 2021 at 2:49 AM Piotr Nowojski <[hidden email]> wrote:
Hi Rex,

Good that you have found the source of your problem and thanks for reporting it back.

Regarding your question about the recovery steps (ignoring scheduling and deploying). I think it depends on the used state backend. From your other emails I see you are using RocksDB, so I believe this is the big picture how it works in the RocksDB case:

1. Relevant state files are loaded from the DFS to local disks of a TaskManager [1].
2. I presume RocksDB needs to load at least some meta data from those local files in order to finish the recovery process (I doubt it but maybe it also needs to load some actual data).
3. Task can start processing the records.

Best,
Piotrek

[1] This step can be avoided if you are using local recovery https://ci.apache.org/projects/flink/flink-docs-stable/ops/state/large_state_tuning.html#task-local-recovery

sob., 16 sty 2021 o 06:15 Rex Fenley <[hidden email]> napisał(a):
Aftering looking through lots of graphs and AWS limits. I've come to the conclusion that we're hitting limits on our disk writes. I'm guessing this is backpressuring against the entire restore process. I'm still very curious about all the steps involved in savepoint restoration though!

On Fri, Jan 15, 2021 at 7:50 PM Rex Fenley <[hidden email]> wrote:
Hello,

We have a savepoint that's ~0.5 TiB in size. When we try to restore from it, we time out because it takes too long (write now checkpoint timeouts are set to 2 hours which is way above where we want them already).

I'm curious if it needs to download the entire savepoint to continue. Or, for further education, what are all the operations that take place before a job is restored from a savepoint?

Additionally, the network seems to be a big bottleneck. Our network should be operating in the GiB/s range per instance, but seems to operate between 70-100MiB per second when retrieving a savepoint. Are there any constraining factors in Flink's design that would slow down the network download of a savepoint this much (from S3)?

Thanks!

--
Rex Fenley | Software Engineer - Mobile and Backend

Remind.com | BLOG | FOLLOW US | LIKE US

--
Rex Fenley | Software Engineer - Mobile and Backend

Remind.com | BLOG | FOLLOW US | LIKE US

--
Rex Fenley | Software Engineer - Mobile and Backend

Remind.com | BLOG | FOLLOW US | LIKE US