Hello, We have a savepoint that's ~0.5 TiB in size. When we try to restore from it, we time out because it takes too long (write now checkpoint timeouts are set to 2 hours which is way above where we want them already). I'm curious if it needs to download the entire savepoint to continue. Or, for further education, what are all the operations that take place before a job is restored from a savepoint? Additionally, the network seems to be a big bottleneck. Our network should be operating in the GiB/s range per instance, but seems to operate between 70-100MiB per second when retrieving a savepoint. Are there any constraining factors in Flink's design that would slow down the network download of a savepoint this much (from S3)? Thanks! -- Rex Fenley | Software Engineer - Mobile and Backend Remind.com | BLOG | FOLLOW US | LIKE US |
Aftering looking through lots of graphs and AWS limits. I've come to the conclusion that we're hitting limits on our disk writes. I'm guessing this is backpressuring against the entire restore process. I'm still very curious about all the steps involved in savepoint restoration though! On Fri, Jan 15, 2021 at 7:50 PM Rex Fenley <[hidden email]> wrote:
-- Rex Fenley | Software Engineer - Mobile and Backend Remind.com | BLOG | FOLLOW US | LIKE US |
Hi Rex, Good that you have found the source of your problem and thanks for reporting it back. Regarding your question about the recovery steps (ignoring scheduling and deploying). I think it depends on the used state backend. From your other emails I see you are using RocksDB, so I believe this is the big picture how it works in the RocksDB case: 1. Relevant state files are loaded from the DFS to local disks of a TaskManager [1]. 2. I presume RocksDB needs to load at least some meta data from those local files in order to finish the recovery process (I doubt it but maybe it also needs to load some actual data). 3. Task can start processing the records. Best, Piotrek [1] This step can be avoided if you are using local recovery https://ci.apache.org/projects/flink/flink-docs-stable/ops/state/large_state_tuning.html#task-local-recovery sob., 16 sty 2021 o 06:15 Rex Fenley <[hidden email]> napisał(a):
|
This is great information, thank you. It does look like task local recovery only works for checkpoints, however we do want to bring down our recovery times so this is useful. I'm wondering what all is involved with savepoints though too. Savepoint restoration must have some repartitioning step too I'd imagine which seems like it could be fairly involved? Anything else I'm missing? Thanks! On Mon, Jan 18, 2021 at 2:49 AM Piotr Nowojski <[hidden email]> wrote:
-- Rex Fenley | Software Engineer - Mobile and Backend Remind.com | BLOG | FOLLOW US | LIKE US |
Hi, Savepoints internally work in the exact same way as checkpoints. I'm not sure what you are referring to as a repartitioning step. If you mean rescalling (changing the parallelism), then this can happen both for the checkpoints and savepoints. You can find answers to a lot of such questions by just searching on Google. About rescaling you can read here [1]. Piotrek pon., 18 sty 2021 o 17:58 Rex Fenley <[hidden email]> napisał(a):
|
Free forum by Nabble | Edit this page |