Hi all,
We are trying to benchmark savepoint size vs. restore time. One thing we've observed is that when we reduce the number of task managers, the time to restore from a savepoint increases drastically: 1/ Restoring from 9.7tb savepoint onto 156 task managers takes 28 minutes 2/ Restoring from the save savepoint onto 30 task managers takes over 3 hours Is this expected? How does the restore process work? Is this just a matter of having lower restore parallelism for 30 task managers vs 156 task managers? Some details - Running on kubernetes - Used Rocksdb with a local ssd for state backend - Savepoint is hosted on GCS - The smaller task manager case is important to us because we expect to deploy our application with a high number of task managers, and downscale once a backfill is completed Differences between 1/ and 2/: 2/ has decreased task manager count 156 -> 30 2/ has decreased operator parallelism by a factor of ~10 2/ uses a striped SSD (3 ssds mounted as a single logical volume) to hold rocksdb files Thanks in advance for your help! |
Hi Kevin, when decreasing the TaskManager count I assume that you also decrease the parallelism of the Flink job. There are three aspects which can then cause a slower recovery. 1) Each Task gets a larger key range assigned. Therefore, each TaskManager has to download more data in order to restart the Task. Moreover, there are fewer nodes downloading larger portions of the data (less parallelization). 2) If you rescaled the parallelism, then it can happen that a Task gets a key range assigned which requires downloading of multiple key range parts from the previous run/savepoint. The new key range might not need all the data from the savepoint parts and hence you download some data which is not really used in the end. 3) When rescaling the job, then Flink has to rebuild the RocksDB instance which is an expensive and slow operation. What happens is that Flink creates for every savepoint part which it needs for its key range a RocksDB instance and then extracts the part which is only relevant for its key range into a new RocksDB instance. This causes a lot of read and write amplification. Cheers, Till On Wed, Apr 7, 2021 at 4:07 PM Kevin Lam <[hidden email]> wrote:
|
That's really helpful, thanks Till! On Thu, Apr 8, 2021 at 6:32 AM Till Rohrmann <[hidden email]> wrote:
|
Free forum by Nabble | Edit this page |