(DEPRECATED) Apache Flink User Mailing List archive.

Reducing Task Manager Count Greatly Increases Savepoint Restore

Classic

List

Threaded

3 messages Options

Kevin Lam

Reducing Task Manager Count Greatly Increases Savepoint Restore

Hi all,

We are trying to benchmark savepoint size vs. restore time.

One thing we've observed is that when we reduce the number of task managers, the time to restore from a savepoint increases drastically:

1/ Restoring from 9.7tb savepoint onto 156 task managers takes 28 minutes

2/ Restoring from the save savepoint onto 30 task managers takes over 3 hours

Is this expected? How does the restore process work? Is this just a matter of having lower restore parallelism for 30 task managers vs 156 task managers?

Some details

- Running on kubernetes

- Used Rocksdb with a local ssd for state backend

- Savepoint is hosted on GCS

- The smaller task manager case is important to us because we expect to deploy our application with a high number of task managers, and downscale once a backfill is completed

Differences between 1/ and 2/:

2/ has decreased task manager count 156 -> 30
2/ has decreased operator parallelism by a factor of ~10
2/ uses a striped SSD (3 ssds mounted as a single logical volume) to hold rocksdb files

Thanks in advance for your help!

Till Rohrmann

Re: Reducing Task Manager Count Greatly Increases Savepoint Restore

Hi Kevin,

when decreasing the TaskManager count I assume that you also decrease the parallelism of the Flink job. There are three aspects which can then cause a slower recovery.

1) Each Task gets a larger key range assigned. Therefore, each TaskManager has to download more data in order to restart the Task. Moreover, there are fewer nodes downloading larger portions of the data (less parallelization).

2) If you rescaled the parallelism, then it can happen that a Task gets a key range assigned which requires downloading of multiple key range parts from the previous run/savepoint. The new key range might not need all the data from the savepoint parts and hence you download some data which is not really used in the end.

3) When rescaling the job, then Flink has to rebuild the RocksDB instance which is an expensive and slow operation. What happens is that Flink creates for every savepoint part which it needs for its key range a RocksDB instance and then extracts the part which is only relevant for its key range into a new RocksDB instance. This causes a lot of read and write amplification.

Cheers,

Till

On Wed, Apr 7, 2021 at 4:07 PM Kevin Lam <[hidden email]> wrote:

Hi all,

We are trying to benchmark savepoint size vs. restore time.

One thing we've observed is that when we reduce the number of task managers, the time to restore from a savepoint increases drastically:

1/ Restoring from 9.7tb savepoint onto 156 task managers takes 28 minutes
2/ Restoring from the save savepoint onto 30 task managers takes over 3 hours

Is this expected? How does the restore process work? Is this just a matter of having lower restore parallelism for 30 task managers vs 156 task managers?

Some details

- Running on kubernetes
- Used Rocksdb with a local ssd for state backend
- Savepoint is hosted on GCS
- The smaller task manager case is important to us because we expect to deploy our application with a high number of task managers, and downscale once a backfill is completed

Differences between 1/ and 2/:

2/ has decreased task manager count 156 -> 30
2/ has decreased operator parallelism by a factor of ~10
2/ uses a striped SSD (3 ssds mounted as a single logical volume) to hold rocksdb files

Thanks in advance for your help!

Kevin Lam

Re: Reducing Task Manager Count Greatly Increases Savepoint Restore

That's really helpful, thanks Till!

On Thu, Apr 8, 2021 at 6:32 AM Till Rohrmann <[hidden email]> wrote:

Hi Kevin,

when decreasing the TaskManager count I assume that you also decrease the parallelism of the Flink job. There are three aspects which can then cause a slower recovery.

1) Each Task gets a larger key range assigned. Therefore, each TaskManager has to download more data in order to restart the Task. Moreover, there are fewer nodes downloading larger portions of the data (less parallelization).
2) If you rescaled the parallelism, then it can happen that a Task gets a key range assigned which requires downloading of multiple key range parts from the previous run/savepoint. The new key range might not need all the data from the savepoint parts and hence you download some data which is not really used in the end.
3) When rescaling the job, then Flink has to rebuild the RocksDB instance which is an expensive and slow operation. What happens is that Flink creates for every savepoint part which it needs for its key range a RocksDB instance and then extracts the part which is only relevant for its key range into a new RocksDB instance. This causes a lot of read and write amplification.

Cheers,
Till

On Wed, Apr 7, 2021 at 4:07 PM Kevin Lam <[hidden email]> wrote:
Hi all,

We are trying to benchmark savepoint size vs. restore time.

One thing we've observed is that when we reduce the number of task managers, the time to restore from a savepoint increases drastically:

1/ Restoring from 9.7tb savepoint onto 156 task managers takes 28 minutes
2/ Restoring from the save savepoint onto 30 task managers takes over 3 hours

Is this expected? How does the restore process work? Is this just a matter of having lower restore parallelism for 30 task managers vs 156 task managers?

Some details

- Running on kubernetes
- Used Rocksdb with a local ssd for state backend
- Savepoint is hosted on GCS
- The smaller task manager case is important to us because we expect to deploy our application with a high number of task managers, and downscale once a backfill is completed

Differences between 1/ and 2/:

2/ has decreased task manager count 156 -> 30
2/ has decreased operator parallelism by a factor of ~10
2/ uses a striped SSD (3 ssds mounted as a single logical volume) to hold rocksdb files

Thanks in advance for your help!