IO benchmarking

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

IO benchmarking

deepthi Sridharan
Hi,

I am trying to set up some benchmarking with a couple of IO options for saving checkpoints and have a couple of questions :

1. Does flink come with any IO benchmarking tools? I couldn't find any. I was hoping to use those to derive some insights about the storage performance and extrapolate it for the checkpoint use case.

2. Are there any metrics pertaining to restore from checkpoints? The only metric I can find is the last restore time, but neither the time it took to read the checkpoints, nor the time it took to restore the operator/task states seem to be covered. I am using RocksDB, but couldn't find any metrics relating to how much time it took to restore the state backend from rocksdb either. 

3. I am trying to find documentation on how the states are serialized into the checkpoint files from multiple operators and tasks to tailor the testing use case, but can't seem to find any. Are there any bogs that go into this detail or would reading the code be the only option? 

--
Thanks,
Deepthi
Reply | Threaded
Open this post in threaded view
|

Re: IO benchmarking

Matthias
Hi Deepthi,
1. Have you had a look at flink-benchmarks [1]? I haven't used it but it might be helpful.
2. Unfortunately, Flink doesn't provide metrics like that. But you might want to follow FLINK-21736 [2] for future developments.
3. Is there anything specific you are looking for? Unfortunately, I don't know any blogs for a more detailed summary. If you plan to look into the code CheckpointCoordinator [3] might be a starting point. Alternatively, something like MetadataV2V3SerializerBase [4] offers insights into how the checkpoints' metadata is serialized.

Best,
Matthias


On Tue, Mar 30, 2021 at 8:37 PM deepthi Sridharan <[hidden email]> wrote:
Hi,

I am trying to set up some benchmarking with a couple of IO options for saving checkpoints and have a couple of questions :

1. Does flink come with any IO benchmarking tools? I couldn't find any. I was hoping to use those to derive some insights about the storage performance and extrapolate it for the checkpoint use case.

2. Are there any metrics pertaining to restore from checkpoints? The only metric I can find is the last restore time, but neither the time it took to read the checkpoints, nor the time it took to restore the operator/task states seem to be covered. I am using RocksDB, but couldn't find any metrics relating to how much time it took to restore the state backend from rocksdb either. 

3. I am trying to find documentation on how the states are serialized into the checkpoint files from multiple operators and tasks to tailor the testing use case, but can't seem to find any. Are there any bogs that go into this detail or would reading the code be the only option? 

--
Thanks,
Deepthi
Reply | Threaded
Open this post in threaded view
|

Re: IO benchmarking

deepthi Sridharan
Thanks, Matthias. This is very helpful. 

Regarding the checkpoint documentation, I was mostly looking for information on how states from various tasks get serialized into one (or more?) files on persistent storage. I'll check out the code pointers! 

On Wed, Mar 31, 2021 at 7:07 AM Matthias Pohl <[hidden email]> wrote:
Hi Deepthi,
1. Have you had a look at flink-benchmarks [1]? I haven't used it but it might be helpful.
2. Unfortunately, Flink doesn't provide metrics like that. But you might want to follow FLINK-21736 [2] for future developments.
3. Is there anything specific you are looking for? Unfortunately, I don't know any blogs for a more detailed summary. If you plan to look into the code CheckpointCoordinator [3] might be a starting point. Alternatively, something like MetadataV2V3SerializerBase [4] offers insights into how the checkpoints' metadata is serialized.

Best,
Matthias


On Tue, Mar 30, 2021 at 8:37 PM deepthi Sridharan <[hidden email]> wrote:
Hi,

I am trying to set up some benchmarking with a couple of IO options for saving checkpoints and have a couple of questions :

1. Does flink come with any IO benchmarking tools? I couldn't find any. I was hoping to use those to derive some insights about the storage performance and extrapolate it for the checkpoint use case.

2. Are there any metrics pertaining to restore from checkpoints? The only metric I can find is the last restore time, but neither the time it took to read the checkpoints, nor the time it took to restore the operator/task states seem to be covered. I am using RocksDB, but couldn't find any metrics relating to how much time it took to restore the state backend from rocksdb either. 

3. I am trying to find documentation on how the states are serialized into the checkpoint files from multiple operators and tasks to tailor the testing use case, but can't seem to find any. Are there any bogs that go into this detail or would reading the code be the only option? 

--
Thanks,
Deepthi


--
Regards,
Deepthi
Reply | Threaded
Open this post in threaded view
|

Re: IO benchmarking

Matthias
For 2. there are also efforts to expose the state and operator initialization through the logs (see FLINK-17012 [1]).
For 3. the TypeSerializer [2] might be another point of interest. It is used to serialize specific types. Other than that, the state serialzation depends heavily on the used state backend. Hence, you want to look into RocksDB's SSTables if relying on it as a state backend.


On Thu, Apr 1, 2021 at 1:27 AM deepthi Sridharan <[hidden email]> wrote:
Thanks, Matthias. This is very helpful. 

Regarding the checkpoint documentation, I was mostly looking for information on how states from various tasks get serialized into one (or more?) files on persistent storage. I'll check out the code pointers! 

On Wed, Mar 31, 2021 at 7:07 AM Matthias Pohl <[hidden email]> wrote:
Hi Deepthi,
1. Have you had a look at flink-benchmarks [1]? I haven't used it but it might be helpful.
2. Unfortunately, Flink doesn't provide metrics like that. But you might want to follow FLINK-21736 [2] for future developments.
3. Is there anything specific you are looking for? Unfortunately, I don't know any blogs for a more detailed summary. If you plan to look into the code CheckpointCoordinator [3] might be a starting point. Alternatively, something like MetadataV2V3SerializerBase [4] offers insights into how the checkpoints' metadata is serialized.

Best,
Matthias


On Tue, Mar 30, 2021 at 8:37 PM deepthi Sridharan <[hidden email]> wrote:
Hi,

I am trying to set up some benchmarking with a couple of IO options for saving checkpoints and have a couple of questions :

1. Does flink come with any IO benchmarking tools? I couldn't find any. I was hoping to use those to derive some insights about the storage performance and extrapolate it for the checkpoint use case.

2. Are there any metrics pertaining to restore from checkpoints? The only metric I can find is the last restore time, but neither the time it took to read the checkpoints, nor the time it took to restore the operator/task states seem to be covered. I am using RocksDB, but couldn't find any metrics relating to how much time it took to restore the state backend from rocksdb either. 

3. I am trying to find documentation on how the states are serialized into the checkpoint files from multiple operators and tasks to tailor the testing use case, but can't seem to find any. Are there any bogs that go into this detail or would reading the code be the only option? 

--
Thanks,
Deepthi


--
Regards,
Deepthi