(DEPRECATED) Apache Flink User Mailing List archive.

IO benchmarking

Classic

List

Threaded

4 messages Options

deepthi Sridharan

IO benchmarking

Hi,

I am trying to set up some benchmarking with a couple of IO options for saving checkpoints and have a couple of questions :

1. Does flink come with any IO benchmarking tools? I couldn't find any. I was hoping to use those to derive some insights about the storage performance and extrapolate it for the checkpoint use case.

2. Are there any metrics pertaining to restore from checkpoints? The only metric I can find is the last restore time, but neither the time it took to read the checkpoints, nor the time it took to restore the operator/task states seem to be covered. I am using RocksDB, but couldn't find any metrics relating to how much time it took to restore the state backend from rocksdb either.

3. I am trying to find documentation on how the states are serialized into the checkpoint files from multiple operators and tasks to tailor the testing use case, but can't seem to find any. Are there any bogs that go into this detail or would reading the code be the only option?

Thanks,

Deepthi

Matthias

Re: IO benchmarking

Hi Deepthi,

1. Have you had a look at flink-benchmarks [1]? I haven't used it but it might be helpful.

2. Unfortunately, Flink doesn't provide metrics like that. But you might want to follow FLINK-21736 [2] for future developments.

3. Is there anything specific you are looking for? Unfortunately, I don't know any blogs for a more detailed summary. If you plan to look into the code CheckpointCoordinator [3] might be a starting point. Alternatively, something like MetadataV2V3SerializerBase [4] offers insights into how the checkpoints' metadata is serialized.

Best,

Matthias

[1] https://github.com/apache/flink-benchmarks

[2] https://issues.apache.org/jira/browse/FLINK-21736

[3] https://github.com/apache/flink/blob/11550edbd4e1874634ec441bde4fe4952fc1ec4e/flink-runtime/src/main/java/org/apache/flink/runtime/checkpoint/CheckpointCoordinator.java#L1493

[4] https://github.com/apache/flink/blob/adaaed426c2e637b8e5ffa3f0d051326038d30aa/flink-runtime/src/main/java/org/apache/flink/runtime/checkpoint/metadata/MetadataV2V3SerializerBase.java#L83

On Tue, Mar 30, 2021 at 8:37 PM deepthi Sridharan <[hidden email]> wrote:

Hi,

I am trying to set up some benchmarking with a couple of IO options for saving checkpoints and have a couple of questions :

1. Does flink come with any IO benchmarking tools? I couldn't find any. I was hoping to use those to derive some insights about the storage performance and extrapolate it for the checkpoint use case.

2. Are there any metrics pertaining to restore from checkpoints? The only metric I can find is the last restore time, but neither the time it took to read the checkpoints, nor the time it took to restore the operator/task states seem to be covered. I am using RocksDB, but couldn't find any metrics relating to how much time it took to restore the state backend from rocksdb either.

3. I am trying to find documentation on how the states are serialized into the checkpoint files from multiple operators and tasks to tailor the testing use case, but can't seem to find any. Are there any bogs that go into this detail or would reading the code be the only option?

--
Thanks,
Deepthi

deepthi Sridharan

Re: IO benchmarking

Thanks, Matthias. This is very helpful.

Regarding the checkpoint documentation, I was mostly looking for information on how states from various tasks get serialized into one (or more?) files on persistent storage. I'll check out the code pointers!

On Wed, Mar 31, 2021 at 7:07 AM Matthias Pohl <[hidden email]> wrote:

Hi Deepthi,
1. Have you had a look at flink-benchmarks [1]? I haven't used it but it might be helpful.
2. Unfortunately, Flink doesn't provide metrics like that. But you might want to follow FLINK-21736 [2] for future developments.
3. Is there anything specific you are looking for? Unfortunately, I don't know any blogs for a more detailed summary. If you plan to look into the code CheckpointCoordinator [3] might be a starting point. Alternatively, something like MetadataV2V3SerializerBase [4] offers insights into how the checkpoints' metadata is serialized.

Best,
Matthias

[1] https://github.com/apache/flink-benchmarks
[2] https://issues.apache.org/jira/browse/FLINK-21736
[3] https://github.com/apache/flink/blob/11550edbd4e1874634ec441bde4fe4952fc1ec4e/flink-runtime/src/main/java/org/apache/flink/runtime/checkpoint/CheckpointCoordinator.java#L1493
[4] https://github.com/apache/flink/blob/adaaed426c2e637b8e5ffa3f0d051326038d30aa/flink-runtime/src/main/java/org/apache/flink/runtime/checkpoint/metadata/MetadataV2V3SerializerBase.java#L83

On Tue, Mar 30, 2021 at 8:37 PM deepthi Sridharan <[hidden email]> wrote:
Hi,

I am trying to set up some benchmarking with a couple of IO options for saving checkpoints and have a couple of questions :

1. Does flink come with any IO benchmarking tools? I couldn't find any. I was hoping to use those to derive some insights about the storage performance and extrapolate it for the checkpoint use case.

2. Are there any metrics pertaining to restore from checkpoints? The only metric I can find is the last restore time, but neither the time it took to read the checkpoints, nor the time it took to restore the operator/task states seem to be covered. I am using RocksDB, but couldn't find any metrics relating to how much time it took to restore the state backend from rocksdb either.

3. I am trying to find documentation on how the states are serialized into the checkpoint files from multiple operators and tasks to tailor the testing use case, but can't seem to find any. Are there any bogs that go into this detail or would reading the code be the only option?

--
Thanks,
Deepthi

Regards,

Deepthi

Matthias

Re: IO benchmarking

For 2. there are also efforts to expose the state and operator initialization through the logs (see FLINK-17012 [1]).

For 3. the TypeSerializer [2] might be another point of interest. It is used to serialize specific types. Other than that, the state serialzation depends heavily on the used state backend. Hence, you want to look into RocksDB's SSTables if relying on it as a state backend.

[1] https://issues.apache.org/jira/browse/FLINK-17012

[2] https://github.com/apache/flink/blob/c6997c97c575d334679915c328792b8a3067cfb5/flink-core/src/main/java/org/apache/flink/api/common/typeutils/TypeSerializer.java

On Thu, Apr 1, 2021 at 1:27 AM deepthi Sridharan <[hidden email]> wrote:

Thanks, Matthias. This is very helpful.

Regarding the checkpoint documentation, I was mostly looking for information on how states from various tasks get serialized into one (or more?) files on persistent storage. I'll check out the code pointers!

On Wed, Mar 31, 2021 at 7:07 AM Matthias Pohl <[hidden email]> wrote:
Hi Deepthi,
1. Have you had a look at flink-benchmarks [1]? I haven't used it but it might be helpful.
2. Unfortunately, Flink doesn't provide metrics like that. But you might want to follow FLINK-21736 [2] for future developments.
3. Is there anything specific you are looking for? Unfortunately, I don't know any blogs for a more detailed summary. If you plan to look into the code CheckpointCoordinator [3] might be a starting point. Alternatively, something like MetadataV2V3SerializerBase [4] offers insights into how the checkpoints' metadata is serialized.

Best,
Matthias

[1] https://github.com/apache/flink-benchmarks
[2] https://issues.apache.org/jira/browse/FLINK-21736
[3] https://github.com/apache/flink/blob/11550edbd4e1874634ec441bde4fe4952fc1ec4e/flink-runtime/src/main/java/org/apache/flink/runtime/checkpoint/CheckpointCoordinator.java#L1493
[4] https://github.com/apache/flink/blob/adaaed426c2e637b8e5ffa3f0d051326038d30aa/flink-runtime/src/main/java/org/apache/flink/runtime/checkpoint/metadata/MetadataV2V3SerializerBase.java#L83

On Tue, Mar 30, 2021 at 8:37 PM deepthi Sridharan <[hidden email]> wrote:
Hi,

I am trying to set up some benchmarking with a couple of IO options for saving checkpoints and have a couple of questions :

1. Does flink come with any IO benchmarking tools? I couldn't find any. I was hoping to use those to derive some insights about the storage performance and extrapolate it for the checkpoint use case.

2. Are there any metrics pertaining to restore from checkpoints? The only metric I can find is the last restore time, but neither the time it took to read the checkpoints, nor the time it took to restore the operator/task states seem to be covered. I am using RocksDB, but couldn't find any metrics relating to how much time it took to restore the state backend from rocksdb either.

3. I am trying to find documentation on how the states are serialized into the checkpoint files from multiple operators and tasks to tailor the testing use case, but can't seem to find any. Are there any bogs that go into this detail or would reading the code be the only option?

--
Thanks,
Deepthi

--
Regards,
Deepthi