Quick survey on checkpointing performance

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

Quick survey on checkpointing performance

Robin Cassan
Hi all,

We are currently experiencing long checkpointing times on S3 and are wondering how abnormal it is compared to other loads and setups. Could some of you share a few stats in your running architecture so we can compare?

Here are our stats:

Architecture: 28 TM on Kubernetes, 4 slots per TM, local NVME SSDs (r5d.2xlarge instances), RocksDB state backend, incremental checkpoints on Amazon S3 (without entropy injection), checkpoint interval of 1 hour

Typical state size for one checkpoint: 220gb

Checkpointing duration (End to End): 58 minutes

We are surprised to see such a long duration to send 220gb to S3, we observe no backpressure in our job and the checkpointing duration is more or less the same for each subtask. We'd love to check if it's a normal duration or not, so thanks a lot for your answers!

Cheers,
Robin
Reply | Threaded
Open this post in threaded view
|

Re: Quick survey on checkpointing performance

Yun Tang
Hi Robin

First of all, did you get the state size from the web UI? If so, the state size is the incremental checkpoint size not the actual full size [1]. I assume you only have one RocksDB instance per slot, the incremental checkpoint size for each RocksDB instance is 2011MB, which is some
how quite large as an incremental size.

If every operator would only upload 2011MB to S3, the overall time of 58min is really too large. Would you please check the async phase of your checkpoint details of all tasks. The async time would reflect the S3 performance for writing data. I guess your async time would not be too large, the most common reason is operator receiving the barrier late which lead to the end-to-end duration large. I hope you could offer the UI of your checkpoint details for further investigation.



Best
Yun Tang

From: Robin Cassan <[hidden email]>
Sent: Wednesday, April 15, 2020 18:35
To: user <[hidden email]>
Subject: Quick survey on checkpointing performance
 
Hi all,

We are currently experiencing long checkpointing times on S3 and are wondering how abnormal it is compared to other loads and setups. Could some of you share a few stats in your running architecture so we can compare?

Here are our stats:

Architecture: 28 TM on Kubernetes, 4 slots per TM, local NVME SSDs (r5d.2xlarge instances), RocksDB state backend, incremental checkpoints on Amazon S3 (without entropy injection), checkpoint interval of 1 hour

Typical state size for one checkpoint: 220gb

Checkpointing duration (End to End): 58 minutes

We are surprised to see such a long duration to send 220gb to S3, we observe no backpressure in our job and the checkpointing duration is more or less the same for each subtask. We'd love to check if it's a normal duration or not, so thanks a lot for your answers!

Cheers,
Robin
Reply | Threaded
Open this post in threaded view
|

Re: Quick survey on checkpointing performance

Robin Cassan
Hi Yun,

Thank you for answering! For clarification, the 220Gb size was indeed the size of one incremental checkpoint for all subtasks, as shown by the web UI. Since this operator has 122 partitions, this would mean that each RocksDB instance is sending 1.8Gb to S3, if I'm not mistaken, which should take less than one hour I agree...
We do not really see delays in the alignment phase, most of the time is spent in the Async phase, as shown in the screenshot below. Does the Async phase only include sending sst files to S3 or does it also do other operations, like RocksDB flushing or merging?

Here is an example of a long Async duration:
Screenshot 2020-04-16 at 01.23.38.png

The longest subtask is the number 46:
Screenshot 2020-04-16 at 01.24.00.png

Do you think our bottleneck could be network or disk throughput?

Thanks again!
Robin

Le mer. 15 avr. 2020 à 15:56, Yun Tang <[hidden email]> a écrit :
Hi Robin

First of all, did you get the state size from the web UI? If so, the state size is the incremental checkpoint size not the actual full size [1]. I assume you only have one RocksDB instance per slot, the incremental checkpoint size for each RocksDB instance is 2011MB, which is some
how quite large as an incremental size.

If every operator would only upload 2011MB to S3, the overall time of 58min is really too large. Would you please check the async phase of your checkpoint details of all tasks. The async time would reflect the S3 performance for writing data. I guess your async time would not be too large, the most common reason is operator receiving the barrier late which lead to the end-to-end duration large. I hope you could offer the UI of your checkpoint details for further investigation.



Best
Yun Tang

From: Robin Cassan <[hidden email]>
Sent: Wednesday, April 15, 2020 18:35
To: user <[hidden email]>
Subject: Quick survey on checkpointing performance
 
Hi all,

We are currently experiencing long checkpointing times on S3 and are wondering how abnormal it is compared to other loads and setups. Could some of you share a few stats in your running architecture so we can compare?

Here are our stats:

Architecture: 28 TM on Kubernetes, 4 slots per TM, local NVME SSDs (r5d.2xlarge instances), RocksDB state backend, incremental checkpoints on Amazon S3 (without entropy injection), checkpoint interval of 1 hour

Typical state size for one checkpoint: 220gb

Checkpointing duration (End to End): 58 minutes

We are surprised to see such a long duration to send 220gb to S3, we observe no backpressure in our job and the checkpointing duration is more or less the same for each subtask. We'd love to check if it's a normal duration or not, so thanks a lot for your answers!

Cheers,
Robin
Reply | Threaded
Open this post in threaded view
|

Re: Quick survey on checkpointing performance

Yun Tang
Hi Robin

The async phase would only include uploading sst files to S3, the time of RocksDB flushing to local disk is included in the sync phase. From my point of view, uploading only 1GB files with duration of 20min is really slow. Could you check the status of S3 to see the metrics of network or disk throughput, and logs within taskmanager might also contain some warnings if they write to S3 slow due to some timeout exception.


Best
Yun Tang

From: Robin Cassan <[hidden email]>
Sent: Thursday, April 16, 2020 7:33
To: Yun Tang <[hidden email]>
Cc: user <[hidden email]>
Subject: Re: Quick survey on checkpointing performance
 
Hi Yun,

Thank you for answering! For clarification, the 220Gb size was indeed the size of one incremental checkpoint for all subtasks, as shown by the web UI. Since this operator has 122 partitions, this would mean that each RocksDB instance is sending 1.8Gb to S3, if I'm not mistaken, which should take less than one hour I agree...
We do not really see delays in the alignment phase, most of the time is spent in the Async phase, as shown in the screenshot below. Does the Async phase only include sending sst files to S3 or does it also do other operations, like RocksDB flushing or merging?

Here is an example of a long Async duration:
Screenshot 2020-04-16 at 01.23.38.png

The longest subtask is the number 46:
Screenshot 2020-04-16 at 01.24.00.png

Do you think our bottleneck could be network or disk throughput?

Thanks again!
Robin

Le mer. 15 avr. 2020 à 15:56, Yun Tang <[hidden email]> a écrit :
Hi Robin

First of all, did you get the state size from the web UI? If so, the state size is the incremental checkpoint size not the actual full size [1]. I assume you only have one RocksDB instance per slot, the incremental checkpoint size for each RocksDB instance is 2011MB, which is some
how quite large as an incremental size.

If every operator would only upload 2011MB to S3, the overall time of 58min is really too large. Would you please check the async phase of your checkpoint details of all tasks. The async time would reflect the S3 performance for writing data. I guess your async time would not be too large, the most common reason is operator receiving the barrier late which lead to the end-to-end duration large. I hope you could offer the UI of your checkpoint details for further investigation.



Best
Yun Tang

From: Robin Cassan <[hidden email]>
Sent: Wednesday, April 15, 2020 18:35
To: user <[hidden email]>
Subject: Quick survey on checkpointing performance
 
Hi all,

We are currently experiencing long checkpointing times on S3 and are wondering how abnormal it is compared to other loads and setups. Could some of you share a few stats in your running architecture so we can compare?

Here are our stats:

Architecture: 28 TM on Kubernetes, 4 slots per TM, local NVME SSDs (r5d.2xlarge instances), RocksDB state backend, incremental checkpoints on Amazon S3 (without entropy injection), checkpoint interval of 1 hour

Typical state size for one checkpoint: 220gb

Checkpointing duration (End to End): 58 minutes

We are surprised to see such a long duration to send 220gb to S3, we observe no backpressure in our job and the checkpointing duration is more or less the same for each subtask. We'd love to check if it's a normal duration or not, so thanks a lot for your answers!

Cheers,
Robin