Checkpoint Space Usage Debugging

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

Checkpoint Space Usage Debugging

Kent Murra
I'm looking into a situation where our checkpoint sizes are automatically growing over time.  I'm unable to pinpoint exactly why this is happening, and it would be great if there was a way to figure out how much checkpoint space is attributable to each operator so I can narrow it down.  Are there any tools or methods for introspecting the checkpoint data so that I can determine where the space is going?

The pipeline in question is consuming from Kinesis and batching up data using windows.  I suspected that I was doing something wrong with windowing, but I'm emitting FIRE_AND_PURGE and also setting a max end timestamp.  The Kinesis consumer is not emitting watermarks at the moment, but as far as I know thats not necessary for proper checkpointing (only exactly once behavior).
Reply | Threaded
Open this post in threaded view
|

Re: Checkpoint Space Usage Debugging

Yun Tang
Hi Kent

You can view checkpoint details via web UI to know how much checkpointed data uploaded for each operator, and you can compare the state size as time goes on to see whether they upload checkpointed data in stable range.

Best
Yun Tang

From: Kent Murra <[hidden email]>
Sent: Saturday, April 18, 2020 1:47
To: [hidden email] <[hidden email]>
Subject: Checkpoint Space Usage Debugging
 
I'm looking into a situation where our checkpoint sizes are automatically growing over time.  I'm unable to pinpoint exactly why this is happening, and it would be great if there was a way to figure out how much checkpoint space is attributable to each operator so I can narrow it down.  Are there any tools or methods for introspecting the checkpoint data so that I can determine where the space is going?

The pipeline in question is consuming from Kinesis and batching up data using windows.  I suspected that I was doing something wrong with windowing, but I'm emitting FIRE_AND_PURGE and also setting a max end timestamp.  The Kinesis consumer is not emitting watermarks at the moment, but as far as I know thats not necessary for proper checkpointing (only exactly once behavior).