Checkpoint is timing out - inspecting state

classic Classic list List threaded Threaded
3 messages Options
Dan
Reply | Threaded
Open this post in threaded view
|

Checkpoint is timing out - inspecting state

Dan
Hi.

We're doing something bad with our Flink state.  We just launched a feature that creates very big values (lists of objects that we append to) in MapState.

Our checkpoints time out (10 minutes).  I'm assuming the values are too big.  Backpressure is okay and cpu+memory metrics look okay.

Questions

1. Is there an easy tool for inspecting the Flink state?

I found this post about drilling into Flink state.  I was hoping for something more like a CLI.

2. Is there a way to break down the time spent during a checkout if it times out?

Thanks!
- Dan


Reply | Threaded
Open this post in threaded view
|

Re: Checkpoint is timing out - inspecting state

Yun Gao
Hi Dan,

Flink should already have integrate a tool in the web UI to monitor 
the detailed statistics of the checkpoint [1]. It would show the time
consumed in each part and each task, thus it could be used to debug
the checkpoint timeout.

Best,
Yun




------------------Original Mail ------------------
Sender:Dan Hill <[hidden email]>
Send Date:Sat Jun 12 09:15:50 2021
Recipients:user <[hidden email]>
Subject:Checkpoint is timing out - inspecting state
Hi.

We're doing something bad with our Flink state.  We just launched a feature that creates very big values (lists of objects that we append to) in MapState.

Our checkpoints time out (10 minutes).  I'm assuming the values are too big.  Backpressure is okay and cpu+memory metrics look okay.

Questions

1. Is there an easy tool for inspecting the Flink state?

I found this post about drilling into Flink state.  I was hoping for something more like a CLI.

2. Is there a way to break down the time spent during a checkout if it times out?

Thanks!
- Dan


Dan
Reply | Threaded
Open this post in threaded view
|

Re: Checkpoint is timing out - inspecting state

Dan
Hi Yun.  The UI was not useful for this case.  I had a feeling before hand about what the issue was.  We refactored the state and now the checkpoint is 10x faster.

On Mon, Jun 14, 2021 at 5:47 AM Yun Gao <[hidden email]> wrote:
Hi Dan,

Flink should already have integrate a tool in the web UI to monitor 
the detailed statistics of the checkpoint [1]. It would show the time
consumed in each part and each task, thus it could be used to debug
the checkpoint timeout.

Best,
Yun




------------------Original Mail ------------------
Sender:Dan Hill <[hidden email]>
Send Date:Sat Jun 12 09:15:50 2021
Recipients:user <[hidden email]>
Subject:Checkpoint is timing out - inspecting state
Hi.

We're doing something bad with our Flink state.  We just launched a feature that creates very big values (lists of objects that we append to) in MapState.

Our checkpoints time out (10 minutes).  I'm assuming the values are too big.  Backpressure is okay and cpu+memory metrics look okay.

Questions

1. Is there an easy tool for inspecting the Flink state?

I found this post about drilling into Flink state.  I was hoping for something more like a CLI.

2. Is there a way to break down the time spent during a checkout if it times out?

Thanks!
- Dan