Hi. Are there good ways to debug long Flink checkpoint durations?
I'm running a backfill job that runs ~10 days of data and then starts checkpointing failing. Since I only see the last 10 checkpoints in the jobmaster UI, I don't see when it starts. I looked through the text logs and didn't see much. I assume: 1) I have something misconfigured that is causing old state is sticking around. 2) I don't have enough resources. |
Hi Dan, I think you could see the detail of the checkpoints via the checkpoint UI[1]. Also, if you see in the pending checkpoints some tasks do not take snapshot, you might have a look whether this task is backpressuring the previous tasks [2]. Best, Yun ------------------------------------------------------------------ |
Thanks! Yes, I've looked at these. My job is facing backpressure starting at an early join step. I'm unclear if more time is fine for the backfill or if I need more resources. On Tue, Mar 2, 2021 at 12:50 AM Yun Gao <[hidden email]> wrote:
|
I dove deeper into it and made a little more progress (by giving more resources). Here is a screenshot of one bottleneck: My job isn't making any progress. It's checkpointing and failing. The taskmaster text logs are empty during the checkpoint. It's not clear if the checkpoint is making any progress. https://drive.google.com/file/d/1slLO6PJVhXfoAN5OrSqsE9G7kvHPXJnl/view?usp=sharing I spent some time changing the memory parameters but it's unclear if I'm making forward progress. On Tue, Mar 2, 2021 at 3:45 PM Dan Hill <[hidden email]> wrote:
|
The checkpoint was only acknowledged shortly after it was started. On Thu, Mar 4, 2021 at 12:38 PM Dan Hill <[hidden email]> wrote:
|
Free forum by Nabble | Edit this page |