Debugging long Flink checkpoint durations

classic Classic list List threaded Threaded
5 messages Options
Dan
Reply | Threaded
Open this post in threaded view
|

Debugging long Flink checkpoint durations

Dan
Hi.  Are there good ways to debug long Flink checkpoint durations?

I'm running a backfill job that runs ~10 days of data and then starts checkpointing failing.  Since I only see the last 10 checkpoints in the jobmaster UI, I don't see when it starts.

I looked through the text logs and didn't see much.

I assume:
1) I have something misconfigured that is causing old state is sticking around.
2) I don't have enough resources.
 
Reply | Threaded
Open this post in threaded view
|

Re: Debugging long Flink checkpoint durations

Yun Gao
Hi Dan,

I think you could see the detail of the checkpoints via the checkpoint UI[1]. Also, if you see in the
pending checkpoints some tasks do not take snapshot,  you might have a look whether this task
is backpressuring the previous tasks [2].

Best,
Yun



------------------------------------------------------------------
Sender:Dan Hill<[hidden email]>
Date:2021/03/02 04:34:56
Recipient:user<[hidden email]>
Theme:Debugging long Flink checkpoint durations

Hi.  Are there good ways to debug long Flink checkpoint durations?

I'm running a backfill job that runs ~10 days of data and then starts checkpointing failing.  Since I only see the last 10 checkpoints in the jobmaster UI, I don't see when it starts.

I looked through the text logs and didn't see much.

I assume:
1) I have something misconfigured that is causing old state is sticking around.
2) I don't have enough resources.
 

Dan
Reply | Threaded
Open this post in threaded view
|

Re: Debugging long Flink checkpoint durations

Dan
Thanks!  Yes, I've looked at these.   My job is facing backpressure starting at an early join step.  I'm unclear if more time is fine for the backfill or if I need more resources.

On Tue, Mar 2, 2021 at 12:50 AM Yun Gao <[hidden email]> wrote:
Hi Dan,

I think you could see the detail of the checkpoints via the checkpoint UI[1]. Also, if you see in the
pending checkpoints some tasks do not take snapshot,  you might have a look whether this task
is backpressuring the previous tasks [2].

Best,
Yun



------------------------------------------------------------------
Sender:Dan Hill<[hidden email]>
Date:2021/03/02 04:34:56
Recipient:user<[hidden email]>
Theme:Debugging long Flink checkpoint durations

Hi.  Are there good ways to debug long Flink checkpoint durations?

I'm running a backfill job that runs ~10 days of data and then starts checkpointing failing.  Since I only see the last 10 checkpoints in the jobmaster UI, I don't see when it starts.

I looked through the text logs and didn't see much.

I assume:
1) I have something misconfigured that is causing old state is sticking around.
2) I don't have enough resources.
 

Dan
Reply | Threaded
Open this post in threaded view
|

Re: Debugging long Flink checkpoint durations

Dan
I dove deeper into it and made a little more progress (by giving more resources).

Here is a screenshot of one bottleneck:

My job isn't making any progress.  It's checkpointing and failing.  The taskmaster text logs are empty during the checkpoint.  It's not clear if the checkpoint is making any progress.
https://drive.google.com/file/d/1slLO6PJVhXfoAN5OrSqsE9G7kvHPXJnl/view?usp=sharing

I spent some time changing the memory parameters but it's unclear if I'm making forward progress.




On Tue, Mar 2, 2021 at 3:45 PM Dan Hill <[hidden email]> wrote:
Thanks!  Yes, I've looked at these.   My job is facing backpressure starting at an early join step.  I'm unclear if more time is fine for the backfill or if I need more resources.

On Tue, Mar 2, 2021 at 12:50 AM Yun Gao <[hidden email]> wrote:
Hi Dan,

I think you could see the detail of the checkpoints via the checkpoint UI[1]. Also, if you see in the
pending checkpoints some tasks do not take snapshot,  you might have a look whether this task
is backpressuring the previous tasks [2].

Best,
Yun



------------------------------------------------------------------
Sender:Dan Hill<[hidden email]>
Date:2021/03/02 04:34:56
Recipient:user<[hidden email]>
Theme:Debugging long Flink checkpoint durations

Hi.  Are there good ways to debug long Flink checkpoint durations?

I'm running a backfill job that runs ~10 days of data and then starts checkpointing failing.  Since I only see the last 10 checkpoints in the jobmaster UI, I don't see when it starts.

I looked through the text logs and didn't see much.

I assume:
1) I have something misconfigured that is causing old state is sticking around.
2) I don't have enough resources.
 

Dan
Reply | Threaded
Open this post in threaded view
|

Re: Debugging long Flink checkpoint durations

Dan
The checkpoint was only acknowledged shortly after it was started.

On Thu, Mar 4, 2021 at 12:38 PM Dan Hill <[hidden email]> wrote:
I dove deeper into it and made a little more progress (by giving more resources).

Here is a screenshot of one bottleneck:

My job isn't making any progress.  It's checkpointing and failing.  The taskmaster text logs are empty during the checkpoint.  It's not clear if the checkpoint is making any progress.
https://drive.google.com/file/d/1slLO6PJVhXfoAN5OrSqsE9G7kvHPXJnl/view?usp=sharing

I spent some time changing the memory parameters but it's unclear if I'm making forward progress.




On Tue, Mar 2, 2021 at 3:45 PM Dan Hill <[hidden email]> wrote:
Thanks!  Yes, I've looked at these.   My job is facing backpressure starting at an early join step.  I'm unclear if more time is fine for the backfill or if I need more resources.

On Tue, Mar 2, 2021 at 12:50 AM Yun Gao <[hidden email]> wrote:
Hi Dan,

I think you could see the detail of the checkpoints via the checkpoint UI[1]. Also, if you see in the
pending checkpoints some tasks do not take snapshot,  you might have a look whether this task
is backpressuring the previous tasks [2].

Best,
Yun



------------------------------------------------------------------
Sender:Dan Hill<[hidden email]>
Date:2021/03/02 04:34:56
Recipient:user<[hidden email]>
Theme:Debugging long Flink checkpoint durations

Hi.  Are there good ways to debug long Flink checkpoint durations?

I'm running a backfill job that runs ~10 days of data and then starts checkpointing failing.  Since I only see the last 10 checkpoints in the jobmaster UI, I don't see when it starts.

I looked through the text logs and didn't see much.

I assume:
1) I have something misconfigured that is causing old state is sticking around.
2) I don't have enough resources.