(DEPRECATED) Apache Flink User Mailing List archive.

Debugging long Flink checkpoint durations

Classic

List

Threaded

5 messages Options

Dan

Debugging long Flink checkpoint durations

Hi. Are there good ways to debug long Flink checkpoint durations?

I'm running a backfill job that runs ~10 days of data and then starts checkpointing failing. Since I only see the last 10 checkpoints in the jobmaster UI, I don't see when it starts.

I looked through the text logs and didn't see much.

I assume:

1) I have something misconfigured that is causing old state is sticking around.

2) I don't have enough resources.

Yun Gao

Re: Debugging long Flink checkpoint durations

Hi Dan,

I think you could see the detail of the checkpoints via the checkpoint UI[1]. Also, if you see in the

pending checkpoints some tasks do not take snapshot, you might have a look whether this task

is backpressuring the previous tasks [2].

Best,

Yun

[1] https://ci.apache.org/projects/flink/flink-docs-stable/ops/monitoring/checkpoint_monitoring.html

[2] https://ci.apache.org/projects/flink/flink-docs-release-1.12/ops/monitoring/back_pressure.html

------------------------------------------------------------------
Sender:Dan Hill<[hidden email]>
Date:2021/03/02 04:34:56
Recipient:user<[hidden email]>
Theme:Debugging long Flink checkpoint durations

Hi. Are there good ways to debug long Flink checkpoint durations?

I'm running a backfill job that runs ~10 days of data and then starts checkpointing failing. Since I only see the last 10 checkpoints in the jobmaster UI, I don't see when it starts.

I looked through the text logs and didn't see much.

I assume:
1) I have something misconfigured that is causing old state is sticking around.
2) I don't have enough resources.

Dan

Re: Debugging long Flink checkpoint durations

Thanks! Yes, I've looked at these. My job is facing backpressure starting at an early join step. I'm unclear if more time is fine for the backfill or if I need more resources.

On Tue, Mar 2, 2021 at 12:50 AM Yun Gao <[hidden email]> wrote:

Hi Dan,

I think you could see the detail of the checkpoints via the checkpoint UI[1]. Also, if you see in the
pending checkpoints some tasks do not take snapshot, you might have a look whether this task
is backpressuring the previous tasks [2].

Best,
Yun

[1] https://ci.apache.org/projects/flink/flink-docs-stable/ops/monitoring/checkpoint_monitoring.html
[2] https://ci.apache.org/projects/flink/flink-docs-release-1.12/ops/monitoring/back_pressure.html
------------------------------------------------------------------
Sender:Dan Hill<[hidden email]>
Date:2021/03/02 04:34:56
Recipient:user<[hidden email]>
Theme:Debugging long Flink checkpoint durations

Hi. Are there good ways to debug long Flink checkpoint durations?

I'm running a backfill job that runs ~10 days of data and then starts checkpointing failing. Since I only see the last 10 checkpoints in the jobmaster UI, I don't see when it starts.

I looked through the text logs and didn't see much.

I assume:
1) I have something misconfigured that is causing old state is sticking around.
2) I don't have enough resources.

Dan

Re: Debugging long Flink checkpoint durations

I dove deeper into it and made a little more progress (by giving more resources).

Here is a screenshot of one bottleneck:

https://drive.google.com/file/d/1CIatEuIJwmKjBE9__RihVlxSilchtKS1/view

My job isn't making any progress. It's checkpointing and failing. The taskmaster text logs are empty during the checkpoint. It's not clear if the checkpoint is making any progress.
https://drive.google.com/file/d/1slLO6PJVhXfoAN5OrSqsE9G7kvHPXJnl/view?usp=sharing

I spent some time changing the memory parameters but it's unclear if I'm making forward progress.

https://ci.apache.org/projects/flink/flink-docs-release-1.11/ops/memory/mem_setup_tm.html

On Tue, Mar 2, 2021 at 3:45 PM Dan Hill <[hidden email]> wrote:

Thanks! Yes, I've looked at these. My job is facing backpressure starting at an early join step. I'm unclear if more time is fine for the backfill or if I need more resources.

On Tue, Mar 2, 2021 at 12:50 AM Yun Gao <[hidden email]> wrote:
Hi Dan,

I think you could see the detail of the checkpoints via the checkpoint UI[1]. Also, if you see in the
pending checkpoints some tasks do not take snapshot, you might have a look whether this task
is backpressuring the previous tasks [2].

Best,
Yun

[1] https://ci.apache.org/projects/flink/flink-docs-stable/ops/monitoring/checkpoint_monitoring.html
[2] https://ci.apache.org/projects/flink/flink-docs-release-1.12/ops/monitoring/back_pressure.html
------------------------------------------------------------------
Sender:Dan Hill<[hidden email]>
Date:2021/03/02 04:34:56
Recipient:user<[hidden email]>
Theme:Debugging long Flink checkpoint durations

Hi. Are there good ways to debug long Flink checkpoint durations?

I'm running a backfill job that runs ~10 days of data and then starts checkpointing failing. Since I only see the last 10 checkpoints in the jobmaster UI, I don't see when it starts.

I looked through the text logs and didn't see much.

I assume:
1) I have something misconfigured that is causing old state is sticking around.
2) I don't have enough resources.

Dan

Re: Debugging long Flink checkpoint durations

The checkpoint was only acknowledged shortly after it was started.

On Thu, Mar 4, 2021 at 12:38 PM Dan Hill <[hidden email]> wrote:

I dove deeper into it and made a little more progress (by giving more resources).

Here is a screenshot of one bottleneck:
https://drive.google.com/file/d/1CIatEuIJwmKjBE9__RihVlxSilchtKS1/view

My job isn't making any progress. It's checkpointing and failing. The taskmaster text logs are empty during the checkpoint. It's not clear if the checkpoint is making any progress.
https://drive.google.com/file/d/1slLO6PJVhXfoAN5OrSqsE9G7kvHPXJnl/view?usp=sharing

I spent some time changing the memory parameters but it's unclear if I'm making forward progress.
https://ci.apache.org/projects/flink/flink-docs-release-1.11/ops/memory/mem_setup_tm.html

On Tue, Mar 2, 2021 at 3:45 PM Dan Hill <[hidden email]> wrote:
Thanks! Yes, I've looked at these. My job is facing backpressure starting at an early join step. I'm unclear if more time is fine for the backfill or if I need more resources.

On Tue, Mar 2, 2021 at 12:50 AM Yun Gao <[hidden email]> wrote:
Hi Dan,

I think you could see the detail of the checkpoints via the checkpoint UI[1]. Also, if you see in the
pending checkpoints some tasks do not take snapshot, you might have a look whether this task
is backpressuring the previous tasks [2].

Best,
Yun

[1] https://ci.apache.org/projects/flink/flink-docs-stable/ops/monitoring/checkpoint_monitoring.html
[2] https://ci.apache.org/projects/flink/flink-docs-release-1.12/ops/monitoring/back_pressure.html
------------------------------------------------------------------
Sender:Dan Hill<[hidden email]>
Date:2021/03/02 04:34:56
Recipient:user<[hidden email]>
Theme:Debugging long Flink checkpoint durations

Hi. Are there good ways to debug long Flink checkpoint durations?

I'm running a backfill job that runs ~10 days of data and then starts checkpointing failing. Since I only see the last 10 checkpoints in the jobmaster UI, I don't see when it starts.

I looked through the text logs and didn't see much.

I assume:
1) I have something misconfigured that is causing old state is sticking around.
2) I don't have enough resources.