(DEPRECATED) Apache Flink User Mailing List archive.

How to debug checkpoints failing to complete

Classic

List

Threaded

4 messages Options

Stephen Connolly

How to debug checkpoints failing to complete

We have a topology and the checkpoints fail to complete a *lot* of the time.

Typically it is just one subtask that fails.

We have a parallelism of 2 on this topology at present and the other subtask will complete in 3ms though the end to end duration on the rare times when the checkpointing completes is like 4m30

How can I start debugging this? When I run locally on my development cluster I have no issues, the issues only seem to show in production.

seeksst

Re: How to debug checkpoints failing to complete

Hi：

according to my experience, there are several possible reasons for checkpoint fail.

1. if you use rocksdb as backend, insufficient disk will cause it. because file save on local disk, and you may see a exception.

2. Sink can’t be written. all parallelism can’t be complete, and there is often no phenomenon.

3. Back Pressure. data skew will cause one subtask take on more calculations, so checkpoint can’t be finish.

Here is my advice:

1. learn more about checkpoint work.

https://ci.apache.org/projects/flink/flink-docs-release-1.10/internals/stream_checkpointing.html

2. try to test back pressure.

https://ci.apache.org/projects/flink/flink-docs-release-1.10/monitoring/back_pressure.html

3. if there is no data skew, you can set more parallelism, or you can ajust checkpoint parameter.

In my computer, I have hadoop environment. so i commit job to yarn, i can use dashboard to test pressure.

On 2020/03/23 15:14:33, Stephen Connolly <s...@gmail.com> wrote:

> We have a topology and the checkpoints fail to complete a *lot* of the time.>
>
> Typically it is just one subtask that fails.>
>
> We have a parallelism of 2 on this topology at present and the other>
> subtask will complete in 3ms though the end to end duration on the rare>
> times when the checkpointing completes is like 4m30>
>
> How can I start debugging this? When I run locally on my development>
> cluster I have no issues, the issues only seem to show in production.>
>

David Anderson-2

Re: How to debug checkpoints failing to complete

seeksst has already covered many of the relevant points, but a few more thoughts:

I would start by checking to see if the checkpoints are failing because they timeout, or for some other reason. Assuming they are timing out, then a good place to start is to look and see if this can be explained by data skew (which you can see in the metrics in the Flink dashboard). Common causes of data skew include hot key(s), and joins between streams where one stream is significantly behind the other.

Another likely cause of checkpoint troubles is back pressure, which is most often caused by slow or unavailable connections between flink and external systems, such as sinks, async i/o operators, filesystems, network, etc.

--david

On Tue, Mar 24, 2020 at 2:59 AM seeksst <[hidden email]> wrote:

Hi：
according to my experience, there are several possible reasons for checkpoint fail.
1. if you use rocksdb as backend, insufficient disk will cause it. because file save on local disk, and you may see a exception.
2. Sink can’t be written. all parallelism can’t be complete, and there is often no phenomenon.
3. Back Pressure. data skew will cause one subtask take on more calculations, so checkpoint can’t be finish.
Here is my advice:
   1. learn more about checkpoint work.
  https://ci.apache.org/projects/flink/flink-docs-release-1.10/internals/stream_checkpointing.html
2. try to test back pressure.
  https://ci.apache.org/projects/flink/flink-docs-release-1.10/monitoring/back_pressure.html
3. if there is no data skew, you can set more parallelism, or you can ajust checkpoint parameter.
In my computer, I have hadoop environment. so i commit job to yarn, i can use dashboard to test pressure.

On 2020/03/23 15:14:33, Stephen Connolly <s...@gmail.com> wrote:

> We have a topology and the checkpoints fail to complete a *lot* of the time.>
>
> Typically it is just one subtask that fails.>
>
> We have a parallelism of 2 on this topology at present and the other>
> subtask will complete in 3ms though the end to end duration on the rare>
> times when the checkpointing completes is like 4m30>
>
> How can I start debugging this? When I run locally on my development>
> cluster I have no issues, the issues only seem to show in production.>
>

Congxian Qiu

Re: How to debug checkpoints failing to complete

From my experience, you can first check the jobmanager.log, find out whether the checkpoint expired or was declined by some task, if expired, you can follow the adivce of seeksst given above(maybe enable debug log can help you here), if was declined, then you can go to the taskmanager.log to find out the reason.

Best,

Congxian

David Anderson <[hidden email]> 于2020年3月25日周三下午11:21写道：

seeksst has already covered many of the relevant points, but a few more thoughts:

I would start by checking to see if the checkpoints are failing because they timeout, or for some other reason. Assuming they are timing out, then a good place to start is to look and see if this can be explained by data skew (which you can see in the metrics in the Flink dashboard). Common causes of data skew include hot key(s), and joins between streams where one stream is significantly behind the other.

Another likely cause of checkpoint troubles is back pressure, which is most often caused by slow or unavailable connections between flink and external systems, such as sinks, async i/o operators, filesystems, network, etc.

--david

On Tue, Mar 24, 2020 at 2:59 AM seeksst <[hidden email]> wrote:

Hi：
according to my experience, there are several possible reasons for checkpoint fail.
1. if you use rocksdb as backend, insufficient disk will cause it. because file save on local disk, and you may see a exception.
2. Sink can’t be written. all parallelism can’t be complete, and there is often no phenomenon.
3. Back Pressure. data skew will cause one subtask take on more calculations, so checkpoint can’t be finish.
Here is my advice:
   1. learn more about checkpoint work.
  https://ci.apache.org/projects/flink/flink-docs-release-1.10/internals/stream_checkpointing.html
2. try to test back pressure.
  https://ci.apache.org/projects/flink/flink-docs-release-1.10/monitoring/back_pressure.html
3. if there is no data skew, you can set more parallelism, or you can ajust checkpoint parameter.
In my computer, I have hadoop environment. so i commit job to yarn, i can use dashboard to test pressure.

On 2020/03/23 15:14:33, Stephen Connolly <s...@gmail.com> wrote:

> We have a topology and the checkpoints fail to complete a *lot* of the time.>
>
> Typically it is just one subtask that fails.>
>
> We have a parallelism of 2 on this topology at present and the other>
> subtask will complete in 3ms though the end to end duration on the rare>
> times when the checkpointing completes is like 4m30>
>
> How can I start debugging this? When I run locally on my development>
> cluster I have no issues, the issues only seem to show in production.>
>