Re: Checkpoint fail due to timeout

Posted by Alexey Trenikhun on
URL: http://deprecated-apache-flink-user-mailing-list-archive.369.s1.nabble.com/Checkpoint-fail-due-to-timeout-tp42125p42216.html

Hello Roman,

  •  history, details and summary stats are attached.
  • There is backpressure on all sources except Source:gca-cfg and Source:heartbeat
  • Flink version 1.12.1, I also trying 1.12.2 with same results
Thanks,
Alexey

From: Roman Khachatryan <[hidden email]>
Sent: Thursday, March 11, 2021 11:49 PM
To: Alexey Trenikhun <[hidden email]>
Cc: Flink User Mail List <[hidden email]>
Subject: Re: Checkpoint fail due to timeout
 
Hello,

This can be caused by several reasons such as back-pressure, large
snapshots or bugs.

Could you please share:
- the stats of the previous (successful) checkpoints
- back-pressure metrics for sources
- which Flink version do you use?

Regards,
Roman


On Thu, Mar 11, 2021 at 7:03 AM Alexey Trenikhun <[hidden email]> wrote:
>
> Hello,
> We are experiencing the problem with checkpoints failing due to timeout (already set to 30 minute, still failing), checkpoints were not too big before they started to fail, around 1.2Gb. Looks like one of sources (Kafka) never acknowledged (see attached screenshot). What could be the reason?
>
> Thanks,
> Alexey
>
>

Screen Shot 2021-03-12 at 11.35.22 AM.png (100K) Download Attachment
Screen Shot 2021-03-12 at 11.37.48 AM.png (218K) Download Attachment
Screen Shot 2021-03-12 at 11.39.39 AM.png (654K) Download Attachment