Re: Checkpoint fail due to timeout

Posted by Alexey Trenikhun on
URL: http://deprecated-apache-flink-user-mailing-list-archive.369.s1.nabble.com/Checkpoint-fail-due-to-timeout-tp42125p42444.html

Great! I doubt that it will help in my case however, since in my case even unaligned checkpoints “stuck”, in difference with aligned checkpoints, after unaligned checkpoint triggered, Flink at some moment become idle, kubernetes metrics report very little CPU usage by container, but unaligned checkpoint still times out after 3hr. 


From: Arvid Heise <[hidden email]>
Sent: Monday, March 22, 2021 6:58:20 AM
To: ChangZhuo Chen (陳昌倬) <[hidden email]>
Cc: Alexey Trenikhun <[hidden email]>; [hidden email] <[hidden email]>; Flink User Mail List <[hidden email]>
Subject: Re: Checkpoint fail due to timeout
 
Hi Alexey,

rescaling from unaligned checkpoints will be supported with the upcoming 1.13 release (expected at the end of April).

Best,

Arvid

On Wed, Mar 17, 2021 at 8:29 AM ChangZhuo Chen (陳昌倬) <[hidden email]> wrote:
On Wed, Mar 17, 2021 at 05:45:38AM +0000, Alexey Trenikhun wrote:
> In my opinion looks similar. Were you able to tune-up Flink to make it work? I'm stuck with it, I wanted to scale up hoping to reduce backpressure, but to rescale I need to take savepoint, which never completes (at least takes longer than 3 hours).

You can use aligned checkpoint to scala your job. Just restarting from
checkpoint with the same jar file, and new parallelism shall do the
trick.


--
ChangZhuo Chen (陳昌倬) czchen@{czchen,debian}.org
http://czchen.info/
Key fingerprint = BA04 346D C2E1 FE63 C790  8793 CC65 B0CD EC27 5D5B