Hi all, I am trying to test failure recovery of a Flink job when a JM or TM goes down. Our target is having job auto restart and back to normal condition in any case. However, what's I am seeing is very strange and hope someone here help me to understand it. When JM or TM went down, I see the job was being restarted but as soon as it restart it's working on checkingpoint and usually took 30+ minutes to finish (usually in normal case, it only take 1-2 mins for checkpoint), As soon as the checkpoint is finish, the job is back to normal condition. I'm using 1.4.2, but seeing similar thing on 1.6.0 as well. Could anyone please help to explain this behavior? We really want to reduce the time of recovery but doesn't seem to find any document mentioned about recovery process in detail. Any help is really appreciate. Thanks Kien Thanks
Kien |
Hi Kien
From your description, your job has already started to execute checkpoint after job failover, which means your job was in RUNNING status. From my point of view, the actual recovery time should be the time during job's status: RESTARTING->CREATED->RUNNING[1].
Your trouble sounds more like the long time needed for the first checkpoint to complete after job failover. Afaik, It's probably because your job is heavily back pressured after the failover and the checkpoint mode is exactly-once, some operators need to receive
all the input checkpoint barrier to trigger the checkpoint. You can watch your metrics of checkpoint alignment time to verify the root cause, and if you do not need the exactly once guarantees, you can change the checkpoint mode to at-least-once[2].
Best
Yun Tang
From: trung kien <[hidden email]>
Sent: Thursday, September 6, 2018 18:50 To: [hidden email] Subject: Flink failure recovery tooks very long time Hi all,
I am trying to test failure recovery of a Flink job when a JM or TM goes down.
Our target is having job auto restart and back to normal condition in any case.
However, what's I am seeing is very strange and hope someone here help me to understand it.
When JM or TM went down, I see the job was being restarted but as soon as it restart it's working on checkingpoint and usually took 30+ minutes to finish (usually in normal case, it only take 1-2 mins for checkpoint), As soon as the checkpoint is finish,
the job is back to normal condition.
I'm using 1.4.2, but seeing similar thing on 1.6.0 as well.
Could anyone please help to explain this behavior? We really want to reduce the time of recovery but doesn't seem to find any document mentioned about recovery process in detail.
Any help is really appreciate.
Thanks
Kien Thanks
Kien |
Hi trung, Can you provide more information to aid in positioning? For example, the size of the state generated by a checkpoint and more log information, you can try to switch the log level to DEBUG. Thanks, vino. Yun Tang <[hidden email]> 于2018年9月6日周四 下午7:42写道:
|
Hi Yun, Yes, the job’s status change to Running pretty fast after failure (~ 1 min). As soon as the status change to running, first checkpoint is kick off and it took 30 mins. I need to have exactly-one as i maintining some aggregation metric, do you know whats the diffrent between first checkpoint and checkpoints after that? (it’s fairely quick after that) Here is size of my checkpoints ( i config to keep 5 latest checkpoints) 449M chk-1626 775M chk-1627 486M chk-1628 7.8G chk-1629 7.5G chk-1630 I dont know why the size is too diffrent. Metrics on checkpoints looks good as besides the spike in the first checkpoint, everything looks fine. @Vino: Yes, i can try to switch to DEBuG to see if i got any information. On Thu, Sep 6, 2018 at 7:09 AM vino yang <[hidden email]> wrote:
Thanks
Kien |
And here is the snapshot of my checkpoint metrics in normal condition. On Thu, Sep 6, 2018 at 9:21 AM trung kien <[hidden email]> wrote:
Thanks
Kien |
Hi Kien
You could try to kill one TM container by using 'yarn container -signal <container id> FORCEFUL_SHUTDOWN' command, and then watch the first checkpoint after job failover. You could view the checkpoint details[1] to see whether exists outlier operator or sub-task
which consumed extremely long time to finish the checkpoint, and then you could view the task manager log of that sub-task to see anything weird. If all the sync and async duration is not long but the end-to-end duration is too long, it might because the checkpoint
alignment time is too long which affects the overall end-to-end duration.
Best
Yun Tang
From: trung kien <[hidden email]>
Sent: Thursday, September 6, 2018 22:31 To: vino yang Cc: [hidden email]; user Subject: Re: Flink failure recovery tooks very long time And here is the snapshot of my checkpoint metrics in normal condition.
On Thu, Sep 6, 2018 at 9:21 AM trung kien <[hidden email]> wrote:
Thanks
Kien |
Hi Kien, Yun Tang's analytical thinking is correct. It's not clear why there is such a big change in the size of your checkpoint state. When recovering, it's obviously based on the latest checkpoint. The running state of the job is actually just a reference, because it won't wait until all the task instances become running. Thanks, vino. Yun Tang <[hidden email]> 于2018年9月6日周四 下午10:53写道:
image_6483441.JPG (32K) Download Attachment image_6483441.JPG (73K) Download Attachment image_6483441.JPG (32K) Download Attachment |
Free forum by Nabble | Edit this page |