Small checkpoint data takes too much time

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Small checkpoint data takes too much time

徐涛
Hi
        I recently encounter a problem in production. I found checkpoint takes too much time, although it doesn`t affect the job execution.
        I am using FsStateBackend, writing the data to a HDFS checkpointDataUri, and asynchronousSnapshots, I print the metric data “lastCheckpointDuration” and “lastCheckpointSize”. It shows the “lastCheckpointSize” is about 80KB, but the “lastCheckpointDuration” is about 160s! Because checkpoint data is small , I think it should not take that long time. I do not know why and which condition may influent the checkpoint time. Does anyone has encounter such problem?
        Thanks a lot.

Best
Henry
Reply | Threaded
Open this post in threaded view
|

回复:Small checkpoint data takes too much time

Zhijiang(wangzhijiang999)
The checkpoint duration includes the processes of barrier alignment and state snapshot. Every task has to receive all the barriers from all the channels, then trriger to snapshot state.
I guess the barrier alignment may take long time for your case, and it is specially critical during backpressure. You can check the metric of "checkpointAlignmentTime" for confirmation.

Best,
Zhijiang
------------------------------------------------------------------
发件人:徐涛 <[hidden email]>
发送时间:2018年10月10日(星期三) 13:13
收件人:user <[hidden email]>
主 题:Small checkpoint data takes too much time

Hi 
 I recently encounter a problem in production. I found checkpoint takes too much time, although it doesn`t affect the job execution.
 I am using FsStateBackend, writing the data to a HDFS checkpointDataUri, and asynchronousSnapshots, I print the metric data “lastCheckpointDuration” and “lastCheckpointSize”. It shows the “lastCheckpointSize” is about 80KB, but the “lastCheckpointDuration” is about 160s! Because checkpoint data is small , I think it should not take that long time. I do not know why and which condition may influent the checkpoint time. Does anyone has encounter such problem?
 Thanks a lot.

Best
Henry

Reply | Threaded
Open this post in threaded view
|

Re: Small checkpoint data takes too much time

徐涛
Hi Zhijiang,
Thanks for your response.
I add the checkpointAlignmentTime, the data shows that the checkpointDuration is about 150s, and the checkpointAlignmentTims is about 4s. There is a big gap between them.

Best
Henry

在 2018年10月10日,下午1:26,Zhijiang(wangzhijiang999) <[hidden email]> 写道:

The checkpoint duration includes the processes of barrier alignment and state snapshot. Every task has to receive all the barriers from all the channels, then trriger to snapshot state.
I guess the barrier alignment may take long time for your case, and it is specially critical during backpressure. You can check the metric of "checkpointAlignmentTime" for confirmation.

Best,
Zhijiang
------------------------------------------------------------------
发件人:徐涛 <[hidden email]>
发送时间:2018年10月10日(星期三) 13:13
收件人:user <[hidden email]>
主 题:Small checkpoint data takes too much time

Hi 
 I recently encounter a problem in production. I found checkpoint takes too much time, although it doesn`t affect the job execution.
 I am using FsStateBackend, writing the data to a HDFS checkpointDataUri, and asynchronousSnapshots, I print the metric data “lastCheckpointDuration” and “lastCheckpointSize”. It shows the “lastCheckpointSize” is about 80KB, but the “lastCheckpointDuration” is about 160s! Because checkpoint data is small , I think it should not take that long time. I do not know why and which condition may influent the checkpoint time. Does anyone has encounter such problem?
 Thanks a lot.

Best
Henry