Checkpoint acknowledge takes too long

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

Checkpoint acknowledge takes too long

徐涛
Hi
        I am running a flink application with parallelism 64, I left the checkpoint timeout default value, which is 10minutes, the state size is less than 1MB, I am using the FsStateBackend.
        The application triggers some checkpoints but all of them fails due to "Checkpoint expired before completing”, I check the checkpoint history, found that there are 63 subtask acknowledge, but one left n/a, and also the alignment duration is quite long, about 5m27s.
        I want to know why there is one subtask does not acknowledge? And because the alignment duration is long, what will influent the alignment duration?
        Thank a lot.

Best
Henry
Reply | Threaded
Open this post in threaded view
|

Re: Checkpoint acknowledge takes too long

Kien Truong
Hi,

In my experience, this is most likely due to one sub-task is blocked
doing some long-running operation.

Try to run the task manager with some profiler (like VisualVM) and check
for hot spot.


Regards,

Kien

On 10/24/2018 4:02 PM, 徐涛 wrote:
> Hi
> I am running a flink application with parallelism 64, I left the checkpoint timeout default value, which is 10minutes, the state size is less than 1MB, I am using the FsStateBackend.
> The application triggers some checkpoints but all of them fails due to "Checkpoint expired before completing”, I check the checkpoint history, found that there are 63 subtask acknowledge, but one left n/a, and also the alignment duration is quite long, about 5m27s.
> I want to know why there is one subtask does not acknowledge? And because the alignment duration is long, what will influent the alignment duration?
> Thank a lot.
>
> Best
> Henry
Reply | Threaded
Open this post in threaded view
|

Re: Checkpoint acknowledge takes too long

Hequn Cheng
In reply to this post by 徐涛
Hi Henry,

@Kien is right. Take a thread dump to see what was doing in the TaskManager. Also check whether gc happens frequently.

Best, Hequn
 

On Wed, Oct 24, 2018 at 5:03 PM 徐涛 <[hidden email]> wrote:
Hi
        I am running a flink application with parallelism 64, I left the checkpoint timeout default value, which is 10minutes, the state size is less than 1MB, I am using the FsStateBackend.
        The application triggers some checkpoints but all of them fails due to "Checkpoint expired before completing”, I check the checkpoint history, found that there are 63 subtask acknowledge, but one left n/a, and also the alignment duration is quite long, about 5m27s.
        I want to know why there is one subtask does not acknowledge? And because the alignment duration is long, what will influent the alignment duration?
        Thank a lot.

Best
Henry
Reply | Threaded
Open this post in threaded view
|

Re: Checkpoint acknowledge takes too long

徐涛
Hi Hequn & Kien,
Finally the problem is solved.
It is due to slow sink write. Because the job only have 2 tasks, I check the backpressure, found that the source has high backpressure, so I tried to improve the sink write. After that the end to end duration is below 1s and the checkpoint timeout is fixed.

Best
Henry


在 2018年10月24日,下午10:43,徐涛 <[hidden email]> 写道:

Hequn & Kien,
Thanks a lot for your help, I will try it later.

Best
Henry


在 2018年10月24日,下午8:18,Hequn Cheng <[hidden email]> 写道:

Hi Henry,

@Kien is right. Take a thread dump to see what was doing in the TaskManager. Also check whether gc happens frequently.

Best, Hequn
 

On Wed, Oct 24, 2018 at 5:03 PM 徐涛 <[hidden email]> wrote:
Hi
        I am running a flink application with parallelism 64, I left the checkpoint timeout default value, which is 10minutes, the state size is less than 1MB, I am using the FsStateBackend.
        The application triggers some checkpoints but all of them fails due to "Checkpoint expired before completing”, I check the checkpoint history, found that there are 63 subtask acknowledge, but one left n/a, and also the alignment duration is quite long, about 5m27s.
        I want to know why there is one subtask does not acknowledge? And because the alignment duration is long, what will influent the alignment duration?
        Thank a lot.

Best
Henry


Reply | Threaded
Open this post in threaded view
|

Re: Checkpoint acknowledge takes too long

Hequn Cheng
Hi Henry, 

Thanks for letting us know. 

On Thu, Oct 25, 2018 at 7:34 PM 徐涛 <[hidden email]> wrote:
Hi Hequn & Kien,
Finally the problem is solved.
It is due to slow sink write. Because the job only have 2 tasks, I check the backpressure, found that the source has high backpressure, so I tried to improve the sink write. After that the end to end duration is below 1s and the checkpoint timeout is fixed.

Best
Henry


在 2018年10月24日,下午10:43,徐涛 <[hidden email]> 写道:

Hequn & Kien,
Thanks a lot for your help, I will try it later.

Best
Henry


在 2018年10月24日,下午8:18,Hequn Cheng <[hidden email]> 写道:

Hi Henry,

@Kien is right. Take a thread dump to see what was doing in the TaskManager. Also check whether gc happens frequently.

Best, Hequn
 

On Wed, Oct 24, 2018 at 5:03 PM 徐涛 <[hidden email]> wrote:
Hi
        I am running a flink application with parallelism 64, I left the checkpoint timeout default value, which is 10minutes, the state size is less than 1MB, I am using the FsStateBackend.
        The application triggers some checkpoints but all of them fails due to "Checkpoint expired before completing”, I check the checkpoint history, found that there are 63 subtask acknowledge, but one left n/a, and also the alignment duration is quite long, about 5m27s.
        I want to know why there is one subtask does not acknowledge? And because the alignment duration is long, what will influent the alignment duration?
        Thank a lot.

Best
Henry