apache flink: Why checkpoint coordinator takes long time to get completion

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

apache flink: Why checkpoint coordinator takes long time to get completion

Xiangyu Su
Dear flink community,

We are POC flink(1.8) to process data in real time, and using global checkpointing(S3) and local checkpointing(EBS), deploy cluster on EKS. Our application is consuming data from Kinesis.

For my test e.g I am using checkpointing interval 5min. and minimum pause 2min.

The issue what we saw is: It seems like flink checkpointing process would be idle for 3-4 min, before job manager get complete notification.

here is some logging from job manager:

2019-07-10 11:59:03,893 INFO  org.apache.flink.runtime.checkpoint.CheckpointCoordinator     - Triggering checkpoint 4 @ 1562759941082 for job e7a97014f5799458f1c656135712813d.
2019-07-10 12:05:01,836 INFO  org.apache.flink.runtime.checkpoint.CheckpointCoordinator     - Completed checkpoint 4 for job e7a97014f5799458f1c656135712813d (22387207650 bytes in 58645 ms).

As my understanding the logging above, the completedCheckpoint(CheckpointCoordinator) object has been completed in 58645 ms, but the whole checkpointing process took ~ 6min.

This logging is for 4th checkpointing, But the first 3 checkpointing were finished on time.

Could you please tell me, why flink checkpointing in my test was starting "idle" for few minutes after 3 checkpointing?

Best Regards
--
Xiangyu Su
Java Developer
[hidden email]

Smaato Inc.
San Francisco - New York - Hamburg - Singapore
www.smaato.com

Germany:
Valentinskamp 70, Emporio, 19th Floor
20355 Hamburg
M 0049(176)22943076

The information contained in this communication may be CONFIDENTIAL and is intended only for the use of the recipient(s) named above. If you are not the intended recipient, you are hereby notified that any dissemination, distribution, or copying of this communication, or any of its contents, is strictly prohibited. If you have received this communication in error, please notify the sender and delete/destroy the original message and any copy of it from your computer or paper files.
Reply | Threaded
Open this post in threaded view
|

Re: apache flink: Why checkpoint coordinator takes long time to get completion

Biao Liu
Hi Xiangyu,

Just took a glance at the relevant codes. There is a gap between calculating the duration and logging it out. I guess the checkpoint 4 is finished in 1 minute, but there is an unexpected time-consuming operation during that time. But I can't tell which part it is.


Xiangyu Su <[hidden email]> 于2019年7月19日周五 下午4:14写道:
Dear flink community,

We are POC flink(1.8) to process data in real time, and using global checkpointing(S3) and local checkpointing(EBS), deploy cluster on EKS. Our application is consuming data from Kinesis.

For my test e.g I am using checkpointing interval 5min. and minimum pause 2min.

The issue what we saw is: It seems like flink checkpointing process would be idle for 3-4 min, before job manager get complete notification.

here is some logging from job manager:

2019-07-10 11:59:03,893 INFO  org.apache.flink.runtime.checkpoint.CheckpointCoordinator     - Triggering checkpoint 4 @ 1562759941082 for job e7a97014f5799458f1c656135712813d.
2019-07-10 12:05:01,836 INFO  org.apache.flink.runtime.checkpoint.CheckpointCoordinator     - Completed checkpoint 4 for job e7a97014f5799458f1c656135712813d (22387207650 bytes in 58645 ms).

As my understanding the logging above, the completedCheckpoint(CheckpointCoordinator) object has been completed in 58645 ms, but the whole checkpointing process took ~ 6min.

This logging is for 4th checkpointing, But the first 3 checkpointing were finished on time.

Could you please tell me, why flink checkpointing in my test was starting "idle" for few minutes after 3 checkpointing?

Best Regards
--
Xiangyu Su
Java Developer
[hidden email]

Smaato Inc.
San Francisco - New York - Hamburg - Singapore
www.smaato.com

Germany:
Valentinskamp 70, Emporio, 19th Floor
20355 Hamburg
M 0049(176)22943076

The information contained in this communication may be CONFIDENTIAL and is intended only for the use of the recipient(s) named above. If you are not the intended recipient, you are hereby notified that any dissemination, distribution, or copying of this communication, or any of its contents, is strictly prohibited. If you have received this communication in error, please notify the sender and delete/destroy the original message and any copy of it from your computer or paper files.
Reply | Threaded
Open this post in threaded view
|

Re: apache flink: Why checkpoint coordinator takes long time to get completion

Xiangyu Su
Ok, thanks. 

and this time-consuming until now always happens after 3rd checkpointing, and this unexpected  time-consuming was always consistent (~ 4 min by under 4G/min incoming traffic). 

On Fri, 19 Jul 2019 at 11:06, Biao Liu <[hidden email]> wrote:
Hi Xiangyu,

Just took a glance at the relevant codes. There is a gap between calculating the duration and logging it out. I guess the checkpoint 4 is finished in 1 minute, but there is an unexpected time-consuming operation during that time. But I can't tell which part it is.


Xiangyu Su <[hidden email]> 于2019年7月19日周五 下午4:14写道:
Dear flink community,

We are POC flink(1.8) to process data in real time, and using global checkpointing(S3) and local checkpointing(EBS), deploy cluster on EKS. Our application is consuming data from Kinesis.

For my test e.g I am using checkpointing interval 5min. and minimum pause 2min.

The issue what we saw is: It seems like flink checkpointing process would be idle for 3-4 min, before job manager get complete notification.

here is some logging from job manager:

2019-07-10 11:59:03,893 INFO  org.apache.flink.runtime.checkpoint.CheckpointCoordinator     - Triggering checkpoint 4 @ 1562759941082 for job e7a97014f5799458f1c656135712813d.
2019-07-10 12:05:01,836 INFO  org.apache.flink.runtime.checkpoint.CheckpointCoordinator     - Completed checkpoint 4 for job e7a97014f5799458f1c656135712813d (22387207650 bytes in 58645 ms).

As my understanding the logging above, the completedCheckpoint(CheckpointCoordinator) object has been completed in 58645 ms, but the whole checkpointing process took ~ 6min.

This logging is for 4th checkpointing, But the first 3 checkpointing were finished on time.

Could you please tell me, why flink checkpointing in my test was starting "idle" for few minutes after 3 checkpointing?

Best Regards
--
Xiangyu Su
Java Developer
[hidden email]

Smaato Inc.
San Francisco - New York - Hamburg - Singapore
www.smaato.com

Germany:
Valentinskamp 70, Emporio, 19th Floor
20355 Hamburg
M 0049(176)22943076

The information contained in this communication may be CONFIDENTIAL and is intended only for the use of the recipient(s) named above. If you are not the intended recipient, you are hereby notified that any dissemination, distribution, or copying of this communication, or any of its contents, is strictly prohibited. If you have received this communication in error, please notify the sender and delete/destroy the original message and any copy of it from your computer or paper files.


--
Xiangyu Su
Java Developer
[hidden email]

Smaato Inc.
San Francisco - New York - Hamburg - Singapore
www.smaato.com

Germany:
Valentinskamp 70, Emporio, 19th Floor
20355 Hamburg
M 0049(176)22943076

The information contained in this communication may be CONFIDENTIAL and is intended only for the use of the recipient(s) named above. If you are not the intended recipient, you are hereby notified that any dissemination, distribution, or copying of this communication, or any of its contents, is strictly prohibited. If you have received this communication in error, please notify the sender and delete/destroy the original message and any copy of it from your computer or paper files.
Reply | Threaded
Open this post in threaded view
|

Re: apache flink: Why checkpoint coordinator takes long time to get completion

Xiangyu Su
btw. it seems like this issue has been fixed in 1.8.1

On Fri, 19 Jul 2019 at 12:21, Xiangyu Su <[hidden email]> wrote:
Ok, thanks. 

and this time-consuming until now always happens after 3rd checkpointing, and this unexpected  time-consuming was always consistent (~ 4 min by under 4G/min incoming traffic). 

On Fri, 19 Jul 2019 at 11:06, Biao Liu <[hidden email]> wrote:
Hi Xiangyu,

Just took a glance at the relevant codes. There is a gap between calculating the duration and logging it out. I guess the checkpoint 4 is finished in 1 minute, but there is an unexpected time-consuming operation during that time. But I can't tell which part it is.


Xiangyu Su <[hidden email]> 于2019年7月19日周五 下午4:14写道:
Dear flink community,

We are POC flink(1.8) to process data in real time, and using global checkpointing(S3) and local checkpointing(EBS), deploy cluster on EKS. Our application is consuming data from Kinesis.

For my test e.g I am using checkpointing interval 5min. and minimum pause 2min.

The issue what we saw is: It seems like flink checkpointing process would be idle for 3-4 min, before job manager get complete notification.

here is some logging from job manager:

2019-07-10 11:59:03,893 INFO  org.apache.flink.runtime.checkpoint.CheckpointCoordinator     - Triggering checkpoint 4 @ 1562759941082 for job e7a97014f5799458f1c656135712813d.
2019-07-10 12:05:01,836 INFO  org.apache.flink.runtime.checkpoint.CheckpointCoordinator     - Completed checkpoint 4 for job e7a97014f5799458f1c656135712813d (22387207650 bytes in 58645 ms).

As my understanding the logging above, the completedCheckpoint(CheckpointCoordinator) object has been completed in 58645 ms, but the whole checkpointing process took ~ 6min.

This logging is for 4th checkpointing, But the first 3 checkpointing were finished on time.

Could you please tell me, why flink checkpointing in my test was starting "idle" for few minutes after 3 checkpointing?

Best Regards
--
Xiangyu Su
Java Developer
[hidden email]

Smaato Inc.
San Francisco - New York - Hamburg - Singapore
www.smaato.com

Germany:
Valentinskamp 70, Emporio, 19th Floor
20355 Hamburg
M 0049(176)22943076

The information contained in this communication may be CONFIDENTIAL and is intended only for the use of the recipient(s) named above. If you are not the intended recipient, you are hereby notified that any dissemination, distribution, or copying of this communication, or any of its contents, is strictly prohibited. If you have received this communication in error, please notify the sender and delete/destroy the original message and any copy of it from your computer or paper files.


--
Xiangyu Su
Java Developer
[hidden email]

Smaato Inc.
San Francisco - New York - Hamburg - Singapore
www.smaato.com

Germany:
Valentinskamp 70, Emporio, 19th Floor
20355 Hamburg
M 0049(176)22943076

The information contained in this communication may be CONFIDENTIAL and is intended only for the use of the recipient(s) named above. If you are not the intended recipient, you are hereby notified that any dissemination, distribution, or copying of this communication, or any of its contents, is strictly prohibited. If you have received this communication in error, please notify the sender and delete/destroy the original message and any copy of it from your computer or paper files.


--
Xiangyu Su
Java Developer
[hidden email]

Smaato Inc.
San Francisco - New York - Hamburg - Singapore
www.smaato.com

Germany:
Valentinskamp 70, Emporio, 19th Floor
20355 Hamburg
M 0049(176)22943076

The information contained in this communication may be CONFIDENTIAL and is intended only for the use of the recipient(s) named above. If you are not the intended recipient, you are hereby notified that any dissemination, distribution, or copying of this communication, or any of its contents, is strictly prohibited. If you have received this communication in error, please notify the sender and delete/destroy the original message and any copy of it from your computer or paper files.
Reply | Threaded
Open this post in threaded view
|

Re: apache flink: Why checkpoint coordinator takes long time to get completion

tison
Hi Xiangyu,

Could you share the corresponding JIRA that fixed this issue?

Best,
tison.


Xiangyu Su <[hidden email]> 于2019年7月19日周五 下午8:47写道:
btw. it seems like this issue has been fixed in 1.8.1

On Fri, 19 Jul 2019 at 12:21, Xiangyu Su <[hidden email]> wrote:
Ok, thanks. 

and this time-consuming until now always happens after 3rd checkpointing, and this unexpected  time-consuming was always consistent (~ 4 min by under 4G/min incoming traffic). 

On Fri, 19 Jul 2019 at 11:06, Biao Liu <[hidden email]> wrote:
Hi Xiangyu,

Just took a glance at the relevant codes. There is a gap between calculating the duration and logging it out. I guess the checkpoint 4 is finished in 1 minute, but there is an unexpected time-consuming operation during that time. But I can't tell which part it is.


Xiangyu Su <[hidden email]> 于2019年7月19日周五 下午4:14写道:
Dear flink community,

We are POC flink(1.8) to process data in real time, and using global checkpointing(S3) and local checkpointing(EBS), deploy cluster on EKS. Our application is consuming data from Kinesis.

For my test e.g I am using checkpointing interval 5min. and minimum pause 2min.

The issue what we saw is: It seems like flink checkpointing process would be idle for 3-4 min, before job manager get complete notification.

here is some logging from job manager:

2019-07-10 11:59:03,893 INFO  org.apache.flink.runtime.checkpoint.CheckpointCoordinator     - Triggering checkpoint 4 @ 1562759941082 for job e7a97014f5799458f1c656135712813d.
2019-07-10 12:05:01,836 INFO  org.apache.flink.runtime.checkpoint.CheckpointCoordinator     - Completed checkpoint 4 for job e7a97014f5799458f1c656135712813d (22387207650 bytes in 58645 ms).

As my understanding the logging above, the completedCheckpoint(CheckpointCoordinator) object has been completed in 58645 ms, but the whole checkpointing process took ~ 6min.

This logging is for 4th checkpointing, But the first 3 checkpointing were finished on time.

Could you please tell me, why flink checkpointing in my test was starting "idle" for few minutes after 3 checkpointing?

Best Regards
--
Xiangyu Su
Java Developer
[hidden email]

Smaato Inc.
San Francisco - New York - Hamburg - Singapore
www.smaato.com

Germany:
Valentinskamp 70, Emporio, 19th Floor
20355 Hamburg
M 0049(176)22943076

The information contained in this communication may be CONFIDENTIAL and is intended only for the use of the recipient(s) named above. If you are not the intended recipient, you are hereby notified that any dissemination, distribution, or copying of this communication, or any of its contents, is strictly prohibited. If you have received this communication in error, please notify the sender and delete/destroy the original message and any copy of it from your computer or paper files.


--
Xiangyu Su
Java Developer
[hidden email]

Smaato Inc.
San Francisco - New York - Hamburg - Singapore
www.smaato.com

Germany:
Valentinskamp 70, Emporio, 19th Floor
20355 Hamburg
M 0049(176)22943076

The information contained in this communication may be CONFIDENTIAL and is intended only for the use of the recipient(s) named above. If you are not the intended recipient, you are hereby notified that any dissemination, distribution, or copying of this communication, or any of its contents, is strictly prohibited. If you have received this communication in error, please notify the sender and delete/destroy the original message and any copy of it from your computer or paper files.


--
Xiangyu Su
Java Developer
[hidden email]

Smaato Inc.
San Francisco - New York - Hamburg - Singapore
www.smaato.com

Germany:
Valentinskamp 70, Emporio, 19th Floor
20355 Hamburg
M 0049(176)22943076

The information contained in this communication may be CONFIDENTIAL and is intended only for the use of the recipient(s) named above. If you are not the intended recipient, you are hereby notified that any dissemination, distribution, or copying of this communication, or any of its contents, is strictly prohibited. If you have received this communication in error, please notify the sender and delete/destroy the original message and any copy of it from your computer or paper files.
Reply | Threaded
Open this post in threaded view
|

Re: apache flink: Why checkpoint coordinator takes long time to get completion

Xiangyu Su
Hi Zili,

But I could not find any ticket related to the "unexpected time-consuming", I have just tested our application with both versions, this issue is be able to reproduce every time with version 1.8.0, and it does not happen with version 1.8.1 until now.

Best regards
Xiangyu

On Tue, 23 Jul 2019 at 08:49, Zili Chen <[hidden email]> wrote:
Hi Xiangyu,

Could you share the corresponding JIRA that fixed this issue?

Best,
tison.


Xiangyu Su <[hidden email]> 于2019年7月19日周五 下午8:47写道:
btw. it seems like this issue has been fixed in 1.8.1

On Fri, 19 Jul 2019 at 12:21, Xiangyu Su <[hidden email]> wrote:
Ok, thanks. 

and this time-consuming until now always happens after 3rd checkpointing, and this unexpected  time-consuming was always consistent (~ 4 min by under 4G/min incoming traffic). 

On Fri, 19 Jul 2019 at 11:06, Biao Liu <[hidden email]> wrote:
Hi Xiangyu,

Just took a glance at the relevant codes. There is a gap between calculating the duration and logging it out. I guess the checkpoint 4 is finished in 1 minute, but there is an unexpected time-consuming operation during that time. But I can't tell which part it is.


Xiangyu Su <[hidden email]> 于2019年7月19日周五 下午4:14写道:
Dear flink community,

We are POC flink(1.8) to process data in real time, and using global checkpointing(S3) and local checkpointing(EBS), deploy cluster on EKS. Our application is consuming data from Kinesis.

For my test e.g I am using checkpointing interval 5min. and minimum pause 2min.

The issue what we saw is: It seems like flink checkpointing process would be idle for 3-4 min, before job manager get complete notification.

here is some logging from job manager:

2019-07-10 11:59:03,893 INFO  org.apache.flink.runtime.checkpoint.CheckpointCoordinator     - Triggering checkpoint 4 @ 1562759941082 for job e7a97014f5799458f1c656135712813d.
2019-07-10 12:05:01,836 INFO  org.apache.flink.runtime.checkpoint.CheckpointCoordinator     - Completed checkpoint 4 for job e7a97014f5799458f1c656135712813d (22387207650 bytes in 58645 ms).

As my understanding the logging above, the completedCheckpoint(CheckpointCoordinator) object has been completed in 58645 ms, but the whole checkpointing process took ~ 6min.

This logging is for 4th checkpointing, But the first 3 checkpointing were finished on time.

Could you please tell me, why flink checkpointing in my test was starting "idle" for few minutes after 3 checkpointing?

Best Regards
--
Xiangyu Su
Java Developer
[hidden email]

Smaato Inc.
San Francisco - New York - Hamburg - Singapore
www.smaato.com

Germany:
Valentinskamp 70, Emporio, 19th Floor
20355 Hamburg
M 0049(176)22943076

The information contained in this communication may be CONFIDENTIAL and is intended only for the use of the recipient(s) named above. If you are not the intended recipient, you are hereby notified that any dissemination, distribution, or copying of this communication, or any of its contents, is strictly prohibited. If you have received this communication in error, please notify the sender and delete/destroy the original message and any copy of it from your computer or paper files.


--
Xiangyu Su
Java Developer
[hidden email]

Smaato Inc.
San Francisco - New York - Hamburg - Singapore
www.smaato.com

Germany:
Valentinskamp 70, Emporio, 19th Floor
20355 Hamburg
M 0049(176)22943076

The information contained in this communication may be CONFIDENTIAL and is intended only for the use of the recipient(s) named above. If you are not the intended recipient, you are hereby notified that any dissemination, distribution, or copying of this communication, or any of its contents, is strictly prohibited. If you have received this communication in error, please notify the sender and delete/destroy the original message and any copy of it from your computer or paper files.


--
Xiangyu Su
Java Developer
[hidden email]

Smaato Inc.
San Francisco - New York - Hamburg - Singapore
www.smaato.com

Germany:
Valentinskamp 70, Emporio, 19th Floor
20355 Hamburg
M 0049(176)22943076

The information contained in this communication may be CONFIDENTIAL and is intended only for the use of the recipient(s) named above. If you are not the intended recipient, you are hereby notified that any dissemination, distribution, or copying of this communication, or any of its contents, is strictly prohibited. If you have received this communication in error, please notify the sender and delete/destroy the original message and any copy of it from your computer or paper files.


--
Xiangyu Su
Java Developer
[hidden email]

Smaato Inc.
San Francisco - New York - Hamburg - Singapore
www.smaato.com

Germany:
Valentinskamp 70, Emporio, 19th Floor
20355 Hamburg
M 0049(176)22943076

The information contained in this communication may be CONFIDENTIAL and is intended only for the use of the recipient(s) named above. If you are not the intended recipient, you are hereby notified that any dissemination, distribution, or copying of this communication, or any of its contents, is strictly prohibited. If you have received this communication in error, please notify the sender and delete/destroy the original message and any copy of it from your computer or paper files.