(DEPRECATED) Apache Flink User Mailing List archive.

Trying to figure out why a slot takes a long time to checkpoint

Classic

List

Threaded

15 messages Options

Julio Biason

Trying to figure out why a slot takes a long time to checkpoint

Hey guys,

On our pipeline, we have a single slot that it's taking longer to create the checkpoint compared to other slots and we are wondering what could be causing it.

The operator in question is the window metric -- the only element in the pipeline that actually uses the state. While the other slots take 7 mins to create the checkpoint, this one -- and only this one -- takes 55mins.

Is there something I should look at to understand what's going on?

(We are storing all checkpoints in HDFS, in case that helps.)

Julio Biason, Sofware Engineer

AZION | Deliver. Accelerate. Protect.

Office: <a href="callto:+555130838101" value="+555130838101" style="color:rgb(17,85,204);font-family:arial,sans-serif;font-size:12.8px" target="_blank">+55 51 3083 8101 | Mobile: <a href="callto:+5551996209291" style="color:rgb(17,85,204)" target="_blank">+55 51 99907 0554

Screenshot from 2018-09-14 19-56-45.png (230K) Download Attachment

Julio Biason

Re: Trying to figure out why a slot takes a long time to checkpoint

(Just an addendum: Although it's not a huge problem -- we can always increase the checkpoint timeout time -- this anomalous situation makes me think there is something wrong in our pipeline or in our cluster, and that is what is making the checkpoint creation go crazy.)

On Fri, Sep 14, 2018 at 8:00 PM, Julio Biason <[hidden email]> wrote:

Hey guys,

On our pipeline, we have a single slot that it's taking longer to create the checkpoint compared to other slots and we are wondering what could be causing it.

The operator in question is the window metric -- the only element in the pipeline that actually uses the state. While the other slots take 7 mins to create the checkpoint, this one -- and only this one -- takes 55mins.

Is there something I should look at to understand what's going on?

(We are storing all checkpoints in HDFS, in case that helps.)

--
Julio Biason, Sofware Engineer
AZION | Deliver. Accelerate. Protect.
Office: <a href="callto:+555130838101" value="+555130838101" style="color:rgb(17,85,204);font-family:arial,sans-serif;font-size:12.8px" target="_blank">+55 51 3083 8101 | Mobile: <a href="callto:+5551996209291" style="color:rgb(17,85,204)" target="_blank">+55 51 99907 0554

Julio Biason, Sofware Engineer

AZION | Deliver. Accelerate. Protect.

vino yang

Re: Trying to figure out why a slot takes a long time to checkpoint

Hi Julio,

Yes, it seems that fifty-five minutes is really long.

However, it is linear with the time and size of the previous task adjacent to it in the diagram.

I think your real application is concerned about why Flink accesses HDFS so slowly.

You can call the DEBUG log to see if you can find any clues, or post the log to the mailing list to help others analyze the problem for you.

Thanks, vino.

Julio Biason <[hidden email]> 于2018年9月15日周六上午7:03写道：

(Just an addendum: Although it's not a huge problem -- we can always increase the checkpoint timeout time -- this anomalous situation makes me think there is something wrong in our pipeline or in our cluster, and that is what is making the checkpoint creation go crazy.)

On Fri, Sep 14, 2018 at 8:00 PM, Julio Biason <[hidden email]> wrote:
Hey guys,

On our pipeline, we have a single slot that it's taking longer to create the checkpoint compared to other slots and we are wondering what could be causing it.

The operator in question is the window metric -- the only element in the pipeline that actually uses the state. While the other slots take 7 mins to create the checkpoint, this one -- and only this one -- takes 55mins.

Is there something I should look at to understand what's going on?

(We are storing all checkpoints in HDFS, in case that helps.)

--
Julio Biason, Sofware Engineer
AZION | Deliver. Accelerate. Protect.
Office: <a href="callto:+555130838101" value="+555130838101" style="color:rgb(17,85,204);font-family:arial,sans-serif;font-size:12.8px" target="_blank">+55 51 3083 8101 | Mobile: <a href="callto:+5551996209291" style="color:rgb(17,85,204)" target="_blank">+55 51 99907 0554

--
Julio Biason, Sofware Engineer
AZION | Deliver. Accelerate. Protect.
Office: <a href="callto:+555130838101" value="+555130838101" style="color:rgb(17,85,204);font-family:arial,sans-serif;font-size:12.8px" target="_blank">+55 51 3083 8101 | Mobile: <a href="callto:+5551996209291" style="color:rgb(17,85,204)" target="_blank">+55 51 99907 0554

Renjie Liu

Re: Trying to figure out why a slot takes a long time to checkpoint

Hi, Julio:

This happens frequently? What state backend do you use? The async checkpoint duration and sync checkpoint duration seems normal compared to others, it seems that most of the time are spent acking the checkpoint.

On Sun, Sep 16, 2018 at 9:24 AM vino yang <[hidden email]> wrote:

Hi Julio,

Yes, it seems that fifty-five minutes is really long.
However, it is linear with the time and size of the previous task adjacent to it in the diagram.
I think your real application is concerned about why Flink accesses HDFS so slowly.
You can call the DEBUG log to see if you can find any clues, or post the log to the mailing list to help others analyze the problem for you.

Thanks, vino.

Julio Biason <[hidden email]> 于2018年9月15日周六上午7:03写道：
(Just an addendum: Although it's not a huge problem -- we can always increase the checkpoint timeout time -- this anomalous situation makes me think there is something wrong in our pipeline or in our cluster, and that is what is making the checkpoint creation go crazy.)

On Fri, Sep 14, 2018 at 8:00 PM, Julio Biason <[hidden email]> wrote:
Hey guys,

On our pipeline, we have a single slot that it's taking longer to create the checkpoint compared to other slots and we are wondering what could be causing it.

The operator in question is the window metric -- the only element in the pipeline that actually uses the state. While the other slots take 7 mins to create the checkpoint, this one -- and only this one -- takes 55mins.

Is there something I should look at to understand what's going on?

(We are storing all checkpoints in HDFS, in case that helps.)

--
Julio Biason, Sofware Engineer
AZION | Deliver. Accelerate. Protect.
Office: <a href="callto:+555130838101" value="+555130838101" style="color:rgb(17,85,204);font-family:arial,sans-serif;font-size:12.8px" target="_blank">+55 51 3083 8101 | Mobile: <a href="callto:+5551996209291" style="color:rgb(17,85,204)" target="_blank">+55 51 99907 0554

--
Julio Biason, Sofware Engineer
AZION | Deliver. Accelerate. Protect.
Office: <a href="callto:+555130838101" value="+555130838101" style="color:rgb(17,85,204);font-family:arial,sans-serif;font-size:12.8px" target="_blank">+55 51 3083 8101 | Mobile: <a href="callto:+5551996209291" style="color:rgb(17,85,204)" target="_blank">+55 51 99907 0554

Liu, Renjie

Software Engineer, MVAD

Till Rohrmann

Re: Trying to figure out why a slot takes a long time to checkpoint

This behavior seems very odd Julio. Could you indeed share the debug logs of all Flink processes in order to see why things are taking so long?

The checkpoint size of task #8 is twice as big as the second biggest checkpoint. But this should not cause an increase in checkpoint time of a factor of 8.

Cheers,

Till

On Mon, Sep 17, 2018 at 5:25 AM Renjie Liu <[hidden email]> wrote:

Hi, Julio:
This happens frequently? What state backend do you use? The async checkpoint duration and sync checkpoint duration seems normal compared to others, it seems that most of the time are spent acking the checkpoint.

On Sun, Sep 16, 2018 at 9:24 AM vino yang <[hidden email]> wrote:
Hi Julio,

Yes, it seems that fifty-five minutes is really long.
However, it is linear with the time and size of the previous task adjacent to it in the diagram.
I think your real application is concerned about why Flink accesses HDFS so slowly.
You can call the DEBUG log to see if you can find any clues, or post the log to the mailing list to help others analyze the problem for you.

Thanks, vino.

Julio Biason <[hidden email]> 于2018年9月15日周六上午7:03写道：
(Just an addendum: Although it's not a huge problem -- we can always increase the checkpoint timeout time -- this anomalous situation makes me think there is something wrong in our pipeline or in our cluster, and that is what is making the checkpoint creation go crazy.)

On Fri, Sep 14, 2018 at 8:00 PM, Julio Biason <[hidden email]> wrote:
Hey guys,

On our pipeline, we have a single slot that it's taking longer to create the checkpoint compared to other slots and we are wondering what could be causing it.

The operator in question is the window metric -- the only element in the pipeline that actually uses the state. While the other slots take 7 mins to create the checkpoint, this one -- and only this one -- takes 55mins.

Is there something I should look at to understand what's going on?

(We are storing all checkpoints in HDFS, in case that helps.)

--
Julio Biason, Sofware Engineer
AZION | Deliver. Accelerate. Protect.
Office: <a href="callto:+555130838101" value="+555130838101" style="color:rgb(17,85,204);font-family:arial,sans-serif;font-size:12.8px" target="_blank">+55 51 3083 8101 | Mobile: <a href="callto:+5551996209291" style="color:rgb(17,85,204)" target="_blank">+55 51 99907 0554

--
Julio Biason, Sofware Engineer
AZION | Deliver. Accelerate. Protect.
Office: <a href="callto:+555130838101" value="+555130838101" style="color:rgb(17,85,204);font-family:arial,sans-serif;font-size:12.8px" target="_blank">+55 51 3083 8101 | Mobile: <a href="callto:+5551996209291" style="color:rgb(17,85,204)" target="_blank">+55 51 99907 0554

--
Liu, Renjie
Software Engineer, MVAD

Renjie Liu

Re: Trying to figure out why a slot takes a long time to checkpoint

I think what's weird is that non of the three stages: alignment, sync cp, async cp takes much time.

On Tue, Sep 18, 2018 at 3:20 PM Till Rohrmann <[hidden email]> wrote:

This behavior seems very odd Julio. Could you indeed share the debug logs of all Flink processes in order to see why things are taking so long?

The checkpoint size of task #8 is twice as big as the second biggest checkpoint. But this should not cause an increase in checkpoint time of a factor of 8.

Cheers,
Till

On Mon, Sep 17, 2018 at 5:25 AM Renjie Liu <[hidden email]> wrote:
Hi, Julio:
This happens frequently? What state backend do you use? The async checkpoint duration and sync checkpoint duration seems normal compared to others, it seems that most of the time are spent acking the checkpoint.

On Sun, Sep 16, 2018 at 9:24 AM vino yang <[hidden email]> wrote:
Hi Julio,

Yes, it seems that fifty-five minutes is really long.
However, it is linear with the time and size of the previous task adjacent to it in the diagram.
I think your real application is concerned about why Flink accesses HDFS so slowly.
You can call the DEBUG log to see if you can find any clues, or post the log to the mailing list to help others analyze the problem for you.

Thanks, vino.

Julio Biason <[hidden email]> 于2018年9月15日周六上午7:03写道：
(Just an addendum: Although it's not a huge problem -- we can always increase the checkpoint timeout time -- this anomalous situation makes me think there is something wrong in our pipeline or in our cluster, and that is what is making the checkpoint creation go crazy.)

On Fri, Sep 14, 2018 at 8:00 PM, Julio Biason <[hidden email]> wrote:
Hey guys,

On our pipeline, we have a single slot that it's taking longer to create the checkpoint compared to other slots and we are wondering what could be causing it.

The operator in question is the window metric -- the only element in the pipeline that actually uses the state. While the other slots take 7 mins to create the checkpoint, this one -- and only this one -- takes 55mins.

Is there something I should look at to understand what's going on?

(We are storing all checkpoints in HDFS, in case that helps.)

--
Julio Biason, Sofware Engineer
AZION | Deliver. Accelerate. Protect.
Office: <a href="callto:+555130838101" value="+555130838101" style="color:rgb(17,85,204);font-family:arial,sans-serif;font-size:12.8px" target="_blank">+55 51 3083 8101 | Mobile: <a href="callto:+5551996209291" style="color:rgb(17,85,204)" target="_blank">+55 51 99907 0554

--
Julio Biason, Sofware Engineer
AZION | Deliver. Accelerate. Protect.
Office: <a href="callto:+555130838101" value="+555130838101" style="color:rgb(17,85,204);font-family:arial,sans-serif;font-size:12.8px" target="_blank">+55 51 3083 8101 | Mobile: <a href="callto:+5551996209291" style="color:rgb(17,85,204)" target="_blank">+55 51 99907 0554

--
Liu, Renjie
Software Engineer, MVAD

Liu, Renjie

Software Engineer, MVAD

Stefan Richter

Re: Trying to figure out why a slot takes a long time to checkpoint

Hi,

from your screenshot, it looks like everything is running fine as soon as the snapshots are actually running, sync and async part times are normal. So I think the explanation is the time that the checkpoint barrier needs to reach this particular operator. It seems like there is a large queue of events in the buffers of that operator, in front of the barrier, and/or the operator is very slow at processing events. Given the beakdown at hand the time must be spend between the triggering of the checkpoint and the point where it reaches the operator that lacks behind.

Best,

Stefan

Am 18.09.2018 um 09:39 schrieb Renjie Liu <[hidden email]>:

I think what's weird is that non of the three stages: alignment, sync cp, async cp takes much time.

On Tue, Sep 18, 2018 at 3:20 PM Till Rohrmann <[hidden email]> wrote:
This behavior seems very odd Julio. Could you indeed share the debug logs of all Flink processes in order to see why things are taking so long?

The checkpoint size of task #8 is twice as big as the second biggest checkpoint. But this should not cause an increase in checkpoint time of a factor of 8.

Cheers,
Till

On Mon, Sep 17, 2018 at 5:25 AM Renjie Liu <[hidden email]> wrote:
Hi, Julio:
This happens frequently? What state backend do you use? The async checkpoint duration and sync checkpoint duration seems normal compared to others, it seems that most of the time are spent acking the checkpoint.

On Sun, Sep 16, 2018 at 9:24 AM vino yang <[hidden email]> wrote:
Hi Julio,

Yes, it seems that fifty-five minutes is really long.
However, it is linear with the time and size of the previous task adjacent to it in the diagram.
I think your real application is concerned about why Flink accesses HDFS so slowly.
You can call the DEBUG log to see if you can find any clues, or post the log to the mailing list to help others analyze the problem for you.

Thanks, vino.

Julio Biason <[hidden email]> 于2018年9月15日周六上午7:03写道：
(Just an addendum: Although it's not a huge problem -- we can always increase the checkpoint timeout time -- this anomalous situation makes me think there is something wrong in our pipeline or in our cluster, and that is what is making the checkpoint creation go crazy.)

On Fri, Sep 14, 2018 at 8:00 PM, Julio Biason <[hidden email]> wrote:
Hey guys,

On our pipeline, we have a single slot that it's taking longer to create the checkpoint compared to other slots and we are wondering what could be causing it.

The operator in question is the window metric -- the only element in the pipeline that actually uses the state. While the other slots take 7 mins to create the checkpoint, this one -- and only this one -- takes 55mins.

Is there something I should look at to understand what's going on?

(We are storing all checkpoints in HDFS, in case that helps.)

--
Julio Biason, Sofware Engineer
AZION | Deliver. Accelerate. Protect.
Office: <a href="callto:+555130838101" value="+555130838101" style="color:rgb(17,85,204);font-family:arial,sans-serif;font-size:12.8px" target="_blank" class="">+55 51 3083 8101 | Mobile: <a href="callto:+5551996209291" style="color:rgb(17,85,204)" target="_blank" class="">+55 51 99907 0554

--
Julio Biason, Sofware Engineer
AZION | Deliver. Accelerate. Protect.
Office: <a href="callto:+555130838101" value="+555130838101" style="color:rgb(17,85,204);font-family:arial,sans-serif;font-size:12.8px" target="_blank" class="">+55 51 3083 8101 | Mobile: <a href="callto:+5551996209291" style="color:rgb(17,85,204)" target="_blank" class="">+55 51 99907 0554

--
Liu, Renjie
Software Engineer, MVAD

--
Liu, Renjie
Software Engineer, MVAD

Stefan Richter

Re: Trying to figure out why a slot takes a long time to checkpoint

Adding to my previous email, I start to doubt a little bit about the explanation because also alignment times are very low. Could it be possible that it takes very long for the checkpoint operation (for whatever reason) to get the checkpointing lock?

Am 18.09.2018 um 11:58 schrieb Stefan Richter <[hidden email]>:

Hi,

from your screenshot, it looks like everything is running fine as soon as the snapshots are actually running, sync and async part times are normal. So I think the explanation is the time that the checkpoint barrier needs to reach this particular operator. It seems like there is a large queue of events in the buffers of that operator, in front of the barrier, and/or the operator is very slow at processing events. Given the beakdown at hand the time must be spend between the triggering of the checkpoint and the point where it reaches the operator that lacks behind.

Best,
Stefan

Am 18.09.2018 um 09:39 schrieb Renjie Liu <[hidden email]>:

I think what's weird is that non of the three stages: alignment, sync cp, async cp takes much time.

On Tue, Sep 18, 2018 at 3:20 PM Till Rohrmann <[hidden email]> wrote:
This behavior seems very odd Julio. Could you indeed share the debug logs of all Flink processes in order to see why things are taking so long?

The checkpoint size of task #8 is twice as big as the second biggest checkpoint. But this should not cause an increase in checkpoint time of a factor of 8.

Cheers,
Till

On Mon, Sep 17, 2018 at 5:25 AM Renjie Liu <[hidden email]> wrote:
Hi, Julio:
This happens frequently? What state backend do you use? The async checkpoint duration and sync checkpoint duration seems normal compared to others, it seems that most of the time are spent acking the checkpoint.

On Sun, Sep 16, 2018 at 9:24 AM vino yang <[hidden email]> wrote:
Hi Julio,

Yes, it seems that fifty-five minutes is really long.
However, it is linear with the time and size of the previous task adjacent to it in the diagram.
I think your real application is concerned about why Flink accesses HDFS so slowly.
You can call the DEBUG log to see if you can find any clues, or post the log to the mailing list to help others analyze the problem for you.

Thanks, vino.

Julio Biason <[hidden email]> 于2018年9月15日周六上午7:03写道：
(Just an addendum: Although it's not a huge problem -- we can always increase the checkpoint timeout time -- this anomalous situation makes me think there is something wrong in our pipeline or in our cluster, and that is what is making the checkpoint creation go crazy.)

On Fri, Sep 14, 2018 at 8:00 PM, Julio Biason <[hidden email]> wrote:
Hey guys,

On our pipeline, we have a single slot that it's taking longer to create the checkpoint compared to other slots and we are wondering what could be causing it.

The operator in question is the window metric -- the only element in the pipeline that actually uses the state. While the other slots take 7 mins to create the checkpoint, this one -- and only this one -- takes 55mins.

Is there something I should look at to understand what's going on?

(We are storing all checkpoints in HDFS, in case that helps.)

--
Julio Biason, Sofware Engineer
AZION | Deliver. Accelerate. Protect.
Office: <a href="callto:+555130838101" value="+555130838101" style="color:rgb(17,85,204);font-family:arial,sans-serif;font-size:12.8px" target="_blank" class="">+55 51 3083 8101 | Mobile: <a href="callto:+5551996209291" style="color:rgb(17,85,204)" target="_blank" class="">+55 51 99907 0554

--
Julio Biason, Sofware Engineer
AZION | Deliver. Accelerate. Protect.
Office: <a href="callto:+555130838101" value="+555130838101" style="color:rgb(17,85,204);font-family:arial,sans-serif;font-size:12.8px" target="_blank" class="">+55 51 3083 8101 | Mobile: <a href="callto:+5551996209291" style="color:rgb(17,85,204)" target="_blank" class="">+55 51 99907 0554

--
Liu, Renjie
Software Engineer, MVAD

--
Liu, Renjie
Software Engineer, MVAD

aitozi

Re: Trying to figure out why a slot takes a long time to checkpoint

Hi,

alignment time much lower than the whole checkpoint duration may due to the
downstream receive first barrier and start alignment at 5min and alignment
for 2min to complete the checkpoint. So the whole duration can be 7min with
the alignment with 2min average. I just statement a possibility

Thanks,
aitozi

--
Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/

Julio Biason

Re: Trying to figure out why a slot takes a long time to checkpoint

In reply to this post by Till Rohrmann

Hey TIll (and others),

We don't have debug logs yet, but we decided to remove a related component: HDFS.

We are moving the storage to our Ceph install (using S3), which is running for longer than our HDFS install and we know, for sure, it runs without any problems (specially 'cause we have more people that understand Ceph than people that know HDFS at this point).

If, for some reason, the problem persists, we know it's not the underlying storage and may be something with our pipeline itself. I'll enable debug logs, then.

On Tue, Sep 18, 2018 at 4:20 AM, Till Rohrmann <[hidden email]> wrote:

This behavior seems very odd Julio. Could you indeed share the debug logs of all Flink processes in order to see why things are taking so long?

The checkpoint size of task #8 is twice as big as the second biggest checkpoint. But this should not cause an increase in checkpoint time of a factor of 8.

Cheers,
Till

On Mon, Sep 17, 2018 at 5:25 AM Renjie Liu <[hidden email]> wrote:
Hi, Julio:
This happens frequently? What state backend do you use? The async checkpoint duration and sync checkpoint duration seems normal compared to others, it seems that most of the time are spent acking the checkpoint.

On Sun, Sep 16, 2018 at 9:24 AM vino yang <[hidden email]> wrote:
Hi Julio,

Yes, it seems that fifty-five minutes is really long.
However, it is linear with the time and size of the previous task adjacent to it in the diagram.
I think your real application is concerned about why Flink accesses HDFS so slowly.
You can call the DEBUG log to see if you can find any clues, or post the log to the mailing list to help others analyze the problem for you.

Thanks, vino.

Julio Biason <[hidden email]> 于2018年9月15日周六上午7:03写道：
(Just an addendum: Although it's not a huge problem -- we can always increase the checkpoint timeout time -- this anomalous situation makes me think there is something wrong in our pipeline or in our cluster, and that is what is making the checkpoint creation go crazy.)

On Fri, Sep 14, 2018 at 8:00 PM, Julio Biason <[hidden email]> wrote:
Hey guys,

On our pipeline, we have a single slot that it's taking longer to create the checkpoint compared to other slots and we are wondering what could be causing it.

The operator in question is the window metric -- the only element in the pipeline that actually uses the state. While the other slots take 7 mins to create the checkpoint, this one -- and only this one -- takes 55mins.

Is there something I should look at to understand what's going on?

(We are storing all checkpoints in HDFS, in case that helps.)

--
Julio Biason, Sofware Engineer
AZION | Deliver. Accelerate. Protect.
Office: <a href="callto:+555130838101" value="+555130838101" style="color:rgb(17,85,204);font-family:arial,sans-serif;font-size:12.8px" target="_blank">+55 51 3083 8101 | Mobile: <a href="callto:+5551996209291" style="color:rgb(17,85,204)" target="_blank">+55 51 99907 0554

--
Julio Biason, Sofware Engineer
AZION | Deliver. Accelerate. Protect.
Office: <a href="callto:+555130838101" value="+555130838101" style="color:rgb(17,85,204);font-family:arial,sans-serif;font-size:12.8px" target="_blank">+55 51 3083 8101 | Mobile: <a href="callto:+5551996209291" style="color:rgb(17,85,204)" target="_blank">+55 51 99907 0554

--
Liu, Renjie
Software Engineer, MVAD

Julio Biason, Sofware Engineer

AZION | Deliver. Accelerate. Protect.

Till Rohrmann

Re: Trying to figure out why a slot takes a long time to checkpoint

Ok, thanks for the update.

On Tue, Sep 18, 2018, 17:34 Julio Biason <[hidden email]> wrote:

Hey TIll (and others),

We don't have debug logs yet, but we decided to remove a related component: HDFS.

We are moving the storage to our Ceph install (using S3), which is running for longer than our HDFS install and we know, for sure, it runs without any problems (specially 'cause we have more people that understand Ceph than people that know HDFS at this point).

If, for some reason, the problem persists, we know it's not the underlying storage and may be something with our pipeline itself. I'll enable debug logs, then.

On Tue, Sep 18, 2018 at 4:20 AM, Till Rohrmann <[hidden email]> wrote:
This behavior seems very odd Julio. Could you indeed share the debug logs of all Flink processes in order to see why things are taking so long?

The checkpoint size of task #8 is twice as big as the second biggest checkpoint. But this should not cause an increase in checkpoint time of a factor of 8.

Cheers,
Till

On Mon, Sep 17, 2018 at 5:25 AM Renjie Liu <[hidden email]> wrote:
Hi, Julio:
This happens frequently? What state backend do you use? The async checkpoint duration and sync checkpoint duration seems normal compared to others, it seems that most of the time are spent acking the checkpoint.

On Sun, Sep 16, 2018 at 9:24 AM vino yang <[hidden email]> wrote:
Hi Julio,

Yes, it seems that fifty-five minutes is really long.
However, it is linear with the time and size of the previous task adjacent to it in the diagram.
I think your real application is concerned about why Flink accesses HDFS so slowly.
You can call the DEBUG log to see if you can find any clues, or post the log to the mailing list to help others analyze the problem for you.

Thanks, vino.

Julio Biason <[hidden email]> 于2018年9月15日周六上午7:03写道：
(Just an addendum: Although it's not a huge problem -- we can always increase the checkpoint timeout time -- this anomalous situation makes me think there is something wrong in our pipeline or in our cluster, and that is what is making the checkpoint creation go crazy.)

On Fri, Sep 14, 2018 at 8:00 PM, Julio Biason <[hidden email]> wrote:
Hey guys,

On our pipeline, we have a single slot that it's taking longer to create the checkpoint compared to other slots and we are wondering what could be causing it.

The operator in question is the window metric -- the only element in the pipeline that actually uses the state. While the other slots take 7 mins to create the checkpoint, this one -- and only this one -- takes 55mins.

Is there something I should look at to understand what's going on?

(We are storing all checkpoints in HDFS, in case that helps.)

--
Julio Biason, Sofware Engineer
AZION | Deliver. Accelerate. Protect.
Office: <a href="callto:+555130838101" value="+555130838101" style="color:rgb(17,85,204);font-family:arial,sans-serif;font-size:12.8px" target="_blank" rel="noreferrer">+55 51 3083 8101 | Mobile: <a href="callto:+5551996209291" style="color:rgb(17,85,204)" target="_blank" rel="noreferrer">+55 51 99907 0554

--
Julio Biason, Sofware Engineer
AZION | Deliver. Accelerate. Protect.
Office: <a href="callto:+555130838101" value="+555130838101" style="color:rgb(17,85,204);font-family:arial,sans-serif;font-size:12.8px" target="_blank" rel="noreferrer">+55 51 3083 8101 | Mobile: <a href="callto:+5551996209291" style="color:rgb(17,85,204)" target="_blank" rel="noreferrer">+55 51 99907 0554

--
Liu, Renjie
Software Engineer, MVAD

--
Julio Biason, Sofware Engineer
AZION | Deliver. Accelerate. Protect.
Office: <a href="callto:+555130838101" value="+555130838101" style="color:rgb(17,85,204);font-family:arial,sans-serif;font-size:12.8px" target="_blank" rel="noreferrer">+55 51 3083 8101 | Mobile: <a href="callto:+5551996209291" style="color:rgb(17,85,204)" target="_blank" rel="noreferrer">+55 51 99907 0554

Julio Biason

Re: Trying to figure out why a slot takes a long time to checkpoint

In reply to this post by Julio Biason

Hey guys,

So, switching to Ceph/S3 didn't shine any new lights on the issue. Although the times are a bit higher, just a few slots are taking a magnitude longer to save. So I changed the logs for DEBUG.

The problem is: I'm not seeing anything that seems relevant; only pings from ZooKeeper, heartbeats and the S3 disconnecting from being idle.

Is there anything else that I should change to DEBUG? Akka? Kafka? Haoop? ZooKeeper? (Those are, by the default config, bumped to INFO)

All of those?

On Tue, Sep 18, 2018 at 12:34 PM, Julio Biason <[hidden email]> wrote:

Hey TIll (and others),

We don't have debug logs yet, but we decided to remove a related component: HDFS.

We are moving the storage to our Ceph install (using S3), which is running for longer than our HDFS install and we know, for sure, it runs without any problems (specially 'cause we have more people that understand Ceph than people that know HDFS at this point).

If, for some reason, the problem persists, we know it's not the underlying storage and may be something with our pipeline itself. I'll enable debug logs, then.

On Tue, Sep 18, 2018 at 4:20 AM, Till Rohrmann <[hidden email]> wrote:
This behavior seems very odd Julio. Could you indeed share the debug logs of all Flink processes in order to see why things are taking so long?

The checkpoint size of task #8 is twice as big as the second biggest checkpoint. But this should not cause an increase in checkpoint time of a factor of 8.

Cheers,
Till

On Mon, Sep 17, 2018 at 5:25 AM Renjie Liu <[hidden email]> wrote:
Hi, Julio:
This happens frequently? What state backend do you use? The async checkpoint duration and sync checkpoint duration seems normal compared to others, it seems that most of the time are spent acking the checkpoint.

On Sun, Sep 16, 2018 at 9:24 AM vino yang <[hidden email]> wrote:
Hi Julio,

Yes, it seems that fifty-five minutes is really long.
However, it is linear with the time and size of the previous task adjacent to it in the diagram.
I think your real application is concerned about why Flink accesses HDFS so slowly.
You can call the DEBUG log to see if you can find any clues, or post the log to the mailing list to help others analyze the problem for you.

Thanks, vino.

Julio Biason <[hidden email]> 于2018年9月15日周六上午7:03写道：
(Just an addendum: Although it's not a huge problem -- we can always increase the checkpoint timeout time -- this anomalous situation makes me think there is something wrong in our pipeline or in our cluster, and that is what is making the checkpoint creation go crazy.)

On Fri, Sep 14, 2018 at 8:00 PM, Julio Biason <[hidden email]> wrote:
Hey guys,

On our pipeline, we have a single slot that it's taking longer to create the checkpoint compared to other slots and we are wondering what could be causing it.

The operator in question is the window metric -- the only element in the pipeline that actually uses the state. While the other slots take 7 mins to create the checkpoint, this one -- and only this one -- takes 55mins.

Is there something I should look at to understand what's going on?

(We are storing all checkpoints in HDFS, in case that helps.)

--
Julio Biason, Sofware Engineer
AZION | Deliver. Accelerate. Protect.
Office: <a href="callto:+555130838101" value="+555130838101" style="color:rgb(17,85,204);font-family:arial,sans-serif;font-size:12.8px" target="_blank">+55 51 3083 8101 | Mobile: <a href="callto:+5551996209291" style="color:rgb(17,85,204)" target="_blank">+55 51 99907 0554

--
Julio Biason, Sofware Engineer
AZION | Deliver. Accelerate. Protect.
Office: <a href="callto:+555130838101" value="+555130838101" style="color:rgb(17,85,204);font-family:arial,sans-serif;font-size:12.8px" target="_blank">+55 51 3083 8101 | Mobile: <a href="callto:+5551996209291" style="color:rgb(17,85,204)" target="_blank">+55 51 99907 0554

--
Liu, Renjie
Software Engineer, MVAD

--
Julio Biason, Sofware Engineer
AZION | Deliver. Accelerate. Protect.
Office: <a href="callto:+555130838101" value="+555130838101" style="color:rgb(17,85,204);font-family:arial,sans-serif;font-size:12.8px" target="_blank">+55 51 3083 8101 | Mobile: <a href="callto:+5551996209291" style="color:rgb(17,85,204)" target="_blank">+55 51 99907 0554

Julio Biason, Sofware Engineer

AZION | Deliver. Accelerate. Protect.

Stefan Richter

Re: Trying to figure out why a slot takes a long time to checkpoint

Hi,

if some tasks take like 50 minutes, could you wait until such a checkpoint is in progress and (let’s say after 10 minutes) log into the node and create a (or multiple over time) thread-dump(s) for the JVM that runs the slow checkpointing task. This could help to figure out where it is stuck or waiting.

Best,

Stefan

Am 19.09.2018 um 22:30 schrieb Julio Biason <[hidden email]>:

Hey guys,

So, switching to Ceph/S3 didn't shine any new lights on the issue. Although the times are a bit higher, just a few slots are taking a magnitude longer to save. So I changed the logs for DEBUG.

The problem is: I'm not seeing anything that seems relevant; only pings from ZooKeeper, heartbeats and the S3 disconnecting from being idle.

Is there anything else that I should change to DEBUG? Akka? Kafka? Haoop? ZooKeeper? (Those are, by the default config, bumped to INFO)

All of those?

On Tue, Sep 18, 2018 at 12:34 PM, Julio Biason <[hidden email]> wrote:
Hey TIll (and others),

We don't have debug logs yet, but we decided to remove a related component: HDFS.

We are moving the storage to our Ceph install (using S3), which is running for longer than our HDFS install and we know, for sure, it runs without any problems (specially 'cause we have more people that understand Ceph than people that know HDFS at this point).

If, for some reason, the problem persists, we know it's not the underlying storage and may be something with our pipeline itself. I'll enable debug logs, then.

On Tue, Sep 18, 2018 at 4:20 AM, Till Rohrmann <[hidden email]> wrote:
This behavior seems very odd Julio. Could you indeed share the debug logs of all Flink processes in order to see why things are taking so long?

The checkpoint size of task #8 is twice as big as the second biggest checkpoint. But this should not cause an increase in checkpoint time of a factor of 8.

Cheers,
Till

On Mon, Sep 17, 2018 at 5:25 AM Renjie Liu <[hidden email]> wrote:
Hi, Julio:
This happens frequently? What state backend do you use? The async checkpoint duration and sync checkpoint duration seems normal compared to others, it seems that most of the time are spent acking the checkpoint.

On Sun, Sep 16, 2018 at 9:24 AM vino yang <[hidden email]> wrote:
Hi Julio,

Yes, it seems that fifty-five minutes is really long.
However, it is linear with the time and size of the previous task adjacent to it in the diagram.
I think your real application is concerned about why Flink accesses HDFS so slowly.
You can call the DEBUG log to see if you can find any clues, or post the log to the mailing list to help others analyze the problem for you.

Thanks, vino.

Julio Biason <[hidden email]> 于2018年9月15日周六上午7:03写道：
(Just an addendum: Although it's not a huge problem -- we can always increase the checkpoint timeout time -- this anomalous situation makes me think there is something wrong in our pipeline or in our cluster, and that is what is making the checkpoint creation go crazy.)

On Fri, Sep 14, 2018 at 8:00 PM, Julio Biason <[hidden email]> wrote:
Hey guys,

On our pipeline, we have a single slot that it's taking longer to create the checkpoint compared to other slots and we are wondering what could be causing it.

The operator in question is the window metric -- the only element in the pipeline that actually uses the state. While the other slots take 7 mins to create the checkpoint, this one -- and only this one -- takes 55mins.

Is there something I should look at to understand what's going on?

(We are storing all checkpoints in HDFS, in case that helps.)

--
Julio Biason, Sofware Engineer
AZION | Deliver. Accelerate. Protect.
Office: <a href="callto:+555130838101" value="+555130838101" style="color:rgb(17,85,204);font-family:arial,sans-serif;font-size:12.8px" target="_blank" class="">+55 51 3083 8101 | Mobile: <a href="callto:+5551996209291" style="color:rgb(17,85,204)" target="_blank" class="">+55 51 99907 0554

--
Julio Biason, Sofware Engineer
AZION | Deliver. Accelerate. Protect.
Office: <a href="callto:+555130838101" value="+555130838101" style="color:rgb(17,85,204);font-family:arial,sans-serif;font-size:12.8px" target="_blank" class="">+55 51 3083 8101 | Mobile: <a href="callto:+5551996209291" style="color:rgb(17,85,204)" target="_blank" class="">+55 51 99907 0554

--
Liu, Renjie
Software Engineer, MVAD

--
Julio Biason, Sofware Engineer
AZION | Deliver. Accelerate. Protect.
Office: <a href="callto:+555130838101" value="+555130838101" style="color:rgb(17,85,204);font-family:arial,sans-serif;font-size:12.8px" target="_blank" class="">+55 51 3083 8101 | Mobile: <a href="callto:+5551996209291" style="color:rgb(17,85,204)" target="_blank" class="">+55 51 99907 0554

--
Julio Biason, Sofware Engineer
AZION | Deliver. Accelerate. Protect.
Office: <a href="callto:+555130838101" value="+555130838101" style="color:rgb(17,85,204);font-family:arial,sans-serif;font-size:12.8px" target="_blank" class="">+55 51 3083 8101 | Mobile: <a href="callto:+5551996209291" style="color:rgb(17,85,204)" target="_blank" class="">+55 51 99907 0554

Stefan Richter

Re: Trying to figure out why a slot takes a long time to checkpoint

A debug log for state backend and checkpoint coordinator could also help.

Am 20.09.2018 um 14:19 schrieb Stefan Richter <[hidden email]>:

Hi,

if some tasks take like 50 minutes, could you wait until such a checkpoint is in progress and (let’s say after 10 minutes) log into the node and create a (or multiple over time) thread-dump(s) for the JVM that runs the slow checkpointing task. This could help to figure out where it is stuck or waiting.

Best,
Stefan

Am 19.09.2018 um 22:30 schrieb Julio Biason <[hidden email]>:

Hey guys,

So, switching to Ceph/S3 didn't shine any new lights on the issue. Although the times are a bit higher, just a few slots are taking a magnitude longer to save. So I changed the logs for DEBUG.

The problem is: I'm not seeing anything that seems relevant; only pings from ZooKeeper, heartbeats and the S3 disconnecting from being idle.

Is there anything else that I should change to DEBUG? Akka? Kafka? Haoop? ZooKeeper? (Those are, by the default config, bumped to INFO)

All of those?

On Tue, Sep 18, 2018 at 12:34 PM, Julio Biason <[hidden email]> wrote:
Hey TIll (and others),

We don't have debug logs yet, but we decided to remove a related component: HDFS.

We are moving the storage to our Ceph install (using S3), which is running for longer than our HDFS install and we know, for sure, it runs without any problems (specially 'cause we have more people that understand Ceph than people that know HDFS at this point).

If, for some reason, the problem persists, we know it's not the underlying storage and may be something with our pipeline itself. I'll enable debug logs, then.

On Tue, Sep 18, 2018 at 4:20 AM, Till Rohrmann <[hidden email]> wrote:
This behavior seems very odd Julio. Could you indeed share the debug logs of all Flink processes in order to see why things are taking so long?

The checkpoint size of task #8 is twice as big as the second biggest checkpoint. But this should not cause an increase in checkpoint time of a factor of 8.

Cheers,
Till

On Mon, Sep 17, 2018 at 5:25 AM Renjie Liu <[hidden email]> wrote:
Hi, Julio:
This happens frequently? What state backend do you use? The async checkpoint duration and sync checkpoint duration seems normal compared to others, it seems that most of the time are spent acking the checkpoint.

On Sun, Sep 16, 2018 at 9:24 AM vino yang <[hidden email]> wrote:
Hi Julio,

Yes, it seems that fifty-five minutes is really long.
However, it is linear with the time and size of the previous task adjacent to it in the diagram.
I think your real application is concerned about why Flink accesses HDFS so slowly.
You can call the DEBUG log to see if you can find any clues, or post the log to the mailing list to help others analyze the problem for you.

Thanks, vino.

Julio Biason <[hidden email]> 于2018年9月15日周六上午7:03写道：
(Just an addendum: Although it's not a huge problem -- we can always increase the checkpoint timeout time -- this anomalous situation makes me think there is something wrong in our pipeline or in our cluster, and that is what is making the checkpoint creation go crazy.)

On Fri, Sep 14, 2018 at 8:00 PM, Julio Biason <[hidden email]> wrote:
Hey guys,

On our pipeline, we have a single slot that it's taking longer to create the checkpoint compared to other slots and we are wondering what could be causing it.

The operator in question is the window metric -- the only element in the pipeline that actually uses the state. While the other slots take 7 mins to create the checkpoint, this one -- and only this one -- takes 55mins.

Is there something I should look at to understand what's going on?

(We are storing all checkpoints in HDFS, in case that helps.)

--
Julio Biason, Sofware Engineer
AZION | Deliver. Accelerate. Protect.
Office: <a href="callto:+555130838101" value="+555130838101" style="color:rgb(17,85,204);font-family:arial,sans-serif;font-size:12.8px" target="_blank" class="">+55 51 3083 8101 | Mobile: <a href="callto:+5551996209291" style="color:rgb(17,85,204)" target="_blank" class="">+55 51 99907 0554

--
Julio Biason, Sofware Engineer
AZION | Deliver. Accelerate. Protect.
Office: <a href="callto:+555130838101" value="+555130838101" style="color:rgb(17,85,204);font-family:arial,sans-serif;font-size:12.8px" target="_blank" class="">+55 51 3083 8101 | Mobile: <a href="callto:+5551996209291" style="color:rgb(17,85,204)" target="_blank" class="">+55 51 99907 0554

--
Liu, Renjie
Software Engineer, MVAD

--
Julio Biason, Sofware Engineer
AZION | Deliver. Accelerate. Protect.
Office: <a href="callto:+555130838101" value="+555130838101" style="color:rgb(17,85,204);font-family:arial,sans-serif;font-size:12.8px" target="_blank" class="">+55 51 3083 8101 | Mobile: <a href="callto:+5551996209291" style="color:rgb(17,85,204)" target="_blank" class="">+55 51 99907 0554

--
Julio Biason, Sofware Engineer
AZION | Deliver. Accelerate. Protect.
Office: <a href="callto:+555130838101" value="+555130838101" style="color:rgb(17,85,204);font-family:arial,sans-serif;font-size:12.8px" target="_blank" class="">+55 51 3083 8101 | Mobile: <a href="callto:+5551996209291" style="color:rgb(17,85,204)" target="_blank" class="">+55 51 99907 0554

Julio Biason

Re: Trying to figure out why a slot takes a long time to checkpoint

Hey guys, Stefan,

Yeah, sorry about the stacks. Completely forgot about them.

But I think we figured out why it's taking so long (and yeah, Stefan was right from the start): This specific slot is receiving 5x more records than any other slot (on a recent run, it had 10x more records than the second largest slot) -- and I feel like a complete twit for not realizing this earlier. We'll try to find out why that slot is getting so many records.

For now, we switched from FsStateBackend to RocksDBBackend with incremental checkpointing and things seem to, somewhat, balanced out, with no more failures due timeout.

If this kind of stuff happens again, I'll send the logs -- with stacks this time. :p

On Fri, Sep 21, 2018 at 5:19 AM, Stefan Richter <[hidden email]> wrote:

Hi,

any chance that you could also provide some thread dumps from blocking/slow machines as mentioned in my previous mail. I think they might help more than just logs. Would also be nice if you could create a single zip file with all the things instead of a bunch of > 50MB logs.

Best,
Stefan

Am 20.09.2018 um 21:57 schrieb Julio Biason <[hidden email]>:

Hey guys,

Ok, I started the whole pipeline again with DEBUG enable for everything. I captured the logs for all taskmanagers (2 machines with 4 taskmanagers, 1 machine with 2) and the jobmanager: http://juliobiason.net/files/logs/

I remove the logs from org.apache.flink.fs.s3hadoop.shaded.org.apache.http.wire 'cause it seemed a bit too redundant and our own logs, 'cause it would look like noise.

In this session, the operator "Group metrics" (our keyBy(...).window(TumblingEventTimeWindows)) failed in the first checkpoint, taking longer than 1 hour (our timeout) to complete checkpoints on slots 8 and 27; all other slots completed in less than 5 minutes.

On Thu, Sep 20, 2018 at 9:21 AM, Stefan Richter <[hidden email]> wrote:
A debug log for state backend and checkpoint coordinator could also help.

Am 20.09.2018 um 14:19 schrieb Stefan Richter <[hidden email]>:

Hi,

if some tasks take like 50 minutes, could you wait until such a checkpoint is in progress and (let’s say after 10 minutes) log into the node and create a (or multiple over time) thread-dump(s) for the JVM that runs the slow checkpointing task. This could help to figure out where it is stuck or waiting.

Best,
Stefan

Am 19.09.2018 um 22:30 schrieb Julio Biason <[hidden email]>:

Hey guys,

So, switching to Ceph/S3 didn't shine any new lights on the issue. Although the times are a bit higher, just a few slots are taking a magnitude longer to save. So I changed the logs for DEBUG.

The problem is: I'm not seeing anything that seems relevant; only pings from ZooKeeper, heartbeats and the S3 disconnecting from being idle.

Is there anything else that I should change to DEBUG? Akka? Kafka? Haoop? ZooKeeper? (Those are, by the default config, bumped to INFO)

All of those?

On Tue, Sep 18, 2018 at 12:34 PM, Julio Biason <[hidden email]> wrote:
Hey TIll (and others),

We don't have debug logs yet, but we decided to remove a related component: HDFS.

We are moving the storage to our Ceph install (using S3), which is running for longer than our HDFS install and we know, for sure, it runs without any problems (specially 'cause we have more people that understand Ceph than people that know HDFS at this point).

If, for some reason, the problem persists, we know it's not the underlying storage and may be something with our pipeline itself. I'll enable debug logs, then.

On Tue, Sep 18, 2018 at 4:20 AM, Till Rohrmann <[hidden email]> wrote:
This behavior seems very odd Julio. Could you indeed share the debug logs of all Flink processes in order to see why things are taking so long?

The checkpoint size of task #8 is twice as big as the second biggest checkpoint. But this should not cause an increase in checkpoint time of a factor of 8.

Cheers,
Till

On Mon, Sep 17, 2018 at 5:25 AM Renjie Liu <[hidden email]> wrote:
Hi, Julio:
This happens frequently? What state backend do you use? The async checkpoint duration and sync checkpoint duration seems normal compared to others, it seems that most of the time are spent acking the checkpoint.

On Sun, Sep 16, 2018 at 9:24 AM vino yang <[hidden email]> wrote:
Hi Julio,

Yes, it seems that fifty-five minutes is really long.
However, it is linear with the time and size of the previous task adjacent to it in the diagram.
I think your real application is concerned about why Flink accesses HDFS so slowly.
You can call the DEBUG log to see if you can find any clues, or post the log to the mailing list to help others analyze the problem for you.

Thanks, vino.

Julio Biason <[hidden email]> 于2018年9月15日周六上午7:03写道：
(Just an addendum: Although it's not a huge problem -- we can always increase the checkpoint timeout time -- this anomalous situation makes me think there is something wrong in our pipeline or in our cluster, and that is what is making the checkpoint creation go crazy.)

On Fri, Sep 14, 2018 at 8:00 PM, Julio Biason <[hidden email]> wrote:
Hey guys,

On our pipeline, we have a single slot that it's taking longer to create the checkpoint compared to other slots and we are wondering what could be causing it.

The operator in question is the window metric -- the only element in the pipeline that actually uses the state. While the other slots take 7 mins to create the checkpoint, this one -- and only this one -- takes 55mins.

Is there something I should look at to understand what's going on?

(We are storing all checkpoints in HDFS, in case that helps.)

--
Julio Biason, Sofware Engineer
AZION | Deliver. Accelerate. Protect.
Office: <a href="callto:+555130838101" value="+555130838101" style="color:rgb(17,85,204);font-family:arial,sans-serif;font-size:12.8px" target="_blank">+55 51 3083 8101 | Mobile: <a href="callto:+5551996209291" style="color:rgb(17,85,204)" target="_blank">+55 51 99907 0554

--
Julio Biason, Sofware Engineer
AZION | Deliver. Accelerate. Protect.
Office: <a href="callto:+555130838101" value="+555130838101" style="color:rgb(17,85,204);font-family:arial,sans-serif;font-size:12.8px" target="_blank">+55 51 3083 8101 | Mobile: <a href="callto:+5551996209291" style="color:rgb(17,85,204)" target="_blank">+55 51 99907 0554

--
Liu, Renjie
Software Engineer, MVAD

--
Julio Biason, Sofware Engineer
AZION | Deliver. Accelerate. Protect.
Office: <a href="callto:+555130838101" value="+555130838101" style="color:rgb(17,85,204);font-family:arial,sans-serif;font-size:12.8px" target="_blank">+55 51 3083 8101 | Mobile: <a href="callto:+5551996209291" style="color:rgb(17,85,204)" target="_blank">+55 51 99907 0554

--
Julio Biason, Sofware Engineer
AZION | Deliver. Accelerate. Protect.
Office: <a href="callto:+555130838101" value="+555130838101" style="color:rgb(17,85,204);font-family:arial,sans-serif;font-size:12.8px" target="_blank">+55 51 3083 8101 | Mobile: <a href="callto:+5551996209291" style="color:rgb(17,85,204)" target="_blank">+55 51 99907 0554

--
Julio Biason, Sofware Engineer
AZION | Deliver. Accelerate. Protect.
Office: <a href="callto:+555130838101" value="+555130838101" style="color:rgb(17,85,204);font-family:arial,sans-serif;font-size:12.8px" target="_blank">+55 51 3083 8101 | Mobile: <a href="callto:+5551996209291" style="color:rgb(17,85,204)" target="_blank">+55 51 99907 0554

Julio Biason, Sofware Engineer

AZION | Deliver. Accelerate. Protect.