(DEPRECATED) Apache Flink User Mailing List archive.

InpoolUsage & InpoolBuffers inconsistence

Classic

List

Threaded

5 messages Options

aitozi

InpoolUsage & InpoolBuffers inconsistence

Hi,

I found that these two metric is inconsistent, the inpoolQueueLength is
positive, but the inpoolUsage is always zero. Is this a bug? cc @Chesnay
Schepler

<http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/file/t1056/WX20180918-124014%402x.png>

--
Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/

aitozi

Re: InpoolUsage & InpoolBuffers inconsistence

And my doubt for that comes from the debug of problem of checkpoint
expiration.
I encountered the checkpoint expiration with no backpressure shown in web
ui. But after i add many log, i found that the barrier send to the
downstream, And the downstream may be set to autoread = false , and block
the consume of the barrier. But temporary in inputchannel do not cause the
upstream backpressure.

I think this situation can be monitored by check the inpoolUsage metric,
when it is 1, it may have some problem. But when i check the inpoolUsage and
inpoolQueueLength, I found the inconsistent problem. Although the
inpoolUsage is calculated by bestEffotGetUsedBuffer / allbuffers, Is this
lead to the mistake ?

--
Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/

Zhijiang(wangzhijiang999)

回复：InpoolUsage & InpoolBuffers inconsistence

Hi,

The inpoolQueueLength indicates hown many buffers are received and queued. But if the buffers in the queue are the events (like barrier), it will not be calculated in the inpoolUsage.

So in your case it may be normal for these two metrics. If you monitored that the autoread=false in downstream side, that means the inpoolUsage may already reach 100% and there are no available buffers to receive data from upstream side. But if the upstream still has available buffers to output, the upstream will not be blocked in this case.

BTW, if the autoread=false and the inpoolUsage reaches 100%, there may be a lot of buffers queued in front of barrier, so the checkpoint may expire as you said.

Best,

Zhijiang

------------------------------------------------------------------
发件人：aitozi <[hidden email]>
发送时间：2018年9月18日(星期二) 12:59
收件人：user <[hidden email]>
主　题：Re: InpoolUsage & InpoolBuffers inconsistence

And my doubt for that comes from the debug of problem of checkpoint
expiration.
I encountered the checkpoint expiration with no backpressure shown in web
ui. But after i add many log, i found that the barrier send to the
downstream, And the downstream may be set to autoread = false , and block
the consume of the barrier. But temporary in inputchannel do not cause the
upstream backpressure.

I think this situation can be monitored by check the inpoolUsage metric,
when it is 1, it may have some problem. But when i check the inpoolUsage and
inpoolQueueLength, I found the inconsistent problem. Although the
inpoolUsage is calculated by bestEffotGetUsedBuffer / allbuffers, Is this
lead to the mistake ?

--
Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/

aitozi

Re: 回复：InpoolUsage & InpoolBuffers inconsistence

Hi，Zhijiang
Thanks for your reply. But i still have little question.

Let me make my debug process more clearly。I log in PartitionRequestQueue to
ensure the action of write a barrier and the callback of writeAndFlush is
almost at the same time. (Which is done by check the buffer to be sent
whether the barrier and addListener to log).
But my log in PartitionRequestClientHandler shows that the downstream got
the "event" 15 seconds later than the send action above. And it also run
into the waitForBuffer branch wait about 15 seconds until the onEvent
callback from EventListener (I test the code with 1.3.2 branch). So i think
it is caused by the downstream inputchannel buffer is not enough and block
the consume of barrier.

But at the same time, I cant see the metric of inPoolUsage of the subtask
which expire the checkpoint be 1. I think it is very strange. If the
inputchannel buffer is not enough, at least we should see this value to be
1. Mainly reason I care about this is i want to find a metric to monitor
this. you just mention the autoread flag. Do you think monitor this flag in
inputchannel is a good choice ?

Thanks,
aitozi.

Zhijiang(wangzhijiang999) wrote

> Hi,
>
> The inpoolQueueLength indicates hown many buffers are received and queued.
> But if the buffers in the queue are the events (like barrier), it will not
> be calculated in the inpoolUsage.
> So in your case it may be normal for these two metrics. If you monitored
> that the autoread=false in downstream side, that means the inpoolUsage may
> already reach 100% and there are no available buffers to receive data from
> upstream side. But if the upstream still has available buffers to output,
> the upstream will not be blocked in this case.
> BTW, if the autoread=false and the inpoolUsage reaches 100%, there may be
> a lot of buffers queued in front of barrier, so the checkpoint may expire
> as you said.
>
> Best,
> Zhijiang
> ------------------------------------------------------------------
> 发件人：aitozi <

> gjying1314@

> >
> 发送时间：2018年9月18日(星期二) 12:59
> 收件人：user <

> user@.apache

> >
> 主　题：Re: InpoolUsage & InpoolBuffers inconsistence
>
> And my doubt for that comes from the debug of problem of checkpoint
> expiration.
> I encountered the checkpoint expiration with no backpressure shown in web
> ui. But after i add many log, i found that the barrier send to the
> downstream, And the downstream may be set to autoread = false , and block
> the consume of the barrier. But temporary in inputchannel do not cause the
> upstream backpressure.
>
> I think this situation can be monitored by check the inpoolUsage metric,
> when it is 1, it may have some problem. But when i check the inpoolUsage
> and
> inpoolQueueLength, I found the inconsistent problem. Although the
> inpoolUsage is calculated by bestEffotGetUsedBuffer / allbuffers, Is this
> lead to the mistake ?
>
>
>
> --
> Sent from:
> http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/

--
Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/

Zhijiang(wangzhijiang999)

回复：回复：InpoolUsage & InpoolBuffers inconsistence

1. I am not sure whether you monitor the same barrier between downstream sender and upstream receiver. The time cost between PartitionRequestQueue#write&flush and PartitionRequestClientHandler#channelRead should not be delay 15 seconds normally. Also the listener waits for 15 seconds to be notified avaiable buffer, that means your UDF may be involved in some time-cost operations.

2. I think it is not necessary to monitor the autoread flag, and I explaned the autoread process just because you mentioned it in last email. The metric of inPoolUsage is enough for the same thing.

3. I suggest you upgrading the version to 1.5 or above. The checkpoint process may be faster than the current version.

Best,

Zhijiang

------------------------------------------------------------------
发件人：aitozi <[hidden email]>
发送时间：2018年9月18日(星期二) 14:53
收件人：user <[hidden email]>
主　题：Re: 回复：InpoolUsage & InpoolBuffers inconsistence

Hi，Zhijiang
Thanks for your reply. But i still have little question.

Let me make my debug process more clearly。I log in PartitionRequestQueue to
ensure the action of write a barrier and the callback of writeAndFlush is
almost at the same time. (Which is done by check the buffer to be sent
whether the barrier and addListener to log).
But my log in PartitionRequestClientHandler shows that the downstream got
the "event" 15 seconds later than the send action above. And it also run
into the waitForBuffer branch wait about 15 seconds until the onEvent
callback from EventListener (I test the code with 1.3.2 branch). So i think
it is caused by the downstream inputchannel buffer is not enough and block
the consume of barrier.

But at the same time, I cant see the metric of inPoolUsage of the subtask
which expire the checkpoint be 1. I think it is very strange. If the
inputchannel buffer is not enough, at least we should see this value to be
1. Mainly reason I care about this is i want to find a metric to monitor
this. you just mention the autoread flag. Do you think monitor this flag in
inputchannel is a good choice ?

Thanks,
aitozi.

Zhijiang(wangzhijiang999) wrote

> Hi,
>
> The inpoolQueueLength indicates hown many buffers are received and queued.
> But if the buffers in the queue are the events (like barrier), it will not
> be calculated in the inpoolUsage.
> So in your case it may be normal for these two metrics. If you monitored
> that the autoread=false in downstream side, that means the inpoolUsage may
> already reach 100% and there are no available buffers to receive data from
> upstream side. But if the upstream still has available buffers to output,
> the upstream will not be blocked in this case.
> BTW, if the autoread=false and the inpoolUsage reaches 100%, there may be
> a lot of buffers queued in front of barrier, so the checkpoint may expire
> as you said.
>
> Best,
> Zhijiang
> ------------------------------------------------------------------
> 发件人：aitozi <

> gjying1314@

> >
> 发送时间：2018年9月18日(星期二) 12:59
> 收件人：user <

> user@.apache

> >
> 主题：Re: InpoolUsage & InpoolBuffers inconsistence
>
> And my doubt for that comes from the debug of problem of checkpoint
> expiration.
> I encountered the checkpoint expiration with no backpressure shown in web
> ui. But after i add many log, i found that the barrier send to the
> downstream, And the downstream may be set to autoread = false , and block
> the consume of the barrier. But temporary in inputchannel do not cause the
> upstream backpressure.
>
> I think this situation can be monitored by check the inpoolUsage metric,
> when it is 1, it may have some problem. But when i check the inpoolUsage
> and
> inpoolQueueLength, I found the inconsistent problem. Although the
> inpoolUsage is calculated by bestEffotGetUsedBuffer / allbuffers, Is this
> lead to the mistake ?
>
>
>
> --
> Sent from:
> http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/

--
Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/