Hi,
I found that these two metric is inconsistent, the inpoolQueueLength is positive, but the inpoolUsage is always zero. Is this a bug? cc @Chesnay Schepler <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/file/t1056/WX20180918-124014%402x.png> -- Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/ |
And my doubt for that comes from the debug of problem of checkpoint
expiration. I encountered the checkpoint expiration with no backpressure shown in web ui. But after i add many log, i found that the barrier send to the downstream, And the downstream may be set to autoread = false , and block the consume of the barrier. But temporary in inputchannel do not cause the upstream backpressure. I think this situation can be monitored by check the inpoolUsage metric, when it is 1, it may have some problem. But when i check the inpoolUsage and inpoolQueueLength, I found the inconsistent problem. Although the inpoolUsage is calculated by bestEffotGetUsedBuffer / allbuffers, Is this lead to the mistake ? -- Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/ |
Hi, The inpoolQueueLength indicates hown many buffers are received and queued. But if the buffers in the queue are the events (like barrier), it will not be calculated in the inpoolUsage. So in your case it may be normal for these two metrics. If you monitored that the autoread=false in downstream side, that means the inpoolUsage may already reach 100% and there are no available buffers to receive data from upstream side. But if the upstream still has available buffers to output, the upstream will not be blocked in this case. BTW, if the autoread=false and the inpoolUsage reaches 100%, there may be a lot of buffers queued in front of barrier, so the checkpoint may expire as you said. Best, Zhijiang
|
Hi,Zhijiang
Thanks for your reply. But i still have little question. Let me make my debug process more clearly。I log in PartitionRequestQueue to ensure the action of write a barrier and the callback of writeAndFlush is almost at the same time. (Which is done by check the buffer to be sent whether the barrier and addListener to log). But my log in PartitionRequestClientHandler shows that the downstream got the "event" 15 seconds later than the send action above. And it also run into the waitForBuffer branch wait about 15 seconds until the onEvent callback from EventListener (I test the code with 1.3.2 branch). So i think it is caused by the downstream inputchannel buffer is not enough and block the consume of barrier. But at the same time, I cant see the metric of inPoolUsage of the subtask which expire the checkpoint be 1. I think it is very strange. If the inputchannel buffer is not enough, at least we should see this value to be 1. Mainly reason I care about this is i want to find a metric to monitor this. you just mention the autoread flag. Do you think monitor this flag in inputchannel is a good choice ? Thanks, aitozi. Zhijiang(wangzhijiang999) wrote > Hi, > > The inpoolQueueLength indicates hown many buffers are received and queued. > But if the buffers in the queue are the events (like barrier), it will not > be calculated in the inpoolUsage. > So in your case it may be normal for these two metrics. If you monitored > that the autoread=false in downstream side, that means the inpoolUsage may > already reach 100% and there are no available buffers to receive data from > upstream side. But if the upstream still has available buffers to output, > the upstream will not be blocked in this case. > BTW, if the autoread=false and the inpoolUsage reaches 100%, there may be > a lot of buffers queued in front of barrier, so the checkpoint may expire > as you said. > > Best, > Zhijiang > ------------------------------------------------------------------ > 发件人:aitozi < > gjying1314@ > > > 发送时间:2018年9月18日(星期二) 12:59 > 收件人:user < > user@.apache > > > 主 题:Re: InpoolUsage & InpoolBuffers inconsistence > > And my doubt for that comes from the debug of problem of checkpoint > expiration. > I encountered the checkpoint expiration with no backpressure shown in web > ui. But after i add many log, i found that the barrier send to the > downstream, And the downstream may be set to autoread = false , and block > the consume of the barrier. But temporary in inputchannel do not cause the > upstream backpressure. > > I think this situation can be monitored by check the inpoolUsage metric, > when it is 1, it may have some problem. But when i check the inpoolUsage > and > inpoolQueueLength, I found the inconsistent problem. Although the > inpoolUsage is calculated by bestEffotGetUsedBuffer / allbuffers, Is this > lead to the mistake ? > > > > -- > Sent from: > http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/ -- Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/ |
1. I am not sure whether you monitor the same barrier between downstream sender and upstream receiver. The time cost between PartitionRequestQueue#write&flush and PartitionRequestClientHandler#channelRead should not be delay 15 seconds normally. Also the listener waits for 15 seconds to be notified avaiable buffer, that means your UDF may be involved in some time-cost operations. 2. I think it is not necessary to monitor the autoread flag, and I explaned the autoread process just because you mentioned it in last email. The metric of inPoolUsage is enough for the same thing. 3. I suggest you upgrading the version to 1.5 or above. The checkpoint process may be faster than the current version. Best, Zhijiang
|
Free forum by Nabble | Edit this page |