回复:An addition to Netty's memory footprint

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

回复:An addition to Netty's memory footprint

Zhijiang(wangzhijiang999)
Based on Kurt's scenario, if the cumulator allocates a big ByteBuf from ByteBufAllocator during expansion, it is easy to result in creating a new PoolChunk(16M) because of no consistent memory in current PoolChunks. And this will cause the total used direct memory beyond estimated.

For further explaination:
1. Each PoolArena maintains a list of PoolChunks and the PoolChunk is grouped into different lists based on memory usages.
2. Each PoolChunk contains a list of subpage(8K) which are constructed a complete balanced binary tree for allocating memory easily.
3. When allocating a length memory from ByteBufAllocator, PoolArena will try to loop all the current internal PoolChunks to find the enough consistent memory. If not found , it will create a new chunk.

For example, if the memory usage for a chunk is 50%, that means there are 8M room available for this chunk. If the length of memory allocation is small, this chunk can satisfy in most cases.
But if the length is big like 1M, the remainder 50% space may not satisfy because all the available subpages are not under the same parent node in the tree.

After the network improvement mentioned in Stephan's FLIP, the direct memory usage by netty PooledByteBuffer can be largely reduced and under controlled easily.

cheers,
zhijiang

------------------------------------------------------------------
发件人:Kurt Young <[hidden email]>
发送时间:2017年6月30日(星期五) 15:51
收件人:dev <[hidden email]>; user <[hidden email]>
主 题:An addition to Netty's memory footprint

Hi,

Ufuk had write up an excellent document about Netty's memory allocation [1] inside Flink, and i want to add one more note after running some large scale jobs.

The only inaccurate thing about [1] is how much memory will LengthFieldBasedFrameDecoder use. From our observations, it will cost at most 4M for each physical connection. 

Why(tl;dr): the reason is ByteToMessageDecoder which is the base class of LengthFieldBasedFrameDecoder used a Cumulator to save the bytes for further decoding. The Cumulator will try to discard some read bytes to make room in the buffer when channelReadComplete is triggered. In most cases, channelReadComplete will only be triggered by AbstractNioByteChannel after which has read "maxMessagesPerRead" times. The default value for maxMessagesPerRead is 16. So in worst case, the Cumulator will write up to 1M (64K * 16) data. And due to the logic of ByteBuf's discardSomeReadBytes, the Cumulator will expand to 4M.

We add an option to tune the maxMessagesPerRead, set it to 2 and everything works fine. I know Stephan is working on network improvements, it will be a good choice to replace the whole netty pipeline with Flink's own implementation. But I think we will face some similar logics when implementing, careful about this.

BTW, should we open a jira to add this config?