Hi, all
In the Flink 1.12 we introduce the TM merge shuffle. But the out-of-the-box experience of using TM merge shuffle is not very good. The main reason is that the default configuration always makes users encounter OOM [1]. So we hope to introduce a managed memory pool for TM merge shuffle to avoid the problem. Goals
Proposal
In this default configuration, the allocation of the memory pool is almost impossible to fail. Currently the default framework’s off-heap memory is 128m, which is mainly used by Netty. But after we introduced zero copy, the usage of it has been reduced, and you can refer to the detailed data [2]. Known LimitationUsability for increasing the memory pool sizeIn addition to increasing `taskmanager.memory.network.batch-read`, the user may also need to adjust `taskmanager.memory.framework.off-heap.size` at the same time. It also means that once the user forgets this, it is likely to fail the check when allocating the memory pool. So in the following two situations, we will still prompt the user to increase the size of `framework.off-heap.size`.
An alternative is that when the user adjusts the size of the memory pool, the system automatically adjusts it. But we are not entierly sure about this, given its implicity and complicating the memory configurations. Potential memory wasteIn the first step, the memory pool will not be released once allocated. This means in the first step, even if there is no subsequent batch job, the pooled memory cannot be used by other consumers. We are not releasing the pool in the first step due to the concern that frequently allocating/deallocating the entire pool may increase the GC pressue. Investitations on how to dynamically release the pool when it's no longer needed is considered a future follow-up. Looking forward to your feedback.
[1] https://issues.apache.org/jira/browse/FLINK-20740 [2] https://github.com/apache/flink/pull/7368. Best, Guowei |
Thanks Guowei, for the proposal. As discussed offline already, I think this sounds good. One thought is that 16m sounds very small for a default read buffer pool. How risky do you think it is to increase this to 32m or 64m? Best, Stephan On Fri, Mar 5, 2021 at 4:33 AM Guowei Ma <[hidden email]> wrote:
|
Thanks for this proposal Guowei. +1 for it. Concerning the default size, maybe we can run some experiments and see how the system behaves with different pool sizes. Cheers, Till On Fri, Mar 5, 2021 at 2:45 PM Stephan Ewen <[hidden email]> wrote:
|
Hi, all I think it is a good idea that we increase the default size of the separated pool by testing. I am fine with adding the suffix(".size") to the config name, which makes it more clear to the user. But I am a little worried about adding a prefix("framework") because currently the tm shuffle service is only a shuffle-plugin, which is not a part of the framework. So maybe we could add a clear explanation in the document? Best, Guowei On Tue, Mar 9, 2021 at 3:58 PM 曹英杰(北牧) <[hidden email]> wrote:
|
Thanks for the update Yingjie. Then let's go with 32 MB I would say. Concerning the name of the configuration option I see Xintong's point. If the batch shuffle is subtracted from `taskmanager.memory.framework.off-heap.size` because it is part of the off-heap pool, then something like `taskmanager.memory.framework.off-heap.batch-shuffle.size` would better reflect this situation. On the other hand, this is quite a long configuration name. But it is also a quite advanced configuration option which, hopefully, should not be touched by too many of our users. Cheers, Till On Mon, Mar 22, 2021 at 9:15 AM 曹英杰(北牧) <[hidden email]> wrote:
|
In reply to this post by Guowei Ma
Hi Yingjie! Thanks for doing those experiments, the results look good. Let's go ahead with 32M then. Regarding the key, I am not strongly opinionated there. There are arguments for both keys, (1) making the key part of the network pool config as you did here or (2) making it part of the TM config (relative to framework off-heap memory). I find (1) quite understandable, but it is personal taste, so I can go with either option. Best, Stephan On Mon, Mar 22, 2021 at 9:15 AM 曹英杰(北牧) <[hidden email]> wrote:
|
Hi, I discussed with Xingtong and Yingjie offline and we agreed that the name `taskmanager.memory.framework.off-heap.batch-shuffle.size` can better reflect the current memory usage. So we decided to use the name Till suggested.Thank you all for your valuable feedback. Best, Guowei On Mon, Mar 22, 2021 at 5:21 PM Stephan Ewen <[hidden email]> wrote:
|
Free forum by Nabble | Edit this page |