Hi, We are observing "Insufficient number of Network Buffers" issue Sporadically when Flink is upgraded from 1.4.2 to 1.8.2. The state of the tasks with this issue translated from DEPLOYING to FAILED. Whenever this issue occurs, the job manager restarts. Sometimes, the issue goes away after the restart. As we are not getting the issue consistently, we are in a dilemma of whether to change the memory configurations or not. Min recommended No. of Network Buffers: (8 * 8) * 8 * 4 = 2048 The exception says that 13112 no. of network buffers are present, which is 6x the recommendation. Is reducing the no. of shuffles the only way to reduce the no. of network buffers required? Thanks, Rahul configs: env: Kubernetes Flink: 1.8.2 using default configs for memory.fraction, memory.min, memory.max. using 8 TM, 8 slots/TM Each TM is running with 1 core, 4 GB Memory. Exception: java.io.IOException: Insufficient number of network buffers: required 2, but only 0 available. The total number of network buffers is currently set to 13112 of 32768 bytes each. You can increase this number by setting the configuration keys 'taskmanager.network.memory.fraction', 'taskmanager.network.memory.min', and 'taskmanager.network.memory.max'. at org.apache.flink.runtime.io.network.buffer.NetworkBufferPool.requestMemorySegments(NetworkBufferPool.java:138) at org.apache.flink.runtime.io.network.partition.consumer.SingleInputGate.assignExclusiveSegments(SingleInputGate.java:311) at org.apache.flink.runtime.io.network.NetworkEnvironment.setupInputGate(NetworkEnvironment.java:271) at org.apache.flink.runtime.io.network.NetworkEnvironment.registerTask(NetworkEnvironment.java:224) at org.apache.flink.runtime.taskmanager.Task.run(Task.java:614) at java.lang.Thread.run(Thread.java:748) |
Hi Rahul,
Try to increase taskmanager.network.memory.max to 1GB, basically double what you have now. However, you only have 4GB RAM for the entire TM, seems out of proportion to have 1GB network buffer with 4GB total RAM. Reducing number of shuffling will require less network buffer. But if your job need the shuffling, then you may consider to add more memory to TM. Thanks, Ivan
|
Thanks for your reply, Ivan. I think taskmanager.network.memory.max is by default 1GB. In my case, the network buffers memory is 13112 * 32768 = around 400MB which is 10% of the TM memory as by default taskmanager.network.memory.fraction is 0.1. Do you mean to increase taskmanager.network.memory.fraction?
I am thinking whether having fewer network buffers is the root cause (or) if the root cause is something else which triggers this issue. On Sat, Aug 1, 2020 at 9:36 AM Ivan Yang <[hidden email]> wrote:
|
Yes, increase the taskmanager.network.memory.fraction in your case. Also reduce the parallelism will reduce number of network buffer required for your job. I never used 1.4.x, so don’t know about it.
Ivan
|
From the metrics in Prometheus, we observed that the minimum AvailableMemorySegments out of all the task managers is 4.5k when the exception was thrown. So there were enough network buffers. correction to the configs provided above: each TM CPU has 8 cores. Apart from having fewer network buffers, can something else trigger this issue? Also, is it expected that the issue is sporadic? Rahul On Sat, Aug 1, 2020 at 12:24 PM Ivan Yang <[hidden email]> wrote:
|
After debugging more, it seems like this issue is caused by the scheduling strategy. Depending on the tasks assigned to the task manager, probably the amount of memory configured for network buffers is running out. Through these references: FLINK-12122, FLINK-15031, Flink 1.10 release notes we came to know that the scheduling strategy has changed since 1.5.0(FLIP-6) from 1.4.2 and the change is sort of fixed back in 1.9.2 with support for providing a configuration for scheduling strategy - cluster.evenly-spread-out-slots: true "Spread out" strategy could definitely help in this case. can you please confirm our findings and probably suggest some possible ways to mitigate this issue. Rahul On Sat, Aug 1, 2020 at 9:24 PM Rahul Patwari <[hidden email]> wrote:
|
Free forum by Nabble | Edit this page |