(DEPRECATED) Apache Flink User Mailing List archive.

sporadic "Insufficient no of network buffers" issue

Classic

List

Threaded

6 messages Options

Rahul Patwari

sporadic "Insufficient no of network buffers" issue

Hi,

We are observing "Insufficient number of Network Buffers" issue Sporadically when Flink is upgraded from 1.4.2 to 1.8.2.

The state of the tasks with this issue translated from DEPLOYING to FAILED.

Whenever this issue occurs, the job manager restarts. Sometimes, the issue goes away after the restart.

As we are not getting the issue consistently, we are in a dilemma of whether to change the memory configurations or not.

Min recommended No. of Network Buffers: (8 * 8) * 8 * 4 = 2048

The exception says that 13112 no. of network buffers are present, which is 6x the recommendation.

Is reducing the no. of shuffles the only way to reduce the no. of network buffers required?

Thanks,

Rahul

configs:

env: Kubernetes

Flink: 1.8.2
using default configs for memory.fraction, memory.min, memory.max.

using 8 TM, 8 slots/TM

Each TM is running with 1 core, 4 GB Memory.

Exception:

java.io.IOException: Insufficient number of network buffers: required 2, but only 0 available. The total number of network buffers is currently set to 13112 of 32768 bytes each. You can increase this number by setting the configuration keys 'taskmanager.network.memory.fraction', 'taskmanager.network.memory.min', and 'taskmanager.network.memory.max'.
at org.apache.flink.runtime.io.network.buffer.NetworkBufferPool.requestMemorySegments(NetworkBufferPool.java:138)
at org.apache.flink.runtime.io.network.partition.consumer.SingleInputGate.assignExclusiveSegments(SingleInputGate.java:311)
at org.apache.flink.runtime.io.network.NetworkEnvironment.setupInputGate(NetworkEnvironment.java:271)
at org.apache.flink.runtime.io.network.NetworkEnvironment.registerTask(NetworkEnvironment.java:224)
at org.apache.flink.runtime.taskmanager.Task.run(Task.java:614)
at java.lang.Thread.run(Thread.java:748)

Ivan Yang

Re: sporadic "Insufficient no of network buffers" issue

Hi Rahul,

Try to increase taskmanager.network.memory.max to 1GB, basically double what you have now. However, you only have 4GB RAM for the entire TM, seems out of proportion to have 1GB network buffer with 4GB total RAM. Reducing number of shuffling will require less network buffer. But if your job need the shuffling, then you may consider to add more memory to TM.

Thanks,

Ivan

On Jul 31, 2020, at 2:02 PM, Rahul Patwari <[hidden email]> wrote:

Hi,

We are observing "Insufficient number of Network Buffers" issue Sporadically when Flink is upgraded from 1.4.2 to 1.8.2.
The state of the tasks with this issue translated from DEPLOYING to FAILED.
Whenever this issue occurs, the job manager restarts. Sometimes, the issue goes away after the restart.
As we are not getting the issue consistently, we are in a dilemma of whether to change the memory configurations or not.

Min recommended No. of Network Buffers: (8 * 8) * 8 * 4 = 2048
The exception says that 13112 no. of network buffers are present, which is 6x the recommendation.

Is reducing the no. of shuffles the only way to reduce the no. of network buffers required?

Thanks,
Rahul

configs:
env: Kubernetes
Flink: 1.8.2
using default configs for memory.fraction, memory.min, memory.max.
using 8 TM, 8 slots/TM
Each TM is running with 1 core, 4 GB Memory.

Exception:
java.io.IOException: Insufficient number of network buffers: required 2, but only 0 available. The total number of network buffers is currently set to 13112 of 32768 bytes each. You can increase this number by setting the configuration keys 'taskmanager.network.memory.fraction', 'taskmanager.network.memory.min', and 'taskmanager.network.memory.max'.
at org.apache.flink.runtime.io.network.buffer.NetworkBufferPool.requestMemorySegments(NetworkBufferPool.java:138)
at org.apache.flink.runtime.io.network.partition.consumer.SingleInputGate.assignExclusiveSegments(SingleInputGate.java:311)
at org.apache.flink.runtime.io.network.NetworkEnvironment.setupInputGate(NetworkEnvironment.java:271)
at org.apache.flink.runtime.io.network.NetworkEnvironment.registerTask(NetworkEnvironment.java:224)
at org.apache.flink.runtime.taskmanager.Task.run(Task.java:614)
at java.lang.Thread.run(Thread.java:748)

Rahul Patwari

Re: sporadic "Insufficient no of network buffers" issue

Thanks for your reply, Ivan.

I think taskmanager.network.memory.max is by default 1GB.
In my case, the network buffers memory is 13112 * 32768 = around 400MB which is 10% of the TM memory as by default taskmanager.network.memory.fraction is 0.1.

Do you mean to increase taskmanager.network.memory.fraction?

If Flink is upgraded from 1.4.2 to 1.8.2 does the application need more network buffers?
Can this issue happen sporadically? sometimes this issue is not seen when the job manager is restarted.

I am thinking whether having fewer network buffers is the root cause (or) if the root cause is something else which triggers this issue.

On Sat, Aug 1, 2020 at 9:36 AM Ivan Yang <[hidden email]> wrote:

Hi Rahul,

Try to increase taskmanager.network.memory.max to 1GB, basically double what you have now. However, you only have 4GB RAM for the entire TM, seems out of proportion to have 1GB network buffer with 4GB total RAM. Reducing number of shuffling will require less network buffer. But if your job need the shuffling, then you may consider to add more memory to TM.

Thanks,
Ivan

On Jul 31, 2020, at 2:02 PM, Rahul Patwari <[hidden email]> wrote:

Hi,

We are observing "Insufficient number of Network Buffers" issue Sporadically when Flink is upgraded from 1.4.2 to 1.8.2.
The state of the tasks with this issue translated from DEPLOYING to FAILED.
Whenever this issue occurs, the job manager restarts. Sometimes, the issue goes away after the restart.
As we are not getting the issue consistently, we are in a dilemma of whether to change the memory configurations or not.

Min recommended No. of Network Buffers: (8 * 8) * 8 * 4 = 2048
The exception says that 13112 no. of network buffers are present, which is 6x the recommendation.

Is reducing the no. of shuffles the only way to reduce the no. of network buffers required?

Thanks,
Rahul

configs:
env: Kubernetes
Flink: 1.8.2
using default configs for memory.fraction, memory.min, memory.max.
using 8 TM, 8 slots/TM
Each TM is running with 1 core, 4 GB Memory.

Exception:
java.io.IOException: Insufficient number of network buffers: required 2, but only 0 available. The total number of network buffers is currently set to 13112 of 32768 bytes each. You can increase this number by setting the configuration keys 'taskmanager.network.memory.fraction', 'taskmanager.network.memory.min', and 'taskmanager.network.memory.max'.
at org.apache.flink.runtime.io.network.buffer.NetworkBufferPool.requestMemorySegments(NetworkBufferPool.java:138)
at org.apache.flink.runtime.io.network.partition.consumer.SingleInputGate.assignExclusiveSegments(SingleInputGate.java:311)
at org.apache.flink.runtime.io.network.NetworkEnvironment.setupInputGate(NetworkEnvironment.java:271)
at org.apache.flink.runtime.io.network.NetworkEnvironment.registerTask(NetworkEnvironment.java:224)
at org.apache.flink.runtime.taskmanager.Task.run(Task.java:614)
at java.lang.Thread.run(Thread.java:748)

Ivan Yang

Re: sporadic "Insufficient no of network buffers" issue

Yes, increase the taskmanager.network.memory.fraction in your case. Also reduce the parallelism will reduce number of network buffer required for your job. I never used 1.4.x, so don’t know about it.

Ivan

On Jul 31, 2020, at 11:37 PM, Rahul Patwari <[hidden email]> wrote:

Thanks for your reply, Ivan.

I think taskmanager.network.memory.max is by default 1GB.
In my case, the network buffers memory is 13112 * 32768 = around 400MB which is 10% of the TM memory as by default taskmanager.network.memory.fraction is 0.1.
Do you mean to increase taskmanager.network.memory.fraction?
If Flink is upgraded from 1.4.2 to 1.8.2 does the application need more network buffers?
Can this issue happen sporadically? sometimes this issue is not seen when the job manager is restarted.
I am thinking whether having fewer network buffers is the root cause (or) if the root cause is something else which triggers this issue.

On Sat, Aug 1, 2020 at 9:36 AM Ivan Yang <[hidden email]> wrote:
Hi Rahul,

Try to increase taskmanager.network.memory.max to 1GB, basically double what you have now. However, you only have 4GB RAM for the entire TM, seems out of proportion to have 1GB network buffer with 4GB total RAM. Reducing number of shuffling will require less network buffer. But if your job need the shuffling, then you may consider to add more memory to TM.

Thanks,
Ivan

On Jul 31, 2020, at 2:02 PM, Rahul Patwari <[hidden email]> wrote:

Hi,

We are observing "Insufficient number of Network Buffers" issue Sporadically when Flink is upgraded from 1.4.2 to 1.8.2.
The state of the tasks with this issue translated from DEPLOYING to FAILED.
Whenever this issue occurs, the job manager restarts. Sometimes, the issue goes away after the restart.
As we are not getting the issue consistently, we are in a dilemma of whether to change the memory configurations or not.

Min recommended No. of Network Buffers: (8 * 8) * 8 * 4 = 2048
The exception says that 13112 no. of network buffers are present, which is 6x the recommendation.

Is reducing the no. of shuffles the only way to reduce the no. of network buffers required?

Thanks,
Rahul

configs:
env: Kubernetes
Flink: 1.8.2
using default configs for memory.fraction, memory.min, memory.max.
using 8 TM, 8 slots/TM
Each TM is running with 1 core, 4 GB Memory.

Exception:
java.io.IOException: Insufficient number of network buffers: required 2, but only 0 available. The total number of network buffers is currently set to 13112 of 32768 bytes each. You can increase this number by setting the configuration keys 'taskmanager.network.memory.fraction', 'taskmanager.network.memory.min', and 'taskmanager.network.memory.max'.
at org.apache.flink.runtime.io.network.buffer.NetworkBufferPool.requestMemorySegments(NetworkBufferPool.java:138)
at org.apache.flink.runtime.io.network.partition.consumer.SingleInputGate.assignExclusiveSegments(SingleInputGate.java:311)
at org.apache.flink.runtime.io.network.NetworkEnvironment.setupInputGate(NetworkEnvironment.java:271)
at org.apache.flink.runtime.io.network.NetworkEnvironment.registerTask(NetworkEnvironment.java:224)
at org.apache.flink.runtime.taskmanager.Task.run(Task.java:614)
at java.lang.Thread.run(Thread.java:748)

Rahul Patwari

Re: sporadic "Insufficient no of network buffers" issue

From the metrics in Prometheus, we observed that the minimum AvailableMemorySegments out of all the task managers is 4.5k when the exception was thrown.

So there were enough network buffers.

correction to the configs provided above: each TM CPU has 8 cores.

Apart from having fewer network buffers, can something else trigger this issue?
Also, is it expected that the issue is sporadic?

Rahul

On Sat, Aug 1, 2020 at 12:24 PM Ivan Yang <[hidden email]> wrote:

Yes, increase the taskmanager.network.memory.fraction in your case. Also reduce the parallelism will reduce number of network buffer required for your job. I never used 1.4.x, so don’t know about it.

Ivan

On Jul 31, 2020, at 11:37 PM, Rahul Patwari <[hidden email]> wrote:

Thanks for your reply, Ivan.

I think taskmanager.network.memory.max is by default 1GB.
In my case, the network buffers memory is 13112 * 32768 = around 400MB which is 10% of the TM memory as by default taskmanager.network.memory.fraction is 0.1.
Do you mean to increase taskmanager.network.memory.fraction?
If Flink is upgraded from 1.4.2 to 1.8.2 does the application need more network buffers?
Can this issue happen sporadically? sometimes this issue is not seen when the job manager is restarted.
I am thinking whether having fewer network buffers is the root cause (or) if the root cause is something else which triggers this issue.

On Sat, Aug 1, 2020 at 9:36 AM Ivan Yang <[hidden email]> wrote:
Hi Rahul,

Try to increase taskmanager.network.memory.max to 1GB, basically double what you have now. However, you only have 4GB RAM for the entire TM, seems out of proportion to have 1GB network buffer with 4GB total RAM. Reducing number of shuffling will require less network buffer. But if your job need the shuffling, then you may consider to add more memory to TM.

Thanks,
Ivan

On Jul 31, 2020, at 2:02 PM, Rahul Patwari <[hidden email]> wrote:

Hi,

We are observing "Insufficient number of Network Buffers" issue Sporadically when Flink is upgraded from 1.4.2 to 1.8.2.
The state of the tasks with this issue translated from DEPLOYING to FAILED.
Whenever this issue occurs, the job manager restarts. Sometimes, the issue goes away after the restart.
As we are not getting the issue consistently, we are in a dilemma of whether to change the memory configurations or not.

Min recommended No. of Network Buffers: (8 * 8) * 8 * 4 = 2048
The exception says that 13112 no. of network buffers are present, which is 6x the recommendation.

Is reducing the no. of shuffles the only way to reduce the no. of network buffers required?

Thanks,
Rahul

configs:
env: Kubernetes
Flink: 1.8.2
using default configs for memory.fraction, memory.min, memory.max.
using 8 TM, 8 slots/TM
Each TM is running with 1 core, 4 GB Memory.

Exception:
java.io.IOException: Insufficient number of network buffers: required 2, but only 0 available. The total number of network buffers is currently set to 13112 of 32768 bytes each. You can increase this number by setting the configuration keys 'taskmanager.network.memory.fraction', 'taskmanager.network.memory.min', and 'taskmanager.network.memory.max'.
at org.apache.flink.runtime.io.network.buffer.NetworkBufferPool.requestMemorySegments(NetworkBufferPool.java:138)
at org.apache.flink.runtime.io.network.partition.consumer.SingleInputGate.assignExclusiveSegments(SingleInputGate.java:311)
at org.apache.flink.runtime.io.network.NetworkEnvironment.setupInputGate(NetworkEnvironment.java:271)
at org.apache.flink.runtime.io.network.NetworkEnvironment.registerTask(NetworkEnvironment.java:224)
at org.apache.flink.runtime.taskmanager.Task.run(Task.java:614)
at java.lang.Thread.run(Thread.java:748)

Rahul Patwari

Re: sporadic "Insufficient no of network buffers" issue

After debugging more, it seems like this issue is caused by the scheduling strategy.

Depending on the tasks assigned to the task manager, probably the amount of memory configured for network buffers is running out.

Through these references: FLINK-12122, FLINK-15031, Flink 1.10 release notes we came to know that the scheduling strategy has changed since 1.5.0(FLIP-6) from 1.4.2 and the change is sort of fixed back in 1.9.2 with support for providing a configuration for scheduling strategy - cluster.evenly-spread-out-slots: true

"Spread out" strategy could definitely help in this case.

can you please confirm our findings and probably suggest some possible ways to mitigate this issue.

Rahul

On Sat, Aug 1, 2020 at 9:24 PM Rahul Patwari <[hidden email]> wrote:

From the metrics in Prometheus, we observed that the minimum AvailableMemorySegments out of all the task managers is 4.5k when the exception was thrown.
So there were enough network buffers.
correction to the configs provided above: each TM CPU has 8 cores.

Apart from having fewer network buffers, can something else trigger this issue?
Also, is it expected that the issue is sporadic?

Rahul

On Sat, Aug 1, 2020 at 12:24 PM Ivan Yang <[hidden email]> wrote:
Yes, increase the taskmanager.network.memory.fraction in your case. Also reduce the parallelism will reduce number of network buffer required for your job. I never used 1.4.x, so don’t know about it.

Ivan

On Jul 31, 2020, at 11:37 PM, Rahul Patwari <[hidden email]> wrote:

Thanks for your reply, Ivan.

I think taskmanager.network.memory.max is by default 1GB.
In my case, the network buffers memory is 13112 * 32768 = around 400MB which is 10% of the TM memory as by default taskmanager.network.memory.fraction is 0.1.
Do you mean to increase taskmanager.network.memory.fraction?
If Flink is upgraded from 1.4.2 to 1.8.2 does the application need more network buffers?
Can this issue happen sporadically? sometimes this issue is not seen when the job manager is restarted.
I am thinking whether having fewer network buffers is the root cause (or) if the root cause is something else which triggers this issue.

On Sat, Aug 1, 2020 at 9:36 AM Ivan Yang <[hidden email]> wrote:
Hi Rahul,

Try to increase taskmanager.network.memory.max to 1GB, basically double what you have now. However, you only have 4GB RAM for the entire TM, seems out of proportion to have 1GB network buffer with 4GB total RAM. Reducing number of shuffling will require less network buffer. But if your job need the shuffling, then you may consider to add more memory to TM.

Thanks,
Ivan

On Jul 31, 2020, at 2:02 PM, Rahul Patwari <[hidden email]> wrote:

Hi,

We are observing "Insufficient number of Network Buffers" issue Sporadically when Flink is upgraded from 1.4.2 to 1.8.2.
The state of the tasks with this issue translated from DEPLOYING to FAILED.
Whenever this issue occurs, the job manager restarts. Sometimes, the issue goes away after the restart.
As we are not getting the issue consistently, we are in a dilemma of whether to change the memory configurations or not.

Min recommended No. of Network Buffers: (8 * 8) * 8 * 4 = 2048
The exception says that 13112 no. of network buffers are present, which is 6x the recommendation.

Is reducing the no. of shuffles the only way to reduce the no. of network buffers required?

Thanks,
Rahul

configs:
env: Kubernetes
Flink: 1.8.2
using default configs for memory.fraction, memory.min, memory.max.
using 8 TM, 8 slots/TM
Each TM is running with 1 core, 4 GB Memory.

Exception:
java.io.IOException: Insufficient number of network buffers: required 2, but only 0 available. The total number of network buffers is currently set to 13112 of 32768 bytes each. You can increase this number by setting the configuration keys 'taskmanager.network.memory.fraction', 'taskmanager.network.memory.min', and 'taskmanager.network.memory.max'.
at org.apache.flink.runtime.io.network.buffer.NetworkBufferPool.requestMemorySegments(NetworkBufferPool.java:138)
at org.apache.flink.runtime.io.network.partition.consumer.SingleInputGate.assignExclusiveSegments(SingleInputGate.java:311)
at org.apache.flink.runtime.io.network.NetworkEnvironment.setupInputGate(NetworkEnvironment.java:271)
at org.apache.flink.runtime.io.network.NetworkEnvironment.registerTask(NetworkEnvironment.java:224)
at org.apache.flink.runtime.taskmanager.Task.run(Task.java:614)
at java.lang.Thread.run(Thread.java:748)