Hi there,
I have a cluster of 10 nodes with 12 CPUs each. This is my configuration: jobmanager.rpc.port: 6123 jobmanager.heap.mb: 4024 taskmanager.heap.mb: 8096 taskmanager.numberOfTaskSlots: 12 parallelization.degree.default: 120 I have been getting the following error: java.lang.Exception: Failed to deploy the task Reduce (SUM(1)) (65/120) - execution #0 to slot SimpleSlot (1)(0) - efc370a0b2a9a63f2e7b960cfe4e4c27 - ALLOCATED/ALIVE: java.io.IOException: Insufficient number of network buffers: required 120, but only 2 of 2048 available. at org.apache.flink.runtime.io.network.buffer.NetworkBufferPool.createBufferPool(NetworkBufferPool.java:155) at org.apache.flink.runtime.io.network.NetworkEnvironment.registerTask(NetworkEnvironment.java:163) at org.apache.flink.runtime.taskmanager.TaskManager.org$apache$flink$runtime$taskmanager$TaskManager$$submitTask(TaskManager.scala:426) at org.apache.flink.runtime.taskmanager.TaskManager$$anonfun$receiveWithLogMessages$1.applyOrElse(TaskManager.scala:261) at scala.runtime.AbstractPartialFunction$mcVL$sp.apply$mcVL$sp(AbstractPartialFunction.scala:33) at scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:33) at scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:25) at org.apache.flink.runtime.ActorLogMessages$$anon$1.apply(ActorLogMessages.scala:37) at org.apache.flink.runtime.ActorLogMessages$$anon$1.apply(ActorLogMessages.scala:30) at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:118) at org.apache.flink.runtime.ActorLogMessages$$anon$1.applyOrElse(ActorLogMessages.scala:30) at akka.actor.Actor$class.aroundReceive(Actor.scala:465) at org.apache.flink.runtime.taskmanager.TaskManager.aroundReceive(TaskManager.scala:89) at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516) at akka.actor.ActorCell.invoke(ActorCell.scala:487) at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:254) at akka.dispatch.Mailbox.run(Mailbox.scala:221) at akka.dispatch.Mailbox.exec(Mailbox.scala:231) at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) at org.apache.flink.runtime.executiongraph.Execution$2.onComplete(Execution.java:344) at akka.dispatch.OnComplete.internal(Future.scala:247) at akka.dispatch.OnComplete.internal(Future.scala:244) at akka.dispatch.japi$CallbackBridge.apply(Future.scala:174) at akka.dispatch.japi$CallbackBridge.apply(Future.scala:171) at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:32) at scala.concurrent.impl.ExecutionContextImpl$$anon$3.exec(ExecutionContextImpl.scala:107) at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) I failed to get any info online on how to solve it. Any help would be welcome. Thank you! |
Hi Yiannis! You need to increase the number of buffers for your setup. Here is a FAQ entry with a few pointers: Greetings, Am 18.02.2015 21:21 schrieb "Yiannis Gkoufas" <[hidden email]>:
|
In reply to this post by Yiannis Gkoufas
Hi Yiannis, if you scale Flink to larger setups you need to adapt the number of network buffers. The background section of the configuration reference explains the details on that [1]. Let us know, if that helped to solve the problem. Best, Fabian 2015-02-18 21:18 GMT+01:00 Yiannis Gkoufas <[hidden email]>:
|
Hi! thank you for your replies! I increased the number of network buffers: taskmanager.network.numberOfBuffers: 2048 but I am still getting the same error: Insufficient number of network buffers: required 120, but only 2 of 2048 available. Thanks a lot! On 18 February 2015 at 20:27, Fabian Hueske <[hidden email]> wrote:
|
2048 is the default. So you didn't actually increase the number of buffers ;-) Try 4096 or so. 2015-02-18 22:59 GMT+01:00 Yiannis Gkoufas <[hidden email]>:
|
Perfect! It worked! Thanks a lot for the help! On 18 February 2015 at 22:13, Fabian Hueske <[hidden email]> wrote:
|
Would it be helpful to add additional message in the error message in
NetworkBufferPool#createBufferPool to check the taskmanager.network.numberOfBuffers property? - Henry On Wed, Feb 18, 2015 at 4:32 PM, Yiannis Gkoufas <[hidden email]> wrote: > Perfect! It worked! Thanks a lot for the help! > > On 18 February 2015 at 22:13, Fabian Hueske <[hidden email]> wrote: >> >> 2048 is the default. So you didn't actually increase the number of buffers >> ;-) >> >> Try 4096 or so. >> >> 2015-02-18 22:59 GMT+01:00 Yiannis Gkoufas <[hidden email]>: >>> >>> Hi! >>> >>> thank you for your replies! >>> I increased the number of network buffers: >>> >>> taskmanager.network.numberOfBuffers: 2048 >>> >>> but I am still getting the same error: >>> >>> Insufficient number of network buffers: required 120, but only 2 of 2048 >>> available. >>> >>> Thanks a lot! >>> >>> >>> On 18 February 2015 at 20:27, Fabian Hueske <[hidden email]> wrote: >>>> >>>> Hi Yiannis, >>>> >>>> if you scale Flink to larger setups you need to adapt the number of >>>> network buffers. >>>> The background section of the configuration reference explains the >>>> details on that [1]. >>>> >>>> Let us know, if that helped to solve the problem. >>>> >>>> Best, Fabian >>>> >>>> [1] http://flink.apache.org/docs/0.8/config.html#background >>>> >>>> 2015-02-18 21:18 GMT+01:00 Yiannis Gkoufas <[hidden email]>: >>>>> >>>>> Hi there, >>>>> >>>>> I have a cluster of 10 nodes with 12 CPUs each. >>>>> This is my configuration: >>>>> >>>>> jobmanager.rpc.port: 6123 >>>>> >>>>> jobmanager.heap.mb: 4024 >>>>> >>>>> taskmanager.heap.mb: 8096 >>>>> >>>>> taskmanager.numberOfTaskSlots: 12 >>>>> >>>>> parallelization.degree.default: 120 >>>>> >>>>> I have been getting the following error: >>>>> >>>>> java.lang.Exception: Failed to deploy the task Reduce (SUM(1)) (65/120) >>>>> - execution #0 to slot SimpleSlot (1)(0) - efc370a0b2a9a63f2e7b960cfe4e4c27 >>>>> - ALLOCATED/ALIVE: java.io.IOException: Insufficient number of network >>>>> buffers: required 120, but only 2 of 2048 available. >>>>> at >>>>> org.apache.flink.runtime.io.network.buffer.NetworkBufferPool.createBufferPool(NetworkBufferPool.java:155) >>>>> at >>>>> org.apache.flink.runtime.io.network.NetworkEnvironment.registerTask(NetworkEnvironment.java:163) >>>>> at >>>>> org.apache.flink.runtime.taskmanager.TaskManager.org$apache$flink$runtime$taskmanager$TaskManager$$submitTask(TaskManager.scala:426) >>>>> at >>>>> org.apache.flink.runtime.taskmanager.TaskManager$$anonfun$receiveWithLogMessages$1.applyOrElse(TaskManager.scala:261) >>>>> at >>>>> scala.runtime.AbstractPartialFunction$mcVL$sp.apply$mcVL$sp(AbstractPartialFunction.scala:33) >>>>> at >>>>> scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:33) >>>>> at >>>>> scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:25) >>>>> at >>>>> org.apache.flink.runtime.ActorLogMessages$$anon$1.apply(ActorLogMessages.scala:37) >>>>> at >>>>> org.apache.flink.runtime.ActorLogMessages$$anon$1.apply(ActorLogMessages.scala:30) >>>>> at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:118) >>>>> at >>>>> org.apache.flink.runtime.ActorLogMessages$$anon$1.applyOrElse(ActorLogMessages.scala:30) >>>>> at akka.actor.Actor$class.aroundReceive(Actor.scala:465) >>>>> at >>>>> org.apache.flink.runtime.taskmanager.TaskManager.aroundReceive(TaskManager.scala:89) >>>>> at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516) >>>>> at akka.actor.ActorCell.invoke(ActorCell.scala:487) >>>>> at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:254) >>>>> at akka.dispatch.Mailbox.run(Mailbox.scala:221) >>>>> at akka.dispatch.Mailbox.exec(Mailbox.scala:231) >>>>> at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) >>>>> at >>>>> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) >>>>> at >>>>> scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) >>>>> at >>>>> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) >>>>> >>>>> at >>>>> org.apache.flink.runtime.executiongraph.Execution$2.onComplete(Execution.java:344) >>>>> at akka.dispatch.OnComplete.internal(Future.scala:247) >>>>> at akka.dispatch.OnComplete.internal(Future.scala:244) >>>>> at akka.dispatch.japi$CallbackBridge.apply(Future.scala:174) >>>>> at akka.dispatch.japi$CallbackBridge.apply(Future.scala:171) >>>>> at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:32) >>>>> at >>>>> scala.concurrent.impl.ExecutionContextImpl$$anon$3.exec(ExecutionContextImpl.scala:107) >>>>> at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) >>>>> at >>>>> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) >>>>> at >>>>> scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) >>>>> at >>>>> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) >>>>> >>>>> >>>>> I failed to get any info online on how to solve it. >>>>> Any help would be welcome. >>>>> >>>>> Thank you! >>>> >>>> >>> >> > |
I agree with Henry. We should include the name of the required configuration parameter into the exception. Users often run into this issue. I've filed a JIRA to track the fix: https://issues.apache.org/jira/browse/FLINK-1646 On Thu, Feb 19, 2015 at 6:18 PM, Henry Saputra <[hidden email]> wrote: Would it be helpful to add additional message in the error message in |
Good idea. I've changed the message. :)
On 04 Mar 2015, at 14:51, Robert Metzger <[hidden email]> wrote: > I agree with Henry. > We should include the name of the required configuration parameter into the exception. > Users often run into this issue. |
Free forum by Nabble | Edit this page |