Hello I was running a program with 'parallelism.default' of 384 as I read in the documentation on Flink's official page that 'parallelism.default' is "the total number of CPUs in the cluster". I have four machines with 96 cores on each of them. So 96*4=384. But the program thew an error saying:Caused by: java.io.IOException: Insufficient number of network buffers: required 384, but only 298 available. The total number of network buffers is currently set to 2048. You can increase this number by setting the configuration key 'taskmanager.network.numberOfBuffers'. -- Thank You Regards |
Hi, I think you've chosen a good initial value for the parallelism. The higher the parallelism, the more network buffers are needed. I would follow the recommendation from the exception and increase the number of network buffers. On Thu, May 5, 2016 at 11:23 AM, Punit Naik <[hidden email]> wrote:
|
Yes I followed it and changed it to 298 but again it said the same thing. The only change was that it now said "required 298, but only 200 available". Why did it say that?On Thu, May 5, 2016 at 4:50 PM, Robert Metzger <[hidden email]> wrote:
-- Thank You Regards |
The TM's request the buffers in batches, so you 384 were requested, but only 200 were left in the pool. This means your overall pool size is too small. Here is the relevant section from the documentation: On Thu, May 5, 2016 at 1:30 PM, Punit Naik <[hidden email]> wrote:
My GPG Key ID: 336E2680 |
In reply to this post by Punit Naik
The default value of taskmanager.network.numberOfBuffers is 2048. I would recommend to use a multiple of that value, for example 16384 (given that you have enough memory per TaskManager) I recommend checking out these slides I created a while ago. They explain what the network buffers are needed for: http://www.slideshare.net/robertmetzger1/apache-flink-hands-on#37 On Thu, May 5, 2016 at 1:30 PM, Punit Naik <[hidden email]> wrote:
|
Could it be that the TaskManagers are configured with not-enough memory? On Thu, 5 May 2016 at 13:35 Robert Metzger <[hidden email]> wrote:
|
I am afraid not. On 07-May-2016 1:24 PM, "Aljoscha Krettek" <[hidden email]> wrote:
|
Hey Punit,
you need to give the task managers more network buffers as Robert suggested. Using the formula from the docs, can you please use 147456 (96^2*4*4) for the number of network buffers. Each buffer is 32 KB, meaning that you give 4,5 GB of memory to the network stack. You might have to adjust the heap memory (taskmanager.heap.mb) you give to the task managers accordingly. Does this solve it? – Ufuk On Sat, May 7, 2016 at 10:50 AM, Punit Naik <[hidden email]> wrote: > I am afraid not. > > On 07-May-2016 1:24 PM, "Aljoscha Krettek" <[hidden email]> wrote: >> >> Could it be that the TaskManagers are configured with not-enough memory? >> >> On Thu, 5 May 2016 at 13:35 Robert Metzger <[hidden email]> wrote: >>> >>> The default value of taskmanager.network.numberOfBuffers is 2048. I would >>> recommend to use a multiple of that value, for example 16384 (given that you >>> have enough memory per TaskManager) >>> >>> I recommend checking out these slides I created a while ago. They explain >>> what the network buffers are needed for: >>> http://www.slideshare.net/robertmetzger1/apache-flink-hands-on#37 >>> >>> >>> On Thu, May 5, 2016 at 1:30 PM, Punit Naik <[hidden email]> >>> wrote: >>>> >>>> Yes I followed it and changed it to 298 but again it said the same >>>> thing. The only change was that it now said "required 298, but only 200 >>>> available". >>>> >>>> Why did it say that? >>>> >>>> On Thu, May 5, 2016 at 4:50 PM, Robert Metzger <[hidden email]> >>>> wrote: >>>>> >>>>> Hi, >>>>> >>>>> I think you've chosen a good initial value for the parallelism. >>>>> The higher the parallelism, the more network buffers are needed. I >>>>> would follow the recommendation from the exception and increase the number >>>>> of network buffers. >>>>> >>>>> On Thu, May 5, 2016 at 11:23 AM, Punit Naik <[hidden email]> >>>>> wrote: >>>>>> >>>>>> Hello >>>>>> >>>>>> I was running a program with 'parallelism.default' of 384 as I read in >>>>>> the documentation on Flink's official page that 'parallelism.default' is >>>>>> "the total number of CPUs in the cluster". I have four machines with 96 >>>>>> cores on each of them. So 96*4=384. But the program thew an error saying: >>>>>> >>>>>> Caused by: java.io.IOException: Insufficient number of network >>>>>> buffers: required 384, but only 298 available. The total number of network >>>>>> buffers is currently set to 2048. You can increase this number by setting >>>>>> the configuration key 'taskmanager.network.numberOfBuffers'. >>>>>> >>>>>> What does this mean? And how to choose a proper value for parallelism? >>>>>> >>>>>> -- >>>>>> Thank You >>>>>> >>>>>> Regards >>>>>> >>>>>> Punit Naik >>>>> >>>>> >>>> >>>> >>>> >>>> -- >>>> Thank You >>>> >>>> Regards >>>> >>>> Punit Naik |
Hi Ufuk Thanks for the detailed answer. I will definitely try this and get back to you. On 09-May-2016 2:08 PM, "Ufuk Celebi" <[hidden email]> wrote:
Hey Punit, |
On Mon, May 9, 2016 at 11:05 AM, Punit Naik <[hidden email]> wrote:
> Thanks for the detailed answer. I will definitely try this and get back to > you. OK, looking forward to it. ;) In the meantime I've updated the docs with a more concise version of what do to when you see this exception. |
Yeah, thanks a lot for that. Also if you could, please write the formula, #cores\^2\^ * #machines * 4, in a different form so that its more readable and understandable. On 09-May-2016 2:54 PM, "Ufuk Celebi" <[hidden email]> wrote:
On Mon, May 9, 2016 at 11:05 AM, Punit Naik <[hidden email]> wrote: |
Yes, I did just that and I used the relevant Flink terminology instead
of #cores and #machines: #cores => #slots per TM #machines => #TMs On Mon, May 9, 2016 at 11:33 AM, Punit Naik <[hidden email]> wrote: > Yeah, thanks a lot for that. Also if you could, please write the formula, > #cores\^2\^ * #machines * 4, in a different form so that its more readable > and understandable. > > On 09-May-2016 2:54 PM, "Ufuk Celebi" <[hidden email]> wrote: >> >> On Mon, May 9, 2016 at 11:05 AM, Punit Naik <[hidden email]> >> wrote: >> > Thanks for the detailed answer. I will definitely try this and get back >> > to >> > you. >> >> OK, looking forward to it. ;) >> >> In the meantime I've updated the docs with a more concise version of >> what do to when you see this exception. |
Perfect👍 On Mon, May 9, 2016 at 3:12 PM, Ufuk Celebi <[hidden email]> wrote: Yes, I did just that and I used the relevant Flink terminology instead -- Thank You Regards |
Free forum by Nabble | Edit this page |