How to choose the 'parallelism.default' value

classic Classic list List threaded Threaded
13 messages Options
Reply | Threaded
Open this post in threaded view
|

How to choose the 'parallelism.default' value

Punit Naik
Hello

I was running a program with 'parallelism.default' of 384 as I read in the documentation on Flink's official page that 'parallelism.default' is "the total number of CPUs in the cluster". I have four machines with 96 cores on each of them. So 96*4=384. But the program thew an error saying:

Caused by: java.io.IOException: Insufficient number of network buffers: required 384, but only 298 available. The total number of network buffers is currently set to 2048. You can increase this number by setting the configuration key 'taskmanager.network.numberOfBuffers'.

What does this mean? And how to choose a proper value for parallelism?

--
Thank You

Regards

Punit Naik
Reply | Threaded
Open this post in threaded view
|

Re: How to choose the 'parallelism.default' value

rmetzger0
Hi,

I think you've chosen a good initial value for the parallelism. 
The higher the parallelism, the more network buffers are needed. I would follow the recommendation from the exception and increase the number of network buffers.

On Thu, May 5, 2016 at 11:23 AM, Punit Naik <[hidden email]> wrote:
Hello

I was running a program with 'parallelism.default' of 384 as I read in the documentation on Flink's official page that 'parallelism.default' is "the total number of CPUs in the cluster". I have four machines with 96 cores on each of them. So 96*4=384. But the program thew an error saying:

Caused by: java.io.IOException: Insufficient number of network buffers: required 384, but only 298 available. The total number of network buffers is currently set to 2048. You can increase this number by setting the configuration key 'taskmanager.network.numberOfBuffers'.

What does this mean? And how to choose a proper value for parallelism?

--
Thank You

Regards

Punit Naik

Reply | Threaded
Open this post in threaded view
|

Re: How to choose the 'parallelism.default' value

Punit Naik
Yes I followed it and changed it to 298 but again it said the same thing. The only change was that it now said "required 298, but only 200 available".

Why did it say that?

On Thu, May 5, 2016 at 4:50 PM, Robert Metzger <[hidden email]> wrote:
Hi,

I think you've chosen a good initial value for the parallelism. 
The higher the parallelism, the more network buffers are needed. I would follow the recommendation from the exception and increase the number of network buffers.

On Thu, May 5, 2016 at 11:23 AM, Punit Naik <[hidden email]> wrote:
Hello

I was running a program with 'parallelism.default' of 384 as I read in the documentation on Flink's official page that 'parallelism.default' is "the total number of CPUs in the cluster". I have four machines with 96 cores on each of them. So 96*4=384. But the program thew an error saying:

Caused by: java.io.IOException: Insufficient number of network buffers: required 384, but only 298 available. The total number of network buffers is currently set to 2048. You can increase this number by setting the configuration key 'taskmanager.network.numberOfBuffers'.

What does this mean? And how to choose a proper value for parallelism?

--
Thank You

Regards

Punit Naik




--
Thank You

Regards

Punit Naik
Reply | Threaded
Open this post in threaded view
|

Re: How to choose the 'parallelism.default' value

Robert Schmidtke
The TM's request the buffers in batches, so you 384 were requested, but only 200 were left in the pool. This means your overall pool size is too small. Here is the relevant section from the documentation:


On Thu, May 5, 2016 at 1:30 PM, Punit Naik <[hidden email]> wrote:
Yes I followed it and changed it to 298 but again it said the same thing. The only change was that it now said "required 298, but only 200 available".

Why did it say that?

On Thu, May 5, 2016 at 4:50 PM, Robert Metzger <[hidden email]> wrote:
Hi,

I think you've chosen a good initial value for the parallelism. 
The higher the parallelism, the more network buffers are needed. I would follow the recommendation from the exception and increase the number of network buffers.

On Thu, May 5, 2016 at 11:23 AM, Punit Naik <[hidden email]> wrote:
Hello

I was running a program with 'parallelism.default' of 384 as I read in the documentation on Flink's official page that 'parallelism.default' is "the total number of CPUs in the cluster". I have four machines with 96 cores on each of them. So 96*4=384. But the program thew an error saying:

Caused by: java.io.IOException: Insufficient number of network buffers: required 384, but only 298 available. The total number of network buffers is currently set to 2048. You can increase this number by setting the configuration key 'taskmanager.network.numberOfBuffers'.

What does this mean? And how to choose a proper value for parallelism?

--
Thank You

Regards

Punit Naik




--
Thank You

Regards

Punit Naik



--
My GPG Key ID: 336E2680
Reply | Threaded
Open this post in threaded view
|

Re: How to choose the 'parallelism.default' value

rmetzger0
In reply to this post by Punit Naik
The default value of taskmanager.network.numberOfBuffers is 2048. I would recommend to use a multiple of that value, for example 16384 (given that you have enough memory per TaskManager)

I recommend checking out these slides I created a while ago. They explain what the network buffers are needed for: http://www.slideshare.net/robertmetzger1/apache-flink-hands-on#37


On Thu, May 5, 2016 at 1:30 PM, Punit Naik <[hidden email]> wrote:
Yes I followed it and changed it to 298 but again it said the same thing. The only change was that it now said "required 298, but only 200 available".

Why did it say that?

On Thu, May 5, 2016 at 4:50 PM, Robert Metzger <[hidden email]> wrote:
Hi,

I think you've chosen a good initial value for the parallelism. 
The higher the parallelism, the more network buffers are needed. I would follow the recommendation from the exception and increase the number of network buffers.

On Thu, May 5, 2016 at 11:23 AM, Punit Naik <[hidden email]> wrote:
Hello

I was running a program with 'parallelism.default' of 384 as I read in the documentation on Flink's official page that 'parallelism.default' is "the total number of CPUs in the cluster". I have four machines with 96 cores on each of them. So 96*4=384. But the program thew an error saying:

Caused by: java.io.IOException: Insufficient number of network buffers: required 384, but only 298 available. The total number of network buffers is currently set to 2048. You can increase this number by setting the configuration key 'taskmanager.network.numberOfBuffers'.

What does this mean? And how to choose a proper value for parallelism?

--
Thank You

Regards

Punit Naik




--
Thank You

Regards

Punit Naik

Reply | Threaded
Open this post in threaded view
|

Re: How to choose the 'parallelism.default' value

Aljoscha Krettek
Could it be that the TaskManagers are configured with not-enough memory?

On Thu, 5 May 2016 at 13:35 Robert Metzger <[hidden email]> wrote:
The default value of taskmanager.network.numberOfBuffers is 2048. I would recommend to use a multiple of that value, for example 16384 (given that you have enough memory per TaskManager)

I recommend checking out these slides I created a while ago. They explain what the network buffers are needed for: http://www.slideshare.net/robertmetzger1/apache-flink-hands-on#37


On Thu, May 5, 2016 at 1:30 PM, Punit Naik <[hidden email]> wrote:
Yes I followed it and changed it to 298 but again it said the same thing. The only change was that it now said "required 298, but only 200 available".

Why did it say that?

On Thu, May 5, 2016 at 4:50 PM, Robert Metzger <[hidden email]> wrote:
Hi,

I think you've chosen a good initial value for the parallelism. 
The higher the parallelism, the more network buffers are needed. I would follow the recommendation from the exception and increase the number of network buffers.

On Thu, May 5, 2016 at 11:23 AM, Punit Naik <[hidden email]> wrote:
Hello

I was running a program with 'parallelism.default' of 384 as I read in the documentation on Flink's official page that 'parallelism.default' is "the total number of CPUs in the cluster". I have four machines with 96 cores on each of them. So 96*4=384. But the program thew an error saying:

Caused by: java.io.IOException: Insufficient number of network buffers: required 384, but only 298 available. The total number of network buffers is currently set to 2048. You can increase this number by setting the configuration key 'taskmanager.network.numberOfBuffers'.

What does this mean? And how to choose a proper value for parallelism?

--
Thank You

Regards

Punit Naik




--
Thank You

Regards

Punit Naik
Reply | Threaded
Open this post in threaded view
|

Re: How to choose the 'parallelism.default' value

Punit Naik

I am afraid not.

On 07-May-2016 1:24 PM, "Aljoscha Krettek" <[hidden email]> wrote:
Could it be that the TaskManagers are configured with not-enough memory?

On Thu, 5 May 2016 at 13:35 Robert Metzger <[hidden email]> wrote:
The default value of taskmanager.network.numberOfBuffers is 2048. I would recommend to use a multiple of that value, for example 16384 (given that you have enough memory per TaskManager)

I recommend checking out these slides I created a while ago. They explain what the network buffers are needed for: http://www.slideshare.net/robertmetzger1/apache-flink-hands-on#37


On Thu, May 5, 2016 at 1:30 PM, Punit Naik <[hidden email]> wrote:
Yes I followed it and changed it to 298 but again it said the same thing. The only change was that it now said "required 298, but only 200 available".

Why did it say that?

On Thu, May 5, 2016 at 4:50 PM, Robert Metzger <[hidden email]> wrote:
Hi,

I think you've chosen a good initial value for the parallelism. 
The higher the parallelism, the more network buffers are needed. I would follow the recommendation from the exception and increase the number of network buffers.

On Thu, May 5, 2016 at 11:23 AM, Punit Naik <[hidden email]> wrote:
Hello

I was running a program with 'parallelism.default' of 384 as I read in the documentation on Flink's official page that 'parallelism.default' is "the total number of CPUs in the cluster". I have four machines with 96 cores on each of them. So 96*4=384. But the program thew an error saying:

Caused by: java.io.IOException: Insufficient number of network buffers: required 384, but only 298 available. The total number of network buffers is currently set to 2048. You can increase this number by setting the configuration key 'taskmanager.network.numberOfBuffers'.

What does this mean? And how to choose a proper value for parallelism?

--
Thank You

Regards

Punit Naik




--
Thank You

Regards

Punit Naik
Reply | Threaded
Open this post in threaded view
|

Re: How to choose the 'parallelism.default' value

Ufuk Celebi
Hey Punit,

you need to give the task managers more network buffers as Robert
suggested. Using the formula from the docs, can you please use 147456
(96^2*4*4) for the number of network buffers. Each buffer is 32 KB,
meaning that you give 4,5 GB of memory to the network stack. You might
have to adjust the heap memory (taskmanager.heap.mb) you give to the
task managers accordingly.

Does this solve it?

– Ufuk


On Sat, May 7, 2016 at 10:50 AM, Punit Naik <[hidden email]> wrote:

> I am afraid not.
>
> On 07-May-2016 1:24 PM, "Aljoscha Krettek" <[hidden email]> wrote:
>>
>> Could it be that the TaskManagers are configured with not-enough memory?
>>
>> On Thu, 5 May 2016 at 13:35 Robert Metzger <[hidden email]> wrote:
>>>
>>> The default value of taskmanager.network.numberOfBuffers is 2048. I would
>>> recommend to use a multiple of that value, for example 16384 (given that you
>>> have enough memory per TaskManager)
>>>
>>> I recommend checking out these slides I created a while ago. They explain
>>> what the network buffers are needed for:
>>> http://www.slideshare.net/robertmetzger1/apache-flink-hands-on#37
>>>
>>>
>>> On Thu, May 5, 2016 at 1:30 PM, Punit Naik <[hidden email]>
>>> wrote:
>>>>
>>>> Yes I followed it and changed it to 298 but again it said the same
>>>> thing. The only change was that it now said "required 298, but only 200
>>>> available".
>>>>
>>>> Why did it say that?
>>>>
>>>> On Thu, May 5, 2016 at 4:50 PM, Robert Metzger <[hidden email]>
>>>> wrote:
>>>>>
>>>>> Hi,
>>>>>
>>>>> I think you've chosen a good initial value for the parallelism.
>>>>> The higher the parallelism, the more network buffers are needed. I
>>>>> would follow the recommendation from the exception and increase the number
>>>>> of network buffers.
>>>>>
>>>>> On Thu, May 5, 2016 at 11:23 AM, Punit Naik <[hidden email]>
>>>>> wrote:
>>>>>>
>>>>>> Hello
>>>>>>
>>>>>> I was running a program with 'parallelism.default' of 384 as I read in
>>>>>> the documentation on Flink's official page that 'parallelism.default' is
>>>>>> "the total number of CPUs in the cluster". I have four machines with 96
>>>>>> cores on each of them. So 96*4=384. But the program thew an error saying:
>>>>>>
>>>>>> Caused by: java.io.IOException: Insufficient number of network
>>>>>> buffers: required 384, but only 298 available. The total number of network
>>>>>> buffers is currently set to 2048. You can increase this number by setting
>>>>>> the configuration key 'taskmanager.network.numberOfBuffers'.
>>>>>>
>>>>>> What does this mean? And how to choose a proper value for parallelism?
>>>>>>
>>>>>> --
>>>>>> Thank You
>>>>>>
>>>>>> Regards
>>>>>>
>>>>>> Punit Naik
>>>>>
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Thank You
>>>>
>>>> Regards
>>>>
>>>> Punit Naik
Reply | Threaded
Open this post in threaded view
|

Re: How to choose the 'parallelism.default' value

Punit Naik

Hi Ufuk

Thanks for the detailed answer. I will definitely try this and get back to you.

On 09-May-2016 2:08 PM, "Ufuk Celebi" <[hidden email]> wrote:
Hey Punit,

you need to give the task managers more network buffers as Robert
suggested. Using the formula from the docs, can you please use 147456
(96^2*4*4) for the number of network buffers. Each buffer is 32 KB,
meaning that you give 4,5 GB of memory to the network stack. You might
have to adjust the heap memory (taskmanager.heap.mb) you give to the
task managers accordingly.

Does this solve it?

– Ufuk


On Sat, May 7, 2016 at 10:50 AM, Punit Naik <[hidden email]> wrote:
> I am afraid not.
>
> On 07-May-2016 1:24 PM, "Aljoscha Krettek" <[hidden email]> wrote:
>>
>> Could it be that the TaskManagers are configured with not-enough memory?
>>
>> On Thu, 5 May 2016 at 13:35 Robert Metzger <[hidden email]> wrote:
>>>
>>> The default value of taskmanager.network.numberOfBuffers is 2048. I would
>>> recommend to use a multiple of that value, for example 16384 (given that you
>>> have enough memory per TaskManager)
>>>
>>> I recommend checking out these slides I created a while ago. They explain
>>> what the network buffers are needed for:
>>> http://www.slideshare.net/robertmetzger1/apache-flink-hands-on#37
>>>
>>>
>>> On Thu, May 5, 2016 at 1:30 PM, Punit Naik <[hidden email]>
>>> wrote:
>>>>
>>>> Yes I followed it and changed it to 298 but again it said the same
>>>> thing. The only change was that it now said "required 298, but only 200
>>>> available".
>>>>
>>>> Why did it say that?
>>>>
>>>> On Thu, May 5, 2016 at 4:50 PM, Robert Metzger <[hidden email]>
>>>> wrote:
>>>>>
>>>>> Hi,
>>>>>
>>>>> I think you've chosen a good initial value for the parallelism.
>>>>> The higher the parallelism, the more network buffers are needed. I
>>>>> would follow the recommendation from the exception and increase the number
>>>>> of network buffers.
>>>>>
>>>>> On Thu, May 5, 2016 at 11:23 AM, Punit Naik <[hidden email]>
>>>>> wrote:
>>>>>>
>>>>>> Hello
>>>>>>
>>>>>> I was running a program with 'parallelism.default' of 384 as I read in
>>>>>> the documentation on Flink's official page that 'parallelism.default' is
>>>>>> "the total number of CPUs in the cluster". I have four machines with 96
>>>>>> cores on each of them. So 96*4=384. But the program thew an error saying:
>>>>>>
>>>>>> Caused by: java.io.IOException: Insufficient number of network
>>>>>> buffers: required 384, but only 298 available. The total number of network
>>>>>> buffers is currently set to 2048. You can increase this number by setting
>>>>>> the configuration key 'taskmanager.network.numberOfBuffers'.
>>>>>>
>>>>>> What does this mean? And how to choose a proper value for parallelism?
>>>>>>
>>>>>> --
>>>>>> Thank You
>>>>>>
>>>>>> Regards
>>>>>>
>>>>>> Punit Naik
>>>>>
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Thank You
>>>>
>>>> Regards
>>>>
>>>> Punit Naik
Reply | Threaded
Open this post in threaded view
|

Re: How to choose the 'parallelism.default' value

Ufuk Celebi
On Mon, May 9, 2016 at 11:05 AM, Punit Naik <[hidden email]> wrote:
> Thanks for the detailed answer. I will definitely try this and get back to
> you.

OK, looking forward to it. ;)

In the meantime I've updated the docs with a more concise version of
what do to when you see this exception.
Reply | Threaded
Open this post in threaded view
|

Re: How to choose the 'parallelism.default' value

Punit Naik

Yeah, thanks a lot for that. Also if you could, please write the formula, #cores\^2\^ * #machines * 4, in a different form so that its more readable and understandable.

On 09-May-2016 2:54 PM, "Ufuk Celebi" <[hidden email]> wrote:
On Mon, May 9, 2016 at 11:05 AM, Punit Naik <[hidden email]> wrote:
> Thanks for the detailed answer. I will definitely try this and get back to
> you.

OK, looking forward to it. ;)

In the meantime I've updated the docs with a more concise version of
what do to when you see this exception.
Reply | Threaded
Open this post in threaded view
|

Re: How to choose the 'parallelism.default' value

Ufuk Celebi
Yes, I did just that and I used the relevant Flink terminology instead
of #cores and #machines:

#cores => #slots per TM
#machines => #TMs

On Mon, May 9, 2016 at 11:33 AM, Punit Naik <[hidden email]> wrote:

> Yeah, thanks a lot for that. Also if you could, please write the formula,
> #cores\^2\^ * #machines * 4, in a different form so that its more readable
> and understandable.
>
> On 09-May-2016 2:54 PM, "Ufuk Celebi" <[hidden email]> wrote:
>>
>> On Mon, May 9, 2016 at 11:05 AM, Punit Naik <[hidden email]>
>> wrote:
>> > Thanks for the detailed answer. I will definitely try this and get back
>> > to
>> > you.
>>
>> OK, looking forward to it. ;)
>>
>> In the meantime I've updated the docs with a more concise version of
>> what do to when you see this exception.
Reply | Threaded
Open this post in threaded view
|

Re: How to choose the 'parallelism.default' value

Punit Naik
Perfect👍

On Mon, May 9, 2016 at 3:12 PM, Ufuk Celebi <[hidden email]> wrote:
Yes, I did just that and I used the relevant Flink terminology instead
of #cores and #machines:

#cores => #slots per TM
#machines => #TMs

On Mon, May 9, 2016 at 11:33 AM, Punit Naik <[hidden email]> wrote:
> Yeah, thanks a lot for that. Also if you could, please write the formula,
> #cores\^2\^ * #machines * 4, in a different form so that its more readable
> and understandable.
>
> On 09-May-2016 2:54 PM, "Ufuk Celebi" <[hidden email]> wrote:
>>
>> On Mon, May 9, 2016 at 11:05 AM, Punit Naik <[hidden email]>
>> wrote:
>> > Thanks for the detailed answer. I will definitely try this and get back
>> > to
>> > you.
>>
>> OK, looking forward to it. ;)
>>
>> In the meantime I've updated the docs with a more concise version of
>> what do to when you see this exception.



--
Thank You

Regards

Punit Naik