Reduce parallelism without network transfer.

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

Reduce parallelism without network transfer.

Kien Truong
Hi,

Assuming that I have a streaming job, using 30 task managers with 4 slot each. I want to change the parallelism of 1 operator from 120 to 30. Are there anyway so that each subtask of this operator get data from 4 upstream subtasks running in the same task manager, thus avoiding network completely ?

Best regards,
Kien

Sent from TypeApp
Reply | Threaded
Open this post in threaded view
|

Re: Reduce parallelism without network transfer.

Piotr Nowojski
Hi,

It should work like this out of the box if you use rescale method:


If it will not work, please let us know.

Piotrek

On 3 Feb 2018, at 04:39, Kien Truong <[hidden email]> wrote:

Hi,

Assuming that I have a streaming job, using 30 task managers with 4 slot each. I want to change the parallelism of 1 operator from 120 to 30. Are there anyway so that each subtask of this operator get data from 4 upstream subtasks running in the same task manager, thus avoiding network completely ?

Best regards,
Kien

Sent from TypeApp

Reply | Threaded
Open this post in threaded view
|

Re: Reduce parallelism without network transfer.

Kien Truong
Thanks Piotr, it works.
May I ask why default behavior when reducing parallelism is rebalance, and not rescale ?

Regards,
Kien

Sent from TypeApp
On Feb 5, 2018, at 15:28, Piotr Nowojski <[hidden email]> wrote:
Hi,

It should work like this out of the box if you use rescale method:


If it will not work, please let us know.

Piotrek

On 3 Feb 2018, at 04:39, Kien Truong < [hidden email]> wrote:

Hi,

Assuming that I have a streaming job, using 30 task managers with 4 slot each. I want to change the parallelism of 1 operator from 120 to 30. Are there anyway so that each subtask of this operator get data from 4 upstream subtasks running in the same task manager, thus avoiding network completely ?

Best regards,
Kien

Sent from TypeApp

Reply | Threaded
Open this post in threaded view
|

Re: Reduce parallelism without network transfer.

Piotr Nowojski
Hi,

Rebalance is more safe default setting that protects against data skew. And even the smallest data skew can create a bottleneck much larger then the serialisation/network transfer cost. Especially if one changes the parallelism to a value that’s not a result of multiplication or division (like N down to N-1). And data skew can be arbitrarily large, while rebalance overhead compare to rescale is limited.

Piotrek


On 6 Feb 2018, at 04:32, Kien Truong <[hidden email]> wrote:

Thanks Piotr, it works.
May I ask why default behavior when reducing parallelism is rebalance, and not rescale ?

Regards,
Kien

Sent from TypeApp
On Feb 5, 2018, at 15:28, Piotr Nowojski <[hidden email]> wrote:
Hi,

It should work like this out of the box if you use rescale method:


If it will not work, please let us know.

Piotrek

On 3 Feb 2018, at 04:39, Kien Truong < [hidden email]> wrote:

Hi,

Assuming that I have a streaming job, using 30 task managers with 4 slot each. I want to change the parallelism of 1 operator from 120 to 30. Are there anyway so that each subtask of this operator get data from 4 upstream subtasks running in the same task manager, thus avoiding network completely ?

Best regards,
Kien

Sent from TypeApp