How many task managers can Flink efficiently scale to?

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

How many task managers can Flink efficiently scale to?

Chad Dombrova
Hi,
I'm still on my task management investigation, and I'm curious to know how many task managers people are reliably using with Flink.  We're currently using AWS | Thinkbox Deadline, and we're able to easily utilize over 300 workers, and I've heard from other customers who use several thousand, so I'm curious how Flink compares in this regard.  Also, what aspects of the system begin to deteriorate at higher scales?

thanks in advance!

-chad

Reply | Threaded
Open this post in threaded view
|

Re: How many task managers can Flink efficiently scale to?

qi luo
Hi Chad,

In our cases, 1~2k TMs with up to ~10k TM slots are used in one job. In general, the CPU/memory of Job Manager should be increased with more TMs.

Regards,
Qi

> On Aug 11, 2019, at 2:03 AM, Chad Dombrova <[hidden email]> wrote:
>
> Hi,
> I'm still on my task management investigation, and I'm curious to know how many task managers people are reliably using with Flink.  We're currently using AWS | Thinkbox Deadline, and we're able to easily utilize over 300 workers, and I've heard from other customers who use several thousand, so I'm curious how Flink compares in this regard.  Also, what aspects of the system begin to deteriorate at higher scales?
>
> thanks in advance!
>
> -chad
>

Reply | Threaded
Open this post in threaded view
|

Re: How many task managers can Flink efficiently scale to?

Zhu Zhu
Hi Chad,

We have (Blink) jobs each running with over 10 thousands of TMs. 
In our experience, the main regression caused by large scale TMs is the in TM allocation stage in ResourceManager, that some times it fails to allocate enough TMs before the allocation timeout.
It does not deteriorate much once the Flink cluster has reached a stable state.

The main loads, In my mind, increases with the task scale and edge scale of a submitted job.
JM can be overwhelmed by frequent and slow GCs caused by task deployment if the JM memory is not fine tuned.
The JM can also be slower due to more PRCs to JM main thread and increased computation complexity of each RPC handling.

Thanks,
Zhu Zhu

qi luo <[hidden email]> 于2019年8月11日周日 下午6:17写道:
Hi Chad,

In our cases, 1~2k TMs with up to ~10k TM slots are used in one job. In general, the CPU/memory of Job Manager should be increased with more TMs.

Regards,
Qi

> On Aug 11, 2019, at 2:03 AM, Chad Dombrova <[hidden email]> wrote:
>
> Hi,
> I'm still on my task management investigation, and I'm curious to know how many task managers people are reliably using with Flink.  We're currently using AWS | Thinkbox Deadline, and we're able to easily utilize over 300 workers, and I've heard from other customers who use several thousand, so I'm curious how Flink compares in this regard.  Also, what aspects of the system begin to deteriorate at higher scales?
>
> thanks in advance!
>
> -chad
>

Reply | Threaded
Open this post in threaded view
|

Re: How many task managers can Flink efficiently scale to?

Chad Dombrova
Thanks for the info!  It's very helpful.

-chad


On Sun, Aug 11, 2019 at 4:21 AM Zhu Zhu <[hidden email]> wrote:
Hi Chad,

We have (Blink) jobs each running with over 10 thousands of TMs. 
In our experience, the main regression caused by large scale TMs is the in TM allocation stage in ResourceManager, that some times it fails to allocate enough TMs before the allocation timeout.
It does not deteriorate much once the Flink cluster has reached a stable state.

The main loads, In my mind, increases with the task scale and edge scale of a submitted job.
JM can be overwhelmed by frequent and slow GCs caused by task deployment if the JM memory is not fine tuned.
The JM can also be slower due to more PRCs to JM main thread and increased computation complexity of each RPC handling.

Thanks,
Zhu Zhu

qi luo <[hidden email]> 于2019年8月11日周日 下午6:17写道:
Hi Chad,

In our cases, 1~2k TMs with up to ~10k TM slots are used in one job. In general, the CPU/memory of Job Manager should be increased with more TMs.

Regards,
Qi

> On Aug 11, 2019, at 2:03 AM, Chad Dombrova <[hidden email]> wrote:
>
> Hi,
> I'm still on my task management investigation, and I'm curious to know how many task managers people are reliably using with Flink.  We're currently using AWS | Thinkbox Deadline, and we're able to easily utilize over 300 workers, and I've heard from other customers who use several thousand, so I'm curious how Flink compares in this regard.  Also, what aspects of the system begin to deteriorate at higher scales?
>
> thanks in advance!
>
> -chad
>