(DEPRECATED) Apache Flink User Mailing List archive.

How many task managers can Flink efficiently scale to?

Classic

List

Threaded

4 messages Options

Chad Dombrova

How many task managers can Flink efficiently scale to?

Hi,

I'm still on my task management investigation, and I'm curious to know how many task managers people are reliably using with Flink. We're currently using AWS | Thinkbox Deadline, and we're able to easily utilize over 300 workers, and I've heard from other customers who use several thousand, so I'm curious how Flink compares in this regard. Also, what aspects of the system begin to deteriorate at higher scales?

thanks in advance!

-chad

qi luo

Re: How many task managers can Flink efficiently scale to?

Hi Chad,

In our cases, 1~2k TMs with up to ~10k TM slots are used in one job. In general, the CPU/memory of Job Manager should be increased with more TMs.

Regards,
Qi

> On Aug 11, 2019, at 2:03 AM, Chad Dombrova <[hidden email]> wrote:
>
> Hi,
> I'm still on my task management investigation, and I'm curious to know how many task managers people are reliably using with Flink. We're currently using AWS | Thinkbox Deadline, and we're able to easily utilize over 300 workers, and I've heard from other customers who use several thousand, so I'm curious how Flink compares in this regard. Also, what aspects of the system begin to deteriorate at higher scales?
>
> thanks in advance!
>
> -chad
>

Zhu Zhu

Re: How many task managers can Flink efficiently scale to?

Hi Chad,

We have (Blink) jobs each running with over 10 thousands of TMs.

In our experience, the main regression caused by large scale TMs is the in TM allocation stage in ResourceManager, that some times it fails to allocate enough TMs before the allocation timeout.

It does not deteriorate much once the Flink cluster has reached a stable state.

The main loads, In my mind, increases with the task scale and edge scale of a submitted job.

JM can be overwhelmed by frequent and slow GCs caused by task deployment if the JM memory is not fine tuned.

The JM can also be slower due to more PRCs to JM main thread and increased computation complexity of each RPC handling.

Thanks,

Zhu Zhu

qi luo <[hidden email]> 于2019年8月11日周日下午6:17写道：

Hi Chad,

In our cases, 1~2k TMs with up to ~10k TM slots are used in one job. In general, the CPU/memory of Job Manager should be increased with more TMs.

Regards,
Qi

> On Aug 11, 2019, at 2:03 AM, Chad Dombrova <[hidden email]> wrote:
>
> Hi,
> I'm still on my task management investigation, and I'm curious to know how many task managers people are reliably using with Flink. We're currently using AWS | Thinkbox Deadline, and we're able to easily utilize over 300 workers, and I've heard from other customers who use several thousand, so I'm curious how Flink compares in this regard. Also, what aspects of the system begin to deteriorate at higher scales?
>
> thanks in advance!
>
> -chad
>

Chad Dombrova

Re: How many task managers can Flink efficiently scale to?

Thanks for the info! It's very helpful.

-chad

On Sun, Aug 11, 2019 at 4:21 AM Zhu Zhu <[hidden email]> wrote:

Hi Chad,

We have (Blink) jobs each running with over 10 thousands of TMs.
In our experience, the main regression caused by large scale TMs is the in TM allocation stage in ResourceManager, that some times it fails to allocate enough TMs before the allocation timeout.
It does not deteriorate much once the Flink cluster has reached a stable state.

The main loads, In my mind, increases with the task scale and edge scale of a submitted job.
JM can be overwhelmed by frequent and slow GCs caused by task deployment if the JM memory is not fine tuned.
The JM can also be slower due to more PRCs to JM main thread and increased computation complexity of each RPC handling.

Thanks,
Zhu Zhu

qi luo <[hidden email]> 于2019年8月11日周日下午6:17写道：
Hi Chad,

In our cases, 1~2k TMs with up to ~10k TM slots are used in one job. In general, the CPU/memory of Job Manager should be increased with more TMs.

Regards,
Qi

> On Aug 11, 2019, at 2:03 AM, Chad Dombrova <[hidden email]> wrote:
>
> Hi,
> I'm still on my task management investigation, and I'm curious to know how many task managers people are reliably using with Flink. We're currently using AWS | Thinkbox Deadline, and we're able to easily utilize over 300 workers, and I've heard from other customers who use several thousand, so I'm curious how Flink compares in this regard. Also, what aspects of the system begin to deteriorate at higher scales?
>
> thanks in advance!
>
> -chad
>