Hi,
I'm still on my task management investigation, and I'm curious to know how many task managers people are reliably using with Flink. We're currently using AWS | Thinkbox Deadline, and we're able to easily utilize over 300 workers, and I've heard from other customers who use several thousand, so I'm curious how Flink compares in this regard. Also, what aspects of the system begin to deteriorate at higher scales? thanks in advance! -chad |
Hi Chad,
In our cases, 1~2k TMs with up to ~10k TM slots are used in one job. In general, the CPU/memory of Job Manager should be increased with more TMs. Regards, Qi > On Aug 11, 2019, at 2:03 AM, Chad Dombrova <[hidden email]> wrote: > > Hi, > I'm still on my task management investigation, and I'm curious to know how many task managers people are reliably using with Flink. We're currently using AWS | Thinkbox Deadline, and we're able to easily utilize over 300 workers, and I've heard from other customers who use several thousand, so I'm curious how Flink compares in this regard. Also, what aspects of the system begin to deteriorate at higher scales? > > thanks in advance! > > -chad > |
Hi Chad, We have (Blink) jobs each running with over 10 thousands of TMs. In our experience, the main regression caused by large scale TMs is the in TM allocation stage in ResourceManager, that some times it fails to allocate enough TMs before the allocation timeout. It does not deteriorate much once the Flink cluster has reached a stable state. The main loads, In my mind, increases with the task scale and edge scale of a submitted job. JM can be overwhelmed by frequent and slow GCs caused by task deployment if the JM memory is not fine tuned. The JM can also be slower due to more PRCs to JM main thread and increased computation complexity of each RPC handling. Thanks, Zhu Zhu qi luo <[hidden email]> 于2019年8月11日周日 下午6:17写道: Hi Chad, |
Thanks for the info! It's very helpful. -chad On Sun, Aug 11, 2019 at 4:21 AM Zhu Zhu <[hidden email]> wrote:
|
Free forum by Nabble | Edit this page |