Scaling Higher than 10k Nodes

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

Scaling Higher than 10k Nodes

Joey Tran
Hi, I was looking at Apache Beam/Flink for some of our data processing needs, but when reading about the resource managers (YARN/mesos/Kubernetes), it seems like they all top out at around 10k nodes. What are recommended solutions for scaling higher than this?

Thanks in advance,
Joey
Reply | Threaded
Open this post in threaded view
|

Re: Scaling Higher than 10k Nodes

Piotr Nowojski-4
Hi Joey,

Sorry for not responding to your question sooner. As you can imagine there are not many users running Flink at such scale. As far as I know, Alibaba is running the largest/one of the largest clusters, I'm asking for someone who is familiar with those deployments to take a look at this conversation. I hope someone will respond here soon :)

Best,
Piotrek

pon., 1 mar 2021 o 14:43 Joey Tran <[hidden email]> napisał(a):
Hi, I was looking at Apache Beam/Flink for some of our data processing needs, but when reading about the resource managers (YARN/mesos/Kubernetes), it seems like they all top out at around 10k nodes. What are recommended solutions for scaling higher than this?

Thanks in advance,
Joey
Reply | Threaded
Open this post in threaded view
|

Re: Scaling Higher than 10k Nodes

Xintong Song
Hi Joey,

Quick question: by *nodes*, do you mean Flink task manager processes, or physical/virtual machines (like ecs, yarn NM)? 

In our production, we run flink workloads on several Yarn/Kubernetes clusters, where each cluster typically has 2k~5k machines. Most Flink workloads are deployed in single-job mode, giving us thousands (sometimes more than 10k) of flink instances concurrently running on each cluster. In this way, the scale of each flink instance is usually not extremely large (less than 1000 TMs), and we rely on the power of Yarn/Kubernetes to deal with the large number of instances.

There're also cases that a single flink job is extremely large. We had a batch workload from last year's double-11 event, with 8k max per-stage parallelism and up to 30k task managers running at the same time. At that scale, we run into problems with the single-point JobManager process, such as tremendous memory consumption, buzy rpc main thread, etc. To make that case work, we did many optimizations on our internal flink version, which we are trying to contribute to the community. See FLINK-21110 [1] for the details.

Thank you~

Xintong Song



On Thu, Mar 4, 2021 at 3:39 PM Piotr Nowojski <[hidden email]> wrote:
Hi Joey,

Sorry for not responding to your question sooner. As you can imagine there are not many users running Flink at such scale. As far as I know, Alibaba is running the largest/one of the largest clusters, I'm asking for someone who is familiar with those deployments to take a look at this conversation. I hope someone will respond here soon :)

Best,
Piotrek

pon., 1 mar 2021 o 14:43 Joey Tran <[hidden email]> napisał(a):
Hi, I was looking at Apache Beam/Flink for some of our data processing needs, but when reading about the resource managers (YARN/mesos/Kubernetes), it seems like they all top out at around 10k nodes. What are recommended solutions for scaling higher than this?

Thanks in advance,
Joey
Reply | Threaded
Open this post in threaded view
|

Re: Scaling Higher than 10k Nodes

Piotr Nowojski-4
Maybe a stupid question Joey, but if the problem is in the resource managers, haven't you tried running standalone Flink clusters without any resource manager? Probably you would still hit the JobManager problems that Xintong mentioned, but those problems we can help addressing.

Piotrek

czw., 4 mar 2021 o 09:30 Xintong Song <[hidden email]> napisał(a):
Hi Joey,

Quick question: by *nodes*, do you mean Flink task manager processes, or physical/virtual machines (like ecs, yarn NM)? 

In our production, we run flink workloads on several Yarn/Kubernetes clusters, where each cluster typically has 2k~5k machines. Most Flink workloads are deployed in single-job mode, giving us thousands (sometimes more than 10k) of flink instances concurrently running on each cluster. In this way, the scale of each flink instance is usually not extremely large (less than 1000 TMs), and we rely on the power of Yarn/Kubernetes to deal with the large number of instances.

There're also cases that a single flink job is extremely large. We had a batch workload from last year's double-11 event, with 8k max per-stage parallelism and up to 30k task managers running at the same time. At that scale, we run into problems with the single-point JobManager process, such as tremendous memory consumption, buzy rpc main thread, etc. To make that case work, we did many optimizations on our internal flink version, which we are trying to contribute to the community. See FLINK-21110 [1] for the details.

Thank you~

Xintong Song



On Thu, Mar 4, 2021 at 3:39 PM Piotr Nowojski <[hidden email]> wrote:
Hi Joey,

Sorry for not responding to your question sooner. As you can imagine there are not many users running Flink at such scale. As far as I know, Alibaba is running the largest/one of the largest clusters, I'm asking for someone who is familiar with those deployments to take a look at this conversation. I hope someone will respond here soon :)

Best,
Piotrek

pon., 1 mar 2021 o 14:43 Joey Tran <[hidden email]> napisał(a):
Hi, I was looking at Apache Beam/Flink for some of our data processing needs, but when reading about the resource managers (YARN/mesos/Kubernetes), it seems like they all top out at around 10k nodes. What are recommended solutions for scaling higher than this?

Thanks in advance,
Joey
Reply | Threaded
Open this post in threaded view
|

Re: Scaling Higher than 10k Nodes

Yuval Itzchakov
Hi Joey,

We are currently running around 2000+ small Flink clusters on top of k8s, currently at around ~ 100 nodes. Do you see yourself scaling to 10k nodes, given that each node can run a significant amount of Flink jobs inside of it?

On Thu, Mar 4, 2021 at 10:51 AM Piotr Nowojski <[hidden email]> wrote:
Maybe a stupid question Joey, but if the problem is in the resource managers, haven't you tried running standalone Flink clusters without any resource manager? Probably you would still hit the JobManager problems that Xintong mentioned, but those problems we can help addressing.

Piotrek

czw., 4 mar 2021 o 09:30 Xintong Song <[hidden email]> napisał(a):
Hi Joey,

Quick question: by *nodes*, do you mean Flink task manager processes, or physical/virtual machines (like ecs, yarn NM)? 

In our production, we run flink workloads on several Yarn/Kubernetes clusters, where each cluster typically has 2k~5k machines. Most Flink workloads are deployed in single-job mode, giving us thousands (sometimes more than 10k) of flink instances concurrently running on each cluster. In this way, the scale of each flink instance is usually not extremely large (less than 1000 TMs), and we rely on the power of Yarn/Kubernetes to deal with the large number of instances.

There're also cases that a single flink job is extremely large. We had a batch workload from last year's double-11 event, with 8k max per-stage parallelism and up to 30k task managers running at the same time. At that scale, we run into problems with the single-point JobManager process, such as tremendous memory consumption, buzy rpc main thread, etc. To make that case work, we did many optimizations on our internal flink version, which we are trying to contribute to the community. See FLINK-21110 [1] for the details.

Thank you~

Xintong Song



On Thu, Mar 4, 2021 at 3:39 PM Piotr Nowojski <[hidden email]> wrote:
Hi Joey,

Sorry for not responding to your question sooner. As you can imagine there are not many users running Flink at such scale. As far as I know, Alibaba is running the largest/one of the largest clusters, I'm asking for someone who is familiar with those deployments to take a look at this conversation. I hope someone will respond here soon :)

Best,
Piotrek

pon., 1 mar 2021 o 14:43 Joey Tran <[hidden email]> napisał(a):
Hi, I was looking at Apache Beam/Flink for some of our data processing needs, but when reading about the resource managers (YARN/mesos/Kubernetes), it seems like they all top out at around 10k nodes. What are recommended solutions for scaling higher than this?

Thanks in advance,
Joey


--
Best Regards,
Yuval Itzchakov.