(DEPRECATED) Apache Flink User Mailing List archive.

scaling issue Running Flink on Kubernetes

Classic

List

Threaded

7 messages Options

Eleanore Jin

scaling issue Running Flink on Kubernetes

Hi Experts,

I have my flink application running on Kubernetes, initially with 1 Job Manager, and 2 Task Managers.

Then we have the custom operator that watches for the CRD, when the CRD replicas changed, it will patch the Flink Job Manager deployment parallelism and max parallelism according to the replicas from CRD (parallelism can be configured via env variables for our application). which causes the job manager restart. hence a new Flink job. But the consumer group does not change, so it will continue from the offset where it left.

In addition, operator will also update Task Manager's deployment replicas, and will adjust the pod number.

In case of scale up, the existing task manager pods do not get killed, but new task manager pods will be created.

And we observed a skew in the partition offset consumed. e.g. some partitions have huge lags and other partitions have small lags. (observed from burrow)

This is also validated by the metrics from Flink UI, showing the throughput differs for slotss

Any clue why this is the case?

Thanks a lot!

Eleanore

Xintong Song

Re: scaling issue Running Flink on Kubernetes

Hi Eleanore,

I have a few more questions regarding your issue.

Which Flink version are you using?
Is this skew observed only after a scaling-up? What happens if the parallelism is initially set to the scaled-up value?
Keeping the job running a while after the scale-up, does the skew ease?

I suspect the performance difference might be an outcome of some warming up issues. E.g., the existing TMs might have some file already localized, or some memory buffers already promoted to the JVM tenured area, while the new TMs have not.

Thank you~

Xintong Song

On Wed, Mar 11, 2020 at 9:25 AM Eleanore Jin <[hidden email]> wrote:

Hi Experts,
I have my flink application running on Kubernetes, initially with 1 Job Manager, and 2 Task Managers.

Then we have the custom operator that watches for the CRD, when the CRD replicas changed, it will patch the Flink Job Manager deployment parallelism and max parallelism according to the replicas from CRD (parallelism can be configured via env variables for our application). which causes the job manager restart. hence a new Flink job. But the consumer group does not change, so it will continue from the offset where it left.

In addition, operator will also update Task Manager's deployment replicas, and will adjust the pod number.

In case of scale up, the existing task manager pods do not get killed, but new task manager pods will be created.

And we observed a skew in the partition offset consumed. e.g. some partitions have huge lags and other partitions have small lags. (observed from burrow)

This is also validated by the metrics from Flink UI, showing the throughput differs for slotss

Any clue why this is the case?

Thanks a lot!
Eleanore

Eleanore Jin

Re: scaling issue Running Flink on Kubernetes

_Hi Xintong,

Thanks for the prompt reply! To answer your question:

Which Flink version are you using?

v1.8.2

Is this skew observed only after a scaling-up? What happens if the parallelism is initially set to the scaled-up value?

I also tried this, it seems skew also happens even I do not change the parallelism, so it may not caused by scale-up/down

Keeping the job running a while after the scale-up, does the skew ease?

So the skew happens in such a way that: some partitions lags down to 0, but other partitions are still at level of 10_000, and I am seeing the back pressure is ok.

Thanks a lot!

Eleanore

On Tue, Mar 10, 2020 at 7:03 PM Xintong Song <[hidden email]> wrote:

Hi Eleanore,

I have a few more questions regarding your issue.
Which Flink version are you using?
Is this skew observed only after a scaling-up? What happens if the parallelism is initially set to the scaled-up value?
Keeping the job running a while after the scale-up, does the skew ease?
I suspect the performance difference might be an outcome of some warming up issues. E.g., the existing TMs might have some file already localized, or some memory buffers already promoted to the JVM tenured area, while the new TMs have not.

Thank you~
Xintong Song

On Wed, Mar 11, 2020 at 9:25 AM Eleanore Jin <[hidden email]> wrote:
Hi Experts,
I have my flink application running on Kubernetes, initially with 1 Job Manager, and 2 Task Managers.

Then we have the custom operator that watches for the CRD, when the CRD replicas changed, it will patch the Flink Job Manager deployment parallelism and max parallelism according to the replicas from CRD (parallelism can be configured via env variables for our application). which causes the job manager restart. hence a new Flink job. But the consumer group does not change, so it will continue from the offset where it left.

In addition, operator will also update Task Manager's deployment replicas, and will adjust the pod number.

In case of scale up, the existing task manager pods do not get killed, but new task manager pods will be created.

And we observed a skew in the partition offset consumed. e.g. some partitions have huge lags and other partitions have small lags. (observed from burrow)

This is also validated by the metrics from Flink UI, showing the throughput differs for slotss

Any clue why this is the case?

Thanks a lot!
Eleanore

Xintong Song

Re: scaling issue Running Flink on Kubernetes

Hi Eleanore,

That does't sound like a scaling issue. It's probably a data skew, that the data volume on some of the keys are significantly higher than others. I'm not familiar with this area though, and have copied Jark for you, who is one of the community experts in this area.

Thank you~

Xintong Song

On Wed, Mar 11, 2020 at 10:37 AM Eleanore Jin <[hidden email]> wrote:

_Hi Xintong,

Thanks for the prompt reply! To answer your question:
Which Flink version are you using?
v1.8.2
Is this skew observed only after a scaling-up? What happens if the parallelism is initially set to the scaled-up value?
I also tried this, it seems skew also happens even I do not change the parallelism, so it may not caused by scale-up/down
Keeping the job running a while after the scale-up, does the skew ease?
So the skew happens in such a way that: some partitions lags down to 0, but other partitions are still at level of 10_000, and I am seeing the back pressure is ok.

Thanks a lot!
Eleanore

On Tue, Mar 10, 2020 at 7:03 PM Xintong Song <[hidden email]> wrote:
Hi Eleanore,

I have a few more questions regarding your issue.
Which Flink version are you using?
Is this skew observed only after a scaling-up? What happens if the parallelism is initially set to the scaled-up value?
Keeping the job running a while after the scale-up, does the skew ease?
I suspect the performance difference might be an outcome of some warming up issues. E.g., the existing TMs might have some file already localized, or some memory buffers already promoted to the JVM tenured area, while the new TMs have not.

Thank you~
Xintong Song

On Wed, Mar 11, 2020 at 9:25 AM Eleanore Jin <[hidden email]> wrote:
Hi Experts,
I have my flink application running on Kubernetes, initially with 1 Job Manager, and 2 Task Managers.

Then we have the custom operator that watches for the CRD, when the CRD replicas changed, it will patch the Flink Job Manager deployment parallelism and max parallelism according to the replicas from CRD (parallelism can be configured via env variables for our application). which causes the job manager restart. hence a new Flink job. But the consumer group does not change, so it will continue from the offset where it left.

In addition, operator will also update Task Manager's deployment replicas, and will adjust the pod number.

In case of scale up, the existing task manager pods do not get killed, but new task manager pods will be created.

And we observed a skew in the partition offset consumed. e.g. some partitions have huge lags and other partitions have small lags. (observed from burrow)

This is also validated by the metrics from Flink UI, showing the throughput differs for slotss

Any clue why this is the case?

Thanks a lot!
Eleanore

Flavio Pompermaier

Re: scaling issue Running Flink on Kubernetes

Have you tried to use existing operators such as https://github.com/GoogleCloudPlatform/flink-on-k8s-operator or https://github.com/GoogleCloudPlatform/flink-on-k8s-operator?

On Wed, Mar 11, 2020 at 4:46 AM Xintong Song <[hidden email]> wrote:

Hi Eleanore,

That does't sound like a scaling issue. It's probably a data skew, that the data volume on some of the keys are significantly higher than others. I'm not familiar with this area though, and have copied Jark for you, who is one of the community experts in this area.

Thank you~
Xintong Song

On Wed, Mar 11, 2020 at 10:37 AM Eleanore Jin <[hidden email]> wrote:
_Hi Xintong,

Thanks for the prompt reply! To answer your question:
Which Flink version are you using?
v1.8.2
Is this skew observed only after a scaling-up? What happens if the parallelism is initially set to the scaled-up value?
I also tried this, it seems skew also happens even I do not change the parallelism, so it may not caused by scale-up/down
Keeping the job running a while after the scale-up, does the skew ease?
So the skew happens in such a way that: some partitions lags down to 0, but other partitions are still at level of 10_000, and I am seeing the back pressure is ok.

Thanks a lot!
Eleanore

On Tue, Mar 10, 2020 at 7:03 PM Xintong Song <[hidden email]> wrote:
Hi Eleanore,

I have a few more questions regarding your issue.
Which Flink version are you using?
Is this skew observed only after a scaling-up? What happens if the parallelism is initially set to the scaled-up value?
Keeping the job running a while after the scale-up, does the skew ease?
I suspect the performance difference might be an outcome of some warming up issues. E.g., the existing TMs might have some file already localized, or some memory buffers already promoted to the JVM tenured area, while the new TMs have not.

Thank you~
Xintong Song

On Wed, Mar 11, 2020 at 9:25 AM Eleanore Jin <[hidden email]> wrote:
Hi Experts,
I have my flink application running on Kubernetes, initially with 1 Job Manager, and 2 Task Managers.

Then we have the custom operator that watches for the CRD, when the CRD replicas changed, it will patch the Flink Job Manager deployment parallelism and max parallelism according to the replicas from CRD (parallelism can be configured via env variables for our application). which causes the job manager restart. hence a new Flink job. But the consumer group does not change, so it will continue from the offset where it left.

In addition, operator will also update Task Manager's deployment replicas, and will adjust the pod number.

In case of scale up, the existing task manager pods do not get killed, but new task manager pods will be created.

And we observed a skew in the partition offset consumed. e.g. some partitions have huge lags and other partitions have small lags. (observed from burrow)

This is also validated by the metrics from Flink UI, showing the throughput differs for slotss

Any clue why this is the case?

Thanks a lot!
Eleanore

Flavio Pompermaier

Re: scaling issue Running Flink on Kubernetes

Sorry I wanted to mention https://github.com/lyft/flinkk8soperator (I don't know which one of the 2 is better)

On Wed, Mar 11, 2020 at 10:19 AM Flavio Pompermaier <[hidden email]> wrote:

Have you tried to use existing operators such as https://github.com/GoogleCloudPlatform/flink-on-k8s-operator or https://github.com/GoogleCloudPlatform/flink-on-k8s-operator?

On Wed, Mar 11, 2020 at 4:46 AM Xintong Song <[hidden email]> wrote:
Hi Eleanore,

That does't sound like a scaling issue. It's probably a data skew, that the data volume on some of the keys are significantly higher than others. I'm not familiar with this area though, and have copied Jark for you, who is one of the community experts in this area.

Thank you~
Xintong Song

On Wed, Mar 11, 2020 at 10:37 AM Eleanore Jin <[hidden email]> wrote:
_Hi Xintong,

Thanks for the prompt reply! To answer your question:
Which Flink version are you using?
v1.8.2
Is this skew observed only after a scaling-up? What happens if the parallelism is initially set to the scaled-up value?
I also tried this, it seems skew also happens even I do not change the parallelism, so it may not caused by scale-up/down
Keeping the job running a while after the scale-up, does the skew ease?
So the skew happens in such a way that: some partitions lags down to 0, but other partitions are still at level of 10_000, and I am seeing the back pressure is ok.

Thanks a lot!
Eleanore

On Tue, Mar 10, 2020 at 7:03 PM Xintong Song <[hidden email]> wrote:
Hi Eleanore,

I have a few more questions regarding your issue.
Which Flink version are you using?
Is this skew observed only after a scaling-up? What happens if the parallelism is initially set to the scaled-up value?
Keeping the job running a while after the scale-up, does the skew ease?
I suspect the performance difference might be an outcome of some warming up issues. E.g., the existing TMs might have some file already localized, or some memory buffers already promoted to the JVM tenured area, while the new TMs have not.

Thank you~
Xintong Song

On Wed, Mar 11, 2020 at 9:25 AM Eleanore Jin <[hidden email]> wrote:
Hi Experts,
I have my flink application running on Kubernetes, initially with 1 Job Manager, and 2 Task Managers.

Then we have the custom operator that watches for the CRD, when the CRD replicas changed, it will patch the Flink Job Manager deployment parallelism and max parallelism according to the replicas from CRD (parallelism can be configured via env variables for our application). which causes the job manager restart. hence a new Flink job. But the consumer group does not change, so it will continue from the offset where it left.

In addition, operator will also update Task Manager's deployment replicas, and will adjust the pod number.

In case of scale up, the existing task manager pods do not get killed, but new task manager pods will be created.

And we observed a skew in the partition offset consumed. e.g. some partitions have huge lags and other partitions have small lags. (observed from burrow)

This is also validated by the metrics from Flink UI, showing the throughput differs for slotss

Any clue why this is the case?

Thanks a lot!
Eleanore

Eleanore Jin

Re: scaling issue Running Flink on Kubernetes

Hi Flavio,

We have implemented our own flink operator, the operator will start a flink job cluster (the application jar is already packaged together with flink in the docker image). I believe Google's flink operator will start a session cluster, and user can submit the flink job via REST. Not looked into lyft one before.

Eleanore

On Wed, Mar 11, 2020 at 2:21 AM Flavio Pompermaier <[hidden email]> wrote:

Sorry I wanted to mention https://github.com/lyft/flinkk8soperator (I don't know which one of the 2 is better)

On Wed, Mar 11, 2020 at 10:19 AM Flavio Pompermaier <[hidden email]> wrote:
Have you tried to use existing operators such as https://github.com/GoogleCloudPlatform/flink-on-k8s-operator or https://github.com/GoogleCloudPlatform/flink-on-k8s-operator?

On Wed, Mar 11, 2020 at 4:46 AM Xintong Song <[hidden email]> wrote:
Hi Eleanore,

That does't sound like a scaling issue. It's probably a data skew, that the data volume on some of the keys are significantly higher than others. I'm not familiar with this area though, and have copied Jark for you, who is one of the community experts in this area.

Thank you~
Xintong Song

On Wed, Mar 11, 2020 at 10:37 AM Eleanore Jin <[hidden email]> wrote:
_Hi Xintong,

Thanks for the prompt reply! To answer your question:
Which Flink version are you using?
v1.8.2
Is this skew observed only after a scaling-up? What happens if the parallelism is initially set to the scaled-up value?
I also tried this, it seems skew also happens even I do not change the parallelism, so it may not caused by scale-up/down
Keeping the job running a while after the scale-up, does the skew ease?
So the skew happens in such a way that: some partitions lags down to 0, but other partitions are still at level of 10_000, and I am seeing the back pressure is ok.

Thanks a lot!
Eleanore

On Tue, Mar 10, 2020 at 7:03 PM Xintong Song <[hidden email]> wrote:
Hi Eleanore,

I have a few more questions regarding your issue.
Which Flink version are you using?
Is this skew observed only after a scaling-up? What happens if the parallelism is initially set to the scaled-up value?
Keeping the job running a while after the scale-up, does the skew ease?
I suspect the performance difference might be an outcome of some warming up issues. E.g., the existing TMs might have some file already localized, or some memory buffers already promoted to the JVM tenured area, while the new TMs have not.

Thank you~
Xintong Song

On Wed, Mar 11, 2020 at 9:25 AM Eleanore Jin <[hidden email]> wrote:
Hi Experts,
I have my flink application running on Kubernetes, initially with 1 Job Manager, and 2 Task Managers.

Then we have the custom operator that watches for the CRD, when the CRD replicas changed, it will patch the Flink Job Manager deployment parallelism and max parallelism according to the replicas from CRD (parallelism can be configured via env variables for our application). which causes the job manager restart. hence a new Flink job. But the consumer group does not change, so it will continue from the offset where it left.

In addition, operator will also update Task Manager's deployment replicas, and will adjust the pod number.

In case of scale up, the existing task manager pods do not get killed, but new task manager pods will be created.

And we observed a skew in the partition offset consumed. e.g. some partitions have huge lags and other partitions have small lags. (observed from burrow)

This is also validated by the metrics from Flink UI, showing the throughput differs for slotss

Any clue why this is the case?

Thanks a lot!
Eleanore