(DEPRECATED) Apache Flink User Mailing List archive.

Flink CPU load metrics in K8s

Classic

List

Threaded

8 messages Options

Bajaj, Abhinav

Flink CPU load metrics in K8s

Hi,

I am trying to understand the CPU Load metrics reported by Flink 1.7.1 running with openjdk 1.8.0_212 on K8s.

After deploying the Flink Job on K8s, I tried to get CPU Load metrics following this documentation.

curl localhost:8081/taskmanagers/7737ac33b311ea0a696422680711597b/metrics?get=Status.JVM.CPU.Load,Status.JVM.CPU.Time

[{"id":"Status.JVM.CPU.Load","value":"0.0023815194093831865"},{"id":"Status.JVM.CPU.Time","value":"23260000000"}]

The value of the CPU load looks odd to me.

What is the unit and scale of this value?

How does Flink determine this value?

Appreciate your time and help here.

~ Abhinav Bajaj

Roman Grebennikov

Re: Flink CPU load metrics in K8s

Hi,

JVM.CPU.Load is just a wrapper (MetricUtils.instantiateCPUMetrics) on top of OperatingSystemMXBean.getProcessCpuLoad (see https://docs.oracle.com/javase/7/docs/jre/api/management/extension/com/sun/management/OperatingSystemMXBean.html#getProcessCpuLoad())

Usually it looks weird if you have multiple CPU cores. For example, if you have a job with a single slot 100% utilizing a single CPU core on a 8 core machine, the JVM.CPU.Load will be 1.0/8.0 = 0.125. It's also a point-in-time snapshot of current CPU usage, so if you're collecting your metrics every minute, and the job has spiky workload within this minute (like it's idle almost always and once in a minute it consumes 100% CPU for one second), so you have a chance to completely miss this from the metrics.

As for me personally, JVM.CPU.Time is more clear indicator of CPU usage, which is always increasing amount of milliseconds CPU spent executing your code. And it will also catch CPU usage spikes.

Roman Grebennikov | [hidden email]

On Mon, Aug 3, 2020, at 23:34, Bajaj, Abhinav wrote:

Hi,

I am trying to understand the CPU Load metrics reported by Flink 1.7.1 running with openjdk 1.8.0_212 on K8s.

After deploying the Flink Job on K8s, I tried to get CPU Load metrics following this documentation.
curl localhost:8081/taskmanagers/7737ac33b311ea0a696422680711597b/metrics?get=Status.JVM.CPU.Load,Status.JVM.CPU.Time
[{"id":"Status.JVM.CPU.Load","value":"0.0023815194093831865"},{"id":"Status.JVM.CPU.Time","value":"23260000000"}]

The value of the CPU load looks odd to me.

What is the unit and scale of this value?
How does Flink determine this value?

Appreciate your time and help here.
~ Abhinav Bajaj

Bajaj, Abhinav

Re: Flink CPU load metrics in K8s

Thanks Roman for providing the details.

I also made more observations that has increased my confusion about this topic 😝

To ease the calculations, I deployed a test cluster this time providing 1 CPU in K8s(with docker) for all the taskmanager container.

When I check the taskmanager CPU load, the value is in the order of "0.002158428663932657".

Assuming that the underlying JVM recognizes 1 CPU allocated to the docker container, this values means % CPU usage in ball park of 0.21%.

However, if I look at the K8s metrics(formula below) for this container – it turns out in the ball park of 10-16%.

There is no other process running in the container apart from the flink taskmanager.

The order of these two values of CPU % usage is different.

Am I comparing the right metrics here?

How are folks running Flink on K8s monitoring the CPU load?

~ Abhi

% CPU usage from K8s metrics

sum(rate(container_cpu_usage_seconds_total{pod=~"my-taskmanagers-*", container="taskmanager"}[5m])) by (pod)

/ sum(container_spec_cpu_quota{pod=~"my-taskmanager-pod-*", container="taskmanager"}

/ container_spec_cpu_period{pod=~"my-taskmanager-pod-*", container="taskmanager"}) by (pod)

From: Roman Grebennikov <[hidden email]>
Date: Tuesday, August 4, 2020 at 12:42 AM
To: "[hidden email]" <[hidden email]>
Subject: Re: Flink CPU load metrics in K8s

LEARN FAST: This email originated outside of HERE.
Please do not click on links or open attachments unless you recognize the sender and know the content is safe. Thank you.

Hi,

As for me personally, JVM.CPU.Time is more clear indicator of CPU usage, which is always increasing amount of milliseconds CPU spent executing your code. And it will also catch CPU usage spikes.

Roman Grebennikov | [hidden email]

On Mon, Aug 3, 2020, at 23:34, Bajaj, Abhinav wrote:

Hi,

I am trying to understand the CPU Load metrics reported by Flink 1.7.1 running with openjdk 1.8.0_212 on K8s.

After deploying the Flink Job on K8s, I tried to get CPU Load metrics following this documentation.

curl localhost:8081/taskmanagers/7737ac33b311ea0a696422680711597b/metrics?get=Status.JVM.CPU.Load,Status.JVM.CPU.Time

[{"id":"Status.JVM.CPU.Load","value":"0.0023815194093831865"},{"id":"Status.JVM.CPU.Time","value":"23260000000"}]

The value of the CPU load looks odd to me.

What is the unit and scale of this value?

How does Flink determine this value?

Appreciate your time and help here.

~ Abhinav Bajaj

Bajaj, Abhinav

Re: Flink CPU load metrics in K8s

Hi,

Reaching out to folks running Flink on K8s.

~ Abhinav Bajaj

From: "Bajaj, Abhinav" <[hidden email]>
Date: Wednesday, August 5, 2020 at 1:46 PM
To: Roman Grebennikov <[hidden email]>, "[hidden email]" <[hidden email]>
Subject: Re: Flink CPU load metrics in K8s

Thanks Roman for providing the details.

I also made more observations that has increased my confusion about this topic 😝

To ease the calculations, I deployed a test cluster this time providing 1 CPU in K8s(with docker) for all the taskmanager container.

When I check the taskmanager CPU load, the value is in the order of "0.002158428663932657".

Assuming that the underlying JVM recognizes 1 CPU allocated to the docker container, this values means % CPU usage in ball park of 0.21%.

However, if I look at the K8s metrics(formula below) for this container – it turns out in the ball park of 10-16%.

There is no other process running in the container apart from the flink taskmanager.

The order of these two values of CPU % usage is different.

Am I comparing the right metrics here?

How are folks running Flink on K8s monitoring the CPU load?

~ Abhi

% CPU usage from K8s metrics

sum(rate(container_cpu_usage_seconds_total{pod=~"my-taskmanagers-*", container="taskmanager"}[5m])) by (pod)

/ sum(container_spec_cpu_quota{pod=~"my-taskmanager-pod-*", container="taskmanager"}

/ container_spec_cpu_period{pod=~"my-taskmanager-pod-*", container="taskmanager"}) by (pod)

From: Roman Grebennikov <[hidden email]>
Date: Tuesday, August 4, 2020 at 12:42 AM
To: "[hidden email]" <[hidden email]>
Subject: Re: Flink CPU load metrics in K8s

LEARN FAST: This email originated outside of HERE.
Please do not click on links or open attachments unless you recognize the sender and know the content is safe. Thank you.

Hi,

As for me personally, JVM.CPU.Time is more clear indicator of CPU usage, which is always increasing amount of milliseconds CPU spent executing your code. And it will also catch CPU usage spikes.

Roman Grebennikov | [hidden email]

On Mon, Aug 3, 2020, at 23:34, Bajaj, Abhinav wrote:

Hi,

I am trying to understand the CPU Load metrics reported by Flink 1.7.1 running with openjdk 1.8.0_212 on K8s.

After deploying the Flink Job on K8s, I tried to get CPU Load metrics following this documentation.

curl localhost:8081/taskmanagers/7737ac33b311ea0a696422680711597b/metrics?get=Status.JVM.CPU.Load,Status.JVM.CPU.Time

[{"id":"Status.JVM.CPU.Load","value":"0.0023815194093831865"},{"id":"Status.JVM.CPU.Time","value":"23260000000"}]

The value of the CPU load looks odd to me.

What is the unit and scale of this value?

How does Flink determine this value?

Appreciate your time and help here.

~ Abhinav Bajaj

Xintong Song

Re: Flink CPU load metrics in K8s

Hi Abhinav,

Do you know how many total cpus does the physical machine have where the kubernetes container is running?

I'm asking because I suspect whether JVM is aware that only 1 cpu is configured for the container. It does not work like JVM understands how many cpu are configured and controls itself to not use more than that. On the other hand, JVM tries to use as much cpu time as possible, and the limit comes from external (OS, docker, cgroup, ...).

Please understand that docker containers are not virtual machines. They do not "pretend" to only have certain hardwares. I did a simple test on my laptop, launching a docker container with cpu limit configured. Inside the container, I can still see all my machine's cpus.

Thank you~

Xintong Song

On Wed, Aug 12, 2020 at 1:19 AM Bajaj, Abhinav <[hidden email]> wrote:

Hi,

Reaching out to folks running Flink on K8s.

~ Abhinav Bajaj

From: "Bajaj, Abhinav" <[hidden email]>
Date: Wednesday, August 5, 2020 at 1:46 PM
To: Roman Grebennikov <[hidden email]>, "[hidden email]" <[hidden email]>
Subject: Re: Flink CPU load metrics in K8s

Thanks Roman for providing the details.

I also made more observations that has increased my confusion about this topic 😝

To ease the calculations, I deployed a test cluster this time providing 1 CPU in K8s(with docker) for all the taskmanager container.

When I check the taskmanager CPU load, the value is in the order of "0.002158428663932657".

Assuming that the underlying JVM recognizes 1 CPU allocated to the docker container, this values means % CPU usage in ball park of 0.21%.

However, if I look at the K8s metrics(formula below) for this container – it turns out in the ball park of 10-16%.

There is no other process running in the container apart from the flink taskmanager.

The order of these two values of CPU % usage is different.

Am I comparing the right metrics here?

How are folks running Flink on K8s monitoring the CPU load?

~ Abhi

% CPU usage from K8s metrics

sum(rate(container_cpu_usage_seconds_total{pod=~"my-taskmanagers-*", container="taskmanager"}[5m])) by (pod)

/ sum(container_spec_cpu_quota{pod=~"my-taskmanager-pod-*", container="taskmanager"}

/ container_spec_cpu_period{pod=~"my-taskmanager-pod-*", container="taskmanager"}) by (pod)

From: Roman Grebennikov <[hidden email]>
Date: Tuesday, August 4, 2020 at 12:42 AM
To: "[hidden email]" <[hidden email]>
Subject: Re: Flink CPU load metrics in K8s

LEARN FAST: This email originated outside of HERE.
Please do not click on links or open attachments unless you recognize the sender and know the content is safe. Thank you.

Hi,

JVM.CPU.Load is just a wrapper (MetricUtils.instantiateCPUMetrics) on top of OperatingSystemMXBean.getProcessCpuLoad (see https://docs.oracle.com/javase/7/docs/jre/api/management/extension/com/sun/management/OperatingSystemMXBean.html#getProcessCpuLoad())

Usually it looks weird if you have multiple CPU cores. For example, if you have a job with a single slot 100% utilizing a single CPU core on a 8 core machine, the JVM.CPU.Load will be 1.0/8.0 = 0.125. It's also a point-in-time snapshot of current CPU usage, so if you're collecting your metrics every minute, and the job has spiky workload within this minute (like it's idle almost always and once in a minute it consumes 100% CPU for one second), so you have a chance to completely miss this from the metrics.

As for me personally, JVM.CPU.Time is more clear indicator of CPU usage, which is always increasing amount of milliseconds CPU spent executing your code. And it will also catch CPU usage spikes.

Roman Grebennikov | [hidden email]

On Mon, Aug 3, 2020, at 23:34, Bajaj, Abhinav wrote:

Hi,

I am trying to understand the CPU Load metrics reported by Flink 1.7.1 running with openjdk 1.8.0_212 on K8s.

After deploying the Flink Job on K8s, I tried to get CPU Load metrics following this documentation.

curl localhost:8081/taskmanagers/7737ac33b311ea0a696422680711597b/metrics?get=Status.JVM.CPU.Load,Status.JVM.CPU.Time

[{"id":"Status.JVM.CPU.Load","value":"0.0023815194093831865"},{"id":"Status.JVM.CPU.Time","value":"23260000000"}]

The value of the CPU load looks odd to me.

What is the unit and scale of this value?

How does Flink determine this value?

Appreciate your time and help here.

~ Abhinav Bajaj

Bajaj, Abhinav

Re: Flink CPU load metrics in K8s

Thanks Xintong for your input.

From the information I could find, I understand the JDK version 1.8.0_212 we use includes the docker/container support.

I also had a quick test inside the docker image using the below –

Runtime.getRuntime().availableProcessors()

It showed the right number of CPU cores associated to container.

But I am not familiar with OperatingSystemMXBean used by Flink.

So I don’t know if it will pick up docker CPU limits set by K8s or not. I will continue to investigate that.

In meantime, the K8s metric - container_cpu_usage_seconds_total does seem to provide the expected CPU usage for the containers.

I was hoping that someone in the community may have already ran into this behavior on K8s and can share their specific experience 😊.

Thanks much.

~ Abhinav Bajaj

From: Xintong Song <[hidden email]>
Date: Wednesday, August 12, 2020 at 3:56 AM
To: "Bajaj, Abhinav" <[hidden email]>
Cc: "[hidden email]" <[hidden email]>, Roman Grebennikov <[hidden email]>
Subject: Re: Flink CPU load metrics in K8s

Hi Abhinav,

Do you know how many total cpus does the physical machine have where the kubernetes container is running?

Thank you~

Xintong Song

On Wed, Aug 12, 2020 at 1:19 AM Bajaj, Abhinav <[hidden email]> wrote:

Hi,

Reaching out to folks running Flink on K8s.

~ Abhinav Bajaj

From: "Bajaj, Abhinav" <[hidden email]>
Date: Wednesday, August 5, 2020 at 1:46 PM
To: Roman Grebennikov <[hidden email]>, "[hidden email]" <[hidden email]>
Subject: Re: Flink CPU load metrics in K8s

Thanks Roman for providing the details.

I also made more observations that has increased my confusion about this topic 😝

To ease the calculations, I deployed a test cluster this time providing 1 CPU in K8s(with docker) for all the taskmanager container.

When I check the taskmanager CPU load, the value is in the order of "0.002158428663932657".

Assuming that the underlying JVM recognizes 1 CPU allocated to the docker container, this values means % CPU usage in ball park of 0.21%.

However, if I look at the K8s metrics(formula below) for this container – it turns out in the ball park of 10-16%.

There is no other process running in the container apart from the flink taskmanager.

The order of these two values of CPU % usage is different.

Am I comparing the right metrics here?

How are folks running Flink on K8s monitoring the CPU load?

~ Abhi

% CPU usage from K8s metrics

sum(rate(container_cpu_usage_seconds_total{pod=~"my-taskmanagers-*", container="taskmanager"}[5m])) by (pod)

/ sum(container_spec_cpu_quota{pod=~"my-taskmanager-pod-*", container="taskmanager"}

/ container_spec_cpu_period{pod=~"my-taskmanager-pod-*", container="taskmanager"}) by (pod)

From: Roman Grebennikov <[hidden email]>
Date: Tuesday, August 4, 2020 at 12:42 AM
To: "[hidden email]" <[hidden email]>
Subject: Re: Flink CPU load metrics in K8s

LEARN FAST: This email originated outside of HERE.
Please do not click on links or open attachments unless you recognize the sender and know the content is safe. Thank you.

Hi,

JVM.CPU.Load is just a wrapper (MetricUtils.instantiateCPUMetrics) on top of OperatingSystemMXBean.getProcessCpuLoad (see https://docs.oracle.com/javase/7/docs/jre/api/management/extension/com/sun/management/OperatingSystemMXBean.html#getProcessCpuLoad())

Usually it looks weird if you have multiple CPU cores. For example, if you have a job with a single slot 100% utilizing a single CPU core on a 8 core machine, the JVM.CPU.Load will be 1.0/8.0 = 0.125. It's also a point-in-time snapshot of current CPU usage, so if you're collecting your metrics every minute, and the job has spiky workload within this minute (like it's idle almost always and once in a minute it consumes 100% CPU for one second), so you have a chance to completely miss this from the metrics.

As for me personally, JVM.CPU.Time is more clear indicator of CPU usage, which is always increasing amount of milliseconds CPU spent executing your code. And it will also catch CPU usage spikes.

Roman Grebennikov | [hidden email]

On Mon, Aug 3, 2020, at 23:34, Bajaj, Abhinav wrote:

Hi,

I am trying to understand the CPU Load metrics reported by Flink 1.7.1 running with openjdk 1.8.0_212 on K8s.

After deploying the Flink Job on K8s, I tried to get CPU Load metrics following this documentation.

curl localhost:8081/taskmanagers/7737ac33b311ea0a696422680711597b/metrics?get=Status.JVM.CPU.Load,Status.JVM.CPU.Time

[{"id":"Status.JVM.CPU.Load","value":"0.0023815194093831865"},{"id":"Status.JVM.CPU.Time","value":"23260000000"}]

The value of the CPU load looks odd to me.

What is the unit and scale of this value?

How does Flink determine this value?

Appreciate your time and help here.

~ Abhinav Bajaj

Arvid Heise-3

Re: Flink CPU load metrics in K8s

Hi Abhinav,

according to [1], you need 8u261 for the OperatingSystemMXBean to work as expected.

[1] https://bugs.openjdk.java.net/browse/JDK-8242287

On Thu, Aug 13, 2020 at 1:10 AM Bajaj, Abhinav <[hidden email]> wrote:

Thanks Xintong for your input.

From the information I could find, I understand the JDK version 1.8.0_212 we use includes the docker/container support.

I also had a quick test inside the docker image using the below –

Runtime.getRuntime().availableProcessors()

It showed the right number of CPU cores associated to container.

But I am not familiar with OperatingSystemMXBean used by Flink.

So I don’t know if it will pick up docker CPU limits set by K8s or not. I will continue to investigate that.

In meantime, the K8s metric - container_cpu_usage_seconds_total does seem to provide the expected CPU usage for the containers.

I was hoping that someone in the community may have already ran into this behavior on K8s and can share their specific experience 😊.

Thanks much.

~ Abhinav Bajaj

From: Xintong Song <[hidden email]>
Date: Wednesday, August 12, 2020 at 3:56 AM
To: "Bajaj, Abhinav" <[hidden email]>
Cc: "[hidden email]" <[hidden email]>, Roman Grebennikov <[hidden email]>
Subject: Re: Flink CPU load metrics in K8s

Hi Abhinav,

Do you know how many total cpus does the physical machine have where the kubernetes container is running?

I'm asking because I suspect whether JVM is aware that only 1 cpu is configured for the container. It does not work like JVM understands how many cpu are configured and controls itself to not use more than that. On the other hand, JVM tries to use as much cpu time as possible, and the limit comes from external (OS, docker, cgroup, ...).

Please understand that docker containers are not virtual machines. They do not "pretend" to only have certain hardwares. I did a simple test on my laptop, launching a docker container with cpu limit configured. Inside the container, I can still see all my machine's cpus.

Thank you~

Xintong Song

On Wed, Aug 12, 2020 at 1:19 AM Bajaj, Abhinav <[hidden email]> wrote:

Hi,

Reaching out to folks running Flink on K8s.

~ Abhinav Bajaj

From: "Bajaj, Abhinav" <[hidden email]>
Date: Wednesday, August 5, 2020 at 1:46 PM
To: Roman Grebennikov <[hidden email]>, "[hidden email]" <[hidden email]>
Subject: Re: Flink CPU load metrics in K8s

Thanks Roman for providing the details.

I also made more observations that has increased my confusion about this topic 😝

To ease the calculations, I deployed a test cluster this time providing 1 CPU in K8s(with docker) for all the taskmanager container.

When I check the taskmanager CPU load, the value is in the order of "0.002158428663932657".

Assuming that the underlying JVM recognizes 1 CPU allocated to the docker container, this values means % CPU usage in ball park of 0.21%.

However, if I look at the K8s metrics(formula below) for this container – it turns out in the ball park of 10-16%.

There is no other process running in the container apart from the flink taskmanager.

The order of these two values of CPU % usage is different.

Am I comparing the right metrics here?

How are folks running Flink on K8s monitoring the CPU load?

~ Abhi

% CPU usage from K8s metrics

sum(rate(container_cpu_usage_seconds_total{pod=~"my-taskmanagers-*", container="taskmanager"}[5m])) by (pod)

/ sum(container_spec_cpu_quota{pod=~"my-taskmanager-pod-*", container="taskmanager"}

/ container_spec_cpu_period{pod=~"my-taskmanager-pod-*", container="taskmanager"}) by (pod)

From: Roman Grebennikov <[hidden email]>
Date: Tuesday, August 4, 2020 at 12:42 AM
To: "[hidden email]" <[hidden email]>
Subject: Re: Flink CPU load metrics in K8s

LEARN FAST: This email originated outside of HERE.
Please do not click on links or open attachments unless you recognize the sender and know the content is safe. Thank you.

Hi,

JVM.CPU.Load is just a wrapper (MetricUtils.instantiateCPUMetrics) on top of OperatingSystemMXBean.getProcessCpuLoad (see https://docs.oracle.com/javase/7/docs/jre/api/management/extension/com/sun/management/OperatingSystemMXBean.html#getProcessCpuLoad())

Usually it looks weird if you have multiple CPU cores. For example, if you have a job with a single slot 100% utilizing a single CPU core on a 8 core machine, the JVM.CPU.Load will be 1.0/8.0 = 0.125. It's also a point-in-time snapshot of current CPU usage, so if you're collecting your metrics every minute, and the job has spiky workload within this minute (like it's idle almost always and once in a minute it consumes 100% CPU for one second), so you have a chance to completely miss this from the metrics.

As for me personally, JVM.CPU.Time is more clear indicator of CPU usage, which is always increasing amount of milliseconds CPU spent executing your code. And it will also catch CPU usage spikes.

Roman Grebennikov | [hidden email]

On Mon, Aug 3, 2020, at 23:34, Bajaj, Abhinav wrote:

Hi,

I am trying to understand the CPU Load metrics reported by Flink 1.7.1 running with openjdk 1.8.0_212 on K8s.

After deploying the Flink Job on K8s, I tried to get CPU Load metrics following this documentation.

curl localhost:8081/taskmanagers/7737ac33b311ea0a696422680711597b/metrics?get=Status.JVM.CPU.Load,Status.JVM.CPU.Time

[{"id":"Status.JVM.CPU.Load","value":"0.0023815194093831865"},{"id":"Status.JVM.CPU.Time","value":"23260000000"}]

The value of the CPU load looks odd to me.

What is the unit and scale of this value?

How does Flink determine this value?

Appreciate your time and help here.

~ Abhinav Bajaj

Arvid Heise | Senior Java Developer

Join Flink Forward - The Apache Flink Conference

Stream Processing | Event Driven | Real Time

Ververica GmbH | Invalidenstrasse 115, 10115 Berlin, Germany

Ververica GmbH
Registered at Amtsgericht Charlottenburg: HRB 158244 B
Managing Directors: Timothy Alexander Steinert, Yip Park Tung Jason, Ji (Toni) Cheng

Bajaj, Abhinav

Re: Flink CPU load metrics in K8s

Awesome. This is exactly what I was going to look for.

Thanks much.

~ Abhinav

From: Arvid Heise <[hidden email]>
Date: Thursday, August 13, 2020 at 12:33 AM
To: "Bajaj, Abhinav" <[hidden email]>
Cc: Xintong Song <[hidden email]>, "[hidden email]" <[hidden email]>, Roman Grebennikov <[hidden email]>
Subject: Re: Flink CPU load metrics in K8s

Hi Abhinav,

according to [1], you need 8u261 for the OperatingSystemMXBean to work as expected.

[1] https://bugs.openjdk.java.net/browse/JDK-8242287

On Thu, Aug 13, 2020 at 1:10 AM Bajaj, Abhinav <[hidden email]> wrote:

Thanks Xintong for your input.

From the information I could find, I understand the JDK version 1.8.0_212 we use includes the docker/container support.

I also had a quick test inside the docker image using the below –

Runtime.getRuntime().availableProcessors()

It showed the right number of CPU cores associated to container.

But I am not familiar with OperatingSystemMXBean used by Flink.

So I don’t know if it will pick up docker CPU limits set by K8s or not. I will continue to investigate that.

In meantime, the K8s metric - container_cpu_usage_seconds_total does seem to provide the expected CPU usage for the containers.

I was hoping that someone in the community may have already ran into this behavior on K8s and can share their specific experience 😊.

Thanks much.

~ Abhinav Bajaj

From: Xintong Song <[hidden email]>
Date: Wednesday, August 12, 2020 at 3:56 AM
To: "Bajaj, Abhinav" <[hidden email]>
Cc: "[hidden email]" <[hidden email]>, Roman Grebennikov <[hidden email]>
Subject: Re: Flink CPU load metrics in K8s

Hi Abhinav,

Do you know how many total cpus does the physical machine have where the kubernetes container is running?

I'm asking because I suspect whether JVM is aware that only 1 cpu is configured for the container. It does not work like JVM understands how many cpu are configured and controls itself to not use more than that. On the other hand, JVM tries to use as much cpu time as possible, and the limit comes from external (OS, docker, cgroup, ...).

Please understand that docker containers are not virtual machines. They do not "pretend" to only have certain hardwares. I did a simple test on my laptop, launching a docker container with cpu limit configured. Inside the container, I can still see all my machine's cpus.

Thank you~

Xintong Song

On Wed, Aug 12, 2020 at 1:19 AM Bajaj, Abhinav <[hidden email]> wrote:

Hi,

Reaching out to folks running Flink on K8s.

~ Abhinav Bajaj

From: "Bajaj, Abhinav" <[hidden email]>
Date: Wednesday, August 5, 2020 at 1:46 PM
To: Roman Grebennikov <[hidden email]>, "[hidden email]" <[hidden email]>
Subject: Re: Flink CPU load metrics in K8s

Thanks Roman for providing the details.

I also made more observations that has increased my confusion about this topic 😝

To ease the calculations, I deployed a test cluster this time providing 1 CPU in K8s(with docker) for all the taskmanager container.

When I check the taskmanager CPU load, the value is in the order of "0.002158428663932657".

Assuming that the underlying JVM recognizes 1 CPU allocated to the docker container, this values means % CPU usage in ball park of 0.21%.

However, if I look at the K8s metrics(formula below) for this container – it turns out in the ball park of 10-16%.

There is no other process running in the container apart from the flink taskmanager.

The order of these two values of CPU % usage is different.

Am I comparing the right metrics here?

How are folks running Flink on K8s monitoring the CPU load?

~ Abhi

% CPU usage from K8s metrics

sum(rate(container_cpu_usage_seconds_total{pod=~"my-taskmanagers-*", container="taskmanager"}[5m])) by (pod)

/ sum(container_spec_cpu_quota{pod=~"my-taskmanager-pod-*", container="taskmanager"}

/ container_spec_cpu_period{pod=~"my-taskmanager-pod-*", container="taskmanager"}) by (pod)

From: Roman Grebennikov <[hidden email]>
Date: Tuesday, August 4, 2020 at 12:42 AM
To: "[hidden email]" <[hidden email]>
Subject: Re: Flink CPU load metrics in K8s

LEARN FAST: This email originated outside of HERE.
Please do not click on links or open attachments unless you recognize the sender and know the content is safe. Thank you.

Hi,

JVM.CPU.Load is just a wrapper (MetricUtils.instantiateCPUMetrics) on top of OperatingSystemMXBean.getProcessCpuLoad (see https://docs.oracle.com/javase/7/docs/jre/api/management/extension/com/sun/management/OperatingSystemMXBean.html#getProcessCpuLoad())

Usually it looks weird if you have multiple CPU cores. For example, if you have a job with a single slot 100% utilizing a single CPU core on a 8 core machine, the JVM.CPU.Load will be 1.0/8.0 = 0.125. It's also a point-in-time snapshot of current CPU usage, so if you're collecting your metrics every minute, and the job has spiky workload within this minute (like it's idle almost always and once in a minute it consumes 100% CPU for one second), so you have a chance to completely miss this from the metrics.

As for me personally, JVM.CPU.Time is more clear indicator of CPU usage, which is always increasing amount of milliseconds CPU spent executing your code. And it will also catch CPU usage spikes.

Roman Grebennikov | [hidden email]

On Mon, Aug 3, 2020, at 23:34, Bajaj, Abhinav wrote:

Hi,

I am trying to understand the CPU Load metrics reported by Flink 1.7.1 running with openjdk 1.8.0_212 on K8s.

After deploying the Flink Job on K8s, I tried to get CPU Load metrics following this documentation.

curl localhost:8081/taskmanagers/7737ac33b311ea0a696422680711597b/metrics?get=Status.JVM.CPU.Load,Status.JVM.CPU.Time

[{"id":"Status.JVM.CPU.Load","value":"0.0023815194093831865"},{"id":"Status.JVM.CPU.Time","value":"23260000000"}]

The value of the CPU load looks odd to me.

What is the unit and scale of this value?

How does Flink determine this value?

Appreciate your time and help here.

~ Abhinav Bajaj

Arvid Heise | Senior Java Developer

Join Flink Forward - The Apache Flink Conference

Stream Processing | Event Driven | Real Time

Ververica GmbH | Invalidenstrasse 115, 10115 Berlin, Germany

Ververica GmbH
Registered at Amtsgericht Charlottenburg: HRB 158244 B
Managing Directors: Timothy Alexander Steinert, Yip Park Tung Jason, Ji (Toni) Cheng