Measure CPU utilization

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

Measure CPU utilization

Piper Piper
Hello,

What is the best way to measure the CPU utilization of a TaskManager in Flink, as opposed to using Linux's "top" command? Is querying the REST endpoint http://<IP>:<port>/taskmanagers/<TM_ID>/metrics?get=Status.JVM.CPU.Load\ the best option? Roman's reply (copied below) from the archives suggests that it returns the CPU usage for the whole system including other processes currently in the system, and would not give the CPU utilization only of that Task Manager.

Based on Roman's reply that JVM.CPU.Time is a more clear indicator of CPU usage, can you suggest how I would use it to calculate CPU utilization? Is there any way I can get the CPU utilization for a Job that is distributed over several nodes in the cluster?

Also, what is the difference between the two REST API endpoints below:

1. http://<IP>:<port>/taskmanagers/<TM_ID>/metrics?get=Status.JVM.CPU.Load\
2. http://<IP>:<port>/taskmanagers/<TM_ID>/metrics?get=System.CPU.Usage\

Thanks,

Piper

Hi,

JVM.CPU.Load is just a wrapper (MetricUtils.instantiateCPUMetrics) on top of OperatingSystemMXBean.getProcessCpuLoad (see https://docs.oracle.com/javase/7/docs/jre/api/management/extension/com/sun/management/OperatingSystemMXBean.html#getProcessCpuLoad<https://eur02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdocs.oracle.com%2Fjavase%2F7%2Fdocs%2Fjre%2Fapi%2Fmanagement%2Fextension%2Fcom%2Fsun%2Fmanagement%2FOperatingSystemMXBean.html%23getProcessCpuLoad&data=01%7C01%7C%7Ce32e547897104433cdef08d83eae5912%7C6d4034cd72254f72b85391feaea64919%7C1&sdata=1GFnINqDDVLZGLUQnFMEz7W%2Fcnm36HnViOsVpEikrVE%3D&reserved=0>())

Usually it looks weird if you have multiple CPU cores. For example, if you have a job with a single slot 100% utilizing a single CPU core on a 8 core machine, the JVM.CPU.Load will be 1.0/8.0 = 0.125. It's also a point-in-time snapshot of current CPU usage, so if you're collecting your metrics every minute, and the job has spiky workload within this minute (like it's idle almost always and once in a minute it consumes 100% CPU for one second), so you have a chance to completely miss this from the metrics.

As for me personally, JVM.CPU.Time is more clear indicator of CPU usage, which is always increasing amount of milliseconds CPU spent executing your code. And it will also catch CPU usage spikes.

Roman Grebennikov | [hidden email]<[hidden email]>
Reply | Threaded
Open this post in threaded view
|

Re: Measure CPU utilization

rmetzger0
Hi Piper,

I personally like looking at the system load (if Flink is the only major process on the system). It nicely captures the "stress" Flink puts on the system (this would be the "System.CPU.Load5min class of metrics") (there are a lot of articles about understanding linux load averages)

I don't think there's something built into Flink for getting the CPU utilization across the cluster.

For the difference in the REST endpoints:
According to the Flink documentation (1) captures the process CPU usage (with the issue Roman described), (2) captures the overall system CPU usage

Best,
Robert


On Thu, Sep 10, 2020 at 11:08 PM Piper Piper <[hidden email]> wrote:
Hello,

What is the best way to measure the CPU utilization of a TaskManager in Flink, as opposed to using Linux's "top" command? Is querying the REST endpoint http://<IP>:<port>/taskmanagers/<TM_ID>/metrics?get=Status.JVM.CPU.Load\ the best option? Roman's reply (copied below) from the archives suggests that it returns the CPU usage for the whole system including other processes currently in the system, and would not give the CPU utilization only of that Task Manager.

Based on Roman's reply that JVM.CPU.Time is a more clear indicator of CPU usage, can you suggest how I would use it to calculate CPU utilization? Is there any way I can get the CPU utilization for a Job that is distributed over several nodes in the cluster?

Also, what is the difference between the two REST API endpoints below:

1. http://<IP>:<port>/taskmanagers/<TM_ID>/metrics?get=Status.JVM.CPU.Load\
2. http://<IP>:<port>/taskmanagers/<TM_ID>/metrics?get=System.CPU.Usage\

Thanks,

Piper

Hi,

JVM.CPU.Load is just a wrapper (MetricUtils.instantiateCPUMetrics) on top of OperatingSystemMXBean.getProcessCpuLoad (see https://docs.oracle.com/javase/7/docs/jre/api/management/extension/com/sun/management/OperatingSystemMXBean.html#getProcessCpuLoad<https://eur02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdocs.oracle.com%2Fjavase%2F7%2Fdocs%2Fjre%2Fapi%2Fmanagement%2Fextension%2Fcom%2Fsun%2Fmanagement%2FOperatingSystemMXBean.html%23getProcessCpuLoad&data=01%7C01%7C%7Ce32e547897104433cdef08d83eae5912%7C6d4034cd72254f72b85391feaea64919%7C1&sdata=1GFnINqDDVLZGLUQnFMEz7W%2Fcnm36HnViOsVpEikrVE%3D&reserved=0>())

Usually it looks weird if you have multiple CPU cores. For example, if you have a job with a single slot 100% utilizing a single CPU core on a 8 core machine, the JVM.CPU.Load will be 1.0/8.0 = 0.125. It's also a point-in-time snapshot of current CPU usage, so if you're collecting your metrics every minute, and the job has spiky workload within this minute (like it's idle almost always and once in a minute it consumes 100% CPU for one second), so you have a chance to completely miss this from the metrics.

As for me personally, JVM.CPU.Time is more clear indicator of CPU usage, which is always increasing amount of milliseconds CPU spent executing your code. And it will also catch CPU usage spikes.

Roman Grebennikov | [hidden email]<[hidden email]>