Hi,
I got stuck in using Prometheus,Pushgateway to collect metrics. Here is my configuration about reporter: metrics.reporter.promgateway.class: org.apache.flink.metrics.prometheus.PrometheusPushGatewayReporter metrics.reporter.promgateway.host: localhost metrics.reporter.promgateway.port: 9091 metrics.reporter.promgateway.jobName: myJob metrics.reporter.promgateway.randomJobNameSuffix: true metrics.reporter.promgateway.deleteOnShutdown: true And the version information: Flink 1.9.1 Prometheus 2.18 PushGateway 1.2 & 0.9 (I had already try them both) I found that when Flink cluster restart, there showed up metrics which have new jobName with random suffix. But there still existed those metrics having jobName before restarting cluster(value stop update). Since Prometheus still periodically pulled the data in pushgateway, I got a bunch of time series data with value unchanged forever. It looks like: # HELP flink_jobmanager_Status_JVM_CPU_Load Load (scope: jobmanager_Status_JVM_CPU) # TYPE flink_jobmanager_Status_JVM_CPU_Load gauge flink_jobmanager_Status_JVM_CPU_Load{host="localhost",instance="",job="myJobae71620b106e8c2fdf86cb5c65fd6414"} 0 flink_jobmanager_Status_JVM_CPU_Load{host="localhost",instance="",job="myJobe50caa3be194aeb2ff71a64bced17cea"} 0.0006602344673593189 # HELP flink_jobmanager_Status_JVM_CPU_Time Time (scope: jobmanager_Status_JVM_CPU) # TYPE flink_jobmanager_Status_JVM_CPU_Time gauge flink_jobmanager_Status_JVM_CPU_Time{host="localhost",instance="",job="myJobae71620b106e8c2fdf86cb5c65fd6414"} 4.54512e+09 flink_jobmanager_Status_JVM_CPU_Time{host="localhost",instance="",job="myJobe50caa3be194aeb2ff71a64bced17cea"} 8.24809e+09 # HELP flink_jobmanager_Status_JVM_ClassLoader_ClassesLoaded ClassesLoaded (scope: jobmanager_Status_JVM_ClassLoader) # TYPE flink_jobmanager_Status_JVM_ClassLoader_ClassesLoaded gauge flink_jobmanager_Status_JVM_ClassLoader_ClassesLoaded{host="localhost",instance="",job="myJobae71620b106e8c2fdf86cb5c65fd6414"} 5984 flink_jobmanager_Status_JVM_ClassLoader_ClassesLoaded{host="localhost",instance="",job="myJobe50caa3be194aeb2ff71a64bced17cea"} 6014 # HELP flink_jobmanager_Status_JVM_ClassLoader_ClassesUnloaded ClassesUnloaded (scope: jobmanager_Status_JVM_ClassLoader) # TYPE flink_jobmanager_Status_JVM_ClassLoader_ClassesUnloaded gauge flink_jobmanager_Status_JVM_ClassLoader_ClassesUnloaded{host="localhost",instance="",job="myJobae71620b106e8c2fdf86cb5c65fd6414"} 0 flink_jobmanager_Status_JVM_ClassLoader_ClassesUnloaded{host="localhost",instance="",job="myJobe50caa3be194aeb2ff71a64bced17cea"} 0 Ps: This cluster has one JobManager. In my understanding, when I set metrics.reporter.promgateway.deleteOnShutdown to true, the old metrics information should be deleted from pushgateway. But it didn’t work somehow. Is my understanding on these configuration right? Any solution about deleting metrics from pushgateway? Thanks!
|
I met this issue three months ago. Finally, we got the conclusion that is Prometheus push gateway can not handle high throughout metric data. But we solved the issue via service discovery. We changed the Prometheus metric reporter
code, adding the registration logic, so the job can expose the host and port on discovery service. And then write a plugin for Prometheus that can get the service list to pull the metrics from the Flink jobs.
From: 李佳宸 <[hidden email]>
Sent: Wednesday, May 13, 2020 11:26:26 AM To: [hidden email] <[hidden email]> Subject: Prometheus Pushgateway Reporter Can not DELETE metrics on pushgateway Hi,
I got stuck in using Prometheus,Pushgateway to collect metrics. Here is my configuration about reporter:
metrics.reporter.promgateway.class: org.apache.flink.metrics.prometheus.PrometheusPushGatewayReporter
metrics.reporter.promgateway.host: localhost
metrics.reporter.promgateway.port: 9091
metrics.reporter.promgateway.jobName: myJob
metrics.reporter.promgateway.randomJobNameSuffix:
true
metrics.reporter.promgateway.deleteOnShutdown:
true
And the version information:
Flink 1.9.1
Prometheus 2.18
PushGateway 1.2 & 0.9 (I had already try them both)
I found that when Flink cluster restart, there showed up metrics which have new jobName with random suffix. But there still existed those metrics having jobName before restarting cluster(value stop update). Since Prometheus still periodically
pulled the data in pushgateway, I got a bunch of time series data with value unchanged forever.
It looks like:
# HELP flink_jobmanager_Status_JVM_CPU_Load Load (scope: jobmanager_Status_JVM_CPU) # TYPE flink_jobmanager_Status_JVM_CPU_Load gauge flink_jobmanager_Status_JVM_CPU_Load{host="localhost",instance="",job="myJobae71620b106e8c2fdf86cb5c65fd6414"} 0 flink_jobmanager_Status_JVM_CPU_Load{host="localhost",instance="",job="myJobe50caa3be194aeb2ff71a64bced17cea"} 0.0006602344673593189 # HELP flink_jobmanager_Status_JVM_CPU_Time Time (scope: jobmanager_Status_JVM_CPU) # TYPE flink_jobmanager_Status_JVM_CPU_Time gauge flink_jobmanager_Status_JVM_CPU_Time{host="localhost",instance="",job="myJobae71620b106e8c2fdf86cb5c65fd6414"} 4.54512e+09 flink_jobmanager_Status_JVM_CPU_Time{host="localhost",instance="",job="myJobe50caa3be194aeb2ff71a64bced17cea"} 8.24809e+09 # HELP flink_jobmanager_Status_JVM_ClassLoader_ClassesLoaded ClassesLoaded (scope: jobmanager_Status_JVM_ClassLoader) # TYPE flink_jobmanager_Status_JVM_ClassLoader_ClassesLoaded gauge flink_jobmanager_Status_JVM_ClassLoader_ClassesLoaded{host="localhost",instance="",job="myJobae71620b106e8c2fdf86cb5c65fd6414"} 5984 flink_jobmanager_Status_JVM_ClassLoader_ClassesLoaded{host="localhost",instance="",job="myJobe50caa3be194aeb2ff71a64bced17cea"} 6014 # HELP flink_jobmanager_Status_JVM_ClassLoader_ClassesUnloaded ClassesUnloaded (scope: jobmanager_Status_JVM_ClassLoader) # TYPE flink_jobmanager_Status_JVM_ClassLoader_ClassesUnloaded gauge flink_jobmanager_Status_JVM_ClassLoader_ClassesUnloaded{host="localhost",instance="",job="myJobae71620b106e8c2fdf86cb5c65fd6414"} 0 flink_jobmanager_Status_JVM_ClassLoader_ClassesUnloaded{host="localhost",instance="",job="myJobe50caa3be194aeb2ff71a64bced17cea"} 0 Ps: This cluster has one JobManager.
In my understanding, when I set metrics.reporter.promgateway.deleteOnShutdown to true, the old metrics information should be deleted
from pushgateway. But it didn’t work somehow.
Is my understanding on these configuration right? Any solution about deleting metrics from pushgateway?
Thanks!
|
Hi
From our experience, instead of offering more resource for Prometheus push-gateway and servers. We could leverage Flink' feature to avoid sending unnecessary data (especially high-dimension tags, e,g task_attempt_id) after Flink-1.10. In general, we could exclude
"operator_id;task_id;task_attempt_id", which are rarely used, in
metrics.reporter.<name>.scope.variables.excludes .
Best
Yun Tang
From: Thomas Huang <[hidden email]>
Sent: Wednesday, May 13, 2020 12:00 To: 李佳宸 <[hidden email]>; [hidden email] <[hidden email]> Subject: Re: Prometheus Pushgateway Reporter Can not DELETE metrics on pushgateway I met this issue three months ago. Finally, we got the conclusion that is Prometheus push gateway can not handle high throughout metric data. But we solved the issue via service discovery. We changed the Prometheus metric reporter
code, adding the registration logic, so the job can expose the host and port on discovery service. And then write a plugin for Prometheus that can get the service list to pull the metrics from the Flink jobs.
From: 李佳宸 <[hidden email]>
Sent: Wednesday, May 13, 2020 11:26:26 AM To: [hidden email] <[hidden email]> Subject: Prometheus Pushgateway Reporter Can not DELETE metrics on pushgateway Hi,
I got stuck in using Prometheus,Pushgateway to collect metrics. Here is my configuration about reporter:
metrics.reporter.promgateway.class: org.apache.flink.metrics.prometheus.PrometheusPushGatewayReporter
metrics.reporter.promgateway.host: localhost
metrics.reporter.promgateway.port: 9091
metrics.reporter.promgateway.jobName: myJob
metrics.reporter.promgateway.randomJobNameSuffix:
true
metrics.reporter.promgateway.deleteOnShutdown:
true
And the version information:
Flink 1.9.1
Prometheus 2.18
PushGateway 1.2 & 0.9 (I had already try them both)
I found that when Flink cluster restart, there showed up metrics which have new jobName with random suffix. But there still existed those metrics having jobName before restarting cluster(value stop update). Since Prometheus still periodically
pulled the data in pushgateway, I got a bunch of time series data with value unchanged forever.
It looks like:
# HELP flink_jobmanager_Status_JVM_CPU_Load Load (scope: jobmanager_Status_JVM_CPU) # TYPE flink_jobmanager_Status_JVM_CPU_Load gauge flink_jobmanager_Status_JVM_CPU_Load{host="localhost",instance="",job="myJobae71620b106e8c2fdf86cb5c65fd6414"} 0 flink_jobmanager_Status_JVM_CPU_Load{host="localhost",instance="",job="myJobe50caa3be194aeb2ff71a64bced17cea"} 0.0006602344673593189 # HELP flink_jobmanager_Status_JVM_CPU_Time Time (scope: jobmanager_Status_JVM_CPU) # TYPE flink_jobmanager_Status_JVM_CPU_Time gauge flink_jobmanager_Status_JVM_CPU_Time{host="localhost",instance="",job="myJobae71620b106e8c2fdf86cb5c65fd6414"} 4.54512e+09 flink_jobmanager_Status_JVM_CPU_Time{host="localhost",instance="",job="myJobe50caa3be194aeb2ff71a64bced17cea"} 8.24809e+09 # HELP flink_jobmanager_Status_JVM_ClassLoader_ClassesLoaded ClassesLoaded (scope: jobmanager_Status_JVM_ClassLoader) # TYPE flink_jobmanager_Status_JVM_ClassLoader_ClassesLoaded gauge flink_jobmanager_Status_JVM_ClassLoader_ClassesLoaded{host="localhost",instance="",job="myJobae71620b106e8c2fdf86cb5c65fd6414"} 5984 flink_jobmanager_Status_JVM_ClassLoader_ClassesLoaded{host="localhost",instance="",job="myJobe50caa3be194aeb2ff71a64bced17cea"} 6014 # HELP flink_jobmanager_Status_JVM_ClassLoader_ClassesUnloaded ClassesUnloaded (scope: jobmanager_Status_JVM_ClassLoader) # TYPE flink_jobmanager_Status_JVM_ClassLoader_ClassesUnloaded gauge flink_jobmanager_Status_JVM_ClassLoader_ClassesUnloaded{host="localhost",instance="",job="myJobae71620b106e8c2fdf86cb5c65fd6414"} 0 flink_jobmanager_Status_JVM_ClassLoader_ClassesUnloaded{host="localhost",instance="",job="myJobe50caa3be194aeb2ff71a64bced17cea"} 0 Ps: This cluster has one JobManager.
In my understanding, when I set metrics.reporter.promgateway.deleteOnShutdown to true, the old metrics information should be deleted
from pushgateway. But it didn’t work somehow.
Is my understanding on these configuration right? Any solution about deleting metrics from pushgateway?
Thanks!
|
Free forum by Nabble | Edit this page |