Prometheus Pushgateway Reporter Can not DELETE metrics on pushgateway

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Prometheus Pushgateway Reporter Can not DELETE metrics on pushgateway

李佳宸
Hi,

I got stuck in using Prometheus,Pushgateway to collect metrics. Here is my configuration about reporter:

metrics.reporter.promgateway.class: org.apache.flink.metrics.prometheus.PrometheusPushGatewayReporter
metrics.reporter.promgateway.host: localhost
metrics.reporter.promgateway.port: 9091
metrics.reporter.promgateway.jobName: myJob
metrics.reporter.promgateway.randomJobNameSuffix: true
metrics.reporter.promgateway.deleteOnShutdown: true

And the version information:
Flink 1.9.1
Prometheus 2.18
PushGateway 1.2 & 0.9 (I had already try them both) 

I found that when Flink cluster restart, there showed up metrics which have new jobName with random suffix. But there still existed those metrics having jobName before restarting cluster(value stop update). Since Prometheus still periodically pulled the data in pushgateway, I got a bunch of time series data with value unchanged forever. 

It looks like:

# HELP flink_jobmanager_Status_JVM_CPU_Load Load (scope: jobmanager_Status_JVM_CPU)
# TYPE flink_jobmanager_Status_JVM_CPU_Load gauge
flink_jobmanager_Status_JVM_CPU_Load{host="localhost",instance="",job="myJobae71620b106e8c2fdf86cb5c65fd6414"} 0
flink_jobmanager_Status_JVM_CPU_Load{host="localhost",instance="",job="myJobe50caa3be194aeb2ff71a64bced17cea"} 0.0006602344673593189
# HELP flink_jobmanager_Status_JVM_CPU_Time Time (scope: jobmanager_Status_JVM_CPU)
# TYPE flink_jobmanager_Status_JVM_CPU_Time gauge
flink_jobmanager_Status_JVM_CPU_Time{host="localhost",instance="",job="myJobae71620b106e8c2fdf86cb5c65fd6414"} 4.54512e+09
flink_jobmanager_Status_JVM_CPU_Time{host="localhost",instance="",job="myJobe50caa3be194aeb2ff71a64bced17cea"} 8.24809e+09
# HELP flink_jobmanager_Status_JVM_ClassLoader_ClassesLoaded ClassesLoaded (scope: jobmanager_Status_JVM_ClassLoader)
# TYPE flink_jobmanager_Status_JVM_ClassLoader_ClassesLoaded gauge
flink_jobmanager_Status_JVM_ClassLoader_ClassesLoaded{host="localhost",instance="",job="myJobae71620b106e8c2fdf86cb5c65fd6414"} 5984
flink_jobmanager_Status_JVM_ClassLoader_ClassesLoaded{host="localhost",instance="",job="myJobe50caa3be194aeb2ff71a64bced17cea"} 6014
# HELP flink_jobmanager_Status_JVM_ClassLoader_ClassesUnloaded ClassesUnloaded (scope: jobmanager_Status_JVM_ClassLoader)
# TYPE flink_jobmanager_Status_JVM_ClassLoader_ClassesUnloaded gauge
flink_jobmanager_Status_JVM_ClassLoader_ClassesUnloaded{host="localhost",instance="",job="myJobae71620b106e8c2fdf86cb5c65fd6414"} 0
flink_jobmanager_Status_JVM_ClassLoader_ClassesUnloaded{host="localhost",instance="",job="myJobe50caa3be194aeb2ff71a64bced17cea"} 0
Ps: This cluster has one JobManager.

In my understanding, when I set metrics.reporter.promgateway.deleteOnShutdown to true, the old metrics information should be deleted from pushgateway. But it didn’t work somehow.
Is my understanding on these configuration right? Any solution about deleting metrics from pushgateway?

Thanks!
Reply | Threaded
Open this post in threaded view
|

Re: Prometheus Pushgateway Reporter Can not DELETE metrics on pushgateway

Thomas Huang
I met this issue three months ago. Finally, we got the conclusion that is Prometheus push gateway can not handle high throughout metric data. But we solved the issue via service discovery. We changed the Prometheus metric reporter code, adding the registration logic, so the job can expose the host and port on discovery service. And then write a plugin for Prometheus that can get the service list to pull the metrics from the Flink jobs.


From: 李佳宸 <[hidden email]>
Sent: Wednesday, May 13, 2020 11:26:26 AM
To: [hidden email] <[hidden email]>
Subject: Prometheus Pushgateway Reporter Can not DELETE metrics on pushgateway
 
Hi,

I got stuck in using Prometheus,Pushgateway to collect metrics. Here is my configuration about reporter:

metrics.reporter.promgateway.class: org.apache.flink.metrics.prometheus.PrometheusPushGatewayReporter
metrics.reporter.promgateway.host: localhost
metrics.reporter.promgateway.port: 9091
metrics.reporter.promgateway.jobName: myJob
metrics.reporter.promgateway.randomJobNameSuffix: true
metrics.reporter.promgateway.deleteOnShutdown: true

And the version information:
Flink 1.9.1
Prometheus 2.18
PushGateway 1.2 & 0.9 (I had already try them both) 

I found that when Flink cluster restart, there showed up metrics which have new jobName with random suffix. But there still existed those metrics having jobName before restarting cluster(value stop update). Since Prometheus still periodically pulled the data in pushgateway, I got a bunch of time series data with value unchanged forever. 

It looks like:

# HELP flink_jobmanager_Status_JVM_CPU_Load Load (scope: jobmanager_Status_JVM_CPU)
# TYPE flink_jobmanager_Status_JVM_CPU_Load gauge
flink_jobmanager_Status_JVM_CPU_Load{host="localhost",instance="",job="myJobae71620b106e8c2fdf86cb5c65fd6414"} 0
flink_jobmanager_Status_JVM_CPU_Load{host="localhost",instance="",job="myJobe50caa3be194aeb2ff71a64bced17cea"} 0.0006602344673593189
# HELP flink_jobmanager_Status_JVM_CPU_Time Time (scope: jobmanager_Status_JVM_CPU)
# TYPE flink_jobmanager_Status_JVM_CPU_Time gauge
flink_jobmanager_Status_JVM_CPU_Time{host="localhost",instance="",job="myJobae71620b106e8c2fdf86cb5c65fd6414"} 4.54512e+09
flink_jobmanager_Status_JVM_CPU_Time{host="localhost",instance="",job="myJobe50caa3be194aeb2ff71a64bced17cea"} 8.24809e+09
# HELP flink_jobmanager_Status_JVM_ClassLoader_ClassesLoaded ClassesLoaded (scope: jobmanager_Status_JVM_ClassLoader)
# TYPE flink_jobmanager_Status_JVM_ClassLoader_ClassesLoaded gauge
flink_jobmanager_Status_JVM_ClassLoader_ClassesLoaded{host="localhost",instance="",job="myJobae71620b106e8c2fdf86cb5c65fd6414"} 5984
flink_jobmanager_Status_JVM_ClassLoader_ClassesLoaded{host="localhost",instance="",job="myJobe50caa3be194aeb2ff71a64bced17cea"} 6014
# HELP flink_jobmanager_Status_JVM_ClassLoader_ClassesUnloaded ClassesUnloaded (scope: jobmanager_Status_JVM_ClassLoader)
# TYPE flink_jobmanager_Status_JVM_ClassLoader_ClassesUnloaded gauge
flink_jobmanager_Status_JVM_ClassLoader_ClassesUnloaded{host="localhost",instance="",job="myJobae71620b106e8c2fdf86cb5c65fd6414"} 0
flink_jobmanager_Status_JVM_ClassLoader_ClassesUnloaded{host="localhost",instance="",job="myJobe50caa3be194aeb2ff71a64bced17cea"} 0
Ps: This cluster has one JobManager.

In my understanding, when I set metrics.reporter.promgateway.deleteOnShutdown to true, the old metrics information should be deleted from pushgateway. But it didn’t work somehow.
Is my understanding on these configuration right? Any solution about deleting metrics from pushgateway?

Thanks!
Reply | Threaded
Open this post in threaded view
|

Re: Prometheus Pushgateway Reporter Can not DELETE metrics on pushgateway

Yun Tang
Hi

From our experience, instead of offering more resource for Prometheus push-gateway and servers. We could leverage Flink' feature to avoid sending unnecessary data (especially high-dimension tags, e,g task_attempt_id) after Flink-1.10. In general, we could exclude "operator_id;task_id;task_attempt_id", which are rarely used, in metrics.reporter.<name>.scope.variables.excludes.


Best
Yun Tang

From: Thomas Huang <[hidden email]>
Sent: Wednesday, May 13, 2020 12:00
To: 李佳宸 <[hidden email]>; [hidden email] <[hidden email]>
Subject: Re: Prometheus Pushgateway Reporter Can not DELETE metrics on pushgateway
 
I met this issue three months ago. Finally, we got the conclusion that is Prometheus push gateway can not handle high throughout metric data. But we solved the issue via service discovery. We changed the Prometheus metric reporter code, adding the registration logic, so the job can expose the host and port on discovery service. And then write a plugin for Prometheus that can get the service list to pull the metrics from the Flink jobs.


From: 李佳宸 <[hidden email]>
Sent: Wednesday, May 13, 2020 11:26:26 AM
To: [hidden email] <[hidden email]>
Subject: Prometheus Pushgateway Reporter Can not DELETE metrics on pushgateway
 
Hi,

I got stuck in using Prometheus,Pushgateway to collect metrics. Here is my configuration about reporter:

metrics.reporter.promgateway.class: org.apache.flink.metrics.prometheus.PrometheusPushGatewayReporter
metrics.reporter.promgateway.host: localhost
metrics.reporter.promgateway.port: 9091
metrics.reporter.promgateway.jobName: myJob
metrics.reporter.promgateway.randomJobNameSuffix: true
metrics.reporter.promgateway.deleteOnShutdown: true

And the version information:
Flink 1.9.1
Prometheus 2.18
PushGateway 1.2 & 0.9 (I had already try them both) 

I found that when Flink cluster restart, there showed up metrics which have new jobName with random suffix. But there still existed those metrics having jobName before restarting cluster(value stop update). Since Prometheus still periodically pulled the data in pushgateway, I got a bunch of time series data with value unchanged forever. 

It looks like:

# HELP flink_jobmanager_Status_JVM_CPU_Load Load (scope: jobmanager_Status_JVM_CPU)
# TYPE flink_jobmanager_Status_JVM_CPU_Load gauge
flink_jobmanager_Status_JVM_CPU_Load{host="localhost",instance="",job="myJobae71620b106e8c2fdf86cb5c65fd6414"} 0
flink_jobmanager_Status_JVM_CPU_Load{host="localhost",instance="",job="myJobe50caa3be194aeb2ff71a64bced17cea"} 0.0006602344673593189
# HELP flink_jobmanager_Status_JVM_CPU_Time Time (scope: jobmanager_Status_JVM_CPU)
# TYPE flink_jobmanager_Status_JVM_CPU_Time gauge
flink_jobmanager_Status_JVM_CPU_Time{host="localhost",instance="",job="myJobae71620b106e8c2fdf86cb5c65fd6414"} 4.54512e+09
flink_jobmanager_Status_JVM_CPU_Time{host="localhost",instance="",job="myJobe50caa3be194aeb2ff71a64bced17cea"} 8.24809e+09
# HELP flink_jobmanager_Status_JVM_ClassLoader_ClassesLoaded ClassesLoaded (scope: jobmanager_Status_JVM_ClassLoader)
# TYPE flink_jobmanager_Status_JVM_ClassLoader_ClassesLoaded gauge
flink_jobmanager_Status_JVM_ClassLoader_ClassesLoaded{host="localhost",instance="",job="myJobae71620b106e8c2fdf86cb5c65fd6414"} 5984
flink_jobmanager_Status_JVM_ClassLoader_ClassesLoaded{host="localhost",instance="",job="myJobe50caa3be194aeb2ff71a64bced17cea"} 6014
# HELP flink_jobmanager_Status_JVM_ClassLoader_ClassesUnloaded ClassesUnloaded (scope: jobmanager_Status_JVM_ClassLoader)
# TYPE flink_jobmanager_Status_JVM_ClassLoader_ClassesUnloaded gauge
flink_jobmanager_Status_JVM_ClassLoader_ClassesUnloaded{host="localhost",instance="",job="myJobae71620b106e8c2fdf86cb5c65fd6414"} 0
flink_jobmanager_Status_JVM_ClassLoader_ClassesUnloaded{host="localhost",instance="",job="myJobe50caa3be194aeb2ff71a64bced17cea"} 0
Ps: This cluster has one JobManager.

In my understanding, when I set metrics.reporter.promgateway.deleteOnShutdown to true, the old metrics information should be deleted from pushgateway. But it didn’t work somehow.
Is my understanding on these configuration right? Any solution about deleting metrics from pushgateway?

Thanks!