(DEPRECATED) Apache Flink User Mailing List archive.

Prometheus Pushgateway Reporter Can not DELETE metrics on pushgateway

Classic

List

Threaded

3 messages Options

李佳宸

Prometheus Pushgateway Reporter Can not DELETE metrics on pushgateway

Hi,

I got stuck in using Prometheus,Pushgateway to collect metrics. Here is my configuration about reporter:

metrics.reporter.promgateway.class: org.apache.flink.metrics.prometheus.PrometheusPushGatewayReporter

metrics.reporter.promgateway.host: localhost

metrics.reporter.promgateway.port: 9091

metrics.reporter.promgateway.jobName: myJob

metrics.reporter.promgateway.randomJobNameSuffix: true

metrics.reporter.promgateway.deleteOnShutdown: true

And the version information:

Flink 1.9.1

Prometheus 2.18

PushGateway 1.2 & 0.9 (I had already try them both)

I found that when Flink cluster restart, there showed up metrics which have new jobName with random suffix. But there still existed those metrics having jobName before restarting cluster(value stop update). Since Prometheus still periodically pulled the data in pushgateway, I got a bunch of time series data with value unchanged forever.

It looks like:

# HELP flink_jobmanager_Status_JVM_CPU_Load Load (scope: jobmanager_Status_JVM_CPU)
# TYPE flink_jobmanager_Status_JVM_CPU_Load gauge
flink_jobmanager_Status_JVM_CPU_Load{host="localhost",instance="",job="myJobae71620b106e8c2fdf86cb5c65fd6414"} 0
flink_jobmanager_Status_JVM_CPU_Load{host="localhost",instance="",job="myJobe50caa3be194aeb2ff71a64bced17cea"} 0.0006602344673593189
# HELP flink_jobmanager_Status_JVM_CPU_Time Time (scope: jobmanager_Status_JVM_CPU)
# TYPE flink_jobmanager_Status_JVM_CPU_Time gauge
flink_jobmanager_Status_JVM_CPU_Time{host="localhost",instance="",job="myJobae71620b106e8c2fdf86cb5c65fd6414"} 4.54512e+09
flink_jobmanager_Status_JVM_CPU_Time{host="localhost",instance="",job="myJobe50caa3be194aeb2ff71a64bced17cea"} 8.24809e+09
# HELP flink_jobmanager_Status_JVM_ClassLoader_ClassesLoaded ClassesLoaded (scope: jobmanager_Status_JVM_ClassLoader)
# TYPE flink_jobmanager_Status_JVM_ClassLoader_ClassesLoaded gauge
flink_jobmanager_Status_JVM_ClassLoader_ClassesLoaded{host="localhost",instance="",job="myJobae71620b106e8c2fdf86cb5c65fd6414"} 5984
flink_jobmanager_Status_JVM_ClassLoader_ClassesLoaded{host="localhost",instance="",job="myJobe50caa3be194aeb2ff71a64bced17cea"} 6014
# HELP flink_jobmanager_Status_JVM_ClassLoader_ClassesUnloaded ClassesUnloaded (scope: jobmanager_Status_JVM_ClassLoader)
# TYPE flink_jobmanager_Status_JVM_ClassLoader_ClassesUnloaded gauge
flink_jobmanager_Status_JVM_ClassLoader_ClassesUnloaded{host="localhost",instance="",job="myJobae71620b106e8c2fdf86cb5c65fd6414"} 0
flink_jobmanager_Status_JVM_ClassLoader_ClassesUnloaded{host="localhost",instance="",job="myJobe50caa3be194aeb2ff71a64bced17cea"} 0

Ps: This cluster has one JobManager.

In my understanding, when I set metrics.reporter.promgateway.deleteOnShutdown to true, the old metrics information should be deleted from pushgateway. But it didn’t work somehow.

Is my understanding on these configuration right? Any solution about deleting metrics from pushgateway?

Thanks!

Thomas Huang

Re: Prometheus Pushgateway Reporter Can not DELETE metrics on pushgateway

I met this issue three months ago. Finally, we got the conclusion that is Prometheus push gateway can not handle high throughout metric data. But we solved the issue via service discovery. We changed the Prometheus metric reporter code, adding the registration logic, so the job can expose the host and port on discovery service. And then write a plugin for Prometheus that can get the service list to pull the metrics from the Flink jobs.

From: 李佳宸 <[hidden email]>
Sent: Wednesday, May 13, 2020 11:26:26 AM
To: [hidden email] <[hidden email]>
Subject: Prometheus Pushgateway Reporter Can not DELETE metrics on pushgateway

Hi,

I got stuck in using Prometheus,Pushgateway to collect metrics. Here is my configuration about reporter:

metrics.reporter.promgateway.class: org.apache.flink.metrics.prometheus.PrometheusPushGatewayReporter

metrics.reporter.promgateway.host: localhost

metrics.reporter.promgateway.port: 9091

metrics.reporter.promgateway.jobName: myJob

metrics.reporter.promgateway.randomJobNameSuffix: 
true

metrics.reporter.promgateway.deleteOnShutdown: 
true

And the version information:

Flink 1.9.1

Prometheus 2.18

PushGateway 1.2 & 0.9 (I had already try them both)

It looks like:

# HELP flink_jobmanager_Status_JVM_CPU_Load Load (scope: jobmanager_Status_JVM_CPU)
# TYPE flink_jobmanager_Status_JVM_CPU_Load gauge
flink_jobmanager_Status_JVM_CPU_Load{host="localhost",instance="",job="myJobae71620b106e8c2fdf86cb5c65fd6414"} 0
flink_jobmanager_Status_JVM_CPU_Load{host="localhost",instance="",job="myJobe50caa3be194aeb2ff71a64bced17cea"} 0.0006602344673593189
# HELP flink_jobmanager_Status_JVM_CPU_Time Time (scope: jobmanager_Status_JVM_CPU)
# TYPE flink_jobmanager_Status_JVM_CPU_Time gauge
flink_jobmanager_Status_JVM_CPU_Time{host="localhost",instance="",job="myJobae71620b106e8c2fdf86cb5c65fd6414"} 4.54512e+09
flink_jobmanager_Status_JVM_CPU_Time{host="localhost",instance="",job="myJobe50caa3be194aeb2ff71a64bced17cea"} 8.24809e+09
# HELP flink_jobmanager_Status_JVM_ClassLoader_ClassesLoaded ClassesLoaded (scope: jobmanager_Status_JVM_ClassLoader)
# TYPE flink_jobmanager_Status_JVM_ClassLoader_ClassesLoaded gauge
flink_jobmanager_Status_JVM_ClassLoader_ClassesLoaded{host="localhost",instance="",job="myJobae71620b106e8c2fdf86cb5c65fd6414"} 5984
flink_jobmanager_Status_JVM_ClassLoader_ClassesLoaded{host="localhost",instance="",job="myJobe50caa3be194aeb2ff71a64bced17cea"} 6014
# HELP flink_jobmanager_Status_JVM_ClassLoader_ClassesUnloaded ClassesUnloaded (scope: jobmanager_Status_JVM_ClassLoader)
# TYPE flink_jobmanager_Status_JVM_ClassLoader_ClassesUnloaded gauge
flink_jobmanager_Status_JVM_ClassLoader_ClassesUnloaded{host="localhost",instance="",job="myJobae71620b106e8c2fdf86cb5c65fd6414"} 0
flink_jobmanager_Status_JVM_ClassLoader_ClassesUnloaded{host="localhost",instance="",job="myJobe50caa3be194aeb2ff71a64bced17cea"} 0

Ps: This cluster has one JobManager.

In my understanding, when I set metrics.reporter.promgateway.deleteOnShutdown to true, the old metrics information should be deleted from pushgateway. But it didn’t work somehow.

Is my understanding on these configuration right? Any solution about deleting metrics from pushgateway?

Thanks!

Yun Tang

Re: Prometheus Pushgateway Reporter Can not DELETE metrics on pushgateway

From our experience, instead of offering more resource for Prometheus push-gateway and servers. We could leverage Flink' feature to avoid sending unnecessary data (especially high-dimension tags, e,g task_attempt_id) after Flink-1.10. In general, we could exclude "operator_id;task_id;task_attempt_id", which are rarely used, in metrics.reporter.<name>.scope.variables.excludes.

[1] https://ci.apache.org/projects/flink/flink-docs-release-1.10/monitoring/metrics.html#reporter

Best

Yun Tang

From: Thomas Huang <[hidden email]>
Sent: Wednesday, May 13, 2020 12:00
To: 李佳宸 <[hidden email]>; [hidden email] <[hidden email]>
Subject: Re: Prometheus Pushgateway Reporter Can not DELETE metrics on pushgateway

Hi,

I got stuck in using Prometheus,Pushgateway to collect metrics. Here is my configuration about reporter:

metrics.reporter.promgateway.class: org.apache.flink.metrics.prometheus.PrometheusPushGatewayReporter

metrics.reporter.promgateway.host: localhost

metrics.reporter.promgateway.port: 9091

metrics.reporter.promgateway.jobName: myJob

metrics.reporter.promgateway.randomJobNameSuffix: 
true

metrics.reporter.promgateway.deleteOnShutdown: 
true

And the version information:

Flink 1.9.1

Prometheus 2.18

PushGateway 1.2 & 0.9 (I had already try them both)

It looks like:

# HELP flink_jobmanager_Status_JVM_CPU_Load Load (scope: jobmanager_Status_JVM_CPU)
# TYPE flink_jobmanager_Status_JVM_CPU_Load gauge
flink_jobmanager_Status_JVM_CPU_Load{host="localhost",instance="",job="myJobae71620b106e8c2fdf86cb5c65fd6414"} 0
flink_jobmanager_Status_JVM_CPU_Load{host="localhost",instance="",job="myJobe50caa3be194aeb2ff71a64bced17cea"} 0.0006602344673593189
# HELP flink_jobmanager_Status_JVM_CPU_Time Time (scope: jobmanager_Status_JVM_CPU)
# TYPE flink_jobmanager_Status_JVM_CPU_Time gauge
flink_jobmanager_Status_JVM_CPU_Time{host="localhost",instance="",job="myJobae71620b106e8c2fdf86cb5c65fd6414"} 4.54512e+09
flink_jobmanager_Status_JVM_CPU_Time{host="localhost",instance="",job="myJobe50caa3be194aeb2ff71a64bced17cea"} 8.24809e+09
# HELP flink_jobmanager_Status_JVM_ClassLoader_ClassesLoaded ClassesLoaded (scope: jobmanager_Status_JVM_ClassLoader)
# TYPE flink_jobmanager_Status_JVM_ClassLoader_ClassesLoaded gauge
flink_jobmanager_Status_JVM_ClassLoader_ClassesLoaded{host="localhost",instance="",job="myJobae71620b106e8c2fdf86cb5c65fd6414"} 5984
flink_jobmanager_Status_JVM_ClassLoader_ClassesLoaded{host="localhost",instance="",job="myJobe50caa3be194aeb2ff71a64bced17cea"} 6014
# HELP flink_jobmanager_Status_JVM_ClassLoader_ClassesUnloaded ClassesUnloaded (scope: jobmanager_Status_JVM_ClassLoader)
# TYPE flink_jobmanager_Status_JVM_ClassLoader_ClassesUnloaded gauge
flink_jobmanager_Status_JVM_ClassLoader_ClassesUnloaded{host="localhost",instance="",job="myJobae71620b106e8c2fdf86cb5c65fd6414"} 0
flink_jobmanager_Status_JVM_ClassLoader_ClassesUnloaded{host="localhost",instance="",job="myJobe50caa3be194aeb2ff71a64bced17cea"} 0

Ps: This cluster has one JobManager.

In my understanding, when I set metrics.reporter.promgateway.deleteOnShutdown to true, the old metrics information should be deleted from pushgateway. But it didn’t work somehow.

Is my understanding on these configuration right? Any solution about deleting metrics from pushgateway?

Thanks!