Operator metrics do not get unregistered after job finishes

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Operator metrics do not get unregistered after job finishes

Helmut Zechmann-2
Hi all,


we are using flink 1.5.2 in batch mode with prometheus monitoring.

We noticed that a few metrics do not get unregistered after a job is finished:

flink_taskmanager_job_task_operator_numRecordsIn
flink_taskmanager_job_task_operator_numRecordsInPerSecond
flink_taskmanager_job_task_operator_numRecordsOut
flink_taskmanager_job_task_operator_numRecordsOutPerSecond


Those metrics stay in the taksmanager metrics list until the task manger gets restarted.

Our metrics config is:

metrics.reporters: prom
metrics.reporter.prom.class: org.apache.flink.metrics.prometheus.PrometheusReporter
metrics.reporter.prom.port: 7000-7001

metrics.scope.jm: flink.<host>.jobmanager
metrics.scope.tm: flink.<host>.taskmanager.<tm_id>
metrics.scope.jm.job: flink.<host>.jobmanager.<job_name>
metrics.scope.tm.job: flink.<host>.taskmanager.<tm_id>.<job_name>
metrics.scope.task: flink.<host>.taskmanager.<tm_id>.<job_name>.<task_name>.<subtask_index>
metrics.scope.operator: flink.<host>.taskmanager.<tm_id>.<job_name>.<operator_name>.<subtask_index>


Since we run many batch jobs, this makes prometheus monitoring unusable for us. Is this a known issue?


Best,

Helmut
Reply | Threaded
Open this post in threaded view
|

Re: Operator metrics do not get unregistered after job finishes

vino yang
Hi Helmut,

Is the metrics of all the sub task instances of a job not unregistered, or part of it is not unregistered. Is there any exception log information available?

Please feel free to create a JIRA issue and clearly describe your problem.

Thanks, vino.

Helmut Zechmann <[hidden email]> 于2018年8月17日周五 下午11:14写道:
Hi all,


we are using flink 1.5.2 in batch mode with prometheus monitoring.

We noticed that a few metrics do not get unregistered after a job is finished:

flink_taskmanager_job_task_operator_numRecordsIn
flink_taskmanager_job_task_operator_numRecordsInPerSecond
flink_taskmanager_job_task_operator_numRecordsOut
flink_taskmanager_job_task_operator_numRecordsOutPerSecond


Those metrics stay in the taksmanager metrics list until the task manger gets restarted.

Our metrics config is:

metrics.reporters: prom
metrics.reporter.prom.class: org.apache.flink.metrics.prometheus.PrometheusReporter
metrics.reporter.prom.port: 7000-7001

metrics.scope.jm: flink.<host>.jobmanager
metrics.scope.tm: flink.<host>.taskmanager.<tm_id>
metrics.scope.jm.job: flink.<host>.jobmanager.<job_name>
metrics.scope.tm.job: flink.<host>.taskmanager.<tm_id>.<job_name>
metrics.scope.task: flink.<host>.taskmanager.<tm_id>.<job_name>.<task_name>.<subtask_index>
metrics.scope.operator: flink.<host>.taskmanager.<tm_id>.<job_name>.<operator_name>.<subtask_index>


Since we run many batch jobs, this makes prometheus monitoring unusable for us. Is this a known issue?


Best,

Helmut
Reply | Threaded
Open this post in threaded view
|

Re: Operator metrics do not get unregistered after job finishes

Helmut Zechmann-2
Hi Vino,

The log shows no problems. The problem can be reproduced easily. I created https://issues.apache.org/jira/browse/FLINK-10300.

Best,

Helmut

On 18. Aug 2018, at 04:53, vino yang <[hidden email]> wrote:

Hi Helmut,

Is the metrics of all the sub task instances of a job not unregistered, or part of it is not unregistered. Is there any exception log information available?

Please feel free to create a JIRA issue and clearly describe your problem.

Thanks, vino.

Helmut Zechmann <[hidden email]> 于2018年8月17日周五 下午11:14写道:
Hi all,


we are using flink 1.5.2 in batch mode with prometheus monitoring.

We noticed that a few metrics do not get unregistered after a job is finished:

flink_taskmanager_job_task_operator_numRecordsIn
flink_taskmanager_job_task_operator_numRecordsInPerSecond
flink_taskmanager_job_task_operator_numRecordsOut
flink_taskmanager_job_task_operator_numRecordsOutPerSecond


Those metrics stay in the taksmanager metrics list until the task manger gets restarted.

Our metrics config is:

metrics.reporters: prom
metrics.reporter.prom.class: org.apache.flink.metrics.prometheus.PrometheusReporter
metrics.reporter.prom.port: 7000-7001

metrics.scope.jm: flink.<host>.jobmanager
metrics.scope.tm: flink.<host>.taskmanager.<tm_id>
metrics.scope.jm.job: flink.<host>.jobmanager.<job_name>
metrics.scope.tm.job: flink.<host>.taskmanager.<tm_id>.<job_name>
metrics.scope.task: flink.<host>.taskmanager.<tm_id>.<job_name>.<task_name>.<subtask_index>
metrics.scope.operator: flink.<host>.taskmanager.<tm_id>.<job_name>.<operator_name>.<subtask_index>


Since we run many batch jobs, this makes prometheus monitoring unusable for us. Is this a known issue?


Best,

Helmut