Repeated exceptions during system metrics registration

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Repeated exceptions during system metrics registration

Reinier Kip
Hi all,

I'm running a Beam pipeline on Flink and sending metrics via the Graphite reporter. I get repeated exceptions on the slaves, which try to register the same metric multiple times. Jobmanager and taskmanager data is fine: I can see JVM stuff, but only one datapoint here and there for tasks/operations.

I am using Beam 2.1.0, and am thus running Flink 1.3.0. What can be the cause of this or how can I find out?

Below you'll find the error, followed by Flink's metrics configuration.

Reinier

------

Log message: [ERROR] Error while registering metric.
Stack trace:

java.lang.IllegalArgumentException: A metric named bla.hdp-slave-015.taskmanager.3e32c83dedd7f4c7de3c41e2cb6d2a4a.bla-0929120623-613dc6af.operator.Map (Key Extractor).15.numRecordsOutPerSecond already exists
at com.codahale.metrics.MetricRegistry.register(MetricRegistry.java:91)
at org.apache.flink.dropwizard.ScheduledDropwizardReporter.notifyOfAddedMetric(ScheduledDropwizardReporter.java:151)
at org.apache.flink.runtime.metrics.MetricRegistry.register(MetricRegistry.java:294)
at org.apache.flink.runtime.metrics.groups.AbstractMetricGroup.addMetric(AbstractMetricGroup.java:370)
at org.apache.flink.runtime.metrics.groups.AbstractMetricGroup.meter(AbstractMetricGroup.java:336)
at org.apache.flink.runtime.metrics.groups.OperatorIOMetricGroup.<init>(OperatorIOMetricGroup.java:42)
at org.apache.flink.runtime.metrics.groups.OperatorMetricGroup.<init>(OperatorMetricGroup.java:45)
at org.apache.flink.runtime.metrics.groups.TaskMetricGroup.addOperator(TaskMetricGroup.java:133)
at org.apache.flink.runtime.operators.chaining.ChainedDriver.setup(ChainedDriver.java:72)
at org.apache.flink.runtime.operators.BatchTask.initOutputs(BatchTask.java:1299)
at org.apache.flink.runtime.operators.BatchTask.initOutputs(BatchTask.java:1015)
at org.apache.flink.runtime.operators.BatchTask.invoke(BatchTask.java:256)
at org.apache.flink.runtime.taskmanager.Task.run(Task.java:702)
at java.lang.Thread.run(Thread.java:748)

------

metrics.reporters: graphite
metrics.reporter.graphite.class: org.apache.flink.metrics.graphite.GraphiteReporter
metrics.reporter.graphite.host: something
metrics.reporter.graphite.port: 2003
metrics.reporter.graphite.protocol: TCP
metrics.reporter.graphite.interval: 1 SECONDS
metrics.scope.jm: bla.<host>.jobmanager
metrics.scope.jm.job: bla.<host>.jobmanager.<job_name>
metrics.scope.tm: bla.<host>.taskmanager.<tm_id>
metrics.scope.tm.job: bla.<host>.taskmanager.<tm_id>.<job_name>
metrics.scope.task: bla.<host>.taskmanager.<tm_id>.<job_name>.task.<task_name>.<subtask_index>
metrics.scope.operator: bla.<host>.taskmanager.<tm_id>.<job_name>.operator.<operator_name>.<subtask_index>
Reply | Threaded
Open this post in threaded view
|

Re: Repeated exceptions during system metrics registration

Chesnay Schepler
You probably have multiple operators that are called "Map", which causes the metric identifier to not be unique.
As a result only 1 of these metrics is reported (whichever was registered first).

Giving each operator a unique name will resolve this issue, but I don't know exactly how to do that with Beam.

On 29.09.2017 16:03, Reinier Kip wrote:
Hi all,

I'm running a Beam pipeline on Flink and sending metrics via the Graphite reporter. I get repeated exceptions on the slaves, which try to register the same metric multiple times. Jobmanager and taskmanager data is fine: I can see JVM stuff, but only one datapoint here and there for tasks/operations.

I am using Beam 2.1.0, and am thus running Flink 1.3.0. What can be the cause of this or how can I find out?

Below you'll find the error, followed by Flink's metrics configuration.

Reinier

------

Log message: [ERROR] Error while registering metric.
Stack trace:

java.lang.IllegalArgumentException: A metric named bla.hdp-slave-015.taskmanager.3e32c83dedd7f4c7de3c41e2cb6d2a4a.bla-0929120623-613dc6af.operator.Map (Key Extractor).15.numRecordsOutPerSecond already exists
at com.codahale.metrics.MetricRegistry.register(MetricRegistry.java:91)
at org.apache.flink.dropwizard.ScheduledDropwizardReporter.notifyOfAddedMetric(ScheduledDropwizardReporter.java:151)
at org.apache.flink.runtime.metrics.MetricRegistry.register(MetricRegistry.java:294)
at org.apache.flink.runtime.metrics.groups.AbstractMetricGroup.addMetric(AbstractMetricGroup.java:370)
at org.apache.flink.runtime.metrics.groups.AbstractMetricGroup.meter(AbstractMetricGroup.java:336)
at org.apache.flink.runtime.metrics.groups.OperatorIOMetricGroup.<init>(OperatorIOMetricGroup.java:42)
at org.apache.flink.runtime.metrics.groups.OperatorMetricGroup.<init>(OperatorMetricGroup.java:45)
at org.apache.flink.runtime.metrics.groups.TaskMetricGroup.addOperator(TaskMetricGroup.java:133)
at org.apache.flink.runtime.operators.chaining.ChainedDriver.setup(ChainedDriver.java:72)
at org.apache.flink.runtime.operators.BatchTask.initOutputs(BatchTask.java:1299)
at org.apache.flink.runtime.operators.BatchTask.initOutputs(BatchTask.java:1015)
at org.apache.flink.runtime.operators.BatchTask.invoke(BatchTask.java:256)
at org.apache.flink.runtime.taskmanager.Task.run(Task.java:702)
at java.lang.Thread.run(Thread.java:748)

------

metrics.reporters: graphite
metrics.reporter.graphite.class: org.apache.flink.metrics.graphite.GraphiteReporter
metrics.reporter.graphite.host: something
metrics.reporter.graphite.port: 2003
metrics.reporter.graphite.protocol: TCP
metrics.reporter.graphite.interval: 1 SECONDS
metrics.scope.jm: bla.<host>.jobmanager
metrics.scope.jm.job: bla.<host>.jobmanager.<job_name>
metrics.scope.tm: bla.<host>.taskmanager.<tm_id>
metrics.scope.tm.job: bla.<host>.taskmanager.<tm_id>.<job_name>
metrics.scope.task: bla.<host>.taskmanager.<tm_id>.<job_name>.task.<task_name>.<subtask_index>
metrics.scope.operator: bla.<host>.taskmanager.<tm_id>.<job_name>.operator.<operator_name>.<subtask_index>


Reply | Threaded
Open this post in threaded view
|

Re: Repeated exceptions during system metrics registration

Reinier Kip

Why of course... Thank you for your time. I'll figure out where to go with Beam.


From: Chesnay Schepler <[hidden email]>
Sent: 29 September 2017 16:41:23
To: [hidden email]
Subject: Re: Repeated exceptions during system metrics registration
 
You probably have multiple operators that are called "Map", which causes the metric identifier to not be unique.
As a result only 1 of these metrics is reported (whichever was registered first).

Giving each operator a unique name will resolve this issue, but I don't know exactly how to do that with Beam.

On 29.09.2017 16:03, Reinier Kip wrote:
Hi all,

I'm running a Beam pipeline on Flink and sending metrics via the Graphite reporter. I get repeated exceptions on the slaves, which try to register the same metric multiple times. Jobmanager and taskmanager data is fine: I can see JVM stuff, but only one datapoint here and there for tasks/operations.

I am using Beam 2.1.0, and am thus running Flink 1.3.0. What can be the cause of this or how can I find out?

Below you'll find the error, followed by Flink's metrics configuration.

Reinier

------

Log message: [ERROR] Error while registering metric.
Stack trace:

java.lang.IllegalArgumentException: A metric named bla.hdp-slave-015.taskmanager.3e32c83dedd7f4c7de3c41e2cb6d2a4a.bla-0929120623-613dc6af.operator.Map (Key Extractor).15.numRecordsOutPerSecond already exists
at com.codahale.metrics.MetricRegistry.register(MetricRegistry.java:91)
at org.apache.flink.dropwizard.ScheduledDropwizardReporter.notifyOfAddedMetric(ScheduledDropwizardReporter.java:151)
at org.apache.flink.runtime.metrics.MetricRegistry.register(MetricRegistry.java:294)
at org.apache.flink.runtime.metrics.groups.AbstractMetricGroup.addMetric(AbstractMetricGroup.java:370)
at org.apache.flink.runtime.metrics.groups.AbstractMetricGroup.meter(AbstractMetricGroup.java:336)
at org.apache.flink.runtime.metrics.groups.OperatorIOMetricGroup.<init>(OperatorIOMetricGroup.java:42)
at org.apache.flink.runtime.metrics.groups.OperatorMetricGroup.<init>(OperatorMetricGroup.java:45)
at org.apache.flink.runtime.metrics.groups.TaskMetricGroup.addOperator(TaskMetricGroup.java:133)
at org.apache.flink.runtime.operators.chaining.ChainedDriver.setup(ChainedDriver.java:72)
at org.apache.flink.runtime.operators.BatchTask.initOutputs(BatchTask.java:1299)
at org.apache.flink.runtime.operators.BatchTask.initOutputs(BatchTask.java:1015)
at org.apache.flink.runtime.operators.BatchTask.invoke(BatchTask.java:256)
at org.apache.flink.runtime.taskmanager.Task.run(Task.java:702)
at java.lang.Thread.run(Thread.java:748)

------

metrics.reporters: graphite
metrics.reporter.graphite.class: org.apache.flink.metrics.graphite.GraphiteReporter
metrics.reporter.graphite.host: something
metrics.reporter.graphite.port: 2003
metrics.reporter.graphite.protocol: TCP
metrics.reporter.graphite.interval: 1 SECONDS
metrics.scope.jm: bla.<host>.jobmanager
metrics.scope.jm.job: bla.<host>.jobmanager.<job_name>
metrics.scope.tm: bla.<host>.taskmanager.<tm_id>
metrics.scope.tm.job: bla.<host>.taskmanager.<tm_id>.<job_name>
metrics.scope.task: bla.<host>.taskmanager.<tm_id>.<job_name>.task.<task_name>.<subtask_index>
metrics.scope.operator: bla.<host>.taskmanager.<tm_id>.<job_name>.operator.<operator_name>.<subtask_index>