Missing metrics when using metric reporter on high parallelism

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

Missing metrics when using metric reporter on high parallelism

Nikola Hrusov
Hello,

I am doing some tests with flink 1.11.1 and I have noticed something strange/wrong going on with the exported metrics.

I have a configuration like such:

metrics.reporter.graphite.class: org.apache.flink.metrics.graphite.GraphiteReporterFactory
metrics.reporter.graphite.host: graphite
metrics.reporter.graphite.port: 8080
metrics.reporter.graphite.protocol: tcp
metrics.reporter.graphite.interval: 10 SECONDS


which should produce metrics to graphite every 10 seconds.

And that works with low parallelism (e.g. <= 20). Then we get all metrics, all the time, every 10th second.
However, when I scale my job to 200 parallelism or more, the metrics are not sent every 10 seconds. Sometimes they are missing for up to 3 reporting cycles.
I have had a brief look in the code here: https://github.com/apache/flink/blob/release-1.11.1/flink-runtime/src/main/java/org/apache/flink/runtime/metrics/MetricRegistryImpl.java#L107-L144 and it looks like there is a separate thread. That was my first guess, if it is doing too much work on the same thread.

I have tried lowering the reporting interval from 10 SECONDS to 6-7 SECONDS, but even in that case there will be missing metrics. Even for simpler jobs such as "source -> map -> sink" with higher parallelism that would happen.

What can I do to further debug/make this work? Has anyone come across this before?

Regards
,
Nikola Hrusov

Reply | Threaded
Open this post in threaded view
|

Re: Missing metrics when using metric reporter on high parallelism

Chesnay Schepler
IIRC this can be caused by the Carbon MAX_CREATES_PER_MINUTE setting.

I would deem it unlikely that the reporter thread is busy for 30 seconds.

On 11/08/2020 16:57, Nikola Hrusov wrote:
Hello,

I am doing some tests with flink 1.11.1 and I have noticed something strange/wrong going on with the exported metrics.

I have a configuration like such:

metrics.reporter.graphite.class: org.apache.flink.metrics.graphite.GraphiteReporterFactory
metrics.reporter.graphite.host: graphite
metrics.reporter.graphite.port: 8080
metrics.reporter.graphite.protocol: tcp
metrics.reporter.graphite.interval: 10 SECONDS


which should produce metrics to graphite every 10 seconds.

And that works with low parallelism (e.g. <= 20). Then we get all metrics, all the time, every 10th second.
However, when I scale my job to 200 parallelism or more, the metrics are not sent every 10 seconds. Sometimes they are missing for up to 3 reporting cycles.
I have had a brief look in the code here: https://github.com/apache/flink/blob/release-1.11.1/flink-runtime/src/main/java/org/apache/flink/runtime/metrics/MetricRegistryImpl.java#L107-L144 and it looks like there is a separate thread. That was my first guess, if it is doing too much work on the same thread.

I have tried lowering the reporting interval from 10 SECONDS to 6-7 SECONDS, but even in that case there will be missing metrics. Even for simpler jobs such as "source -> map -> sink" with higher parallelism that would happen.

What can I do to further debug/make this work? Has anyone come across this before?

Regards
,
Nikola Hrusov