Howdy, We are seeing our task manager JVM metrics disappear over time. This last time we correlated it to our job crashing and restarting. I wasn't able to grab the failing exception to share. Any thoughts? We track metrics through the MetricReporter interface. As far as I can tell this more or less only affects the JVM metrics. I.e. most / all other metrics continue reporting fine as the job is automatically restarted. Nik Davis Software Engineer New Relic |
How are your metrics dimensionalized/named? Task managers often have UIDs generated for them. The task id dimension will change on restart. If you name your metric based on this 'task_id' there would be a discontinuity with the old metric. On Wed, May 30, 2018 at 4:49 PM, Nikolas Davis <[hidden email]> wrote:
|
We keep track of metrics by using the value of MetricGroup::getMetricIdentifier, which returns the fully qualified metric name. The query that we use to monitor metrics filters for metrics IDs that match '%Status.JVM.Memory%'. As long as the new metrics come online via the MetricReporter interface then I think the chart would be continuous; we would just see the old JVM memory metrics cycle into new metrics.
Nik Davis Software Engineer New Relic On Wed, May 30, 2018 at 5:30 PM, Ajay Tripathy <[hidden email]> wrote:
|
Hi Nik, Can you have a look at this JIRA ticket [1] and check if it is related to the problems your are facing?2018-05-31 4:41 GMT+02:00 Nikolas Davis <[hidden email]>:
|
Can you show us the metrics-related
configuration parameters in flink-conf.yaml?
Please also check the logs for any warnings from the MetricGroup and MetricRegistry classes. On 04.06.2018 10:44, Fabian Hueske wrote:
|
Fabian,
It does look like it may be related. I'll add a comment. After digging a bit more I found that the crash and lack of metrics were precipitated by the JobManager instance crashing and cycling, which caused the job to restart. Chesnay, I didn't see anything interesting in our logs. Our reporter config is fairly straightforward (I think): metrics.reporter.nr.class: com.newrelic.flink. metrics.reporter.nr.interval: 60 SECONDS metrics.reporters: nr Nik Davis Software Engineer New Relic On Mon, Jun 4, 2018 at 1:56 AM, Chesnay Schepler <[hidden email]> wrote:
|
The config looks OK to me. On the Flink
side I cannot find an explanation why only some metrics
disappear.
The only explanation I could come up with at the moment is that FLINK-8946 is triggered, all metrics are (officially) unregistered, but the reporter isn't removing some metrics (i.e. all job related ones). Due to FLINK-8946 no new metrics would be registered after the JM restart, but the old metrics continue to be reported. To verify this I would add logging statements to the notifyOfAddedMetric/notifyOfRemovedMetric methods, to check whether Flink attempts to unregister all metrics or only some. On 05.06.2018 02:02, Nikolas Davis wrote:
|
Free forum by Nabble | Edit this page |