JVM metrics disappearing after job crash, restart

classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|

JVM metrics disappearing after job crash, restart

Nikolas Davis
Howdy,

We are seeing our task manager JVM metrics disappear over time. This last time we correlated it to our job crashing and restarting. I wasn't able to grab the failing exception to share. Any thoughts?

We track metrics through the MetricReporter interface. As far as I can tell this more or less only affects the JVM metrics. I.e. most / all other metrics continue reporting fine as the job is automatically restarted.

Nik Davis
Software Engineer
New Relic
Reply | Threaded
Open this post in threaded view
|

Re: JVM metrics disappearing after job crash, restart

Ajay Tripathy
How are your metrics dimensionalized/named? Task managers often have UIDs generated for them. The task id dimension will change on restart. If you name your metric based on this 'task_id' there would be a discontinuity with the old metric.

On Wed, May 30, 2018 at 4:49 PM, Nikolas Davis <[hidden email]> wrote:
Howdy,

We are seeing our task manager JVM metrics disappear over time. This last time we correlated it to our job crashing and restarting. I wasn't able to grab the failing exception to share. Any thoughts?

We track metrics through the MetricReporter interface. As far as I can tell this more or less only affects the JVM metrics. I.e. most / all other metrics continue reporting fine as the job is automatically restarted.

Nik Davis
Software Engineer
New Relic

Reply | Threaded
Open this post in threaded view
|

Re: JVM metrics disappearing after job crash, restart

Nikolas Davis
We keep track of metrics by using the value of MetricGroup::getMetricIdentifier, which returns the fully qualified metric name. The query that we use to monitor metrics filters for metrics IDs that match '%Status.JVM.Memory%'. As long as the new metrics come online via the MetricReporter interface then I think the chart would be continuous; we would just see the old JVM memory metrics cycle into new metrics.

Nik Davis
Software Engineer
New Relic

On Wed, May 30, 2018 at 5:30 PM, Ajay Tripathy <[hidden email]> wrote:
How are your metrics dimensionalized/named? Task managers often have UIDs generated for them. The task id dimension will change on restart. If you name your metric based on this 'task_id' there would be a discontinuity with the old metric.

On Wed, May 30, 2018 at 4:49 PM, Nikolas Davis <[hidden email]> wrote:
Howdy,

We are seeing our task manager JVM metrics disappear over time. This last time we correlated it to our job crashing and restarting. I wasn't able to grab the failing exception to share. Any thoughts?

We track metrics through the MetricReporter interface. As far as I can tell this more or less only affects the JVM metrics. I.e. most / all other metrics continue reporting fine as the job is automatically restarted.

Nik Davis
Software Engineer
New Relic


Reply | Threaded
Open this post in threaded view
|

Re: JVM metrics disappearing after job crash, restart

Fabian Hueske-2
Hi Nik,

Can you have a look at this JIRA ticket [1] and check if it is related to the problems your are facing?
If so, would you mind leaving a comment there?

Thank you,
Fabian

2018-05-31 4:41 GMT+02:00 Nikolas Davis <[hidden email]>:
We keep track of metrics by using the value of MetricGroup::getMetricIdentifier, which returns the fully qualified metric name. The query that we use to monitor metrics filters for metrics IDs that match '%Status.JVM.Memory%'. As long as the new metrics come online via the MetricReporter interface then I think the chart would be continuous; we would just see the old JVM memory metrics cycle into new metrics.

Nik Davis
Software Engineer
New Relic

On Wed, May 30, 2018 at 5:30 PM, Ajay Tripathy <[hidden email]> wrote:
How are your metrics dimensionalized/named? Task managers often have UIDs generated for them. The task id dimension will change on restart. If you name your metric based on this 'task_id' there would be a discontinuity with the old metric.

On Wed, May 30, 2018 at 4:49 PM, Nikolas Davis <[hidden email]> wrote:
Howdy,

We are seeing our task manager JVM metrics disappear over time. This last time we correlated it to our job crashing and restarting. I wasn't able to grab the failing exception to share. Any thoughts?

We track metrics through the MetricReporter interface. As far as I can tell this more or less only affects the JVM metrics. I.e. most / all other metrics continue reporting fine as the job is automatically restarted.

Nik Davis
Software Engineer
New Relic



Reply | Threaded
Open this post in threaded view
|

Re: JVM metrics disappearing after job crash, restart

Chesnay Schepler
Can you show us the metrics-related configuration parameters in flink-conf.yaml?

Please also check the logs for any warnings from the MetricGroup and MetricRegistry classes.

On 04.06.2018 10:44, Fabian Hueske wrote:
Hi Nik,

Can you have a look at this JIRA ticket [1] and check if it is related to the problems your are facing?
If so, would you mind leaving a comment there?

Thank you,
Fabian

2018-05-31 4:41 GMT+02:00 Nikolas Davis <[hidden email]>:
We keep track of metrics by using the value of MetricGroup::getMetricIdentifier, which returns the fully qualified metric name. The query that we use to monitor metrics filters for metrics IDs that match '%Status.JVM.Memory%'. As long as the new metrics come online via the MetricReporter interface then I think the chart would be continuous; we would just see the old JVM memory metrics cycle into new metrics.

Nik Davis
Software Engineer
New Relic

On Wed, May 30, 2018 at 5:30 PM, Ajay Tripathy <[hidden email]> wrote:
How are your metrics dimensionalized/named? Task managers often have UIDs generated for them. The task id dimension will change on restart. If you name your metric based on this 'task_id' there would be a discontinuity with the old metric.

On Wed, May 30, 2018 at 4:49 PM, Nikolas Davis <[hidden email]> wrote:
Howdy,

We are seeing our task manager JVM metrics disappear over time. This last time we correlated it to our job crashing and restarting. I wasn't able to grab the failing exception to share. Any thoughts?

We track metrics through the MetricReporter interface. As far as I can tell this more or less only affects the JVM metrics. I.e. most / all other metrics continue reporting fine as the job is automatically restarted.

Nik Davis
Software Engineer
New Relic




Reply | Threaded
Open this post in threaded view
|

Re: JVM metrics disappearing after job crash, restart

Nikolas Davis
Fabian,

It does look like it may be related. I'll add a comment. After digging a bit more I found that the crash and lack of metrics were precipitated by the JobManager instance crashing and cycling, which caused the job to restart.


Chesnay,

I didn't see anything interesting in our logs. Our reporter config is fairly straightforward (I think):

metrics.reporter.nr.class: com.newrelic.flink.NewRelicReporter
metrics.reporter.nr.interval: 60 SECONDS
metrics.reporters: nr

Nik Davis
Software Engineer
New Relic

On Mon, Jun 4, 2018 at 1:56 AM, Chesnay Schepler <[hidden email]> wrote:
Can you show us the metrics-related configuration parameters in flink-conf.yaml?

Please also check the logs for any warnings from the MetricGroup and MetricRegistry classes.


On 04.06.2018 10:44, Fabian Hueske wrote:
Hi Nik,

Can you have a look at this JIRA ticket [1] and check if it is related to the problems your are facing?
If so, would you mind leaving a comment there?

Thank you,
Fabian

2018-05-31 4:41 GMT+02:00 Nikolas Davis <[hidden email]>:
We keep track of metrics by using the value of MetricGroup::getMetricIdentifier, which returns the fully qualified metric name. The query that we use to monitor metrics filters for metrics IDs that match '%Status.JVM.Memory%'. As long as the new metrics come online via the MetricReporter interface then I think the chart would be continuous; we would just see the old JVM memory metrics cycle into new metrics.

Nik Davis
Software Engineer
New Relic

On Wed, May 30, 2018 at 5:30 PM, Ajay Tripathy <[hidden email]> wrote:
How are your metrics dimensionalized/named? Task managers often have UIDs generated for them. The task id dimension will change on restart. If you name your metric based on this 'task_id' there would be a discontinuity with the old metric.

On Wed, May 30, 2018 at 4:49 PM, Nikolas Davis <[hidden email]> wrote:
Howdy,

We are seeing our task manager JVM metrics disappear over time. This last time we correlated it to our job crashing and restarting. I wasn't able to grab the failing exception to share. Any thoughts?

We track metrics through the MetricReporter interface. As far as I can tell this more or less only affects the JVM metrics. I.e. most / all other metrics continue reporting fine as the job is automatically restarted.

Nik Davis
Software Engineer
New Relic





Reply | Threaded
Open this post in threaded view
|

Re: JVM metrics disappearing after job crash, restart

Chesnay Schepler
The config looks OK to me. On the Flink side I cannot find an explanation why only some metrics disappear.

The only explanation I could come up with at the moment is that FLINK-8946 is triggered, all metrics are (officially) unregistered, but the reporter isn't removing some metrics (i.e. all job related ones).
Due to FLINK-8946 no new metrics would be registered after the JM restart, but the old metrics continue to be reported.

To verify this I would add logging statements to the notifyOfAddedMetric/notifyOfRemovedMetric methods, to check whether Flink attempts to unregister all metrics or only some.

On 05.06.2018 02:02, Nikolas Davis wrote:
Fabian,

It does look like it may be related. I'll add a comment. After digging a bit more I found that the crash and lack of metrics were precipitated by the JobManager instance crashing and cycling, which caused the job to restart.


Chesnay,

I didn't see anything interesting in our logs. Our reporter config is fairly straightforward (I think):

metrics.reporter.nr.class: com.newrelic.flink.NewRelicReporter
metrics.reporter.nr.interval: 60 SECONDS
metrics.reporters: nr

Nik Davis
Software Engineer
New Relic

On Mon, Jun 4, 2018 at 1:56 AM, Chesnay Schepler <[hidden email]> wrote:
Can you show us the metrics-related configuration parameters in flink-conf.yaml?

Please also check the logs for any warnings from the MetricGroup and MetricRegistry classes.


On 04.06.2018 10:44, Fabian Hueske wrote:
Hi Nik,

Can you have a look at this JIRA ticket [1] and check if it is related to the problems your are facing?
If so, would you mind leaving a comment there?

Thank you,
Fabian

2018-05-31 4:41 GMT+02:00 Nikolas Davis <[hidden email]>:
We keep track of metrics by using the value of MetricGroup::getMetricIdentifier, which returns the fully qualified metric name. The query that we use to monitor metrics filters for metrics IDs that match '%Status.JVM.Memory%'. As long as the new metrics come online via the MetricReporter interface then I think the chart would be continuous; we would just see the old JVM memory metrics cycle into new metrics.

Nik Davis
Software Engineer
New Relic

On Wed, May 30, 2018 at 5:30 PM, Ajay Tripathy <[hidden email]> wrote:
How are your metrics dimensionalized/named? Task managers often have UIDs generated for them. The task id dimension will change on restart. If you name your metric based on this 'task_id' there would be a discontinuity with the old metric.

On Wed, May 30, 2018 at 4:49 PM, Nikolas Davis <[hidden email]> wrote:
Howdy,

We are seeing our task manager JVM metrics disappear over time. This last time we correlated it to our job crashing and restarting. I wasn't able to grab the failing exception to share. Any thoughts?

We track metrics through the MetricReporter interface. As far as I can tell this more or less only affects the JVM metrics. I.e. most / all other metrics continue reporting fine as the job is automatically restarted.

Nik Davis
Software Engineer
New Relic