(DEPRECATED) Apache Flink User Mailing List archive.

JVM metrics disappearing after job crash, restart

Classic

List

Threaded

7 messages Options

Nikolas Davis

JVM metrics disappearing after job crash, restart

Howdy,

We are seeing our task manager JVM metrics disappear over time. This last time we correlated it to our job crashing and restarting. I wasn't able to grab the failing exception to share. Any thoughts?

We track metrics through the MetricReporter interface. As far as I can tell this more or less only affects the JVM metrics. I.e. most / all other metrics continue reporting fine as the job is automatically restarted.

Nik Davis

Software Engineer

New Relic

Ajay Tripathy

Re: JVM metrics disappearing after job crash, restart

How are your metrics dimensionalized/named? Task managers often have UIDs generated for them. The task id dimension will change on restart. If you name your metric based on this 'task_id' there would be a discontinuity with the old metric.

On Wed, May 30, 2018 at 4:49 PM, Nikolas Davis <[hidden email]> wrote:

Howdy,

We are seeing our task manager JVM metrics disappear over time. This last time we correlated it to our job crashing and restarting. I wasn't able to grab the failing exception to share. Any thoughts?

We track metrics through the MetricReporter interface. As far as I can tell this more or less only affects the JVM metrics. I.e. most / all other metrics continue reporting fine as the job is automatically restarted.

Nik Davis
Software Engineer
New Relic

Nikolas Davis

Re: JVM metrics disappearing after job crash, restart

We keep track of metrics by using the value of MetricGroup::getMetricIdentifier, which returns the fully qualified metric name. The query that we use to monitor metrics filters for metrics IDs that match '%Status.JVM.Memory%'. As long as the new metrics come online via the MetricReporter interface then I think the chart would be continuous; we would just see the old JVM memory metrics cycle into new metrics.

Nik Davis

Software Engineer

New Relic

On Wed, May 30, 2018 at 5:30 PM, Ajay Tripathy <[hidden email]> wrote:

How are your metrics dimensionalized/named? Task managers often have UIDs generated for them. The task id dimension will change on restart. If you name your metric based on this 'task_id' there would be a discontinuity with the old metric.

On Wed, May 30, 2018 at 4:49 PM, Nikolas Davis <[hidden email]> wrote:
Howdy,

We are seeing our task manager JVM metrics disappear over time. This last time we correlated it to our job crashing and restarting. I wasn't able to grab the failing exception to share. Any thoughts?

We track metrics through the MetricReporter interface. As far as I can tell this more or less only affects the JVM metrics. I.e. most / all other metrics continue reporting fine as the job is automatically restarted.

Nik Davis
Software Engineer
New Relic

Fabian Hueske-2

Re: JVM metrics disappearing after job crash, restart

Hi Nik,

Can you have a look at this JIRA ticket [1] and check if it is related to the problems your are facing?

If so, would you mind leaving a comment there?

Thank you,

Fabian

[1] https://issues.apache.org/jira/browse/FLINK-8946

2018-05-31 4:41 GMT+02:00 Nikolas Davis <[hidden email]>:

We keep track of metrics by using the value of MetricGroup::getMetricIdentifier, which returns the fully qualified metric name. The query that we use to monitor metrics filters for metrics IDs that match '%Status.JVM.Memory%'. As long as the new metrics come online via the MetricReporter interface then I think the chart would be continuous; we would just see the old JVM memory metrics cycle into new metrics.

Nik Davis
Software Engineer
New Relic

On Wed, May 30, 2018 at 5:30 PM, Ajay Tripathy <[hidden email]> wrote:
How are your metrics dimensionalized/named? Task managers often have UIDs generated for them. The task id dimension will change on restart. If you name your metric based on this 'task_id' there would be a discontinuity with the old metric.

On Wed, May 30, 2018 at 4:49 PM, Nikolas Davis <[hidden email]> wrote:
Howdy,

We are seeing our task manager JVM metrics disappear over time. This last time we correlated it to our job crashing and restarting. I wasn't able to grab the failing exception to share. Any thoughts?

We track metrics through the MetricReporter interface. As far as I can tell this more or less only affects the JVM metrics. I.e. most / all other metrics continue reporting fine as the job is automatically restarted.

Nik Davis
Software Engineer
New Relic

Chesnay Schepler

Re: JVM metrics disappearing after job crash, restart

Can you show us the metrics-related configuration parameters in flink-conf.yaml?

Please also check the logs for any warnings from the MetricGroupand MetricRegistry classes.

On 04.06.2018 10:44, Fabian Hueske wrote:

Hi Nik,

Can you have a look at this JIRA ticket [1] and check if it is related to the problems your are facing?

If so, would you mind leaving a comment there?

Thank you,

Fabian

[1] https://issues.apache.org/jira/browse/FLINK-8946

2018-05-31 4:41 GMT+02:00 Nikolas Davis <[hidden email]>:

We keep track of metrics by using the value of MetricGroup::getMetricIdentifier, which returns the fully qualified metric name. The query that we use to monitor metrics filters for metrics IDs that match '%Status.JVM.Memory%'. As long as the new metrics come online via the MetricReporter interface then I think the chart would be continuous; we would just see the old JVM memory metrics cycle into new metrics.

Nik Davis
Software Engineer

New Relic

On Wed, May 30, 2018 at 5:30 PM, Ajay Tripathy <[hidden email]> wrote:

How are your metrics dimensionalized/named? Task managers often have UIDs generated for them. The task id dimension will change on restart. If you name your metric based on this 'task_id' there would be a discontinuity with the old metric.

On Wed, May 30, 2018 at 4:49 PM, Nikolas Davis <[hidden email]> wrote:

Howdy,

We are seeing our task manager JVM metrics disappear over time. This last time we correlated it to our job crashing and restarting. I wasn't able to grab the failing exception to share. Any thoughts?

We track metrics through the MetricReporter interface. As far as I can tell this more or less only affects the JVM metrics. I.e. most / all other metrics continue reporting fine as the job is automatically restarted.

Nik Davis
Software Engineer

New Relic

Nikolas Davis

Re: JVM metrics disappearing after job crash, restart

Fabian,

It does look like it may be related. I'll add a comment. After digging a bit more I found that the crash and lack of metrics were precipitated by the JobManager instance crashing and cycling, which caused the job to restart.

Chesnay,

I didn't see anything interesting in our logs. Our reporter config is fairly straightforward (I think):

metrics.reporter.nr.class: com.newrelic.flink.NewRelicReporter

metrics.reporter.nr.interval: 60 SECONDS

metrics.reporters: nr

Nik Davis

Software Engineer

New Relic

On Mon, Jun 4, 2018 at 1:56 AM, Chesnay Schepler <[hidden email]> wrote:

Can you show us the metrics-related configuration parameters in flink-conf.yaml?

Please also check the logs for any warnings from the MetricGroupand MetricRegistry classes.

On 04.06.2018 10:44, Fabian Hueske wrote:

Hi Nik,

Can you have a look at this JIRA ticket [1] and check if it is related to the problems your are facing?

If so, would you mind leaving a comment there?

Thank you,

Fabian

[1] https://issues.apache.org/jira/browse/FLINK-8946

2018-05-31 4:41 GMT+02:00 Nikolas Davis <[hidden email]>:

We keep track of metrics by using the value of MetricGroup::getMetricIdentifier, which returns the fully qualified metric name. The query that we use to monitor metrics filters for metrics IDs that match '%Status.JVM.Memory%'. As long as the new metrics come online via the MetricReporter interface then I think the chart would be continuous; we would just see the old JVM memory metrics cycle into new metrics.

Nik Davis
Software Engineer

New Relic

On Wed, May 30, 2018 at 5:30 PM, Ajay Tripathy <[hidden email]> wrote:

How are your metrics dimensionalized/named? Task managers often have UIDs generated for them. The task id dimension will change on restart. If you name your metric based on this 'task_id' there would be a discontinuity with the old metric.

On Wed, May 30, 2018 at 4:49 PM, Nikolas Davis <[hidden email]> wrote:

Howdy,

We are seeing our task manager JVM metrics disappear over time. This last time we correlated it to our job crashing and restarting. I wasn't able to grab the failing exception to share. Any thoughts?

We track metrics through the MetricReporter interface. As far as I can tell this more or less only affects the JVM metrics. I.e. most / all other metrics continue reporting fine as the job is automatically restarted.

Nik Davis
Software Engineer

New Relic

Chesnay Schepler

Re: JVM metrics disappearing after job crash, restart

The config looks OK to me. On the Flink side I cannot find an explanation why only some metrics disappear.

The only explanation I could come up with at the moment is that FLINK-8946 is triggered, all metrics are (officially) unregistered, but the reporter isn't removing some metrics (i.e. all job related ones).
Due to FLINK-8946 no new metrics would be registered after the JM restart, but the old metrics continue to be reported.

To verify this I would add logging statements to the notifyOfAddedMetric/notifyOfRemovedMetric methods, to check whether Flink attempts to unregister all metrics or only some.

On 05.06.2018 02:02, Nikolas Davis wrote:

Fabian,

It does look like it may be related. I'll add a comment. After digging a bit more I found that the crash and lack of metrics were precipitated by the JobManager instance crashing and cycling, which caused the job to restart.

Chesnay,

I didn't see anything interesting in our logs. Our reporter config is fairly straightforward (I think):

metrics.reporter.nr.class: com.newrelic.flink.NewRelicReporter

metrics.reporter.nr.interval: 60 SECONDS

metrics.reporters: nr

Nik Davis
Software Engineer

New Relic

On Mon, Jun 4, 2018 at 1:56 AM, Chesnay Schepler <[hidden email]> wrote:

Can you show us the metrics-related configuration parameters in flink-conf.yaml?

Please also check the logs for any warnings from the MetricGroupand MetricRegistry classes.

On 04.06.2018 10:44, Fabian Hueske wrote:

Hi Nik,

Can you have a look at this JIRA ticket [1] and check if it is related to the problems your are facing?

If so, would you mind leaving a comment there?

Thank you,

Fabian

[1] https://issues.apache.org/jira/browse/FLINK-8946

2018-05-31 4:41 GMT+02:00 Nikolas Davis <[hidden email]>:

We keep track of metrics by using the value of MetricGroup::getMetricIdentifier, which returns the fully qualified metric name. The query that we use to monitor metrics filters for metrics IDs that match '%Status.JVM.Memory%'. As long as the new metrics come online via the MetricReporter interface then I think the chart would be continuous; we would just see the old JVM memory metrics cycle into new metrics.

Nik Davis
Software Engineer

New Relic

On Wed, May 30, 2018 at 5:30 PM, Ajay Tripathy <[hidden email]> wrote:

How are your metrics dimensionalized/named? Task managers often have UIDs generated for them. The task id dimension will change on restart. If you name your metric based on this 'task_id' there would be a discontinuity with the old metric.

On Wed, May 30, 2018 at 4:49 PM, Nikolas Davis <[hidden email]> wrote:

Howdy,

We are seeing our task manager JVM metrics disappear over time. This last time we correlated it to our job crashing and restarting. I wasn't able to grab the failing exception to share. Any thoughts?

We track metrics through the MetricReporter interface. As far as I can tell this more or less only affects the JVM metrics. I.e. most / all other metrics continue reporting fine as the job is automatically restarted.

Nik Davis
Software Engineer

New Relic