Flink and Prometheus monitoring question

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

Flink and Prometheus monitoring question

Jesús Vásquez
Hi,
I want to monitor Flink Streaming jobs using Prometheus
My first goal is to send alerts when a Flink job has failed.
The thing is that looking at the documentation I haven't found a metric that helps me defining an alerting rule.
As a starting point i thought that the metric flink_jobmanager_job_downtime could help since the doc says this metric emits -1 for a completed job.
But when i tested this i found out this doesn't work since the metric always emits 0 and after the job is completed there is no metric.
Has anyone managed to alert when flink job has failed with Prometheus?
Thanks for your help.
Reply | Threaded
Open this post in threaded view
|

Re: [EXTERNAL] Flink and Prometheus monitoring question

PoolakkalMukkath, Shakir

You could use “flink_jobmanager_numRunningJobs” to check the number of running jobs.

 

Thanks

 

From: Jesús Vásquez <[hidden email]>
Date: Monday, December 16, 2019 at 12:47 PM
To: "[hidden email]" <[hidden email]>
Subject: [EXTERNAL] Flink and Prometheus monitoring question

 

Hi,

I want to monitor Flink Streaming jobs using Prometheus

My first goal is to send alerts when a Flink job has failed.

The thing is that looking at the documentation I haven't found a metric that helps me defining an alerting rule.

As a starting point i thought that the metric flink_jobmanager_job_downtime could help since the doc says this metric emits -1 for a completed job.

But when i tested this i found out this doesn't work since the metric always emits 0 and after the job is completed there is no metric.

Has anyone managed to alert when flink job has failed with Prometheus?

Thanks for your help.

Reply | Threaded
Open this post in threaded view
|

Re: [EXTERNAL] Flink and Prometheus monitoring question

Jesús Vásquez
The thing about numRunningJobs metric is that i have to configure in advance the Prometheus rules with the number of jobs i expect to be running in order to alert, i kind of need this rule to alert on individual jobs. I initially thought of flink_jobmanager_downtime{job_id=~".*"} == -1 , bit it resulted that the metric just emits 0 on running jobs, and doesn't emit -1 for failed jobs.

El lun., 16 dic. 2019 7:01 p. m., PoolakkalMukkath, Shakir <[hidden email]> escribió:

You could use “flink_jobmanager_numRunningJobs” to check the number of running jobs.

 

Thanks

 

From: Jesús Vásquez <[hidden email]>
Date: Monday, December 16, 2019 at 12:47 PM
To: "[hidden email]" <[hidden email]>
Subject: [EXTERNAL] Flink and Prometheus monitoring question

 

Hi,

I want to monitor Flink Streaming jobs using Prometheus

My first goal is to send alerts when a Flink job has failed.

The thing is that looking at the documentation I haven't found a metric that helps me defining an alerting rule.

As a starting point i thought that the metric flink_jobmanager_job_downtime could help since the doc says this metric emits -1 for a completed job.

But when i tested this i found out this doesn't work since the metric always emits 0 and after the job is completed there is no metric.

Has anyone managed to alert when flink job has failed with Prometheus?

Thanks for your help.

Reply | Threaded
Open this post in threaded view
|

Re: [EXTERNAL] Flink and Prometheus monitoring question

Zhu Zhu
Hi Jesús, 
If your job has checkpointing enabled, you can monitor 'numberOfCompletedCheckpoints' to see wether the job is still alive and healthy.

Thanks,
Zhu Zhu

Jesús Vásquez <[hidden email]> 于2019年12月17日周二 上午2:43写道:
The thing about numRunningJobs metric is that i have to configure in advance the Prometheus rules with the number of jobs i expect to be running in order to alert, i kind of need this rule to alert on individual jobs. I initially thought of flink_jobmanager_downtime{job_id=~".*"} == -1 , bit it resulted that the metric just emits 0 on running jobs, and doesn't emit -1 for failed jobs.

El lun., 16 dic. 2019 7:01 p. m., PoolakkalMukkath, Shakir <[hidden email]> escribió:

You could use “flink_jobmanager_numRunningJobs” to check the number of running jobs.

 

Thanks

 

From: Jesús Vásquez <[hidden email]>
Date: Monday, December 16, 2019 at 12:47 PM
To: "[hidden email]" <[hidden email]>
Subject: [EXTERNAL] Flink and Prometheus monitoring question

 

Hi,

I want to monitor Flink Streaming jobs using Prometheus

My first goal is to send alerts when a Flink job has failed.

The thing is that looking at the documentation I haven't found a metric that helps me defining an alerting rule.

As a starting point i thought that the metric flink_jobmanager_job_downtime could help since the doc says this metric emits -1 for a completed job.

But when i tested this i found out this doesn't work since the metric always emits 0 and after the job is completed there is no metric.

Has anyone managed to alert when flink job has failed with Prometheus?

Thanks for your help.