Metric counter gets reset when leader jobmanager changes in Flink native K8s HA solution

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

Metric counter gets reset when leader jobmanager changes in Flink native K8s HA solution

Amit Bhatia
Hi,

We have configured jobmanager HA with flink 1.12.1 on the k8s environment. We have implemented a HA solution using Native K8s HA solution (https://cwiki.apache.org/confluence/display/FLINK/FLIP-144%3A+Native+Kubernetes+HA+for+Flink). We have used deployment controller for both jobmanager & taskmanager pods. 

So whenever a leader jobmanager crashes and the same jobmanager becomes leader again then everything works fine but whenever a leader jobmanager crashes and some other standby jobmanager becomes leader then metric count gets reset and it starts the request count again from 1. Is it the expected behaviour ? or is there any specific configuration required so that even if the leader jobmanager changes then instead of resetting the metric count it continues the count.

Regards,
Amit
Reply | Threaded
Open this post in threaded view
|

Re: Metric counter gets reset when leader jobmanager changes in Flink native K8s HA solution

Prasanna kumar
amit,

This is expected behaviour from counter . If the total count irrespective of the restarts needed to be found, aggregate functions need to be applied on the counter . Example  sum(Rate(counter)) https://prometheus.io/docs/prometheus/latest/querying/functions/

Prasanna.

On Tue, Jun 15, 2021 at 8:25 AM Amit Bhatia <[hidden email]> wrote:
Hi,

We have configured jobmanager HA with flink 1.12.1 on the k8s environment. We have implemented a HA solution using Native K8s HA solution (https://cwiki.apache.org/confluence/display/FLINK/FLIP-144%3A+Native+Kubernetes+HA+for+Flink). We have used deployment controller for both jobmanager & taskmanager pods. 

So whenever a leader jobmanager crashes and the same jobmanager becomes leader again then everything works fine but whenever a leader jobmanager crashes and some other standby jobmanager becomes leader then metric count gets reset and it starts the request count again from 1. Is it the expected behaviour ? or is there any specific configuration required so that even if the leader jobmanager changes then instead of resetting the metric count it continues the count.

Regards,
Amit