Flink Metrics

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

Flink Metrics

Prasanna kumar
Hi flinksters,  

Scenario: We have cdc messages from our rdbms(various tables) flowing to Kafka.  Our flink job reads the CDC messages and creates events based on certain rules.  

I am using Prometheus  and grafana.

Following are there metrics that i need to calculate 

A) Number of CDC messages wrt to each table.
B) Number of events created wrt to each event type.
C) Average/P99/P95 Latency (event created ts - ccd operation ts) 

For A and B, I created counters and able to see the metrices flowing into Prometheus . Few questions I have here. 

1) How to create labels for counters in flink ? I did not find any easier method to do it . Right now I see that I need to create counters for each type of table and events . I referred to one of the community discussions.  [1] . Is there any way apart from this ?

2) When the job gets restarted , the counters get back to 0 . How to prevent that and to get continuity.  

For C , I calculated latency in code for each event and assigned  it to histogram.  Few questions I have here.

3) I read in a few blogs [2] that histogram is the best way to get latencies. Is there any better idea? 

4) How to create buckets for various ranges? I also read in a community email that flink implements  histogram as summaries.  I also should be able to see the latencies across timelines .


Thanks,
Prasanna.
Reply | Threaded
Open this post in threaded view
|

Re: Flink Metrics

Piotr Nowojski-4
Hi,

1)
Do you want to output those metrics as Flink metrics? Or output those "metrics"/counters as values to some external system (like Kafka)? The problem discussed in [1], was that the metrics (Counters) were not fitting in memory, so David suggested to hold them on Flink's state and treat the measured values as regular output of the job. 

The former option you can think of if you had a single operator, that consumes your CDCs outputs something (filtered CDCs? processed CDCs?) to Kafka, while keeping some metrics that you can access via Flink metrics system. The latter would be the same operator, but instead of single output it would have multiple outputs, writing the "counters" also for example to Kafka (or any other system of your choice). Both options are viable, each has its own pros and cons.

2) You need to persist your metrics somewhere. Why don't you use Flink's state for that purpose? Upon recovery/initialisation, you can get the recovered value from state and update/set metric value to that recovered value.

3) That seems to be a question a bit unrelated to Flink. Try searching online how to calculate percentiles. I haven't thought about it, but histograms or sorting all of the values seems to be the options. Probably best if you would use some existing library to do that for you.

4) Could you rephrase your question?

Best,
Piotrek

niedz., 28 lut 2021 o 14:53 Prasanna kumar <[hidden email]> napisał(a):
Hi flinksters,  

Scenario: We have cdc messages from our rdbms(various tables) flowing to Kafka.  Our flink job reads the CDC messages and creates events based on certain rules.  

I am using Prometheus  and grafana.

Following are there metrics that i need to calculate 

A) Number of CDC messages wrt to each table.
B) Number of events created wrt to each event type.
C) Average/P99/P95 Latency (event created ts - ccd operation ts) 

For A and B, I created counters and able to see the metrices flowing into Prometheus . Few questions I have here. 

1) How to create labels for counters in flink ? I did not find any easier method to do it . Right now I see that I need to create counters for each type of table and events . I referred to one of the community discussions.  [1] . Is there any way apart from this ?

2) When the job gets restarted , the counters get back to 0 . How to prevent that and to get continuity.  

For C , I calculated latency in code for each event and assigned  it to histogram.  Few questions I have here.

3) I read in a few blogs [2] that histogram is the best way to get latencies. Is there any better idea? 

4) How to create buckets for various ranges? I also read in a community email that flink implements  histogram as summaries.  I also should be able to see the latencies across timelines .


Thanks,
Prasanna.