Metrics: (1) Histogram in prometheus and (2) meter by event_time

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

Metrics: (1) Histogram in prometheus and (2) meter by event_time

Averell
Good day everyone,

I have a stream with two timestamps (ts1 and ts2) inside each record. My
event time is ts1. This ts1 has value truncated to a quarter (like 23:30,
23:45, 00:00,...)
I want to report two metrics:
 1. A meter which counts number of records per value of ts1. (fig.1)
 
<http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/file/t1586/Meter.png>

 2. A histogram which shows the distribution of the difference between ts1
and ts2 within each record (fig.2)
<http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/file/t1586/heatmap_histogram.png>

I'm using Prometheus with Grafana. Is that possible to do what I mentioned?

Thanks and best regards,
Averell



--
Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/
Reply | Threaded
Open this post in threaded view
|

Re: Metrics: (1) Histogram in prometheus and (2) meter by event_time

Kostas Kloudas
Hi Averell,

From what I understand for your use case, it is possible to do what you want with Flink.

If you are implementing a function, then you have access to the metric system through
the runtime context (see [1] for more information).

Some things to take into consideration:

1) Metrics are not fault-tolerant, so if you need fault-tolerance then you have to take care of that 
    (e.g. keep them in Flink’s state).

2) Are you sure you want them as metrics and not something like side-output? Metrics are more 
    supposed to monitor the health of you cluster and more “system” characteristics, rather than
    business logic or data properties.

Cheers,
Kostas


On Sep 27, 2018, at 8:45 AM, Averell <[hidden email]> wrote:

Good day everyone,

I have a stream with two timestamps (ts1 and ts2) inside each record. My
event time is ts1. This ts1 has value truncated to a quarter (like 23:30,
23:45, 00:00,...)
I want to report two metrics:
1. A meter which counts number of records per value of ts1. (fig.1)

<http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/file/t1586/Meter.png>

2. A histogram which shows the distribution of the difference between ts1
and ts2 within each record (fig.2)
<http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/file/t1586/heatmap_histogram.png>

I'm using Prometheus with Grafana. Is that possible to do what I mentioned?

Thanks and best regards,
Averell



--
Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/

Reply | Threaded
Open this post in threaded view
|

Re: Metrics: (1) Histogram in prometheus and (2) meter by event_time

Averell
Hi Kostas,

Yes, I want them as metrics, as they are purely for monitoring purpose.
There's no need of fault tolerance.

If I use side-output, for example for that metric no.1, I would need a
tumbling AllWindowFunction, which, as I understand, would introduce some
delay to both the normal processing flow, and to the checkpoint process.

I already tried to follow the referencing web page that you sent. However, I
could not know how to have what I want.
For example, with metrics no.1 - meter: org.apache.flink.metrics.Meter only
provides markEvent(), which marks an event to that Meter. There is no option
to provide the event_time, and processing_time is always used. So my graph
is spread over time like the one below.
<http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/file/t1586/Meter2.png>

For metrics no.2 - histogram: What I can see at Prometheus is the calculated
percentile values (0.5, 0.75, 0.9, 0.99, 0.999), which tells me, for
example: 99% the total number of records had ts1-ts2 <= 350s (which looks
more like a rolling average). But it doesn't tell me roughly how many % of
record have diff of 250ms, how many of 260ms, etc...
<http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/file/t1586/Histo2.png>

Thanks and regards,
Averell




--
Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/
Reply | Threaded
Open this post in threaded view
|

Re: Metrics: (1) Histogram in prometheus and (2) meter by event_time

Kostas Kloudas
Hi Averell,

> On Sep 27, 2018, at 3:09 PM, Averell <[hidden email]> wrote:
>
> Hi Kostas,
>
> Yes, I want them as metrics, as they are purely for monitoring purpose.
> There's no need of fault tolerance.
>
> If I use side-output, for example for that metric no.1, I would need a
> tumbling AllWindowFunction, which, as I understand, would introduce some
> delay to both the normal processing flow, and to the checkpoint process.
>

Side-output may introduce all that but you can always do something like:

mymainStream = …

myMainStream.myMainComputation….

muMainStream.windowAll().myMonitoringComputation…

This does not affect the main path of your computation (if this is your only concern).

> I already tried to follow the referencing web page that you sent. However, I
> could not know how to have what I want.
> For example, with metrics no.1 - meter: org.apache.flink.metrics.Meter only
> provides markEvent(), which marks an event to that Meter. There is no option
> to provide the event_time, and processing_time is always used. So my graph
> is spread over time like the one below.
> <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/file/t1586/Meter2.png>
>

Event time is a notion of Flink and a property of your data (timestamp).

Metric systems like Prometheus take whatever you expose as metric and attach a timestamp based on
the current wall-clock time, as for them, the time an event occurred is the time that they got that metric.

So, if you want event-time computations, then those should be done in Flink.

> For metrics no.2 - histogram: What I can see at Prometheus is the calculated
> percentile values (0.5, 0.75, 0.9, 0.99, 0.999), which tells me, for
> example: 99% the total number of records had ts1-ts2 <= 350s (which looks
> more like a rolling average). But it doesn't tell me roughly how many % of
> record have diff of 250ms, how many of 260ms, etc...
> <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/file/t1586/Histo2.png>
>

For this case, the two notions of histograms (yours and Prometheus’) are not aligned. So
what you could do, is keep each “bucket” of your histogram as a separate metric and
expose it as such to prometheus. So essentially, you are the one creating the histogram.

Metric systems do not perform any (complex) computation for you.

Once again, I would say that these exploratory metrics about your stream fall more under the category of
analytics about your input data, rather than “metrics”, but of course feel free to disagree :)

Cheers,
Kostas

> Thanks and regards,
> Averell
>
>
>
>
> --
> Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/

Reply | Threaded
Open this post in threaded view
|

Re: Metrics: (1) Histogram in prometheus and (2) meter by event_time

Averell
Hello Kostas,

Thank you very much for the details. Also thanks for that "feel free to
disagree" (however, I don't have any desire to disagree here :) thanks a
lot)

Regarding that mainStream.windowAll, did you mean that checkpointing of the
two branches (the main one and the monitoring one) will be separated? Is
that possible to disable checkpointing for that 2nd branch?

Thanks and best regards,
Averell



--
Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/