[External] Measuring Kafka consumer lag

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

[External] Measuring Kafka consumer lag

Padarn Wilson
Hi all,

I'm looking for some advice on how other people measure consumer lag for Kafka consumers. Recently we had an application that looked like it was performing identically to before, but all of a sudden the throughput of the job decreased dramatically. However it was not clear from our Flink metrics, only from the lag in time vs watermark time that our consumers were measuring.

How do people approach measuring this?

Thanks,
Padarn


By communicating with Grab Inc and/or its subsidiaries, associate companies and jointly controlled entities (“Grab Group”), you are deemed to have consented to the processing of your personal data as set out in the Privacy Notice which can be viewed at https://grab.com/privacy/

This email contains confidential information and is only for the intended recipient(s). If you are not the intended recipient(s), please do not disseminate, distribute or copy this email Please notify Grab Group immediately if you have received this by mistake and delete this email from your system. Email transmission cannot be guaranteed to be secure or error-free as any information therein could be intercepted, corrupted, lost, destroyed, delayed or incomplete, or contain viruses. Grab Group do not accept liability for any errors or omissions in the contents of this email arises as a result of email transmission. All intellectual property rights in this email and attachments therein shall remain vested in Grab Group, unless otherwise provided by law.
Reply | Threaded
Open this post in threaded view
|

Re: [External] Measuring Kafka consumer lag

Padarn Wilson
Thanks Robert. 

Yes we monitor many of the Flink internal metric, which is why I was surprised that we were unable to notice the warning signs before our consumers notified us.

It would be nice to measure the topic vs consumer group offset of the flink consumer.

On Tue, Jun 16, 2020 at 1:57 AM Robert Metzger <[hidden email]> wrote:
Hi Padarn,
I usually recommend the approach you described: accessing/monitoring the lag via Flink's metrics system. Sometimes it also makes sense to consider application level metrics.
I checked Youtube for past Flink Forward talks, but I couldn't find a video. I'm sure there were users talking about best practices for monitoring Flink in the past ...

Best,
Robert

On Sun, Jun 14, 2020 at 5:47 AM Padarn Wilson <[hidden email]> wrote:
Hi all,

I'm looking for some advice on how other people measure consumer lag for Kafka consumers. Recently we had an application that looked like it was performing identically to before, but all of a sudden the throughput of the job decreased dramatically. However it was not clear from our Flink metrics, only from the lag in time vs watermark time that our consumers were measuring.

How do people approach measuring this?

Thanks,
Padarn


By communicating with Grab Inc and/or its subsidiaries, associate companies and jointly controlled entities (“Grab Group”), you are deemed to have consented to the processing of your personal data as set out in the Privacy Notice which can be viewed at https://grab.com/privacy/

This email contains confidential information and is only for the intended recipient(s). If you are not the intended recipient(s), please do not disseminate, distribute or copy this email Please notify Grab Group immediately if you have received this by mistake and delete this email from your system. Email transmission cannot be guaranteed to be secure or error-free as any information therein could be intercepted, corrupted, lost, destroyed, delayed or incomplete, or contain viruses. Grab Group do not accept liability for any errors or omissions in the contents of this email arises as a result of email transmission. All intellectual property rights in this email and attachments therein shall remain vested in Grab Group, unless otherwise provided by law.


By communicating with Grab Inc and/or its subsidiaries, associate companies and jointly controlled entities (“Grab Group”), you are deemed to have consented to the processing of your personal data as set out in the Privacy Notice which can be viewed at https://grab.com/privacy/

This email contains confidential information and is only for the intended recipient(s). If you are not the intended recipient(s), please do not disseminate, distribute or copy this email Please notify Grab Group immediately if you have received this by mistake and delete this email from your system. Email transmission cannot be guaranteed to be secure or error-free as any information therein could be intercepted, corrupted, lost, destroyed, delayed or incomplete, or contain viruses. Grab Group do not accept liability for any errors or omissions in the contents of this email arises as a result of email transmission. All intellectual property rights in this email and attachments therein shall remain vested in Grab Group, unless otherwise provided by law.
Reply | Threaded
Open this post in threaded view
|

Re: [External] Measuring Kafka consumer lag

Theo
Hi Padarn,

We configure our Flink KafkaConsumer with  setCommitOffsetsOnCheckpoints(true). In this case, the offsets are committed on each checkpoint for the conumer group of the application. We have an external monitoring on our kafka consumer groups (Just a small script) which writes kafka infos like: startOffset, endOffset and current committed position for all consumer groups for each topic and partition to our metrics db. I like that approach of monitoring as it is rather independent of Flink and thus reliable in terms of detecting problems if Flink is too slow. Of course, we also rely heavily on flink internal metrics, but for the first check of "is everything ok?", we check the kafka topic metrics and see "there are XX events coming in and there is no lag (backpressure) => All fine".

Best regards
Theo


Von: "Padarn Wilson" <[hidden email]>
An: "Robert Metzger" <[hidden email]>, "user" <[hidden email]>
Gesendet: Dienstag, 16. Juni 2020 02:52:16
Betreff: Re: [External] Measuring Kafka consumer lag

Thanks Robert. 
Yes we monitor many of the Flink internal metric, which is why I was surprised that we were unable to notice the warning signs before our consumers notified us.

It would be nice to measure the topic vs consumer group offset of the flink consumer.

On Tue, Jun 16, 2020 at 1:57 AM Robert Metzger <[hidden email]> wrote:
Hi Padarn,
I usually recommend the approach you described: accessing/monitoring the lag via Flink's metrics system. Sometimes it also makes sense to consider application level metrics.
I checked Youtube for past Flink Forward talks, but I couldn't find a video. I'm sure there were users talking about best practices for monitoring Flink in the past ...

Best,
Robert

On Sun, Jun 14, 2020 at 5:47 AM Padarn Wilson <[hidden email]> wrote:
Hi all,
I'm looking for some advice on how other people measure consumer lag for Kafka consumers. Recently we had an application that looked like it was performing identically to before, but all of a sudden the throughput of the job decreased dramatically. However it was not clear from our Flink metrics, only from the lag in time vs watermark time that our consumers were measuring.

How do people approach measuring this?

Thanks,
Padarn


By communicating with Grab Inc and/or its subsidiaries, associate companies and jointly controlled entities (“Grab Group”), you are deemed to have consented to the processing of your personal data as set out in the Privacy Notice which can be viewed at https://grab.com/privacy/

This email contains confidential information and is only for the intended recipient(s). If you are not the intended recipient(s), please do not disseminate, distribute or copy this email Please notify Grab Group immediately if you have received this by mistake and delete this email from your system. Email transmission cannot be guaranteed to be secure or error-free as any information therein could be intercepted, corrupted, lost, destroyed, delayed or incomplete, or contain viruses. Grab Group do not accept liability for any errors or omissions in the contents of this email arises as a result of email transmission. All intellectual property rights in this email and attachments therein shall remain vested in Grab Group, unless otherwise provided by law.


By communicating with Grab Inc and/or its subsidiaries, associate companies and jointly controlled entities (“Grab Group”), you are deemed to have consented to the processing of your personal data as set out in the Privacy Notice which can be viewed at https://grab.com/privacy/

This email contains confidential information and is only for the intended recipient(s). If you are not the intended recipient(s), please do not disseminate, distribute or copy this email Please notify Grab Group immediately if you have received this by mistake and delete this email from your system. Email transmission cannot be guaranteed to be secure or error-free as any information therein could be intercepted, corrupted, lost, destroyed, delayed or incomplete, or contain viruses. Grab Group do not accept liability for any errors or omissions in the contents of this email arises as a result of email transmission. All intellectual property rights in this email and attachments therein shall remain vested in Grab Group, unless otherwise provided by law.