Trying to understand KafkaConsumer_records_lag_max

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Trying to understand KafkaConsumer_records_lag_max

Julio Biason
Hi guys,

We are on the final stages of moving our Flink pipeline from staging to production, but I just found something kinda weird:

We are graphing some Flink metrics, like flink_taskmanager_job_task_operator_KafkaConsumer_records_lag_max. If I got this right, that's "kafka head offset - flink consumer offset", e.g., the number of records flink still needs to reach the most recent in the partition. Is that right?

If that's the case, I saw another weird thing: It seems that, at some points, this lag falls back to 0 and then slowly goes back up (remember, this is a staging environment, not production, so we are using smaller machines with few cores [2] and low memory [8Gb]) -- attached Grafana graph for reference. I don't see any checkpoint errors or taskmanager failures, so I don't think it simply dropped everything and started over.

Any ideas what's going on here?

--
Julio Biason, Sofware Engineer
AZION  |  Deliver. Accelerate. Protect.
Office: <a href="callto:+555130838101" value="+555130838101" style="color:rgb(17,85,204);font-family:arial,sans-serif;font-size:12.8px" target="_blank">+55 51 3083 8101  |  Mobile: <a href="callto:+5551996209291" style="color:rgb(17,85,204)" target="_blank">+55 51 99907 0554

Screenshot 2018-04-13 15.00.12.png (90K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Trying to understand KafkaConsumer_records_lag_max

Tzu-Li (Gordon) Tai
Hi Julio,

I'm not really sure, but do you think it is possible that there could be
some hard data retention setting for your Kafka topics in the staging
environment?
As in, at some point in time and maybe periodically, all data in the Kafka
topics are dropped and therefore the consumers effectively jump directly
back to the head again.

Cheers,
Gordon



--
Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/
Reply | Threaded
Open this post in threaded view
|

Re: Trying to understand KafkaConsumer_records_lag_max

Julio Biason
Hi Gordon (and list),

Yes, that's probably what's going on. I got another message from 徐骁 which told me almost the same thing -- something I completely forgot (he also mentioned auto.offset.reset, which could be forcing Flink to keep reading from the top of Kafka instead of trying to go back and read older entries).

Now I need to figure out how to make my pipeline consume entries faster (or at least on par) with the speed those are getting in Kafka -- but that's a discussion for another email. ;)

On Mon, Apr 16, 2018 at 1:29 AM, Tzu-Li (Gordon) Tai <[hidden email]> wrote:
Hi Julio,

I'm not really sure, but do you think it is possible that there could be
some hard data retention setting for your Kafka topics in the staging
environment?
As in, at some point in time and maybe periodically, all data in the Kafka
topics are dropped and therefore the consumers effectively jump directly
back to the head again.

Cheers,
Gordon



--
Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/



--
Julio Biason, Sofware Engineer
AZION  |  Deliver. Accelerate. Protect.
Office: <a href="callto:+555130838101" value="+555130838101" style="color:rgb(17,85,204);font-family:arial,sans-serif;font-size:12.8px" target="_blank">+55 51 3083 8101  |  Mobile: <a href="callto:+5551996209291" style="color:rgb(17,85,204)" target="_blank">+55 51 99907 0554