flink_taskmanager_job_task_operator_records_lag_max == -Inf on Flink 1.4.2

classic Classic list List threaded Threaded
8 messages Options
Reply | Threaded
Open this post in threaded view
|

flink_taskmanager_job_task_operator_records_lag_max == -Inf on Flink 1.4.2

Julio Biason
Hey guys,

I just updated our Flink install from 1.4.0 to 1.4.2, but our Prometheus monitoring is not getting the current Kafka lag.

After updating to 1.4.2 and making the symlink between opt/flink-metrics-prometheus-1.4.2.jar to lib/, I got the metrics back on Prometheus, but the most important one, flink_taskmanager_job_task_operator_records_lag_max is now returning -Inf.

Did I miss something?

--
Julio Biason, Sofware Engineer
AZION  |  Deliver. Accelerate. Protect.
Office: <a href="callto:+555130838101" value="+555130838101" style="color:rgb(17,85,204);font-family:arial,sans-serif;font-size:12.8px" target="_blank">+55 51 3083 8101  |  Mobile: <a href="callto:+5551996209291" style="color:rgb(17,85,204)" target="_blank">+55 51 99907 0554
Reply | Threaded
Open this post in threaded view
|

Re: flink_taskmanager_job_task_operator_records_lag_max == -Inf on Flink 1.4.2

Tzu-Li (Gordon) Tai
Hi,

Which Kafka version are you using?

AFAIK, the only recent changes to Kafka connector metrics in the 1.4.x series would be FLINK-8419 [1].
The ‘records_lag_max’ metric is a Kafka-shipped metric simply forwarded from the internally used Kafka client, so nothing should have been affected.

Do you see other metrics under the pattern of ‘flink_taskmanager_job_task_operator_*’? All Kafka-shipped metrics should still follow this pattern.
If not, could you find the ‘records_lag_max’ metric (or any other Kafka-shipped metrics [2]) under the user scope ‘KafkaConsumer’?

The above should provide more insight into what may be wrong here.

- Gordon


On 12 June 2018 at 11:47:51 PM, Julio Biason ([hidden email]) wrote:

Hey guys,

I just updated our Flink install from 1.4.0 to 1.4.2, but our Prometheus monitoring is not getting the current Kafka lag.

After updating to 1.4.2 and making the symlink between opt/flink-metrics-prometheus-1.4.2.jar to lib/, I got the metrics back on Prometheus, but the most important one, flink_taskmanager_job_task_operator_records_lag_max is now returning -Inf.

Did I miss something?

--
Julio Biason, Sofware Engineer
AZION  |  Deliver. Accelerate. Protect.
Office: <a href="callto:+555130838101" value="+555130838101" style="color:rgb(17,85,204);font-family:arial,sans-serif;font-size:12.8px" target="_blank">+55 51 3083 8101  |  Mobile: <a href="callto:+5551996209291" style="color:rgb(17,85,204)" target="_blank">+55 51 99907 0554
Reply | Threaded
Open this post in threaded view
|

Re: flink_taskmanager_job_task_operator_records_lag_max == -Inf on Flink 1.4.2

Julio Biason
Hi Gordon,

We have Kafka 0.10.1.1 running and use the flink-connector-kafka-0.10 driver.

There are a bunch of flink_taskmanager_job_task_operator_* metrics, including some about the committed offset for each partition. It seems I have 4 different records_lag_max with different attempt_id, though, 3 with -Inf and 1 with a value -- which will give me some more understand of Prometheus to extract this properly.

I was also checking our Grafana and the metric we were using was "flink_taskmanager_job_task_operator_KafkaConsumer_records_lag_max", actually. "flink_taskmanager_job_task_operator_records_lag_max" seems to be new (with the attempt thingy).

On the "KafkaConsumer" front, but it only has the "commited_offset" for each partition.

On Wed, Jun 13, 2018 at 5:41 AM, Tzu-Li (Gordon) Tai <[hidden email]> wrote:
Hi,

Which Kafka version are you using?

AFAIK, the only recent changes to Kafka connector metrics in the 1.4.x series would be FLINK-8419 [1].
The ‘records_lag_max’ metric is a Kafka-shipped metric simply forwarded from the internally used Kafka client, so nothing should have been affected.

Do you see other metrics under the pattern of ‘flink_taskmanager_job_task_operator_*’? All Kafka-shipped metrics should still follow this pattern.
If not, could you find the ‘records_lag_max’ metric (or any other Kafka-shipped metrics [2]) under the user scope ‘KafkaConsumer’?

The above should provide more insight into what may be wrong here.

- Gordon


On 12 June 2018 at 11:47:51 PM, Julio Biason ([hidden email]) wrote:

Hey guys,

I just updated our Flink install from 1.4.0 to 1.4.2, but our Prometheus monitoring is not getting the current Kafka lag.

After updating to 1.4.2 and making the symlink between opt/flink-metrics-prometheus-1.4.2.jar to lib/, I got the metrics back on Prometheus, but the most important one, flink_taskmanager_job_task_operator_records_lag_max is now returning -Inf.

Did I miss something?

--
Julio Biason, Sofware Engineer
AZION  |  Deliver. Accelerate. Protect.
Office: <a href="callto:+555130838101" value="+555130838101" style="color:rgb(17,85,204);font-family:arial,sans-serif;font-size:12.8px" target="_blank">+55 51 3083 8101  |  Mobile: <a href="callto:+5551996209291" style="color:rgb(17,85,204)" target="_blank">+55 51 99907 0554



--
Julio Biason, Sofware Engineer
AZION  |  Deliver. Accelerate. Protect.
Office: <a href="callto:+555130838101" value="+555130838101" style="color:rgb(17,85,204);font-family:arial,sans-serif;font-size:12.8px" target="_blank">+55 51 3083 8101  |  Mobile: <a href="callto:+5551996209291" style="color:rgb(17,85,204)" target="_blank">+55 51 99907 0554
Reply | Threaded
Open this post in threaded view
|

Re: flink_taskmanager_job_task_operator_records_lag_max == -Inf on Flink 1.4.2

Julio Biason
Just to add some more info, here is the data I have on Prometheus (with some names redacted):

flink_taskmanager_job_task_operator_records_lag_max{host="002s",operator_id="cbc357ccb763df2852fee8c4fc7d55f2",subtask_index="0",task_attempt_id="fa104111e1f493bbec6f4b2ce44ec1da",task_attempt_num="11",task_id="cbc357ccb763df2852fee8c4fc7d55f2",tm_id="0496ee3dc28c7cad5a512b4aefce67fa"}        83529
flink_taskmanager_job_task_operator_records_lag_max{host="002s",operator_id="cbc357ccb763df2852fee8c4fc7d55f2",subtask_index="3",task_attempt_id="4c5d8395882fb1ad26bdd6fd7f4e789c",task_attempt_num="11",task_id="cbc357ccb763df2852fee8c4fc7d55f2",tm_id="0496ee3dc28c7cad5a512b4aefce67fa"}        -Inf
flink_taskmanager_job_task_operator_records_lag_max{host="002s",operator_id="cbc357ccb763df2852fee8c4fc7d55f2",subtask_index="6",task_attempt_id="f05e3171c446c19c7b928eeffe0fa52f",task_attempt_num="11",task_id="cbc357ccb763df2852fee8c4fc7d55f2",tm_id="0496ee3dc28c7cad5a512b4aefce67fa"}        -Inf
flink_taskmanager_job_task_operator_records_lag_max{host="002s",operator_id="cbc357ccb763df2852fee8c4fc7d55f2",subtask_index="9",task_attempt_id="9533c5fa9fafadf4878ce90a08f83213",task_attempt_num="11",task_id="cbc357ccb763df2852fee8c4fc7d55f2",tm_id="0496ee3dc28c7cad5a512b4aefce67fa"}        84096
flink_taskmanager_job_task_operator_records_lag_max{host="003s",operator_id="cbc357ccb763df2852fee8c4fc7d55f2",subtask_index="1",task_attempt_id="7ea45523850f5d08bab719418321e410",task_attempt_num="11",task_id="cbc357ccb763df2852fee8c4fc7d55f2",tm_id="8f68b231855c663ae6b3c2362d39568a"}        83867
flink_taskmanager_job_task_operator_records_lag_max{host="003s",operator_id="cbc357ccb763df2852fee8c4fc7d55f2",subtask_index="10",task_attempt_id="cf6c2349ccf818f6870fdf0296be121b",task_attempt_num="11",task_id="cbc357ccb763df2852fee8c4fc7d55f2",tm_id="8f68b231855c663ae6b3c2362d39568a"}       83829
flink_taskmanager_job_task_operator_records_lag_max{host="003s",operator_id="cbc357ccb763df2852fee8c4fc7d55f2",subtask_index="4",task_attempt_id="b39c5366b90d74e57d058b64e9e08e56",task_attempt_num="11",task_id="cbc357ccb763df2852fee8c4fc7d55f2",tm_id="8f68b231855c663ae6b3c2362d39568a"}        -Inf
flink_taskmanager_job_task_operator_records_lag_max{host="003s",operator_id="cbc357ccb763df2852fee8c4fc7d55f2",subtask_index="7",task_attempt_id="db563e7e360227585cff9fa3d0035b0d",task_attempt_num="11",task_id="cbc357ccb763df2852fee8c4fc7d55f2",tm_id="8f68b231855c663ae6b3c2362d39568a"}        -Inf
flink_taskmanager_job_task_operator_records_lag_max{host="004s",operator_id="cbc357ccb763df2852fee8c4fc7d55f2",subtask_index="11",task_attempt_id="4e9231f9187b0dffc728d8cd77cfef9e",task_attempt_num="11",task_id="cbc357ccb763df2852fee8c4fc7d55f2",tm_id="9098e39a467aa6c255dcf2ec44544cb2"}       83730
flink_taskmanager_job_task_operator_records_lag_max{host="004s",operator_id="cbc357ccb763df2852fee8c4fc7d55f2",subtask_index="2",task_attempt_id="e920a1672b0a31c2d186e3f6fee38bed",task_attempt_num="11",task_id="cbc357ccb763df2852fee8c4fc7d55f2",tm_id="9098e39a467aa6c255dcf2ec44544cb2"}        -Inf
flink_taskmanager_job_task_operator_records_lag_max{host="004s",operator_id="cbc357ccb763df2852fee8c4fc7d55f2",subtask_index="5",task_attempt_id="0e22fd213905d4da222e3651e7007106",task_attempt_num="11",task_id="cbc357ccb763df2852fee8c4fc7d55f2",tm_id="9098e39a467aa6c255dcf2ec44544cb2"}        83472
flink_taskmanager_job_task_operator_records_lag_max{host="004s",operator_id="cbc357ccb763df2852fee8c4fc7d55f2",subtask_index="8",task_attempt_id="f36fe63b0688a821f5abf685551c47fa",task_attempt_num="11",task_id="cbc357ccb763df2852fee8c4fc7d55f2",tm_id="9098e39a467aa6c255dcf2ec44544cb2"}        83842

What we have are 3 servers running with 4 slots each. Our Kafka have 12 partitions and the Job is running with a parallelism of 12. In this set up, I'd expect that each slot would grab one partition and process it, giving me each a different lag respective to each partition. But it seems some slots are stubbornly refusing to either grab a partition or update their status. It doesn't seem (from a perspective of someone that doesn't know the code) that it's not related to TaskManagers using the same Kafka connection, as 004 is consuming 3 partitions while 002 and 003 are consuming just 2.

And I was wrong about the attempt_id: It's not what's messing with my Prometheus query, it's some slots reporting -Inf on their partitions.

On Wed, Jun 13, 2018 at 9:05 AM, Julio Biason <[hidden email]> wrote:
Hi Gordon,

We have Kafka 0.10.1.1 running and use the flink-connector-kafka-0.10 driver.

There are a bunch of flink_taskmanager_job_task_operator_* metrics, including some about the committed offset for each partition. It seems I have 4 different records_lag_max with different attempt_id, though, 3 with -Inf and 1 with a value -- which will give me some more understand of Prometheus to extract this properly.

I was also checking our Grafana and the metric we were using was "flink_taskmanager_job_task_operator_KafkaConsumer_records_lag_max", actually. "flink_taskmanager_job_task_operator_records_lag_max" seems to be new (with the attempt thingy).

On the "KafkaConsumer" front, but it only has the "commited_offset" for each partition.

On Wed, Jun 13, 2018 at 5:41 AM, Tzu-Li (Gordon) Tai <[hidden email]> wrote:
Hi,

Which Kafka version are you using?

AFAIK, the only recent changes to Kafka connector metrics in the 1.4.x series would be FLINK-8419 [1].
The ‘records_lag_max’ metric is a Kafka-shipped metric simply forwarded from the internally used Kafka client, so nothing should have been affected.

Do you see other metrics under the pattern of ‘flink_taskmanager_job_task_operator_*’? All Kafka-shipped metrics should still follow this pattern.
If not, could you find the ‘records_lag_max’ metric (or any other Kafka-shipped metrics [2]) under the user scope ‘KafkaConsumer’?

The above should provide more insight into what may be wrong here.

- Gordon


On 12 June 2018 at 11:47:51 PM, Julio Biason ([hidden email]) wrote:

Hey guys,

I just updated our Flink install from 1.4.0 to 1.4.2, but our Prometheus monitoring is not getting the current Kafka lag.

After updating to 1.4.2 and making the symlink between opt/flink-metrics-prometheus-1.4.2.jar to lib/, I got the metrics back on Prometheus, but the most important one, flink_taskmanager_job_task_operator_records_lag_max is now returning -Inf.

Did I miss something?

--
Julio Biason, Sofware Engineer
AZION  |  Deliver. Accelerate. Protect.
Office: <a href="callto:+555130838101" value="+555130838101" style="color:rgb(17,85,204);font-family:arial,sans-serif;font-size:12.8px" target="_blank">+55 51 3083 8101  |  Mobile: <a href="callto:+5551996209291" style="color:rgb(17,85,204)" target="_blank">+55 51 99907 0554



--
Julio Biason, Sofware Engineer
AZION  |  Deliver. Accelerate. Protect.
Office: <a href="callto:+555130838101" value="+555130838101" style="color:rgb(17,85,204);font-family:arial,sans-serif;font-size:12.8px" target="_blank">+55 51 3083 8101  |  Mobile: <a href="callto:+5551996209291" style="color:rgb(17,85,204)" target="_blank">+55 51 99907 0554



--
Julio Biason, Sofware Engineer
AZION  |  Deliver. Accelerate. Protect.
Office: <a href="callto:+555130838101" value="+555130838101" style="color:rgb(17,85,204);font-family:arial,sans-serif;font-size:12.8px" target="_blank">+55 51 3083 8101  |  Mobile: <a href="callto:+5551996209291" style="color:rgb(17,85,204)" target="_blank">+55 51 99907 0554
Reply | Threaded
Open this post in threaded view
|

Re: flink_taskmanager_job_task_operator_records_lag_max == -Inf on Flink 1.4.2

Tzu-Li (Gordon) Tai
In reply to this post by Julio Biason
Hi,

Thanks for the extra information. So, there seems to be 2 separate issues here. I’ll go through them one by one.

I was also checking our Grafana and the metric we were using was "flink_taskmanager_job_task_operator_KafkaConsumer_records_lag_max", actually. "flink_taskmanager_job_task_operator_records_lag_max" seems to be new (with the attempt thingy).

After looking at the code changes in FLINK-8419, this unfortunately is a accidental “break” in the scope of the metric.
In 1.4.0, the Kafka-shipped metrics were exposed under the “KafkaConsumer” metrics group. After FLINK-8419, this was changed, as you observed.
In 1.5.0, however, I think the metrics are exposed under both patterns.

Now, with the fact that some subtasks are returning -Inf for ‘record-lag-max’:
If I understood the metric semantics correctly, this metric represents the "max record lag across **partitions subscribed by a Kafka consumer client**.
So, the only possibility that could think of causing this, is that either the subtask does not have any partitions assigned to it, or simply there is a bug with the Kafka client returning this value.

Is it possible that you verify that all subtasks have a partition assigned to it? That should be possible by just checking the job status in the Web UI, and observe the numRecordsOut value for each source subtask.

Cheers,
Gordon


On 13 June 2018 at 2:05:17 PM, Julio Biason ([hidden email]) wrote:

Hi Gordon,

We have Kafka 0.10.1.1 running and use the flink-connector-kafka-0.10 driver.

There are a bunch of flink_taskmanager_job_task_operator_* metrics, including some about the committed offset for each partition. It seems I have 4 different records_lag_max with different attempt_id, though, 3 with -Inf and 1 with a value -- which will give me some more understand of Prometheus to extract this properly.

I was also checking our Grafana and the metric we were using was "flink_taskmanager_job_task_operator_KafkaConsumer_records_lag_max", actually. "flink_taskmanager_job_task_operator_records_lag_max" seems to be new (with the attempt thingy).

On the "KafkaConsumer" front, but it only has the "commited_offset" for each partition.

On Wed, Jun 13, 2018 at 5:41 AM, Tzu-Li (Gordon) Tai <[hidden email]> wrote:
Hi,

Which Kafka version are you using?

AFAIK, the only recent changes to Kafka connector metrics in the 1.4.x series would be FLINK-8419 [1].
The ‘records_lag_max’ metric is a Kafka-shipped metric simply forwarded from the internally used Kafka client, so nothing should have been affected.

Do you see other metrics under the pattern of ‘flink_taskmanager_job_task_operator_*’? All Kafka-shipped metrics should still follow this pattern.
If not, could you find the ‘records_lag_max’ metric (or any other Kafka-shipped metrics [2]) under the user scope ‘KafkaConsumer’?

The above should provide more insight into what may be wrong here.

- Gordon


On 12 June 2018 at 11:47:51 PM, Julio Biason ([hidden email]) wrote:

Hey guys,

I just updated our Flink install from 1.4.0 to 1.4.2, but our Prometheus monitoring is not getting the current Kafka lag.

After updating to 1.4.2 and making the symlink between opt/flink-metrics-prometheus-1.4.2.jar to lib/, I got the metrics back on Prometheus, but the most important one, flink_taskmanager_job_task_operator_records_lag_max is now returning -Inf.

Did I miss something?

--
Julio Biason, Sofware Engineer
AZION  |  Deliver. Accelerate. Protect.
Office: <a href="callto:+555130838101" value="+555130838101" style="color:rgb(17,85,204);font-family:arial,sans-serif;font-size:12.8px" target="_blank">+55 51 3083 8101  |  Mobile: <a href="callto:+5551996209291" style="color:rgb(17,85,204)" target="_blank">+55 51 99907 0554



--
Julio Biason, Sofware Engineer
AZION  |  Deliver. Accelerate. Protect.
Office: <a href="callto:+555130838101" value="+555130838101" style="color:rgb(17,85,204);font-family:arial,sans-serif;font-size:12.8px" target="_blank">+55 51 3083 8101  |  Mobile: <a href="callto:+5551996209291" style="color:rgb(17,85,204)" target="_blank">+55 51 99907 0554
Reply | Threaded
Open this post in threaded view
|

Re: flink_taskmanager_job_task_operator_records_lag_max == -Inf on Flink 1.4.2

Julio Biason
Hey Gordon,

The job restarted somewhere in the middle of the night (I haven't checked why yet) and now I have this weird status of the first TaskManager with only one valid lag, the second with 2 and the third with none.

I dunno if I could see the partition in the logs, but all "numRecordsOut" are increasing over time (attached the screenshot of the graphs).

On Thu, Jun 14, 2018 at 5:27 AM, Tzu-Li (Gordon) Tai <[hidden email]> wrote:
Hi,

Thanks for the extra information. So, there seems to be 2 separate issues here. I’ll go through them one by one.

I was also checking our Grafana and the metric we were using was "flink_taskmanager_job_task_operator_KafkaConsumer_records_lag_max", actually. "flink_taskmanager_job_task_operator_records_lag_max" seems to be new (with the attempt thingy).

After looking at the code changes in FLINK-8419, this unfortunately is a accidental “break” in the scope of the metric.
In 1.4.0, the Kafka-shipped metrics were exposed under the “KafkaConsumer” metrics group. After FLINK-8419, this was changed, as you observed.
In 1.5.0, however, I think the metrics are exposed under both patterns.

Now, with the fact that some subtasks are returning -Inf for ‘record-lag-max’:
If I understood the metric semantics correctly, this metric represents the "max record lag across **partitions subscribed by a Kafka consumer client**.
So, the only possibility that could think of causing this, is that either the subtask does not have any partitions assigned to it, or simply there is a bug with the Kafka client returning this value.

Is it possible that you verify that all subtasks have a partition assigned to it? That should be possible by just checking the job status in the Web UI, and observe the numRecordsOut value for each source subtask.

Cheers,
Gordon


On 13 June 2018 at 2:05:17 PM, Julio Biason ([hidden email]) wrote:

Hi Gordon,

We have Kafka 0.10.1.1 running and use the flink-connector-kafka-0.10 driver.

There are a bunch of flink_taskmanager_job_task_operator_* metrics, including some about the committed offset for each partition. It seems I have 4 different records_lag_max with different attempt_id, though, 3 with -Inf and 1 with a value -- which will give me some more understand of Prometheus to extract this properly.

I was also checking our Grafana and the metric we were using was "flink_taskmanager_job_task_operator_KafkaConsumer_records_lag_max", actually. "flink_taskmanager_job_task_operator_records_lag_max" seems to be new (with the attempt thingy).

On the "KafkaConsumer" front, but it only has the "commited_offset" for each partition.

On Wed, Jun 13, 2018 at 5:41 AM, Tzu-Li (Gordon) Tai <[hidden email]> wrote:
Hi,

Which Kafka version are you using?

AFAIK, the only recent changes to Kafka connector metrics in the 1.4.x series would be FLINK-8419 [1].
The ‘records_lag_max’ metric is a Kafka-shipped metric simply forwarded from the internally used Kafka client, so nothing should have been affected.

Do you see other metrics under the pattern of ‘flink_taskmanager_job_task_operator_*’? All Kafka-shipped metrics should still follow this pattern.
If not, could you find the ‘records_lag_max’ metric (or any other Kafka-shipped metrics [2]) under the user scope ‘KafkaConsumer’?

The above should provide more insight into what may be wrong here.

- Gordon


On 12 June 2018 at 11:47:51 PM, Julio Biason ([hidden email]) wrote:

Hey guys,

I just updated our Flink install from 1.4.0 to 1.4.2, but our Prometheus monitoring is not getting the current Kafka lag.

After updating to 1.4.2 and making the symlink between opt/flink-metrics-prometheus-1.4.2.jar to lib/, I got the metrics back on Prometheus, but the most important one, flink_taskmanager_job_task_operator_records_lag_max is now returning -Inf.

Did I miss something?

--
Julio Biason, Sofware Engineer
AZION  |  Deliver. Accelerate. Protect.
Office: <a href="callto:+555130838101" value="+555130838101" style="color:rgb(17,85,204);font-family:arial,sans-serif;font-size:12.8px" target="_blank">+55 51 3083 8101  |  Mobile: <a href="callto:+5551996209291" style="color:rgb(17,85,204)" target="_blank">+55 51 99907 0554



--
Julio Biason, Sofware Engineer
AZION  |  Deliver. Accelerate. Protect.
Office: <a href="callto:+555130838101" value="+555130838101" style="color:rgb(17,85,204);font-family:arial,sans-serif;font-size:12.8px" target="_blank">+55 51 3083 8101  |  Mobile: <a href="callto:+5551996209291" style="color:rgb(17,85,204)" target="_blank">+55 51 99907 0554



--
Julio Biason, Sofware Engineer
AZION  |  Deliver. Accelerate. Protect.
Office: <a href="callto:+555130838101" value="+555130838101" style="color:rgb(17,85,204);font-family:arial,sans-serif;font-size:12.8px" target="_blank">+55 51 3083 8101  |  Mobile: <a href="callto:+5551996209291" style="color:rgb(17,85,204)" target="_blank">+55 51 99907 0554

Screenshot 2018-06-14 11.28.45.png (80K) Download Attachment
Screenshot 2018-06-14 11.28.39.png (84K) Download Attachment
Screenshot 2018-06-14 11.28.33.png (81K) Download Attachment
Screenshot 2018-06-14 11.28.27.png (78K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: flink_taskmanager_job_task_operator_records_lag_max == -Inf on Flink 1.4.2

Julio Biason
Hey Gordon,

(Reviving this long thread) I think I found part of the problem: It seems the metric is capturing the lag from time to time and reseting the value in-between. I managed to replicate this attaching a SQL Sink (JDBCOutputFormat) connecting to an outside database -- something that took about 2 minutes to write 500 records.

I opened the ticket https://issues.apache.org/jira/browse/FLINK-9998 with a bit more information about this ('cause I completely forgot to open a ticket a month ago about this).

On Thu, Jun 14, 2018 at 11:31 AM, Julio Biason <[hidden email]> wrote:
Hey Gordon,

The job restarted somewhere in the middle of the night (I haven't checked why yet) and now I have this weird status of the first TaskManager with only one valid lag, the second with 2 and the third with none.

I dunno if I could see the partition in the logs, but all "numRecordsOut" are increasing over time (attached the screenshot of the graphs).

On Thu, Jun 14, 2018 at 5:27 AM, Tzu-Li (Gordon) Tai <[hidden email]> wrote:
Hi,

Thanks for the extra information. So, there seems to be 2 separate issues here. I’ll go through them one by one.

I was also checking our Grafana and the metric we were using was "flink_taskmanager_job_task_operator_KafkaConsumer_records_lag_max", actually. "flink_taskmanager_job_task_operator_records_lag_max" seems to be new (with the attempt thingy).

After looking at the code changes in FLINK-8419, this unfortunately is a accidental “break” in the scope of the metric.
In 1.4.0, the Kafka-shipped metrics were exposed under the “KafkaConsumer” metrics group. After FLINK-8419, this was changed, as you observed.
In 1.5.0, however, I think the metrics are exposed under both patterns.

Now, with the fact that some subtasks are returning -Inf for ‘record-lag-max’:
If I understood the metric semantics correctly, this metric represents the "max record lag across **partitions subscribed by a Kafka consumer client**.
So, the only possibility that could think of causing this, is that either the subtask does not have any partitions assigned to it, or simply there is a bug with the Kafka client returning this value.

Is it possible that you verify that all subtasks have a partition assigned to it? That should be possible by just checking the job status in the Web UI, and observe the numRecordsOut value for each source subtask.

Cheers,
Gordon


On 13 June 2018 at 2:05:17 PM, Julio Biason ([hidden email]) wrote:

Hi Gordon,

We have Kafka 0.10.1.1 running and use the flink-connector-kafka-0.10 driver.

There are a bunch of flink_taskmanager_job_task_operator_* metrics, including some about the committed offset for each partition. It seems I have 4 different records_lag_max with different attempt_id, though, 3 with -Inf and 1 with a value -- which will give me some more understand of Prometheus to extract this properly.

I was also checking our Grafana and the metric we were using was "flink_taskmanager_job_task_operator_KafkaConsumer_records_lag_max", actually. "flink_taskmanager_job_task_operator_records_lag_max" seems to be new (with the attempt thingy).

On the "KafkaConsumer" front, but it only has the "commited_offset" for each partition.

On Wed, Jun 13, 2018 at 5:41 AM, Tzu-Li (Gordon) Tai <[hidden email]> wrote:
Hi,

Which Kafka version are you using?

AFAIK, the only recent changes to Kafka connector metrics in the 1.4.x series would be FLINK-8419 [1].
The ‘records_lag_max’ metric is a Kafka-shipped metric simply forwarded from the internally used Kafka client, so nothing should have been affected.

Do you see other metrics under the pattern of ‘flink_taskmanager_job_task_operator_*’? All Kafka-shipped metrics should still follow this pattern.
If not, could you find the ‘records_lag_max’ metric (or any other Kafka-shipped metrics [2]) under the user scope ‘KafkaConsumer’?

The above should provide more insight into what may be wrong here.

- Gordon


On 12 June 2018 at 11:47:51 PM, Julio Biason ([hidden email]) wrote:

Hey guys,

I just updated our Flink install from 1.4.0 to 1.4.2, but our Prometheus monitoring is not getting the current Kafka lag.

After updating to 1.4.2 and making the symlink between opt/flink-metrics-prometheus-1.4.2.jar to lib/, I got the metrics back on Prometheus, but the most important one, flink_taskmanager_job_task_operator_records_lag_max is now returning -Inf.

Did I miss something?

--
Julio Biason, Sofware Engineer
AZION  |  Deliver. Accelerate. Protect.
Office: <a href="callto:+555130838101" value="+555130838101" style="color:rgb(17,85,204);font-family:arial,sans-serif;font-size:12.8px" target="_blank">+55 51 3083 8101  |  Mobile: <a href="callto:+5551996209291" style="color:rgb(17,85,204)" target="_blank">+55 51 99907 0554



--
Julio Biason, Sofware Engineer
AZION  |  Deliver. Accelerate. Protect.
Office: <a href="callto:+555130838101" value="+555130838101" style="color:rgb(17,85,204);font-family:arial,sans-serif;font-size:12.8px" target="_blank">+55 51 3083 8101  |  Mobile: <a href="callto:+5551996209291" style="color:rgb(17,85,204)" target="_blank">+55 51 99907 0554



--
Julio Biason, Sofware Engineer
AZION  |  Deliver. Accelerate. Protect.
Office: <a href="callto:+555130838101" value="+555130838101" style="color:rgb(17,85,204);font-family:arial,sans-serif;font-size:12.8px" target="_blank">+55 51 3083 8101  |  Mobile: <a href="callto:+5551996209291" style="color:rgb(17,85,204)" target="_blank">+55 51 99907 0554



--
Julio Biason, Sofware Engineer
AZION  |  Deliver. Accelerate. Protect.
Office: <a href="callto:+555130838101" value="+555130838101" style="color:rgb(17,85,204);font-family:arial,sans-serif;font-size:12.8px" target="_blank">+55 51 3083 8101  |  Mobile: <a href="callto:+5551996209291" style="color:rgb(17,85,204)" target="_blank">+55 51 99907 0554
Reply | Threaded
Open this post in threaded view
|

Re: flink_taskmanager_job_task_operator_records_lag_max == -Inf on Flink 1.4.2

Tony Wei
Hi Juilo,

As Gordon said, the `records_lag_max` metric is a Kafka-shipped metric [1].
And I also found this thread [2] in Kafka mailing list. It seems that it is the
design inner Kafka. So I think there is nothing we can do in Flink-Kafka-Connector.

BTW, the Kafka document [1] said `records_lag_max` is the maximum lag
in terms of number of records for any partition in this "window". I'm not sure
what this "window" means and if it is configurable. If it is configurable, then
you can directly pass the config argument to Flink-Kafka-Connector to set
kafka consumer.

Best,
Tony Wei


2018-07-31 1:36 GMT+08:00 Julio Biason <[hidden email]>:
Hey Gordon,

(Reviving this long thread) I think I found part of the problem: It seems the metric is capturing the lag from time to time and reseting the value in-between. I managed to replicate this attaching a SQL Sink (JDBCOutputFormat) connecting to an outside database -- something that took about 2 minutes to write 500 records.

I opened the ticket https://issues.apache.org/jira/browse/FLINK-9998 with a bit more information about this ('cause I completely forgot to open a ticket a month ago about this).

On Thu, Jun 14, 2018 at 11:31 AM, Julio Biason <[hidden email]> wrote:
Hey Gordon,

The job restarted somewhere in the middle of the night (I haven't checked why yet) and now I have this weird status of the first TaskManager with only one valid lag, the second with 2 and the third with none.

I dunno if I could see the partition in the logs, but all "numRecordsOut" are increasing over time (attached the screenshot of the graphs).

On Thu, Jun 14, 2018 at 5:27 AM, Tzu-Li (Gordon) Tai <[hidden email]> wrote:
Hi,

Thanks for the extra information. So, there seems to be 2 separate issues here. I’ll go through them one by one.

I was also checking our Grafana and the metric we were using was "flink_taskmanager_job_task_operator_KafkaConsumer_records_lag_max", actually. "flink_taskmanager_job_task_operator_records_lag_max" seems to be new (with the attempt thingy).

After looking at the code changes in FLINK-8419, this unfortunately is a accidental “break” in the scope of the metric.
In 1.4.0, the Kafka-shipped metrics were exposed under the “KafkaConsumer” metrics group. After FLINK-8419, this was changed, as you observed.
In 1.5.0, however, I think the metrics are exposed under both patterns.

Now, with the fact that some subtasks are returning -Inf for ‘record-lag-max’:
If I understood the metric semantics correctly, this metric represents the "max record lag across **partitions subscribed by a Kafka consumer client**.
So, the only possibility that could think of causing this, is that either the subtask does not have any partitions assigned to it, or simply there is a bug with the Kafka client returning this value.

Is it possible that you verify that all subtasks have a partition assigned to it? That should be possible by just checking the job status in the Web UI, and observe the numRecordsOut value for each source subtask.

Cheers,
Gordon


On 13 June 2018 at 2:05:17 PM, Julio Biason ([hidden email]) wrote:

Hi Gordon,

We have Kafka 0.10.1.1 running and use the flink-connector-kafka-0.10 driver.

There are a bunch of flink_taskmanager_job_task_operator_* metrics, including some about the committed offset for each partition. It seems I have 4 different records_lag_max with different attempt_id, though, 3 with -Inf and 1 with a value -- which will give me some more understand of Prometheus to extract this properly.

I was also checking our Grafana and the metric we were using was "flink_taskmanager_job_task_operator_KafkaConsumer_records_lag_max", actually. "flink_taskmanager_job_task_operator_records_lag_max" seems to be new (with the attempt thingy).

On the "KafkaConsumer" front, but it only has the "commited_offset" for each partition.

On Wed, Jun 13, 2018 at 5:41 AM, Tzu-Li (Gordon) Tai <[hidden email]> wrote:
Hi,

Which Kafka version are you using?

AFAIK, the only recent changes to Kafka connector metrics in the 1.4.x series would be FLINK-8419 [1].
The ‘records_lag_max’ metric is a Kafka-shipped metric simply forwarded from the internally used Kafka client, so nothing should have been affected.

Do you see other metrics under the pattern of ‘flink_taskmanager_job_task_operator_*’? All Kafka-shipped metrics should still follow this pattern.
If not, could you find the ‘records_lag_max’ metric (or any other Kafka-shipped metrics [2]) under the user scope ‘KafkaConsumer’?

The above should provide more insight into what may be wrong here.

- Gordon


On 12 June 2018 at 11:47:51 PM, Julio Biason ([hidden email]) wrote:

Hey guys,

I just updated our Flink install from 1.4.0 to 1.4.2, but our Prometheus monitoring is not getting the current Kafka lag.

After updating to 1.4.2 and making the symlink between opt/flink-metrics-prometheus-1.4.2.jar to lib/, I got the metrics back on Prometheus, but the most important one, flink_taskmanager_job_task_operator_records_lag_max is now returning -Inf.

Did I miss something?

--
Julio Biason, Sofware Engineer
AZION  |  Deliver. Accelerate. Protect.
Office: <a href="callto:+555130838101" value="+555130838101" style="color:rgb(17,85,204);font-family:arial,sans-serif;font-size:12.8px" target="_blank">+55 51 3083 8101  |  Mobile: <a href="callto:+5551996209291" style="color:rgb(17,85,204)" target="_blank">+55 51 99907 0554



--
Julio Biason, Sofware Engineer
AZION  |  Deliver. Accelerate. Protect.
Office: <a href="callto:+555130838101" value="+555130838101" style="color:rgb(17,85,204);font-family:arial,sans-serif;font-size:12.8px" target="_blank">+55 51 3083 8101  |  Mobile: <a href="callto:+5551996209291" style="color:rgb(17,85,204)" target="_blank">+55 51 99907 0554



--
Julio Biason, Sofware Engineer
AZION  |  Deliver. Accelerate. Protect.
Office: <a href="callto:+555130838101" value="+555130838101" style="color:rgb(17,85,204);font-family:arial,sans-serif;font-size:12.8px" target="_blank">+55 51 3083 8101  |  Mobile: <a href="callto:+5551996209291" style="color:rgb(17,85,204)" target="_blank">+55 51 99907 0554



--
Julio Biason, Sofware Engineer
AZION  |  Deliver. Accelerate. Protect.
Office: <a href="callto:+555130838101" value="+555130838101" style="color:rgb(17,85,204);font-family:arial,sans-serif;font-size:12.8px" target="_blank">+55 51 3083 8101  |  Mobile: <a href="callto:+5551996209291" style="color:rgb(17,85,204)" target="_blank">+55 51 99907 0554