Get EOF from PrometheusReporter in JM

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

Get EOF from PrometheusReporter in JM

Tony Wei
Hi, 

I have built the Prometheus reporter package from this PR https://github.com/apache/flink/pull/4586, and used it on Flink 1.3.2 to record every default metrics and those from `FlinkKafkaConsumer`.

Originally, everything was fine. I could get those metrics in TM from Prometheus just like I saw on Flink Web UI.
However, when I turned to JM, I found Prometheus gives this error to me: Get http://localhost:9249/metrics: EOF.
I checked the log on JM and saw nothing in it. There was no error message and 9249 port was still alive.

To figure out what happened, I created another cluster and I found Prometheus could connect to Flink cluster if there is no running job. After JM triggered or completed the first checkpoint, Prometheus started getting ERR_EMPTY_RESPONSE from JM, but not for TM. There was still no error in log file and 9249 port was still alive.

I was wondering where did the error occur. Flink or Prometheus reporter?
Or It is incorrect to use Prometheus reporter on Flink 1.3.2 ? Thank you.

Best Regards,
Tony Wei
Reply | Threaded
Open this post in threaded view
|

Re: Get EOF from PrometheusReporter in JM

Chesnay Schepler
The Prometheus reporter should work with 1.3.2.

Does this also occur with the reporter that currently exists in 1.4? (to rule out new bugs from the PR).

To investigate this further, please set the logging level to WARN and try again, as all errors in the metric system are logged on that level.

On 22.09.2017 10:33, Tony Wei wrote:
Hi, 

I have built the Prometheus reporter package from this PR https://github.com/apache/flink/pull/4586, and used it on Flink 1.3.2 to record every default metrics and those from `FlinkKafkaConsumer`.

Originally, everything was fine. I could get those metrics in TM from Prometheus just like I saw on Flink Web UI.
However, when I turned to JM, I found Prometheus gives this error to me: Get http://localhost:9249/metrics: EOF.
I checked the log on JM and saw nothing in it. There was no error message and 9249 port was still alive.

To figure out what happened, I created another cluster and I found Prometheus could connect to Flink cluster if there is no running job. After JM triggered or completed the first checkpoint, Prometheus started getting ERR_EMPTY_RESPONSE from JM, but not for TM. There was still no error in log file and 9249 port was still alive.

I was wondering where did the error occur. Flink or Prometheus reporter?
Or It is incorrect to use Prometheus reporter on Flink 1.3.2 ? Thank you.

Best Regards,
Tony Wei


Reply | Threaded
Open this post in threaded view
|

Re: Get EOF from PrometheusReporter in JM

Tony Wei
Hi Chesnay,

I didn't try it in 1.4, so I have no idea if this also occurs in 1.4.
For my setting for logging, It have already set to INFO level, but there wasn't any error or warning in log file as well.

Best Regards,
Tony Wei

2017-09-22 22:07 GMT+08:00 Chesnay Schepler <[hidden email]>:
The Prometheus reporter should work with 1.3.2.

Does this also occur with the reporter that currently exists in 1.4? (to rule out new bugs from the PR).

To investigate this further, please set the logging level to WARN and try again, as all errors in the metric system are logged on that level.


On 22.09.2017 10:33, Tony Wei wrote:
Hi, 

I have built the Prometheus reporter package from this PR https://github.com/apache/flink/pull/4586, and used it on Flink 1.3.2 to record every default metrics and those from `FlinkKafkaConsumer`.

Originally, everything was fine. I could get those metrics in TM from Prometheus just like I saw on Flink Web UI.
However, when I turned to JM, I found Prometheus gives this error to me: Get http://localhost:9249/metrics: EOF.
I checked the log on JM and saw nothing in it. There was no error message and 9249 port was still alive.

To figure out what happened, I created another cluster and I found Prometheus could connect to Flink cluster if there is no running job. After JM triggered or completed the first checkpoint, Prometheus started getting ERR_EMPTY_RESPONSE from JM, but not for TM. There was still no error in log file and 9249 port was still alive.

I was wondering where did the error occur. Flink or Prometheus reporter?
Or It is incorrect to use Prometheus reporter on Flink 1.3.2 ? Thank you.

Best Regards,
Tony Wei



Reply | Threaded
Open this post in threaded view
|

Re: Get EOF from PrometheusReporter in JM

Tony Wei
Hi Chesnay,

I built another flink cluster using version 1.4, set the log level to DEBUG, and I found that the root cause might be this exception: java.lang.NullPointerException: Value returned by gauge lastCheckpointExternalPath was null.

I updated `CheckpointStatsTracker` to ignore external path when it is null, and this exception didn't happen again. The prometheus reporter works as well.

I have created a Jira issue for it: https://issues.apache.org/jira/browse/FLINK-7675, and I will submit the PR after I passed Travis CI for my repository.

Best Regards,
Tony Wei

 

2017-09-22 22:20 GMT+08:00 Tony Wei <[hidden email]>:
Hi Chesnay,

I didn't try it in 1.4, so I have no idea if this also occurs in 1.4.
For my setting for logging, It have already set to INFO level, but there wasn't any error or warning in log file as well.

Best Regards,
Tony Wei

2017-09-22 22:07 GMT+08:00 Chesnay Schepler <[hidden email]>:
The Prometheus reporter should work with 1.3.2.

Does this also occur with the reporter that currently exists in 1.4? (to rule out new bugs from the PR).

To investigate this further, please set the logging level to WARN and try again, as all errors in the metric system are logged on that level.


On 22.09.2017 10:33, Tony Wei wrote:
Hi, 

I have built the Prometheus reporter package from this PR https://github.com/apache/flink/pull/4586, and used it on Flink 1.3.2 to record every default metrics and those from `FlinkKafkaConsumer`.

Originally, everything was fine. I could get those metrics in TM from Prometheus just like I saw on Flink Web UI.
However, when I turned to JM, I found Prometheus gives this error to me: Get http://localhost:9249/metrics: EOF.
I checked the log on JM and saw nothing in it. There was no error message and 9249 port was still alive.

To figure out what happened, I created another cluster and I found Prometheus could connect to Flink cluster if there is no running job. After JM triggered or completed the first checkpoint, Prometheus started getting ERR_EMPTY_RESPONSE from JM, but not for TM. There was still no error in log file and 9249 port was still alive.

I was wondering where did the error occur. Flink or Prometheus reporter?
Or It is incorrect to use Prometheus reporter on Flink 1.3.2 ? Thank you.

Best Regards,
Tony Wei




Reply | Threaded
Open this post in threaded view
|

Re: Get EOF from PrometheusReporter in JM

Maximilian Bode
Hi Tony,

thanks for troubleshooting this. I have added a commit to https://github.com/apache/flink/pull/4586 that should enable you to use the reporter with 1.3.2 as well.

Best regards,
Max

23. September 2017 um 13:11
Hi Chesnay,

I built another flink cluster using version 1.4, set the log level to DEBUG, and I found that the root cause might be this exception: java.lang.NullPointerException: Value returned by gauge lastCheckpointExternalPath was null.

I updated `CheckpointStatsTracker` to ignore external path when it is null, and this exception didn't happen again. The prometheus reporter works as well.

I have created a Jira issue for it: https://issues.apache.org/jira/browse/FLINK-7675, and I will submit the PR after I passed Travis CI for my repository.

Best Regards,
Tony Wei

 


22. September 2017 um 16:20
Hi Chesnay,

I didn't try it in 1.4, so I have no idea if this also occurs in 1.4.
For my setting for logging, It have already set to INFO level, but there wasn't any error or warning in log file as well.

Best Regards,
Tony Wei


22. September 2017 um 16:07
The Prometheus reporter should work with 1.3.2.

Does this also occur with the reporter that currently exists in 1.4? (to rule out new bugs from the PR).

To investigate this further, please set the logging level to WARN and try again, as all errors in the metric system are logged on that level.

On 22.09.2017 10:33, Tony Wei wrote:


22. September 2017 um 10:33
Hi, 

I have built the Prometheus reporter package from this PR https://github.com/apache/flink/pull/4586, and used it on Flink 1.3.2 to record every default metrics and those from `FlinkKafkaConsumer`.

Originally, everything was fine. I could get those metrics in TM from Prometheus just like I saw on Flink Web UI.
However, when I turned to JM, I found Prometheus gives this error to me: Get http://localhost:9249/metrics: EOF.
I checked the log on JM and saw nothing in it. There was no error message and 9249 port was still alive.

To figure out what happened, I created another cluster and I found Prometheus could connect to Flink cluster if there is no running job. After JM triggered or completed the first checkpoint, Prometheus started getting ERR_EMPTY_RESPONSE from JM, but not for TM. There was still no error in log file and 9249 port was still alive.

I was wondering where did the error occur. Flink or Prometheus reporter?
Or It is incorrect to use Prometheus reporter on Flink 1.3.2 ? Thank you.

Best Regards,
Tony Wei

signature.asc (602 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Get EOF from PrometheusReporter in JM

Tony Wei
Hi Max,

Good to know. Thanks very much.

Best Regards,
Tony Wei

2017-10-24 13:52 GMT+08:00 Maximilian Bode <[hidden email]>:
Hi Tony,

thanks for troubleshooting this. I have added a commit to https://github.com/apache/flink/pull/4586 that should enable you to use the reporter with 1.3.2 as well.

Best regards,
Max

23. September 2017 um 13:11
Hi Chesnay,

I built another flink cluster using version 1.4, set the log level to DEBUG, and I found that the root cause might be this exception: java.lang.NullPointerException: Value returned by gauge lastCheckpointExternalPath was null.

I updated `CheckpointStatsTracker` to ignore external path when it is null, and this exception didn't happen again. The prometheus reporter works as well.

I have created a Jira issue for it: https://issues.apache.org/jira/browse/FLINK-7675, and I will submit the PR after I passed Travis CI for my repository.

Best Regards,
Tony Wei

 


22. September 2017 um 16:20
Hi Chesnay,

I didn't try it in 1.4, so I have no idea if this also occurs in 1.4.
For my setting for logging, It have already set to INFO level, but there wasn't any error or warning in log file as well.

Best Regards,
Tony Wei


22. September 2017 um 16:07
The Prometheus reporter should work with 1.3.2.

Does this also occur with the reporter that currently exists in 1.4? (to rule out new bugs from the PR).

To investigate this further, please set the logging level to WARN and try again, as all errors in the metric system are logged on that level.

On 22.09.2017 10:33, Tony Wei wrote:


22. September 2017 um 10:33
Hi, 

I have built the Prometheus reporter package from this PR https://github.com/apache/flink/pull/4586, and used it on Flink 1.3.2 to record every default metrics and those from `FlinkKafkaConsumer`.

Originally, everything was fine. I could get those metrics in TM from Prometheus just like I saw on Flink Web UI.
However, when I turned to JM, I found Prometheus gives this error to me: Get http://localhost:9249/metrics: EOF.
I checked the log on JM and saw nothing in it. There was no error message and 9249 port was still alive.

To figure out what happened, I created another cluster and I found Prometheus could connect to Flink cluster if there is no running job. After JM triggered or completed the first checkpoint, Prometheus started getting ERR_EMPTY_RESPONSE from JM, but not for TM. There was still no error in log file and 9249 port was still alive.

I was wondering where did the error occur. Flink or Prometheus reporter?
Or It is incorrect to use Prometheus reporter on Flink 1.3.2 ? Thank you.

Best Regards,
Tony Wei