Flink Datadog Timeout

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

Flink Datadog Timeout

Claude Murad

Hello, 

I have a Flink jobmanager and taskmanagers deployed in a Kubernetes cluster.  I integrated it with Datadog by having the following specified in the flink-conf.yaml. 

metrics.reporter.dghttp.class: org.apache.flink.metrics.datadog.DatadogHttpReporter
metrics.reporter.dghttp.apikey: <DD_API_KEY>

However, I'm seeing random timeouts in the log and don't know why this is occurring and how to solve the issue.   Please see attached file showing the error.  


Thanks





FlinkDatadogTimeout.txt (3K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Flink Datadog Timeout

Chesnay Schepler
The reported exception looks quite similar to the one in this thread, which was supposedly caused by Datadog rate limits but I don't think this was thoroughly investigated.
(bear in mind that each container has its own reporter; with the default reporting interval of 10 seconds you quickly reach fairly high reports/second rates)

Alternatively it could just be plain connectivity issues.

If the issues do not persist for a long time then no metrics should be lost however, so you may be able to ignore them.


On 2/2/2021 7:31 PM, Claude M wrote:

Hello, 

I have a Flink jobmanager and taskmanagers deployed in a Kubernetes cluster.  I integrated it with Datadog by having the following specified in the flink-conf.yaml. 

metrics.reporter.dghttp.class: org.apache.flink.metrics.datadog.DatadogHttpReporter
metrics.reporter.dghttp.apikey: <DD_API_KEY>

However, I'm seeing random timeouts in the log and don't know why this is occurring and how to solve the issue.   Please see attached file showing the error.  


Thanks