Datadog reporter timeout & OOM issue

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

Datadog reporter timeout & OOM issue

Xingcan Cui
Hi all,

Recently, I tried to use the Datadog reporter to collect some user-defined metrics. Sometimes when reaching traffic peaks (which are also peaks for metrics), the HTTP client will throw the following exception:

```
[OkHttp https://app.datadoghq.com/...] WARN  org.apache.flink.metrics.datadog.DatadogHttpClient  - Failed sending request to Datadog
java.net.SocketTimeoutException: timeout
at okhttp3.internal.http2.Http2Stream$StreamTimeout.newTimeoutException(Http2Stream.java:593)
at okhttp3.internal.http2.Http2Stream$StreamTimeout.exitAndThrowIfTimedOut(Http2Stream.java:601)
at okhttp3.internal.http2.Http2Stream.takeResponseHeaders(Http2Stream.java:146)
at okhttp3.internal.http2.Http2Codec.readResponseHeaders(Http2Codec.java:120)
at okhttp3.internal.http.CallServerInterceptor.intercept(CallServerInterceptor.java:75)
at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:92)
at okhttp3.internal.connection.ConnectInterceptor.intercept(ConnectInterceptor.java:45)
at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:92)
at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:67)
at okhttp3.internal.cache.CacheInterceptor.intercept(CacheInterceptor.java:93)
at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:92)
at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:67)
at okhttp3.internal.http.BridgeInterceptor.intercept(BridgeInterceptor.java:93)
at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:92)
at okhttp3.internal.http.RetryAndFollowUpInterceptor.intercept(RetryAndFollowUpInterceptor.java:120)
at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:92)
at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:67)
at okhttp3.RealCall.getResponseWithInterceptorChain(RealCall.java:185)
at okhttp3.RealCall$AsyncCall.execute(RealCall.java:135)
at okhttp3.internal.NamedRunnable.run(NamedRunnable.java:32)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
```

I guess this may be caused by the rate limit of the Datadog server since too many HTTP requests look like a kind of "attack". The real problem is that after throwing the above exceptions, the JVM heap size of the taskmanager starts to increase and finally causes OOM. I'm curious if this may be caused by metrics accumulation, i.e., for some reason, the client can't reconnect to the Datadog server and send the metrics so that the metrics data is buffered in memory and causes OOM.

I'm running Flink 1.11.2 on EMR-6.2.0 with flink-metrics-datadog-1.11.2.jar.

Thanks,
Xingcan
Reply | Threaded
Open this post in threaded view
|

Re: Datadog reporter timeout & OOM issue

Juha Mynttinen-2
Hey,

A few months back, I had a very similar problem with Datadog when I tried to do a proof of concept using it with Flink. I had quite a lot of user defined metrics. I got similar exceptions and the metrics didn't end up in Datadog. Without too much deeper analysis, I assumed Datadog was throttling the incoming traffic.

Back then it was also difficult (?) to configure the Datadog region (eu/us). If I remember correctly the region was more or less hardcoded to US. That seems to be fixed now, there's the param metrics.reporter.dghttp.dataCenter to define the region.

Regards,
Juha

El mié, 27 ene 2021 a las 6:53, Xingcan Cui (<[hidden email]>) escribió:
Hi all,

Recently, I tried to use the Datadog reporter to collect some user-defined metrics. Sometimes when reaching traffic peaks (which are also peaks for metrics), the HTTP client will throw the following exception:

```
[OkHttp https://app.datadoghq.com/...] WARN  org.apache.flink.metrics.datadog.DatadogHttpClient  - Failed sending request to Datadog
java.net.SocketTimeoutException: timeout
at okhttp3.internal.http2.Http2Stream$StreamTimeout.newTimeoutException(Http2Stream.java:593)
at okhttp3.internal.http2.Http2Stream$StreamTimeout.exitAndThrowIfTimedOut(Http2Stream.java:601)
at okhttp3.internal.http2.Http2Stream.takeResponseHeaders(Http2Stream.java:146)
at okhttp3.internal.http2.Http2Codec.readResponseHeaders(Http2Codec.java:120)
at okhttp3.internal.http.CallServerInterceptor.intercept(CallServerInterceptor.java:75)
at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:92)
at okhttp3.internal.connection.ConnectInterceptor.intercept(ConnectInterceptor.java:45)
at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:92)
at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:67)
at okhttp3.internal.cache.CacheInterceptor.intercept(CacheInterceptor.java:93)
at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:92)
at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:67)
at okhttp3.internal.http.BridgeInterceptor.intercept(BridgeInterceptor.java:93)
at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:92)
at okhttp3.internal.http.RetryAndFollowUpInterceptor.intercept(RetryAndFollowUpInterceptor.java:120)
at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:92)
at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:67)
at okhttp3.RealCall.getResponseWithInterceptorChain(RealCall.java:185)
at okhttp3.RealCall$AsyncCall.execute(RealCall.java:135)
at okhttp3.internal.NamedRunnable.run(NamedRunnable.java:32)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
```

I guess this may be caused by the rate limit of the Datadog server since too many HTTP requests look like a kind of "attack". The real problem is that after throwing the above exceptions, the JVM heap size of the taskmanager starts to increase and finally causes OOM. I'm curious if this may be caused by metrics accumulation, i.e., for some reason, the client can't reconnect to the Datadog server and send the metrics so that the metrics data is buffered in memory and causes OOM.

I'm running Flink 1.11.2 on EMR-6.2.0 with flink-metrics-datadog-1.11.2.jar.

Thanks,
Xingcan
Reply | Threaded
Open this post in threaded view
|

Re: Datadog reporter timeout & OOM issue

Chesnay Schepler
In reply to this post by Xingcan Cui
Yes, I could see how the memory issue can occur.

However, it should be limited to buffering 64 requests; this is the default limit that okhttp imposes on concurrent calls.
Maybe lowering this value already does the trick.

On 1/27/2021 5:52 AM, Xingcan Cui wrote:
Hi all,

Recently, I tried to use the Datadog reporter to collect some user-defined metrics. Sometimes when reaching traffic peaks (which are also peaks for metrics), the HTTP client will throw the following exception:

```
[OkHttp https://app.datadoghq.com/...] WARN  org.apache.flink.metrics.datadog.DatadogHttpClient  - Failed sending request to Datadog
java.net.SocketTimeoutException: timeout
at okhttp3.internal.http2.Http2Stream$StreamTimeout.newTimeoutException(Http2Stream.java:593)
at okhttp3.internal.http2.Http2Stream$StreamTimeout.exitAndThrowIfTimedOut(Http2Stream.java:601)
at okhttp3.internal.http2.Http2Stream.takeResponseHeaders(Http2Stream.java:146)
at okhttp3.internal.http2.Http2Codec.readResponseHeaders(Http2Codec.java:120)
at okhttp3.internal.http.CallServerInterceptor.intercept(CallServerInterceptor.java:75)
at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:92)
at okhttp3.internal.connection.ConnectInterceptor.intercept(ConnectInterceptor.java:45)
at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:92)
at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:67)
at okhttp3.internal.cache.CacheInterceptor.intercept(CacheInterceptor.java:93)
at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:92)
at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:67)
at okhttp3.internal.http.BridgeInterceptor.intercept(BridgeInterceptor.java:93)
at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:92)
at okhttp3.internal.http.RetryAndFollowUpInterceptor.intercept(RetryAndFollowUpInterceptor.java:120)
at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:92)
at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:67)
at okhttp3.RealCall.getResponseWithInterceptorChain(RealCall.java:185)
at okhttp3.RealCall$AsyncCall.execute(RealCall.java:135)
at okhttp3.internal.NamedRunnable.run(NamedRunnable.java:32)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
```

I guess this may be caused by the rate limit of the Datadog server since too many HTTP requests look like a kind of "attack". The real problem is that after throwing the above exceptions, the JVM heap size of the taskmanager starts to increase and finally causes OOM. I'm curious if this may be caused by metrics accumulation, i.e., for some reason, the client can't reconnect to the Datadog server and send the metrics so that the metrics data is buffered in memory and causes OOM.

I'm running Flink 1.11.2 on EMR-6.2.0 with flink-metrics-datadog-1.11.2.jar.

Thanks,
Xingcan


Reply | Threaded
Open this post in threaded view
|

Re: Datadog reporter timeout & OOM issue

Chesnay Schepler
(setting this field is currently not possible from a Flink user perspective; it is something I will investigate)


On 1/27/2021 10:30 AM, Chesnay Schepler wrote:
Yes, I could see how the memory issue can occur.

However, it should be limited to buffering 64 requests; this is the default limit that okhttp imposes on concurrent calls.
Maybe lowering this value already does the trick.

On 1/27/2021 5:52 AM, Xingcan Cui wrote:
Hi all,

Recently, I tried to use the Datadog reporter to collect some user-defined metrics. Sometimes when reaching traffic peaks (which are also peaks for metrics), the HTTP client will throw the following exception:

```
[OkHttp https://app.datadoghq.com/...] WARN  org.apache.flink.metrics.datadog.DatadogHttpClient  - Failed sending request to Datadog
java.net.SocketTimeoutException: timeout
at okhttp3.internal.http2.Http2Stream$StreamTimeout.newTimeoutException(Http2Stream.java:593)
at okhttp3.internal.http2.Http2Stream$StreamTimeout.exitAndThrowIfTimedOut(Http2Stream.java:601)
at okhttp3.internal.http2.Http2Stream.takeResponseHeaders(Http2Stream.java:146)
at okhttp3.internal.http2.Http2Codec.readResponseHeaders(Http2Codec.java:120)
at okhttp3.internal.http.CallServerInterceptor.intercept(CallServerInterceptor.java:75)
at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:92)
at okhttp3.internal.connection.ConnectInterceptor.intercept(ConnectInterceptor.java:45)
at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:92)
at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:67)
at okhttp3.internal.cache.CacheInterceptor.intercept(CacheInterceptor.java:93)
at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:92)
at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:67)
at okhttp3.internal.http.BridgeInterceptor.intercept(BridgeInterceptor.java:93)
at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:92)
at okhttp3.internal.http.RetryAndFollowUpInterceptor.intercept(RetryAndFollowUpInterceptor.java:120)
at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:92)
at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:67)
at okhttp3.RealCall.getResponseWithInterceptorChain(RealCall.java:185)
at okhttp3.RealCall$AsyncCall.execute(RealCall.java:135)
at okhttp3.internal.NamedRunnable.run(NamedRunnable.java:32)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
```

I guess this may be caused by the rate limit of the Datadog server since too many HTTP requests look like a kind of "attack". The real problem is that after throwing the above exceptions, the JVM heap size of the taskmanager starts to increase and finally causes OOM. I'm curious if this may be caused by metrics accumulation, i.e., for some reason, the client can't reconnect to the Datadog server and send the metrics so that the metrics data is buffered in memory and causes OOM.

I'm running Flink 1.11.2 on EMR-6.2.0 with flink-metrics-datadog-1.11.2.jar.

Thanks,
Xingcan



Reply | Threaded
Open this post in threaded view
|

Re: Datadog reporter timeout & OOM issue

Xingcan Cui
Hi Juha and Chesnay,

I do appreciate your prompt responses! I'll also continue to investigate this issue.

Best,
Xingcan

On Wed, Jan 27, 2021, 04:32 Chesnay Schepler <[hidden email]> wrote:
(setting this field is currently not possible from a Flink user perspective; it is something I will investigate)


On 1/27/2021 10:30 AM, Chesnay Schepler wrote:
Yes, I could see how the memory issue can occur.

However, it should be limited to buffering 64 requests; this is the default limit that okhttp imposes on concurrent calls.
Maybe lowering this value already does the trick.

On 1/27/2021 5:52 AM, Xingcan Cui wrote:
Hi all,

Recently, I tried to use the Datadog reporter to collect some user-defined metrics. Sometimes when reaching traffic peaks (which are also peaks for metrics), the HTTP client will throw the following exception:

```
[OkHttp https://app.datadoghq.com/...] WARN  org.apache.flink.metrics.datadog.DatadogHttpClient  - Failed sending request to Datadog
java.net.SocketTimeoutException: timeout
at okhttp3.internal.http2.Http2Stream$StreamTimeout.newTimeoutException(Http2Stream.java:593)
at okhttp3.internal.http2.Http2Stream$StreamTimeout.exitAndThrowIfTimedOut(Http2Stream.java:601)
at okhttp3.internal.http2.Http2Stream.takeResponseHeaders(Http2Stream.java:146)
at okhttp3.internal.http2.Http2Codec.readResponseHeaders(Http2Codec.java:120)
at okhttp3.internal.http.CallServerInterceptor.intercept(CallServerInterceptor.java:75)
at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:92)
at okhttp3.internal.connection.ConnectInterceptor.intercept(ConnectInterceptor.java:45)
at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:92)
at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:67)
at okhttp3.internal.cache.CacheInterceptor.intercept(CacheInterceptor.java:93)
at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:92)
at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:67)
at okhttp3.internal.http.BridgeInterceptor.intercept(BridgeInterceptor.java:93)
at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:92)
at okhttp3.internal.http.RetryAndFollowUpInterceptor.intercept(RetryAndFollowUpInterceptor.java:120)
at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:92)
at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:67)
at okhttp3.RealCall.getResponseWithInterceptorChain(RealCall.java:185)
at okhttp3.RealCall$AsyncCall.execute(RealCall.java:135)
at okhttp3.internal.NamedRunnable.run(NamedRunnable.java:32)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
```

I guess this may be caused by the rate limit of the Datadog server since too many HTTP requests look like a kind of "attack". The real problem is that after throwing the above exceptions, the JVM heap size of the taskmanager starts to increase and finally causes OOM. I'm curious if this may be caused by metrics accumulation, i.e., for some reason, the client can't reconnect to the Datadog server and send the metrics so that the metrics data is buffered in memory and causes OOM.

I'm running Flink 1.11.2 on EMR-6.2.0 with flink-metrics-datadog-1.11.2.jar.

Thanks,
Xingcan