DataDog and Flink

classic Classic list List threaded Threaded
9 messages Options
Reply | Threaded
Open this post in threaded view
|

DataDog and Flink

Vishal Santoshi
Hello folks, 
                  This is quite strange. We see a TM stop reporting metrics to DataDog .The logs from that specific TM  for every DataDog dispatch time out with java.net.SocketTimeoutException: timeout and that seems to repeat over every dispatch to DataDog. It seems it is on a 10 seconds cadence per container. The TM remains humming, so does not seem to be under memory/CPU distress. And the exception is not transient. It just stops dead and from there on timeout.

Looking at SLA provided by DataDog any throttling exception should pretty much not be a SocketTimeOut, till of course the reporting the specific issue is off. This thus appears very much a n/w issue which appears weird as other TMs with the same n/w just hum along, sending their metrics successfully. The other issue could be just the amount of metrics and the current volume for the TM is prohibitive. That said the exception is still not helpful.

Any ideas from folks who have used DataDog reporter with Flink. I guess even best practices may be a sufficient beginning.

Regards.

Reply | Threaded
Open this post in threaded view
|

Re: DataDog and Flink

Arvid Heise-4
Hi Vishal,

I have no experience in the Flink+DataDog setup but worked a bit with DataDog before.
I'd agree that the timeout does not seem like a rate limit. It would also be odd that the other TMs with a similar rate still pass. So I'd suspect n/w issues.
Can you log into the TM's machine and try out manually how the system behaves?

On Sat, Mar 20, 2021 at 1:44 PM Vishal Santoshi <[hidden email]> wrote:
Hello folks, 
                  This is quite strange. We see a TM stop reporting metrics to DataDog .The logs from that specific TM  for every DataDog dispatch time out with java.net.SocketTimeoutException: timeout and that seems to repeat over every dispatch to DataDog. It seems it is on a 10 seconds cadence per container. The TM remains humming, so does not seem to be under memory/CPU distress. And the exception is not transient. It just stops dead and from there on timeout.

Looking at SLA provided by DataDog any throttling exception should pretty much not be a SocketTimeOut, till of course the reporting the specific issue is off. This thus appears very much a n/w issue which appears weird as other TMs with the same n/w just hum along, sending their metrics successfully. The other issue could be just the amount of metrics and the current volume for the TM is prohibitive. That said the exception is still not helpful.

Any ideas from folks who have used DataDog reporter with Flink. I guess even best practices may be a sufficient beginning.

Regards.

Reply | Threaded
Open this post in threaded view
|

Re: DataDog and Flink

Vishal Santoshi

If we look at this code , the metrics are divided into chunks up-to a max size. and enqueued. The Request has a 3 second read/connect/write timeout which IMHO should have been configurable ( or is it ) . While the number metrics ( all metrics ) exposed by flink cluster is pretty high ( and the names of the metrics along with tags ) , it may make sense to limit the number of metrics in a single chunk ( to ultimately limit the size of a single chunk ). There is this configuration which allows for reducing the metrics in a single chunk

metrics.reporter.dghttp.maxMetricsPerRequest: 2000

We could decrease this to 1500 ( 1500 is pretty, not based on any empirical reasoning ) and see if that stabilizes the dispatch. It is inevitable that the number of requests will grow and we may hit the throttle but then we know the exception rather than the timeouts that are generally less intuitive.

Any thoughts?



On Mon, Mar 22, 2021 at 10:37 AM Arvid Heise <[hidden email]> wrote:
Hi Vishal,

I have no experience in the Flink+DataDog setup but worked a bit with DataDog before.
I'd agree that the timeout does not seem like a rate limit. It would also be odd that the other TMs with a similar rate still pass. So I'd suspect n/w issues.
Can you log into the TM's machine and try out manually how the system behaves?

On Sat, Mar 20, 2021 at 1:44 PM Vishal Santoshi <[hidden email]> wrote:
Hello folks, 
                  This is quite strange. We see a TM stop reporting metrics to DataDog .The logs from that specific TM  for every DataDog dispatch time out with java.net.SocketTimeoutException: timeout and that seems to repeat over every dispatch to DataDog. It seems it is on a 10 seconds cadence per container. The TM remains humming, so does not seem to be under memory/CPU distress. And the exception is not transient. It just stops dead and from there on timeout.

Looking at SLA provided by DataDog any throttling exception should pretty much not be a SocketTimeOut, till of course the reporting the specific issue is off. This thus appears very much a n/w issue which appears weird as other TMs with the same n/w just hum along, sending their metrics successfully. The other issue could be just the amount of metrics and the current volume for the TM is prohibitive. That said the exception is still not helpful.

Any ideas from folks who have used DataDog reporter with Flink. I guess even best practices may be a sufficient beginning.

Regards.

Reply | Threaded
Open this post in threaded view
|

Re: DataDog and Flink

Vishal Santoshi
I guess there is a bigger issue here. We dropped the property to 500. We also realized that this failure happened on a TM that had one specific job running on it. What was good ( but surprising ) that the exception was the more protocol specific 413  ( as in the chunk is greater then some size limit DD has on a request.

Failed to send request to Datadog (response was Response{protocol=h2, code=413, message=, url=<a target="_blank" class="gmail-c-link" href="https://app.datadoghq.com/api/v1/series?api_key=0ffa36e48f5042465635b5843fa3f2a6}" rel="noopener noreferrer" style="font-family:Monaco,Menlo,Consolas,&quot;Courier New&quot;,monospace;font-size:12px;font-variant-ligatures:none;white-space:pre-wrap;box-sizing:inherit;color:inherit;text-decoration-line:none">https://app.datadoghq.com/api/v1/series?api_key=**********})

which implies that the Socket timeout was masking this issue. The 2000 was just a huge payload that DD was unable to parse in time ( or was slow to upload etc ). Now we could go lower but that makes less sense. We could play with https://ci.apache.org/projects/flink/flink-docs-stable/ops/metrics.html#system-scope to reduce the size of the tags ( or keys ). 









On Tue, Mar 23, 2021 at 11:33 AM Vishal Santoshi <[hidden email]> wrote:

If we look at this code , the metrics are divided into chunks up-to a max size. and enqueued. The Request has a 3 second read/connect/write timeout which IMHO should have been configurable ( or is it ) . While the number metrics ( all metrics ) exposed by flink cluster is pretty high ( and the names of the metrics along with tags ) , it may make sense to limit the number of metrics in a single chunk ( to ultimately limit the size of a single chunk ). There is this configuration which allows for reducing the metrics in a single chunk

metrics.reporter.dghttp.maxMetricsPerRequest: 2000

We could decrease this to 1500 ( 1500 is pretty, not based on any empirical reasoning ) and see if that stabilizes the dispatch. It is inevitable that the number of requests will grow and we may hit the throttle but then we know the exception rather than the timeouts that are generally less intuitive.

Any thoughts?



On Mon, Mar 22, 2021 at 10:37 AM Arvid Heise <[hidden email]> wrote:
Hi Vishal,

I have no experience in the Flink+DataDog setup but worked a bit with DataDog before.
I'd agree that the timeout does not seem like a rate limit. It would also be odd that the other TMs with a similar rate still pass. So I'd suspect n/w issues.
Can you log into the TM's machine and try out manually how the system behaves?

On Sat, Mar 20, 2021 at 1:44 PM Vishal Santoshi <[hidden email]> wrote:
Hello folks, 
                  This is quite strange. We see a TM stop reporting metrics to DataDog .The logs from that specific TM  for every DataDog dispatch time out with java.net.SocketTimeoutException: timeout and that seems to repeat over every dispatch to DataDog. It seems it is on a 10 seconds cadence per container. The TM remains humming, so does not seem to be under memory/CPU distress. And the exception is not transient. It just stops dead and from there on timeout.

Looking at SLA provided by DataDog any throttling exception should pretty much not be a SocketTimeOut, till of course the reporting the specific issue is off. This thus appears very much a n/w issue which appears weird as other TMs with the same n/w just hum along, sending their metrics successfully. The other issue could be just the amount of metrics and the current volume for the TM is prohibitive. That said the exception is still not helpful.

Any ideas from folks who have used DataDog reporter with Flink. I guess even best practices may be a sufficient beginning.

Regards.

Reply | Threaded
Open this post in threaded view
|

Re: DataDog and Flink

Vishal Santoshi
That said, is there a way to get a dump of all metrics exposed by TM. I was searching for it and I bet we could get it for ServieMonitor on k8s ( scrape ) but am missing a way to het a TM and dump all metrics that are pushed. 

Thanks and regards.

On Tue, Mar 23, 2021 at 5:56 PM Vishal Santoshi <[hidden email]> wrote:
I guess there is a bigger issue here. We dropped the property to 500. We also realized that this failure happened on a TM that had one specific job running on it. What was good ( but surprising ) that the exception was the more protocol specific 413  ( as in the chunk is greater then some size limit DD has on a request.

Failed to send request to Datadog (response was Response{protocol=h2, code=413, message=, url=https://app.datadoghq.com/api/v1/series?api_key=**********})

which implies that the Socket timeout was masking this issue. The 2000 was just a huge payload that DD was unable to parse in time ( or was slow to upload etc ). Now we could go lower but that makes less sense. We could play with https://ci.apache.org/projects/flink/flink-docs-stable/ops/metrics.html#system-scope to reduce the size of the tags ( or keys ). 









On Tue, Mar 23, 2021 at 11:33 AM Vishal Santoshi <[hidden email]> wrote:

If we look at this code , the metrics are divided into chunks up-to a max size. and enqueued. The Request has a 3 second read/connect/write timeout which IMHO should have been configurable ( or is it ) . While the number metrics ( all metrics ) exposed by flink cluster is pretty high ( and the names of the metrics along with tags ) , it may make sense to limit the number of metrics in a single chunk ( to ultimately limit the size of a single chunk ). There is this configuration which allows for reducing the metrics in a single chunk

metrics.reporter.dghttp.maxMetricsPerRequest: 2000

We could decrease this to 1500 ( 1500 is pretty, not based on any empirical reasoning ) and see if that stabilizes the dispatch. It is inevitable that the number of requests will grow and we may hit the throttle but then we know the exception rather than the timeouts that are generally less intuitive.

Any thoughts?



On Mon, Mar 22, 2021 at 10:37 AM Arvid Heise <[hidden email]> wrote:
Hi Vishal,

I have no experience in the Flink+DataDog setup but worked a bit with DataDog before.
I'd agree that the timeout does not seem like a rate limit. It would also be odd that the other TMs with a similar rate still pass. So I'd suspect n/w issues.
Can you log into the TM's machine and try out manually how the system behaves?

On Sat, Mar 20, 2021 at 1:44 PM Vishal Santoshi <[hidden email]> wrote:
Hello folks, 
                  This is quite strange. We see a TM stop reporting metrics to DataDog .The logs from that specific TM  for every DataDog dispatch time out with java.net.SocketTimeoutException: timeout and that seems to repeat over every dispatch to DataDog. It seems it is on a 10 seconds cadence per container. The TM remains humming, so does not seem to be under memory/CPU distress. And the exception is not transient. It just stops dead and from there on timeout.

Looking at SLA provided by DataDog any throttling exception should pretty much not be a SocketTimeOut, till of course the reporting the specific issue is off. This thus appears very much a n/w issue which appears weird as other TMs with the same n/w just hum along, sending their metrics successfully. The other issue could be just the amount of metrics and the current volume for the TM is prohibitive. That said the exception is still not helpful.

Any ideas from folks who have used DataDog reporter with Flink. I guess even best practices may be a sufficient beginning.

Regards.

Reply | Threaded
Open this post in threaded view
|

Re: DataDog and Flink

Matthias
Hi Vishal,
what about the TM metrics' REST endpoint [1]. Is this something you could use to get all the metrics for a specific TaskManager? Or are you looking for something else?

Best,
Matthias


On Tue, Mar 23, 2021 at 10:59 PM Vishal Santoshi <[hidden email]> wrote:
That said, is there a way to get a dump of all metrics exposed by TM. I was searching for it and I bet we could get it for ServieMonitor on k8s ( scrape ) but am missing a way to het a TM and dump all metrics that are pushed. 

Thanks and regards.

On Tue, Mar 23, 2021 at 5:56 PM Vishal Santoshi <[hidden email]> wrote:
I guess there is a bigger issue here. We dropped the property to 500. We also realized that this failure happened on a TM that had one specific job running on it. What was good ( but surprising ) that the exception was the more protocol specific 413  ( as in the chunk is greater then some size limit DD has on a request.

Failed to send request to Datadog (response was Response{protocol=h2, code=413, message=, url=https://app.datadoghq.com/api/v1/series?api_key=**********})

which implies that the Socket timeout was masking this issue. The 2000 was just a huge payload that DD was unable to parse in time ( or was slow to upload etc ). Now we could go lower but that makes less sense. We could play with https://ci.apache.org/projects/flink/flink-docs-stable/ops/metrics.html#system-scope to reduce the size of the tags ( or keys ). 









On Tue, Mar 23, 2021 at 11:33 AM Vishal Santoshi <[hidden email]> wrote:

If we look at this code , the metrics are divided into chunks up-to a max size. and enqueued. The Request has a 3 second read/connect/write timeout which IMHO should have been configurable ( or is it ) . While the number metrics ( all metrics ) exposed by flink cluster is pretty high ( and the names of the metrics along with tags ) , it may make sense to limit the number of metrics in a single chunk ( to ultimately limit the size of a single chunk ). There is this configuration which allows for reducing the metrics in a single chunk

metrics.reporter.dghttp.maxMetricsPerRequest: 2000

We could decrease this to 1500 ( 1500 is pretty, not based on any empirical reasoning ) and see if that stabilizes the dispatch. It is inevitable that the number of requests will grow and we may hit the throttle but then we know the exception rather than the timeouts that are generally less intuitive.

Any thoughts?



On Mon, Mar 22, 2021 at 10:37 AM Arvid Heise <[hidden email]> wrote:
Hi Vishal,

I have no experience in the Flink+DataDog setup but worked a bit with DataDog before.
I'd agree that the timeout does not seem like a rate limit. It would also be odd that the other TMs with a similar rate still pass. So I'd suspect n/w issues.
Can you log into the TM's machine and try out manually how the system behaves?

On Sat, Mar 20, 2021 at 1:44 PM Vishal Santoshi <[hidden email]> wrote:
Hello folks, 
                  This is quite strange. We see a TM stop reporting metrics to DataDog .The logs from that specific TM  for every DataDog dispatch time out with java.net.SocketTimeoutException: timeout and that seems to repeat over every dispatch to DataDog. It seems it is on a 10 seconds cadence per container. The TM remains humming, so does not seem to be under memory/CPU distress. And the exception is not transient. It just stops dead and from there on timeout.

Looking at SLA provided by DataDog any throttling exception should pretty much not be a SocketTimeOut, till of course the reporting the specific issue is off. This thus appears very much a n/w issue which appears weird as other TMs with the same n/w just hum along, sending their metrics successfully. The other issue could be just the amount of metrics and the current volume for the TM is prohibitive. That said the exception is still not helpful.

Any ideas from folks who have used DataDog reporter with Flink. I guess even best practices may be a sufficient beginning.

Regards.

Reply | Threaded
Open this post in threaded view
|

Re: DataDog and Flink

Arvid Heise-4
Hi Vishal,

REST API is the most direct way to get through all metrics as Matthias pointed out. Additionally, you could also add a JMX reporter and log to the machines to check.

But in general, I think you are on the right track. You need to reduce the metrics that are sent to DD by configuring the scope / excluding variables.

Furthermore, I think it would be a good idea to make the timeout configurable. Could you open a ticket for that?

Best,

Arvid

On Wed, Mar 24, 2021 at 9:02 AM Matthias Pohl <[hidden email]> wrote:
Hi Vishal,
what about the TM metrics' REST endpoint [1]. Is this something you could use to get all the metrics for a specific TaskManager? Or are you looking for something else?

Best,
Matthias


On Tue, Mar 23, 2021 at 10:59 PM Vishal Santoshi <[hidden email]> wrote:
That said, is there a way to get a dump of all metrics exposed by TM. I was searching for it and I bet we could get it for ServieMonitor on k8s ( scrape ) but am missing a way to het a TM and dump all metrics that are pushed. 

Thanks and regards.

On Tue, Mar 23, 2021 at 5:56 PM Vishal Santoshi <[hidden email]> wrote:
I guess there is a bigger issue here. We dropped the property to 500. We also realized that this failure happened on a TM that had one specific job running on it. What was good ( but surprising ) that the exception was the more protocol specific 413  ( as in the chunk is greater then some size limit DD has on a request.

Failed to send request to Datadog (response was Response{protocol=h2, code=413, message=, url=https://app.datadoghq.com/api/v1/series?api_key=**********})

which implies that the Socket timeout was masking this issue. The 2000 was just a huge payload that DD was unable to parse in time ( or was slow to upload etc ). Now we could go lower but that makes less sense. We could play with https://ci.apache.org/projects/flink/flink-docs-stable/ops/metrics.html#system-scope to reduce the size of the tags ( or keys ). 









On Tue, Mar 23, 2021 at 11:33 AM Vishal Santoshi <[hidden email]> wrote:

If we look at this code , the metrics are divided into chunks up-to a max size. and enqueued. The Request has a 3 second read/connect/write timeout which IMHO should have been configurable ( or is it ) . While the number metrics ( all metrics ) exposed by flink cluster is pretty high ( and the names of the metrics along with tags ) , it may make sense to limit the number of metrics in a single chunk ( to ultimately limit the size of a single chunk ). There is this configuration which allows for reducing the metrics in a single chunk

metrics.reporter.dghttp.maxMetricsPerRequest: 2000

We could decrease this to 1500 ( 1500 is pretty, not based on any empirical reasoning ) and see if that stabilizes the dispatch. It is inevitable that the number of requests will grow and we may hit the throttle but then we know the exception rather than the timeouts that are generally less intuitive.

Any thoughts?



On Mon, Mar 22, 2021 at 10:37 AM Arvid Heise <[hidden email]> wrote:
Hi Vishal,

I have no experience in the Flink+DataDog setup but worked a bit with DataDog before.
I'd agree that the timeout does not seem like a rate limit. It would also be odd that the other TMs with a similar rate still pass. So I'd suspect n/w issues.
Can you log into the TM's machine and try out manually how the system behaves?

On Sat, Mar 20, 2021 at 1:44 PM Vishal Santoshi <[hidden email]> wrote:
Hello folks, 
                  This is quite strange. We see a TM stop reporting metrics to DataDog .The logs from that specific TM  for every DataDog dispatch time out with java.net.SocketTimeoutException: timeout and that seems to repeat over every dispatch to DataDog. It seems it is on a 10 seconds cadence per container. The TM remains humming, so does not seem to be under memory/CPU distress. And the exception is not transient. It just stops dead and from there on timeout.

Looking at SLA provided by DataDog any throttling exception should pretty much not be a SocketTimeOut, till of course the reporting the specific issue is off. This thus appears very much a n/w issue which appears weird as other TMs with the same n/w just hum along, sending their metrics successfully. The other issue could be just the amount of metrics and the current volume for the TM is prohibitive. That said the exception is still not helpful.

Any ideas from folks who have used DataDog reporter with Flink. I guess even best practices may be a sufficient beginning.

Regards.

Reply | Threaded
Open this post in threaded view
|

Re: DataDog and Flink

Vishal Santoshi
Yes, I will do that. 

Regarding the metrics dump through REST, it does provide for the TM specific but  refuses to do it for all jobs and vertices/operators etc .Moreover I am not sure I have access to the vertices ( vertex_id ) readily from the UI. 

curl http://[jm]/taskmanagers/[tm_id]
curl http://[jm]/taskmanagers/[tm_id]/metrics 



On Wed, Mar 24, 2021 at 4:24 AM Arvid Heise <[hidden email]> wrote:
Hi Vishal,

REST API is the most direct way to get through all metrics as Matthias pointed out. Additionally, you could also add a JMX reporter and log to the machines to check.

But in general, I think you are on the right track. You need to reduce the metrics that are sent to DD by configuring the scope / excluding variables.

Furthermore, I think it would be a good idea to make the timeout configurable. Could you open a ticket for that?

Best,

Arvid

On Wed, Mar 24, 2021 at 9:02 AM Matthias Pohl <[hidden email]> wrote:
Hi Vishal,
what about the TM metrics' REST endpoint [1]. Is this something you could use to get all the metrics for a specific TaskManager? Or are you looking for something else?

Best,
Matthias


On Tue, Mar 23, 2021 at 10:59 PM Vishal Santoshi <[hidden email]> wrote:
That said, is there a way to get a dump of all metrics exposed by TM. I was searching for it and I bet we could get it for ServieMonitor on k8s ( scrape ) but am missing a way to het a TM and dump all metrics that are pushed. 

Thanks and regards.

On Tue, Mar 23, 2021 at 5:56 PM Vishal Santoshi <[hidden email]> wrote:
I guess there is a bigger issue here. We dropped the property to 500. We also realized that this failure happened on a TM that had one specific job running on it. What was good ( but surprising ) that the exception was the more protocol specific 413  ( as in the chunk is greater then some size limit DD has on a request.

Failed to send request to Datadog (response was Response{protocol=h2, code=413, message=, url=https://app.datadoghq.com/api/v1/series?api_key=**********})

which implies that the Socket timeout was masking this issue. The 2000 was just a huge payload that DD was unable to parse in time ( or was slow to upload etc ). Now we could go lower but that makes less sense. We could play with https://ci.apache.org/projects/flink/flink-docs-stable/ops/metrics.html#system-scope to reduce the size of the tags ( or keys ). 









On Tue, Mar 23, 2021 at 11:33 AM Vishal Santoshi <[hidden email]> wrote:

If we look at this code , the metrics are divided into chunks up-to a max size. and enqueued. The Request has a 3 second read/connect/write timeout which IMHO should have been configurable ( or is it ) . While the number metrics ( all metrics ) exposed by flink cluster is pretty high ( and the names of the metrics along with tags ) , it may make sense to limit the number of metrics in a single chunk ( to ultimately limit the size of a single chunk ). There is this configuration which allows for reducing the metrics in a single chunk

metrics.reporter.dghttp.maxMetricsPerRequest: 2000

We could decrease this to 1500 ( 1500 is pretty, not based on any empirical reasoning ) and see if that stabilizes the dispatch. It is inevitable that the number of requests will grow and we may hit the throttle but then we know the exception rather than the timeouts that are generally less intuitive.

Any thoughts?



On Mon, Mar 22, 2021 at 10:37 AM Arvid Heise <[hidden email]> wrote:
Hi Vishal,

I have no experience in the Flink+DataDog setup but worked a bit with DataDog before.
I'd agree that the timeout does not seem like a rate limit. It would also be odd that the other TMs with a similar rate still pass. So I'd suspect n/w issues.
Can you log into the TM's machine and try out manually how the system behaves?

On Sat, Mar 20, 2021 at 1:44 PM Vishal Santoshi <[hidden email]> wrote:
Hello folks, 
                  This is quite strange. We see a TM stop reporting metrics to DataDog .The logs from that specific TM  for every DataDog dispatch time out with java.net.SocketTimeoutException: timeout and that seems to repeat over every dispatch to DataDog. It seems it is on a 10 seconds cadence per container. The TM remains humming, so does not seem to be under memory/CPU distress. And the exception is not transient. It just stops dead and from there on timeout.

Looking at SLA provided by DataDog any throttling exception should pretty much not be a SocketTimeOut, till of course the reporting the specific issue is off. This thus appears very much a n/w issue which appears weird as other TMs with the same n/w just hum along, sending their metrics successfully. The other issue could be just the amount of metrics and the current volume for the TM is prohibitive. That said the exception is still not helpful.

Any ideas from folks who have used DataDog reporter with Flink. I guess even best practices may be a sufficient beginning.

Regards.

Reply | Threaded
Open this post in threaded view
|

Re: DataDog and Flink

Vishal Santoshi
yep, not a single EP that does all the dump but something like this works ( dirty but who cares :)) ..  The vertex metrics are the most numerous any way 
```curl -s  <a href="http://xxxx/jobs/[job_id]">http://xxxx/jobs/[job_id] | jq -r '.vertices' | jq '.[].id' |  xargs -I {}  curl http://xxxxxx/jobs/[job_id]/vertices/{}/metrics | jq

On Wed, Mar 24, 2021 at 9:56 AM Vishal Santoshi <[hidden email]> wrote:
Yes, I will do that. 

Regarding the metrics dump through REST, it does provide for the TM specific but  refuses to do it for all jobs and vertices/operators etc .Moreover I am not sure I have access to the vertices ( vertex_id ) readily from the UI. 

curl http://[jm]/taskmanagers/[tm_id]
curl http://[jm]/taskmanagers/[tm_id]/metrics 



On Wed, Mar 24, 2021 at 4:24 AM Arvid Heise <[hidden email]> wrote:
Hi Vishal,

REST API is the most direct way to get through all metrics as Matthias pointed out. Additionally, you could also add a JMX reporter and log to the machines to check.

But in general, I think you are on the right track. You need to reduce the metrics that are sent to DD by configuring the scope / excluding variables.

Furthermore, I think it would be a good idea to make the timeout configurable. Could you open a ticket for that?

Best,

Arvid

On Wed, Mar 24, 2021 at 9:02 AM Matthias Pohl <[hidden email]> wrote:
Hi Vishal,
what about the TM metrics' REST endpoint [1]. Is this something you could use to get all the metrics for a specific TaskManager? Or are you looking for something else?

Best,
Matthias


On Tue, Mar 23, 2021 at 10:59 PM Vishal Santoshi <[hidden email]> wrote:
That said, is there a way to get a dump of all metrics exposed by TM. I was searching for it and I bet we could get it for ServieMonitor on k8s ( scrape ) but am missing a way to het a TM and dump all metrics that are pushed. 

Thanks and regards.

On Tue, Mar 23, 2021 at 5:56 PM Vishal Santoshi <[hidden email]> wrote:
I guess there is a bigger issue here. We dropped the property to 500. We also realized that this failure happened on a TM that had one specific job running on it. What was good ( but surprising ) that the exception was the more protocol specific 413  ( as in the chunk is greater then some size limit DD has on a request.

Failed to send request to Datadog (response was Response{protocol=h2, code=413, message=, url=https://app.datadoghq.com/api/v1/series?api_key=**********})

which implies that the Socket timeout was masking this issue. The 2000 was just a huge payload that DD was unable to parse in time ( or was slow to upload etc ). Now we could go lower but that makes less sense. We could play with https://ci.apache.org/projects/flink/flink-docs-stable/ops/metrics.html#system-scope to reduce the size of the tags ( or keys ). 









On Tue, Mar 23, 2021 at 11:33 AM Vishal Santoshi <[hidden email]> wrote:

If we look at this code , the metrics are divided into chunks up-to a max size. and enqueued. The Request has a 3 second read/connect/write timeout which IMHO should have been configurable ( or is it ) . While the number metrics ( all metrics ) exposed by flink cluster is pretty high ( and the names of the metrics along with tags ) , it may make sense to limit the number of metrics in a single chunk ( to ultimately limit the size of a single chunk ). There is this configuration which allows for reducing the metrics in a single chunk

metrics.reporter.dghttp.maxMetricsPerRequest: 2000

We could decrease this to 1500 ( 1500 is pretty, not based on any empirical reasoning ) and see if that stabilizes the dispatch. It is inevitable that the number of requests will grow and we may hit the throttle but then we know the exception rather than the timeouts that are generally less intuitive.

Any thoughts?



On Mon, Mar 22, 2021 at 10:37 AM Arvid Heise <[hidden email]> wrote:
Hi Vishal,

I have no experience in the Flink+DataDog setup but worked a bit with DataDog before.
I'd agree that the timeout does not seem like a rate limit. It would also be odd that the other TMs with a similar rate still pass. So I'd suspect n/w issues.
Can you log into the TM's machine and try out manually how the system behaves?

On Sat, Mar 20, 2021 at 1:44 PM Vishal Santoshi <[hidden email]> wrote:
Hello folks, 
                  This is quite strange. We see a TM stop reporting metrics to DataDog .The logs from that specific TM  for every DataDog dispatch time out with java.net.SocketTimeoutException: timeout and that seems to repeat over every dispatch to DataDog. It seems it is on a 10 seconds cadence per container. The TM remains humming, so does not seem to be under memory/CPU distress. And the exception is not transient. It just stops dead and from there on timeout.

Looking at SLA provided by DataDog any throttling exception should pretty much not be a SocketTimeOut, till of course the reporting the specific issue is off. This thus appears very much a n/w issue which appears weird as other TMs with the same n/w just hum along, sending their metrics successfully. The other issue could be just the amount of metrics and the current volume for the TM is prohibitive. That said the exception is still not helpful.

Any ideas from folks who have used DataDog reporter with Flink. I guess even best practices may be a sufficient beginning.

Regards.