Flink Graphire Reporter stops reporting via TCP if network issue

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Flink Graphire Reporter stops reporting via TCP if network issue

Bruno Aranda
Hi,

We are using the Graphite reporter from Flink 1.2.0 to send the metrics via TCP. Due to our network configuration we cannot use UDP at the moment.

We have observed that if there is any problem with graphite our the network, basically, the TCP connection times out or something, the metrics reporter does not recover. This is easy to reproduce by blocking the port we are sending the metrics using iptables. If we block the port for more than a minute or so, the problem will happen. After the port is re-open, Flink does not continue like before.

Is this a known issue? Googling shows some problems with the metrics-graphite package that should have been solved already. We have trying updated metrics-core/graphite to the latest with no success.

Any ideas?

Thanks!

Bruno
Reply | Threaded
Open this post in threaded view
|

Re: Flink Graphire Reporter stops reporting via TCP if network issue

Chesnay Schepler
Hello,

for Graphite, Flink uses the DropWizard metrics reporter. I don't know
at the moment whether it supports any kind of reconnecting functionality.

I'm not sure whether i understood you correctly; did you try upgrading
the DropWizard metrics-core/metrics-graphite dependencies?

If that didn't do the trick we could in fact implement this in Flink, it
would be hack though. When an error occurs we can simply re-instantiate
the reporter, but we would have to know how the reporter communicates
the connection drop; i.e. whether it throws some exception or not.

Could you check the log for a warning statements from the MetricRegistry?

Regards,
Chesnay

On 05.05.2017 13:26, Bruno Aranda wrote:

> Hi,
>
> We are using the Graphite reporter from Flink 1.2.0 to send the
> metrics via TCP. Due to our network configuration we cannot use UDP at
> the moment.
>
> We have observed that if there is any problem with graphite our the
> network, basically, the TCP connection times out or something, the
> metrics reporter does not recover. This is easy to reproduce by
> blocking the port we are sending the metrics using iptables. If we
> block the port for more than a minute or so, the problem will happen.
> After the port is re-open, Flink does not continue like before.
>
> Is this a known issue? Googling shows some problems with the
> metrics-graphite package that should have been solved already. We have
> trying updated metrics-core/graphite to the latest with no success.
>
> Any ideas?
>
> Thanks!
>
> Bruno


Reply | Threaded
Open this post in threaded view
|

Re: Flink Graphire Reporter stops reporting via TCP if network issue

elmosca
Hi Chesnay,

Many thanks for your reply. At the end, we have decided to change the infrastructure a bit and use StatD instead. This way, we don't need a custom reporter and it works fine.

Thanks!

Bruno

On Fri, 5 May 2017 at 13:20 Chesnay Schepler <[hidden email]> wrote:
Hello,

for Graphite, Flink uses the DropWizard metrics reporter. I don't know
at the moment whether it supports any kind of reconnecting functionality.

I'm not sure whether i understood you correctly; did you try upgrading
the DropWizard metrics-core/metrics-graphite dependencies?

If that didn't do the trick we could in fact implement this in Flink, it
would be hack though. When an error occurs we can simply re-instantiate
the reporter, but we would have to know how the reporter communicates
the connection drop; i.e. whether it throws some exception or not.

Could you check the log for a warning statements from the MetricRegistry?

Regards,
Chesnay

On 05.05.2017 13:26, Bruno Aranda wrote:
> Hi,
>
> We are using the Graphite reporter from Flink 1.2.0 to send the
> metrics via TCP. Due to our network configuration we cannot use UDP at
> the moment.
>
> We have observed that if there is any problem with graphite our the
> network, basically, the TCP connection times out or something, the
> metrics reporter does not recover. This is easy to reproduce by
> blocking the port we are sending the metrics using iptables. If we
> block the port for more than a minute or so, the problem will happen.
> After the port is re-open, Flink does not continue like before.
>
> Is this a known issue? Googling shows some problems with the
> metrics-graphite package that should have been solved already. We have
> trying updated metrics-core/graphite to the latest with no success.
>
> Any ideas?
>
> Thanks!
>
> Bruno