(DEPRECATED) Apache Flink User Mailing List archive.

Temporary failure in name resolution on JobManager

Classic

List

Threaded

3 messages Options

David Maddison

Nov 29, 2019; 10:40am

Temporary failure in name resolution on JobManager

I have a Flink 1.7 cluster using the "flink:1.7.2" (OpenJDK build 1.8.0_222-b10) image on Kubernetes.

As part of a MasterRestoreHook (for checkpointing) the JobManager needs to communicate with an external security service. This all works well until there's a DNS lookup failure (due to network issues) at which point the JobManager JVM seems unable to ever successfully look up the name again, even when it's confirmed DNS service has been restored. The weird thing is that I can use kubectl to exec into the JobManager POD and successfully perform a lookup even while the JobManager JVM is still failing to lookup.

Has anybody seen an issue like this before, or have any suggestions? As far as I'm aware Flink doesn't install a SecurityManager and therefore the JVM should only cache invalid name requests for 10 seconds.

Restarting the JobManager JVM does successfully recover the Job, but I'd like to avoid having to do that if possible.

Caused by: java.net.UnknownHostException: <********>.com: Temporary failure in name resolution
at java.net.Inet4AddressImpl.lookupAllHostAddr(Native Method)
at java.net.InetAddress$2.lookupAllHostAddr(InetAddress.java:929)
at java.net.InetAddress.getAddressesFromNameService(InetAddress.java:1324)
at java.net.InetAddress.getAllByName0(InetAddress.java:1277)
at java.net.InetAddress.getAllByName(InetAddress.java:1193)
at java.net.InetAddress.getAllByName(InetAddress.java:1127)

Thanks in advance,

David

Yang Wang

Dec 02, 2019; 3:16am

Re: Temporary failure in name resolution on JobManager

Hi David,

Do you mean when the JobManager starts, the dns has some problem and the service could

not be resolved? The dns restores to normal, and the JobManager jvm could not look up the

dns.

I think it may because the jvm dns cache. You could set the ttl and have a try.

sun.net.inetaddr.ttl

sun.net.inetaddr.negative.ttl

Best,

Yang

David Maddison <[hidden email]> 于2019年11月29日周五下午6:41写道：

I have a Flink 1.7 cluster using the "flink:1.7.2" (OpenJDK build 1.8.0_222-b10) image on Kubernetes.

As part of a MasterRestoreHook (for checkpointing) the JobManager needs to communicate with an external security service. This all works well until there's a DNS lookup failure (due to network issues) at which point the JobManager JVM seems unable to ever successfully look up the name again, even when it's confirmed DNS service has been restored. The weird thing is that I can use kubectl to exec into the JobManager POD and successfully perform a lookup even while the JobManager JVM is still failing to lookup.

Has anybody seen an issue like this before, or have any suggestions? As far as I'm aware Flink doesn't install a SecurityManager and therefore the JVM should only cache invalid name requests for 10 seconds.

Restarting the JobManager JVM does successfully recover the Job, but I'd like to avoid having to do that if possible.

Caused by: java.net.UnknownHostException: <********>.com: Temporary failure in name resolution
at java.net.Inet4AddressImpl.lookupAllHostAddr(Native Method)
at java.net.InetAddress$2.lookupAllHostAddr(InetAddress.java:929)
at java.net.InetAddress.getAddressesFromNameService(InetAddress.java:1324)
at java.net.InetAddress.getAllByName0(InetAddress.java:1277)
at java.net.InetAddress.getAllByName(InetAddress.java:1193)
at java.net.InetAddress.getAllByName(InetAddress.java:1127)

Thanks in advance,

David

... [show rest of quote]

David Maddison

Dec 02, 2019; 10:58am

Re: Temporary failure in name resolution on JobManager

Thanks Yang.

We did try both those properties and it didn't fix it. However, we did EVENTUALLY (after some late nights!) track the issue down, not to DNS resolution but rather an obscure bug our our connector code :-(

Thanks for your response,

/David/

On Mon, Dec 2, 2019 at 3:16 AM Yang Wang <[hidden email]> wrote:

Hi David,

Do you mean when the JobManager starts, the dns has some problem and the service could
not be resolved? The dns restores to normal, and the JobManager jvm could not look up the
dns.
I think it may because the jvm dns cache. You could set the ttl and have a try.
sun.net.inetaddr.ttl
sun.net.inetaddr.negative.ttl

Best,
Yang

David Maddison <[hidden email]> 于2019年11月29日周五下午6:41写道：
I have a Flink 1.7 cluster using the "flink:1.7.2" (OpenJDK build 1.8.0_222-b10) image on Kubernetes.

As part of a MasterRestoreHook (for checkpointing) the JobManager needs to communicate with an external security service. This all works well until there's a DNS lookup failure (due to network issues) at which point the JobManager JVM seems unable to ever successfully look up the name again, even when it's confirmed DNS service has been restored. The weird thing is that I can use kubectl to exec into the JobManager POD and successfully perform a lookup even while the JobManager JVM is still failing to lookup.

Has anybody seen an issue like this before, or have any suggestions? As far as I'm aware Flink doesn't install a SecurityManager and therefore the JVM should only cache invalid name requests for 10 seconds.

Restarting the JobManager JVM does successfully recover the Job, but I'd like to avoid having to do that if possible.

Caused by: java.net.UnknownHostException: <********>.com: Temporary failure in name resolution
at java.net.Inet4AddressImpl.lookupAllHostAddr(Native Method)
at java.net.InetAddress$2.lookupAllHostAddr(InetAddress.java:929)
at java.net.InetAddress.getAddressesFromNameService(InetAddress.java:1324)
at java.net.InetAddress.getAllByName0(InetAddress.java:1277)
at java.net.InetAddress.getAllByName(InetAddress.java:1193)
at java.net.InetAddress.getAllByName(InetAddress.java:1127)

Thanks in advance,

David

... [show rest of quote]

... [show rest of quote]