Hi everyone,
I'm currently testing data local computing of Flink on XtreemFS (I'm one of the developers). We have implemented our adapter using the hadoop FileSystem interface and all works well. However upon closer inspection, I found that only remote splits are assigned, which is strange, as XtreemFS stores files split across multiple nodes and reports the hostnames for each split. Specifically, I'm receiving the warning message issued in: https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/runtime/instance/InstanceConnectionInfo.java#L103 So each TaskManager cannot resolve their hostname from their IP, so the input split assigner cannot connect nodes to splits. This is because the nodes identify with their IPs (and not their hostnames), but the splits identify with hostnames, so no connection can be made, resulting in (mostly) non-local computing. I tracked the issue down and it turns out that the default name lookup mechanism in Java seems to be faulty on my cluster configuration. When passing in "env.java.opts: -Dsun.net.spi.nameservice.provider.1=dns,sun" (a non-default nameservice) in flink-conf.yaml, then the IP addresses are resolved to hostnames properly. I know that this is probably not directly related to Flink, but given the fact that you specifically handle the case where hostname resolution is not possible, I was wondering whether you have experienced such cases, and if so, how you overcame the issue. I'm not particularly fond of performing way too many reverse lookups, when the normal strategy using files should work as well (note that nslookup <IP-OF-NODE> works as expected, and when strace'ing the command, it does not even connect to the nameserver). Thanks in advance for your help Robert -- My GPG Key ID: 336E2680 |
Hey! Thanks for reporting this. We added the warning when we spoiled some of our own experiments with faulty DNS configurations. I am not sure what could be done in this case. Do you know the reason why the java dns reverse resolution works differently from nslookup in that case? BTW:There should not be too many reverse name lookups. Each TaskManager does this once, upon startup. Greetings, Stephan On Thu, Jul 9, 2015 at 11:36 AM, Robert Schmidtke <[hidden email]> wrote:
|
Hi,
I dug deeply into Java source code, and it comes down to a native call to getByHostAddr, for which I only found C implementations for Windows and Solaris. Frankly, I don't know what's going on on our Linux machines here, deep down there will be a call to getnameinfo I presume. I could not yet figure out what system calls are made in nslookup and the getByHostAddr functions, and why they're different at all. Another strange thing is that only the hostname of the executing node cannot be resolved, for the other nodes it works: When executing InetAddress.getByName("123.123.123.123").getCanonicalHostName() on the machine with IP 123.123.123.123, the canocical hostname turns out to be that exact IP. When executing the exact same code (with the same IP literal) on machine 123.123.123.124, the FQDN is returned properly. If I dig something up about a faulty DNS configuration on my side, I'll let you know. And yes you're right, actually the lookup is not performed too often during startup, so that might be a way to go. I'm a little more worried about communication in general, as I'm not sure whether/how often names need to be resolved when executing a job. Thanks Robert |
Free forum by Nabble | Edit this page |