Thank you Edward and Christophe!2018-03-29 17:55 GMT+02:00 Edward Alexander Rojas Clavijo <[hidden email]>:Hi all,I did some tests based on the PR Christophe mentioned above and by making a change on the NettyClient to use CanonicalHostName instead of HostNameAddress to identify the server, the SSL validation works!!I created a PR with this change: https://github.com/apache/flink/pull/5789 Regards,Edward2018-03-28 17:22 GMT+02:00 Edward Alexander Rojas Clavijo <[hidden email]>:Hi Till,I just created the JIRA ticket: https://issues.apache.org/jira/browse/FLINK-9103 I added the JobManager and TaskManager logs, Hope this helps to resolve the issue.Regards,Edward2018-03-27 17:48 GMT+02:00 Till Rohrmann <[hidden email]>:Hi Edward,could you please file a JIRA issue for this problem. It might be as simple as that the TaskManager's network stack uses the IP instead of the hostname as you suggested. But we have to look into this to be sure. Also the logs of the JobManager as well as the TaskManagers could be helpful.Cheers,TillOn Tue, Mar 27, 2018 at 5:17 PM, Christophe Jolif <[hidden email]> wrote:I suspect this relates to: https://issues.apache.org/jira/browse/FLINK-5030 For which there was a PR at some point but nothing has been done so far. It seems the current code explicitly uses the IP vs Hostname for Netty SSL configuration.Without that I'm really wondering how people are reasonably using SSL on a Kubernetes Flink-based cluster as every time a pod is (re-started) it can theoretically take a different IP? Or do I miss something?--ChristopheOn Tue, Mar 27, 2018 at 3:24 PM, Edward Alexander Rojas Clavijo <[hidden email]> wrote:Hi all,Currently I have a Flink 1.4 cluster running on kubernetes and with SSL configuration based on https://ci.apache.org/projects/flink/flink-docs-master/op .s/security-ssl.html However, as the IP of the nodes are dynamic (from the nature of kubernetes), we are using only the DNS which we can control using kubernetes services. So we add to the Subject Alternative Name(SAN) the flink-jobmanager DNS and also the DNS for the task managers *.flink-taskmanager-svc (each task manager has a DNS in the form flink-taskmanager-0.flink-taskmanager-svc). Additionally we set the jobmanager.rpc.address property on all the nodes and each task manager sets the taskmanager.host property, all matching the ones on the certificate.This is working well when using Job with Parallelism set to 1. The SSL validations are good and the Jobmanager can communicate with Task manager and vice versa.But when we set the parallelism to more than 1 we have exceptions on the SSL validation like this:Caused by: java.security.cert.CertificateException: No subject alternative names matching IP address 172.30.247.163 found at sun.security.util.HostnameChecker.matchIP(HostnameChecker.ja va:168) at sun.security.util.HostnameChecker.match(HostnameChecker.java :94) at sun.security.ssl.X509TrustManagerImpl.checkIdentity(X509Trus tManagerImpl.java:455) at sun.security.ssl.X509TrustManagerImpl.checkIdentity(X509Trus tManagerImpl.java:436) at sun.security.ssl.X509TrustManagerImpl.checkTrusted(X509Trust ManagerImpl.java:252) at sun.security.ssl.X509TrustManagerImpl.checkServerTrusted(X50 9TrustManagerImpl.java:136) at sun.security.ssl.ClientHandshaker.serverCertificate(ClientHa ndshaker.java:1601) ... 21 moreFrom the logs I see the Jobmanager is correctly registering the taskmanagers:org.apache.flink.runtime.instance.InstanceManager - Registered TaskManager at flink-taskmanager-1 (akka.ssl.tcp://flink@taiga-fl ink-taskmanager-1.flink-taskma nager-svc.default.svc.cluster. local:6122/user/taskmanager) as 1a3f59693cec8b3929ed8898edcc27 00. Current number of registered hosts is 3. Current number of alive task slots is 6. And also each taskmanager is correctly registered to use the hostname for communication:org.apache.flink.runtime.taskmanager.TaskManager - TaskManager will use hostname/address 'flink-taskmanager-1.flink-tas kmanager-svc.default.svc.clust er.local' (172.30.247.163) for communication. ...akka.remote.Remoting - Remoting started; listening on addresses :[akka.ssl.tcp://flink@flink-taskmanager-1.flink-taskmanager -svc.default.svc.cluster.local :6122] ...org.apache.flink.runtime.io.network.netty.NettyConfig - NettyConfig [server address: flink-taskmanager-1.flink-task manager-svc.default.svc.cluste r.local/172.30.247.163, server port: 6121, ssl enabled: true, memory segment size (bytes): 32768, transport type: NIO, number of server threads: 2 (manual), number of client threads: 2 (manual), server connect backlog: 0 (use Netty's default), client connect timeout (sec): 120, send/receive buffer size (bytes): 0 (use Netty's default)] ...org.apache.flink.runtime.taskmanager.TaskManager - TaskManager data connection information: bf4a9b50e57c99c17049adb66d65f6 85 @ flink-taskmanager-1.flink-task manager-svc.default.svc.cluste r.local (dataPort=6121) But even with that, it seems like the taskmanagers are using the IP communicate between them and the SSL validation fails.Do you know if it's possible to make the taskmanagers to use the hostname to communicate instead of the IP ?orDo you have any advice to get the SSL configuration to work on this environment ?Thanks in advance.Regards,Edward
Free forum by Nabble | Edit this page |