SSL config on Kubernetes - Dynamic IP

classic Classic list List threaded Threaded
8 messages Options
Reply | Threaded
Open this post in threaded view
|

SSL config on Kubernetes - Dynamic IP

Edward Rojas
Hi all, 

Currently I have a Flink 1.4 cluster running on kubernetes and with SSL configuration based on https://ci.apache.org/projects/flink/flink-docs-master/ops/security-ssl.html

However, as the IP of the nodes are dynamic (from the nature of kubernetes), we are using only the DNS which we can control using kubernetes services. So we add to the Subject Alternative Name(SAN) the flink-jobmanager DNS and also the DNS for the task managers *.flink-taskmanager-svc (each task manager has a DNS in the form flink-taskmanager-0.flink-taskmanager-svc). 

Additionally we set the jobmanager.rpc.address property on all the nodes and each task manager sets the taskmanager.host property, all matching the ones on the certificate. 

This is working well when using Job with Parallelism set to 1. The SSL validations are good and the Jobmanager can communicate with Task manager and vice versa. 

But when we set the parallelism to more than 1 we have exceptions on the SSL validation like this: 

Caused by: java.security.cert.CertificateException: No subject alternative names matching IP address 172.30.247.163 found 
at sun.security.util.HostnameChecker.matchIP(HostnameChecker.java:168) 
at sun.security.util.HostnameChecker.match(HostnameChecker.java:94) 
at sun.security.ssl.X509TrustManagerImpl.checkIdentity(X509TrustManagerImpl.java:455) 
at sun.security.ssl.X509TrustManagerImpl.checkIdentity(X509TrustManagerImpl.java:436) 
at sun.security.ssl.X509TrustManagerImpl.checkTrusted(X509TrustManagerImpl.java:252) 
at sun.security.ssl.X509TrustManagerImpl.checkServerTrusted(X509TrustManagerImpl.java:136) 
at sun.security.ssl.ClientHandshaker.serverCertificate(ClientHandshaker.java:1601) 
... 21 more 


From the logs I see the Jobmanager is correctly registering the taskmanagers: 

org.apache.flink.runtime.instance.InstanceManager   - Registered TaskManager at flink-taskmanager-1 (akka.ssl.tcp://flink@taiga-flink-taskmanager-1.flink-taskmanager-svc.default.svc.cluster.local:6122/user/taskmanager) as 1a3f59693cec8b3929ed8898edcc2700. Current number of registered hosts is 3. Current number of alive task slots is 6. 

And also each taskmanager is correctly registered to use the hostname for communication: 

org.apache.flink.runtime.taskmanager.TaskManager   - TaskManager will use hostname/address 'flink-taskmanager-1.flink-taskmanager-svc.default.svc.cluster.local' (172.30.247.163) for communication. 
... 
akka.remote.Remoting   - Remoting started; listening on addresses :[akka.ssl.tcp://flink@flink-taskmanager-1.flink-taskmanager-svc.default.svc.cluster.local:6122] 
... 
org.apache.flink.runtime.io.network.netty.NettyConfig   - NettyConfig [server address: flink-taskmanager-1.flink-taskmanager-svc.default.svc.cluster.local/172.30.247.163, server port: 6121, ssl enabled: true, memory segment size (bytes): 32768, transport type: NIO, number of server threads: 2 (manual), number of client threads: 2 (manual), server connect backlog: 0 (use Netty's default), client connect timeout (sec): 120, send/receive buffer size (bytes): 0 (use Netty's default)] 
... 
org.apache.flink.runtime.taskmanager.TaskManager   - TaskManager data connection information: bf4a9b50e57c99c17049adb66d65f685 @ flink-taskmanager-1.flink-taskmanager-svc.default.svc.cluster.local (dataPort=6121) 



But even with that, it seems like the taskmanagers are using the IP communicate between them and the SSL validation fails. 

Do you know if it's possible to make the taskmanagers to use the hostname to communicate instead of the IP ? 
or 
Do you have any advice to get the SSL configuration to work on this environment ? 

Thanks in advance. 

Regards, 
Edward
Reply | Threaded
Open this post in threaded view
|

Re: SSL config on Kubernetes - Dynamic IP

Christophe Jolif

I suspect this relates to: https://issues.apache.org/jira/browse/FLINK-5030

For which there was a PR at some point but nothing has been done so far. It seems the current code explicitly uses the IP vs Hostname for Netty SSL configuration.

Without that I'm really wondering how people are reasonably using SSL on a Kubernetes Flink-based cluster as every time a pod is (re-started) it can theoretically take a different IP? Or do I miss something? 

--
Christophe

On Tue, Mar 27, 2018 at 3:24 PM, Edward Alexander Rojas Clavijo <[hidden email]> wrote:
Hi all, 

Currently I have a Flink 1.4 cluster running on kubernetes and with SSL configuration based on https://ci.apache.org/projects/flink/flink-docs-master/ops/security-ssl.html

However, as the IP of the nodes are dynamic (from the nature of kubernetes), we are using only the DNS which we can control using kubernetes services. So we add to the Subject Alternative Name(SAN) the flink-jobmanager DNS and also the DNS for the task managers *.flink-taskmanager-svc (each task manager has a DNS in the form flink-taskmanager-0.flink-taskmanager-svc). 

Additionally we set the jobmanager.rpc.address property on all the nodes and each task manager sets the taskmanager.host property, all matching the ones on the certificate. 

This is working well when using Job with Parallelism set to 1. The SSL validations are good and the Jobmanager can communicate with Task manager and vice versa. 

But when we set the parallelism to more than 1 we have exceptions on the SSL validation like this: 

Caused by: java.security.cert.CertificateException: No subject alternative names matching IP address 172.30.247.163 found 
at sun.security.util.HostnameChecker.matchIP(HostnameChecker.java:168) 
at sun.security.util.HostnameChecker.match(HostnameChecker.java:94) 
at sun.security.ssl.X509TrustManagerImpl.checkIdentity(X509TrustManagerImpl.java:455) 
at sun.security.ssl.X509TrustManagerImpl.checkIdentity(X509TrustManagerImpl.java:436) 
at sun.security.ssl.X509TrustManagerImpl.checkTrusted(X509TrustManagerImpl.java:252) 
at sun.security.ssl.X509TrustManagerImpl.checkServerTrusted(X509TrustManagerImpl.java:136) 
at sun.security.ssl.ClientHandshaker.serverCertificate(ClientHandshaker.java:1601) 
... 21 more 


From the logs I see the Jobmanager is correctly registering the taskmanagers: 

org.apache.flink.runtime.instance.InstanceManager   - Registered TaskManager at flink-taskmanager-1 (akka.ssl.tcp://flink@taiga-flink-taskmanager-1.flink-taskmanager-svc.default.svc.cluster.local:6122/user/taskmanager) as 1a3f59693cec8b3929ed8898edcc2700. Current number of registered hosts is 3. Current number of alive task slots is 6. 

And also each taskmanager is correctly registered to use the hostname for communication: 

org.apache.flink.runtime.taskmanager.TaskManager   - TaskManager will use hostname/address 'flink-taskmanager-1.flink-taskmanager-svc.default.svc.cluster.local' (172.30.247.163) for communication. 
... 
akka.remote.Remoting   - Remoting started; listening on addresses :[akka.ssl.tcp://flink@flink-taskmanager-1.flink-taskmanager-svc.default.svc.cluster.local:6122] 
... 
org.apache.flink.runtime.io.network.netty.NettyConfig   - NettyConfig [server address: flink-taskmanager-1.flink-taskmanager-svc.default.svc.cluster.local/172.30.247.163, server port: 6121, ssl enabled: true, memory segment size (bytes): 32768, transport type: NIO, number of server threads: 2 (manual), number of client threads: 2 (manual), server connect backlog: 0 (use Netty's default), client connect timeout (sec): 120, send/receive buffer size (bytes): 0 (use Netty's default)] 
... 
org.apache.flink.runtime.taskmanager.TaskManager   - TaskManager data connection information: bf4a9b50e57c99c17049adb66d65f685 @ flink-taskmanager-1.flink-taskmanager-svc.default.svc.cluster.local (dataPort=6121) 



But even with that, it seems like the taskmanagers are using the IP communicate between them and the SSL validation fails. 

Do you know if it's possible to make the taskmanagers to use the hostname to communicate instead of the IP ? 
or 
Do you have any advice to get the SSL configuration to work on this environment ? 

Thanks in advance. 

Regards, 
Edward



--
Christophe
Reply | Threaded
Open this post in threaded view
|

Re: SSL config on Kubernetes - Dynamic IP

Till Rohrmann
Hi Edward,

could you please file a JIRA issue for this problem. It might be as simple as that the TaskManager's network stack uses the IP instead of the hostname as you suggested. But we have to look into this to be sure. Also the logs of the JobManager as well as the TaskManagers could be helpful.

Cheers,
Till

On Tue, Mar 27, 2018 at 5:17 PM, Christophe Jolif <[hidden email]> wrote:

I suspect this relates to: https://issues.apache.org/jira/browse/FLINK-5030

For which there was a PR at some point but nothing has been done so far. It seems the current code explicitly uses the IP vs Hostname for Netty SSL configuration.

Without that I'm really wondering how people are reasonably using SSL on a Kubernetes Flink-based cluster as every time a pod is (re-started) it can theoretically take a different IP? Or do I miss something? 

--
Christophe

On Tue, Mar 27, 2018 at 3:24 PM, Edward Alexander Rojas Clavijo <[hidden email]> wrote:
Hi all, 

Currently I have a Flink 1.4 cluster running on kubernetes and with SSL configuration based on https://ci.apache.org/projects/flink/flink-docs-master/ops/security-ssl.html

However, as the IP of the nodes are dynamic (from the nature of kubernetes), we are using only the DNS which we can control using kubernetes services. So we add to the Subject Alternative Name(SAN) the flink-jobmanager DNS and also the DNS for the task managers *.flink-taskmanager-svc (each task manager has a DNS in the form flink-taskmanager-0.flink-taskmanager-svc). 

Additionally we set the jobmanager.rpc.address property on all the nodes and each task manager sets the taskmanager.host property, all matching the ones on the certificate. 

This is working well when using Job with Parallelism set to 1. The SSL validations are good and the Jobmanager can communicate with Task manager and vice versa. 

But when we set the parallelism to more than 1 we have exceptions on the SSL validation like this: 

Caused by: java.security.cert.CertificateException: No subject alternative names matching IP address 172.30.247.163 found 
at sun.security.util.HostnameChecker.matchIP(HostnameChecker.java:168) 
at sun.security.util.HostnameChecker.match(HostnameChecker.java:94) 
at sun.security.ssl.X509TrustManagerImpl.checkIdentity(X509TrustManagerImpl.java:455) 
at sun.security.ssl.X509TrustManagerImpl.checkIdentity(X509TrustManagerImpl.java:436) 
at sun.security.ssl.X509TrustManagerImpl.checkTrusted(X509TrustManagerImpl.java:252) 
at sun.security.ssl.X509TrustManagerImpl.checkServerTrusted(X509TrustManagerImpl.java:136) 
at sun.security.ssl.ClientHandshaker.serverCertificate(ClientHandshaker.java:1601) 
... 21 more 


From the logs I see the Jobmanager is correctly registering the taskmanagers: 

org.apache.flink.runtime.instance.InstanceManager   - Registered TaskManager at flink-taskmanager-1 (akka.ssl.tcp://flink@taiga-flink-taskmanager-1.flink-taskmanager-svc.default.svc.cluster.local:6122/user/taskmanager) as 1a3f59693cec8b3929ed8898edcc2700. Current number of registered hosts is 3. Current number of alive task slots is 6. 

And also each taskmanager is correctly registered to use the hostname for communication: 

org.apache.flink.runtime.taskmanager.TaskManager   - TaskManager will use hostname/address 'flink-taskmanager-1.flink-taskmanager-svc.default.svc.cluster.local' (172.30.247.163) for communication. 
... 
akka.remote.Remoting   - Remoting started; listening on addresses :[akka.ssl.tcp://flink@flink-taskmanager-1.flink-taskmanager-svc.default.svc.cluster.local:6122] 
... 
org.apache.flink.runtime.io.network.netty.NettyConfig   - NettyConfig [server address: flink-taskmanager-1.flink-taskmanager-svc.default.svc.cluster.local/172.30.247.163, server port: 6121, ssl enabled: true, memory segment size (bytes): 32768, transport type: NIO, number of server threads: 2 (manual), number of client threads: 2 (manual), server connect backlog: 0 (use Netty's default), client connect timeout (sec): 120, send/receive buffer size (bytes): 0 (use Netty's default)] 
... 
org.apache.flink.runtime.taskmanager.TaskManager   - TaskManager data connection information: bf4a9b50e57c99c17049adb66d65f685 @ flink-taskmanager-1.flink-taskmanager-svc.default.svc.cluster.local (dataPort=6121) 



But even with that, it seems like the taskmanagers are using the IP communicate between them and the SSL validation fails. 

Do you know if it's possible to make the taskmanagers to use the hostname to communicate instead of the IP ? 
or 
Do you have any advice to get the SSL configuration to work on this environment ? 

Thanks in advance. 

Regards, 
Edward



--
Christophe

Reply | Threaded
Open this post in threaded view
|

Re: SSL config on Kubernetes - Dynamic IP

Sampath Bhat
Hi Edward,

You can use this parameter in flink-conf.yaml to supress the hostname checking in certificates. If it suits your purpose.
security.ssl.verify-hostname: false

Secondly even I'm running flink 1.4 on K8s, I used to get the same error stack trace as you mentioned, while the blob client was trying to connect to blob server. But this issue was resolved by creating certificate I have given only the job manager service name as SAN. It's working fine.
But I have not submitted job with higher parallelism. Since you are saying that you are facing issue when the parallelism is higher I guess that multiple task managers are not able to communicate among themselves. Make sure if have exposed the services of task managers correctly and surely logs will help.

Jolif, You can use statefull set object in K8s to ensure that the same IP will be used even if the pod restarts.

On Tue, Mar 27, 2018 at 9:18 PM, Till Rohrmann <[hidden email]> wrote:
Hi Edward,

could you please file a JIRA issue for this problem. It might be as simple as that the TaskManager's network stack uses the IP instead of the hostname as you suggested. But we have to look into this to be sure. Also the logs of the JobManager as well as the TaskManagers could be helpful.

Cheers,
Till

On Tue, Mar 27, 2018 at 5:17 PM, Christophe Jolif <[hidden email]> wrote:

I suspect this relates to: https://issues.apache.org/jira/browse/FLINK-5030

For which there was a PR at some point but nothing has been done so far. It seems the current code explicitly uses the IP vs Hostname for Netty SSL configuration.

Without that I'm really wondering how people are reasonably using SSL on a Kubernetes Flink-based cluster as every time a pod is (re-started) it can theoretically take a different IP? Or do I miss something? 

--
Christophe

On Tue, Mar 27, 2018 at 3:24 PM, Edward Alexander Rojas Clavijo <[hidden email]> wrote:
Hi all, 

Currently I have a Flink 1.4 cluster running on kubernetes and with SSL configuration based on https://ci.apache.org/projects/flink/flink-docs-master/ops/security-ssl.html

However, as the IP of the nodes are dynamic (from the nature of kubernetes), we are using only the DNS which we can control using kubernetes services. So we add to the Subject Alternative Name(SAN) the flink-jobmanager DNS and also the DNS for the task managers *.flink-taskmanager-svc (each task manager has a DNS in the form flink-taskmanager-0.flink-taskmanager-svc). 

Additionally we set the jobmanager.rpc.address property on all the nodes and each task manager sets the taskmanager.host property, all matching the ones on the certificate. 

This is working well when using Job with Parallelism set to 1. The SSL validations are good and the Jobmanager can communicate with Task manager and vice versa. 

But when we set the parallelism to more than 1 we have exceptions on the SSL validation like this: 

Caused by: java.security.cert.CertificateException: No subject alternative names matching IP address 172.30.247.163 found 
at sun.security.util.HostnameChecker.matchIP(HostnameChecker.java:168) 
at sun.security.util.HostnameChecker.match(HostnameChecker.java:94) 
at sun.security.ssl.X509TrustManagerImpl.checkIdentity(X509TrustManagerImpl.java:455) 
at sun.security.ssl.X509TrustManagerImpl.checkIdentity(X509TrustManagerImpl.java:436) 
at sun.security.ssl.X509TrustManagerImpl.checkTrusted(X509TrustManagerImpl.java:252) 
at sun.security.ssl.X509TrustManagerImpl.checkServerTrusted(X509TrustManagerImpl.java:136) 
at sun.security.ssl.ClientHandshaker.serverCertificate(ClientHandshaker.java:1601) 
... 21 more 


From the logs I see the Jobmanager is correctly registering the taskmanagers: 

org.apache.flink.runtime.instance.InstanceManager   - Registered TaskManager at flink-taskmanager-1 (akka.ssl.tcp://flink@taiga-flink-taskmanager-1.flink-taskmanager-svc.default.svc.cluster.local:6122/user/taskmanager) as 1a3f59693cec8b3929ed8898edcc2700. Current number of registered hosts is 3. Current number of alive task slots is 6. 

And also each taskmanager is correctly registered to use the hostname for communication: 

org.apache.flink.runtime.taskmanager.TaskManager   - TaskManager will use hostname/address 'flink-taskmanager-1.flink-taskmanager-svc.default.svc.cluster.local' (172.30.247.163) for communication. 
... 
akka.remote.Remoting   - Remoting started; listening on addresses :[akka.ssl.tcp://flink@flink-taskmanager-1.flink-taskmanager-svc.default.svc.cluster.local:6122] 
... 
org.apache.flink.runtime.io.network.netty.NettyConfig   - NettyConfig [server address: flink-taskmanager-1.flink-taskmanager-svc.default.svc.cluster.local/172.30.247.163, server port: 6121, ssl enabled: true, memory segment size (bytes): 32768, transport type: NIO, number of server threads: 2 (manual), number of client threads: 2 (manual), server connect backlog: 0 (use Netty's default), client connect timeout (sec): 120, send/receive buffer size (bytes): 0 (use Netty's default)] 
... 
org.apache.flink.runtime.taskmanager.TaskManager   - TaskManager data connection information: bf4a9b50e57c99c17049adb66d65f685 @ flink-taskmanager-1.flink-taskmanager-svc.default.svc.cluster.local (dataPort=6121) 



But even with that, it seems like the taskmanagers are using the IP communicate between them and the SSL validation fails. 

Do you know if it's possible to make the taskmanagers to use the hostname to communicate instead of the IP ? 
or 
Do you have any advice to get the SSL configuration to work on this environment ? 

Thanks in advance. 

Regards, 
Edward



--
Christophe


Reply | Threaded
Open this post in threaded view
|

Re: SSL config on Kubernetes - Dynamic IP

Edward Rojas
In reply to this post by Till Rohrmann
Hi Till,

I just created the JIRA ticket: https://issues.apache.org/jira/browse/FLINK-9103

I added the JobManager and TaskManager logs, Hope this helps to resolve the issue.

Regards, 
Edward

2018-03-27 17:48 GMT+02:00 Till Rohrmann <[hidden email]>:
Hi Edward,

could you please file a JIRA issue for this problem. It might be as simple as that the TaskManager's network stack uses the IP instead of the hostname as you suggested. But we have to look into this to be sure. Also the logs of the JobManager as well as the TaskManagers could be helpful.

Cheers,
Till

On Tue, Mar 27, 2018 at 5:17 PM, Christophe Jolif <[hidden email]> wrote:

I suspect this relates to: https://issues.apache.org/jira/browse/FLINK-5030

For which there was a PR at some point but nothing has been done so far. It seems the current code explicitly uses the IP vs Hostname for Netty SSL configuration.

Without that I'm really wondering how people are reasonably using SSL on a Kubernetes Flink-based cluster as every time a pod is (re-started) it can theoretically take a different IP? Or do I miss something? 

--
Christophe

On Tue, Mar 27, 2018 at 3:24 PM, Edward Alexander Rojas Clavijo <[hidden email]> wrote:
Hi all, 

Currently I have a Flink 1.4 cluster running on kubernetes and with SSL configuration based on https://ci.apache.org/projects/flink/flink-docs-master/ops/security-ssl.html

However, as the IP of the nodes are dynamic (from the nature of kubernetes), we are using only the DNS which we can control using kubernetes services. So we add to the Subject Alternative Name(SAN) the flink-jobmanager DNS and also the DNS for the task managers *.flink-taskmanager-svc (each task manager has a DNS in the form flink-taskmanager-0.flink-taskmanager-svc). 

Additionally we set the jobmanager.rpc.address property on all the nodes and each task manager sets the taskmanager.host property, all matching the ones on the certificate. 

This is working well when using Job with Parallelism set to 1. The SSL validations are good and the Jobmanager can communicate with Task manager and vice versa. 

But when we set the parallelism to more than 1 we have exceptions on the SSL validation like this: 

Caused by: java.security.cert.CertificateException: No subject alternative names matching IP address 172.30.247.163 found 
at sun.security.util.HostnameChecker.matchIP(HostnameChecker.java:168) 
at sun.security.util.HostnameChecker.match(HostnameChecker.java:94) 
at sun.security.ssl.X509TrustManagerImpl.checkIdentity(X509TrustManagerImpl.java:455) 
at sun.security.ssl.X509TrustManagerImpl.checkIdentity(X509TrustManagerImpl.java:436) 
at sun.security.ssl.X509TrustManagerImpl.checkTrusted(X509TrustManagerImpl.java:252) 
at sun.security.ssl.X509TrustManagerImpl.checkServerTrusted(X509TrustManagerImpl.java:136) 
at sun.security.ssl.ClientHandshaker.serverCertificate(ClientHandshaker.java:1601) 
... 21 more 


From the logs I see the Jobmanager is correctly registering the taskmanagers: 

org.apache.flink.runtime.instance.InstanceManager   - Registered TaskManager at flink-taskmanager-1 (akka.ssl.tcp://flink@taiga-flink-taskmanager-1.flink-taskmanager-svc.default.svc.cluster.local:6122/user/taskmanager) as 1a3f59693cec8b3929ed8898edcc2700. Current number of registered hosts is 3. Current number of alive task slots is 6. 

And also each taskmanager is correctly registered to use the hostname for communication: 

org.apache.flink.runtime.taskmanager.TaskManager   - TaskManager will use hostname/address 'flink-taskmanager-1.flink-taskmanager-svc.default.svc.cluster.local' (172.30.247.163) for communication. 
... 
akka.remote.Remoting   - Remoting started; listening on addresses :[akka.ssl.tcp://flink@flink-taskmanager-1.flink-taskmanager-svc.default.svc.cluster.local:6122] 
... 
org.apache.flink.runtime.io.network.netty.NettyConfig   - NettyConfig [server address: flink-taskmanager-1.flink-taskmanager-svc.default.svc.cluster.local/172.30.247.163, server port: 6121, ssl enabled: true, memory segment size (bytes): 32768, transport type: NIO, number of server threads: 2 (manual), number of client threads: 2 (manual), server connect backlog: 0 (use Netty's default), client connect timeout (sec): 120, send/receive buffer size (bytes): 0 (use Netty's default)] 
... 
org.apache.flink.runtime.taskmanager.TaskManager   - TaskManager data connection information: bf4a9b50e57c99c17049adb66d65f685 @ flink-taskmanager-1.flink-taskmanager-svc.default.svc.cluster.local (dataPort=6121) 



But even with that, it seems like the taskmanagers are using the IP communicate between them and the SSL validation fails. 

Do you know if it's possible to make the taskmanagers to use the hostname to communicate instead of the IP ? 
or 
Do you have any advice to get the SSL configuration to work on this environment ? 

Thanks in advance. 

Regards, 
Edward



--
Christophe




--
Edward Alexander Rojas Clavijo

Software Engineer
Hybrid Cloud
IBM France
Reply | Threaded
Open this post in threaded view
|

Re: SSL config on Kubernetes - Dynamic IP

Edward Rojas
Hi all,

I did some tests based on the PR Christophe mentioned above and by making a change on the NettyClient to use CanonicalHostName instead of HostNameAddress to identify the server, the SSL validation works!!

I created a PR with this change: https://github.com/apache/flink/pull/5789

Regards, 
Edward

2018-03-28 17:22 GMT+02:00 Edward Alexander Rojas Clavijo <[hidden email]>:
Hi Till,

I just created the JIRA ticket: https://issues.apache.org/jira/browse/FLINK-9103

I added the JobManager and TaskManager logs, Hope this helps to resolve the issue.

Regards, 
Edward

2018-03-27 17:48 GMT+02:00 Till Rohrmann <[hidden email]>:
Hi Edward,

could you please file a JIRA issue for this problem. It might be as simple as that the TaskManager's network stack uses the IP instead of the hostname as you suggested. But we have to look into this to be sure. Also the logs of the JobManager as well as the TaskManagers could be helpful.

Cheers,
Till

On Tue, Mar 27, 2018 at 5:17 PM, Christophe Jolif <[hidden email]> wrote:

I suspect this relates to: https://issues.apache.org/jira/browse/FLINK-5030

For which there was a PR at some point but nothing has been done so far. It seems the current code explicitly uses the IP vs Hostname for Netty SSL configuration.

Without that I'm really wondering how people are reasonably using SSL on a Kubernetes Flink-based cluster as every time a pod is (re-started) it can theoretically take a different IP? Or do I miss something? 

--
Christophe

On Tue, Mar 27, 2018 at 3:24 PM, Edward Alexander Rojas Clavijo <[hidden email]> wrote:
Hi all, 

Currently I have a Flink 1.4 cluster running on kubernetes and with SSL configuration based on https://ci.apache.org/projects/flink/flink-docs-master/ops/security-ssl.html

However, as the IP of the nodes are dynamic (from the nature of kubernetes), we are using only the DNS which we can control using kubernetes services. So we add to the Subject Alternative Name(SAN) the flink-jobmanager DNS and also the DNS for the task managers *.flink-taskmanager-svc (each task manager has a DNS in the form flink-taskmanager-0.flink-taskmanager-svc). 

Additionally we set the jobmanager.rpc.address property on all the nodes and each task manager sets the taskmanager.host property, all matching the ones on the certificate. 

This is working well when using Job with Parallelism set to 1. The SSL validations are good and the Jobmanager can communicate with Task manager and vice versa. 

But when we set the parallelism to more than 1 we have exceptions on the SSL validation like this: 

Caused by: java.security.cert.CertificateException: No subject alternative names matching IP address 172.30.247.163 found 
at sun.security.util.HostnameChecker.matchIP(HostnameChecker.java:168) 
at sun.security.util.HostnameChecker.match(HostnameChecker.java:94) 
at sun.security.ssl.X509TrustManagerImpl.checkIdentity(X509TrustManagerImpl.java:455) 
at sun.security.ssl.X509TrustManagerImpl.checkIdentity(X509TrustManagerImpl.java:436) 
at sun.security.ssl.X509TrustManagerImpl.checkTrusted(X509TrustManagerImpl.java:252) 
at sun.security.ssl.X509TrustManagerImpl.checkServerTrusted(X509TrustManagerImpl.java:136) 
at sun.security.ssl.ClientHandshaker.serverCertificate(ClientHandshaker.java:1601) 
... 21 more 


From the logs I see the Jobmanager is correctly registering the taskmanagers: 

org.apache.flink.runtime.instance.InstanceManager   - Registered TaskManager at flink-taskmanager-1 (akka.ssl.tcp://flink@taiga-flink-taskmanager-1.flink-taskmanager-svc.default.svc.cluster.local:6122/user/taskmanager) as 1a3f59693cec8b3929ed8898edcc2700. Current number of registered hosts is 3. Current number of alive task slots is 6. 

And also each taskmanager is correctly registered to use the hostname for communication: 

org.apache.flink.runtime.taskmanager.TaskManager   - TaskManager will use hostname/address 'flink-taskmanager-1.flink-taskmanager-svc.default.svc.cluster.local' (172.30.247.163) for communication. 
... 
akka.remote.Remoting   - Remoting started; listening on addresses :[akka.ssl.tcp://flink@flink-taskmanager-1.flink-taskmanager-svc.default.svc.cluster.local:6122] 
... 
org.apache.flink.runtime.io.network.netty.NettyConfig   - NettyConfig [server address: flink-taskmanager-1.flink-taskmanager-svc.default.svc.cluster.local/172.30.247.163, server port: 6121, ssl enabled: true, memory segment size (bytes): 32768, transport type: NIO, number of server threads: 2 (manual), number of client threads: 2 (manual), server connect backlog: 0 (use Netty's default), client connect timeout (sec): 120, send/receive buffer size (bytes): 0 (use Netty's default)] 
... 
org.apache.flink.runtime.taskmanager.TaskManager   - TaskManager data connection information: bf4a9b50e57c99c17049adb66d65f685 @ flink-taskmanager-1.flink-taskmanager-svc.default.svc.cluster.local (dataPort=6121) 



But even with that, it seems like the taskmanagers are using the IP communicate between them and the SSL validation fails. 

Do you know if it's possible to make the taskmanagers to use the hostname to communicate instead of the IP ? 
or 
Do you have any advice to get the SSL configuration to work on this environment ? 

Thanks in advance. 

Regards, 
Edward



--
Christophe




--
Edward Alexander Rojas Clavijo

Software Engineer
Hybrid Cloud
IBM France

Reply | Threaded
Open this post in threaded view
|

Re: SSL config on Kubernetes - Dynamic IP

Fabian Hueske-2
Thank you Edward and Christophe!

2018-03-29 17:55 GMT+02:00 Edward Alexander Rojas Clavijo <[hidden email]>:
Hi all,

I did some tests based on the PR Christophe mentioned above and by making a change on the NettyClient to use CanonicalHostName instead of HostNameAddress to identify the server, the SSL validation works!!

I created a PR with this change: https://github.com/apache/flink/pull/5789

Regards, 
Edward

2018-03-28 17:22 GMT+02:00 Edward Alexander Rojas Clavijo <[hidden email]>:
Hi Till,

I just created the JIRA ticket: https://issues.apache.org/jira/browse/FLINK-9103

I added the JobManager and TaskManager logs, Hope this helps to resolve the issue.

Regards, 
Edward

2018-03-27 17:48 GMT+02:00 Till Rohrmann <[hidden email]>:
Hi Edward,

could you please file a JIRA issue for this problem. It might be as simple as that the TaskManager's network stack uses the IP instead of the hostname as you suggested. But we have to look into this to be sure. Also the logs of the JobManager as well as the TaskManagers could be helpful.

Cheers,
Till

On Tue, Mar 27, 2018 at 5:17 PM, Christophe Jolif <[hidden email]> wrote:

I suspect this relates to: https://issues.apache.org/jira/browse/FLINK-5030

For which there was a PR at some point but nothing has been done so far. It seems the current code explicitly uses the IP vs Hostname for Netty SSL configuration.

Without that I'm really wondering how people are reasonably using SSL on a Kubernetes Flink-based cluster as every time a pod is (re-started) it can theoretically take a different IP? Or do I miss something? 

--
Christophe

On Tue, Mar 27, 2018 at 3:24 PM, Edward Alexander Rojas Clavijo <[hidden email]> wrote:
Hi all, 

Currently I have a Flink 1.4 cluster running on kubernetes and with SSL configuration based on https://ci.apache.org/projects/flink/flink-docs-master/ops/security-ssl.html

However, as the IP of the nodes are dynamic (from the nature of kubernetes), we are using only the DNS which we can control using kubernetes services. So we add to the Subject Alternative Name(SAN) the flink-jobmanager DNS and also the DNS for the task managers *.flink-taskmanager-svc (each task manager has a DNS in the form flink-taskmanager-0.flink-taskmanager-svc). 

Additionally we set the jobmanager.rpc.address property on all the nodes and each task manager sets the taskmanager.host property, all matching the ones on the certificate. 

This is working well when using Job with Parallelism set to 1. The SSL validations are good and the Jobmanager can communicate with Task manager and vice versa. 

But when we set the parallelism to more than 1 we have exceptions on the SSL validation like this: 

Caused by: java.security.cert.CertificateException: No subject alternative names matching IP address 172.30.247.163 found 
at sun.security.util.HostnameChecker.matchIP(HostnameChecker.java:168) 
at sun.security.util.HostnameChecker.match(HostnameChecker.java:94) 
at sun.security.ssl.X509TrustManagerImpl.checkIdentity(X509TrustManagerImpl.java:455) 
at sun.security.ssl.X509TrustManagerImpl.checkIdentity(X509TrustManagerImpl.java:436) 
at sun.security.ssl.X509TrustManagerImpl.checkTrusted(X509TrustManagerImpl.java:252) 
at sun.security.ssl.X509TrustManagerImpl.checkServerTrusted(X509TrustManagerImpl.java:136) 
at sun.security.ssl.ClientHandshaker.serverCertificate(ClientHandshaker.java:1601) 
... 21 more 


From the logs I see the Jobmanager is correctly registering the taskmanagers: 

org.apache.flink.runtime.instance.InstanceManager   - Registered TaskManager at flink-taskmanager-1 (akka.ssl.tcp://flink@taiga-flink-taskmanager-1.flink-taskmanager-svc.default.svc.cluster.local:6122/user/taskmanager) as 1a3f59693cec8b3929ed8898edcc2700. Current number of registered hosts is 3. Current number of alive task slots is 6. 

And also each taskmanager is correctly registered to use the hostname for communication: 

org.apache.flink.runtime.taskmanager.TaskManager   - TaskManager will use hostname/address 'flink-taskmanager-1.flink-taskmanager-svc.default.svc.cluster.local' (172.30.247.163) for communication. 
... 
akka.remote.Remoting   - Remoting started; listening on addresses :[akka.ssl.tcp://flink@flink-taskmanager-1.flink-taskmanager-svc.default.svc.cluster.local:6122] 
... 
org.apache.flink.runtime.io.network.netty.NettyConfig   - NettyConfig [server address: flink-taskmanager-1.flink-taskmanager-svc.default.svc.cluster.local/172.30.247.163, server port: 6121, ssl enabled: true, memory segment size (bytes): 32768, transport type: NIO, number of server threads: 2 (manual), number of client threads: 2 (manual), server connect backlog: 0 (use Netty's default), client connect timeout (sec): 120, send/receive buffer size (bytes): 0 (use Netty's default)] 
... 
org.apache.flink.runtime.taskmanager.TaskManager   - TaskManager data connection information: bf4a9b50e57c99c17049adb66d65f685 @ flink-taskmanager-1.flink-taskmanager-svc.default.svc.cluster.local (dataPort=6121) 



But even with that, it seems like the taskmanagers are using the IP communicate between them and the SSL validation fails. 

Do you know if it's possible to make the taskmanagers to use the hostname to communicate instead of the IP ? 
or 
Do you have any advice to get the SSL configuration to work on this environment ? 

Thanks in advance. 

Regards, 
Edward



--
Christophe




--
Edward Alexander Rojas Clavijo

Software Engineer
Hybrid Cloud
IBM France


Reply | Threaded
Open this post in threaded view
|

Re: SSL config on Kubernetes - Dynamic IP

Christophe Jolif
By the way Fabian, any chance this issue is looked into / the PR considered for 1.5?

--
Christophe

On Wed, Apr 4, 2018 at 2:41 PM, Fabian Hueske <[hidden email]> wrote:
Thank you Edward and Christophe!

2018-03-29 17:55 GMT+02:00 Edward Alexander Rojas Clavijo <[hidden email]>:
Hi all,

I did some tests based on the PR Christophe mentioned above and by making a change on the NettyClient to use CanonicalHostName instead of HostNameAddress to identify the server, the SSL validation works!!

I created a PR with this change: https://github.com/apache/flink/pull/5789

Regards, 
Edward

2018-03-28 17:22 GMT+02:00 Edward Alexander Rojas Clavijo <[hidden email]>:
Hi Till,

I just created the JIRA ticket: https://issues.apache.org/jira/browse/FLINK-9103

I added the JobManager and TaskManager logs, Hope this helps to resolve the issue.

Regards, 
Edward

2018-03-27 17:48 GMT+02:00 Till Rohrmann <[hidden email]>:
Hi Edward,

could you please file a JIRA issue for this problem. It might be as simple as that the TaskManager's network stack uses the IP instead of the hostname as you suggested. But we have to look into this to be sure. Also the logs of the JobManager as well as the TaskManagers could be helpful.

Cheers,
Till

On Tue, Mar 27, 2018 at 5:17 PM, Christophe Jolif <[hidden email]> wrote:

I suspect this relates to: https://issues.apache.org/jira/browse/FLINK-5030

For which there was a PR at some point but nothing has been done so far. It seems the current code explicitly uses the IP vs Hostname for Netty SSL configuration.

Without that I'm really wondering how people are reasonably using SSL on a Kubernetes Flink-based cluster as every time a pod is (re-started) it can theoretically take a different IP? Or do I miss something? 

--
Christophe

On Tue, Mar 27, 2018 at 3:24 PM, Edward Alexander Rojas Clavijo <[hidden email]> wrote:
Hi all, 

Currently I have a Flink 1.4 cluster running on kubernetes and with SSL configuration based on https://ci.apache.org/projects/flink/flink-docs-master/ops/security-ssl.html

However, as the IP of the nodes are dynamic (from the nature of kubernetes), we are using only the DNS which we can control using kubernetes services. So we add to the Subject Alternative Name(SAN) the flink-jobmanager DNS and also the DNS for the task managers *.flink-taskmanager-svc (each task manager has a DNS in the form flink-taskmanager-0.flink-taskmanager-svc). 

Additionally we set the jobmanager.rpc.address property on all the nodes and each task manager sets the taskmanager.host property, all matching the ones on the certificate. 

This is working well when using Job with Parallelism set to 1. The SSL validations are good and the Jobmanager can communicate with Task manager and vice versa. 

But when we set the parallelism to more than 1 we have exceptions on the SSL validation like this: 

Caused by: java.security.cert.CertificateException: No subject alternative names matching IP address 172.30.247.163 found 
at sun.security.util.HostnameChecker.matchIP(HostnameChecker.java:168) 
at sun.security.util.HostnameChecker.match(HostnameChecker.java:94) 
at sun.security.ssl.X509TrustManagerImpl.checkIdentity(X509TrustManagerImpl.java:455) 
at sun.security.ssl.X509TrustManagerImpl.checkIdentity(X509TrustManagerImpl.java:436) 
at sun.security.ssl.X509TrustManagerImpl.checkTrusted(X509TrustManagerImpl.java:252) 
at sun.security.ssl.X509TrustManagerImpl.checkServerTrusted(X509TrustManagerImpl.java:136) 
at sun.security.ssl.ClientHandshaker.serverCertificate(ClientHandshaker.java:1601) 
... 21 more 


From the logs I see the Jobmanager is correctly registering the taskmanagers: 

org.apache.flink.runtime.instance.InstanceManager   - Registered TaskManager at flink-taskmanager-1 (akka.ssl.tcp://flink@taiga-flink-taskmanager-1.flink-taskmanager-svc.default.svc.cluster.local:6122/user/taskmanager) as 1a3f59693cec8b3929ed8898edcc2700. Current number of registered hosts is 3. Current number of alive task slots is 6. 

And also each taskmanager is correctly registered to use the hostname for communication: 

org.apache.flink.runtime.taskmanager.TaskManager   - TaskManager will use hostname/address 'flink-taskmanager-1.flink-taskmanager-svc.default.svc.cluster.local' (172.30.247.163) for communication. 
... 
akka.remote.Remoting   - Remoting started; listening on addresses :[akka.ssl.tcp://flink@flink-taskmanager-1.flink-taskmanager-svc.default.svc.cluster.local:6122] 
... 
org.apache.flink.runtime.io.network.netty.NettyConfig   - NettyConfig [server address: flink-taskmanager-1.flink-taskmanager-svc.default.svc.cluster.local/172.30.247.163, server port: 6121, ssl enabled: true, memory segment size (bytes): 32768, transport type: NIO, number of server threads: 2 (manual), number of client threads: 2 (manual), server connect backlog: 0 (use Netty's default), client connect timeout (sec): 120, send/receive buffer size (bytes): 0 (use Netty's default)] 
... 
org.apache.flink.runtime.taskmanager.TaskManager   - TaskManager data connection information: bf4a9b50e57c99c17049adb66d65f685 @ flink-taskmanager-1.flink-taskmanager-svc.default.svc.cluster.local (dataPort=6121) 



But even with that, it seems like the taskmanagers are using the IP communicate between them and the SSL validation fails. 

Do you know if it's possible to make the taskmanagers to use the hostname to communicate instead of the IP ? 
or 
Do you have any advice to get the SSL configuration to work on this environment ? 

Thanks in advance. 

Regards, 
Edward