Taskmanager SSL fails looking for Subject Alternative IP Address

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

Taskmanager SSL fails looking for Subject Alternative IP Address

PACE, JAMES

I have the following SSL configuration for a 3 node HA flink cluster:

 

#taskmanager.data.ssl.enabled: false

security.ssl.enabled: true

security.ssl.keystore: /opt/app/certificates/server-keystore.jks

security.ssl.keystore-password: <redacted>

security.ssl.key-password: <redacted>

security.ssl.truststore: /opt/app/certificates/cacerts

security.ssl.truststore-password: <redacted>

security.ssl.verify-hostname: true

 

The job we’re running is the sample WordCount.jar.  The running version of flink is 1.4.0.  It’s not the latest, but I didn’t see anything that looked like updating would solve this issue.

 

If either security.ssl.verify-hostname is set to false or taskmanager.data.ssl.enabled is set to false, everything works fine. 

 

When flink is run in the above configuration above, with ssl fully enabled and security.ssl.verify-hostname: true, the flink job fails.  However, when going through the logs, SSL appears fine for akka, blob service, and jobmanager.

 

The root cause looks to be Caused by: java.security.cert.CertificateException: No subject alternative names matching IP address xxx.xxx.xxx.xxx found. 

I have tried setting taskmanager.hostname to the FQDN of the host, but that did not change anything.

We don’t generate certificates with SAN fields.

 

Any thoughts would be appreciated.

 

This is the full stack trace

Caused by: java.io.IOException: Thread 'SortMerger Reading Thread' terminated due to an exception: Sending the partition request failed.

        at org.apache.flink.runtime.operators.sort.UnilateralSortMerger$ThreadBase.run(UnilateralSortMerger.java:800)

Caused by: org.apache.flink.runtime.io.network.netty.exception.LocalTransportException: Sending the partition request failed.

        at org.apache.flink.runtime.io.network.netty.PartitionRequestClient$1.operationComplete(PartitionRequestClient.java:119)

        at org.apache.flink.runtime.io.network.netty.PartitionRequestClient$1.operationComplete(PartitionRequestClient.java:111)

        at org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:680)

        at org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:567)

        at org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.tryFailure(DefaultPromise.java:424)

        at org.apache.flink.shaded.netty4.io.netty.channel.PendingWriteQueue.safeFail(PendingWriteQueue.java:252)

        at org.apache.flink.shaded.netty4.io.netty.channel.PendingWriteQueue.removeAndFailAll(PendingWriteQueue.java:112)

        at org.apache.flink.shaded.netty4.io.netty.handler.ssl.SslHandler.setHandshakeFailure(SslHandler.java:1256)

        at org.apache.flink.shaded.netty4.io.netty.handler.ssl.SslHandler.unwrap(SslHandler.java:1040)

        at org.apache.flink.shaded.netty4.io.netty.handler.ssl.SslHandler.decode(SslHandler.java:934)

        at org.apache.flink.shaded.netty4.io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:315)

        at org.apache.flink.shaded.netty4.io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:229)

        at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:339)

        at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:324)

        at org.apache.flink.shaded.netty4.io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:847)

        at org.apache.flink.shaded.netty4.io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:131)

        at org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511)

        at org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)

        at org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)

        at org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)

        at org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)

        at java.lang.Thread.run(Thread.java:745)

Caused by: javax.net.ssl.SSLHandshakeException: General SSLEngine problem

        at sun.security.ssl.Handshaker.checkThrown(Handshaker.java:1431)

        at sun.security.ssl.SSLEngineImpl.checkTaskThrown(SSLEngineImpl.java:535)

        at sun.security.ssl.SSLEngineImpl.readNetRecord(SSLEngineImpl.java:813)

        at sun.security.ssl.SSLEngineImpl.unwrap(SSLEngineImpl.java:781)

        at javax.net.ssl.SSLEngine.unwrap(SSLEngine.java:624)

        at org.apache.flink.shaded.netty4.io.netty.handler.ssl.SslHandler.unwrap(SslHandler.java:1114)

        at org.apache.flink.shaded.netty4.io.netty.handler.ssl.SslHandler.unwrap(SslHandler.java:981)

        ... 13 more

Caused by: javax.net.ssl.SSLHandshakeException: General SSLEngine problem

        at sun.security.ssl.Alerts.getSSLException(Alerts.java:192)

        at sun.security.ssl.SSLEngineImpl.fatal(SSLEngineImpl.java:1728)

        at sun.security.ssl.Handshaker.fatalSE(Handshaker.java:304)

        at sun.security.ssl.Handshaker.fatalSE(Handshaker.java:296)

        at sun.security.ssl.ClientHandshaker.serverCertificate(ClientHandshaker.java:1509)

        at sun.security.ssl.ClientHandshaker.processMessage(ClientHandshaker.java:216)

        at sun.security.ssl.Handshaker.processLoop(Handshaker.java:979)

        at sun.security.ssl.Handshaker$1.run(Handshaker.java:919)

        at sun.security.ssl.Handshaker$1.run(Handshaker.java:916)

        at java.security.AccessController.doPrivileged(Native Method)

        at sun.security.ssl.Handshaker$DelegatedTask.run(Handshaker.java:1369)

        at org.apache.flink.shaded.netty4.io.netty.handler.ssl.SslHandler.runDelegatedTasks(SslHandler.java:1148)

        at org.apache.flink.shaded.netty4.io.netty.handler.ssl.SslHandler.unwrap(SslHandler.java:1003)

        ... 13 more

Caused by: java.security.cert.CertificateException: No subject alternative names matching IP address xxx.xxx.xxx.xxx found

        at sun.security.util.HostnameChecker.matchIP(HostnameChecker.java:167)

        at sun.security.util.HostnameChecker.match(HostnameChecker.java:93)

        at sun.security.ssl.X509TrustManagerImpl.checkIdentity(X509TrustManagerImpl.java:455)

        at sun.security.ssl.X509TrustManagerImpl.checkIdentity(X509TrustManagerImpl.java:436)

        at sun.security.ssl.X509TrustManagerImpl.checkTrusted(X509TrustManagerImpl.java:252)

        at sun.security.ssl.X509TrustManagerImpl.checkServerTrusted(X509TrustManagerImpl.java:136)

        at sun.security.ssl.ClientHandshaker.serverCertificate(ClientHandshaker.java:1496)

        ... 21 more

Reply | Threaded
Open this post in threaded view
|

Re: Taskmanager SSL fails looking for Subject Alternative IP Address

Stephan Ewen
Thanks for reporting this.

Given that hostname verification seems to be the issue, I would assume that the TaskManager somehow advertises a hostname in a form that is incompatile with the verification in some setups.

While it would be interesting to dig deeper into why this happens, I think we need to move away from hostname verification for internal communication (rpc, TaskManager Netty, blob server) anyways for the following reasons:

  - Hostname verification is hard (or pretty much incompatible) between containers in many container environments
  - The verification is mainly useful if you use a certificate in a certification chain with some other trusted root certificates
  - For internal SSL between JM/TM and TM/TM, the recommended method is to generate a single purpose certificate (may be self signed) and add a key store and trust store with only that certificate. Given such a "single certificate truststore", hostname verification does not add any additional security (to my understanding).

For Flink 1.6, we are also adding transparent mutual authentication for internal communication (RPC; blob server, netty data plane), which should be an additional level of security. If this is uses with dedicated (self signed) certificates, it should be very secure and not rely on hostname verification.

That said, for external communication (REST calls against JM/Dispatcher/...) clients should use hostname verification, because many users use certificates in a certificate chain for these external endpoints.

Best,
Stephan



On Thu, Jul 12, 2018 at 11:02 PM, PACE, JAMES <[hidden email]> wrote:

I have the following SSL configuration for a 3 node HA flink cluster:

 

#taskmanager.data.ssl.enabled: false

security.ssl.enabled: true

security.ssl.keystore: /opt/app/certificates/server-keystore.jks

security.ssl.keystore-password: <redacted>

security.ssl.key-password: <redacted>

security.ssl.truststore: /opt/app/certificates/cacerts

security.ssl.truststore-password: <redacted>

security.ssl.verify-hostname: true

 

The job we’re running is the sample WordCount.jar.  The running version of flink is 1.4.0.  It’s not the latest, but I didn’t see anything that looked like updating would solve this issue.

 

If either security.ssl.verify-hostname is set to false or taskmanager.data.ssl.enabled is set to false, everything works fine. 

 

When flink is run in the above configuration above, with ssl fully enabled and security.ssl.verify-hostname: true, the flink job fails.  However, when going through the logs, SSL appears fine for akka, blob service, and jobmanager.

 

The root cause looks to be Caused by: java.security.cert.CertificateException: No subject alternative names matching IP address xxx.xxx.xxx.xxx found. 

I have tried setting taskmanager.hostname to the FQDN of the host, but that did not change anything.

We don’t generate certificates with SAN fields.

 

Any thoughts would be appreciated.

 

This is the full stack trace

Caused by: java.io.IOException: Thread 'SortMerger Reading Thread' terminated due to an exception: Sending the partition request failed.

        at org.apache.flink.runtime.operators.sort.UnilateralSortMerger$ThreadBase.run(UnilateralSortMerger.java:800)

Caused by: org.apache.flink.runtime.io.network.netty.exception.LocalTransportException: Sending the partition request failed.

        at org.apache.flink.runtime.io.network.netty.PartitionRequestClient$1.operationComplete(PartitionRequestClient.java:119)

        at org.apache.flink.runtime.io.network.netty.PartitionRequestClient$1.operationComplete(PartitionRequestClient.java:111)

        at org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:680)

        at org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:567)

        at org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.tryFailure(DefaultPromise.java:424)

        at org.apache.flink.shaded.netty4.io.netty.channel.PendingWriteQueue.safeFail(PendingWriteQueue.java:252)

        at org.apache.flink.shaded.netty4.io.netty.channel.PendingWriteQueue.removeAndFailAll(PendingWriteQueue.java:112)

        at org.apache.flink.shaded.netty4.io.netty.handler.ssl.SslHandler.setHandshakeFailure(SslHandler.java:1256)

        at org.apache.flink.shaded.netty4.io.netty.handler.ssl.SslHandler.unwrap(SslHandler.java:1040)

        at org.apache.flink.shaded.netty4.io.netty.handler.ssl.SslHandler.decode(SslHandler.java:934)

        at org.apache.flink.shaded.netty4.io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:315)

        at org.apache.flink.shaded.netty4.io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:229)

        at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:339)

        at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:324)

        at org.apache.flink.shaded.netty4.io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:847)

        at org.apache.flink.shaded.netty4.io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:131)

        at org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511)

        at org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)

        at org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)

        at org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)

        at org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)

        at java.lang.Thread.run(Thread.java:745)

Caused by: javax.net.ssl.SSLHandshakeException: General SSLEngine problem

        at sun.security.ssl.Handshaker.checkThrown(Handshaker.java:1431)

        at sun.security.ssl.SSLEngineImpl.checkTaskThrown(SSLEngineImpl.java:535)

        at sun.security.ssl.SSLEngineImpl.readNetRecord(SSLEngineImpl.java:813)

        at sun.security.ssl.SSLEngineImpl.unwrap(SSLEngineImpl.java:781)

        at javax.net.ssl.SSLEngine.unwrap(SSLEngine.java:624)

        at org.apache.flink.shaded.netty4.io.netty.handler.ssl.SslHandler.unwrap(SslHandler.java:1114)

        at org.apache.flink.shaded.netty4.io.netty.handler.ssl.SslHandler.unwrap(SslHandler.java:981)

        ... 13 more

Caused by: javax.net.ssl.SSLHandshakeException: General SSLEngine problem

        at sun.security.ssl.Alerts.getSSLException(Alerts.java:192)

        at sun.security.ssl.SSLEngineImpl.fatal(SSLEngineImpl.java:1728)

        at sun.security.ssl.Handshaker.fatalSE(Handshaker.java:304)

        at sun.security.ssl.Handshaker.fatalSE(Handshaker.java:296)

        at sun.security.ssl.ClientHandshaker.serverCertificate(ClientHandshaker.java:1509)

        at sun.security.ssl.ClientHandshaker.processMessage(ClientHandshaker.java:216)

        at sun.security.ssl.Handshaker.processLoop(Handshaker.java:979)

        at sun.security.ssl.Handshaker$1.run(Handshaker.java:919)

        at sun.security.ssl.Handshaker$1.run(Handshaker.java:916)

        at java.security.AccessController.doPrivileged(Native Method)

        at sun.security.ssl.Handshaker$DelegatedTask.run(Handshaker.java:1369)

        at org.apache.flink.shaded.netty4.io.netty.handler.ssl.SslHandler.runDelegatedTasks(SslHandler.java:1148)

        at org.apache.flink.shaded.netty4.io.netty.handler.ssl.SslHandler.unwrap(SslHandler.java:1003)

        ... 13 more

Caused by: java.security.cert.CertificateException: No subject alternative names matching IP address xxx.xxx.xxx.xxx found

        at sun.security.util.HostnameChecker.matchIP(HostnameChecker.java:167)

        at sun.security.util.HostnameChecker.match(HostnameChecker.java:93)

        at sun.security.ssl.X509TrustManagerImpl.checkIdentity(X509TrustManagerImpl.java:455)

        at sun.security.ssl.X509TrustManagerImpl.checkIdentity(X509TrustManagerImpl.java:436)

        at sun.security.ssl.X509TrustManagerImpl.checkTrusted(X509TrustManagerImpl.java:252)

        at sun.security.ssl.X509TrustManagerImpl.checkServerTrusted(X509TrustManagerImpl.java:136)

        at sun.security.ssl.ClientHandshaker.serverCertificate(ClientHandshaker.java:1496)

        ... 21 more