Flink 1.4.0 keytab is unreadable

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Flink 1.4.0 keytab is unreadable

sanmutongzi
Hi,
I am using flink single-job mode on YARN to read data from a kafka
cluster  installation configured for Kerberos. When i upgrade flink to
1.4.0 , the yarn application can not run normally and logs th error
like this:

Exception in thread "main" java.lang.RuntimeException:
org.apache.flink.configuration.IllegalConfigurationException: Kerberos
login configuration is invalid; keytab is unreadable
        at org.apache.flink.yarn.YarnTaskManagerRunner.runYarnTaskManager(YarnTaskManagerRunner.java:160)
        at org.apache.flink.yarn.YarnTaskManager$.main(YarnTaskManager.scala:65)
        at org.apache.flink.yarn.YarnTaskManager.main(YarnTaskManager.scala)
Caused by: org.apache.flink.configuration.IllegalConfigurationException:
Kerberos login configuration is invalid; keytab is unreadable
        at org.apache.flink.runtime.security.SecurityConfiguration.validate(SecurityConfiguration.java:139)
        at org.apache.flink.runtime.security.SecurityConfiguration.<init>(SecurityConfiguration.java:90)
        at org.apache.flink.runtime.security.SecurityConfiguration.<init>(SecurityConfiguration.java:71)
        at org.apache.flink.yarn.YarnTaskManagerRunner.runYarnTaskManager(YarnTaskManagerRunner.java:139


So i add some logs for the method "SecurityConfiguration.validate()"
and rebuild the flink  package.

private void validate() {
   if (!StringUtils.isBlank(keytab)) {
      // principal is required
      if (StringUtils.isBlank(principal)) {
         throw new IllegalConfigurationException("Kerberos login
configuration is invalid; keytab requires a principal.");
      }

      // check the keytab is readable
      File keytabFile = new File(keytab);

      if (!keytabFile.exists()) {
         throw new IllegalConfigurationException("WTF! keytabFile is
not exist ! keytab:" + keytab);
      }

      if (!keytabFile.isFile()) {
         throw new IllegalConfigurationException("WTF! keytabFile is
not file ! keytab:" + keytab);
      }

      if (!keytabFile.canRead()) {
         throw new IllegalConfigurationException("WTF! keytabFile is
not readalbe ! keytab:" + keytab);
      }

      if (!keytabFile.exists() || !keytabFile.isFile() ||
!keytabFile.canRead()) {
         throw new IllegalConfigurationException("Kerberos login
configuration is invalid; keytab is unreadable");
      }
   }
}

After that , the yarn logs error  like  this :
017-12-15 17:14:36,314 INFO
org.apache.flink.yarn.YarnTaskManagerRunner                   -
localKeytabPath:
/data1/yarn/nm/usercache/hadoop/appcache/application_1513310528578_0009/container_e05_1513310528578_0009_01_000002/krb5.keytab
2017-12-15 17:14:36,315 INFO
org.apache.flink.yarn.YarnTaskManagerRunner                   - YARN
daemon is running as: hadoop Yarn client user obtainer: hadoop
2017-12-15 17:14:36,315 INFO
org.apache.flink.yarn.YarnTaskManagerRunner                   -
ResourceID assigned for this container:
container_e05_1513310528578_0009_01_000002
2017-12-15 17:14:36,321 ERROR
org.apache.flink.yarn.YarnTaskManagerRunner                   -
Exception occurred while launching Task Manager
org.apache.flink.configuration.IllegalConfigurationException: WTF!
keytabFile is not exist !
keytab:/data1/yarn/nm/usercache/hadoop/appcache/application_1513310528578_0009/container_e05_1513310528578_0009_01_000001/krb5.keytab
at org.apache.flink.runtime.security.SecurityConfiguration.validate(SecurityConfiguration.java:140)
at org.apache.flink.runtime.security.SecurityConfiguration.<init>(SecurityConfiguration.java:90)
at org.apache.flink.runtime.security.SecurityConfiguration.<init>(SecurityConfiguration.java:71)
at org.apache.flink.yarn.YarnTaskManagerRunner.runYarnTaskManager(YarnTaskManagerRunner.java:139)
at org.apache.flink.yarn.YarnTaskManager$.main(YarnTaskManager.scala:65)
at org.apache.flink.yarn.YarnTaskManager.main(YarnTaskManager.scala)


These logs tell the "keytabFile" value is different from the
"localKeytabPath".  I searched the
"org.apache.flink.yarn.YarnTaskManagerRunner" class source code and
found  there are
something different betwee 1.3.2 and 1.4.0

1.3.2

//To support Yarn Secure Integration Test Scenario
File krb5Conf = new File(currDir, Utils.KRB5_FILE_NAME);

if (krb5Conf.exists() && krb5Conf.canRead()) {
   String krb5Path = krb5Conf.getAbsolutePath();
   LOG.info("KRB5 Conf: {}", krb5Path);
   hadoopConfiguration = new org.apache.hadoop.conf.Configuration();
   hadoopConfiguration.set(CommonConfigurationKeysPublic.HADOOP_SECURITY_AUTHENTICATION,
"kerberos");
   hadoopConfiguration.set(CommonConfigurationKeysPublic.HADOOP_SECURITY_AUTHORIZATION,
"true");
}

// set keytab principal and replace path with the local path of the
shipped keytab file in NodeManager
if (localKeytabPath != null && remoteKeytabPrincipal != null) {
   configuration.setString(SecurityOptions.KERBEROS_LOGIN_KEYTAB,
localKeytabPath);
   configuration.setString(SecurityOptions.KERBEROS_LOGIN_PRINCIPAL,
remoteKeytabPrincipal);
}


1.4.0

//To support Yarn Secure Integration Test Scenario
File krb5Conf = new File(currDir, Utils.KRB5_FILE_NAME);

if (krb5Conf.exists() && krb5Conf.canRead()) {
   String krb5Path = krb5Conf.getAbsolutePath();
   LOG.info("KRB5 Conf: {}", krb5Path);
   org.apache.hadoop.conf.Configuration hadoopConfiguration = new
org.apache.hadoop.conf.Configuration();
   hadoopConfiguration.set(CommonConfigurationKeysPublic.HADOOP_SECURITY_AUTHENTICATION,
"kerberos");
   hadoopConfiguration.set(CommonConfigurationKeysPublic.HADOOP_SECURITY_AUTHORIZATION,
"true");

   // set keytab principal and replace path with the local path of the
shipped keytab file in NodeManager
   if (localKeytabPath != null && remoteKeytabPrincipal != null) {
      configuration.setString(SecurityOptions.KERBEROS_LOGIN_KEYTAB,
localKeytabPath);
      configuration.setString(SecurityOptions.KERBEROS_LOGIN_PRINCIPAL,
remoteKeytabPrincipal);
   }

   sc = new SecurityConfiguration(configuration,
      Collections.singletonList(securityConfig -> new
HadoopModule(securityConfig, hadoopConfiguration)));

} else {
   sc = new SecurityConfiguration(configuration);

}



In the previous version ,the "SecurityOptions.KERBEROS_LOGIN_KEYTAB"
is always set the same with "localKeytabPath" but in 1.4.0 only if the
"krb5Conf.exists() && krb5Conf.canRead()" retrun true . And in my test
case ,it looks like  the code only run the else  default code。


Are there something i counld do  to work around this problem ?

Thanks!
Reply | Threaded
Open this post in threaded view
|

Re: Flink 1.4.0 keytab is unreadable

Ufuk Celebi
Hey 杨光,

thanks for looking into this in such a detail. Unfortunately, I'm not
sure what the expected behaviour is (whether the change in behaviour
was accidental or on purpose).

Let me pull in Gordon who has worked quite a bit on the Kerberos
related components in Flink.

@Gordon:
1) Do you know what the expected behaviour is here?
2) How can he work around this issue in 1.4?

– Ufuk

On Fri, Dec 15, 2017 at 11:34 AM, 杨光 <[hidden email]> wrote:

> Hi,
> I am using flink single-job mode on YARN to read data from a kafka
> cluster  installation configured for Kerberos. When i upgrade flink to
> 1.4.0 , the yarn application can not run normally and logs th error
> like this:
>
> Exception in thread "main" java.lang.RuntimeException:
> org.apache.flink.configuration.IllegalConfigurationException: Kerberos
> login configuration is invalid; keytab is unreadable
>         at org.apache.flink.yarn.YarnTaskManagerRunner.runYarnTaskManager(YarnTaskManagerRunner.java:160)
>         at org.apache.flink.yarn.YarnTaskManager$.main(YarnTaskManager.scala:65)
>         at org.apache.flink.yarn.YarnTaskManager.main(YarnTaskManager.scala)
> Caused by: org.apache.flink.configuration.IllegalConfigurationException:
> Kerberos login configuration is invalid; keytab is unreadable
>         at org.apache.flink.runtime.security.SecurityConfiguration.validate(SecurityConfiguration.java:139)
>         at org.apache.flink.runtime.security.SecurityConfiguration.<init>(SecurityConfiguration.java:90)
>         at org.apache.flink.runtime.security.SecurityConfiguration.<init>(SecurityConfiguration.java:71)
>         at org.apache.flink.yarn.YarnTaskManagerRunner.runYarnTaskManager(YarnTaskManagerRunner.java:139
>
>
> So i add some logs for the method "SecurityConfiguration.validate()"
> and rebuild the flink  package.
>
> private void validate() {
>    if (!StringUtils.isBlank(keytab)) {
>       // principal is required
>       if (StringUtils.isBlank(principal)) {
>          throw new IllegalConfigurationException("Kerberos login
> configuration is invalid; keytab requires a principal.");
>       }
>
>       // check the keytab is readable
>       File keytabFile = new File(keytab);
>
>       if (!keytabFile.exists()) {
>          throw new IllegalConfigurationException("WTF! keytabFile is
> not exist ! keytab:" + keytab);
>       }
>
>       if (!keytabFile.isFile()) {
>          throw new IllegalConfigurationException("WTF! keytabFile is
> not file ! keytab:" + keytab);
>       }
>
>       if (!keytabFile.canRead()) {
>          throw new IllegalConfigurationException("WTF! keytabFile is
> not readalbe ! keytab:" + keytab);
>       }
>
>       if (!keytabFile.exists() || !keytabFile.isFile() ||
> !keytabFile.canRead()) {
>          throw new IllegalConfigurationException("Kerberos login
> configuration is invalid; keytab is unreadable");
>       }
>    }
> }
>
> After that , the yarn logs error  like  this :
> 017-12-15 17:14:36,314 INFO
> org.apache.flink.yarn.YarnTaskManagerRunner                   -
> localKeytabPath:
> /data1/yarn/nm/usercache/hadoop/appcache/application_1513310528578_0009/container_e05_1513310528578_0009_01_000002/krb5.keytab
> 2017-12-15 17:14:36,315 INFO
> org.apache.flink.yarn.YarnTaskManagerRunner                   - YARN
> daemon is running as: hadoop Yarn client user obtainer: hadoop
> 2017-12-15 17:14:36,315 INFO
> org.apache.flink.yarn.YarnTaskManagerRunner                   -
> ResourceID assigned for this container:
> container_e05_1513310528578_0009_01_000002
> 2017-12-15 17:14:36,321 ERROR
> org.apache.flink.yarn.YarnTaskManagerRunner                   -
> Exception occurred while launching Task Manager
> org.apache.flink.configuration.IllegalConfigurationException: WTF!
> keytabFile is not exist !
> keytab:/data1/yarn/nm/usercache/hadoop/appcache/application_1513310528578_0009/container_e05_1513310528578_0009_01_000001/krb5.keytab
> at org.apache.flink.runtime.security.SecurityConfiguration.validate(SecurityConfiguration.java:140)
> at org.apache.flink.runtime.security.SecurityConfiguration.<init>(SecurityConfiguration.java:90)
> at org.apache.flink.runtime.security.SecurityConfiguration.<init>(SecurityConfiguration.java:71)
> at org.apache.flink.yarn.YarnTaskManagerRunner.runYarnTaskManager(YarnTaskManagerRunner.java:139)
> at org.apache.flink.yarn.YarnTaskManager$.main(YarnTaskManager.scala:65)
> at org.apache.flink.yarn.YarnTaskManager.main(YarnTaskManager.scala)
>
>
> These logs tell the "keytabFile" value is different from the
> "localKeytabPath".  I searched the
> "org.apache.flink.yarn.YarnTaskManagerRunner" class source code and
> found  there are
> something different betwee 1.3.2 and 1.4.0
>
> 1.3.2
>
> //To support Yarn Secure Integration Test Scenario
> File krb5Conf = new File(currDir, Utils.KRB5_FILE_NAME);
>
> if (krb5Conf.exists() && krb5Conf.canRead()) {
>    String krb5Path = krb5Conf.getAbsolutePath();
>    LOG.info("KRB5 Conf: {}", krb5Path);
>    hadoopConfiguration = new org.apache.hadoop.conf.Configuration();
>    hadoopConfiguration.set(CommonConfigurationKeysPublic.HADOOP_SECURITY_AUTHENTICATION,
> "kerberos");
>    hadoopConfiguration.set(CommonConfigurationKeysPublic.HADOOP_SECURITY_AUTHORIZATION,
> "true");
> }
>
> // set keytab principal and replace path with the local path of the
> shipped keytab file in NodeManager
> if (localKeytabPath != null && remoteKeytabPrincipal != null) {
>    configuration.setString(SecurityOptions.KERBEROS_LOGIN_KEYTAB,
> localKeytabPath);
>    configuration.setString(SecurityOptions.KERBEROS_LOGIN_PRINCIPAL,
> remoteKeytabPrincipal);
> }
>
>
> 1.4.0
>
> //To support Yarn Secure Integration Test Scenario
> File krb5Conf = new File(currDir, Utils.KRB5_FILE_NAME);
>
> if (krb5Conf.exists() && krb5Conf.canRead()) {
>    String krb5Path = krb5Conf.getAbsolutePath();
>    LOG.info("KRB5 Conf: {}", krb5Path);
>    org.apache.hadoop.conf.Configuration hadoopConfiguration = new
> org.apache.hadoop.conf.Configuration();
>    hadoopConfiguration.set(CommonConfigurationKeysPublic.HADOOP_SECURITY_AUTHENTICATION,
> "kerberos");
>    hadoopConfiguration.set(CommonConfigurationKeysPublic.HADOOP_SECURITY_AUTHORIZATION,
> "true");
>
>    // set keytab principal and replace path with the local path of the
> shipped keytab file in NodeManager
>    if (localKeytabPath != null && remoteKeytabPrincipal != null) {
>       configuration.setString(SecurityOptions.KERBEROS_LOGIN_KEYTAB,
> localKeytabPath);
>       configuration.setString(SecurityOptions.KERBEROS_LOGIN_PRINCIPAL,
> remoteKeytabPrincipal);
>    }
>
>    sc = new SecurityConfiguration(configuration,
>       Collections.singletonList(securityConfig -> new
> HadoopModule(securityConfig, hadoopConfiguration)));
>
> } else {
>    sc = new SecurityConfiguration(configuration);
>
> }
>
>
>
> In the previous version ,the "SecurityOptions.KERBEROS_LOGIN_KEYTAB"
> is always set the same with "localKeytabPath" but in 1.4.0 only if the
> "krb5Conf.exists() && krb5Conf.canRead()" retrun true . And in my test
> case ,it looks like  the code only run the else  default code。
>
>
> Are there something i counld do  to work around this problem ?
>
> Thanks!
Reply | Threaded
Open this post in threaded view
|

Re: Flink 1.4.0 keytab is unreadable

Tzu-Li (Gordon) Tai
Hi 杨光,

Thanks a lot for reporting and looking into this with such detail!
Your observations are correct: the changes from 1.3.2 to 1.4.0 in the YarnTaskManagerRunner caused the local Keytab path in TMs to not be correctly set.

Unfortunately, AFAIK I don’t think there is a possible workaround to this for 1.4.0.
Shipped Keytabs to TMs live in the working directory of the corresponding Yarn container, so the correct local path for the keytab cannot be known upfront.
The only scenario that this would work is if all TM containers happen to be on the same NodeManager as the AM container.

@Eron,
This is a reoccurrence of FLINK-5580 [1], and as you speculated, the TM is using the wrong keytab path again because it was not properly set.
I agree that the integration test scenario is best to not be in the main code. It actually seems to also be the cause of this issue this time.
As you can see in [2], the change was only aiming to refactor the integration test scenario code block, but accidentally affected the keytab path setting.
At the same time, we’ll need better unit test coverage for this, as apparently this can very easily break.

I’ve filed a JIRA for this, with the comments so far included: FLINK-8270 [3]
Will suggest this to be a blocker for 1.4.1 / 1.5.0.



On 15 December 2017 at 4:12:24 PM, Tzu-Li (Gordon) Tai ([hidden email]) wrote:

Hi 杨光,

Thanks a lot for reporting and looking into this with such detail!
Your observations are correct: the changes from 1.3.2 to 1.4.0 in the YarnTaskManagerRunner caused the local Keytab path in TMs to not be correctly set.

Unfortunately, AFAIK I don’t think there is a possible workaround to this for 1.4.0.
Shipped Keytabs to TMs live in the working directory of the corresponding Yarn container, so the correct local path for the keytab cannot be known upfront.
The only scenario that this would work is if all TM containers happen to be on the same NodeManager as the AM container.

@Eron,
This is a reoccurrence of FLINK-5580 [1], and as you speculated, the TM is using the wrong keytab path again because it was not properly set.
I agree that the integration test scenario is best to not be in the main code. It actually seems to also be the cause of this issue this time.
As you can see in [2], the change was only aiming to refactor the integration test scenario code block, but accidentally affected the keytab path setting.
At the same time, we’ll need better unit test coverage for this, as apparently this can very easily break.

I’ve filed a JIRA for this, with the comments so far included: FLINK-8270 [3]
Will suggest this to be a blocker for 1.4.1 / 1.5.0.