Hello all, I have a question about Kerberos authentication in Yarn environment for long running streaming job. According to the documentation (
https://ci.apache.org/projects/flink/flink-docs-release-1.4/ops/security-kerberos.html#yarnmesos-mode ) Flink’s solution is to use keytab in order to perform authentication
in YARN perimeter. If keytab is configured, Flink uses
UserGroupInformation#loginUserFromKeytab method in order to perform authentication. In the YARN Security documentation ( https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-site/src/site/markdown/YarnApplicationSecurity.md#keytabs-for-am-and-containers-distributed-via-yarn
) mentioned that it should be enough: Launched containers must themselves log in via UserGroupInformation.loginUserFromKeytab(). UGI handles the login, and schedules a background thread to relogin the user periodically. But in reality if we check the Source code of UGI, we can see that no background Thread is created:
https://github.com/apache/hadoop/blob/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/security/UserGroupInformation.java#L1153. There are just created javax.security.auth.login.LoginContext and performed authentication. Looks like it is true for different Hadoop branches - 2.7, 2.8, 3.0, trunk. So Flink also doesn’t create any background Threads:
https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/runtime/security/modules/HadoopModule.java#L69. So in my case job loses credentials for ResourceManager and HDFS after some time (12 hours in my case). Looks like UGI’s code is not aligned with the documentation
and it doesn’t relogin periodically.
But do you think patching
with background Thread which performs UGI#reloginUserFromKeytab can be a solution? P.S. We are running Flink as a single job on Yarn. |
To my knowledge the various RPC clients take care of renewal (whether reactively or using a renewal thread). Some examples: So I don't think Flink needs a renewal thread but the overall situation is complex. Some stack traces and logs may be needed to understand the issue. Eron On Thu, Dec 14, 2017 at 8:17 AM, Oleksandr Nitavskyi <[hidden email]> wrote:
|
Free forum by Nabble | Edit this page |