Flink long-running streaming job, Keytab authentication

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

Flink long-running streaming job, Keytab authentication

Oleksandr Nitavskyi

Hello all,

 

I have a question about Kerberos authentication in Yarn environment for long running streaming job. According to the documentation ( https://ci.apache.org/projects/flink/flink-docs-release-1.4/ops/security-kerberos.html#yarnmesos-mode ) Flink’s solution is to use keytab in order to perform authentication in YARN perimeter.

 

If keytab is configured, Flink uses UserGroupInformation#loginUserFromKeytab method in order to perform authentication. In the YARN Security documentation (

https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-site/src/site/markdown/YarnApplicationSecurity.md#keytabs-for-am-and-containers-distributed-via-yarn ) mentioned that it should be enough:

 

Launched containers must themselves log in via UserGroupInformation.loginUserFromKeytab(). UGI handles the login, and schedules a background thread to relogin the user periodically.

 

But in reality if we check the Source code of UGI, we can see that no background Thread is created: https://github.com/apache/hadoop/blob/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/security/UserGroupInformation.java#L1153. There are just created javax.security.auth.login.LoginContext

and performed authentication. Looks like it is true for different Hadoop branches - 2.7, 2.8, 3.0, trunk. So Flink also doesn’t create any background Threads: https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/runtime/security/modules/HadoopModule.java#L69. So in my case job loses credentials for ResourceManager and HDFS after some time (12 hours in my case).

 

Looks like UGI’s code is not aligned with the documentation and it doesn’t relogin periodically.

But do you think patching with background Thread which performs UGI#reloginUserFromKeytab can be a solution?

 

P.S. We are running Flink as a single job on Yarn.

 

 

Reply | Threaded
Open this post in threaded view
|

Re: Flink long-running streaming job, Keytab authentication

Eron Wright
To my knowledge the various RPC clients take care of renewal (whether reactively or using a renewal thread).  Some examples:

So I don't think Flink needs a renewal thread but the overall situation is complex.  Some stack traces and logs may be needed to understand the issue.

Eron

On Thu, Dec 14, 2017 at 8:17 AM, Oleksandr Nitavskyi <[hidden email]> wrote:

Hello all,

 

I have a question about Kerberos authentication in Yarn environment for long running streaming job. According to the documentation ( https://ci.apache.org/projects/flink/flink-docs-release-1.4/ops/security-kerberos.html#yarnmesos-mode ) Flink’s solution is to use keytab in order to perform authentication in YARN perimeter.

 

If keytab is configured, Flink uses UserGroupInformation#loginUserFromKeytab method in order to perform authentication. In the YARN Security documentation (

https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-site/src/site/markdown/YarnApplicationSecurity.md#keytabs-for-am-and-containers-distributed-via-yarn ) mentioned that it should be enough:

 

Launched containers must themselves log in via UserGroupInformation.loginUserFromKeytab(). UGI handles the login, and schedules a background thread to relogin the user periodically.

 

But in reality if we check the Source code of UGI, we can see that no background Thread is created: https://github.com/apache/hadoop/blob/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/security/UserGroupInformation.java#L1153. There are just created javax.security.auth.login.LoginContext

and performed authentication. Looks like it is true for different Hadoop branches - 2.7, 2.8, 3.0, trunk. So Flink also doesn’t create any background Threads: https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/runtime/security/modules/HadoopModule.java#L69. So in my case job loses credentials for ResourceManager and HDFS after some time (12 hours in my case).

 

Looks like UGI’s code is not aligned with the documentation and it doesn’t relogin periodically.

But do you think patching with background Thread which performs UGI#reloginUserFromKeytab can be a solution?

 

P.S. We are running Flink as a single job on Yarn.