Hi,
I have the situation that I have a Kerberos secured Yarn/HBase installation and I want to export data from a lot (~200) HBase tables to files on HDFS. I wrote a flink job that does this exactly the way I want it for a single table. Now in general I have a few possible approaches to do this for the 200 tables I am facing: 1) Create a single job that reads the data from all of those tables and writes them to the correct files. I expect that to be a monster that will hog the entire cluster because of the large number of HBase regions. 2) Run a job that does this for a single table and simply run that in a loop. Essentially I would have a shellscript or 'main' that loops over all tablenames and run a flink job for each of those. The downside of this is that it will start a new flink topology on Yarn for each table. This has a startup overhead of something like 30 seconds for each table that I would like to avoid. 3) I start a single yarn-session and submit my job in there 200 times. That would solve most of the startup overhead yet this doesn't work. If I start yarn-session then I see these two relevant lines in the output. 2016-07-29 14:58:30,575 INFO org.apache.flink.yarn.Utils - Attempting to obtain Kerberos security token for HBase 2016-07-29 14:58:30,576 INFO org.apache.flink.yarn.Utils - HBase is not available (not packaged with this application): ClassNotFoundException : "org.apache.hadoop.hbase.HBaseConfiguration". As a consequence any flink job I submit cannot access HBase at all. As an experiment I changed my yarn-session.sh script to include HBase on the classpath. (If you want I can submit a Jira issue and a pull request) Now the yarn-session does have HBase available and the jobs runs as expected. There are how ever two problems that remain: 1) This yarnsession is accessible by everyone on the cluster and as a consequence they can run jobs in there that can access all data I have access to. 2) The kerberos token will expire after a while and (just like with all long running jobs) I would really like to have this to be a 'long lived' thing. As far as I know this is just the tip of the security ice berg and I would like to know what the correct approach is to solve this. Thanks. Best regards / Met vriendelijke groeten,
Niels Basjes |
Thank you for the breakdown of the problem. Option (1) or (2) would be the way to go, currently. The problem that (3) does not support HBase is simply solvable by adding the HBase jars to the lib directory. In the future, this should be solved by the YARN re-architecturing: https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=65147077 For the renewal of Kerberos tokens for streaming jobs: There is WIP and a pull request to attach key tabs to a Flink job: https://github.com/apache/flink/pull/2275 The problem that the YARN session is accessible by everyone is a bit more tricky. In the future, this should be solved by these two parts: - With the YARN re-achitecturing, sessions are bound to individual users. It should be possible to launch the session out of a single YarnExecutionEnvironment and then submit multiple jobs against it. - The over-the-wire encryption and authentication should make sure that no other user can send jobs to that session. Greetings, Stephan On Mon, Aug 1, 2016 at 9:47 AM, Niels Basjes <[hidden email]> wrote:
|
Thanks for the pointers towards the work you are doing here. I'll put up a patch for the jars and such in the next few days. Niels Basjes On Mon, Aug 1, 2016 at 11:46 AM, Stephan Ewen <[hidden email]> wrote:
Best regards / Met vriendelijke groeten,
Niels Basjes |
On Mon, Aug 1, 2016 at 11:54 AM, Niels Basjes <[hidden email]> wrote:
Best regards / Met vriendelijke groeten,
Niels Basjes |
Free forum by Nabble | Edit this page |