Error when accessing secure HDFS with standalone Flink

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

Error when accessing secure HDFS with standalone Flink

stefanobaghino
Hello everybody,

me and my colleagues have been running some tests on Flink 1.0.0 in a secure environment (Kerberos). Yesterday we did several tests on the standalone Flink deployment but couldn't get it to access HDFS. Judging from the error it looks like Flink is not trying to authenticate itself with Kerberos. The root cause of the error is "org.apache.hadoop.security.AccessControlException: SIMPLE authentication is not enabled.  Available:[TOKEN, KERBEROS]". I've put the whole logs in this gist. I've went through the source code and judging from what I saw this error is emitted by Hadoop if a client is not using any authentication method on a secure cluster. Also, in the source code of Flink, it looks like when running a job on a secure cluster a log message (at INFO level) should be printed stating the fact.

To go through the steps I followed to setup the environment: I've built Flink and put it in the same folder under the two nodes of the cluster, adjusted the configs, assigned its ownership (and write permissions) to a group, than I ran kinit with a user belonging to that group on both the nodes and finally I ran start-cluster.sh and deployed the job. I tried both running the job as the same user who ran the start-cluster.sh script and another one (still authenticated with Kerberos on both nodes).

The core-site.xml correctly states that the authentication method is kerberos and using the hdfs CLI everything runs as expected. Thinking it could be an error tied to permissions on the core-site.xml file I also added the user running the start-cluster.sh script to the hadoop group, which owned the file, yield the same results, unfortunately.

Can you help me troubleshoot this issue? Thank you so much in advance!

--
BR,
Stefano Baghino

Software Engineer @ Radicalbit
Reply | Threaded
Open this post in threaded view
|

Re: Error when accessing secure HDFS with standalone Flink

Maximilian Michels
Hi Stefano,

You have probably seen
https://ci.apache.org/projects/flink/flink-docs-release-1.0/setup/config.html#kerberos
?

Currently, all nodes need to be authenticated with the Kerberos before
Flink is started (not just the JobManager). Could it be that the
start-cluster.sh script actually is not authenticated using Kerberos
at the nodes it sshs to when it starts the TaskManagers?

Best,
Max


On Fri, Mar 11, 2016 at 8:17 AM, Stefano Baghino
<[hidden email]> wrote:

> Hello everybody,
>
> me and my colleagues have been running some tests on Flink 1.0.0 in a secure
> environment (Kerberos). Yesterday we did several tests on the standalone
> Flink deployment but couldn't get it to access HDFS. Judging from the error
> it looks like Flink is not trying to authenticate itself with Kerberos. The
> root cause of the error is
> "org.apache.hadoop.security.AccessControlException: SIMPLE authentication is
> not enabled.  Available:[TOKEN, KERBEROS]". I've put the whole logs in this
> gist. I've went through the source code and judging from what I saw this
> error is emitted by Hadoop if a client is not using any authentication
> method on a secure cluster. Also, in the source code of Flink, it looks like
> when running a job on a secure cluster a log message (at INFO level) should
> be printed stating the fact.
>
> To go through the steps I followed to setup the environment: I've built
> Flink and put it in the same folder under the two nodes of the cluster,
> adjusted the configs, assigned its ownership (and write permissions) to a
> group, than I ran kinit with a user belonging to that group on both the
> nodes and finally I ran start-cluster.sh and deployed the job. I tried both
> running the job as the same user who ran the start-cluster.sh script and
> another one (still authenticated with Kerberos on both nodes).
>
> The core-site.xml correctly states that the authentication method is
> kerberos and using the hdfs CLI everything runs as expected. Thinking it
> could be an error tied to permissions on the core-site.xml file I also added
> the user running the start-cluster.sh script to the hadoop group, which
> owned the file, yield the same results, unfortunately.
>
> Can you help me troubleshoot this issue? Thank you so much in advance!
>
> --
> BR,
> Stefano Baghino
>
> Software Engineer @ Radicalbit
Reply | Threaded
Open this post in threaded view
|

Re: Error when accessing secure HDFS with standalone Flink

stefanobaghino
Hi Max,

thanks for the tips. What we did has been running kinit on each node with the same user that then went on running the start-cluster.sh script. Right now the LDAP groups are backed by the OS ones and the user that ran the launch script is part of the flink group, that is on every node of the cluster and has full access to the flink directory (which is placed under the same path on every node).

Would have this been enough to kerberize Flink?

Also: once a user runs Flink in secure mode, is every deployed job run as the user that ran the start-cluster.sh script (same behavior as running a YARN session)? Or users can kinit on each node and then submit jobs that will be individually run with their credentials?

Thanks again.

On Wed, Mar 16, 2016 at 10:30 AM, Maximilian Michels <[hidden email]> wrote:
Hi Stefano,

You have probably seen
https://ci.apache.org/projects/flink/flink-docs-release-1.0/setup/config.html#kerberos
?

Currently, all nodes need to be authenticated with the Kerberos before
Flink is started (not just the JobManager). Could it be that the
start-cluster.sh script actually is not authenticated using Kerberos
at the nodes it sshs to when it starts the TaskManagers?

Best,
Max


On Fri, Mar 11, 2016 at 8:17 AM, Stefano Baghino
<[hidden email]> wrote:
> Hello everybody,
>
> me and my colleagues have been running some tests on Flink 1.0.0 in a secure
> environment (Kerberos). Yesterday we did several tests on the standalone
> Flink deployment but couldn't get it to access HDFS. Judging from the error
> it looks like Flink is not trying to authenticate itself with Kerberos. The
> root cause of the error is
> "org.apache.hadoop.security.AccessControlException: SIMPLE authentication is
> not enabled.  Available:[TOKEN, KERBEROS]". I've put the whole logs in this
> gist. I've went through the source code and judging from what I saw this
> error is emitted by Hadoop if a client is not using any authentication
> method on a secure cluster. Also, in the source code of Flink, it looks like
> when running a job on a secure cluster a log message (at INFO level) should
> be printed stating the fact.
>
> To go through the steps I followed to setup the environment: I've built
> Flink and put it in the same folder under the two nodes of the cluster,
> adjusted the configs, assigned its ownership (and write permissions) to a
> group, than I ran kinit with a user belonging to that group on both the
> nodes and finally I ran start-cluster.sh and deployed the job. I tried both
> running the job as the same user who ran the start-cluster.sh script and
> another one (still authenticated with Kerberos on both nodes).
>
> The core-site.xml correctly states that the authentication method is
> kerberos and using the hdfs CLI everything runs as expected. Thinking it
> could be an error tied to permissions on the core-site.xml file I also added
> the user running the start-cluster.sh script to the hadoop group, which
> owned the file, yield the same results, unfortunately.
>
> Can you help me troubleshoot this issue? Thank you so much in advance!
>
> --
> BR,
> Stefano Baghino
>
> Software Engineer @ Radicalbit



--
BR,
Stefano Baghino

Software Engineer @ Radicalbit
Reply | Threaded
Open this post in threaded view
|

Re: Error when accessing secure HDFS with standalone Flink

Maximilian Michels
Hi Stefano,

The preparations for Kerberos which you described look correct.

Taking a closer lock at the Exception, it seems like the Hadoop config
or environment variables are not correctly set. It keeps trying to
authenticate SIMPLE but on the remote side only Kerberos is available.
Have you added the Hadoop config dir to the Flink config or,
alternatively, set the HADOOP_CONF_DIR environment variable on the
nodes?

Just like in Yarn, in standalone mode every job is run under the same
user which started the cluster.

Cheers,
Max

On Wed, Mar 16, 2016 at 10:50 AM, Stefano Baghino
<[hidden email]> wrote:

> Hi Max,
>
> thanks for the tips. What we did has been running kinit on each node with
> the same user that then went on running the start-cluster.sh script. Right
> now the LDAP groups are backed by the OS ones and the user that ran the
> launch script is part of the flink group, that is on every node of the
> cluster and has full access to the flink directory (which is placed under
> the same path on every node).
>
> Would have this been enough to kerberize Flink?
>
> Also: once a user runs Flink in secure mode, is every deployed job run as
> the user that ran the start-cluster.sh script (same behavior as running a
> YARN session)? Or users can kinit on each node and then submit jobs that
> will be individually run with their credentials?
>
> Thanks again.
>
> On Wed, Mar 16, 2016 at 10:30 AM, Maximilian Michels <[hidden email]> wrote:
>>
>> Hi Stefano,
>>
>> You have probably seen
>>
>> https://ci.apache.org/projects/flink/flink-docs-release-1.0/setup/config.html#kerberos
>> ?
>>
>> Currently, all nodes need to be authenticated with the Kerberos before
>> Flink is started (not just the JobManager). Could it be that the
>> start-cluster.sh script actually is not authenticated using Kerberos
>> at the nodes it sshs to when it starts the TaskManagers?
>>
>> Best,
>> Max
>>
>>
>> On Fri, Mar 11, 2016 at 8:17 AM, Stefano Baghino
>> <[hidden email]> wrote:
>> > Hello everybody,
>> >
>> > me and my colleagues have been running some tests on Flink 1.0.0 in a
>> > secure
>> > environment (Kerberos). Yesterday we did several tests on the standalone
>> > Flink deployment but couldn't get it to access HDFS. Judging from the
>> > error
>> > it looks like Flink is not trying to authenticate itself with Kerberos.
>> > The
>> > root cause of the error is
>> > "org.apache.hadoop.security.AccessControlException: SIMPLE
>> > authentication is
>> > not enabled.  Available:[TOKEN, KERBEROS]". I've put the whole logs in
>> > this
>> > gist. I've went through the source code and judging from what I saw this
>> > error is emitted by Hadoop if a client is not using any authentication
>> > method on a secure cluster. Also, in the source code of Flink, it looks
>> > like
>> > when running a job on a secure cluster a log message (at INFO level)
>> > should
>> > be printed stating the fact.
>> >
>> > To go through the steps I followed to setup the environment: I've built
>> > Flink and put it in the same folder under the two nodes of the cluster,
>> > adjusted the configs, assigned its ownership (and write permissions) to
>> > a
>> > group, than I ran kinit with a user belonging to that group on both the
>> > nodes and finally I ran start-cluster.sh and deployed the job. I tried
>> > both
>> > running the job as the same user who ran the start-cluster.sh script and
>> > another one (still authenticated with Kerberos on both nodes).
>> >
>> > The core-site.xml correctly states that the authentication method is
>> > kerberos and using the hdfs CLI everything runs as expected. Thinking it
>> > could be an error tied to permissions on the core-site.xml file I also
>> > added
>> > the user running the start-cluster.sh script to the hadoop group, which
>> > owned the file, yield the same results, unfortunately.
>> >
>> > Can you help me troubleshoot this issue? Thank you so much in advance!
>> >
>> > --
>> > BR,
>> > Stefano Baghino
>> >
>> > Software Engineer @ Radicalbit
>
>
>
>
> --
> BR,
> Stefano Baghino
>
> Software Engineer @ Radicalbit
Reply | Threaded
Open this post in threaded view
|

Re: Error when accessing secure HDFS with standalone Flink

stefanobaghino
Hi Max,

thanks for clarifying the job ownership question.

Regarding the security configuration, we set the HADOOP_CONF_DIR environment variable.
Right now we're testing YARN again, if we go back to standalone and can come up with some better information regarding the failure I'll write again.

Thank you for taking the time to help me!

On Wed, Mar 16, 2016 at 2:17 PM, Maximilian Michels <[hidden email]> wrote:
Hi Stefano,

The preparations for Kerberos which you described look correct.

Taking a closer lock at the Exception, it seems like the Hadoop config
or environment variables are not correctly set. It keeps trying to
authenticate SIMPLE but on the remote side only Kerberos is available.
Have you added the Hadoop config dir to the Flink config or,
alternatively, set the HADOOP_CONF_DIR environment variable on the
nodes?

Just like in Yarn, in standalone mode every job is run under the same
user which started the cluster.

Cheers,
Max

On Wed, Mar 16, 2016 at 10:50 AM, Stefano Baghino
<[hidden email]> wrote:
> Hi Max,
>
> thanks for the tips. What we did has been running kinit on each node with
> the same user that then went on running the start-cluster.sh script. Right
> now the LDAP groups are backed by the OS ones and the user that ran the
> launch script is part of the flink group, that is on every node of the
> cluster and has full access to the flink directory (which is placed under
> the same path on every node).
>
> Would have this been enough to kerberize Flink?
>
> Also: once a user runs Flink in secure mode, is every deployed job run as
> the user that ran the start-cluster.sh script (same behavior as running a
> YARN session)? Or users can kinit on each node and then submit jobs that
> will be individually run with their credentials?
>
> Thanks again.
>
> On Wed, Mar 16, 2016 at 10:30 AM, Maximilian Michels <[hidden email]> wrote:
>>
>> Hi Stefano,
>>
>> You have probably seen
>>
>> https://ci.apache.org/projects/flink/flink-docs-release-1.0/setup/config.html#kerberos
>> ?
>>
>> Currently, all nodes need to be authenticated with the Kerberos before
>> Flink is started (not just the JobManager). Could it be that the
>> start-cluster.sh script actually is not authenticated using Kerberos
>> at the nodes it sshs to when it starts the TaskManagers?
>>
>> Best,
>> Max
>>
>>
>> On Fri, Mar 11, 2016 at 8:17 AM, Stefano Baghino
>> <[hidden email]> wrote:
>> > Hello everybody,
>> >
>> > me and my colleagues have been running some tests on Flink 1.0.0 in a
>> > secure
>> > environment (Kerberos). Yesterday we did several tests on the standalone
>> > Flink deployment but couldn't get it to access HDFS. Judging from the
>> > error
>> > it looks like Flink is not trying to authenticate itself with Kerberos.
>> > The
>> > root cause of the error is
>> > "org.apache.hadoop.security.AccessControlException: SIMPLE
>> > authentication is
>> > not enabled.  Available:[TOKEN, KERBEROS]". I've put the whole logs in
>> > this
>> > gist. I've went through the source code and judging from what I saw this
>> > error is emitted by Hadoop if a client is not using any authentication
>> > method on a secure cluster. Also, in the source code of Flink, it looks
>> > like
>> > when running a job on a secure cluster a log message (at INFO level)
>> > should
>> > be printed stating the fact.
>> >
>> > To go through the steps I followed to setup the environment: I've built
>> > Flink and put it in the same folder under the two nodes of the cluster,
>> > adjusted the configs, assigned its ownership (and write permissions) to
>> > a
>> > group, than I ran kinit with a user belonging to that group on both the
>> > nodes and finally I ran start-cluster.sh and deployed the job. I tried
>> > both
>> > running the job as the same user who ran the start-cluster.sh script and
>> > another one (still authenticated with Kerberos on both nodes).
>> >
>> > The core-site.xml correctly states that the authentication method is
>> > kerberos and using the hdfs CLI everything runs as expected. Thinking it
>> > could be an error tied to permissions on the core-site.xml file I also
>> > added
>> > the user running the start-cluster.sh script to the hadoop group, which
>> > owned the file, yield the same results, unfortunately.
>> >
>> > Can you help me troubleshoot this issue? Thank you so much in advance!
>> >
>> > --
>> > BR,
>> > Stefano Baghino
>> >
>> > Software Engineer @ Radicalbit
>
>
>
>
> --
> BR,
> Stefano Baghino
>
> Software Engineer @ Radicalbit



--
BR,
Stefano Baghino

Software Engineer @ Radicalbit