Re: Flink job on secure Yarn fails after many hours

Posted by Niels Basjes on
URL: http://deprecated-apache-flink-user-mailing-list-archive.369.s1.nabble.com/Flink-job-on-secure-Yarn-fails-after-many-hours-tp3856p5608.html

Hi,

In my environment doing the "proxy" thing didn't work.
With an token expire of 168 hours (1 week) the job consistently terminates at exactly (within a margin of 10 seconds) 173.5 hours.
So far we have not been able to solve this problem.

Our teams now simply assume the thing fails once in a while and have an automatic restart feature (i.e. shell script with a while true loop).
The best guess at a root cause is this https://issues.apache.org/jira/browse/HDFS-9276

If you have a real solution or a reference to a related bug report to this problem then please share!

Niels Basjes



On Thu, Mar 17, 2016 at 10:20 AM, Thomas Lamirault <[hidden email]> wrote:
Hi Max,

I will try these workaround.
Thanks

Thomas

________________________________________
De : Maximilian Michels [[hidden email]]
Envoyé : mardi 15 mars 2016 16:51
À : [hidden email]
Cc : Niels Basjes
Objet : Re: Flink job on secure Yarn fails after many hours

Hi Thomas,

Nils (CC) and I found out that you need at least Hadoop version 2.6.1
to properly run Kerberos applications on Hadoop clusters. Versions
before that have critical bugs related to the internal security token
handling that may expire the token although it is still valid.

That said, there is another limitation of Hadoop that the maximum
internal token life time is one week. To work around this limit, you
have two options:

a) increasing the maximum token life time

In yarn-site.xml:

<property>
  <name>yarn.resourcemanager.delegation.token.max-lifetime</name>
  <value>9223372036854775807</value>
</property>

In hdfs-site.xml

<property>
  <name>dfs.namenode.delegation.token.max-lifetime</name>
  <value>9223372036854775807</value>
</property>


b) setup the Yarn ResourceManager as a proxy for the HDFS Namenode:

From http://www.cloudera.com/documentation/enterprise/5-3-x/topics/cm_sg_yarn_long_jobs.html

"You can work around this by configuring the ResourceManager as a
proxy user for the corresponding HDFS NameNode so that the
ResourceManager can request new tokens when the existing ones are past
their maximum lifetime."

@Nils: Could you comment on what worked best for you?

Best,
Max


On Mon, Mar 14, 2016 at 12:24 PM, Thomas Lamirault
<[hidden email]> wrote:
>
> Hello everyone,
>
>
>
> We are facing the same probleme now in our Flink applications, launch using YARN.
>
> Just want to know if there is any update about this exception ?
>
>
>
> Thanks
>
>
>
> Thomas
>
>
>
> ________________________________
>
> De : [hidden email] [[hidden email]] de la part de Niels Basjes [[hidden email]]
> Envoyé : vendredi 4 décembre 2015 10:40
> À : [hidden email]
> Objet : Re: Flink job on secure Yarn fails after many hours
>
> Hi Maximilian,
>
> I just downloaded the version from your google drive and used that to run my test topology that accesses HBase.
> I deliberately started it twice to double the chance to run into this situation.
>
> I'll keep you posted.
>
> Niels
>
>
> On Thu, Dec 3, 2015 at 11:44 AM, Maximilian Michels <[hidden email]> wrote:
>>
>> Hi Niels,
>>
>> Just got back from our CI. The build above would fail with a
>> Checkstyle error. I corrected that. Also I have built the binaries for
>> your Hadoop version 2.6.0.
>>
>> Binaries:
>>
>> https://github.com/mxm/flink/archive/kerberos-yarn-heartbeat-fail-0.10.1.zip
>>
>> Thanks,
>> Max
>>
>> On Wed, Dec 2, 2015 at 6:52 PM, Maximilian Michels <0.0.0.0:41281
>> >>>> >> >> > 21:30:28,185 ERROR org.apache.flink.runtime.jobmanager.JobManager
>> >>>> >> >> > - Actor akka://flink/user/jobmanager#403236912 terminated,
>> >>>> >> >> > stopping
>> >>>> >> >> > process...
>> >>>> >> >> > 21:30:28,286 INFO
>> >>>> >> >> > org.apache.flink.runtime.webmonitor.WebRuntimeMonitor
>> >>>> >> >> > - Removing web root dir
>> >>>> >> >> > /tmp/flink-web-e1a44f94-ea6d-40ee-b87c-e3122d5cb9bd
>> >>>> >> >> >
>> >>>> >> >> >
>> >>>> >> >> > --
>> >>>> >> >> > Best regards / Met vriendelijke groeten,
>> >>>> >> >> >
>> >>>> >> >> > Niels Basjes
>> >>>> >> >
>> >>>> >> >
>> >>>> >> >
>> >>>> >> >
>> >>>> >> > --
>> >>>> >> > Best regards / Met vriendelijke groeten,
>> >>>> >> >
>> >>>> >> > Niels Basjes
>> >>>> >
>> >>>> >
>> >>>> >
>> >>>> >
>> >>>> > --
>> >>>> > Best regards / Met vriendelijke groeten,
>> >>>> >
>> >>>> > Niels Basjes
>> >>>
>> >>>
>> >>>
>> >>>
>> >>> --
>> >>> Best regards / Met vriendelijke groeten,
>> >>>
>> >>> Niels Basjes
>
>
>
>
> --
> Best regards / Met vriendelijke groeten,
>
> Niels Basjes



--
Best regards / Met vriendelijke groeten,

Niels Basjes