Flink on YARN: delegation token expired prevent job restart

classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|

Flink on YARN: delegation token expired prevent job restart

Kien Truong
Hi all, 

We are having an issue where Flink Application Master is unable to automatically restart Flink job after its delegation token has expired. 

We are using Flink 1.11 with YARN 3.1.1 in single job per yarn-cluster mode. We have also add valid keytab configuration and taskmanagers are able to login with keytabs correctly. However, it seems YARN Application Master still use delegation tokens instead of the keytab.

Any idea how to resolve this would be much appreciated.

Thanks
Kien




Reply | Threaded
Open this post in threaded view
|

Re: Flink on YARN: delegation token expired prevent job restart

Yangze Guo
Hi, Kien,

Do you config the "security.kerberos.login.principal" and the
"security.kerberos.login.keytab" together? If you only set the keytab,
it will not take effect.

Best,
Yangze Guo

On Tue, Nov 17, 2020 at 3:03 PM Kien Truong <[hidden email]> wrote:

>
> Hi all,
>
> We are having an issue where Flink Application Master is unable to automatically restart Flink job after its delegation token has expired.
>
> We are using Flink 1.11 with YARN 3.1.1 in single job per yarn-cluster mode. We have also add valid keytab configuration and taskmanagers are able to login with keytabs correctly. However, it seems YARN Application Master still use delegation tokens instead of the keytab.
>
> Any idea how to resolve this would be much appreciated.
>
> Thanks
> Kien
>
>
>
>
Reply | Threaded
Open this post in threaded view
|

Re: Flink on YARN: delegation token expired prevent job restart

Kien Truong
Hi, 

Yes, I did. There're also logs about logging in using keytab successfully in both Job Manager and Task Manager. 

I found some YARN docs about token renewal on AM restart


> Therefore, to survive AM restart after token expiry, your AM has to get the NMs to localize the keytab or make no HDFS accesses until (somehow) a new token has been passed to them from a client.

Maybe Flink did access HDFS with an expired token, before switching to use the localized keytab ?

Regards,
Kien



On 17 Nov 2020 at 15:14, Yangze Guo <[hidden email]> wrote:

Hi, Kien,


Do you config the "security.kerberos.login.principal" and the
"security.kerberos.login.keytab" together? If you only set the keytab,
it will not take effect.

Best,
Yangze Guo

On Tue, Nov 17, 2020 at 3:03 PM Kien Truong <[hidden email]> wrote:

>
> Hi all,
>
> We are having an issue where Flink Application Master is unable to automatically restart Flink job after its delegation token has expired.
>
> We are using Flink 1.11 with YARN 3.1.1 in single job per yarn-cluster mode. We have also add valid keytab configuration and taskmanagers are able to login with keytabs correctly. However, it seems YARN Application Master still use delegation tokens instead of the keytab.
>
> Any idea how to resolve this would be much appreciated.
>
> Thanks
> Kien
>
>
>
>
Reply | Threaded
Open this post in threaded view
|

Re: Flink on YARN: delegation token expired prevent job restart

Yangze Guo
Hi,

AFAIK, Flink does exclude the HDFS_DELEGATION_TOKEN in the
HadoopModule when user provides the keytab and principal. I'll try to
do a deeper investigation to figure out is there any HDFS access
before the HadoopModule installed.

Best,
Yangze Guo


On Tue, Nov 17, 2020 at 4:36 PM Kien Truong <[hidden email]> wrote:

>
> Hi,
>
> Yes, I did. There're also logs about logging in using keytab successfully in both Job Manager and Task Manager.
>
> I found some YARN docs about token renewal on AM restart
>
>
> > Therefore, to survive AM restart after token expiry, your AM has to get the NMs to localize the keytab or make no HDFS accesses until (somehow) a new token has been passed to them from a client.
>
> Maybe Flink did access HDFS with an expired token, before switching to use the localized keytab ?
>
> Regards,
> Kien
>
>
>
> On 17 Nov 2020 at 15:14, Yangze Guo <[hidden email]> wrote:
>
> Hi, Kien,
>
>
>
> Do you config the "security.kerberos.login.principal" and the
>
> "security.kerberos.login.keytab" together? If you only set the keytab,
>
> it will not take effect.
>
>
>
> Best,
>
> Yangze Guo
>
>
>
> On Tue, Nov 17, 2020 at 3:03 PM Kien Truong <[hidden email]> wrote:
>
> >
>
> > Hi all,
>
> >
>
> > We are having an issue where Flink Application Master is unable to automatically restart Flink job after its delegation token has expired.
>
> >
>
> > We are using Flink 1.11 with YARN 3.1.1 in single job per yarn-cluster mode. We have also add valid keytab configuration and taskmanagers are able to login with keytabs correctly. However, it seems YARN Application Master still use delegation tokens instead of the keytab.
>
> >
>
> > Any idea how to resolve this would be much appreciated.
>
> >
>
> > Thanks
>
> > Kien
>
> >
>
> >
>
> >
>
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: Flink on YARN: delegation token expired prevent job restart

Yangze Guo
Hi,

There is a login operation in
YarnEntrypointUtils.logYarnEnvironmentInformation without the keytab.
One suspect is that Flink may access the HDFS when it tries to build
the PackagedProgram.

Does this issue only happen in the application mode? If so, I would cc
@kkloudas.

Best,
Yangze Guo

On Tue, Nov 17, 2020 at 4:52 PM Yangze Guo <[hidden email]> wrote:

>
> Hi,
>
> AFAIK, Flink does exclude the HDFS_DELEGATION_TOKEN in the
> HadoopModule when user provides the keytab and principal. I'll try to
> do a deeper investigation to figure out is there any HDFS access
> before the HadoopModule installed.
>
> Best,
> Yangze Guo
>
>
> On Tue, Nov 17, 2020 at 4:36 PM Kien Truong <[hidden email]> wrote:
> >
> > Hi,
> >
> > Yes, I did. There're also logs about logging in using keytab successfully in both Job Manager and Task Manager.
> >
> > I found some YARN docs about token renewal on AM restart
> >
> >
> > > Therefore, to survive AM restart after token expiry, your AM has to get the NMs to localize the keytab or make no HDFS accesses until (somehow) a new token has been passed to them from a client.
> >
> > Maybe Flink did access HDFS with an expired token, before switching to use the localized keytab ?
> >
> > Regards,
> > Kien
> >
> >
> >
> > On 17 Nov 2020 at 15:14, Yangze Guo <[hidden email]> wrote:
> >
> > Hi, Kien,
> >
> >
> >
> > Do you config the "security.kerberos.login.principal" and the
> >
> > "security.kerberos.login.keytab" together? If you only set the keytab,
> >
> > it will not take effect.
> >
> >
> >
> > Best,
> >
> > Yangze Guo
> >
> >
> >
> > On Tue, Nov 17, 2020 at 3:03 PM Kien Truong <[hidden email]> wrote:
> >
> > >
> >
> > > Hi all,
> >
> > >
> >
> > > We are having an issue where Flink Application Master is unable to automatically restart Flink job after its delegation token has expired.
> >
> > >
> >
> > > We are using Flink 1.11 with YARN 3.1.1 in single job per yarn-cluster mode. We have also add valid keytab configuration and taskmanagers are able to login with keytabs correctly. However, it seems YARN Application Master still use delegation tokens instead of the keytab.
> >
> > >
> >
> > > Any idea how to resolve this would be much appreciated.
> >
> > >
> >
> > > Thanks
> >
> > > Kien
> >
> > >
> >
> > >
> >
> > >
> >
> > >
> >
Reply | Threaded
Open this post in threaded view
|

Re: Flink on YARN: delegation token expired prevent job restart

Kien Truong
Hi Yangze,

Thanks for checking.

I'm not using the new application mode, but the old single job yarn-cluster mode. 

I'll try to get some more logs tomorrow.

Regards,
Kien

On 17 Nov 2020 at 16:37, Yangze Guo <[hidden email]> wrote:

Hi,


There is a login operation in
YarnEntrypointUtils.logYarnEnvironmentInformation without the keytab.
One suspect is that Flink may access the HDFS when it tries to build
the PackagedProgram.

Does this issue only happen in the application mode? If so, I would cc
@kkloudas.

Best,
Yangze Guo

On Tue, Nov 17, 2020 at 4:52 PM Yangze Guo <[hidden email]> wrote:

>
> Hi,
>
> AFAIK, Flink does exclude the HDFS_DELEGATION_TOKEN in the
> HadoopModule when user provides the keytab and principal. I'll try to
> do a deeper investigation to figure out is there any HDFS access
> before the HadoopModule installed.
>
> Best,
> Yangze Guo
>
>
> On Tue, Nov 17, 2020 at 4:36 PM Kien Truong <[hidden email]> wrote:
> >
> > Hi,
> >
> > Yes, I did. There're also logs about logging in using keytab successfully in both Job Manager and Task Manager.
> >
> > I found some YARN docs about token renewal on AM restart
> >
> >
> > > Therefore, to survive AM restart after token expiry, your AM has to get the NMs to localize the keytab or make no HDFS accesses until (somehow) a new token has been passed to them from a client.
> >
> > Maybe Flink did access HDFS with an expired token, before switching to use the localized keytab ?
> >
> > Regards,
> > Kien
> >
> >
> >
> > On 17 Nov 2020 at 15:14, Yangze Guo <[hidden email]> wrote:
> >
> > Hi, Kien,
> >
> >
> >
> > Do you config the "security.kerberos.login.principal" and the
> >
> > "security.kerberos.login.keytab" together? If you only set the keytab,
> >
> > it will not take effect.
> >
> >
> >
> > Best,
> >
> > Yangze Guo
> >
> >
> >
> > On Tue, Nov 17, 2020 at 3:03 PM Kien Truong <[hidden email]> wrote:
> >
> > >
> >
> > > Hi all,
> >
> > >
> >
> > > We are having an issue where Flink Application Master is unable to automatically restart Flink job after its delegation token has expired.
> >
> > >
> >
> > > We are using Flink 1.11 with YARN 3.1.1 in single job per yarn-cluster mode. We have also add valid keytab configuration and taskmanagers are able to login with keytabs correctly. However, it seems YARN Application Master still use delegation tokens instead of the keytab.
> >
> > >
> >
> > > Any idea how to resolve this would be much appreciated.
> >
> > >
> >
> > > Thanks
> >
> > > Kien
> >
> > >
> >
> > >
> >
> > >
> >
> > >
> >
Reply | Threaded
Open this post in threaded view
|

Re: Flink on YARN: delegation token expired prevent job restart

Kien Truong
Hi all,

So I've checked the log and it seems that the expired delegation error was triggered during resource localization.
Maybe there's something wrong with my Hadoop setup, NMs are supposed to get a good token from RM in order to localize resources automatically.

Regards,
Kiên

2020-11-17 10:28:55,972 WARN org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService: { hdfs://xxxxx:8020/user/xxx/.flink/application_1604481558884_0006/lib/flink-table
-blink_2.12-1.11.2.jar, 1604482517793, FILE, null } failed: Got expired delegation token id
org.apache.hadoop.ipc.RemoteException(javax.security.sasl.SaslException): Got expired delegation token id
        at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1498)
        at org.apache.hadoop.ipc.Client.call(Client.java:1444)
        at org.apache.hadoop.ipc.Client.call(Client.java:1354)
        at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:228)
        at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:116)
        at com.sun.proxy.$Proxy10.getFileInfo(Unknown Source)
        at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getFileInfo(ClientNamenodeProtocolTranslatorPB.java:900)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:422)
        at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:165)
        at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:157)
        at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:95)
        at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:359)
        at com.sun.proxy.$Proxy11.getFileInfo(Unknown Source)
        at org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:1660)
        at org.apache.hadoop.hdfs.DistributedFileSystem$29.doCall(DistributedFileSystem.java:1577)
        at org.apache.hadoop.hdfs.DistributedFileSystem$29.doCall(DistributedFileSystem.java:1574)
        at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
        at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1589)
        at org.apache.hadoop.yarn.util.FSDownload.verifyAndCopy(FSDownload.java:269)
        at org.apache.hadoop.yarn.util.FSDownload.access$000(FSDownload.java:67)
        at org.apache.hadoop.yarn.util.FSDownload$2.run(FSDownload.java:414)
        at org.apache.hadoop.yarn.util.FSDownload$2.run(FSDownload.java:411)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:422)
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730)
        at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:411)
        at org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ContainerLocalizer$FSDownloadWrapper.doDownloadCall(ContainerLocalizer.java:242)
        at org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ContainerLocalizer$FSDownloadWrapper.call(ContainerLocalizer.java:235)
        at org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ContainerLocalizer$FSDownloadWrapper.call(ContainerLocalizer.java:223)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)

2020-11-17 10:28:55,973 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl: Container container_e99_1604481558884_0006_04_000001 transitioned from LOCALIZING to LOCALIZATION_FAILED


On Tue, Nov 17, 2020 at 5:33 PM Kien Truong <[hidden email]> wrote:
Hi Yangze,

Thanks for checking.

I'm not using the new application mode, but the old single job yarn-cluster mode. 

I'll try to get some more logs tomorrow.

Regards,
Kien

On 17 Nov 2020 at 16:37, Yangze Guo <[hidden email]> wrote:

Hi,


There is a login operation in
YarnEntrypointUtils.logYarnEnvironmentInformation without the keytab.
One suspect is that Flink may access the HDFS when it tries to build
the PackagedProgram.

Does this issue only happen in the application mode? If so, I would cc
@kkloudas.

Best,
Yangze Guo

On Tue, Nov 17, 2020 at 4:52 PM Yangze Guo <[hidden email]> wrote:

>
> Hi,
>
> AFAIK, Flink does exclude the HDFS_DELEGATION_TOKEN in the
> HadoopModule when user provides the keytab and principal. I'll try to
> do a deeper investigation to figure out is there any HDFS access
> before the HadoopModule installed.
>
> Best,
> Yangze Guo
>
>
> On Tue, Nov 17, 2020 at 4:36 PM Kien Truong <[hidden email]> wrote:
> >
> > Hi,
> >
> > Yes, I did. There're also logs about logging in using keytab successfully in both Job Manager and Task Manager.
> >
> > I found some YARN docs about token renewal on AM restart
> >
> >
> > > Therefore, to survive AM restart after token expiry, your AM has to get the NMs to localize the keytab or make no HDFS accesses until (somehow) a new token has been passed to them from a client.
> >
> > Maybe Flink did access HDFS with an expired token, before switching to use the localized keytab ?
> >
> > Regards,
> > Kien
> >
> >
> >
> > On 17 Nov 2020 at 15:14, Yangze Guo <[hidden email]> wrote:
> >
> > Hi, Kien,
> >
> >
> >
> > Do you config the "security.kerberos.login.principal" and the
> >
> > "security.kerberos.login.keytab" together? If you only set the keytab,
> >
> > it will not take effect.
> >
> >
> >
> > Best,
> >
> > Yangze Guo
> >
> >
> >
> > On Tue, Nov 17, 2020 at 3:03 PM Kien Truong <[hidden email]> wrote:
> >
> > >
> >
> > > Hi all,
> >
> > >
> >
> > > We are having an issue where Flink Application Master is unable to automatically restart Flink job after its delegation token has expired.
> >
> > >
> >
> > > We are using Flink 1.11 with YARN 3.1.1 in single job per yarn-cluster mode. We have also add valid keytab configuration and taskmanagers are able to login with keytabs correctly. However, it seems YARN Application Master still use delegation tokens instead of the keytab.
> >
> > >
> >
> > > Any idea how to resolve this would be much appreciated.
> >
> > >
> >
> > > Thanks
> >
> > > Kien
> >
> > >
> >
> > >
> >
> > >
> >
> > >
> >