Hi, We have a Kerberos secured Yarn cluster here and I'm experimenting with Apache Flink on top of that. A few days ago I started a very simple Flink application (just stream the time as a String into HBase 10 times per second). I (deliberately) asked our IT-ops guys to make my account have a max ticket time of 5 minutes and a max renew time of 10 minutes (yes, ridiculously low timeout values because I needed to validate this https://issues.apache.org/jira/browse/FLINK-2977). This job is started with a keytab file and after running for 31 hours it suddenly failed with the exception you see below. I had the same job running for almost 400 hours until that failed too (I was too late to check the logfiles but I suspect the same problem). So in that time span my tickets have expired and new tickets have been obtained several hundred times. The main error I see is that in the process of a ticket expiring and being renewed I see this message: Not retrying because the invoked method is not idempotent, and unable to determine whether it was invoked Yarn on the cluster is 2.6.0 ( HDP 2.6.0.2.2.4.2-2 )Flink is version 0.10.1 Niels Basjes 21:30:27,821 WARN org.apache.hadoop.security.UserGroupInformation - PriviledgedActionException as:nbasjes (auth:SIMPLE) cause:org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken): Invalid AMRMToken from appattempt_1443166961758_163901_000001 21:30:27,861 WARN org.apache.hadoop.ipc.Client - Exception encountered while connecting to the server : org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken): Invalid AMRMToken from appattempt_1443166961758_163901_000001 21:30:27,861 WARN org.apache.hadoop.security.UserGroupInformation - PriviledgedActionException as:nbasjes (auth:SIMPLE) cause:org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken): Invalid AMRMToken from appattempt_1443166961758_163901_000001 21:30:27,891 WARN org.apache.hadoop.io.retry.RetryInvocationHandler - Exception while invoking class org.apache.hadoop.yarn.api.impl.pb.client.ApplicationMasterProtocolPBClientImpl.allocate. Not retrying because the invoked method is not idempotent, and unable to determine whether it was invoked org.apache.hadoop.security.token.SecretManager$InvalidToken: Invalid AMRMToken from appattempt_1443166961758_163901_000001 at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:526) at org.apache.hadoop.yarn.ipc.RPCUtil.instantiateException(RPCUtil.java:53) at org.apache.hadoop.yarn.ipc.RPCUtil.unwrapAndThrowException(RPCUtil.java:104) at org.apache.hadoop.yarn.api.impl.pb.client.ApplicationMasterProtocolPBClientImpl.allocate(ApplicationMasterProtocolPBClientImpl.java:79) at sun.reflect.GeneratedMethodAccessor3.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:186) at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102) at com.sun.proxy.$Proxy14.allocate(Unknown Source) at org.apache.hadoop.yarn.client.api.impl.AMRMClientImpl.allocate(AMRMClientImpl.java:245) at org.apache.flink.yarn.YarnJobManager$$anonfun$handleYarnMessage$1.applyOrElse(YarnJobManager.scala:259) at scala.PartialFunction$OrElse.apply(PartialFunction.scala:162) at org.apache.flink.runtime.LeaderSessionMessageFilter$$anonfun$receive$1.applyOrElse(LeaderSessionMessageFilter.scala:44) at scala.runtime.AbstractPartialFunction$mcVL$sp.apply$mcVL$sp(AbstractPartialFunction.scala:33) at scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:33) at scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:25) at org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:33) at org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:28) at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:118) at org.apache.flink.runtime.LogMessages$$anon$1.applyOrElse(LogMessages.scala:28) at akka.actor.Actor$class.aroundReceive(Actor.scala:465) at org.apache.flink.runtime.jobmanager.JobManager.aroundReceive(JobManager.scala:100) at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516) at akka.actor.ActorCell.invoke(ActorCell.scala:487) at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:254) at akka.dispatch.Mailbox.run(Mailbox.scala:221) at akka.dispatch.Mailbox.exec(Mailbox.scala:231) at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.pollAndExecAll(ForkJoinPool.java:1253) at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1346) at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) Caused by: org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken): Invalid AMRMToken from appattempt_1443166961758_163901_000001 at org.apache.hadoop.ipc.Client.call(Client.java:1406) at org.apache.hadoop.ipc.Client.call(Client.java:1359) at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206) at com.sun.proxy.$Proxy13.allocate(Unknown Source) at org.apache.hadoop.yarn.api.impl.pb.client.ApplicationMasterProtocolPBClientImpl.allocate(ApplicationMasterProtocolPBClientImpl.java:77) ... 29 more 21:30:27,943 ERROR akka.actor.OneForOneStrategy - Invalid AMRMToken from appattempt_1443166961758_163901_000001 org.apache.hadoop.security.token.SecretManager$InvalidToken: Invalid AMRMToken from appattempt_1443166961758_163901_000001 at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:526) at org.apache.hadoop.yarn.ipc.RPCUtil.instantiateException(RPCUtil.java:53) at org.apache.hadoop.yarn.ipc.RPCUtil.unwrapAndThrowException(RPCUtil.java:104) at org.apache.hadoop.yarn.api.impl.pb.client.ApplicationMasterProtocolPBClientImpl.allocate(ApplicationMasterProtocolPBClientImpl.java:79) at sun.reflect.GeneratedMethodAccessor3.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:186) at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102) at com.sun.proxy.$Proxy14.allocate(Unknown Source) at org.apache.hadoop.yarn.client.api.impl.AMRMClientImpl.allocate(AMRMClientImpl.java:245) at org.apache.flink.yarn.YarnJobManager$$anonfun$handleYarnMessage$1.applyOrElse(YarnJobManager.scala:259) at scala.PartialFunction$OrElse.apply(PartialFunction.scala:162) at org.apache.flink.runtime.LeaderSessionMessageFilter$$anonfun$receive$1.applyOrElse(LeaderSessionMessageFilter.scala:44) at scala.runtime.AbstractPartialFunction$mcVL$sp.apply$mcVL$sp(AbstractPartialFunction.scala:33) at scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:33) at scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:25) at org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:33) at org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:28) at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:118) at org.apache.flink.runtime.LogMessages$$anon$1.applyOrElse(LogMessages.scala:28) at akka.actor.Actor$class.aroundReceive(Actor.scala:465) at org.apache.flink.runtime.jobmanager.JobManager.aroundReceive(JobManager.scala:100) at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516) at akka.actor.ActorCell.invoke(ActorCell.scala:487) at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:254) at akka.dispatch.Mailbox.run(Mailbox.scala:221) at akka.dispatch.Mailbox.exec(Mailbox.scala:231) at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.pollAndExecAll(ForkJoinPool.java:1253) at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1346) at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) Caused by: org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken): Invalid AMRMToken from appattempt_1443166961758_163901_000001 at org.apache.hadoop.ipc.Client.call(Client.java:1406) at org.apache.hadoop.ipc.Client.call(Client.java:1359) at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206) at com.sun.proxy.$Proxy13.allocate(Unknown Source) at org.apache.hadoop.yarn.api.impl.pb.client.ApplicationMasterProtocolPBClientImpl.allocate(ApplicationMasterProtocolPBClientImpl.java:77) ... 29 more 21:30:28,075 INFO org.apache.flink.yarn.YarnJobManager - Stopping JobManager akka.tcp://<a href="http://flink@10.10.200.3:39527/user/jobmanager. 21:30:28,088">flink@10.10.200.3:39527/user/jobmanager. 21:30:28,088 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph - Source: Custom Source -> Sink: Unnamed (1/1) (db0d95c11c14505827e696eec7efab94) switched from RUNNING to CANCELING 21:30:28,113 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph - Source: Custom Source -> Sink: Unnamed (1/1) (db0d95c11c14505827e696eec7efab94) switched from CANCELING to FAILED 21:30:28,184 INFO org.apache.flink.runtime.blob.BlobServer - Stopped BLOB server at 0.0.0.0:41281 21:30:28,185 ERROR org.apache.flink.runtime.jobmanager.JobManager - Actor akka://flink/user/jobmanager#403236912 terminated, stopping process... 21:30:28,286 INFO org.apache.flink.runtime.webmonitor.WebRuntimeMonitor - Removing web root dir /tmp/flink-web-e1a44f94-ea6d-40ee-b87c-e3122d5cb9bd Best regards / Met vriendelijke groeten,
Niels Basjes |
Hi Niels,
Sorry for hear you experienced this exception. From a first glance, it looks like a bug in Hadoop to me. > "Not retrying because the invoked method is not idempotent, and unable to determine whether it was invoked" That is nothing to worry about. This is Hadoop's internal retry mechanism that re-attempts to do actions which previously failed if that's possible. Since the action is not idempotent (it cannot be executed again without risking to change the state of the execution) and it also doesn't track its execution states, it won't be retried again. The main issue is this exception: >org.apache.hadoop.security.token.SecretManager$InvalidToken: Invalid AMRMToken from >appattempt_1443166961758_163901_000001 From the stack trace it is clear that this exception occurs upon requesting container status information from the Resource Manager: >at org.apache.flink.yarn.YarnJobManager$$anonfun$handleYarnMessage$1.applyOrElse(YarnJobManager.scala:259) Are there any more exceptions in the log? Do you have the complete logs available and could you share them? Best regards, Max On Wed, Dec 2, 2015 at 11:47 AM, Niels Basjes <[hidden email]> wrote: > Hi, > > > We have a Kerberos secured Yarn cluster here and I'm experimenting with > Apache Flink on top of that. > > A few days ago I started a very simple Flink application (just stream the > time as a String into HBase 10 times per second). > > I (deliberately) asked our IT-ops guys to make my account have a max ticket > time of 5 minutes and a max renew time of 10 minutes (yes, ridiculously low > timeout values because I needed to validate this > https://issues.apache.org/jira/browse/FLINK-2977). > > This job is started with a keytab file and after running for 31 hours it > suddenly failed with the exception you see below. > > I had the same job running for almost 400 hours until that failed too (I was > too late to check the logfiles but I suspect the same problem). > > > So in that time span my tickets have expired and new tickets have been > obtained several hundred times. > > > The main error I see is that in the process of a ticket expiring and being > renewed I see this message: > > Not retrying because the invoked method is not idempotent, and unable > to determine whether it was invoked > > > Yarn on the cluster is 2.6.0 ( HDP 2.6.0.2.2.4.2-2 ) > > Flink is version 0.10.1 > > > How do I fix this? > Is this a bug (in either Hadoop or Flink) or am I doing something wrong? > Would upgrading Yarn to 2.7.1 (i.e. HDP 2.3) fix this? > > > Niels Basjes > > > > 21:30:27,821 WARN org.apache.hadoop.security.UserGroupInformation > - PriviledgedActionException as:nbasjes (auth:SIMPLE) > cause:org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken): > Invalid AMRMToken from appattempt_1443166961758_163901_000001 > 21:30:27,861 WARN org.apache.hadoop.ipc.Client > - Exception encountered while connecting to the server : > org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken): > Invalid AMRMToken from appattempt_1443166961758_163901_000001 > 21:30:27,861 WARN org.apache.hadoop.security.UserGroupInformation > - PriviledgedActionException as:nbasjes (auth:SIMPLE) > cause:org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken): > Invalid AMRMToken from appattempt_1443166961758_163901_000001 > 21:30:27,891 WARN org.apache.hadoop.io.retry.RetryInvocationHandler > - Exception while invoking class > org.apache.hadoop.yarn.api.impl.pb.client.ApplicationMasterProtocolPBClientImpl.allocate. > Not retrying because the invoked method is not idempotent, and unable to > determine whether it was invoked > org.apache.hadoop.security.token.SecretManager$InvalidToken: Invalid > AMRMToken from appattempt_1443166961758_163901_000001 > at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) > at > sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57) > at > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) > at java.lang.reflect.Constructor.newInstance(Constructor.java:526) > at org.apache.hadoop.yarn.ipc.RPCUtil.instantiateException(RPCUtil.java:53) > at > org.apache.hadoop.yarn.ipc.RPCUtil.unwrapAndThrowException(RPCUtil.java:104) > at > org.apache.hadoop.yarn.api.impl.pb.client.ApplicationMasterProtocolPBClientImpl.allocate(ApplicationMasterProtocolPBClientImpl.java:79) > at sun.reflect.GeneratedMethodAccessor3.invoke(Unknown Source) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:606) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:186) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102) > at com.sun.proxy.$Proxy14.allocate(Unknown Source) > at > org.apache.hadoop.yarn.client.api.impl.AMRMClientImpl.allocate(AMRMClientImpl.java:245) > at > org.apache.flink.yarn.YarnJobManager$$anonfun$handleYarnMessage$1.applyOrElse(YarnJobManager.scala:259) > at scala.PartialFunction$OrElse.apply(PartialFunction.scala:162) > at > org.apache.flink.runtime.LeaderSessionMessageFilter$$anonfun$receive$1.applyOrElse(LeaderSessionMessageFilter.scala:44) > at > scala.runtime.AbstractPartialFunction$mcVL$sp.apply$mcVL$sp(AbstractPartialFunction.scala:33) > at > scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:33) > at > scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:25) > at org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:33) > at org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:28) > at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:118) > at > org.apache.flink.runtime.LogMessages$$anon$1.applyOrElse(LogMessages.scala:28) > at akka.actor.Actor$class.aroundReceive(Actor.scala:465) > at > org.apache.flink.runtime.jobmanager.JobManager.aroundReceive(JobManager.scala:100) > at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516) > at akka.actor.ActorCell.invoke(ActorCell.scala:487) > at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:254) > at akka.dispatch.Mailbox.run(Mailbox.scala:221) > at akka.dispatch.Mailbox.exec(Mailbox.scala:231) > at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) > at > scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.pollAndExecAll(ForkJoinPool.java:1253) > at > scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1346) > at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) > at > scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) > Caused by: > org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken): > Invalid AMRMToken from appattempt_1443166961758_163901_000001 > at org.apache.hadoop.ipc.Client.call(Client.java:1406) > at org.apache.hadoop.ipc.Client.call(Client.java:1359) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206) > at com.sun.proxy.$Proxy13.allocate(Unknown Source) > at > org.apache.hadoop.yarn.api.impl.pb.client.ApplicationMasterProtocolPBClientImpl.allocate(ApplicationMasterProtocolPBClientImpl.java:77) > ... 29 more > 21:30:27,943 ERROR akka.actor.OneForOneStrategy > - Invalid AMRMToken from appattempt_1443166961758_163901_000001 > org.apache.hadoop.security.token.SecretManager$InvalidToken: Invalid > AMRMToken from appattempt_1443166961758_163901_000001 > at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) > at > sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57) > at > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) > at java.lang.reflect.Constructor.newInstance(Constructor.java:526) > at org.apache.hadoop.yarn.ipc.RPCUtil.instantiateException(RPCUtil.java:53) > at > org.apache.hadoop.yarn.ipc.RPCUtil.unwrapAndThrowException(RPCUtil.java:104) > at > org.apache.hadoop.yarn.api.impl.pb.client.ApplicationMasterProtocolPBClientImpl.allocate(ApplicationMasterProtocolPBClientImpl.java:79) > at sun.reflect.GeneratedMethodAccessor3.invoke(Unknown Source) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:606) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:186) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102) > at com.sun.proxy.$Proxy14.allocate(Unknown Source) > at > org.apache.hadoop.yarn.client.api.impl.AMRMClientImpl.allocate(AMRMClientImpl.java:245) > at > org.apache.flink.yarn.YarnJobManager$$anonfun$handleYarnMessage$1.applyOrElse(YarnJobManager.scala:259) > at scala.PartialFunction$OrElse.apply(PartialFunction.scala:162) > at > org.apache.flink.runtime.LeaderSessionMessageFilter$$anonfun$receive$1.applyOrElse(LeaderSessionMessageFilter.scala:44) > at > scala.runtime.AbstractPartialFunction$mcVL$sp.apply$mcVL$sp(AbstractPartialFunction.scala:33) > at > scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:33) > at > scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:25) > at org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:33) > at org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:28) > at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:118) > at > org.apache.flink.runtime.LogMessages$$anon$1.applyOrElse(LogMessages.scala:28) > at akka.actor.Actor$class.aroundReceive(Actor.scala:465) > at > org.apache.flink.runtime.jobmanager.JobManager.aroundReceive(JobManager.scala:100) > at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516) > at akka.actor.ActorCell.invoke(ActorCell.scala:487) > at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:254) > at akka.dispatch.Mailbox.run(Mailbox.scala:221) > at akka.dispatch.Mailbox.exec(Mailbox.scala:231) > at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) > at > scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.pollAndExecAll(ForkJoinPool.java:1253) > at > scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1346) > at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) > at > scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) > Caused by: > org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken): > Invalid AMRMToken from appattempt_1443166961758_163901_000001 > at org.apache.hadoop.ipc.Client.call(Client.java:1406) > at org.apache.hadoop.ipc.Client.call(Client.java:1359) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206) > at com.sun.proxy.$Proxy13.allocate(Unknown Source) > at > org.apache.hadoop.yarn.api.impl.pb.client.ApplicationMasterProtocolPBClientImpl.allocate(ApplicationMasterProtocolPBClientImpl.java:77) > ... 29 more > 21:30:28,075 INFO org.apache.flink.yarn.YarnJobManager > - Stopping JobManager akka.tcp://flink@10.10.200.3:39527/user/jobmanager. > 21:30:28,088 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph > - Source: Custom Source -> Sink: Unnamed (1/1) > (db0d95c11c14505827e696eec7efab94) switched from RUNNING to CANCELING > 21:30:28,113 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph > - Source: Custom Source -> Sink: Unnamed (1/1) > (db0d95c11c14505827e696eec7efab94) switched from CANCELING to FAILED > 21:30:28,184 INFO org.apache.flink.runtime.blob.BlobServer > - Stopped BLOB server at 0.0.0.0:41281 > 21:30:28,185 ERROR org.apache.flink.runtime.jobmanager.JobManager > - Actor akka://flink/user/jobmanager#403236912 terminated, stopping > process... > 21:30:28,286 INFO org.apache.flink.runtime.webmonitor.WebRuntimeMonitor > - Removing web root dir /tmp/flink-web-e1a44f94-ea6d-40ee-b87c-e3122d5cb9bd > > > -- > Best regards / Met vriendelijke groeten, > > Niels Basjes |
Hi, I posted the entire log from the first log line at the moment of failure to the very end of the logfile. This is all I have. As far as I understand the Kerberos and Keytab mechanism in Hadoop Yarn is that it catches the "Invalid Token" and then (if keytab) gets a new Kerberos ticket (or tgt?). When the new ticket has been obtained it retries the call that previously failed. To me it seemed that this call can fail over the invalid Token yet it cannot be retried. At this moment I'm thinking a bug in Hadoop. Niels On Wed, Dec 2, 2015 at 2:51 PM, Maximilian Michels <[hidden email]> wrote: Hi Niels, Best regards / Met vriendelijke groeten,
Niels Basjes |
Hi Niels,
You mentioned you have the option to update Hadoop and redeploy the job. Would be great if you could do that and let us know how it turns out. Cheers, Max On Wed, Dec 2, 2015 at 3:45 PM, Niels Basjes <[hidden email]> wrote: > Hi, > > I posted the entire log from the first log line at the moment of failure to > the very end of the logfile. > This is all I have. > > As far as I understand the Kerberos and Keytab mechanism in Hadoop Yarn is > that it catches the "Invalid Token" and then (if keytab) gets a new Kerberos > ticket (or tgt?). > When the new ticket has been obtained it retries the call that previously > failed. > To me it seemed that this call can fail over the invalid Token yet it cannot > be retried. > > At this moment I'm thinking a bug in Hadoop. > > Niels > > On Wed, Dec 2, 2015 at 2:51 PM, Maximilian Michels <[hidden email]> wrote: >> >> Hi Niels, >> >> Sorry for hear you experienced this exception. From a first glance, it >> looks like a bug in Hadoop to me. >> >> > "Not retrying because the invoked method is not idempotent, and unable >> > to determine whether it was invoked" >> >> That is nothing to worry about. This is Hadoop's internal retry >> mechanism that re-attempts to do actions which previously failed if >> that's possible. Since the action is not idempotent (it cannot be >> executed again without risking to change the state of the execution) >> and it also doesn't track its execution states, it won't be retried >> again. >> >> The main issue is this exception: >> >> >org.apache.hadoop.security.token.SecretManager$InvalidToken: Invalid >> > AMRMToken from >appattempt_1443166961758_163901_000001 >> >> From the stack trace it is clear that this exception occurs upon >> requesting container status information from the Resource Manager: >> >> >at >> > org.apache.flink.yarn.YarnJobManager$$anonfun$handleYarnMessage$1.applyOrElse(YarnJobManager.scala:259) >> >> Are there any more exceptions in the log? Do you have the complete >> logs available and could you share them? >> >> >> Best regards, >> Max >> >> On Wed, Dec 2, 2015 at 11:47 AM, Niels Basjes <[hidden email]> wrote: >> > Hi, >> > >> > >> > We have a Kerberos secured Yarn cluster here and I'm experimenting with >> > Apache Flink on top of that. >> > >> > A few days ago I started a very simple Flink application (just stream >> > the >> > time as a String into HBase 10 times per second). >> > >> > I (deliberately) asked our IT-ops guys to make my account have a max >> > ticket >> > time of 5 minutes and a max renew time of 10 minutes (yes, ridiculously >> > low >> > timeout values because I needed to validate this >> > https://issues.apache.org/jira/browse/FLINK-2977). >> > >> > This job is started with a keytab file and after running for 31 hours it >> > suddenly failed with the exception you see below. >> > >> > I had the same job running for almost 400 hours until that failed too (I >> > was >> > too late to check the logfiles but I suspect the same problem). >> > >> > >> > So in that time span my tickets have expired and new tickets have been >> > obtained several hundred times. >> > >> > >> > The main error I see is that in the process of a ticket expiring and >> > being >> > renewed I see this message: >> > >> > Not retrying because the invoked method is not idempotent, and >> > unable >> > to determine whether it was invoked >> > >> > >> > Yarn on the cluster is 2.6.0 ( HDP 2.6.0.2.2.4.2-2 ) >> > >> > Flink is version 0.10.1 >> > >> > >> > How do I fix this? >> > Is this a bug (in either Hadoop or Flink) or am I doing something wrong? >> > Would upgrading Yarn to 2.7.1 (i.e. HDP 2.3) fix this? >> > >> > >> > Niels Basjes >> > >> > >> > >> > 21:30:27,821 WARN org.apache.hadoop.security.UserGroupInformation >> > - PriviledgedActionException as:nbasjes (auth:SIMPLE) >> > >> > cause:org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken): >> > Invalid AMRMToken from appattempt_1443166961758_163901_000001 >> > 21:30:27,861 WARN org.apache.hadoop.ipc.Client >> > - Exception encountered while connecting to the server : >> > >> > org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken): >> > Invalid AMRMToken from appattempt_1443166961758_163901_000001 >> > 21:30:27,861 WARN org.apache.hadoop.security.UserGroupInformation >> > - PriviledgedActionException as:nbasjes (auth:SIMPLE) >> > >> > cause:org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken): >> > Invalid AMRMToken from appattempt_1443166961758_163901_000001 >> > 21:30:27,891 WARN org.apache.hadoop.io.retry.RetryInvocationHandler >> > - Exception while invoking class >> > >> > org.apache.hadoop.yarn.api.impl.pb.client.ApplicationMasterProtocolPBClientImpl.allocate. >> > Not retrying because the invoked method is not idempotent, and unable to >> > determine whether it was invoked >> > org.apache.hadoop.security.token.SecretManager$InvalidToken: Invalid >> > AMRMToken from appattempt_1443166961758_163901_000001 >> > at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native >> > Method) >> > at >> > >> > sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57) >> > at >> > >> > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) >> > at java.lang.reflect.Constructor.newInstance(Constructor.java:526) >> > at >> > org.apache.hadoop.yarn.ipc.RPCUtil.instantiateException(RPCUtil.java:53) >> > at >> > >> > org.apache.hadoop.yarn.ipc.RPCUtil.unwrapAndThrowException(RPCUtil.java:104) >> > at >> > >> > org.apache.hadoop.yarn.api.impl.pb.client.ApplicationMasterProtocolPBClientImpl.allocate(ApplicationMasterProtocolPBClientImpl.java:79) >> > at sun.reflect.GeneratedMethodAccessor3.invoke(Unknown Source) >> > at >> > >> > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) >> > at java.lang.reflect.Method.invoke(Method.java:606) >> > at >> > >> > org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:186) >> > at >> > >> > org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102) >> > at com.sun.proxy.$Proxy14.allocate(Unknown Source) >> > at >> > >> > org.apache.hadoop.yarn.client.api.impl.AMRMClientImpl.allocate(AMRMClientImpl.java:245) >> > at >> > >> > org.apache.flink.yarn.YarnJobManager$$anonfun$handleYarnMessage$1.applyOrElse(YarnJobManager.scala:259) >> > at scala.PartialFunction$OrElse.apply(PartialFunction.scala:162) >> > at >> > >> > org.apache.flink.runtime.LeaderSessionMessageFilter$$anonfun$receive$1.applyOrElse(LeaderSessionMessageFilter.scala:44) >> > at >> > >> > scala.runtime.AbstractPartialFunction$mcVL$sp.apply$mcVL$sp(AbstractPartialFunction.scala:33) >> > at >> > >> > scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:33) >> > at >> > >> > scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:25) >> > at >> > org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:33) >> > at >> > org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:28) >> > at >> > scala.PartialFunction$class.applyOrElse(PartialFunction.scala:118) >> > at >> > >> > org.apache.flink.runtime.LogMessages$$anon$1.applyOrElse(LogMessages.scala:28) >> > at akka.actor.Actor$class.aroundReceive(Actor.scala:465) >> > at >> > >> > org.apache.flink.runtime.jobmanager.JobManager.aroundReceive(JobManager.scala:100) >> > at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516) >> > at akka.actor.ActorCell.invoke(ActorCell.scala:487) >> > at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:254) >> > at akka.dispatch.Mailbox.run(Mailbox.scala:221) >> > at akka.dispatch.Mailbox.exec(Mailbox.scala:231) >> > at >> > scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) >> > at >> > >> > scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.pollAndExecAll(ForkJoinPool.java:1253) >> > at >> > >> > scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1346) >> > at >> > scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) >> > at >> > >> > scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) >> > Caused by: >> > >> > org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken): >> > Invalid AMRMToken from appattempt_1443166961758_163901_000001 >> > at org.apache.hadoop.ipc.Client.call(Client.java:1406) >> > at org.apache.hadoop.ipc.Client.call(Client.java:1359) >> > at >> > >> > org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206) >> > at com.sun.proxy.$Proxy13.allocate(Unknown Source) >> > at >> > >> > org.apache.hadoop.yarn.api.impl.pb.client.ApplicationMasterProtocolPBClientImpl.allocate(ApplicationMasterProtocolPBClientImpl.java:77) >> > ... 29 more >> > 21:30:27,943 ERROR akka.actor.OneForOneStrategy >> > - Invalid AMRMToken from appattempt_1443166961758_163901_000001 >> > org.apache.hadoop.security.token.SecretManager$InvalidToken: Invalid >> > AMRMToken from appattempt_1443166961758_163901_000001 >> > at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native >> > Method) >> > at >> > >> > sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57) >> > at >> > >> > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) >> > at java.lang.reflect.Constructor.newInstance(Constructor.java:526) >> > at >> > org.apache.hadoop.yarn.ipc.RPCUtil.instantiateException(RPCUtil.java:53) >> > at >> > >> > org.apache.hadoop.yarn.ipc.RPCUtil.unwrapAndThrowException(RPCUtil.java:104) >> > at >> > >> > org.apache.hadoop.yarn.api.impl.pb.client.ApplicationMasterProtocolPBClientImpl.allocate(ApplicationMasterProtocolPBClientImpl.java:79) >> > at sun.reflect.GeneratedMethodAccessor3.invoke(Unknown Source) >> > at >> > >> > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) >> > at java.lang.reflect.Method.invoke(Method.java:606) >> > at >> > >> > org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:186) >> > at >> > >> > org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102) >> > at com.sun.proxy.$Proxy14.allocate(Unknown Source) >> > at >> > >> > org.apache.hadoop.yarn.client.api.impl.AMRMClientImpl.allocate(AMRMClientImpl.java:245) >> > at >> > >> > org.apache.flink.yarn.YarnJobManager$$anonfun$handleYarnMessage$1.applyOrElse(YarnJobManager.scala:259) >> > at scala.PartialFunction$OrElse.apply(PartialFunction.scala:162) >> > at >> > >> > org.apache.flink.runtime.LeaderSessionMessageFilter$$anonfun$receive$1.applyOrElse(LeaderSessionMessageFilter.scala:44) >> > at >> > >> > scala.runtime.AbstractPartialFunction$mcVL$sp.apply$mcVL$sp(AbstractPartialFunction.scala:33) >> > at >> > >> > scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:33) >> > at >> > >> > scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:25) >> > at >> > org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:33) >> > at >> > org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:28) >> > at >> > scala.PartialFunction$class.applyOrElse(PartialFunction.scala:118) >> > at >> > >> > org.apache.flink.runtime.LogMessages$$anon$1.applyOrElse(LogMessages.scala:28) >> > at akka.actor.Actor$class.aroundReceive(Actor.scala:465) >> > at >> > >> > org.apache.flink.runtime.jobmanager.JobManager.aroundReceive(JobManager.scala:100) >> > at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516) >> > at akka.actor.ActorCell.invoke(ActorCell.scala:487) >> > at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:254) >> > at akka.dispatch.Mailbox.run(Mailbox.scala:221) >> > at akka.dispatch.Mailbox.exec(Mailbox.scala:231) >> > at >> > scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) >> > at >> > >> > scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.pollAndExecAll(ForkJoinPool.java:1253) >> > at >> > >> > scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1346) >> > at >> > scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) >> > at >> > >> > scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) >> > Caused by: >> > >> > org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken): >> > Invalid AMRMToken from appattempt_1443166961758_163901_000001 >> > at org.apache.hadoop.ipc.Client.call(Client.java:1406) >> > at org.apache.hadoop.ipc.Client.call(Client.java:1359) >> > at >> > >> > org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206) >> > at com.sun.proxy.$Proxy13.allocate(Unknown Source) >> > at >> > >> > org.apache.hadoop.yarn.api.impl.pb.client.ApplicationMasterProtocolPBClientImpl.allocate(ApplicationMasterProtocolPBClientImpl.java:77) >> > ... 29 more >> > 21:30:28,075 INFO org.apache.flink.yarn.YarnJobManager >> > - Stopping JobManager >> > akka.tcp://flink@10.10.200.3:39527/user/jobmanager. >> > 21:30:28,088 INFO >> > org.apache.flink.runtime.executiongraph.ExecutionGraph >> > - Source: Custom Source -> Sink: Unnamed (1/1) >> > (db0d95c11c14505827e696eec7efab94) switched from RUNNING to CANCELING >> > 21:30:28,113 INFO >> > org.apache.flink.runtime.executiongraph.ExecutionGraph >> > - Source: Custom Source -> Sink: Unnamed (1/1) >> > (db0d95c11c14505827e696eec7efab94) switched from CANCELING to FAILED >> > 21:30:28,184 INFO org.apache.flink.runtime.blob.BlobServer >> > - Stopped BLOB server at 0.0.0.0:41281 >> > 21:30:28,185 ERROR org.apache.flink.runtime.jobmanager.JobManager >> > - Actor akka://flink/user/jobmanager#403236912 terminated, stopping >> > process... >> > 21:30:28,286 INFO org.apache.flink.runtime.webmonitor.WebRuntimeMonitor >> > - Removing web root dir >> > /tmp/flink-web-e1a44f94-ea6d-40ee-b87c-e3122d5cb9bd >> > >> > >> > -- >> > Best regards / Met vriendelijke groeten, >> > >> > Niels Basjes > > > > > -- > Best regards / Met vriendelijke groeten, > > Niels Basjes |
No, I was just asking. No upgrade is possible for the next month or two. This week is our busiest day of the year ... Our shop is doing about 10 orders per second these days ... So they won't upgrade until next January/February Niels On Wed, Dec 2, 2015 at 3:59 PM, Maximilian Michels <[hidden email]> wrote: Hi Niels, Best regards / Met vriendelijke groeten,
Niels Basjes |
I mentioned that the exception gets thrown when requesting container
status information. We need this to send a heartbeat to YARN but it is not very crucial if this fails once for the running job. Possibly, we could work around this problem by retrying N times in case of an exception. Would it be possible for you to deploy a custom Flink 0.10.1 version we provide and test again? On Wed, Dec 2, 2015 at 4:03 PM, Niels Basjes <[hidden email]> wrote: > No, I was just asking. > No upgrade is possible for the next month or two. > > This week is our busiest day of the year ... > Our shop is doing about 10 orders per second these days ... > > So they won't upgrade until next January/February > > Niels > > On Wed, Dec 2, 2015 at 3:59 PM, Maximilian Michels <[hidden email]> wrote: >> >> Hi Niels, >> >> You mentioned you have the option to update Hadoop and redeploy the >> job. Would be great if you could do that and let us know how it turns >> out. >> >> Cheers, >> Max >> >> On Wed, Dec 2, 2015 at 3:45 PM, Niels Basjes <[hidden email]> wrote: >> > Hi, >> > >> > I posted the entire log from the first log line at the moment of failure >> > to >> > the very end of the logfile. >> > This is all I have. >> > >> > As far as I understand the Kerberos and Keytab mechanism in Hadoop Yarn >> > is >> > that it catches the "Invalid Token" and then (if keytab) gets a new >> > Kerberos >> > ticket (or tgt?). >> > When the new ticket has been obtained it retries the call that >> > previously >> > failed. >> > To me it seemed that this call can fail over the invalid Token yet it >> > cannot >> > be retried. >> > >> > At this moment I'm thinking a bug in Hadoop. >> > >> > Niels >> > >> > On Wed, Dec 2, 2015 at 2:51 PM, Maximilian Michels <[hidden email]> >> > wrote: >> >> >> >> Hi Niels, >> >> >> >> Sorry for hear you experienced this exception. From a first glance, it >> >> looks like a bug in Hadoop to me. >> >> >> >> > "Not retrying because the invoked method is not idempotent, and >> >> > unable >> >> > to determine whether it was invoked" >> >> >> >> That is nothing to worry about. This is Hadoop's internal retry >> >> mechanism that re-attempts to do actions which previously failed if >> >> that's possible. Since the action is not idempotent (it cannot be >> >> executed again without risking to change the state of the execution) >> >> and it also doesn't track its execution states, it won't be retried >> >> again. >> >> >> >> The main issue is this exception: >> >> >> >> >org.apache.hadoop.security.token.SecretManager$InvalidToken: Invalid >> >> > AMRMToken from >appattempt_1443166961758_163901_000001 >> >> >> >> From the stack trace it is clear that this exception occurs upon >> >> requesting container status information from the Resource Manager: >> >> >> >> >at >> >> > >> >> > org.apache.flink.yarn.YarnJobManager$$anonfun$handleYarnMessage$1.applyOrElse(YarnJobManager.scala:259) >> >> >> >> Are there any more exceptions in the log? Do you have the complete >> >> logs available and could you share them? >> >> >> >> >> >> Best regards, >> >> Max >> >> >> >> On Wed, Dec 2, 2015 at 11:47 AM, Niels Basjes <[hidden email]> wrote: >> >> > Hi, >> >> > >> >> > >> >> > We have a Kerberos secured Yarn cluster here and I'm experimenting >> >> > with >> >> > Apache Flink on top of that. >> >> > >> >> > A few days ago I started a very simple Flink application (just stream >> >> > the >> >> > time as a String into HBase 10 times per second). >> >> > >> >> > I (deliberately) asked our IT-ops guys to make my account have a max >> >> > ticket >> >> > time of 5 minutes and a max renew time of 10 minutes (yes, >> >> > ridiculously >> >> > low >> >> > timeout values because I needed to validate this >> >> > https://issues.apache.org/jira/browse/FLINK-2977). >> >> > >> >> > This job is started with a keytab file and after running for 31 hours >> >> > it >> >> > suddenly failed with the exception you see below. >> >> > >> >> > I had the same job running for almost 400 hours until that failed too >> >> > (I >> >> > was >> >> > too late to check the logfiles but I suspect the same problem). >> >> > >> >> > >> >> > So in that time span my tickets have expired and new tickets have >> >> > been >> >> > obtained several hundred times. >> >> > >> >> > >> >> > The main error I see is that in the process of a ticket expiring and >> >> > being >> >> > renewed I see this message: >> >> > >> >> > Not retrying because the invoked method is not idempotent, and >> >> > unable >> >> > to determine whether it was invoked >> >> > >> >> > >> >> > Yarn on the cluster is 2.6.0 ( HDP 2.6.0.2.2.4.2-2 ) >> >> > >> >> > Flink is version 0.10.1 >> >> > >> >> > >> >> > How do I fix this? >> >> > Is this a bug (in either Hadoop or Flink) or am I doing something >> >> > wrong? >> >> > Would upgrading Yarn to 2.7.1 (i.e. HDP 2.3) fix this? >> >> > >> >> > >> >> > Niels Basjes >> >> > >> >> > >> >> > >> >> > 21:30:27,821 WARN org.apache.hadoop.security.UserGroupInformation >> >> > - PriviledgedActionException as:nbasjes (auth:SIMPLE) >> >> > >> >> > >> >> > cause:org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken): >> >> > Invalid AMRMToken from appattempt_1443166961758_163901_000001 >> >> > 21:30:27,861 WARN org.apache.hadoop.ipc.Client >> >> > - Exception encountered while connecting to the server : >> >> > >> >> > >> >> > org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken): >> >> > Invalid AMRMToken from appattempt_1443166961758_163901_000001 >> >> > 21:30:27,861 WARN org.apache.hadoop.security.UserGroupInformation >> >> > - PriviledgedActionException as:nbasjes (auth:SIMPLE) >> >> > >> >> > >> >> > cause:org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken): >> >> > Invalid AMRMToken from appattempt_1443166961758_163901_000001 >> >> > 21:30:27,891 WARN org.apache.hadoop.io.retry.RetryInvocationHandler >> >> > - Exception while invoking class >> >> > >> >> > >> >> > org.apache.hadoop.yarn.api.impl.pb.client.ApplicationMasterProtocolPBClientImpl.allocate. >> >> > Not retrying because the invoked method is not idempotent, and unable >> >> > to >> >> > determine whether it was invoked >> >> > org.apache.hadoop.security.token.SecretManager$InvalidToken: Invalid >> >> > AMRMToken from appattempt_1443166961758_163901_000001 >> >> > at >> >> > sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native >> >> > Method) >> >> > at >> >> > >> >> > >> >> > sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57) >> >> > at >> >> > >> >> > >> >> > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) >> >> > at >> >> > java.lang.reflect.Constructor.newInstance(Constructor.java:526) >> >> > at >> >> > >> >> > org.apache.hadoop.yarn.ipc.RPCUtil.instantiateException(RPCUtil.java:53) >> >> > at >> >> > >> >> > >> >> > org.apache.hadoop.yarn.ipc.RPCUtil.unwrapAndThrowException(RPCUtil.java:104) >> >> > at >> >> > >> >> > >> >> > org.apache.hadoop.yarn.api.impl.pb.client.ApplicationMasterProtocolPBClientImpl.allocate(ApplicationMasterProtocolPBClientImpl.java:79) >> >> > at sun.reflect.GeneratedMethodAccessor3.invoke(Unknown Source) >> >> > at >> >> > >> >> > >> >> > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) >> >> > at java.lang.reflect.Method.invoke(Method.java:606) >> >> > at >> >> > >> >> > >> >> > org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:186) >> >> > at >> >> > >> >> > >> >> > org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102) >> >> > at com.sun.proxy.$Proxy14.allocate(Unknown Source) >> >> > at >> >> > >> >> > >> >> > org.apache.hadoop.yarn.client.api.impl.AMRMClientImpl.allocate(AMRMClientImpl.java:245) >> >> > at >> >> > >> >> > >> >> > org.apache.flink.yarn.YarnJobManager$$anonfun$handleYarnMessage$1.applyOrElse(YarnJobManager.scala:259) >> >> > at >> >> > scala.PartialFunction$OrElse.apply(PartialFunction.scala:162) >> >> > at >> >> > >> >> > >> >> > org.apache.flink.runtime.LeaderSessionMessageFilter$$anonfun$receive$1.applyOrElse(LeaderSessionMessageFilter.scala:44) >> >> > at >> >> > >> >> > >> >> > scala.runtime.AbstractPartialFunction$mcVL$sp.apply$mcVL$sp(AbstractPartialFunction.scala:33) >> >> > at >> >> > >> >> > >> >> > scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:33) >> >> > at >> >> > >> >> > >> >> > scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:25) >> >> > at >> >> > >> >> > org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:33) >> >> > at >> >> > >> >> > org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:28) >> >> > at >> >> > scala.PartialFunction$class.applyOrElse(PartialFunction.scala:118) >> >> > at >> >> > >> >> > >> >> > org.apache.flink.runtime.LogMessages$$anon$1.applyOrElse(LogMessages.scala:28) >> >> > at akka.actor.Actor$class.aroundReceive(Actor.scala:465) >> >> > at >> >> > >> >> > >> >> > org.apache.flink.runtime.jobmanager.JobManager.aroundReceive(JobManager.scala:100) >> >> > at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516) >> >> > at akka.actor.ActorCell.invoke(ActorCell.scala:487) >> >> > at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:254) >> >> > at akka.dispatch.Mailbox.run(Mailbox.scala:221) >> >> > at akka.dispatch.Mailbox.exec(Mailbox.scala:231) >> >> > at >> >> > scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) >> >> > at >> >> > >> >> > >> >> > scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.pollAndExecAll(ForkJoinPool.java:1253) >> >> > at >> >> > >> >> > >> >> > scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1346) >> >> > at >> >> > >> >> > scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) >> >> > at >> >> > >> >> > >> >> > scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) >> >> > Caused by: >> >> > >> >> > >> >> > org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken): >> >> > Invalid AMRMToken from appattempt_1443166961758_163901_000001 >> >> > at org.apache.hadoop.ipc.Client.call(Client.java:1406) >> >> > at org.apache.hadoop.ipc.Client.call(Client.java:1359) >> >> > at >> >> > >> >> > >> >> > org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206) >> >> > at com.sun.proxy.$Proxy13.allocate(Unknown Source) >> >> > at >> >> > >> >> > >> >> > org.apache.hadoop.yarn.api.impl.pb.client.ApplicationMasterProtocolPBClientImpl.allocate(ApplicationMasterProtocolPBClientImpl.java:77) >> >> > ... 29 more >> >> > 21:30:27,943 ERROR akka.actor.OneForOneStrategy >> >> > - Invalid AMRMToken from appattempt_1443166961758_163901_000001 >> >> > org.apache.hadoop.security.token.SecretManager$InvalidToken: Invalid >> >> > AMRMToken from appattempt_1443166961758_163901_000001 >> >> > at >> >> > sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native >> >> > Method) >> >> > at >> >> > >> >> > >> >> > sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57) >> >> > at >> >> > >> >> > >> >> > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) >> >> > at >> >> > java.lang.reflect.Constructor.newInstance(Constructor.java:526) >> >> > at >> >> > >> >> > org.apache.hadoop.yarn.ipc.RPCUtil.instantiateException(RPCUtil.java:53) >> >> > at >> >> > >> >> > >> >> > org.apache.hadoop.yarn.ipc.RPCUtil.unwrapAndThrowException(RPCUtil.java:104) >> >> > at >> >> > >> >> > >> >> > org.apache.hadoop.yarn.api.impl.pb.client.ApplicationMasterProtocolPBClientImpl.allocate(ApplicationMasterProtocolPBClientImpl.java:79) >> >> > at sun.reflect.GeneratedMethodAccessor3.invoke(Unknown Source) >> >> > at >> >> > >> >> > >> >> > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) >> >> > at java.lang.reflect.Method.invoke(Method.java:606) >> >> > at >> >> > >> >> > >> >> > org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:186) >> >> > at >> >> > >> >> > >> >> > org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102) >> >> > at com.sun.proxy.$Proxy14.allocate(Unknown Source) >> >> > at >> >> > >> >> > >> >> > org.apache.hadoop.yarn.client.api.impl.AMRMClientImpl.allocate(AMRMClientImpl.java:245) >> >> > at >> >> > >> >> > >> >> > org.apache.flink.yarn.YarnJobManager$$anonfun$handleYarnMessage$1.applyOrElse(YarnJobManager.scala:259) >> >> > at >> >> > scala.PartialFunction$OrElse.apply(PartialFunction.scala:162) >> >> > at >> >> > >> >> > >> >> > org.apache.flink.runtime.LeaderSessionMessageFilter$$anonfun$receive$1.applyOrElse(LeaderSessionMessageFilter.scala:44) >> >> > at >> >> > >> >> > >> >> > scala.runtime.AbstractPartialFunction$mcVL$sp.apply$mcVL$sp(AbstractPartialFunction.scala:33) >> >> > at >> >> > >> >> > >> >> > scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:33) >> >> > at >> >> > >> >> > >> >> > scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:25) >> >> > at >> >> > >> >> > org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:33) >> >> > at >> >> > >> >> > org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:28) >> >> > at >> >> > scala.PartialFunction$class.applyOrElse(PartialFunction.scala:118) >> >> > at >> >> > >> >> > >> >> > org.apache.flink.runtime.LogMessages$$anon$1.applyOrElse(LogMessages.scala:28) >> >> > at akka.actor.Actor$class.aroundReceive(Actor.scala:465) >> >> > at >> >> > >> >> > >> >> > org.apache.flink.runtime.jobmanager.JobManager.aroundReceive(JobManager.scala:100) >> >> > at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516) >> >> > at akka.actor.ActorCell.invoke(ActorCell.scala:487) >> >> > at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:254) >> >> > at akka.dispatch.Mailbox.run(Mailbox.scala:221) >> >> > at akka.dispatch.Mailbox.exec(Mailbox.scala:231) >> >> > at >> >> > scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) >> >> > at >> >> > >> >> > >> >> > scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.pollAndExecAll(ForkJoinPool.java:1253) >> >> > at >> >> > >> >> > >> >> > scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1346) >> >> > at >> >> > >> >> > scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) >> >> > at >> >> > >> >> > >> >> > scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) >> >> > Caused by: >> >> > >> >> > >> >> > org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken): >> >> > Invalid AMRMToken from appattempt_1443166961758_163901_000001 >> >> > at org.apache.hadoop.ipc.Client.call(Client.java:1406) >> >> > at org.apache.hadoop.ipc.Client.call(Client.java:1359) >> >> > at >> >> > >> >> > >> >> > org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206) >> >> > at com.sun.proxy.$Proxy13.allocate(Unknown Source) >> >> > at >> >> > >> >> > >> >> > org.apache.hadoop.yarn.api.impl.pb.client.ApplicationMasterProtocolPBClientImpl.allocate(ApplicationMasterProtocolPBClientImpl.java:77) >> >> > ... 29 more >> >> > 21:30:28,075 INFO org.apache.flink.yarn.YarnJobManager >> >> > - Stopping JobManager >> >> > akka.tcp://flink@10.10.200.3:39527/user/jobmanager. >> >> > 21:30:28,088 INFO >> >> > org.apache.flink.runtime.executiongraph.ExecutionGraph >> >> > - Source: Custom Source -> Sink: Unnamed (1/1) >> >> > (db0d95c11c14505827e696eec7efab94) switched from RUNNING to CANCELING >> >> > 21:30:28,113 INFO >> >> > org.apache.flink.runtime.executiongraph.ExecutionGraph >> >> > - Source: Custom Source -> Sink: Unnamed (1/1) >> >> > (db0d95c11c14505827e696eec7efab94) switched from CANCELING to FAILED >> >> > 21:30:28,184 INFO org.apache.flink.runtime.blob.BlobServer >> >> > - Stopped BLOB server at 0.0.0.0:41281 >> >> > 21:30:28,185 ERROR org.apache.flink.runtime.jobmanager.JobManager >> >> > - Actor akka://flink/user/jobmanager#403236912 terminated, stopping >> >> > process... >> >> > 21:30:28,286 INFO >> >> > org.apache.flink.runtime.webmonitor.WebRuntimeMonitor >> >> > - Removing web root dir >> >> > /tmp/flink-web-e1a44f94-ea6d-40ee-b87c-e3122d5cb9bd >> >> > >> >> > >> >> > -- >> >> > Best regards / Met vriendelijke groeten, >> >> > >> >> > Niels Basjes >> > >> > >> > >> > >> > -- >> > Best regards / Met vriendelijke groeten, >> > >> > Niels Basjes > > > > > -- > Best regards / Met vriendelijke groeten, > > Niels Basjes |
Sure, just give me the git repo url to build and I'll give it a try. Niels On Wed, Dec 2, 2015 at 4:28 PM, Maximilian Michels <[hidden email]> wrote: I mentioned that the exception gets thrown when requesting container Best regards / Met vriendelijke groeten,
Niels Basjes |
Great. Here is the commit to try out:
https://github.com/mxm/flink/commit/f49b9635bec703541f19cb8c615f302a07ea88b3 If you already have the Flink repository, check it out using git fetch https://github.com/mxm/flink/ f49b9635bec703541f19cb8c615f302a07ea88b3 && git checkout FETCH_HEAD Alternatively, here's a direct download link to the sources with the fix included: https://github.com/mxm/flink/archive/f49b9635bec703541f19cb8c615f302a07ea88b3.zip Thanks a lot, Max On Wed, Dec 2, 2015 at 5:44 PM, Niels Basjes <[hidden email]> wrote: > Sure, just give me the git repo url to build and I'll give it a try. > > Niels > > On Wed, Dec 2, 2015 at 4:28 PM, Maximilian Michels <[hidden email]> wrote: >> >> I mentioned that the exception gets thrown when requesting container >> status information. We need this to send a heartbeat to YARN but it is >> not very crucial if this fails once for the running job. Possibly, we >> could work around this problem by retrying N times in case of an >> exception. >> >> Would it be possible for you to deploy a custom Flink 0.10.1 version >> we provide and test again? >> >> On Wed, Dec 2, 2015 at 4:03 PM, Niels Basjes <[hidden email]> wrote: >> > No, I was just asking. >> > No upgrade is possible for the next month or two. >> > >> > This week is our busiest day of the year ... >> > Our shop is doing about 10 orders per second these days ... >> > >> > So they won't upgrade until next January/February >> > >> > Niels >> > >> > On Wed, Dec 2, 2015 at 3:59 PM, Maximilian Michels <[hidden email]> >> > wrote: >> >> >> >> Hi Niels, >> >> >> >> You mentioned you have the option to update Hadoop and redeploy the >> >> job. Would be great if you could do that and let us know how it turns >> >> out. >> >> >> >> Cheers, >> >> Max >> >> >> >> On Wed, Dec 2, 2015 at 3:45 PM, Niels Basjes <[hidden email]> wrote: >> >> > Hi, >> >> > >> >> > I posted the entire log from the first log line at the moment of >> >> > failure >> >> > to >> >> > the very end of the logfile. >> >> > This is all I have. >> >> > >> >> > As far as I understand the Kerberos and Keytab mechanism in Hadoop >> >> > Yarn >> >> > is >> >> > that it catches the "Invalid Token" and then (if keytab) gets a new >> >> > Kerberos >> >> > ticket (or tgt?). >> >> > When the new ticket has been obtained it retries the call that >> >> > previously >> >> > failed. >> >> > To me it seemed that this call can fail over the invalid Token yet it >> >> > cannot >> >> > be retried. >> >> > >> >> > At this moment I'm thinking a bug in Hadoop. >> >> > >> >> > Niels >> >> > >> >> > On Wed, Dec 2, 2015 at 2:51 PM, Maximilian Michels <[hidden email]> >> >> > wrote: >> >> >> >> >> >> Hi Niels, >> >> >> >> >> >> Sorry for hear you experienced this exception. From a first glance, >> >> >> it >> >> >> looks like a bug in Hadoop to me. >> >> >> >> >> >> > "Not retrying because the invoked method is not idempotent, and >> >> >> > unable >> >> >> > to determine whether it was invoked" >> >> >> >> >> >> That is nothing to worry about. This is Hadoop's internal retry >> >> >> mechanism that re-attempts to do actions which previously failed if >> >> >> that's possible. Since the action is not idempotent (it cannot be >> >> >> executed again without risking to change the state of the execution) >> >> >> and it also doesn't track its execution states, it won't be retried >> >> >> again. >> >> >> >> >> >> The main issue is this exception: >> >> >> >> >> >> >org.apache.hadoop.security.token.SecretManager$InvalidToken: >> >> >> > Invalid >> >> >> > AMRMToken from >appattempt_1443166961758_163901_000001 >> >> >> >> >> >> From the stack trace it is clear that this exception occurs upon >> >> >> requesting container status information from the Resource Manager: >> >> >> >> >> >> >at >> >> >> > >> >> >> > >> >> >> > org.apache.flink.yarn.YarnJobManager$$anonfun$handleYarnMessage$1.applyOrElse(YarnJobManager.scala:259) >> >> >> >> >> >> Are there any more exceptions in the log? Do you have the complete >> >> >> logs available and could you share them? >> >> >> >> >> >> >> >> >> Best regards, >> >> >> Max >> >> >> >> >> >> On Wed, Dec 2, 2015 at 11:47 AM, Niels Basjes <[hidden email]> >> >> >> wrote: >> >> >> > Hi, >> >> >> > >> >> >> > >> >> >> > We have a Kerberos secured Yarn cluster here and I'm experimenting >> >> >> > with >> >> >> > Apache Flink on top of that. >> >> >> > >> >> >> > A few days ago I started a very simple Flink application (just >> >> >> > stream >> >> >> > the >> >> >> > time as a String into HBase 10 times per second). >> >> >> > >> >> >> > I (deliberately) asked our IT-ops guys to make my account have a >> >> >> > max >> >> >> > ticket >> >> >> > time of 5 minutes and a max renew time of 10 minutes (yes, >> >> >> > ridiculously >> >> >> > low >> >> >> > timeout values because I needed to validate this >> >> >> > https://issues.apache.org/jira/browse/FLINK-2977). >> >> >> > >> >> >> > This job is started with a keytab file and after running for 31 >> >> >> > hours >> >> >> > it >> >> >> > suddenly failed with the exception you see below. >> >> >> > >> >> >> > I had the same job running for almost 400 hours until that failed >> >> >> > too >> >> >> > (I >> >> >> > was >> >> >> > too late to check the logfiles but I suspect the same problem). >> >> >> > >> >> >> > >> >> >> > So in that time span my tickets have expired and new tickets have >> >> >> > been >> >> >> > obtained several hundred times. >> >> >> > >> >> >> > >> >> >> > The main error I see is that in the process of a ticket expiring >> >> >> > and >> >> >> > being >> >> >> > renewed I see this message: >> >> >> > >> >> >> > Not retrying because the invoked method is not idempotent, >> >> >> > and >> >> >> > unable >> >> >> > to determine whether it was invoked >> >> >> > >> >> >> > >> >> >> > Yarn on the cluster is 2.6.0 ( HDP 2.6.0.2.2.4.2-2 ) >> >> >> > >> >> >> > Flink is version 0.10.1 >> >> >> > >> >> >> > >> >> >> > How do I fix this? >> >> >> > Is this a bug (in either Hadoop or Flink) or am I doing something >> >> >> > wrong? >> >> >> > Would upgrading Yarn to 2.7.1 (i.e. HDP 2.3) fix this? >> >> >> > >> >> >> > >> >> >> > Niels Basjes >> >> >> > >> >> >> > >> >> >> > >> >> >> > 21:30:27,821 WARN org.apache.hadoop.security.UserGroupInformation >> >> >> > - PriviledgedActionException as:nbasjes (auth:SIMPLE) >> >> >> > >> >> >> > >> >> >> > >> >> >> > cause:org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken): >> >> >> > Invalid AMRMToken from appattempt_1443166961758_163901_000001 >> >> >> > 21:30:27,861 WARN org.apache.hadoop.ipc.Client >> >> >> > - Exception encountered while connecting to the server : >> >> >> > >> >> >> > >> >> >> > >> >> >> > org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken): >> >> >> > Invalid AMRMToken from appattempt_1443166961758_163901_000001 >> >> >> > 21:30:27,861 WARN org.apache.hadoop.security.UserGroupInformation >> >> >> > - PriviledgedActionException as:nbasjes (auth:SIMPLE) >> >> >> > >> >> >> > >> >> >> > >> >> >> > cause:org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken): >> >> >> > Invalid AMRMToken from appattempt_1443166961758_163901_000001 >> >> >> > 21:30:27,891 WARN >> >> >> > org.apache.hadoop.io.retry.RetryInvocationHandler >> >> >> > - Exception while invoking class >> >> >> > >> >> >> > >> >> >> > >> >> >> > org.apache.hadoop.yarn.api.impl.pb.client.ApplicationMasterProtocolPBClientImpl.allocate. >> >> >> > Not retrying because the invoked method is not idempotent, and >> >> >> > unable >> >> >> > to >> >> >> > determine whether it was invoked >> >> >> > org.apache.hadoop.security.token.SecretManager$InvalidToken: >> >> >> > Invalid >> >> >> > AMRMToken from appattempt_1443166961758_163901_000001 >> >> >> > at >> >> >> > sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native >> >> >> > Method) >> >> >> > at >> >> >> > >> >> >> > >> >> >> > >> >> >> > sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57) >> >> >> > at >> >> >> > >> >> >> > >> >> >> > >> >> >> > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) >> >> >> > at >> >> >> > java.lang.reflect.Constructor.newInstance(Constructor.java:526) >> >> >> > at >> >> >> > >> >> >> > >> >> >> > org.apache.hadoop.yarn.ipc.RPCUtil.instantiateException(RPCUtil.java:53) >> >> >> > at >> >> >> > >> >> >> > >> >> >> > >> >> >> > org.apache.hadoop.yarn.ipc.RPCUtil.unwrapAndThrowException(RPCUtil.java:104) >> >> >> > at >> >> >> > >> >> >> > >> >> >> > >> >> >> > org.apache.hadoop.yarn.api.impl.pb.client.ApplicationMasterProtocolPBClientImpl.allocate(ApplicationMasterProtocolPBClientImpl.java:79) >> >> >> > at sun.reflect.GeneratedMethodAccessor3.invoke(Unknown >> >> >> > Source) >> >> >> > at >> >> >> > >> >> >> > >> >> >> > >> >> >> > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) >> >> >> > at java.lang.reflect.Method.invoke(Method.java:606) >> >> >> > at >> >> >> > >> >> >> > >> >> >> > >> >> >> > org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:186) >> >> >> > at >> >> >> > >> >> >> > >> >> >> > >> >> >> > org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102) >> >> >> > at com.sun.proxy.$Proxy14.allocate(Unknown Source) >> >> >> > at >> >> >> > >> >> >> > >> >> >> > >> >> >> > org.apache.hadoop.yarn.client.api.impl.AMRMClientImpl.allocate(AMRMClientImpl.java:245) >> >> >> > at >> >> >> > >> >> >> > >> >> >> > >> >> >> > org.apache.flink.yarn.YarnJobManager$$anonfun$handleYarnMessage$1.applyOrElse(YarnJobManager.scala:259) >> >> >> > at >> >> >> > scala.PartialFunction$OrElse.apply(PartialFunction.scala:162) >> >> >> > at >> >> >> > >> >> >> > >> >> >> > >> >> >> > org.apache.flink.runtime.LeaderSessionMessageFilter$$anonfun$receive$1.applyOrElse(LeaderSessionMessageFilter.scala:44) >> >> >> > at >> >> >> > >> >> >> > >> >> >> > >> >> >> > scala.runtime.AbstractPartialFunction$mcVL$sp.apply$mcVL$sp(AbstractPartialFunction.scala:33) >> >> >> > at >> >> >> > >> >> >> > >> >> >> > >> >> >> > scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:33) >> >> >> > at >> >> >> > >> >> >> > >> >> >> > >> >> >> > scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:25) >> >> >> > at >> >> >> > >> >> >> > >> >> >> > org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:33) >> >> >> > at >> >> >> > >> >> >> > >> >> >> > org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:28) >> >> >> > at >> >> >> > scala.PartialFunction$class.applyOrElse(PartialFunction.scala:118) >> >> >> > at >> >> >> > >> >> >> > >> >> >> > >> >> >> > org.apache.flink.runtime.LogMessages$$anon$1.applyOrElse(LogMessages.scala:28) >> >> >> > at akka.actor.Actor$class.aroundReceive(Actor.scala:465) >> >> >> > at >> >> >> > >> >> >> > >> >> >> > >> >> >> > org.apache.flink.runtime.jobmanager.JobManager.aroundReceive(JobManager.scala:100) >> >> >> > at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516) >> >> >> > at akka.actor.ActorCell.invoke(ActorCell.scala:487) >> >> >> > at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:254) >> >> >> > at akka.dispatch.Mailbox.run(Mailbox.scala:221) >> >> >> > at akka.dispatch.Mailbox.exec(Mailbox.scala:231) >> >> >> > at >> >> >> > >> >> >> > scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) >> >> >> > at >> >> >> > >> >> >> > >> >> >> > >> >> >> > scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.pollAndExecAll(ForkJoinPool.java:1253) >> >> >> > at >> >> >> > >> >> >> > >> >> >> > >> >> >> > scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1346) >> >> >> > at >> >> >> > >> >> >> > >> >> >> > scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) >> >> >> > at >> >> >> > >> >> >> > >> >> >> > >> >> >> > scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) >> >> >> > Caused by: >> >> >> > >> >> >> > >> >> >> > >> >> >> > org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken): >> >> >> > Invalid AMRMToken from appattempt_1443166961758_163901_000001 >> >> >> > at org.apache.hadoop.ipc.Client.call(Client.java:1406) >> >> >> > at org.apache.hadoop.ipc.Client.call(Client.java:1359) >> >> >> > at >> >> >> > >> >> >> > >> >> >> > >> >> >> > org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206) >> >> >> > at com.sun.proxy.$Proxy13.allocate(Unknown Source) >> >> >> > at >> >> >> > >> >> >> > >> >> >> > >> >> >> > org.apache.hadoop.yarn.api.impl.pb.client.ApplicationMasterProtocolPBClientImpl.allocate(ApplicationMasterProtocolPBClientImpl.java:77) >> >> >> > ... 29 more >> >> >> > 21:30:27,943 ERROR akka.actor.OneForOneStrategy >> >> >> > - Invalid AMRMToken from appattempt_1443166961758_163901_000001 >> >> >> > org.apache.hadoop.security.token.SecretManager$InvalidToken: >> >> >> > Invalid >> >> >> > AMRMToken from appattempt_1443166961758_163901_000001 >> >> >> > at >> >> >> > sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native >> >> >> > Method) >> >> >> > at >> >> >> > >> >> >> > >> >> >> > >> >> >> > sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57) >> >> >> > at >> >> >> > >> >> >> > >> >> >> > >> >> >> > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) >> >> >> > at >> >> >> > java.lang.reflect.Constructor.newInstance(Constructor.java:526) >> >> >> > at >> >> >> > >> >> >> > >> >> >> > org.apache.hadoop.yarn.ipc.RPCUtil.instantiateException(RPCUtil.java:53) >> >> >> > at >> >> >> > >> >> >> > >> >> >> > >> >> >> > org.apache.hadoop.yarn.ipc.RPCUtil.unwrapAndThrowException(RPCUtil.java:104) >> >> >> > at >> >> >> > >> >> >> > >> >> >> > >> >> >> > org.apache.hadoop.yarn.api.impl.pb.client.ApplicationMasterProtocolPBClientImpl.allocate(ApplicationMasterProtocolPBClientImpl.java:79) >> >> >> > at sun.reflect.GeneratedMethodAccessor3.invoke(Unknown >> >> >> > Source) >> >> >> > at >> >> >> > >> >> >> > >> >> >> > >> >> >> > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) >> >> >> > at java.lang.reflect.Method.invoke(Method.java:606) >> >> >> > at >> >> >> > >> >> >> > >> >> >> > >> >> >> > org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:186) >> >> >> > at >> >> >> > >> >> >> > >> >> >> > >> >> >> > org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102) >> >> >> > at com.sun.proxy.$Proxy14.allocate(Unknown Source) >> >> >> > at >> >> >> > >> >> >> > >> >> >> > >> >> >> > org.apache.hadoop.yarn.client.api.impl.AMRMClientImpl.allocate(AMRMClientImpl.java:245) >> >> >> > at >> >> >> > >> >> >> > >> >> >> > >> >> >> > org.apache.flink.yarn.YarnJobManager$$anonfun$handleYarnMessage$1.applyOrElse(YarnJobManager.scala:259) >> >> >> > at >> >> >> > scala.PartialFunction$OrElse.apply(PartialFunction.scala:162) >> >> >> > at >> >> >> > >> >> >> > >> >> >> > >> >> >> > org.apache.flink.runtime.LeaderSessionMessageFilter$$anonfun$receive$1.applyOrElse(LeaderSessionMessageFilter.scala:44) >> >> >> > at >> >> >> > >> >> >> > >> >> >> > >> >> >> > scala.runtime.AbstractPartialFunction$mcVL$sp.apply$mcVL$sp(AbstractPartialFunction.scala:33) >> >> >> > at >> >> >> > >> >> >> > >> >> >> > >> >> >> > scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:33) >> >> >> > at >> >> >> > >> >> >> > >> >> >> > >> >> >> > scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:25) >> >> >> > at >> >> >> > >> >> >> > >> >> >> > org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:33) >> >> >> > at >> >> >> > >> >> >> > >> >> >> > org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:28) >> >> >> > at >> >> >> > scala.PartialFunction$class.applyOrElse(PartialFunction.scala:118) >> >> >> > at >> >> >> > >> >> >> > >> >> >> > >> >> >> > org.apache.flink.runtime.LogMessages$$anon$1.applyOrElse(LogMessages.scala:28) >> >> >> > at akka.actor.Actor$class.aroundReceive(Actor.scala:465) >> >> >> > at >> >> >> > >> >> >> > >> >> >> > >> >> >> > org.apache.flink.runtime.jobmanager.JobManager.aroundReceive(JobManager.scala:100) >> >> >> > at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516) >> >> >> > at akka.actor.ActorCell.invoke(ActorCell.scala:487) >> >> >> > at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:254) >> >> >> > at akka.dispatch.Mailbox.run(Mailbox.scala:221) >> >> >> > at akka.dispatch.Mailbox.exec(Mailbox.scala:231) >> >> >> > at >> >> >> > >> >> >> > scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) >> >> >> > at >> >> >> > >> >> >> > >> >> >> > >> >> >> > scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.pollAndExecAll(ForkJoinPool.java:1253) >> >> >> > at >> >> >> > >> >> >> > >> >> >> > >> >> >> > scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1346) >> >> >> > at >> >> >> > >> >> >> > >> >> >> > scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) >> >> >> > at >> >> >> > >> >> >> > >> >> >> > >> >> >> > scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) >> >> >> > Caused by: >> >> >> > >> >> >> > >> >> >> > >> >> >> > org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken): >> >> >> > Invalid AMRMToken from appattempt_1443166961758_163901_000001 >> >> >> > at org.apache.hadoop.ipc.Client.call(Client.java:1406) >> >> >> > at org.apache.hadoop.ipc.Client.call(Client.java:1359) >> >> >> > at >> >> >> > >> >> >> > >> >> >> > >> >> >> > org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206) >> >> >> > at com.sun.proxy.$Proxy13.allocate(Unknown Source) >> >> >> > at >> >> >> > >> >> >> > >> >> >> > >> >> >> > org.apache.hadoop.yarn.api.impl.pb.client.ApplicationMasterProtocolPBClientImpl.allocate(ApplicationMasterProtocolPBClientImpl.java:77) >> >> >> > ... 29 more >> >> >> > 21:30:28,075 INFO org.apache.flink.yarn.YarnJobManager >> >> >> > - Stopping JobManager >> >> >> > akka.tcp://flink@10.10.200.3:39527/user/jobmanager. >> >> >> > 21:30:28,088 INFO >> >> >> > org.apache.flink.runtime.executiongraph.ExecutionGraph >> >> >> > - Source: Custom Source -> Sink: Unnamed (1/1) >> >> >> > (db0d95c11c14505827e696eec7efab94) switched from RUNNING to >> >> >> > CANCELING >> >> >> > 21:30:28,113 INFO >> >> >> > org.apache.flink.runtime.executiongraph.ExecutionGraph >> >> >> > - Source: Custom Source -> Sink: Unnamed (1/1) >> >> >> > (db0d95c11c14505827e696eec7efab94) switched from CANCELING to >> >> >> > FAILED >> >> >> > 21:30:28,184 INFO org.apache.flink.runtime.blob.BlobServer >> >> >> > - Stopped BLOB server at 0.0.0.0:41281 >> >> >> > 21:30:28,185 ERROR org.apache.flink.runtime.jobmanager.JobManager >> >> >> > - Actor akka://flink/user/jobmanager#403236912 terminated, >> >> >> > stopping >> >> >> > process... >> >> >> > 21:30:28,286 INFO >> >> >> > org.apache.flink.runtime.webmonitor.WebRuntimeMonitor >> >> >> > - Removing web root dir >> >> >> > /tmp/flink-web-e1a44f94-ea6d-40ee-b87c-e3122d5cb9bd >> >> >> > >> >> >> > >> >> >> > -- >> >> >> > Best regards / Met vriendelijke groeten, >> >> >> > >> >> >> > Niels Basjes >> >> > >> >> > >> >> > >> >> > >> >> > -- >> >> > Best regards / Met vriendelijke groeten, >> >> > >> >> > Niels Basjes >> > >> > >> > >> > >> > -- >> > Best regards / Met vriendelijke groeten, >> > >> > Niels Basjes > > > > > -- > Best regards / Met vriendelijke groeten, > > Niels Basjes |
发自我的 iPhone > 在 2015年12月3日,上午1:41,Maximilian Michels <[hidden email]> 写道: > > Great. Here is the commit to try out: > https://github.com/mxm/flink/commit/f49b9635bec703541f19cb8c615f302a07ea88b3 > > If you already have the Flink repository, check it out using > > git fetch https://github.com/mxm/flink/ > f49b9635bec703541f19cb8c615f302a07ea88b3 && git checkout FETCH_HEAD > > Alternatively, here's a direct download link to the sources with the > fix included: > https://github.com/mxm/flink/archive/f49b9635bec703541f19cb8c615f302a07ea88b3.zip > > Thanks a lot, > Max > >> On Wed, Dec 2, 2015 at 5:44 PM, Niels Basjes <[hidden email]> wrote: >> Sure, just give me the git repo url to build and I'll give it a try. >> >> Niels >> >>> On Wed, Dec 2, 2015 at 4:28 PM, Maximilian Michels <[hidden email]> wrote: >>> >>> I mentioned that the exception gets thrown when requesting container >>> status information. We need this to send a heartbeat to YARN but it is >>> not very crucial if this fails once for the running job. Possibly, we >>> could work around this problem by retrying N times in case of an >>> exception. >>> >>> Would it be possible for you to deploy a custom Flink 0.10.1 version >>> we provide and test again? >>> >>>> On Wed, Dec 2, 2015 at 4:03 PM, Niels Basjes <[hidden email]> wrote: >>>> No, I was just asking. >>>> No upgrade is possible for the next month or two. >>>> >>>> This week is our busiest day of the year ... >>>> Our shop is doing about 10 orders per second these days ... >>>> >>>> So they won't upgrade until next January/February >>>> >>>> Niels >>>> >>>> On Wed, Dec 2, 2015 at 3:59 PM, Maximilian Michels <[hidden email]> >>>> wrote: >>>>> >>>>> Hi Niels, >>>>> >>>>> You mentioned you have the option to update Hadoop and redeploy the >>>>> job. Would be great if you could do that and let us know how it turns >>>>> out. >>>>> >>>>> Cheers, >>>>> Max >>>>> >>>>>> On Wed, Dec 2, 2015 at 3:45 PM, Niels Basjes <[hidden email]> wrote: >>>>>> Hi, >>>>>> >>>>>> I posted the entire log from the first log line at the moment of >>>>>> failure >>>>>> to >>>>>> the very end of the logfile. >>>>>> This is all I have. >>>>>> >>>>>> As far as I understand the Kerberos and Keytab mechanism in Hadoop >>>>>> Yarn >>>>>> is >>>>>> that it catches the "Invalid Token" and then (if keytab) gets a new >>>>>> Kerberos >>>>>> ticket (or tgt?). >>>>>> When the new ticket has been obtained it retries the call that >>>>>> previously >>>>>> failed. >>>>>> To me it seemed that this call can fail over the invalid Token yet it >>>>>> cannot >>>>>> be retried. >>>>>> >>>>>> At this moment I'm thinking a bug in Hadoop. >>>>>> >>>>>> Niels >>>>>> >>>>>> On Wed, Dec 2, 2015 at 2:51 PM, Maximilian Michels <[hidden email]> >>>>>> wrote: >>>>>>> >>>>>>> Hi Niels, >>>>>>> >>>>>>> Sorry for hear you experienced this exception. From a first glance, >>>>>>> it >>>>>>> looks like a bug in Hadoop to me. >>>>>>> >>>>>>>> "Not retrying because the invoked method is not idempotent, and >>>>>>>> unable >>>>>>>> to determine whether it was invoked" >>>>>>> >>>>>>> That is nothing to worry about. This is Hadoop's internal retry >>>>>>> mechanism that re-attempts to do actions which previously failed if >>>>>>> that's possible. Since the action is not idempotent (it cannot be >>>>>>> executed again without risking to change the state of the execution) >>>>>>> and it also doesn't track its execution states, it won't be retried >>>>>>> again. >>>>>>> >>>>>>> The main issue is this exception: >>>>>>> >>>>>>>> org.apache.hadoop.security.token.SecretManager$InvalidToken: >>>>>>>> Invalid >>>>>>>> AMRMToken from >appattempt_1443166961758_163901_000001 >>>>>>> >>>>>>> From the stack trace it is clear that this exception occurs upon >>>>>>> requesting container status information from the Resource Manager: >>>>>>> >>>>>>>> at >>>>>>>> >>>>>>>> >>>>>>>> org.apache.flink.yarn.YarnJobManager$$anonfun$handleYarnMessage$1.applyOrElse(YarnJobManager.scala:259) >>>>>>> >>>>>>> Are there any more exceptions in the log? Do you have the complete >>>>>>> logs available and could you share them? >>>>>>> >>>>>>> >>>>>>> Best regards, >>>>>>> Max >>>>>>> >>>>>>> On Wed, Dec 2, 2015 at 11:47 AM, Niels Basjes <[hidden email]> >>>>>>> wrote: >>>>>>>> Hi, >>>>>>>> >>>>>>>> >>>>>>>> We have a Kerberos secured Yarn cluster here and I'm experimenting >>>>>>>> with >>>>>>>> Apache Flink on top of that. >>>>>>>> >>>>>>>> A few days ago I started a very simple Flink application (just >>>>>>>> stream >>>>>>>> the >>>>>>>> time as a String into HBase 10 times per second). >>>>>>>> >>>>>>>> I (deliberately) asked our IT-ops guys to make my account have a >>>>>>>> max >>>>>>>> ticket >>>>>>>> time of 5 minutes and a max renew time of 10 minutes (yes, >>>>>>>> ridiculously >>>>>>>> low >>>>>>>> timeout values because I needed to validate this >>>>>>>> https://issues.apache.org/jira/browse/FLINK-2977). >>>>>>>> >>>>>>>> This job is started with a keytab file and after running for 31 >>>>>>>> hours >>>>>>>> it >>>>>>>> suddenly failed with the exception you see below. >>>>>>>> >>>>>>>> I had the same job running for almost 400 hours until that failed >>>>>>>> too >>>>>>>> (I >>>>>>>> was >>>>>>>> too late to check the logfiles but I suspect the same problem). >>>>>>>> >>>>>>>> >>>>>>>> So in that time span my tickets have expired and new tickets have >>>>>>>> been >>>>>>>> obtained several hundred times. >>>>>>>> >>>>>>>> >>>>>>>> The main error I see is that in the process of a ticket expiring >>>>>>>> and >>>>>>>> being >>>>>>>> renewed I see this message: >>>>>>>> >>>>>>>> Not retrying because the invoked method is not idempotent, >>>>>>>> and >>>>>>>> unable >>>>>>>> to determine whether it was invoked >>>>>>>> >>>>>>>> >>>>>>>> Yarn on the cluster is 2.6.0 ( HDP 2.6.0.2.2.4.2-2 ) >>>>>>>> >>>>>>>> Flink is version 0.10.1 >>>>>>>> >>>>>>>> >>>>>>>> How do I fix this? >>>>>>>> Is this a bug (in either Hadoop or Flink) or am I doing something >>>>>>>> wrong? >>>>>>>> Would upgrading Yarn to 2.7.1 (i.e. HDP 2.3) fix this? >>>>>>>> >>>>>>>> >>>>>>>> Niels Basjes >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> 21:30:27,821 WARN org.apache.hadoop.security.UserGroupInformation >>>>>>>> - PriviledgedActionException as:nbasjes (auth:SIMPLE) >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> cause:org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken): >>>>>>>> Invalid AMRMToken from appattempt_1443166961758_163901_000001 >>>>>>>> 21:30:27,861 WARN org.apache.hadoop.ipc.Client >>>>>>>> - Exception encountered while connecting to the server : >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken): >>>>>>>> Invalid AMRMToken from appattempt_1443166961758_163901_000001 >>>>>>>> 21:30:27,861 WARN org.apache.hadoop.security.UserGroupInformation >>>>>>>> - PriviledgedActionException as:nbasjes (auth:SIMPLE) >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> cause:org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken): >>>>>>>> Invalid AMRMToken from appattempt_1443166961758_163901_000001 >>>>>>>> 21:30:27,891 WARN >>>>>>>> org.apache.hadoop.io.retry.RetryInvocationHandler >>>>>>>> - Exception while invoking class >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> org.apache.hadoop.yarn.api.impl.pb.client.ApplicationMasterProtocolPBClientImpl.allocate. >>>>>>>> Not retrying because the invoked method is not idempotent, and >>>>>>>> unable >>>>>>>> to >>>>>>>> determine whether it was invoked >>>>>>>> org.apache.hadoop.security.token.SecretManager$InvalidToken: >>>>>>>> Invalid >>>>>>>> AMRMToken from appattempt_1443166961758_163901_000001 >>>>>>>> at >>>>>>>> sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native >>>>>>>> Method) >>>>>>>> at >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57) >>>>>>>> at >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) >>>>>>>> at >>>>>>>> java.lang.reflect.Constructor.newInstance(Constructor.java:526) >>>>>>>> at >>>>>>>> >>>>>>>> >>>>>>>> org.apache.hadoop.yarn.ipc.RPCUtil.instantiateException(RPCUtil.java:53) >>>>>>>> at >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> org.apache.hadoop.yarn.ipc.RPCUtil.unwrapAndThrowException(RPCUtil.java:104) >>>>>>>> at >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> org.apache.hadoop.yarn.api.impl.pb.client.ApplicationMasterProtocolPBClientImpl.allocate(ApplicationMasterProtocolPBClientImpl.java:79) >>>>>>>> at sun.reflect.GeneratedMethodAccessor3.invoke(Unknown >>>>>>>> Source) >>>>>>>> at >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) >>>>>>>> at java.lang.reflect.Method.invoke(Method.java:606) >>>>>>>> at >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:186) >>>>>>>> at >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102) >>>>>>>> at com.sun.proxy.$Proxy14.allocate(Unknown Source) >>>>>>>> at >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> org.apache.hadoop.yarn.client.api.impl.AMRMClientImpl.allocate(AMRMClientImpl.java:245) >>>>>>>> at >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> org.apache.flink.yarn.YarnJobManager$$anonfun$handleYarnMessage$1.applyOrElse(YarnJobManager.scala:259) >>>>>>>> at >>>>>>>> scala.PartialFunction$OrElse.apply(PartialFunction.scala:162) >>>>>>>> at >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> org.apache.flink.runtime.LeaderSessionMessageFilter$$anonfun$receive$1.applyOrElse(LeaderSessionMessageFilter.scala:44) >>>>>>>> at >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> scala.runtime.AbstractPartialFunction$mcVL$sp.apply$mcVL$sp(AbstractPartialFunction.scala:33) >>>>>>>> at >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:33) >>>>>>>> at >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:25) >>>>>>>> at >>>>>>>> >>>>>>>> >>>>>>>> org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:33) >>>>>>>> at >>>>>>>> >>>>>>>> >>>>>>>> org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:28) >>>>>>>> at >>>>>>>> scala.PartialFunction$class.applyOrElse(PartialFunction.scala:118) >>>>>>>> at >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> org.apache.flink.runtime.LogMessages$$anon$1.applyOrElse(LogMessages.scala:28) >>>>>>>> at akka.actor.Actor$class.aroundReceive(Actor.scala:465) >>>>>>>> at >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> org.apache.flink.runtime.jobmanager.JobManager.aroundReceive(JobManager.scala:100) >>>>>>>> at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516) >>>>>>>> at akka.actor.ActorCell.invoke(ActorCell.scala:487) >>>>>>>> at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:254) >>>>>>>> at akka.dispatch.Mailbox.run(Mailbox.scala:221) >>>>>>>> at akka.dispatch.Mailbox.exec(Mailbox.scala:231) >>>>>>>> at >>>>>>>> >>>>>>>> scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) >>>>>>>> at >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.pollAndExecAll(ForkJoinPool.java:1253) >>>>>>>> at >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1346) >>>>>>>> at >>>>>>>> >>>>>>>> >>>>>>>> scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) >>>>>>>> at >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) >>>>>>>> Caused by: >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken): >>>>>>>> Invalid AMRMToken from appattempt_1443166961758_163901_000001 >>>>>>>> at org.apache.hadoop.ipc.Client.call(Client.java:1406) >>>>>>>> at org.apache.hadoop.ipc.Client.call(Client.java:1359) >>>>>>>> at >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206) >>>>>>>> at com.sun.proxy.$Proxy13.allocate(Unknown Source) >>>>>>>> at >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> org.apache.hadoop.yarn.api.impl.pb.client.ApplicationMasterProtocolPBClientImpl.allocate(ApplicationMasterProtocolPBClientImpl.java:77) >>>>>>>> ... 29 more >>>>>>>> 21:30:27,943 ERROR akka.actor.OneForOneStrategy >>>>>>>> - Invalid AMRMToken from appattempt_1443166961758_163901_000001 >>>>>>>> org.apache.hadoop.security.token.SecretManager$InvalidToken: >>>>>>>> Invalid >>>>>>>> AMRMToken from appattempt_1443166961758_163901_000001 >>>>>>>> at >>>>>>>> sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native >>>>>>>> Method) >>>>>>>> at >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57) >>>>>>>> at >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) >>>>>>>> at >>>>>>>> java.lang.reflect.Constructor.newInstance(Constructor.java:526) >>>>>>>> at >>>>>>>> >>>>>>>> >>>>>>>> org.apache.hadoop.yarn.ipc.RPCUtil.instantiateException(RPCUtil.java:53) >>>>>>>> at >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> org.apache.hadoop.yarn.ipc.RPCUtil.unwrapAndThrowException(RPCUtil.java:104) >>>>>>>> at >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> org.apache.hadoop.yarn.api.impl.pb.client.ApplicationMasterProtocolPBClientImpl.allocate(ApplicationMasterProtocolPBClientImpl.java:79) >>>>>>>> at sun.reflect.GeneratedMethodAccessor3.invoke(Unknown >>>>>>>> Source) >>>>>>>> at >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) >>>>>>>> at java.lang.reflect.Method.invoke(Method.java:606) >>>>>>>> at >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:186) >>>>>>>> at >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102) >>>>>>>> at com.sun.proxy.$Proxy14.allocate(Unknown Source) >>>>>>>> at >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> org.apache.hadoop.yarn.client.api.impl.AMRMClientImpl.allocate(AMRMClientImpl.java:245) >>>>>>>> at >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> org.apache.flink.yarn.YarnJobManager$$anonfun$handleYarnMessage$1.applyOrElse(YarnJobManager.scala:259) >>>>>>>> at >>>>>>>> scala.PartialFunction$OrElse.apply(PartialFunction.scala:162) >>>>>>>> at >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> org.apache.flink.runtime.LeaderSessionMessageFilter$$anonfun$receive$1.applyOrElse(LeaderSessionMessageFilter.scala:44) >>>>>>>> at >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> scala.runtime.AbstractPartialFunction$mcVL$sp.apply$mcVL$sp(AbstractPartialFunction.scala:33) >>>>>>>> at >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:33) >>>>>>>> at >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:25) >>>>>>>> at >>>>>>>> >>>>>>>> >>>>>>>> org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:33) >>>>>>>> at >>>>>>>> >>>>>>>> >>>>>>>> org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:28) >>>>>>>> at >>>>>>>> scala.PartialFunction$class.applyOrElse(PartialFunction.scala:118) >>>>>>>> at >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> org.apache.flink.runtime.LogMessages$$anon$1.applyOrElse(LogMessages.scala:28) >>>>>>>> at akka.actor.Actor$class.aroundReceive(Actor.scala:465) >>>>>>>> at >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> org.apache.flink.runtime.jobmanager.JobManager.aroundReceive(JobManager.scala:100) >>>>>>>> at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516) >>>>>>>> at akka.actor.ActorCell.invoke(ActorCell.scala:487) >>>>>>>> at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:254) >>>>>>>> at akka.dispatch.Mailbox.run(Mailbox.scala:221) >>>>>>>> at akka.dispatch.Mailbox.exec(Mailbox.scala:231) >>>>>>>> at >>>>>>>> >>>>>>>> scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) >>>>>>>> at >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.pollAndExecAll(ForkJoinPool.java:1253) >>>>>>>> at >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1346) >>>>>>>> at >>>>>>>> >>>>>>>> >>>>>>>> scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) >>>>>>>> at >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) >>>>>>>> Caused by: >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken): >>>>>>>> Invalid AMRMToken from appattempt_1443166961758_163901_000001 >>>>>>>> at org.apache.hadoop.ipc.Client.call(Client.java:1406) >>>>>>>> at org.apache.hadoop.ipc.Client.call(Client.java:1359) >>>>>>>> at >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206) >>>>>>>> at com.sun.proxy.$Proxy13.allocate(Unknown Source) >>>>>>>> at >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> org.apache.hadoop.yarn.api.impl.pb.client.ApplicationMasterProtocolPBClientImpl.allocate(ApplicationMasterProtocolPBClientImpl.java:77) >>>>>>>> ... 29 more >>>>>>>> 21:30:28,075 INFO org.apache.flink.yarn.YarnJobManager >>>>>>>> - Stopping JobManager >>>>>>>> akka.tcp://flink@10.10.200.3:39527/user/jobmanager. >>>>>>>> 21:30:28,088 INFO >>>>>>>> org.apache.flink.runtime.executiongraph.ExecutionGraph >>>>>>>> - Source: Custom Source -> Sink: Unnamed (1/1) >>>>>>>> (db0d95c11c14505827e696eec7efab94) switched from RUNNING to >>>>>>>> CANCELING >>>>>>>> 21:30:28,113 INFO >>>>>>>> org.apache.flink.runtime.executiongraph.ExecutionGraph >>>>>>>> - Source: Custom Source -> Sink: Unnamed (1/1) >>>>>>>> (db0d95c11c14505827e696eec7efab94) switched from CANCELING to >>>>>>>> FAILED >>>>>>>> 21:30:28,184 INFO org.apache.flink.runtime.blob.BlobServer >>>>>>>> - Stopped BLOB server at 0.0.0.0:41281 >>>>>>>> 21:30:28,185 ERROR org.apache.flink.runtime.jobmanager.JobManager >>>>>>>> - Actor akka://flink/user/jobmanager#403236912 terminated, >>>>>>>> stopping >>>>>>>> process... >>>>>>>> 21:30:28,286 INFO >>>>>>>> org.apache.flink.runtime.webmonitor.WebRuntimeMonitor >>>>>>>> - Removing web root dir >>>>>>>> /tmp/flink-web-e1a44f94-ea6d-40ee-b87c-e3122d5cb9bd >>>>>>>> >>>>>>>> >>>>>>>> -- >>>>>>>> Best regards / Met vriendelijke groeten, >>>>>>>> >>>>>>>> Niels Basjes >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> -- >>>>>> Best regards / Met vriendelijke groeten, >>>>>> >>>>>> Niels Basjes >>>> >>>> >>>> >>>> >>>> -- >>>> Best regards / Met vriendelijke groeten, >>>> >>>> Niels Basjes >> >> >> >> >> -- >> Best regards / Met vriendelijke groeten, >> >> Niels Basjes |
In reply to this post by Maximilian Michels
I forgot you're using Flink 0.10.1. The above was for the master.
So here's the commit for Flink 0.10.1: https://github.com/mxm/flink/commit/a41f3866f4097586a7b2262093088861b62930cd git fetch https://github.com/mxm/flink/ \ a41f3866f4097586a7b2262093088861b62930cd && git checkout FETCH_HEAD https://github.com/mxm/flink/archive/a41f3866f4097586a7b2262093088861b62930cd.zip Thanks, Max On Wed, Dec 2, 2015 at 6:39 PM, Maximilian Michels <[hidden email]> wrote: > Great. Here is the commit to try out: > https://github.com/mxm/flink/commit/f49b9635bec703541f19cb8c615f302a07ea88b3 > > If you already have the Flink repository, check it out using > > git fetch https://github.com/mxm/flink/ > f49b9635bec703541f19cb8c615f302a07ea88b3 && git checkout FETCH_HEAD > > Alternatively, here's a direct download link to the sources with the > fix included: > https://github.com/mxm/flink/archive/f49b9635bec703541f19cb8c615f302a07ea88b3.zip > > Thanks a lot, > Max > > On Wed, Dec 2, 2015 at 5:44 PM, Niels Basjes <[hidden email]> wrote: >> Sure, just give me the git repo url to build and I'll give it a try. >> >> Niels >> >> On Wed, Dec 2, 2015 at 4:28 PM, Maximilian Michels <[hidden email]> wrote: >>> >>> I mentioned that the exception gets thrown when requesting container >>> status information. We need this to send a heartbeat to YARN but it is >>> not very crucial if this fails once for the running job. Possibly, we >>> could work around this problem by retrying N times in case of an >>> exception. >>> >>> Would it be possible for you to deploy a custom Flink 0.10.1 version >>> we provide and test again? >>> >>> On Wed, Dec 2, 2015 at 4:03 PM, Niels Basjes <[hidden email]> wrote: >>> > No, I was just asking. >>> > No upgrade is possible for the next month or two. >>> > >>> > This week is our busiest day of the year ... >>> > Our shop is doing about 10 orders per second these days ... >>> > >>> > So they won't upgrade until next January/February >>> > >>> > Niels >>> > >>> > On Wed, Dec 2, 2015 at 3:59 PM, Maximilian Michels <[hidden email]> >>> > wrote: >>> >> >>> >> Hi Niels, >>> >> >>> >> You mentioned you have the option to update Hadoop and redeploy the >>> >> job. Would be great if you could do that and let us know how it turns >>> >> out. >>> >> >>> >> Cheers, >>> >> Max >>> >> >>> >> On Wed, Dec 2, 2015 at 3:45 PM, Niels Basjes <[hidden email]> wrote: >>> >> > Hi, >>> >> > >>> >> > I posted the entire log from the first log line at the moment of >>> >> > failure >>> >> > to >>> >> > the very end of the logfile. >>> >> > This is all I have. >>> >> > >>> >> > As far as I understand the Kerberos and Keytab mechanism in Hadoop >>> >> > Yarn >>> >> > is >>> >> > that it catches the "Invalid Token" and then (if keytab) gets a new >>> >> > Kerberos >>> >> > ticket (or tgt?). >>> >> > When the new ticket has been obtained it retries the call that >>> >> > previously >>> >> > failed. >>> >> > To me it seemed that this call can fail over the invalid Token yet it >>> >> > cannot >>> >> > be retried. >>> >> > >>> >> > At this moment I'm thinking a bug in Hadoop. >>> >> > >>> >> > Niels >>> >> > >>> >> > On Wed, Dec 2, 2015 at 2:51 PM, Maximilian Michels <[hidden email]> >>> >> > wrote: >>> >> >> >>> >> >> Hi Niels, >>> >> >> >>> >> >> Sorry for hear you experienced this exception. From a first glance, >>> >> >> it >>> >> >> looks like a bug in Hadoop to me. >>> >> >> >>> >> >> > "Not retrying because the invoked method is not idempotent, and >>> >> >> > unable >>> >> >> > to determine whether it was invoked" >>> >> >> >>> >> >> That is nothing to worry about. This is Hadoop's internal retry >>> >> >> mechanism that re-attempts to do actions which previously failed if >>> >> >> that's possible. Since the action is not idempotent (it cannot be >>> >> >> executed again without risking to change the state of the execution) >>> >> >> and it also doesn't track its execution states, it won't be retried >>> >> >> again. >>> >> >> >>> >> >> The main issue is this exception: >>> >> >> >>> >> >> >org.apache.hadoop.security.token.SecretManager$InvalidToken: >>> >> >> > Invalid >>> >> >> > AMRMToken from >appattempt_1443166961758_163901_000001 >>> >> >> >>> >> >> From the stack trace it is clear that this exception occurs upon >>> >> >> requesting container status information from the Resource Manager: >>> >> >> >>> >> >> >at >>> >> >> > >>> >> >> > >>> >> >> > org.apache.flink.yarn.YarnJobManager$$anonfun$handleYarnMessage$1.applyOrElse(YarnJobManager.scala:259) >>> >> >> >>> >> >> Are there any more exceptions in the log? Do you have the complete >>> >> >> logs available and could you share them? >>> >> >> >>> >> >> >>> >> >> Best regards, >>> >> >> Max >>> >> >> >>> >> >> On Wed, Dec 2, 2015 at 11:47 AM, Niels Basjes <[hidden email]> >>> >> >> wrote: >>> >> >> > Hi, >>> >> >> > >>> >> >> > >>> >> >> > We have a Kerberos secured Yarn cluster here and I'm experimenting >>> >> >> > with >>> >> >> > Apache Flink on top of that. >>> >> >> > >>> >> >> > A few days ago I started a very simple Flink application (just >>> >> >> > stream >>> >> >> > the >>> >> >> > time as a String into HBase 10 times per second). >>> >> >> > >>> >> >> > I (deliberately) asked our IT-ops guys to make my account have a >>> >> >> > max >>> >> >> > ticket >>> >> >> > time of 5 minutes and a max renew time of 10 minutes (yes, >>> >> >> > ridiculously >>> >> >> > low >>> >> >> > timeout values because I needed to validate this >>> >> >> > https://issues.apache.org/jira/browse/FLINK-2977). >>> >> >> > >>> >> >> > This job is started with a keytab file and after running for 31 >>> >> >> > hours >>> >> >> > it >>> >> >> > suddenly failed with the exception you see below. >>> >> >> > >>> >> >> > I had the same job running for almost 400 hours until that failed >>> >> >> > too >>> >> >> > (I >>> >> >> > was >>> >> >> > too late to check the logfiles but I suspect the same problem). >>> >> >> > >>> >> >> > >>> >> >> > So in that time span my tickets have expired and new tickets have >>> >> >> > been >>> >> >> > obtained several hundred times. >>> >> >> > >>> >> >> > >>> >> >> > The main error I see is that in the process of a ticket expiring >>> >> >> > and >>> >> >> > being >>> >> >> > renewed I see this message: >>> >> >> > >>> >> >> > Not retrying because the invoked method is not idempotent, >>> >> >> > and >>> >> >> > unable >>> >> >> > to determine whether it was invoked >>> >> >> > >>> >> >> > >>> >> >> > Yarn on the cluster is 2.6.0 ( HDP 2.6.0.2.2.4.2-2 ) >>> >> >> > >>> >> >> > Flink is version 0.10.1 >>> >> >> > >>> >> >> > >>> >> >> > How do I fix this? >>> >> >> > Is this a bug (in either Hadoop or Flink) or am I doing something >>> >> >> > wrong? >>> >> >> > Would upgrading Yarn to 2.7.1 (i.e. HDP 2.3) fix this? >>> >> >> > >>> >> >> > >>> >> >> > Niels Basjes >>> >> >> > >>> >> >> > >>> >> >> > >>> >> >> > 21:30:27,821 WARN org.apache.hadoop.security.UserGroupInformation >>> >> >> > - PriviledgedActionException as:nbasjes (auth:SIMPLE) >>> >> >> > >>> >> >> > >>> >> >> > >>> >> >> > cause:org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken): >>> >> >> > Invalid AMRMToken from appattempt_1443166961758_163901_000001 >>> >> >> > 21:30:27,861 WARN org.apache.hadoop.ipc.Client >>> >> >> > - Exception encountered while connecting to the server : >>> >> >> > >>> >> >> > >>> >> >> > >>> >> >> > org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken): >>> >> >> > Invalid AMRMToken from appattempt_1443166961758_163901_000001 >>> >> >> > 21:30:27,861 WARN org.apache.hadoop.security.UserGroupInformation >>> >> >> > - PriviledgedActionException as:nbasjes (auth:SIMPLE) >>> >> >> > >>> >> >> > >>> >> >> > >>> >> >> > cause:org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken): >>> >> >> > Invalid AMRMToken from appattempt_1443166961758_163901_000001 >>> >> >> > 21:30:27,891 WARN >>> >> >> > org.apache.hadoop.io.retry.RetryInvocationHandler >>> >> >> > - Exception while invoking class >>> >> >> > >>> >> >> > >>> >> >> > >>> >> >> > org.apache.hadoop.yarn.api.impl.pb.client.ApplicationMasterProtocolPBClientImpl.allocate. >>> >> >> > Not retrying because the invoked method is not idempotent, and >>> >> >> > unable >>> >> >> > to >>> >> >> > determine whether it was invoked >>> >> >> > org.apache.hadoop.security.token.SecretManager$InvalidToken: >>> >> >> > Invalid >>> >> >> > AMRMToken from appattempt_1443166961758_163901_000001 >>> >> >> > at >>> >> >> > sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native >>> >> >> > Method) >>> >> >> > at >>> >> >> > >>> >> >> > >>> >> >> > >>> >> >> > sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57) >>> >> >> > at >>> >> >> > >>> >> >> > >>> >> >> > >>> >> >> > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) >>> >> >> > at >>> >> >> > java.lang.reflect.Constructor.newInstance(Constructor.java:526) >>> >> >> > at >>> >> >> > >>> >> >> > >>> >> >> > org.apache.hadoop.yarn.ipc.RPCUtil.instantiateException(RPCUtil.java:53) >>> >> >> > at >>> >> >> > >>> >> >> > >>> >> >> > >>> >> >> > org.apache.hadoop.yarn.ipc.RPCUtil.unwrapAndThrowException(RPCUtil.java:104) >>> >> >> > at >>> >> >> > >>> >> >> > >>> >> >> > >>> >> >> > org.apache.hadoop.yarn.api.impl.pb.client.ApplicationMasterProtocolPBClientImpl.allocate(ApplicationMasterProtocolPBClientImpl.java:79) >>> >> >> > at sun.reflect.GeneratedMethodAccessor3.invoke(Unknown >>> >> >> > Source) >>> >> >> > at >>> >> >> > >>> >> >> > >>> >> >> > >>> >> >> > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) >>> >> >> > at java.lang.reflect.Method.invoke(Method.java:606) >>> >> >> > at >>> >> >> > >>> >> >> > >>> >> >> > >>> >> >> > org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:186) >>> >> >> > at >>> >> >> > >>> >> >> > >>> >> >> > >>> >> >> > org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102) >>> >> >> > at com.sun.proxy.$Proxy14.allocate(Unknown Source) >>> >> >> > at >>> >> >> > >>> >> >> > >>> >> >> > >>> >> >> > org.apache.hadoop.yarn.client.api.impl.AMRMClientImpl.allocate(AMRMClientImpl.java:245) >>> >> >> > at >>> >> >> > >>> >> >> > >>> >> >> > >>> >> >> > org.apache.flink.yarn.YarnJobManager$$anonfun$handleYarnMessage$1.applyOrElse(YarnJobManager.scala:259) >>> >> >> > at >>> >> >> > scala.PartialFunction$OrElse.apply(PartialFunction.scala:162) >>> >> >> > at >>> >> >> > >>> >> >> > >>> >> >> > >>> >> >> > org.apache.flink.runtime.LeaderSessionMessageFilter$$anonfun$receive$1.applyOrElse(LeaderSessionMessageFilter.scala:44) >>> >> >> > at >>> >> >> > >>> >> >> > >>> >> >> > >>> >> >> > scala.runtime.AbstractPartialFunction$mcVL$sp.apply$mcVL$sp(AbstractPartialFunction.scala:33) >>> >> >> > at >>> >> >> > >>> >> >> > >>> >> >> > >>> >> >> > scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:33) >>> >> >> > at >>> >> >> > >>> >> >> > >>> >> >> > >>> >> >> > scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:25) >>> >> >> > at >>> >> >> > >>> >> >> > >>> >> >> > org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:33) >>> >> >> > at >>> >> >> > >>> >> >> > >>> >> >> > org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:28) >>> >> >> > at >>> >> >> > scala.PartialFunction$class.applyOrElse(PartialFunction.scala:118) >>> >> >> > at >>> >> >> > >>> >> >> > >>> >> >> > >>> >> >> > org.apache.flink.runtime.LogMessages$$anon$1.applyOrElse(LogMessages.scala:28) >>> >> >> > at akka.actor.Actor$class.aroundReceive(Actor.scala:465) >>> >> >> > at >>> >> >> > >>> >> >> > >>> >> >> > >>> >> >> > org.apache.flink.runtime.jobmanager.JobManager.aroundReceive(JobManager.scala:100) >>> >> >> > at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516) >>> >> >> > at akka.actor.ActorCell.invoke(ActorCell.scala:487) >>> >> >> > at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:254) >>> >> >> > at akka.dispatch.Mailbox.run(Mailbox.scala:221) >>> >> >> > at akka.dispatch.Mailbox.exec(Mailbox.scala:231) >>> >> >> > at >>> >> >> > >>> >> >> > scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) >>> >> >> > at >>> >> >> > >>> >> >> > >>> >> >> > >>> >> >> > scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.pollAndExecAll(ForkJoinPool.java:1253) >>> >> >> > at >>> >> >> > >>> >> >> > >>> >> >> > >>> >> >> > scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1346) >>> >> >> > at >>> >> >> > >>> >> >> > >>> >> >> > scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) >>> >> >> > at >>> >> >> > >>> >> >> > >>> >> >> > >>> >> >> > scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) >>> >> >> > Caused by: >>> >> >> > >>> >> >> > >>> >> >> > >>> >> >> > org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken): >>> >> >> > Invalid AMRMToken from appattempt_1443166961758_163901_000001 >>> >> >> > at org.apache.hadoop.ipc.Client.call(Client.java:1406) >>> >> >> > at org.apache.hadoop.ipc.Client.call(Client.java:1359) >>> >> >> > at >>> >> >> > >>> >> >> > >>> >> >> > >>> >> >> > org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206) >>> >> >> > at com.sun.proxy.$Proxy13.allocate(Unknown Source) >>> >> >> > at >>> >> >> > >>> >> >> > >>> >> >> > >>> >> >> > org.apache.hadoop.yarn.api.impl.pb.client.ApplicationMasterProtocolPBClientImpl.allocate(ApplicationMasterProtocolPBClientImpl.java:77) >>> >> >> > ... 29 more >>> >> >> > 21:30:27,943 ERROR akka.actor.OneForOneStrategy >>> >> >> > - Invalid AMRMToken from appattempt_1443166961758_163901_000001 >>> >> >> > org.apache.hadoop.security.token.SecretManager$InvalidToken: >>> >> >> > Invalid >>> >> >> > AMRMToken from appattempt_1443166961758_163901_000001 >>> >> >> > at >>> >> >> > sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native >>> >> >> > Method) >>> >> >> > at >>> >> >> > >>> >> >> > >>> >> >> > >>> >> >> > sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57) >>> >> >> > at >>> >> >> > >>> >> >> > >>> >> >> > >>> >> >> > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) >>> >> >> > at >>> >> >> > java.lang.reflect.Constructor.newInstance(Constructor.java:526) >>> >> >> > at >>> >> >> > >>> >> >> > >>> >> >> > org.apache.hadoop.yarn.ipc.RPCUtil.instantiateException(RPCUtil.java:53) >>> >> >> > at >>> >> >> > >>> >> >> > >>> >> >> > >>> >> >> > org.apache.hadoop.yarn.ipc.RPCUtil.unwrapAndThrowException(RPCUtil.java:104) >>> >> >> > at >>> >> >> > >>> >> >> > >>> >> >> > >>> >> >> > org.apache.hadoop.yarn.api.impl.pb.client.ApplicationMasterProtocolPBClientImpl.allocate(ApplicationMasterProtocolPBClientImpl.java:79) >>> >> >> > at sun.reflect.GeneratedMethodAccessor3.invoke(Unknown >>> >> >> > Source) >>> >> >> > at >>> >> >> > >>> >> >> > >>> >> >> > >>> >> >> > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) >>> >> >> > at java.lang.reflect.Method.invoke(Method.java:606) >>> >> >> > at >>> >> >> > >>> >> >> > >>> >> >> > >>> >> >> > org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:186) >>> >> >> > at >>> >> >> > >>> >> >> > >>> >> >> > >>> >> >> > org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102) >>> >> >> > at com.sun.proxy.$Proxy14.allocate(Unknown Source) >>> >> >> > at >>> >> >> > >>> >> >> > >>> >> >> > >>> >> >> > org.apache.hadoop.yarn.client.api.impl.AMRMClientImpl.allocate(AMRMClientImpl.java:245) >>> >> >> > at >>> >> >> > >>> >> >> > >>> >> >> > >>> >> >> > org.apache.flink.yarn.YarnJobManager$$anonfun$handleYarnMessage$1.applyOrElse(YarnJobManager.scala:259) >>> >> >> > at >>> >> >> > scala.PartialFunction$OrElse.apply(PartialFunction.scala:162) >>> >> >> > at >>> >> >> > >>> >> >> > >>> >> >> > >>> >> >> > org.apache.flink.runtime.LeaderSessionMessageFilter$$anonfun$receive$1.applyOrElse(LeaderSessionMessageFilter.scala:44) >>> >> >> > at >>> >> >> > >>> >> >> > >>> >> >> > >>> >> >> > scala.runtime.AbstractPartialFunction$mcVL$sp.apply$mcVL$sp(AbstractPartialFunction.scala:33) >>> >> >> > at >>> >> >> > >>> >> >> > >>> >> >> > >>> >> >> > scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:33) >>> >> >> > at >>> >> >> > >>> >> >> > >>> >> >> > >>> >> >> > scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:25) >>> >> >> > at >>> >> >> > >>> >> >> > >>> >> >> > org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:33) >>> >> >> > at >>> >> >> > >>> >> >> > >>> >> >> > org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:28) >>> >> >> > at >>> >> >> > scala.PartialFunction$class.applyOrElse(PartialFunction.scala:118) >>> >> >> > at >>> >> >> > >>> >> >> > >>> >> >> > >>> >> >> > org.apache.flink.runtime.LogMessages$$anon$1.applyOrElse(LogMessages.scala:28) >>> >> >> > at akka.actor.Actor$class.aroundReceive(Actor.scala:465) >>> >> >> > at >>> >> >> > >>> >> >> > >>> >> >> > >>> >> >> > org.apache.flink.runtime.jobmanager.JobManager.aroundReceive(JobManager.scala:100) >>> >> >> > at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516) >>> >> >> > at akka.actor.ActorCell.invoke(ActorCell.scala:487) >>> >> >> > at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:254) >>> >> >> > at akka.dispatch.Mailbox.run(Mailbox.scala:221) >>> >> >> > at akka.dispatch.Mailbox.exec(Mailbox.scala:231) >>> >> >> > at >>> >> >> > >>> >> >> > scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) >>> >> >> > at >>> >> >> > >>> >> >> > >>> >> >> > >>> >> >> > scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.pollAndExecAll(ForkJoinPool.java:1253) >>> >> >> > at >>> >> >> > >>> >> >> > >>> >> >> > >>> >> >> > scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1346) >>> >> >> > at >>> >> >> > >>> >> >> > >>> >> >> > scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) >>> >> >> > at >>> >> >> > >>> >> >> > >>> >> >> > >>> >> >> > scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) >>> >> >> > Caused by: >>> >> >> > >>> >> >> > >>> >> >> > >>> >> >> > org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken): >>> >> >> > Invalid AMRMToken from appattempt_1443166961758_163901_000001 >>> >> >> > at org.apache.hadoop.ipc.Client.call(Client.java:1406) >>> >> >> > at org.apache.hadoop.ipc.Client.call(Client.java:1359) >>> >> >> > at >>> >> >> > >>> >> >> > >>> >> >> > >>> >> >> > org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206) >>> >> >> > at com.sun.proxy.$Proxy13.allocate(Unknown Source) >>> >> >> > at >>> >> >> > >>> >> >> > >>> >> >> > >>> >> >> > org.apache.hadoop.yarn.api.impl.pb.client.ApplicationMasterProtocolPBClientImpl.allocate(ApplicationMasterProtocolPBClientImpl.java:77) >>> >> >> > ... 29 more >>> >> >> > 21:30:28,075 INFO org.apache.flink.yarn.YarnJobManager >>> >> >> > - Stopping JobManager >>> >> >> > akka.tcp://flink@10.10.200.3:39527/user/jobmanager. >>> >> >> > 21:30:28,088 INFO >>> >> >> > org.apache.flink.runtime.executiongraph.ExecutionGraph >>> >> >> > - Source: Custom Source -> Sink: Unnamed (1/1) >>> >> >> > (db0d95c11c14505827e696eec7efab94) switched from RUNNING to >>> >> >> > CANCELING >>> >> >> > 21:30:28,113 INFO >>> >> >> > org.apache.flink.runtime.executiongraph.ExecutionGraph >>> >> >> > - Source: Custom Source -> Sink: Unnamed (1/1) >>> >> >> > (db0d95c11c14505827e696eec7efab94) switched from CANCELING to >>> >> >> > FAILED >>> >> >> > 21:30:28,184 INFO org.apache.flink.runtime.blob.BlobServer >>> >> >> > - Stopped BLOB server at 0.0.0.0:41281 >>> >> >> > 21:30:28,185 ERROR org.apache.flink.runtime.jobmanager.JobManager >>> >> >> > - Actor akka://flink/user/jobmanager#403236912 terminated, >>> >> >> > stopping >>> >> >> > process... >>> >> >> > 21:30:28,286 INFO >>> >> >> > org.apache.flink.runtime.webmonitor.WebRuntimeMonitor >>> >> >> > - Removing web root dir >>> >> >> > /tmp/flink-web-e1a44f94-ea6d-40ee-b87c-e3122d5cb9bd >>> >> >> > >>> >> >> > >>> >> >> > -- >>> >> >> > Best regards / Met vriendelijke groeten, >>> >> >> > >>> >> >> > Niels Basjes >>> >> > >>> >> > >>> >> > >>> >> > >>> >> > -- >>> >> > Best regards / Met vriendelijke groeten, >>> >> > >>> >> > Niels Basjes >>> > >>> > >>> > >>> > >>> > -- >>> > Best regards / Met vriendelijke groeten, >>> > >>> > Niels Basjes >> >> >> >> >> -- >> Best regards / Met vriendelijke groeten, >> >> Niels Basjes |
Hi Niels,
Just got back from our CI. The build above would fail with a Checkstyle error. I corrected that. Also I have built the binaries for your Hadoop version 2.6.0. Binaries: https://drive.google.com/file/d/0BziY9U_qva1sZ1FVR3RWeVNrNzA/view?usp=sharing Source: https://github.com/mxm/flink/tree/kerberos-yarn-heartbeat-fail-0.10.1 git fetch https://github.com/mxm/flink/ \ kerberos-yarn-heartbeat-fail-0.10.1 && git checkout FETCH_HEAD https://github.com/mxm/flink/archive/kerberos-yarn-heartbeat-fail-0.10.1.zip Thanks, Max On Wed, Dec 2, 2015 at 6:52 PM, Maximilian Michels <[hidden email]> wrote: > I forgot you're using Flink 0.10.1. The above was for the master. > > So here's the commit for Flink 0.10.1: > https://github.com/mxm/flink/commit/a41f3866f4097586a7b2262093088861b62930cd > > git fetch https://github.com/mxm/flink/ \ > a41f3866f4097586a7b2262093088861b62930cd && git checkout FETCH_HEAD > > https://github.com/mxm/flink/archive/a41f3866f4097586a7b2262093088861b62930cd.zip > > Thanks, > Max > > On Wed, Dec 2, 2015 at 6:39 PM, Maximilian Michels <[hidden email]> wrote: >> Great. Here is the commit to try out: >> https://github.com/mxm/flink/commit/f49b9635bec703541f19cb8c615f302a07ea88b3 >> >> If you already have the Flink repository, check it out using >> >> git fetch https://github.com/mxm/flink/ >> f49b9635bec703541f19cb8c615f302a07ea88b3 && git checkout FETCH_HEAD >> >> Alternatively, here's a direct download link to the sources with the >> fix included: >> https://github.com/mxm/flink/archive/f49b9635bec703541f19cb8c615f302a07ea88b3.zip >> >> Thanks a lot, >> Max >> >> On Wed, Dec 2, 2015 at 5:44 PM, Niels Basjes <[hidden email]> wrote: >>> Sure, just give me the git repo url to build and I'll give it a try. >>> >>> Niels >>> >>> On Wed, Dec 2, 2015 at 4:28 PM, Maximilian Michels <[hidden email]> wrote: >>>> >>>> I mentioned that the exception gets thrown when requesting container >>>> status information. We need this to send a heartbeat to YARN but it is >>>> not very crucial if this fails once for the running job. Possibly, we >>>> could work around this problem by retrying N times in case of an >>>> exception. >>>> >>>> Would it be possible for you to deploy a custom Flink 0.10.1 version >>>> we provide and test again? >>>> >>>> On Wed, Dec 2, 2015 at 4:03 PM, Niels Basjes <[hidden email]> wrote: >>>> > No, I was just asking. >>>> > No upgrade is possible for the next month or two. >>>> > >>>> > This week is our busiest day of the year ... >>>> > Our shop is doing about 10 orders per second these days ... >>>> > >>>> > So they won't upgrade until next January/February >>>> > >>>> > Niels >>>> > >>>> > On Wed, Dec 2, 2015 at 3:59 PM, Maximilian Michels <[hidden email]> >>>> > wrote: >>>> >> >>>> >> Hi Niels, >>>> >> >>>> >> You mentioned you have the option to update Hadoop and redeploy the >>>> >> job. Would be great if you could do that and let us know how it turns >>>> >> out. >>>> >> >>>> >> Cheers, >>>> >> Max >>>> >> >>>> >> On Wed, Dec 2, 2015 at 3:45 PM, Niels Basjes <[hidden email]> wrote: >>>> >> > Hi, >>>> >> > >>>> >> > I posted the entire log from the first log line at the moment of >>>> >> > failure >>>> >> > to >>>> >> > the very end of the logfile. >>>> >> > This is all I have. >>>> >> > >>>> >> > As far as I understand the Kerberos and Keytab mechanism in Hadoop >>>> >> > Yarn >>>> >> > is >>>> >> > that it catches the "Invalid Token" and then (if keytab) gets a new >>>> >> > Kerberos >>>> >> > ticket (or tgt?). >>>> >> > When the new ticket has been obtained it retries the call that >>>> >> > previously >>>> >> > failed. >>>> >> > To me it seemed that this call can fail over the invalid Token yet it >>>> >> > cannot >>>> >> > be retried. >>>> >> > >>>> >> > At this moment I'm thinking a bug in Hadoop. >>>> >> > >>>> >> > Niels >>>> >> > >>>> >> > On Wed, Dec 2, 2015 at 2:51 PM, Maximilian Michels <[hidden email]> >>>> >> > wrote: >>>> >> >> >>>> >> >> Hi Niels, >>>> >> >> >>>> >> >> Sorry for hear you experienced this exception. From a first glance, >>>> >> >> it >>>> >> >> looks like a bug in Hadoop to me. >>>> >> >> >>>> >> >> > "Not retrying because the invoked method is not idempotent, and >>>> >> >> > unable >>>> >> >> > to determine whether it was invoked" >>>> >> >> >>>> >> >> That is nothing to worry about. This is Hadoop's internal retry >>>> >> >> mechanism that re-attempts to do actions which previously failed if >>>> >> >> that's possible. Since the action is not idempotent (it cannot be >>>> >> >> executed again without risking to change the state of the execution) >>>> >> >> and it also doesn't track its execution states, it won't be retried >>>> >> >> again. >>>> >> >> >>>> >> >> The main issue is this exception: >>>> >> >> >>>> >> >> >org.apache.hadoop.security.token.SecretManager$InvalidToken: >>>> >> >> > Invalid >>>> >> >> > AMRMToken from >appattempt_1443166961758_163901_000001 >>>> >> >> >>>> >> >> From the stack trace it is clear that this exception occurs upon >>>> >> >> requesting container status information from the Resource Manager: >>>> >> >> >>>> >> >> >at >>>> >> >> > >>>> >> >> > >>>> >> >> > org.apache.flink.yarn.YarnJobManager$$anonfun$handleYarnMessage$1.applyOrElse(YarnJobManager.scala:259) >>>> >> >> >>>> >> >> Are there any more exceptions in the log? Do you have the complete >>>> >> >> logs available and could you share them? >>>> >> >> >>>> >> >> >>>> >> >> Best regards, >>>> >> >> Max >>>> >> >> >>>> >> >> On Wed, Dec 2, 2015 at 11:47 AM, Niels Basjes <[hidden email]> >>>> >> >> wrote: >>>> >> >> > Hi, >>>> >> >> > >>>> >> >> > >>>> >> >> > We have a Kerberos secured Yarn cluster here and I'm experimenting >>>> >> >> > with >>>> >> >> > Apache Flink on top of that. >>>> >> >> > >>>> >> >> > A few days ago I started a very simple Flink application (just >>>> >> >> > stream >>>> >> >> > the >>>> >> >> > time as a String into HBase 10 times per second). >>>> >> >> > >>>> >> >> > I (deliberately) asked our IT-ops guys to make my account have a >>>> >> >> > max >>>> >> >> > ticket >>>> >> >> > time of 5 minutes and a max renew time of 10 minutes (yes, >>>> >> >> > ridiculously >>>> >> >> > low >>>> >> >> > timeout values because I needed to validate this >>>> >> >> > https://issues.apache.org/jira/browse/FLINK-2977). >>>> >> >> > >>>> >> >> > This job is started with a keytab file and after running for 31 >>>> >> >> > hours >>>> >> >> > it >>>> >> >> > suddenly failed with the exception you see below. >>>> >> >> > >>>> >> >> > I had the same job running for almost 400 hours until that failed >>>> >> >> > too >>>> >> >> > (I >>>> >> >> > was >>>> >> >> > too late to check the logfiles but I suspect the same problem). >>>> >> >> > >>>> >> >> > >>>> >> >> > So in that time span my tickets have expired and new tickets have >>>> >> >> > been >>>> >> >> > obtained several hundred times. >>>> >> >> > >>>> >> >> > >>>> >> >> > The main error I see is that in the process of a ticket expiring >>>> >> >> > and >>>> >> >> > being >>>> >> >> > renewed I see this message: >>>> >> >> > >>>> >> >> > Not retrying because the invoked method is not idempotent, >>>> >> >> > and >>>> >> >> > unable >>>> >> >> > to determine whether it was invoked >>>> >> >> > >>>> >> >> > >>>> >> >> > Yarn on the cluster is 2.6.0 ( HDP 2.6.0.2.2.4.2-2 ) >>>> >> >> > >>>> >> >> > Flink is version 0.10.1 >>>> >> >> > >>>> >> >> > >>>> >> >> > How do I fix this? >>>> >> >> > Is this a bug (in either Hadoop or Flink) or am I doing something >>>> >> >> > wrong? >>>> >> >> > Would upgrading Yarn to 2.7.1 (i.e. HDP 2.3) fix this? >>>> >> >> > >>>> >> >> > >>>> >> >> > Niels Basjes >>>> >> >> > >>>> >> >> > >>>> >> >> > >>>> >> >> > 21:30:27,821 WARN org.apache.hadoop.security.UserGroupInformation >>>> >> >> > - PriviledgedActionException as:nbasjes (auth:SIMPLE) >>>> >> >> > >>>> >> >> > >>>> >> >> > >>>> >> >> > cause:org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken): >>>> >> >> > Invalid AMRMToken from appattempt_1443166961758_163901_000001 >>>> >> >> > 21:30:27,861 WARN org.apache.hadoop.ipc.Client >>>> >> >> > - Exception encountered while connecting to the server : >>>> >> >> > >>>> >> >> > >>>> >> >> > >>>> >> >> > org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken): >>>> >> >> > Invalid AMRMToken from appattempt_1443166961758_163901_000001 >>>> >> >> > 21:30:27,861 WARN org.apache.hadoop.security.UserGroupInformation >>>> >> >> > - PriviledgedActionException as:nbasjes (auth:SIMPLE) >>>> >> >> > >>>> >> >> > >>>> >> >> > >>>> >> >> > cause:org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken): >>>> >> >> > Invalid AMRMToken from appattempt_1443166961758_163901_000001 >>>> >> >> > 21:30:27,891 WARN >>>> >> >> > org.apache.hadoop.io.retry.RetryInvocationHandler >>>> >> >> > - Exception while invoking class >>>> >> >> > >>>> >> >> > >>>> >> >> > >>>> >> >> > org.apache.hadoop.yarn.api.impl.pb.client.ApplicationMasterProtocolPBClientImpl.allocate. >>>> >> >> > Not retrying because the invoked method is not idempotent, and >>>> >> >> > unable >>>> >> >> > to >>>> >> >> > determine whether it was invoked >>>> >> >> > org.apache.hadoop.security.token.SecretManager$InvalidToken: >>>> >> >> > Invalid >>>> >> >> > AMRMToken from appattempt_1443166961758_163901_000001 >>>> >> >> > at >>>> >> >> > sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native >>>> >> >> > Method) >>>> >> >> > at >>>> >> >> > >>>> >> >> > >>>> >> >> > >>>> >> >> > sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57) >>>> >> >> > at >>>> >> >> > >>>> >> >> > >>>> >> >> > >>>> >> >> > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) >>>> >> >> > at >>>> >> >> > java.lang.reflect.Constructor.newInstance(Constructor.java:526) >>>> >> >> > at >>>> >> >> > >>>> >> >> > >>>> >> >> > org.apache.hadoop.yarn.ipc.RPCUtil.instantiateException(RPCUtil.java:53) >>>> >> >> > at >>>> >> >> > >>>> >> >> > >>>> >> >> > >>>> >> >> > org.apache.hadoop.yarn.ipc.RPCUtil.unwrapAndThrowException(RPCUtil.java:104) >>>> >> >> > at >>>> >> >> > >>>> >> >> > >>>> >> >> > >>>> >> >> > org.apache.hadoop.yarn.api.impl.pb.client.ApplicationMasterProtocolPBClientImpl.allocate(ApplicationMasterProtocolPBClientImpl.java:79) >>>> >> >> > at sun.reflect.GeneratedMethodAccessor3.invoke(Unknown >>>> >> >> > Source) >>>> >> >> > at >>>> >> >> > >>>> >> >> > >>>> >> >> > >>>> >> >> > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) >>>> >> >> > at java.lang.reflect.Method.invoke(Method.java:606) >>>> >> >> > at >>>> >> >> > >>>> >> >> > >>>> >> >> > >>>> >> >> > org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:186) >>>> >> >> > at >>>> >> >> > >>>> >> >> > >>>> >> >> > >>>> >> >> > org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102) >>>> >> >> > at com.sun.proxy.$Proxy14.allocate(Unknown Source) >>>> >> >> > at >>>> >> >> > >>>> >> >> > >>>> >> >> > >>>> >> >> > org.apache.hadoop.yarn.client.api.impl.AMRMClientImpl.allocate(AMRMClientImpl.java:245) >>>> >> >> > at >>>> >> >> > >>>> >> >> > >>>> >> >> > >>>> >> >> > org.apache.flink.yarn.YarnJobManager$$anonfun$handleYarnMessage$1.applyOrElse(YarnJobManager.scala:259) >>>> >> >> > at >>>> >> >> > scala.PartialFunction$OrElse.apply(PartialFunction.scala:162) >>>> >> >> > at >>>> >> >> > >>>> >> >> > >>>> >> >> > >>>> >> >> > org.apache.flink.runtime.LeaderSessionMessageFilter$$anonfun$receive$1.applyOrElse(LeaderSessionMessageFilter.scala:44) >>>> >> >> > at >>>> >> >> > >>>> >> >> > >>>> >> >> > >>>> >> >> > scala.runtime.AbstractPartialFunction$mcVL$sp.apply$mcVL$sp(AbstractPartialFunction.scala:33) >>>> >> >> > at >>>> >> >> > >>>> >> >> > >>>> >> >> > >>>> >> >> > scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:33) >>>> >> >> > at >>>> >> >> > >>>> >> >> > >>>> >> >> > >>>> >> >> > scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:25) >>>> >> >> > at >>>> >> >> > >>>> >> >> > >>>> >> >> > org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:33) >>>> >> >> > at >>>> >> >> > >>>> >> >> > >>>> >> >> > org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:28) >>>> >> >> > at >>>> >> >> > scala.PartialFunction$class.applyOrElse(PartialFunction.scala:118) >>>> >> >> > at >>>> >> >> > >>>> >> >> > >>>> >> >> > >>>> >> >> > org.apache.flink.runtime.LogMessages$$anon$1.applyOrElse(LogMessages.scala:28) >>>> >> >> > at akka.actor.Actor$class.aroundReceive(Actor.scala:465) >>>> >> >> > at >>>> >> >> > >>>> >> >> > >>>> >> >> > >>>> >> >> > org.apache.flink.runtime.jobmanager.JobManager.aroundReceive(JobManager.scala:100) >>>> >> >> > at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516) >>>> >> >> > at akka.actor.ActorCell.invoke(ActorCell.scala:487) >>>> >> >> > at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:254) >>>> >> >> > at akka.dispatch.Mailbox.run(Mailbox.scala:221) >>>> >> >> > at akka.dispatch.Mailbox.exec(Mailbox.scala:231) >>>> >> >> > at >>>> >> >> > >>>> >> >> > scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) >>>> >> >> > at >>>> >> >> > >>>> >> >> > >>>> >> >> > >>>> >> >> > scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.pollAndExecAll(ForkJoinPool.java:1253) >>>> >> >> > at >>>> >> >> > >>>> >> >> > >>>> >> >> > >>>> >> >> > scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1346) >>>> >> >> > at >>>> >> >> > >>>> >> >> > >>>> >> >> > scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) >>>> >> >> > at >>>> >> >> > >>>> >> >> > >>>> >> >> > >>>> >> >> > scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) >>>> >> >> > Caused by: >>>> >> >> > >>>> >> >> > >>>> >> >> > >>>> >> >> > org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken): >>>> >> >> > Invalid AMRMToken from appattempt_1443166961758_163901_000001 >>>> >> >> > at org.apache.hadoop.ipc.Client.call(Client.java:1406) >>>> >> >> > at org.apache.hadoop.ipc.Client.call(Client.java:1359) >>>> >> >> > at >>>> >> >> > >>>> >> >> > >>>> >> >> > >>>> >> >> > org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206) >>>> >> >> > at com.sun.proxy.$Proxy13.allocate(Unknown Source) >>>> >> >> > at >>>> >> >> > >>>> >> >> > >>>> >> >> > >>>> >> >> > org.apache.hadoop.yarn.api.impl.pb.client.ApplicationMasterProtocolPBClientImpl.allocate(ApplicationMasterProtocolPBClientImpl.java:77) >>>> >> >> > ... 29 more >>>> >> >> > 21:30:27,943 ERROR akka.actor.OneForOneStrategy >>>> >> >> > - Invalid AMRMToken from appattempt_1443166961758_163901_000001 >>>> >> >> > org.apache.hadoop.security.token.SecretManager$InvalidToken: >>>> >> >> > Invalid >>>> >> >> > AMRMToken from appattempt_1443166961758_163901_000001 >>>> >> >> > at >>>> >> >> > sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native >>>> >> >> > Method) >>>> >> >> > at >>>> >> >> > >>>> >> >> > >>>> >> >> > >>>> >> >> > sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57) >>>> >> >> > at >>>> >> >> > >>>> >> >> > >>>> >> >> > >>>> >> >> > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) >>>> >> >> > at >>>> >> >> > java.lang.reflect.Constructor.newInstance(Constructor.java:526) >>>> >> >> > at >>>> >> >> > >>>> >> >> > >>>> >> >> > org.apache.hadoop.yarn.ipc.RPCUtil.instantiateException(RPCUtil.java:53) >>>> >> >> > at >>>> >> >> > >>>> >> >> > >>>> >> >> > >>>> >> >> > org.apache.hadoop.yarn.ipc.RPCUtil.unwrapAndThrowException(RPCUtil.java:104) >>>> >> >> > at >>>> >> >> > >>>> >> >> > >>>> >> >> > >>>> >> >> > org.apache.hadoop.yarn.api.impl.pb.client.ApplicationMasterProtocolPBClientImpl.allocate(ApplicationMasterProtocolPBClientImpl.java:79) >>>> >> >> > at sun.reflect.GeneratedMethodAccessor3.invoke(Unknown >>>> >> >> > Source) >>>> >> >> > at >>>> >> >> > >>>> >> >> > >>>> >> >> > >>>> >> >> > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) >>>> >> >> > at java.lang.reflect.Method.invoke(Method.java:606) >>>> >> >> > at >>>> >> >> > >>>> >> >> > >>>> >> >> > >>>> >> >> > org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:186) >>>> >> >> > at >>>> >> >> > >>>> >> >> > >>>> >> >> > >>>> >> >> > org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102) >>>> >> >> > at com.sun.proxy.$Proxy14.allocate(Unknown Source) >>>> >> >> > at >>>> >> >> > >>>> >> >> > >>>> >> >> > >>>> >> >> > org.apache.hadoop.yarn.client.api.impl.AMRMClientImpl.allocate(AMRMClientImpl.java:245) >>>> >> >> > at >>>> >> >> > >>>> >> >> > >>>> >> >> > >>>> >> >> > org.apache.flink.yarn.YarnJobManager$$anonfun$handleYarnMessage$1.applyOrElse(YarnJobManager.scala:259) >>>> >> >> > at >>>> >> >> > scala.PartialFunction$OrElse.apply(PartialFunction.scala:162) >>>> >> >> > at >>>> >> >> > >>>> >> >> > >>>> >> >> > >>>> >> >> > org.apache.flink.runtime.LeaderSessionMessageFilter$$anonfun$receive$1.applyOrElse(LeaderSessionMessageFilter.scala:44) >>>> >> >> > at >>>> >> >> > >>>> >> >> > >>>> >> >> > >>>> >> >> > scala.runtime.AbstractPartialFunction$mcVL$sp.apply$mcVL$sp(AbstractPartialFunction.scala:33) >>>> >> >> > at >>>> >> >> > >>>> >> >> > >>>> >> >> > >>>> >> >> > scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:33) >>>> >> >> > at >>>> >> >> > >>>> >> >> > >>>> >> >> > >>>> >> >> > scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:25) >>>> >> >> > at >>>> >> >> > >>>> >> >> > >>>> >> >> > org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:33) >>>> >> >> > at >>>> >> >> > >>>> >> >> > >>>> >> >> > org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:28) >>>> >> >> > at >>>> >> >> > scala.PartialFunction$class.applyOrElse(PartialFunction.scala:118) >>>> >> >> > at >>>> >> >> > >>>> >> >> > >>>> >> >> > >>>> >> >> > org.apache.flink.runtime.LogMessages$$anon$1.applyOrElse(LogMessages.scala:28) >>>> >> >> > at akka.actor.Actor$class.aroundReceive(Actor.scala:465) >>>> >> >> > at >>>> >> >> > >>>> >> >> > >>>> >> >> > >>>> >> >> > org.apache.flink.runtime.jobmanager.JobManager.aroundReceive(JobManager.scala:100) >>>> >> >> > at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516) >>>> >> >> > at akka.actor.ActorCell.invoke(ActorCell.scala:487) >>>> >> >> > at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:254) >>>> >> >> > at akka.dispatch.Mailbox.run(Mailbox.scala:221) >>>> >> >> > at akka.dispatch.Mailbox.exec(Mailbox.scala:231) >>>> >> >> > at >>>> >> >> > >>>> >> >> > scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) >>>> >> >> > at >>>> >> >> > >>>> >> >> > >>>> >> >> > >>>> >> >> > scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.pollAndExecAll(ForkJoinPool.java:1253) >>>> >> >> > at >>>> >> >> > >>>> >> >> > >>>> >> >> > >>>> >> >> > scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1346) >>>> >> >> > at >>>> >> >> > >>>> >> >> > >>>> >> >> > scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) >>>> >> >> > at >>>> >> >> > >>>> >> >> > >>>> >> >> > >>>> >> >> > scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) >>>> >> >> > Caused by: >>>> >> >> > >>>> >> >> > >>>> >> >> > >>>> >> >> > org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken): >>>> >> >> > Invalid AMRMToken from appattempt_1443166961758_163901_000001 >>>> >> >> > at org.apache.hadoop.ipc.Client.call(Client.java:1406) >>>> >> >> > at org.apache.hadoop.ipc.Client.call(Client.java:1359) >>>> >> >> > at >>>> >> >> > >>>> >> >> > >>>> >> >> > >>>> >> >> > org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206) >>>> >> >> > at com.sun.proxy.$Proxy13.allocate(Unknown Source) >>>> >> >> > at >>>> >> >> > >>>> >> >> > >>>> >> >> > >>>> >> >> > org.apache.hadoop.yarn.api.impl.pb.client.ApplicationMasterProtocolPBClientImpl.allocate(ApplicationMasterProtocolPBClientImpl.java:77) >>>> >> >> > ... 29 more >>>> >> >> > 21:30:28,075 INFO org.apache.flink.yarn.YarnJobManager >>>> >> >> > - Stopping JobManager >>>> >> >> > akka.tcp://flink@10.10.200.3:39527/user/jobmanager. >>>> >> >> > 21:30:28,088 INFO >>>> >> >> > org.apache.flink.runtime.executiongraph.ExecutionGraph >>>> >> >> > - Source: Custom Source -> Sink: Unnamed (1/1) >>>> >> >> > (db0d95c11c14505827e696eec7efab94) switched from RUNNING to >>>> >> >> > CANCELING >>>> >> >> > 21:30:28,113 INFO >>>> >> >> > org.apache.flink.runtime.executiongraph.ExecutionGraph >>>> >> >> > - Source: Custom Source -> Sink: Unnamed (1/1) >>>> >> >> > (db0d95c11c14505827e696eec7efab94) switched from CANCELING to >>>> >> >> > FAILED >>>> >> >> > 21:30:28,184 INFO org.apache.flink.runtime.blob.BlobServer >>>> >> >> > - Stopped BLOB server at 0.0.0.0:41281 >>>> >> >> > 21:30:28,185 ERROR org.apache.flink.runtime.jobmanager.JobManager >>>> >> >> > - Actor akka://flink/user/jobmanager#403236912 terminated, >>>> >> >> > stopping >>>> >> >> > process... >>>> >> >> > 21:30:28,286 INFO >>>> >> >> > org.apache.flink.runtime.webmonitor.WebRuntimeMonitor >>>> >> >> > - Removing web root dir >>>> >> >> > /tmp/flink-web-e1a44f94-ea6d-40ee-b87c-e3122d5cb9bd >>>> >> >> > >>>> >> >> > >>>> >> >> > -- >>>> >> >> > Best regards / Met vriendelijke groeten, >>>> >> >> > >>>> >> >> > Niels Basjes >>>> >> > >>>> >> > >>>> >> > >>>> >> > >>>> >> > -- >>>> >> > Best regards / Met vriendelijke groeten, >>>> >> > >>>> >> > Niels Basjes >>>> > >>>> > >>>> > >>>> > >>>> > -- >>>> > Best regards / Met vriendelijke groeten, >>>> > >>>> > Niels Basjes >>> >>> >>> >>> >>> -- >>> Best regards / Met vriendelijke groeten, >>> >>> Niels Basjes |
Hi Maximilian, I just downloaded the version from your google drive and used that to run my test topology that accesses HBase. I deliberately started it twice to double the chance to run into this situation. I'll keep you posted. Niels On Thu, Dec 3, 2015 at 11:44 AM, Maximilian Michels <[hidden email]> wrote: Hi Niels, Best regards / Met vriendelijke groeten,
Niels Basjes |
Hello everyone,
We are facing the same probleme now in our Flink applications, launch using YARN. Just want to know if there is any update about this exception ?
Thanks
Thomas
De : [hidden email] [[hidden email]] de la part de Niels Basjes [[hidden email]]
Envoyé : vendredi 4 décembre 2015 10:40 À : [hidden email] Objet : Re: Flink job on secure Yarn fails after many hours Hi Maximilian,
I just downloaded the version from your google drive and used that to run my test topology that accesses HBase.
I deliberately started it twice to double the chance to run into this situation.
I'll keep you posted.
Niels
On Thu, Dec 3, 2015 at 11:44 AM, Maximilian Michels
<[hidden email]> wrote:
Hi Niels, Best regards / Met vriendelijke groeten,
Niels Basjes |
Hi Thomas,
Nils (CC) and I found out that you need at least Hadoop version 2.6.1 to properly run Kerberos applications on Hadoop clusters. Versions before that have critical bugs related to the internal security token handling that may expire the token although it is still valid. That said, there is another limitation of Hadoop that the maximum internal token life time is one week. To work around this limit, you have two options: a) increasing the maximum token life time In yarn-site.xml: <property> <name>yarn.resourcemanager.delegation.token.max-lifetime</name> <value>9223372036854775807</value> </property> In hdfs-site.xml <property> <name>dfs.namenode.delegation.token.max-lifetime</name> <value>9223372036854775807</value> </property> b) setup the Yarn ResourceManager as a proxy for the HDFS Namenode: From http://www.cloudera.com/documentation/enterprise/5-3-x/topics/cm_sg_yarn_long_jobs.html "You can work around this by configuring the ResourceManager as a proxy user for the corresponding HDFS NameNode so that the ResourceManager can request new tokens when the existing ones are past their maximum lifetime." @Nils: Could you comment on what worked best for you? Best, Max On Mon, Mar 14, 2016 at 12:24 PM, Thomas Lamirault <[hidden email]> wrote: > > Hello everyone, > > > > We are facing the same probleme now in our Flink applications, launch using YARN. > > Just want to know if there is any update about this exception ? > > > > Thanks > > > > Thomas > > > > ________________________________ > > De : [hidden email] [[hidden email]] de la part de Niels Basjes [[hidden email]] > Envoyé : vendredi 4 décembre 2015 10:40 > À : [hidden email] > Objet : Re: Flink job on secure Yarn fails after many hours > > Hi Maximilian, > > I just downloaded the version from your google drive and used that to run my test topology that accesses HBase. > I deliberately started it twice to double the chance to run into this situation. > > I'll keep you posted. > > Niels > > > On Thu, Dec 3, 2015 at 11:44 AM, Maximilian Michels <[hidden email]> wrote: >> >> Hi Niels, >> >> Just got back from our CI. The build above would fail with a >> Checkstyle error. I corrected that. Also I have built the binaries for >> your Hadoop version 2.6.0. >> >> Binaries: >> >> https://github.com/mxm/flink/archive/kerberos-yarn-heartbeat-fail-0.10.1.zip >> >> Thanks, >> Max >> >> On Wed, Dec 2, 2015 at 6:52 PM, Maximilian Michels <0.0.0.0:41281 >> >>>> >> >> > 21:30:28,185 ERROR org.apache.flink.runtime.jobmanager.JobManager >> >>>> >> >> > - Actor akka://flink/user/jobmanager#403236912 terminated, >> >>>> >> >> > stopping >> >>>> >> >> > process... >> >>>> >> >> > 21:30:28,286 INFO >> >>>> >> >> > org.apache.flink.runtime.webmonitor.WebRuntimeMonitor >> >>>> >> >> > - Removing web root dir >> >>>> >> >> > /tmp/flink-web-e1a44f94-ea6d-40ee-b87c-e3122d5cb9bd >> >>>> >> >> > >> >>>> >> >> > >> >>>> >> >> > -- >> >>>> >> >> > Best regards / Met vriendelijke groeten, >> >>>> >> >> > >> >>>> >> >> > Niels Basjes >> >>>> >> > >> >>>> >> > >> >>>> >> > >> >>>> >> > >> >>>> >> > -- >> >>>> >> > Best regards / Met vriendelijke groeten, >> >>>> >> > >> >>>> >> > Niels Basjes >> >>>> > >> >>>> > >> >>>> > >> >>>> > >> >>>> > -- >> >>>> > Best regards / Met vriendelijke groeten, >> >>>> > >> >>>> > Niels Basjes >> >>> >> >>> >> >>> >> >>> >> >>> -- >> >>> Best regards / Met vriendelijke groeten, >> >>> >> >>> Niels Basjes > > > > > -- > Best regards / Met vriendelijke groeten, > > Niels Basjes |
Hi Max,
I will try these workaround. Thanks Thomas ________________________________________ De : Maximilian Michels [[hidden email]] Envoyé : mardi 15 mars 2016 16:51 À : [hidden email] Cc : Niels Basjes Objet : Re: Flink job on secure Yarn fails after many hours Hi Thomas, Nils (CC) and I found out that you need at least Hadoop version 2.6.1 to properly run Kerberos applications on Hadoop clusters. Versions before that have critical bugs related to the internal security token handling that may expire the token although it is still valid. That said, there is another limitation of Hadoop that the maximum internal token life time is one week. To work around this limit, you have two options: a) increasing the maximum token life time In yarn-site.xml: <property> <name>yarn.resourcemanager.delegation.token.max-lifetime</name> <value>9223372036854775807</value> </property> In hdfs-site.xml <property> <name>dfs.namenode.delegation.token.max-lifetime</name> <value>9223372036854775807</value> </property> b) setup the Yarn ResourceManager as a proxy for the HDFS Namenode: From http://www.cloudera.com/documentation/enterprise/5-3-x/topics/cm_sg_yarn_long_jobs.html "You can work around this by configuring the ResourceManager as a proxy user for the corresponding HDFS NameNode so that the ResourceManager can request new tokens when the existing ones are past their maximum lifetime." @Nils: Could you comment on what worked best for you? Best, Max On Mon, Mar 14, 2016 at 12:24 PM, Thomas Lamirault <[hidden email]> wrote: > > Hello everyone, > > > > We are facing the same probleme now in our Flink applications, launch using YARN. > > Just want to know if there is any update about this exception ? > > > > Thanks > > > > Thomas > > > > ________________________________ > > De : [hidden email] [[hidden email]] de la part de Niels Basjes [[hidden email]] > Envoyé : vendredi 4 décembre 2015 10:40 > À : [hidden email] > Objet : Re: Flink job on secure Yarn fails after many hours > > Hi Maximilian, > > I just downloaded the version from your google drive and used that to run my test topology that accesses HBase. > I deliberately started it twice to double the chance to run into this situation. > > I'll keep you posted. > > Niels > > > On Thu, Dec 3, 2015 at 11:44 AM, Maximilian Michels <[hidden email]> wrote: >> >> Hi Niels, >> >> Just got back from our CI. The build above would fail with a >> Checkstyle error. I corrected that. Also I have built the binaries for >> your Hadoop version 2.6.0. >> >> Binaries: >> >> https://github.com/mxm/flink/archive/kerberos-yarn-heartbeat-fail-0.10.1.zip >> >> Thanks, >> Max >> >> On Wed, Dec 2, 2015 at 6:52 PM, Maximilian Michels <0.0.0.0:41281 >> >>>> >> >> > 21:30:28,185 ERROR org.apache.flink.runtime.jobmanager.JobManager >> >>>> >> >> > - Actor akka://flink/user/jobmanager#403236912 terminated, >> >>>> >> >> > stopping >> >>>> >> >> > process... >> >>>> >> >> > 21:30:28,286 INFO >> >>>> >> >> > org.apache.flink.runtime.webmonitor.WebRuntimeMonitor >> >>>> >> >> > - Removing web root dir >> >>>> >> >> > /tmp/flink-web-e1a44f94-ea6d-40ee-b87c-e3122d5cb9bd >> >>>> >> >> > >> >>>> >> >> > >> >>>> >> >> > -- >> >>>> >> >> > Best regards / Met vriendelijke groeten, >> >>>> >> >> > >> >>>> >> >> > Niels Basjes >> >>>> >> > >> >>>> >> > >> >>>> >> > >> >>>> >> > >> >>>> >> > -- >> >>>> >> > Best regards / Met vriendelijke groeten, >> >>>> >> > >> >>>> >> > Niels Basjes >> >>>> > >> >>>> > >> >>>> > >> >>>> > >> >>>> > -- >> >>>> > Best regards / Met vriendelijke groeten, >> >>>> > >> >>>> > Niels Basjes >> >>> >> >>> >> >>> >> >>> >> >>> -- >> >>> Best regards / Met vriendelijke groeten, >> >>> >> >>> Niels Basjes > > > > > -- > Best regards / Met vriendelijke groeten, > > Niels Basjes |
Hi, In my environment doing the "proxy" thing didn't work. With an token expire of 168 hours (1 week) the job consistently terminates at exactly (within a margin of 10 seconds) 173.5 hours. So far we have not been able to solve this problem. Our teams now simply assume the thing fails once in a while and have an automatic restart feature (i.e. shell script with a while true loop). The best guess at a root cause is this https://issues.apache.org/jira/browse/HDFS-9276 If you have a real solution or a reference to a related bug report to this problem then please share! Niels Basjes On Thu, Mar 17, 2016 at 10:20 AM, Thomas Lamirault <[hidden email]> wrote: Hi Max, Best regards / Met vriendelijke groeten,
Niels Basjes |
Hi Niels,
Thanks for the feedback. As far as I know, Hadoop deliberately defaults to the one week maximum life time of delegation tokens. Have you tried increasing the maximum token life time or was that not an option? I wonder why do you use a while loop? Would it be possible to use the Yarn failover mechanism which starts a new ApplicationMaster and resubmits the job? Thanks, Max On Thu, Mar 17, 2016 at 12:43 PM, Niels Basjes <[hidden email]> wrote: > Hi, > > In my environment doing the "proxy" thing didn't work. > With an token expire of 168 hours (1 week) the job consistently terminates > at exactly (within a margin of 10 seconds) 173.5 hours. > So far we have not been able to solve this problem. > > Our teams now simply assume the thing fails once in a while and have an > automatic restart feature (i.e. shell script with a while true loop). > The best guess at a root cause is this > https://issues.apache.org/jira/browse/HDFS-9276 > > If you have a real solution or a reference to a related bug report to this > problem then please share! > > Niels Basjes > > > > On Thu, Mar 17, 2016 at 10:20 AM, Thomas Lamirault > <[hidden email]> wrote: >> >> Hi Max, >> >> I will try these workaround. >> Thanks >> >> Thomas >> >> ________________________________________ >> De : Maximilian Michels [[hidden email]] >> Envoyé : mardi 15 mars 2016 16:51 >> À : [hidden email] >> Cc : Niels Basjes >> Objet : Re: Flink job on secure Yarn fails after many hours >> >> Hi Thomas, >> >> Nils (CC) and I found out that you need at least Hadoop version 2.6.1 >> to properly run Kerberos applications on Hadoop clusters. Versions >> before that have critical bugs related to the internal security token >> handling that may expire the token although it is still valid. >> >> That said, there is another limitation of Hadoop that the maximum >> internal token life time is one week. To work around this limit, you >> have two options: >> >> a) increasing the maximum token life time >> >> In yarn-site.xml: >> >> <property> >> <name>yarn.resourcemanager.delegation.token.max-lifetime</name> >> <value>9223372036854775807</value> >> </property> >> >> In hdfs-site.xml >> >> <property> >> <name>dfs.namenode.delegation.token.max-lifetime</name> >> <value>9223372036854775807</value> >> </property> >> >> >> b) setup the Yarn ResourceManager as a proxy for the HDFS Namenode: >> >> From >> http://www.cloudera.com/documentation/enterprise/5-3-x/topics/cm_sg_yarn_long_jobs.html >> >> "You can work around this by configuring the ResourceManager as a >> proxy user for the corresponding HDFS NameNode so that the >> ResourceManager can request new tokens when the existing ones are past >> their maximum lifetime." >> >> @Nils: Could you comment on what worked best for you? >> >> Best, >> Max >> >> >> On Mon, Mar 14, 2016 at 12:24 PM, Thomas Lamirault >> <[hidden email]> wrote: >> > >> > Hello everyone, >> > >> > >> > >> > We are facing the same probleme now in our Flink applications, launch >> > using YARN. >> > >> > Just want to know if there is any update about this exception ? >> > >> > >> > >> > Thanks >> > >> > >> > >> > Thomas >> > >> > >> > >> > ________________________________ >> > >> > De : [hidden email] [[hidden email]] de la part de Niels Basjes >> > [[hidden email]] >> > Envoyé : vendredi 4 décembre 2015 10:40 >> > À : [hidden email] >> > Objet : Re: Flink job on secure Yarn fails after many hours >> > >> > Hi Maximilian, >> > >> > I just downloaded the version from your google drive and used that to >> > run my test topology that accesses HBase. >> > I deliberately started it twice to double the chance to run into this >> > situation. >> > >> > I'll keep you posted. >> > >> > Niels >> > >> > >> > On Thu, Dec 3, 2015 at 11:44 AM, Maximilian Michels <[hidden email]> >> > wrote: >> >> >> >> Hi Niels, >> >> >> >> Just got back from our CI. The build above would fail with a >> >> Checkstyle error. I corrected that. Also I have built the binaries for >> >> your Hadoop version 2.6.0. >> >> >> >> Binaries: >> >> >> >> >> >> https://github.com/mxm/flink/archive/kerberos-yarn-heartbeat-fail-0.10.1.zip >> >> >> >> Thanks, >> >> Max >> >> >> >> On Wed, Dec 2, 2015 at 6:52 PM, Maximilian Michels <0.0.0.0:41281 >> >> >>>> >> >> > 21:30:28,185 ERROR >> >> >>>> >> >> > org.apache.flink.runtime.jobmanager.JobManager >> >> >>>> >> >> > - Actor akka://flink/user/jobmanager#403236912 terminated, >> >> >>>> >> >> > stopping >> >> >>>> >> >> > process... >> >> >>>> >> >> > 21:30:28,286 INFO >> >> >>>> >> >> > org.apache.flink.runtime.webmonitor.WebRuntimeMonitor >> >> >>>> >> >> > - Removing web root dir >> >> >>>> >> >> > /tmp/flink-web-e1a44f94-ea6d-40ee-b87c-e3122d5cb9bd >> >> >>>> >> >> > >> >> >>>> >> >> > >> >> >>>> >> >> > -- >> >> >>>> >> >> > Best regards / Met vriendelijke groeten, >> >> >>>> >> >> > >> >> >>>> >> >> > Niels Basjes >> >> >>>> >> > >> >> >>>> >> > >> >> >>>> >> > >> >> >>>> >> > >> >> >>>> >> > -- >> >> >>>> >> > Best regards / Met vriendelijke groeten, >> >> >>>> >> > >> >> >>>> >> > Niels Basjes >> >> >>>> > >> >> >>>> > >> >> >>>> > >> >> >>>> > >> >> >>>> > -- >> >> >>>> > Best regards / Met vriendelijke groeten, >> >> >>>> > >> >> >>>> > Niels Basjes >> >> >>> >> >> >>> >> >> >>> >> >> >>> >> >> >>> -- >> >> >>> Best regards / Met vriendelijke groeten, >> >> >>> >> >> >>> Niels Basjes >> > >> > >> > >> > >> > -- >> > Best regards / Met vriendelijke groeten, >> > >> > Niels Basjes > > > > > -- > Best regards / Met vriendelijke groeten, > > Niels Basjes |
Hi
Has anything ever happened on this issue, and not will it be addressed for 1.2? It's a blocker for us. To quote the YARN security docs: "Any YARN service intended to run for an extended period of time must have a strategy for renewing credentials." Reference: https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/YarnApplicationSecurity.html Spark has this implemented as a thread in its application master that periodically renews delegation tokens with (1) hdfs and (2) yarn: https://www.cloudera.com/documentation/enterprise/latest/topics/cm_sg_yarn_long_jobs.html |
In reply to this post by Maximilian Michels
Niels, are you still facing this issue? As far as I understood it, the security changes in Flink 1.2.0 use a new Kerberos mechanism that allows infinite token renewal. On Thu, Mar 17, 2016 at 7:30 AM, Maximilian Michels <[hidden email]> wrote: Hi Niels, |
Hi, No, this issue is now gone for us. The fixed in 1.2.0 ensured that we are now able to run jobs on our cluster beyond the 7 days limit. Niels On Wed, Apr 12, 2017 at 5:35 PM, Robert Metzger <[hidden email]> wrote:
Best regards / Met vriendelijke groeten,
Niels Basjes |
Free forum by Nabble | Edit this page |