(DEPRECATED) Apache Flink User Mailing List archive.

Flink job on secure Yarn fails after many hours

Classic

List

Threaded

20 messages Options

Niels Basjes

Flink job on secure Yarn fails after many hours

Hi,

We have a Kerberos secured Yarn cluster here and I'm experimenting with Apache Flink on top of that.

A few days ago I started a very simple Flink application (just stream the time as a String into HBase 10 times per second).

I (deliberately) asked our IT-ops guys to make my account have a max ticket time of 5 minutes and a max renew time of 10 minutes (yes, ridiculously low timeout values because I needed to validate this https://issues.apache.org/jira/browse/FLINK-2977).

This job is started with a keytab file and after running for 31 hours it suddenly failed with the exception you see below.

I had the same job running for almost 400 hours until that failed too (I was too late to check the logfiles but I suspect the same problem).

So in that time span my tickets have expired and new tickets have been obtained several hundred times.

The main error I see is that in the process of a ticket expiring and being renewed I see this message:

     Not retrying because the invoked method is not idempotent, and unable to determine whether it was invoked

Yarn on the cluster is 2.6.0 ( HDP 2.6.0.2.2.4.2-2 )
Flink is version 0.10.1 

How do I fix this?
Is this a bug (in either Hadoop or Flink) or am I doing something wrong?
Would upgrading Yarn to 2.7.1  (i.e. HDP 2.3) fix this?

Niels Basjes

21:30:27,821 WARN  org.apache.hadoop.security.UserGroupInformation               - PriviledgedActionException as:nbasjes (auth:SIMPLE) cause:org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken): Invalid AMRMToken from appattempt_1443166961758_163901_000001
21:30:27,861 WARN  org.apache.hadoop.ipc.Client                                  - Exception encountered while connecting to the server : org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken): Invalid AMRMToken from appattempt_1443166961758_163901_000001
21:30:27,861 WARN  org.apache.hadoop.security.UserGroupInformation               - PriviledgedActionException as:nbasjes (auth:SIMPLE) cause:org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken): Invalid AMRMToken from appattempt_1443166961758_163901_000001
21:30:27,891 WARN  org.apache.hadoop.io.retry.RetryInvocationHandler             - Exception while invoking class org.apache.hadoop.yarn.api.impl.pb.client.ApplicationMasterProtocolPBClientImpl.allocate. Not retrying because the invoked method is not idempotent, and unable to determine whether it was invoked
org.apache.hadoop.security.token.SecretManager$InvalidToken: Invalid AMRMToken from appattempt_1443166961758_163901_000001
	at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
	at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
	at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
	at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
	at org.apache.hadoop.yarn.ipc.RPCUtil.instantiateException(RPCUtil.java:53)
	at org.apache.hadoop.yarn.ipc.RPCUtil.unwrapAndThrowException(RPCUtil.java:104)
	at org.apache.hadoop.yarn.api.impl.pb.client.ApplicationMasterProtocolPBClientImpl.allocate(ApplicationMasterProtocolPBClientImpl.java:79)
	at sun.reflect.GeneratedMethodAccessor3.invoke(Unknown Source)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:606)
	at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:186)
	at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
	at com.sun.proxy.$Proxy14.allocate(Unknown Source)
	at org.apache.hadoop.yarn.client.api.impl.AMRMClientImpl.allocate(AMRMClientImpl.java:245)
	at org.apache.flink.yarn.YarnJobManager$$anonfun$handleYarnMessage$1.applyOrElse(YarnJobManager.scala:259)
	at scala.PartialFunction$OrElse.apply(PartialFunction.scala:162)
	at org.apache.flink.runtime.LeaderSessionMessageFilter$$anonfun$receive$1.applyOrElse(LeaderSessionMessageFilter.scala:44)
	at scala.runtime.AbstractPartialFunction$mcVL$sp.apply$mcVL$sp(AbstractPartialFunction.scala:33)
	at scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:33)
	at scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:25)
	at org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:33)
	at org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:28)
	at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:118)
	at org.apache.flink.runtime.LogMessages$$anon$1.applyOrElse(LogMessages.scala:28)
	at akka.actor.Actor$class.aroundReceive(Actor.scala:465)
	at org.apache.flink.runtime.jobmanager.JobManager.aroundReceive(JobManager.scala:100)
	at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)
	at akka.actor.ActorCell.invoke(ActorCell.scala:487)
	at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:254)
	at akka.dispatch.Mailbox.run(Mailbox.scala:221)
	at akka.dispatch.Mailbox.exec(Mailbox.scala:231)
	at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
	at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.pollAndExecAll(ForkJoinPool.java:1253)
	at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1346)
	at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
	at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
Caused by: org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken): Invalid AMRMToken from appattempt_1443166961758_163901_000001
	at org.apache.hadoop.ipc.Client.call(Client.java:1406)
	at org.apache.hadoop.ipc.Client.call(Client.java:1359)
	at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206)
	at com.sun.proxy.$Proxy13.allocate(Unknown Source)
	at org.apache.hadoop.yarn.api.impl.pb.client.ApplicationMasterProtocolPBClientImpl.allocate(ApplicationMasterProtocolPBClientImpl.java:77)
	... 29 more
21:30:27,943 ERROR akka.actor.OneForOneStrategy                                  - Invalid AMRMToken from appattempt_1443166961758_163901_000001
org.apache.hadoop.security.token.SecretManager$InvalidToken: Invalid AMRMToken from appattempt_1443166961758_163901_000001
	at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
	at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
	at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
	at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
	at org.apache.hadoop.yarn.ipc.RPCUtil.instantiateException(RPCUtil.java:53)
	at org.apache.hadoop.yarn.ipc.RPCUtil.unwrapAndThrowException(RPCUtil.java:104)
	at org.apache.hadoop.yarn.api.impl.pb.client.ApplicationMasterProtocolPBClientImpl.allocate(ApplicationMasterProtocolPBClientImpl.java:79)
	at sun.reflect.GeneratedMethodAccessor3.invoke(Unknown Source)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:606)
	at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:186)
	at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
	at com.sun.proxy.$Proxy14.allocate(Unknown Source)
	at org.apache.hadoop.yarn.client.api.impl.AMRMClientImpl.allocate(AMRMClientImpl.java:245)
	at org.apache.flink.yarn.YarnJobManager$$anonfun$handleYarnMessage$1.applyOrElse(YarnJobManager.scala:259)
	at scala.PartialFunction$OrElse.apply(PartialFunction.scala:162)
	at org.apache.flink.runtime.LeaderSessionMessageFilter$$anonfun$receive$1.applyOrElse(LeaderSessionMessageFilter.scala:44)
	at scala.runtime.AbstractPartialFunction$mcVL$sp.apply$mcVL$sp(AbstractPartialFunction.scala:33)
	at scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:33)
	at scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:25)
	at org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:33)
	at org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:28)
	at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:118)
	at org.apache.flink.runtime.LogMessages$$anon$1.applyOrElse(LogMessages.scala:28)
	at akka.actor.Actor$class.aroundReceive(Actor.scala:465)
	at org.apache.flink.runtime.jobmanager.JobManager.aroundReceive(JobManager.scala:100)
	at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)
	at akka.actor.ActorCell.invoke(ActorCell.scala:487)
	at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:254)
	at akka.dispatch.Mailbox.run(Mailbox.scala:221)
	at akka.dispatch.Mailbox.exec(Mailbox.scala:231)
	at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
	at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.pollAndExecAll(ForkJoinPool.java:1253)
	at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1346)
	at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
	at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
Caused by: org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken): Invalid AMRMToken from appattempt_1443166961758_163901_000001
	at org.apache.hadoop.ipc.Client.call(Client.java:1406)
	at org.apache.hadoop.ipc.Client.call(Client.java:1359)
	at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206)
	at com.sun.proxy.$Proxy13.allocate(Unknown Source)
	at org.apache.hadoop.yarn.api.impl.pb.client.ApplicationMasterProtocolPBClientImpl.allocate(ApplicationMasterProtocolPBClientImpl.java:77)
	... 29 more
21:30:28,075 INFO  org.apache.flink.yarn.YarnJobManager                          - Stopping JobManager akka.tcp://<a href="http://flink@10.10.200.3:39527/user/jobmanager.
21:30:28,088">flink@10.10.200.3:39527/user/jobmanager.
21:30:28,088 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - Source: Custom Source -> Sink: Unnamed (1/1) (db0d95c11c14505827e696eec7efab94) switched from RUNNING to CANCELING
21:30:28,113 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - Source: Custom Source -> Sink: Unnamed (1/1) (db0d95c11c14505827e696eec7efab94) switched from CANCELING to FAILED
21:30:28,184 INFO  org.apache.flink.runtime.blob.BlobServer                      - Stopped BLOB server at 0.0.0.0:41281
21:30:28,185 ERROR org.apache.flink.runtime.jobmanager.JobManager                - Actor akka://flink/user/jobmanager#403236912 terminated, stopping process...
21:30:28,286 INFO  org.apache.flink.runtime.webmonitor.WebRuntimeMonitor         - Removing web root dir /tmp/flink-web-e1a44f94-ea6d-40ee-b87c-e3122d5cb9bd

Best regards / Met vriendelijke groeten,

Niels Basjes

Maximilian Michels

Re: Flink job on secure Yarn fails after many hours

Hi Niels,

Sorry for hear you experienced this exception. From a first glance, it
looks like a bug in Hadoop to me.

> "Not retrying because the invoked method is not idempotent, and unable to determine whether it was invoked"

That is nothing to worry about. This is Hadoop's internal retry
mechanism that re-attempts to do actions which previously failed if
that's possible. Since the action is not idempotent (it cannot be
executed again without risking to change the state of the execution)
and it also doesn't track its execution states, it won't be retried
again.

The main issue is this exception:

>org.apache.hadoop.security.token.SecretManager$InvalidToken: Invalid AMRMToken from >appattempt_1443166961758_163901_000001

From the stack trace it is clear that this exception occurs upon
requesting container status information from the Resource Manager:

>at org.apache.flink.yarn.YarnJobManager$$anonfun$handleYarnMessage$1.applyOrElse(YarnJobManager.scala:259)

Are there any more exceptions in the log? Do you have the complete
logs available and could you share them?

Best regards,
Max

On Wed, Dec 2, 2015 at 11:47 AM, Niels Basjes <[hidden email]> wrote:

> Hi,
>
>
> We have a Kerberos secured Yarn cluster here and I'm experimenting with
> Apache Flink on top of that.
>
> A few days ago I started a very simple Flink application (just stream the
> time as a String into HBase 10 times per second).
>
> I (deliberately) asked our IT-ops guys to make my account have a max ticket
> time of 5 minutes and a max renew time of 10 minutes (yes, ridiculously low
> timeout values because I needed to validate this
> https://issues.apache.org/jira/browse/FLINK-2977).
>
> This job is started with a keytab file and after running for 31 hours it
> suddenly failed with the exception you see below.
>
> I had the same job running for almost 400 hours until that failed too (I was
> too late to check the logfiles but I suspect the same problem).
>
>
> So in that time span my tickets have expired and new tickets have been
> obtained several hundred times.
>
>
> The main error I see is that in the process of a ticket expiring and being
> renewed I see this message:
>
> Not retrying because the invoked method is not idempotent, and unable
> to determine whether it was invoked
>
>
> Yarn on the cluster is 2.6.0 ( HDP 2.6.0.2.2.4.2-2 )
>
> Flink is version 0.10.1
>
>
> How do I fix this?
> Is this a bug (in either Hadoop or Flink) or am I doing something wrong?
> Would upgrading Yarn to 2.7.1 (i.e. HDP 2.3) fix this?
>
>
> Niels Basjes
>
>
>
> 21:30:27,821 WARN org.apache.hadoop.security.UserGroupInformation
> - PriviledgedActionException as:nbasjes (auth:SIMPLE)
> cause:org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken):
> Invalid AMRMToken from appattempt_1443166961758_163901_000001
> 21:30:27,861 WARN org.apache.hadoop.ipc.Client
> - Exception encountered while connecting to the server :
> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken):
> Invalid AMRMToken from appattempt_1443166961758_163901_000001
> 21:30:27,861 WARN org.apache.hadoop.security.UserGroupInformation
> - PriviledgedActionException as:nbasjes (auth:SIMPLE)
> cause:org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken):
> Invalid AMRMToken from appattempt_1443166961758_163901_000001
> 21:30:27,891 WARN org.apache.hadoop.io.retry.RetryInvocationHandler
> - Exception while invoking class
> org.apache.hadoop.yarn.api.impl.pb.client.ApplicationMasterProtocolPBClientImpl.allocate.
> Not retrying because the invoked method is not idempotent, and unable to
> determine whether it was invoked
> org.apache.hadoop.security.token.SecretManager$InvalidToken: Invalid
> AMRMToken from appattempt_1443166961758_163901_000001
> at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
> at
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
> at
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
> at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
> at org.apache.hadoop.yarn.ipc.RPCUtil.instantiateException(RPCUtil.java:53)
> at
> org.apache.hadoop.yarn.ipc.RPCUtil.unwrapAndThrowException(RPCUtil.java:104)
> at
> org.apache.hadoop.yarn.api.impl.pb.client.ApplicationMasterProtocolPBClientImpl.allocate(ApplicationMasterProtocolPBClientImpl.java:79)
> at sun.reflect.GeneratedMethodAccessor3.invoke(Unknown Source)
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:186)
> at
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
> at com.sun.proxy.$Proxy14.allocate(Unknown Source)
> at
> org.apache.hadoop.yarn.client.api.impl.AMRMClientImpl.allocate(AMRMClientImpl.java:245)
> at
> org.apache.flink.yarn.YarnJobManager$$anonfun$handleYarnMessage$1.applyOrElse(YarnJobManager.scala:259)
> at scala.PartialFunction$OrElse.apply(PartialFunction.scala:162)
> at
> org.apache.flink.runtime.LeaderSessionMessageFilter$$anonfun$receive$1.applyOrElse(LeaderSessionMessageFilter.scala:44)
> at
> scala.runtime.AbstractPartialFunction$mcVL$sp.apply$mcVL$sp(AbstractPartialFunction.scala:33)
> at
> scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:33)
> at
> scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:25)
> at org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:33)
> at org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:28)
> at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:118)
> at
> org.apache.flink.runtime.LogMessages$$anon$1.applyOrElse(LogMessages.scala:28)
> at akka.actor.Actor$class.aroundReceive(Actor.scala:465)
> at
> org.apache.flink.runtime.jobmanager.JobManager.aroundReceive(JobManager.scala:100)
> at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)
> at akka.actor.ActorCell.invoke(ActorCell.scala:487)
> at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:254)
> at akka.dispatch.Mailbox.run(Mailbox.scala:221)
> at akka.dispatch.Mailbox.exec(Mailbox.scala:231)
> at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
> at
> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.pollAndExecAll(ForkJoinPool.java:1253)
> at
> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1346)
> at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
> at
> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
> Caused by:
> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken):
> Invalid AMRMToken from appattempt_1443166961758_163901_000001
> at org.apache.hadoop.ipc.Client.call(Client.java:1406)
> at org.apache.hadoop.ipc.Client.call(Client.java:1359)
> at
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206)
> at com.sun.proxy.$Proxy13.allocate(Unknown Source)
> at
> org.apache.hadoop.yarn.api.impl.pb.client.ApplicationMasterProtocolPBClientImpl.allocate(ApplicationMasterProtocolPBClientImpl.java:77)
> ... 29 more
> 21:30:27,943 ERROR akka.actor.OneForOneStrategy
> - Invalid AMRMToken from appattempt_1443166961758_163901_000001
> org.apache.hadoop.security.token.SecretManager$InvalidToken: Invalid
> AMRMToken from appattempt_1443166961758_163901_000001
> at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
> at
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
> at
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
> at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
> at org.apache.hadoop.yarn.ipc.RPCUtil.instantiateException(RPCUtil.java:53)
> at
> org.apache.hadoop.yarn.ipc.RPCUtil.unwrapAndThrowException(RPCUtil.java:104)
> at
> org.apache.hadoop.yarn.api.impl.pb.client.ApplicationMasterProtocolPBClientImpl.allocate(ApplicationMasterProtocolPBClientImpl.java:79)
> at sun.reflect.GeneratedMethodAccessor3.invoke(Unknown Source)
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:186)
> at
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
> at com.sun.proxy.$Proxy14.allocate(Unknown Source)
> at
> org.apache.hadoop.yarn.client.api.impl.AMRMClientImpl.allocate(AMRMClientImpl.java:245)
> at
> org.apache.flink.yarn.YarnJobManager$$anonfun$handleYarnMessage$1.applyOrElse(YarnJobManager.scala:259)
> at scala.PartialFunction$OrElse.apply(PartialFunction.scala:162)
> at
> org.apache.flink.runtime.LeaderSessionMessageFilter$$anonfun$receive$1.applyOrElse(LeaderSessionMessageFilter.scala:44)
> at
> scala.runtime.AbstractPartialFunction$mcVL$sp.apply$mcVL$sp(AbstractPartialFunction.scala:33)
> at
> scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:33)
> at
> scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:25)
> at org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:33)
> at org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:28)
> at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:118)
> at
> org.apache.flink.runtime.LogMessages$$anon$1.applyOrElse(LogMessages.scala:28)
> at akka.actor.Actor$class.aroundReceive(Actor.scala:465)
> at
> org.apache.flink.runtime.jobmanager.JobManager.aroundReceive(JobManager.scala:100)
> at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)
> at akka.actor.ActorCell.invoke(ActorCell.scala:487)
> at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:254)
> at akka.dispatch.Mailbox.run(Mailbox.scala:221)
> at akka.dispatch.Mailbox.exec(Mailbox.scala:231)
> at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
> at
> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.pollAndExecAll(ForkJoinPool.java:1253)
> at
> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1346)
> at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
> at
> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
> Caused by:
> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken):
> Invalid AMRMToken from appattempt_1443166961758_163901_000001
> at org.apache.hadoop.ipc.Client.call(Client.java:1406)
> at org.apache.hadoop.ipc.Client.call(Client.java:1359)
> at
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206)
> at com.sun.proxy.$Proxy13.allocate(Unknown Source)
> at
> org.apache.hadoop.yarn.api.impl.pb.client.ApplicationMasterProtocolPBClientImpl.allocate(ApplicationMasterProtocolPBClientImpl.java:77)
> ... 29 more
> 21:30:28,075 INFO org.apache.flink.yarn.YarnJobManager
> - Stopping JobManager akka.tcp://flink@10.10.200.3:39527/user/jobmanager.
> 21:30:28,088 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph
> - Source: Custom Source -> Sink: Unnamed (1/1)
> (db0d95c11c14505827e696eec7efab94) switched from RUNNING to CANCELING
> 21:30:28,113 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph
> - Source: Custom Source -> Sink: Unnamed (1/1)
> (db0d95c11c14505827e696eec7efab94) switched from CANCELING to FAILED
> 21:30:28,184 INFO org.apache.flink.runtime.blob.BlobServer
> - Stopped BLOB server at 0.0.0.0:41281
> 21:30:28,185 ERROR org.apache.flink.runtime.jobmanager.JobManager
> - Actor akka://flink/user/jobmanager#403236912 terminated, stopping
> process...
> 21:30:28,286 INFO org.apache.flink.runtime.webmonitor.WebRuntimeMonitor
> - Removing web root dir /tmp/flink-web-e1a44f94-ea6d-40ee-b87c-e3122d5cb9bd
>
>
> --
> Best regards / Met vriendelijke groeten,
>
> Niels Basjes

Niels Basjes

Re: Flink job on secure Yarn fails after many hours

Hi,

I posted the entire log from the first log line at the moment of failure to the very end of the logfile.

This is all I have.

As far as I understand the Kerberos and Keytab mechanism in Hadoop Yarn is that it catches the "Invalid Token" and then (if keytab) gets a new Kerberos ticket (or tgt?).

When the new ticket has been obtained it retries the call that previously failed.

To me it seemed that this call can fail over the invalid Token yet it cannot be retried.

At this moment I'm thinking a bug in Hadoop.

Niels

On Wed, Dec 2, 2015 at 2:51 PM, Maximilian Michels <[hidden email]> wrote:

Hi Niels,

Sorry for hear you experienced this exception. From a first glance, it
looks like a bug in Hadoop to me.

> "Not retrying because the invoked method is not idempotent, and unable to determine whether it was invoked"

That is nothing to worry about. This is Hadoop's internal retry
mechanism that re-attempts to do actions which previously failed if
that's possible. Since the action is not idempotent (it cannot be
executed again without risking to change the state of the execution)
and it also doesn't track its execution states, it won't be retried
again.

The main issue is this exception:

>org.apache.hadoop.security.token.SecretManager$InvalidToken: Invalid AMRMToken from >appattempt_1443166961758_163901_000001

From the stack trace it is clear that this exception occurs upon
requesting container status information from the Resource Manager:

>at org.apache.flink.yarn.YarnJobManager$$anonfun$handleYarnMessage$1.applyOrElse(YarnJobManager.scala:259)

Are there any more exceptions in the log? Do you have the complete
logs available and could you share them?

Best regards,
Max

On Wed, Dec 2, 2015 at 11:47 AM, Niels Basjes <[hidden email]> wrote:
> Hi,
>
>
> We have a Kerberos secured Yarn cluster here and I'm experimenting with
> Apache Flink on top of that.
>
> A few days ago I started a very simple Flink application (just stream the
> time as a String into HBase 10 times per second).
>
> I (deliberately) asked our IT-ops guys to make my account have a max ticket
> time of 5 minutes and a max renew time of 10 minutes (yes, ridiculously low
> timeout values because I needed to validate this
> https://issues.apache.org/jira/browse/FLINK-2977).
>
> This job is started with a keytab file and after running for 31 hours it
> suddenly failed with the exception you see below.
>
> I had the same job running for almost 400 hours until that failed too (I was
> too late to check the logfiles but I suspect the same problem).
>
>
> So in that time span my tickets have expired and new tickets have been
> obtained several hundred times.
>
>
> The main error I see is that in the process of a ticket expiring and being
> renewed I see this message:
>
> Not retrying because the invoked method is not idempotent, and unable
> to determine whether it was invoked
>
>
> Yarn on the cluster is 2.6.0 ( HDP 2.6.0.2.2.4.2-2 )
>
> Flink is version 0.10.1
>
>
> How do I fix this?
> Is this a bug (in either Hadoop or Flink) or am I doing something wrong?
> Would upgrading Yarn to 2.7.1 (i.e. HDP 2.3) fix this?
>
>
> Niels Basjes
>
>
>
> 21:30:27,821 WARN org.apache.hadoop.security.UserGroupInformation
> - PriviledgedActionException as:nbasjes (auth:SIMPLE)
> cause:org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken):
> Invalid AMRMToken from appattempt_1443166961758_163901_000001
> 21:30:27,861 WARN org.apache.hadoop.ipc.Client
> - Exception encountered while connecting to the server :
> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken):
> Invalid AMRMToken from appattempt_1443166961758_163901_000001
> 21:30:27,861 WARN org.apache.hadoop.security.UserGroupInformation
> - PriviledgedActionException as:nbasjes (auth:SIMPLE)
> cause:org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken):
> Invalid AMRMToken from appattempt_1443166961758_163901_000001
> 21:30:27,891 WARN org.apache.hadoop.io.retry.RetryInvocationHandler
> - Exception while invoking class
> org.apache.hadoop.yarn.api.impl.pb.client.ApplicationMasterProtocolPBClientImpl.allocate.
> Not retrying because the invoked method is not idempotent, and unable to
> determine whether it was invoked
> org.apache.hadoop.security.token.SecretManager$InvalidToken: Invalid
> AMRMToken from appattempt_1443166961758_163901_000001
> at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
> at
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
> at
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
> at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
> at org.apache.hadoop.yarn.ipc.RPCUtil.instantiateException(RPCUtil.java:53)
> at
> org.apache.hadoop.yarn.ipc.RPCUtil.unwrapAndThrowException(RPCUtil.java:104)
> at
> org.apache.hadoop.yarn.api.impl.pb.client.ApplicationMasterProtocolPBClientImpl.allocate(ApplicationMasterProtocolPBClientImpl.java:79)
> at sun.reflect.GeneratedMethodAccessor3.invoke(Unknown Source)
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:186)
> at
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
> at com.sun.proxy.$Proxy14.allocate(Unknown Source)
> at
> org.apache.hadoop.yarn.client.api.impl.AMRMClientImpl.allocate(AMRMClientImpl.java:245)
> at
> org.apache.flink.yarn.YarnJobManager$$anonfun$handleYarnMessage$1.applyOrElse(YarnJobManager.scala:259)
> at scala.PartialFunction$OrElse.apply(PartialFunction.scala:162)
> at
> org.apache.flink.runtime.LeaderSessionMessageFilter$$anonfun$receive$1.applyOrElse(LeaderSessionMessageFilter.scala:44)
> at
> scala.runtime.AbstractPartialFunction$mcVL$sp.apply$mcVL$sp(AbstractPartialFunction.scala:33)
> at
> scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:33)
> at
> scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:25)
> at org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:33)
> at org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:28)
> at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:118)
> at
> org.apache.flink.runtime.LogMessages$$anon$1.applyOrElse(LogMessages.scala:28)
> at akka.actor.Actor$class.aroundReceive(Actor.scala:465)
> at
> org.apache.flink.runtime.jobmanager.JobManager.aroundReceive(JobManager.scala:100)
> at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)
> at akka.actor.ActorCell.invoke(ActorCell.scala:487)
> at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:254)
> at akka.dispatch.Mailbox.run(Mailbox.scala:221)
> at akka.dispatch.Mailbox.exec(Mailbox.scala:231)
> at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
> at
> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.pollAndExecAll(ForkJoinPool.java:1253)
> at
> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1346)
> at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
> at
> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
> Caused by:
> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken):
> Invalid AMRMToken from appattempt_1443166961758_163901_000001
> at org.apache.hadoop.ipc.Client.call(Client.java:1406)
> at org.apache.hadoop.ipc.Client.call(Client.java:1359)
> at
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206)
> at com.sun.proxy.$Proxy13.allocate(Unknown Source)
> at
> org.apache.hadoop.yarn.api.impl.pb.client.ApplicationMasterProtocolPBClientImpl.allocate(ApplicationMasterProtocolPBClientImpl.java:77)
> ... 29 more
> 21:30:27,943 ERROR akka.actor.OneForOneStrategy
> - Invalid AMRMToken from appattempt_1443166961758_163901_000001
> org.apache.hadoop.security.token.SecretManager$InvalidToken: Invalid
> AMRMToken from appattempt_1443166961758_163901_000001
> at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
> at
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
> at
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
> at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
> at org.apache.hadoop.yarn.ipc.RPCUtil.instantiateException(RPCUtil.java:53)
> at
> org.apache.hadoop.yarn.ipc.RPCUtil.unwrapAndThrowException(RPCUtil.java:104)
> at
> org.apache.hadoop.yarn.api.impl.pb.client.ApplicationMasterProtocolPBClientImpl.allocate(ApplicationMasterProtocolPBClientImpl.java:79)
> at sun.reflect.GeneratedMethodAccessor3.invoke(Unknown Source)
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:186)
> at
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
> at com.sun.proxy.$Proxy14.allocate(Unknown Source)
> at
> org.apache.hadoop.yarn.client.api.impl.AMRMClientImpl.allocate(AMRMClientImpl.java:245)
> at
> org.apache.flink.yarn.YarnJobManager$$anonfun$handleYarnMessage$1.applyOrElse(YarnJobManager.scala:259)
> at scala.PartialFunction$OrElse.apply(PartialFunction.scala:162)
> at
> org.apache.flink.runtime.LeaderSessionMessageFilter$$anonfun$receive$1.applyOrElse(LeaderSessionMessageFilter.scala:44)
> at
> scala.runtime.AbstractPartialFunction$mcVL$sp.apply$mcVL$sp(AbstractPartialFunction.scala:33)
> at
> scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:33)
> at
> scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:25)
> at org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:33)
> at org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:28)
> at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:118)
> at
> org.apache.flink.runtime.LogMessages$$anon$1.applyOrElse(LogMessages.scala:28)
> at akka.actor.Actor$class.aroundReceive(Actor.scala:465)
> at
> org.apache.flink.runtime.jobmanager.JobManager.aroundReceive(JobManager.scala:100)
> at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)
> at akka.actor.ActorCell.invoke(ActorCell.scala:487)
> at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:254)
> at akka.dispatch.Mailbox.run(Mailbox.scala:221)
> at akka.dispatch.Mailbox.exec(Mailbox.scala:231)
> at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
> at
> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.pollAndExecAll(ForkJoinPool.java:1253)
> at
> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1346)
> at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
> at
> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
> Caused by:
> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken):
> Invalid AMRMToken from appattempt_1443166961758_163901_000001
> at org.apache.hadoop.ipc.Client.call(Client.java:1406)
> at org.apache.hadoop.ipc.Client.call(Client.java:1359)
> at
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206)
> at com.sun.proxy.$Proxy13.allocate(Unknown Source)
> at
> org.apache.hadoop.yarn.api.impl.pb.client.ApplicationMasterProtocolPBClientImpl.allocate(ApplicationMasterProtocolPBClientImpl.java:77)
> ... 29 more
> 21:30:28,075 INFO org.apache.flink.yarn.YarnJobManager
> - Stopping JobManager akka.tcp://flink@10.10.200.3:39527/user/jobmanager.
> 21:30:28,088 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph
> - Source: Custom Source -> Sink: Unnamed (1/1)
> (db0d95c11c14505827e696eec7efab94) switched from RUNNING to CANCELING
> 21:30:28,113 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph
> - Source: Custom Source -> Sink: Unnamed (1/1)
> (db0d95c11c14505827e696eec7efab94) switched from CANCELING to FAILED
> 21:30:28,184 INFO org.apache.flink.runtime.blob.BlobServer
> - Stopped BLOB server at 0.0.0.0:41281
> 21:30:28,185 ERROR org.apache.flink.runtime.jobmanager.JobManager
> - Actor akka://flink/user/jobmanager#403236912 terminated, stopping
> process...
> 21:30:28,286 INFO org.apache.flink.runtime.webmonitor.WebRuntimeMonitor
> - Removing web root dir /tmp/flink-web-e1a44f94-ea6d-40ee-b87c-e3122d5cb9bd
>
>
> --
> Best regards / Met vriendelijke groeten,
>
> Niels Basjes

Best regards / Met vriendelijke groeten,

Niels Basjes

Maximilian Michels

Re: Flink job on secure Yarn fails after many hours

Hi Niels,

You mentioned you have the option to update Hadoop and redeploy the
job. Would be great if you could do that and let us know how it turns
out.

Cheers,
Max

On Wed, Dec 2, 2015 at 3:45 PM, Niels Basjes <[hidden email]> wrote:

> Hi,
>
> I posted the entire log from the first log line at the moment of failure to
> the very end of the logfile.
> This is all I have.
>
> As far as I understand the Kerberos and Keytab mechanism in Hadoop Yarn is
> that it catches the "Invalid Token" and then (if keytab) gets a new Kerberos
> ticket (or tgt?).
> When the new ticket has been obtained it retries the call that previously
> failed.
> To me it seemed that this call can fail over the invalid Token yet it cannot
> be retried.
>
> At this moment I'm thinking a bug in Hadoop.
>
> Niels
>
> On Wed, Dec 2, 2015 at 2:51 PM, Maximilian Michels <[hidden email]> wrote:
>>
>> Hi Niels,
>>
>> Sorry for hear you experienced this exception. From a first glance, it
>> looks like a bug in Hadoop to me.
>>
>> > "Not retrying because the invoked method is not idempotent, and unable
>> > to determine whether it was invoked"
>>
>> That is nothing to worry about. This is Hadoop's internal retry
>> mechanism that re-attempts to do actions which previously failed if
>> that's possible. Since the action is not idempotent (it cannot be
>> executed again without risking to change the state of the execution)
>> and it also doesn't track its execution states, it won't be retried
>> again.
>>
>> The main issue is this exception:
>>
>> >org.apache.hadoop.security.token.SecretManager$InvalidToken: Invalid
>> > AMRMToken from >appattempt_1443166961758_163901_000001
>>
>> From the stack trace it is clear that this exception occurs upon
>> requesting container status information from the Resource Manager:
>>
>> >at
>> > org.apache.flink.yarn.YarnJobManager$$anonfun$handleYarnMessage$1.applyOrElse(YarnJobManager.scala:259)
>>
>> Are there any more exceptions in the log? Do you have the complete
>> logs available and could you share them?
>>
>>
>> Best regards,
>> Max
>>
>> On Wed, Dec 2, 2015 at 11:47 AM, Niels Basjes <[hidden email]> wrote:
>> > Hi,
>> >
>> >
>> > We have a Kerberos secured Yarn cluster here and I'm experimenting with
>> > Apache Flink on top of that.
>> >
>> > A few days ago I started a very simple Flink application (just stream
>> > the
>> > time as a String into HBase 10 times per second).
>> >
>> > I (deliberately) asked our IT-ops guys to make my account have a max
>> > ticket
>> > time of 5 minutes and a max renew time of 10 minutes (yes, ridiculously
>> > low
>> > timeout values because I needed to validate this
>> > https://issues.apache.org/jira/browse/FLINK-2977).
>> >
>> > This job is started with a keytab file and after running for 31 hours it
>> > suddenly failed with the exception you see below.
>> >
>> > I had the same job running for almost 400 hours until that failed too (I
>> > was
>> > too late to check the logfiles but I suspect the same problem).
>> >
>> >
>> > So in that time span my tickets have expired and new tickets have been
>> > obtained several hundred times.
>> >
>> >
>> > The main error I see is that in the process of a ticket expiring and
>> > being
>> > renewed I see this message:
>> >
>> > Not retrying because the invoked method is not idempotent, and
>> > unable
>> > to determine whether it was invoked
>> >
>> >
>> > Yarn on the cluster is 2.6.0 ( HDP 2.6.0.2.2.4.2-2 )
>> >
>> > Flink is version 0.10.1
>> >
>> >
>> > How do I fix this?
>> > Is this a bug (in either Hadoop or Flink) or am I doing something wrong?
>> > Would upgrading Yarn to 2.7.1 (i.e. HDP 2.3) fix this?
>> >
>> >
>> > Niels Basjes
>> >
>> >
>> >
>> > 21:30:27,821 WARN org.apache.hadoop.security.UserGroupInformation
>> > - PriviledgedActionException as:nbasjes (auth:SIMPLE)
>> >
>> > cause:org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken):
>> > Invalid AMRMToken from appattempt_1443166961758_163901_000001
>> > 21:30:27,861 WARN org.apache.hadoop.ipc.Client
>> > - Exception encountered while connecting to the server :
>> >
>> > org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken):
>> > Invalid AMRMToken from appattempt_1443166961758_163901_000001
>> > 21:30:27,861 WARN org.apache.hadoop.security.UserGroupInformation
>> > - PriviledgedActionException as:nbasjes (auth:SIMPLE)
>> >
>> > cause:org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken):
>> > Invalid AMRMToken from appattempt_1443166961758_163901_000001
>> > 21:30:27,891 WARN org.apache.hadoop.io.retry.RetryInvocationHandler
>> > - Exception while invoking class
>> >
>> > org.apache.hadoop.yarn.api.impl.pb.client.ApplicationMasterProtocolPBClientImpl.allocate.
>> > Not retrying because the invoked method is not idempotent, and unable to
>> > determine whether it was invoked
>> > org.apache.hadoop.security.token.SecretManager$InvalidToken: Invalid
>> > AMRMToken from appattempt_1443166961758_163901_000001
>> > at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
>> > Method)
>> > at
>> >
>> > sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
>> > at
>> >
>> > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>> > at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
>> > at
>> > org.apache.hadoop.yarn.ipc.RPCUtil.instantiateException(RPCUtil.java:53)
>> > at
>> >
>> > org.apache.hadoop.yarn.ipc.RPCUtil.unwrapAndThrowException(RPCUtil.java:104)
>> > at
>> >
>> > org.apache.hadoop.yarn.api.impl.pb.client.ApplicationMasterProtocolPBClientImpl.allocate(ApplicationMasterProtocolPBClientImpl.java:79)
>> > at sun.reflect.GeneratedMethodAccessor3.invoke(Unknown Source)
>> > at
>> >
>> > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>> > at java.lang.reflect.Method.invoke(Method.java:606)
>> > at
>> >
>> > org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:186)
>> > at
>> >
>> > org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
>> > at com.sun.proxy.$Proxy14.allocate(Unknown Source)
>> > at
>> >
>> > org.apache.hadoop.yarn.client.api.impl.AMRMClientImpl.allocate(AMRMClientImpl.java:245)
>> > at
>> >
>> > org.apache.flink.yarn.YarnJobManager$$anonfun$handleYarnMessage$1.applyOrElse(YarnJobManager.scala:259)
>> > at scala.PartialFunction$OrElse.apply(PartialFunction.scala:162)
>> > at
>> >
>> > org.apache.flink.runtime.LeaderSessionMessageFilter$$anonfun$receive$1.applyOrElse(LeaderSessionMessageFilter.scala:44)
>> > at
>> >
>> > scala.runtime.AbstractPartialFunction$mcVL$sp.apply$mcVL$sp(AbstractPartialFunction.scala:33)
>> > at
>> >
>> > scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:33)
>> > at
>> >
>> > scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:25)
>> > at
>> > org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:33)
>> > at
>> > org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:28)
>> > at
>> > scala.PartialFunction$class.applyOrElse(PartialFunction.scala:118)
>> > at
>> >
>> > org.apache.flink.runtime.LogMessages$$anon$1.applyOrElse(LogMessages.scala:28)
>> > at akka.actor.Actor$class.aroundReceive(Actor.scala:465)
>> > at
>> >
>> > org.apache.flink.runtime.jobmanager.JobManager.aroundReceive(JobManager.scala:100)
>> > at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)
>> > at akka.actor.ActorCell.invoke(ActorCell.scala:487)
>> > at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:254)
>> > at akka.dispatch.Mailbox.run(Mailbox.scala:221)
>> > at akka.dispatch.Mailbox.exec(Mailbox.scala:231)
>> > at
>> > scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
>> > at
>> >
>> > scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.pollAndExecAll(ForkJoinPool.java:1253)
>> > at
>> >
>> > scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1346)
>> > at
>> > scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
>> > at
>> >
>> > scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
>> > Caused by:
>> >
>> > org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken):
>> > Invalid AMRMToken from appattempt_1443166961758_163901_000001
>> > at org.apache.hadoop.ipc.Client.call(Client.java:1406)
>> > at org.apache.hadoop.ipc.Client.call(Client.java:1359)
>> > at
>> >
>> > org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206)
>> > at com.sun.proxy.$Proxy13.allocate(Unknown Source)
>> > at
>> >
>> > org.apache.hadoop.yarn.api.impl.pb.client.ApplicationMasterProtocolPBClientImpl.allocate(ApplicationMasterProtocolPBClientImpl.java:77)
>> > ... 29 more
>> > 21:30:27,943 ERROR akka.actor.OneForOneStrategy
>> > - Invalid AMRMToken from appattempt_1443166961758_163901_000001
>> > org.apache.hadoop.security.token.SecretManager$InvalidToken: Invalid
>> > AMRMToken from appattempt_1443166961758_163901_000001
>> > at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
>> > Method)
>> > at
>> >
>> > sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
>> > at
>> >
>> > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>> > at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
>> > at
>> > org.apache.hadoop.yarn.ipc.RPCUtil.instantiateException(RPCUtil.java:53)
>> > at
>> >
>> > org.apache.hadoop.yarn.ipc.RPCUtil.unwrapAndThrowException(RPCUtil.java:104)
>> > at
>> >
>> > org.apache.hadoop.yarn.api.impl.pb.client.ApplicationMasterProtocolPBClientImpl.allocate(ApplicationMasterProtocolPBClientImpl.java:79)
>> > at sun.reflect.GeneratedMethodAccessor3.invoke(Unknown Source)
>> > at
>> >
>> > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>> > at java.lang.reflect.Method.invoke(Method.java:606)
>> > at
>> >
>> > org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:186)
>> > at
>> >
>> > org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
>> > at com.sun.proxy.$Proxy14.allocate(Unknown Source)
>> > at
>> >
>> > org.apache.hadoop.yarn.client.api.impl.AMRMClientImpl.allocate(AMRMClientImpl.java:245)
>> > at
>> >
>> > org.apache.flink.yarn.YarnJobManager$$anonfun$handleYarnMessage$1.applyOrElse(YarnJobManager.scala:259)
>> > at scala.PartialFunction$OrElse.apply(PartialFunction.scala:162)
>> > at
>> >
>> > org.apache.flink.runtime.LeaderSessionMessageFilter$$anonfun$receive$1.applyOrElse(LeaderSessionMessageFilter.scala:44)
>> > at
>> >
>> > scala.runtime.AbstractPartialFunction$mcVL$sp.apply$mcVL$sp(AbstractPartialFunction.scala:33)
>> > at
>> >
>> > scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:33)
>> > at
>> >
>> > scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:25)
>> > at
>> > org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:33)
>> > at
>> > org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:28)
>> > at
>> > scala.PartialFunction$class.applyOrElse(PartialFunction.scala:118)
>> > at
>> >
>> > org.apache.flink.runtime.LogMessages$$anon$1.applyOrElse(LogMessages.scala:28)
>> > at akka.actor.Actor$class.aroundReceive(Actor.scala:465)
>> > at
>> >
>> > org.apache.flink.runtime.jobmanager.JobManager.aroundReceive(JobManager.scala:100)
>> > at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)
>> > at akka.actor.ActorCell.invoke(ActorCell.scala:487)
>> > at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:254)
>> > at akka.dispatch.Mailbox.run(Mailbox.scala:221)
>> > at akka.dispatch.Mailbox.exec(Mailbox.scala:231)
>> > at
>> > scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
>> > at
>> >
>> > scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.pollAndExecAll(ForkJoinPool.java:1253)
>> > at
>> >
>> > scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1346)
>> > at
>> > scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
>> > at
>> >
>> > scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
>> > Caused by:
>> >
>> > org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken):
>> > Invalid AMRMToken from appattempt_1443166961758_163901_000001
>> > at org.apache.hadoop.ipc.Client.call(Client.java:1406)
>> > at org.apache.hadoop.ipc.Client.call(Client.java:1359)
>> > at
>> >
>> > org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206)
>> > at com.sun.proxy.$Proxy13.allocate(Unknown Source)
>> > at
>> >
>> > org.apache.hadoop.yarn.api.impl.pb.client.ApplicationMasterProtocolPBClientImpl.allocate(ApplicationMasterProtocolPBClientImpl.java:77)
>> > ... 29 more
>> > 21:30:28,075 INFO org.apache.flink.yarn.YarnJobManager
>> > - Stopping JobManager
>> > akka.tcp://flink@10.10.200.3:39527/user/jobmanager.
>> > 21:30:28,088 INFO
>> > org.apache.flink.runtime.executiongraph.ExecutionGraph
>> > - Source: Custom Source -> Sink: Unnamed (1/1)
>> > (db0d95c11c14505827e696eec7efab94) switched from RUNNING to CANCELING
>> > 21:30:28,113 INFO
>> > org.apache.flink.runtime.executiongraph.ExecutionGraph
>> > - Source: Custom Source -> Sink: Unnamed (1/1)
>> > (db0d95c11c14505827e696eec7efab94) switched from CANCELING to FAILED
>> > 21:30:28,184 INFO org.apache.flink.runtime.blob.BlobServer
>> > - Stopped BLOB server at 0.0.0.0:41281
>> > 21:30:28,185 ERROR org.apache.flink.runtime.jobmanager.JobManager
>> > - Actor akka://flink/user/jobmanager#403236912 terminated, stopping
>> > process...
>> > 21:30:28,286 INFO org.apache.flink.runtime.webmonitor.WebRuntimeMonitor
>> > - Removing web root dir
>> > /tmp/flink-web-e1a44f94-ea6d-40ee-b87c-e3122d5cb9bd
>> >
>> >
>> > --
>> > Best regards / Met vriendelijke groeten,
>> >
>> > Niels Basjes
>
>
>
>
> --
> Best regards / Met vriendelijke groeten,
>
> Niels Basjes

Niels Basjes

Re: Flink job on secure Yarn fails after many hours

No, I was just asking.

No upgrade is possible for the next month or two.

This week is our busiest day of the year ...

Our shop is doing about 10 orders per second these days ...

So they won't upgrade until next January/February

Niels

On Wed, Dec 2, 2015 at 3:59 PM, Maximilian Michels <[hidden email]> wrote:

Hi Niels,

You mentioned you have the option to update Hadoop and redeploy the
job. Would be great if you could do that and let us know how it turns
out.

Cheers,
Max

On Wed, Dec 2, 2015 at 3:45 PM, Niels Basjes <[hidden email]> wrote:
> Hi,
>
> I posted the entire log from the first log line at the moment of failure to
> the very end of the logfile.
> This is all I have.
>
> As far as I understand the Kerberos and Keytab mechanism in Hadoop Yarn is
> that it catches the "Invalid Token" and then (if keytab) gets a new Kerberos
> ticket (or tgt?).
> When the new ticket has been obtained it retries the call that previously
> failed.
> To me it seemed that this call can fail over the invalid Token yet it cannot
> be retried.
>
> At this moment I'm thinking a bug in Hadoop.
>
> Niels
>
> On Wed, Dec 2, 2015 at 2:51 PM, Maximilian Michels <[hidden email]> wrote:
>>
>> Hi Niels,
>>
>> Sorry for hear you experienced this exception. From a first glance, it
>> looks like a bug in Hadoop to me.
>>
>> > "Not retrying because the invoked method is not idempotent, and unable
>> > to determine whether it was invoked"
>>
>> That is nothing to worry about. This is Hadoop's internal retry
>> mechanism that re-attempts to do actions which previously failed if
>> that's possible. Since the action is not idempotent (it cannot be
>> executed again without risking to change the state of the execution)
>> and it also doesn't track its execution states, it won't be retried
>> again.
>>
>> The main issue is this exception:
>>
>> >org.apache.hadoop.security.token.SecretManager$InvalidToken: Invalid
>> > AMRMToken from >appattempt_1443166961758_163901_000001
>>
>> From the stack trace it is clear that this exception occurs upon
>> requesting container status information from the Resource Manager:
>>
>> >at
>> > org.apache.flink.yarn.YarnJobManager$$anonfun$handleYarnMessage$1.applyOrElse(YarnJobManager.scala:259)
>>
>> Are there any more exceptions in the log? Do you have the complete
>> logs available and could you share them?
>>
>>
>> Best regards,
>> Max
>>
>> On Wed, Dec 2, 2015 at 11:47 AM, Niels Basjes <[hidden email]> wrote:
>> > Hi,
>> >
>> >
>> > We have a Kerberos secured Yarn cluster here and I'm experimenting with
>> > Apache Flink on top of that.
>> >
>> > A few days ago I started a very simple Flink application (just stream
>> > the
>> > time as a String into HBase 10 times per second).
>> >
>> > I (deliberately) asked our IT-ops guys to make my account have a max
>> > ticket
>> > time of 5 minutes and a max renew time of 10 minutes (yes, ridiculously
>> > low
>> > timeout values because I needed to validate this
>> > https://issues.apache.org/jira/browse/FLINK-2977).
>> >
>> > This job is started with a keytab file and after running for 31 hours it
>> > suddenly failed with the exception you see below.
>> >
>> > I had the same job running for almost 400 hours until that failed too (I
>> > was
>> > too late to check the logfiles but I suspect the same problem).
>> >
>> >
>> > So in that time span my tickets have expired and new tickets have been
>> > obtained several hundred times.
>> >
>> >
>> > The main error I see is that in the process of a ticket expiring and
>> > being
>> > renewed I see this message:
>> >
>> > Not retrying because the invoked method is not idempotent, and
>> > unable
>> > to determine whether it was invoked
>> >
>> >
>> > Yarn on the cluster is 2.6.0 ( HDP 2.6.0.2.2.4.2-2 )
>> >
>> > Flink is version 0.10.1
>> >
>> >
>> > How do I fix this?
>> > Is this a bug (in either Hadoop or Flink) or am I doing something wrong?
>> > Would upgrading Yarn to 2.7.1 (i.e. HDP 2.3) fix this?
>> >
>> >
>> > Niels Basjes
>> >
>> >
>> >
>> > 21:30:27,821 WARN org.apache.hadoop.security.UserGroupInformation
>> > - PriviledgedActionException as:nbasjes (auth:SIMPLE)
>> >
>> > cause:org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken):
>> > Invalid AMRMToken from appattempt_1443166961758_163901_000001
>> > 21:30:27,861 WARN org.apache.hadoop.ipc.Client
>> > - Exception encountered while connecting to the server :
>> >
>> > org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken):
>> > Invalid AMRMToken from appattempt_1443166961758_163901_000001
>> > 21:30:27,861 WARN org.apache.hadoop.security.UserGroupInformation
>> > - PriviledgedActionException as:nbasjes (auth:SIMPLE)
>> >
>> > cause:org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken):
>> > Invalid AMRMToken from appattempt_1443166961758_163901_000001
>> > 21:30:27,891 WARN org.apache.hadoop.io.retry.RetryInvocationHandler
>> > - Exception while invoking class
>> >
>> > org.apache.hadoop.yarn.api.impl.pb.client.ApplicationMasterProtocolPBClientImpl.allocate.
>> > Not retrying because the invoked method is not idempotent, and unable to
>> > determine whether it was invoked
>> > org.apache.hadoop.security.token.SecretManager$InvalidToken: Invalid
>> > AMRMToken from appattempt_1443166961758_163901_000001
>> > at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
>> > Method)
>> > at
>> >
>> > sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
>> > at
>> >
>> > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>> > at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
>> > at
>> > org.apache.hadoop.yarn.ipc.RPCUtil.instantiateException(RPCUtil.java:53)
>> > at
>> >
>> > org.apache.hadoop.yarn.ipc.RPCUtil.unwrapAndThrowException(RPCUtil.java:104)
>> > at
>> >
>> > org.apache.hadoop.yarn.api.impl.pb.client.ApplicationMasterProtocolPBClientImpl.allocate(ApplicationMasterProtocolPBClientImpl.java:79)
>> > at sun.reflect.GeneratedMethodAccessor3.invoke(Unknown Source)
>> > at
>> >
>> > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>> > at java.lang.reflect.Method.invoke(Method.java:606)
>> > at
>> >
>> > org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:186)
>> > at
>> >
>> > org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
>> > at com.sun.proxy.$Proxy14.allocate(Unknown Source)
>> > at
>> >
>> > org.apache.hadoop.yarn.client.api.impl.AMRMClientImpl.allocate(AMRMClientImpl.java:245)
>> > at
>> >
>> > org.apache.flink.yarn.YarnJobManager$$anonfun$handleYarnMessage$1.applyOrElse(YarnJobManager.scala:259)
>> > at scala.PartialFunction$OrElse.apply(PartialFunction.scala:162)
>> > at
>> >
>> > org.apache.flink.runtime.LeaderSessionMessageFilter$$anonfun$receive$1.applyOrElse(LeaderSessionMessageFilter.scala:44)
>> > at
>> >
>> > scala.runtime.AbstractPartialFunction$mcVL$sp.apply$mcVL$sp(AbstractPartialFunction.scala:33)
>> > at
>> >
>> > scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:33)
>> > at
>> >
>> > scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:25)
>> > at
>> > org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:33)
>> > at
>> > org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:28)
>> > at
>> > scala.PartialFunction$class.applyOrElse(PartialFunction.scala:118)
>> > at
>> >
>> > org.apache.flink.runtime.LogMessages$$anon$1.applyOrElse(LogMessages.scala:28)
>> > at akka.actor.Actor$class.aroundReceive(Actor.scala:465)
>> > at
>> >
>> > org.apache.flink.runtime.jobmanager.JobManager.aroundReceive(JobManager.scala:100)
>> > at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)
>> > at akka.actor.ActorCell.invoke(ActorCell.scala:487)
>> > at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:254)
>> > at akka.dispatch.Mailbox.run(Mailbox.scala:221)
>> > at akka.dispatch.Mailbox.exec(Mailbox.scala:231)
>> > at
>> > scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
>> > at
>> >
>> > scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.pollAndExecAll(ForkJoinPool.java:1253)
>> > at
>> >
>> > scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1346)
>> > at
>> > scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
>> > at
>> >
>> > scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
>> > Caused by:
>> >
>> > org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken):
>> > Invalid AMRMToken from appattempt_1443166961758_163901_000001
>> > at org.apache.hadoop.ipc.Client.call(Client.java:1406)
>> > at org.apache.hadoop.ipc.Client.call(Client.java:1359)
>> > at
>> >
>> > org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206)
>> > at com.sun.proxy.$Proxy13.allocate(Unknown Source)
>> > at
>> >
>> > org.apache.hadoop.yarn.api.impl.pb.client.ApplicationMasterProtocolPBClientImpl.allocate(ApplicationMasterProtocolPBClientImpl.java:77)
>> > ... 29 more
>> > 21:30:27,943 ERROR akka.actor.OneForOneStrategy
>> > - Invalid AMRMToken from appattempt_1443166961758_163901_000001
>> > org.apache.hadoop.security.token.SecretManager$InvalidToken: Invalid
>> > AMRMToken from appattempt_1443166961758_163901_000001
>> > at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
>> > Method)
>> > at
>> >
>> > sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
>> > at
>> >
>> > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>> > at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
>> > at
>> > org.apache.hadoop.yarn.ipc.RPCUtil.instantiateException(RPCUtil.java:53)
>> > at
>> >
>> > org.apache.hadoop.yarn.ipc.RPCUtil.unwrapAndThrowException(RPCUtil.java:104)
>> > at
>> >
>> > org.apache.hadoop.yarn.api.impl.pb.client.ApplicationMasterProtocolPBClientImpl.allocate(ApplicationMasterProtocolPBClientImpl.java:79)
>> > at sun.reflect.GeneratedMethodAccessor3.invoke(Unknown Source)
>> > at
>> >
>> > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>> > at java.lang.reflect.Method.invoke(Method.java:606)
>> > at
>> >
>> > org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:186)
>> > at
>> >
>> > org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
>> > at com.sun.proxy.$Proxy14.allocate(Unknown Source)
>> > at
>> >
>> > org.apache.hadoop.yarn.client.api.impl.AMRMClientImpl.allocate(AMRMClientImpl.java:245)
>> > at
>> >
>> > org.apache.flink.yarn.YarnJobManager$$anonfun$handleYarnMessage$1.applyOrElse(YarnJobManager.scala:259)
>> > at scala.PartialFunction$OrElse.apply(PartialFunction.scala:162)
>> > at
>> >
>> > org.apache.flink.runtime.LeaderSessionMessageFilter$$anonfun$receive$1.applyOrElse(LeaderSessionMessageFilter.scala:44)
>> > at
>> >
>> > scala.runtime.AbstractPartialFunction$mcVL$sp.apply$mcVL$sp(AbstractPartialFunction.scala:33)
>> > at
>> >
>> > scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:33)
>> > at
>> >
>> > scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:25)
>> > at
>> > org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:33)
>> > at
>> > org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:28)
>> > at
>> > scala.PartialFunction$class.applyOrElse(PartialFunction.scala:118)
>> > at
>> >
>> > org.apache.flink.runtime.LogMessages$$anon$1.applyOrElse(LogMessages.scala:28)
>> > at akka.actor.Actor$class.aroundReceive(Actor.scala:465)
>> > at
>> >
>> > org.apache.flink.runtime.jobmanager.JobManager.aroundReceive(JobManager.scala:100)
>> > at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)
>> > at akka.actor.ActorCell.invoke(ActorCell.scala:487)
>> > at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:254)
>> > at akka.dispatch.Mailbox.run(Mailbox.scala:221)
>> > at akka.dispatch.Mailbox.exec(Mailbox.scala:231)
>> > at
>> > scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
>> > at
>> >
>> > scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.pollAndExecAll(ForkJoinPool.java:1253)
>> > at
>> >
>> > scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1346)
>> > at
>> > scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
>> > at
>> >
>> > scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
>> > Caused by:
>> >
>> > org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken):
>> > Invalid AMRMToken from appattempt_1443166961758_163901_000001
>> > at org.apache.hadoop.ipc.Client.call(Client.java:1406)
>> > at org.apache.hadoop.ipc.Client.call(Client.java:1359)
>> > at
>> >
>> > org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206)
>> > at com.sun.proxy.$Proxy13.allocate(Unknown Source)
>> > at
>> >
>> > org.apache.hadoop.yarn.api.impl.pb.client.ApplicationMasterProtocolPBClientImpl.allocate(ApplicationMasterProtocolPBClientImpl.java:77)
>> > ... 29 more
>> > 21:30:28,075 INFO org.apache.flink.yarn.YarnJobManager
>> > - Stopping JobManager
>> > akka.tcp://flink@10.10.200.3:39527/user/jobmanager.
>> > 21:30:28,088 INFO
>> > org.apache.flink.runtime.executiongraph.ExecutionGraph
>> > - Source: Custom Source -> Sink: Unnamed (1/1)
>> > (db0d95c11c14505827e696eec7efab94) switched from RUNNING to CANCELING
>> > 21:30:28,113 INFO
>> > org.apache.flink.runtime.executiongraph.ExecutionGraph
>> > - Source: Custom Source -> Sink: Unnamed (1/1)
>> > (db0d95c11c14505827e696eec7efab94) switched from CANCELING to FAILED
>> > 21:30:28,184 INFO org.apache.flink.runtime.blob.BlobServer
>> > - Stopped BLOB server at 0.0.0.0:41281
>> > 21:30:28,185 ERROR org.apache.flink.runtime.jobmanager.JobManager
>> > - Actor akka://flink/user/jobmanager#403236912 terminated, stopping
>> > process...
>> > 21:30:28,286 INFO org.apache.flink.runtime.webmonitor.WebRuntimeMonitor
>> > - Removing web root dir
>> > /tmp/flink-web-e1a44f94-ea6d-40ee-b87c-e3122d5cb9bd
>> >
>> >
>> > --
>> > Best regards / Met vriendelijke groeten,
>> >
>> > Niels Basjes
>
>
>
>
> --
> Best regards / Met vriendelijke groeten,
>
> Niels Basjes

Best regards / Met vriendelijke groeten,

Niels Basjes

Maximilian Michels

Re: Flink job on secure Yarn fails after many hours

I mentioned that the exception gets thrown when requesting container
status information. We need this to send a heartbeat to YARN but it is
not very crucial if this fails once for the running job. Possibly, we
could work around this problem by retrying N times in case of an
exception.

Would it be possible for you to deploy a custom Flink 0.10.1 version
we provide and test again?

On Wed, Dec 2, 2015 at 4:03 PM, Niels Basjes <[hidden email]> wrote:

> No, I was just asking.
> No upgrade is possible for the next month or two.
>
> This week is our busiest day of the year ...
> Our shop is doing about 10 orders per second these days ...
>
> So they won't upgrade until next January/February
>
> Niels
>
> On Wed, Dec 2, 2015 at 3:59 PM, Maximilian Michels <[hidden email]> wrote:
>>
>> Hi Niels,
>>
>> You mentioned you have the option to update Hadoop and redeploy the
>> job. Would be great if you could do that and let us know how it turns
>> out.
>>
>> Cheers,
>> Max
>>
>> On Wed, Dec 2, 2015 at 3:45 PM, Niels Basjes <[hidden email]> wrote:
>> > Hi,
>> >
>> > I posted the entire log from the first log line at the moment of failure
>> > to
>> > the very end of the logfile.
>> > This is all I have.
>> >
>> > As far as I understand the Kerberos and Keytab mechanism in Hadoop Yarn
>> > is
>> > that it catches the "Invalid Token" and then (if keytab) gets a new
>> > Kerberos
>> > ticket (or tgt?).
>> > When the new ticket has been obtained it retries the call that
>> > previously
>> > failed.
>> > To me it seemed that this call can fail over the invalid Token yet it
>> > cannot
>> > be retried.
>> >
>> > At this moment I'm thinking a bug in Hadoop.
>> >
>> > Niels
>> >
>> > On Wed, Dec 2, 2015 at 2:51 PM, Maximilian Michels <[hidden email]>
>> > wrote:
>> >>
>> >> Hi Niels,
>> >>
>> >> Sorry for hear you experienced this exception. From a first glance, it
>> >> looks like a bug in Hadoop to me.
>> >>
>> >> > "Not retrying because the invoked method is not idempotent, and
>> >> > unable
>> >> > to determine whether it was invoked"
>> >>
>> >> That is nothing to worry about. This is Hadoop's internal retry
>> >> mechanism that re-attempts to do actions which previously failed if
>> >> that's possible. Since the action is not idempotent (it cannot be
>> >> executed again without risking to change the state of the execution)
>> >> and it also doesn't track its execution states, it won't be retried
>> >> again.
>> >>
>> >> The main issue is this exception:
>> >>
>> >> >org.apache.hadoop.security.token.SecretManager$InvalidToken: Invalid
>> >> > AMRMToken from >appattempt_1443166961758_163901_000001
>> >>
>> >> From the stack trace it is clear that this exception occurs upon
>> >> requesting container status information from the Resource Manager:
>> >>
>> >> >at
>> >> >
>> >> > org.apache.flink.yarn.YarnJobManager$$anonfun$handleYarnMessage$1.applyOrElse(YarnJobManager.scala:259)
>> >>
>> >> Are there any more exceptions in the log? Do you have the complete
>> >> logs available and could you share them?
>> >>
>> >>
>> >> Best regards,
>> >> Max
>> >>
>> >> On Wed, Dec 2, 2015 at 11:47 AM, Niels Basjes <[hidden email]> wrote:
>> >> > Hi,
>> >> >
>> >> >
>> >> > We have a Kerberos secured Yarn cluster here and I'm experimenting
>> >> > with
>> >> > Apache Flink on top of that.
>> >> >
>> >> > A few days ago I started a very simple Flink application (just stream
>> >> > the
>> >> > time as a String into HBase 10 times per second).
>> >> >
>> >> > I (deliberately) asked our IT-ops guys to make my account have a max
>> >> > ticket
>> >> > time of 5 minutes and a max renew time of 10 minutes (yes,
>> >> > ridiculously
>> >> > low
>> >> > timeout values because I needed to validate this
>> >> > https://issues.apache.org/jira/browse/FLINK-2977).
>> >> >
>> >> > This job is started with a keytab file and after running for 31 hours
>> >> > it
>> >> > suddenly failed with the exception you see below.
>> >> >
>> >> > I had the same job running for almost 400 hours until that failed too
>> >> > (I
>> >> > was
>> >> > too late to check the logfiles but I suspect the same problem).
>> >> >
>> >> >
>> >> > So in that time span my tickets have expired and new tickets have
>> >> > been
>> >> > obtained several hundred times.
>> >> >
>> >> >
>> >> > The main error I see is that in the process of a ticket expiring and
>> >> > being
>> >> > renewed I see this message:
>> >> >
>> >> > Not retrying because the invoked method is not idempotent, and
>> >> > unable
>> >> > to determine whether it was invoked
>> >> >
>> >> >
>> >> > Yarn on the cluster is 2.6.0 ( HDP 2.6.0.2.2.4.2-2 )
>> >> >
>> >> > Flink is version 0.10.1
>> >> >
>> >> >
>> >> > How do I fix this?
>> >> > Is this a bug (in either Hadoop or Flink) or am I doing something
>> >> > wrong?
>> >> > Would upgrading Yarn to 2.7.1 (i.e. HDP 2.3) fix this?
>> >> >
>> >> >
>> >> > Niels Basjes
>> >> >
>> >> >
>> >> >
>> >> > 21:30:27,821 WARN org.apache.hadoop.security.UserGroupInformation
>> >> > - PriviledgedActionException as:nbasjes (auth:SIMPLE)
>> >> >
>> >> >
>> >> > cause:org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken):
>> >> > Invalid AMRMToken from appattempt_1443166961758_163901_000001
>> >> > 21:30:27,861 WARN org.apache.hadoop.ipc.Client
>> >> > - Exception encountered while connecting to the server :
>> >> >
>> >> >
>> >> > org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken):
>> >> > Invalid AMRMToken from appattempt_1443166961758_163901_000001
>> >> > 21:30:27,861 WARN org.apache.hadoop.security.UserGroupInformation
>> >> > - PriviledgedActionException as:nbasjes (auth:SIMPLE)
>> >> >
>> >> >
>> >> > cause:org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken):
>> >> > Invalid AMRMToken from appattempt_1443166961758_163901_000001
>> >> > 21:30:27,891 WARN org.apache.hadoop.io.retry.RetryInvocationHandler
>> >> > - Exception while invoking class
>> >> >
>> >> >
>> >> > org.apache.hadoop.yarn.api.impl.pb.client.ApplicationMasterProtocolPBClientImpl.allocate.
>> >> > Not retrying because the invoked method is not idempotent, and unable
>> >> > to
>> >> > determine whether it was invoked
>> >> > org.apache.hadoop.security.token.SecretManager$InvalidToken: Invalid
>> >> > AMRMToken from appattempt_1443166961758_163901_000001
>> >> > at
>> >> > sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
>> >> > Method)
>> >> > at
>> >> >
>> >> >
>> >> > sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
>> >> > at
>> >> >
>> >> >
>> >> > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>> >> > at
>> >> > java.lang.reflect.Constructor.newInstance(Constructor.java:526)
>> >> > at
>> >> >
>> >> > org.apache.hadoop.yarn.ipc.RPCUtil.instantiateException(RPCUtil.java:53)
>> >> > at
>> >> >
>> >> >
>> >> > org.apache.hadoop.yarn.ipc.RPCUtil.unwrapAndThrowException(RPCUtil.java:104)
>> >> > at
>> >> >
>> >> >
>> >> > org.apache.hadoop.yarn.api.impl.pb.client.ApplicationMasterProtocolPBClientImpl.allocate(ApplicationMasterProtocolPBClientImpl.java:79)
>> >> > at sun.reflect.GeneratedMethodAccessor3.invoke(Unknown Source)
>> >> > at
>> >> >
>> >> >
>> >> > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>> >> > at java.lang.reflect.Method.invoke(Method.java:606)
>> >> > at
>> >> >
>> >> >
>> >> > org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:186)
>> >> > at
>> >> >
>> >> >
>> >> > org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
>> >> > at com.sun.proxy.$Proxy14.allocate(Unknown Source)
>> >> > at
>> >> >
>> >> >
>> >> > org.apache.hadoop.yarn.client.api.impl.AMRMClientImpl.allocate(AMRMClientImpl.java:245)
>> >> > at
>> >> >
>> >> >
>> >> > org.apache.flink.yarn.YarnJobManager$$anonfun$handleYarnMessage$1.applyOrElse(YarnJobManager.scala:259)
>> >> > at
>> >> > scala.PartialFunction$OrElse.apply(PartialFunction.scala:162)
>> >> > at
>> >> >
>> >> >
>> >> > org.apache.flink.runtime.LeaderSessionMessageFilter$$anonfun$receive$1.applyOrElse(LeaderSessionMessageFilter.scala:44)
>> >> > at
>> >> >
>> >> >
>> >> > scala.runtime.AbstractPartialFunction$mcVL$sp.apply$mcVL$sp(AbstractPartialFunction.scala:33)
>> >> > at
>> >> >
>> >> >
>> >> > scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:33)
>> >> > at
>> >> >
>> >> >
>> >> > scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:25)
>> >> > at
>> >> >
>> >> > org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:33)
>> >> > at
>> >> >
>> >> > org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:28)
>> >> > at
>> >> > scala.PartialFunction$class.applyOrElse(PartialFunction.scala:118)
>> >> > at
>> >> >
>> >> >
>> >> > org.apache.flink.runtime.LogMessages$$anon$1.applyOrElse(LogMessages.scala:28)
>> >> > at akka.actor.Actor$class.aroundReceive(Actor.scala:465)
>> >> > at
>> >> >
>> >> >
>> >> > org.apache.flink.runtime.jobmanager.JobManager.aroundReceive(JobManager.scala:100)
>> >> > at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)
>> >> > at akka.actor.ActorCell.invoke(ActorCell.scala:487)
>> >> > at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:254)
>> >> > at akka.dispatch.Mailbox.run(Mailbox.scala:221)
>> >> > at akka.dispatch.Mailbox.exec(Mailbox.scala:231)
>> >> > at
>> >> > scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
>> >> > at
>> >> >
>> >> >
>> >> > scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.pollAndExecAll(ForkJoinPool.java:1253)
>> >> > at
>> >> >
>> >> >
>> >> > scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1346)
>> >> > at
>> >> >
>> >> > scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
>> >> > at
>> >> >
>> >> >
>> >> > scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
>> >> > Caused by:
>> >> >
>> >> >
>> >> > org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken):
>> >> > Invalid AMRMToken from appattempt_1443166961758_163901_000001
>> >> > at org.apache.hadoop.ipc.Client.call(Client.java:1406)
>> >> > at org.apache.hadoop.ipc.Client.call(Client.java:1359)
>> >> > at
>> >> >
>> >> >
>> >> > org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206)
>> >> > at com.sun.proxy.$Proxy13.allocate(Unknown Source)
>> >> > at
>> >> >
>> >> >
>> >> > org.apache.hadoop.yarn.api.impl.pb.client.ApplicationMasterProtocolPBClientImpl.allocate(ApplicationMasterProtocolPBClientImpl.java:77)
>> >> > ... 29 more
>> >> > 21:30:27,943 ERROR akka.actor.OneForOneStrategy
>> >> > - Invalid AMRMToken from appattempt_1443166961758_163901_000001
>> >> > org.apache.hadoop.security.token.SecretManager$InvalidToken: Invalid
>> >> > AMRMToken from appattempt_1443166961758_163901_000001
>> >> > at
>> >> > sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
>> >> > Method)
>> >> > at
>> >> >
>> >> >
>> >> > sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
>> >> > at
>> >> >
>> >> >
>> >> > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>> >> > at
>> >> > java.lang.reflect.Constructor.newInstance(Constructor.java:526)
>> >> > at
>> >> >
>> >> > org.apache.hadoop.yarn.ipc.RPCUtil.instantiateException(RPCUtil.java:53)
>> >> > at
>> >> >
>> >> >
>> >> > org.apache.hadoop.yarn.ipc.RPCUtil.unwrapAndThrowException(RPCUtil.java:104)
>> >> > at
>> >> >
>> >> >
>> >> > org.apache.hadoop.yarn.api.impl.pb.client.ApplicationMasterProtocolPBClientImpl.allocate(ApplicationMasterProtocolPBClientImpl.java:79)
>> >> > at sun.reflect.GeneratedMethodAccessor3.invoke(Unknown Source)
>> >> > at
>> >> >
>> >> >
>> >> > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>> >> > at java.lang.reflect.Method.invoke(Method.java:606)
>> >> > at
>> >> >
>> >> >
>> >> > org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:186)
>> >> > at
>> >> >
>> >> >
>> >> > org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
>> >> > at com.sun.proxy.$Proxy14.allocate(Unknown Source)
>> >> > at
>> >> >
>> >> >
>> >> > org.apache.hadoop.yarn.client.api.impl.AMRMClientImpl.allocate(AMRMClientImpl.java:245)
>> >> > at
>> >> >
>> >> >
>> >> > org.apache.flink.yarn.YarnJobManager$$anonfun$handleYarnMessage$1.applyOrElse(YarnJobManager.scala:259)
>> >> > at
>> >> > scala.PartialFunction$OrElse.apply(PartialFunction.scala:162)
>> >> > at
>> >> >
>> >> >
>> >> > org.apache.flink.runtime.LeaderSessionMessageFilter$$anonfun$receive$1.applyOrElse(LeaderSessionMessageFilter.scala:44)
>> >> > at
>> >> >
>> >> >
>> >> > scala.runtime.AbstractPartialFunction$mcVL$sp.apply$mcVL$sp(AbstractPartialFunction.scala:33)
>> >> > at
>> >> >
>> >> >
>> >> > scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:33)
>> >> > at
>> >> >
>> >> >
>> >> > scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:25)
>> >> > at
>> >> >
>> >> > org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:33)
>> >> > at
>> >> >
>> >> > org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:28)
>> >> > at
>> >> > scala.PartialFunction$class.applyOrElse(PartialFunction.scala:118)
>> >> > at
>> >> >
>> >> >
>> >> > org.apache.flink.runtime.LogMessages$$anon$1.applyOrElse(LogMessages.scala:28)
>> >> > at akka.actor.Actor$class.aroundReceive(Actor.scala:465)
>> >> > at
>> >> >
>> >> >
>> >> > org.apache.flink.runtime.jobmanager.JobManager.aroundReceive(JobManager.scala:100)
>> >> > at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)
>> >> > at akka.actor.ActorCell.invoke(ActorCell.scala:487)
>> >> > at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:254)
>> >> > at akka.dispatch.Mailbox.run(Mailbox.scala:221)
>> >> > at akka.dispatch.Mailbox.exec(Mailbox.scala:231)
>> >> > at
>> >> > scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
>> >> > at
>> >> >
>> >> >
>> >> > scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.pollAndExecAll(ForkJoinPool.java:1253)
>> >> > at
>> >> >
>> >> >
>> >> > scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1346)
>> >> > at
>> >> >
>> >> > scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
>> >> > at
>> >> >
>> >> >
>> >> > scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
>> >> > Caused by:
>> >> >
>> >> >
>> >> > org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken):
>> >> > Invalid AMRMToken from appattempt_1443166961758_163901_000001
>> >> > at org.apache.hadoop.ipc.Client.call(Client.java:1406)
>> >> > at org.apache.hadoop.ipc.Client.call(Client.java:1359)
>> >> > at
>> >> >
>> >> >
>> >> > org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206)
>> >> > at com.sun.proxy.$Proxy13.allocate(Unknown Source)
>> >> > at
>> >> >
>> >> >
>> >> > org.apache.hadoop.yarn.api.impl.pb.client.ApplicationMasterProtocolPBClientImpl.allocate(ApplicationMasterProtocolPBClientImpl.java:77)
>> >> > ... 29 more
>> >> > 21:30:28,075 INFO org.apache.flink.yarn.YarnJobManager
>> >> > - Stopping JobManager
>> >> > akka.tcp://flink@10.10.200.3:39527/user/jobmanager.
>> >> > 21:30:28,088 INFO
>> >> > org.apache.flink.runtime.executiongraph.ExecutionGraph
>> >> > - Source: Custom Source -> Sink: Unnamed (1/1)
>> >> > (db0d95c11c14505827e696eec7efab94) switched from RUNNING to CANCELING
>> >> > 21:30:28,113 INFO
>> >> > org.apache.flink.runtime.executiongraph.ExecutionGraph
>> >> > - Source: Custom Source -> Sink: Unnamed (1/1)
>> >> > (db0d95c11c14505827e696eec7efab94) switched from CANCELING to FAILED
>> >> > 21:30:28,184 INFO org.apache.flink.runtime.blob.BlobServer
>> >> > - Stopped BLOB server at 0.0.0.0:41281
>> >> > 21:30:28,185 ERROR org.apache.flink.runtime.jobmanager.JobManager
>> >> > - Actor akka://flink/user/jobmanager#403236912 terminated, stopping
>> >> > process...
>> >> > 21:30:28,286 INFO
>> >> > org.apache.flink.runtime.webmonitor.WebRuntimeMonitor
>> >> > - Removing web root dir
>> >> > /tmp/flink-web-e1a44f94-ea6d-40ee-b87c-e3122d5cb9bd
>> >> >
>> >> >
>> >> > --
>> >> > Best regards / Met vriendelijke groeten,
>> >> >
>> >> > Niels Basjes
>> >
>> >
>> >
>> >
>> > --
>> > Best regards / Met vriendelijke groeten,
>> >
>> > Niels Basjes
>
>
>
>
> --
> Best regards / Met vriendelijke groeten,
>
> Niels Basjes

Niels Basjes

Re: Flink job on secure Yarn fails after many hours

Sure, just give me the git repo url to build and I'll give it a try.

Niels

On Wed, Dec 2, 2015 at 4:28 PM, Maximilian Michels <[hidden email]> wrote:

I mentioned that the exception gets thrown when requesting container
status information. We need this to send a heartbeat to YARN but it is
not very crucial if this fails once for the running job. Possibly, we
could work around this problem by retrying N times in case of an
exception.

Would it be possible for you to deploy a custom Flink 0.10.1 version
we provide and test again?

On Wed, Dec 2, 2015 at 4:03 PM, Niels Basjes <[hidden email]> wrote:
> No, I was just asking.
> No upgrade is possible for the next month or two.
>
> This week is our busiest day of the year ...
> Our shop is doing about 10 orders per second these days ...
>
> So they won't upgrade until next January/February
>
> Niels
>
> On Wed, Dec 2, 2015 at 3:59 PM, Maximilian Michels <[hidden email]> wrote:
>>
>> Hi Niels,
>>
>> You mentioned you have the option to update Hadoop and redeploy the
>> job. Would be great if you could do that and let us know how it turns
>> out.
>>
>> Cheers,
>> Max
>>
>> On Wed, Dec 2, 2015 at 3:45 PM, Niels Basjes <[hidden email]> wrote:
>> > Hi,
>> >
>> > I posted the entire log from the first log line at the moment of failure
>> > to
>> > the very end of the logfile.
>> > This is all I have.
>> >
>> > As far as I understand the Kerberos and Keytab mechanism in Hadoop Yarn
>> > is
>> > that it catches the "Invalid Token" and then (if keytab) gets a new
>> > Kerberos
>> > ticket (or tgt?).
>> > When the new ticket has been obtained it retries the call that
>> > previously
>> > failed.
>> > To me it seemed that this call can fail over the invalid Token yet it
>> > cannot
>> > be retried.
>> >
>> > At this moment I'm thinking a bug in Hadoop.
>> >
>> > Niels
>> >
>> > On Wed, Dec 2, 2015 at 2:51 PM, Maximilian Michels <[hidden email]>
>> > wrote:
>> >>
>> >> Hi Niels,
>> >>
>> >> Sorry for hear you experienced this exception. From a first glance, it
>> >> looks like a bug in Hadoop to me.
>> >>
>> >> > "Not retrying because the invoked method is not idempotent, and
>> >> > unable
>> >> > to determine whether it was invoked"
>> >>
>> >> That is nothing to worry about. This is Hadoop's internal retry
>> >> mechanism that re-attempts to do actions which previously failed if
>> >> that's possible. Since the action is not idempotent (it cannot be
>> >> executed again without risking to change the state of the execution)
>> >> and it also doesn't track its execution states, it won't be retried
>> >> again.
>> >>
>> >> The main issue is this exception:
>> >>
>> >> >org.apache.hadoop.security.token.SecretManager$InvalidToken: Invalid
>> >> > AMRMToken from >appattempt_1443166961758_163901_000001
>> >>
>> >> From the stack trace it is clear that this exception occurs upon
>> >> requesting container status information from the Resource Manager:
>> >>
>> >> >at
>> >> >
>> >> > org.apache.flink.yarn.YarnJobManager$$anonfun$handleYarnMessage$1.applyOrElse(YarnJobManager.scala:259)
>> >>
>> >> Are there any more exceptions in the log? Do you have the complete
>> >> logs available and could you share them?
>> >>
>> >>
>> >> Best regards,
>> >> Max
>> >>
>> >> On Wed, Dec 2, 2015 at 11:47 AM, Niels Basjes <[hidden email]> wrote:
>> >> > Hi,
>> >> >
>> >> >
>> >> > We have a Kerberos secured Yarn cluster here and I'm experimenting
>> >> > with
>> >> > Apache Flink on top of that.
>> >> >
>> >> > A few days ago I started a very simple Flink application (just stream
>> >> > the
>> >> > time as a String into HBase 10 times per second).
>> >> >
>> >> > I (deliberately) asked our IT-ops guys to make my account have a max
>> >> > ticket
>> >> > time of 5 minutes and a max renew time of 10 minutes (yes,
>> >> > ridiculously
>> >> > low
>> >> > timeout values because I needed to validate this
>> >> > https://issues.apache.org/jira/browse/FLINK-2977).
>> >> >
>> >> > This job is started with a keytab file and after running for 31 hours
>> >> > it
>> >> > suddenly failed with the exception you see below.
>> >> >
>> >> > I had the same job running for almost 400 hours until that failed too
>> >> > (I
>> >> > was
>> >> > too late to check the logfiles but I suspect the same problem).
>> >> >
>> >> >
>> >> > So in that time span my tickets have expired and new tickets have
>> >> > been
>> >> > obtained several hundred times.
>> >> >
>> >> >
>> >> > The main error I see is that in the process of a ticket expiring and
>> >> > being
>> >> > renewed I see this message:
>> >> >
>> >> > Not retrying because the invoked method is not idempotent, and
>> >> > unable
>> >> > to determine whether it was invoked
>> >> >
>> >> >
>> >> > Yarn on the cluster is 2.6.0 ( HDP 2.6.0.2.2.4.2-2 )
>> >> >
>> >> > Flink is version 0.10.1
>> >> >
>> >> >
>> >> > How do I fix this?
>> >> > Is this a bug (in either Hadoop or Flink) or am I doing something
>> >> > wrong?
>> >> > Would upgrading Yarn to 2.7.1 (i.e. HDP 2.3) fix this?
>> >> >
>> >> >
>> >> > Niels Basjes
>> >> >
>> >> >
>> >> >
>> >> > 21:30:27,821 WARN org.apache.hadoop.security.UserGroupInformation
>> >> > - PriviledgedActionException as:nbasjes (auth:SIMPLE)
>> >> >
>> >> >
>> >> > cause:org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken):
>> >> > Invalid AMRMToken from appattempt_1443166961758_163901_000001
>> >> > 21:30:27,861 WARN org.apache.hadoop.ipc.Client
>> >> > - Exception encountered while connecting to the server :
>> >> >
>> >> >
>> >> > org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken):
>> >> > Invalid AMRMToken from appattempt_1443166961758_163901_000001
>> >> > 21:30:27,861 WARN org.apache.hadoop.security.UserGroupInformation
>> >> > - PriviledgedActionException as:nbasjes (auth:SIMPLE)
>> >> >
>> >> >
>> >> > cause:org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken):
>> >> > Invalid AMRMToken from appattempt_1443166961758_163901_000001
>> >> > 21:30:27,891 WARN org.apache.hadoop.io.retry.RetryInvocationHandler
>> >> > - Exception while invoking class
>> >> >
>> >> >
>> >> > org.apache.hadoop.yarn.api.impl.pb.client.ApplicationMasterProtocolPBClientImpl.allocate.
>> >> > Not retrying because the invoked method is not idempotent, and unable
>> >> > to
>> >> > determine whether it was invoked
>> >> > org.apache.hadoop.security.token.SecretManager$InvalidToken: Invalid
>> >> > AMRMToken from appattempt_1443166961758_163901_000001
>> >> > at
>> >> > sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
>> >> > Method)
>> >> > at
>> >> >
>> >> >
>> >> > sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
>> >> > at
>> >> >
>> >> >
>> >> > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>> >> > at
>> >> > java.lang.reflect.Constructor.newInstance(Constructor.java:526)
>> >> > at
>> >> >
>> >> > org.apache.hadoop.yarn.ipc.RPCUtil.instantiateException(RPCUtil.java:53)
>> >> > at
>> >> >
>> >> >
>> >> > org.apache.hadoop.yarn.ipc.RPCUtil.unwrapAndThrowException(RPCUtil.java:104)
>> >> > at
>> >> >
>> >> >
>> >> > org.apache.hadoop.yarn.api.impl.pb.client.ApplicationMasterProtocolPBClientImpl.allocate(ApplicationMasterProtocolPBClientImpl.java:79)
>> >> > at sun.reflect.GeneratedMethodAccessor3.invoke(Unknown Source)
>> >> > at
>> >> >
>> >> >
>> >> > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>> >> > at java.lang.reflect.Method.invoke(Method.java:606)
>> >> > at
>> >> >
>> >> >
>> >> > org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:186)
>> >> > at
>> >> >
>> >> >
>> >> > org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
>> >> > at com.sun.proxy.$Proxy14.allocate(Unknown Source)
>> >> > at
>> >> >
>> >> >
>> >> > org.apache.hadoop.yarn.client.api.impl.AMRMClientImpl.allocate(AMRMClientImpl.java:245)
>> >> > at
>> >> >
>> >> >
>> >> > org.apache.flink.yarn.YarnJobManager$$anonfun$handleYarnMessage$1.applyOrElse(YarnJobManager.scala:259)
>> >> > at
>> >> > scala.PartialFunction$OrElse.apply(PartialFunction.scala:162)
>> >> > at
>> >> >
>> >> >
>> >> > org.apache.flink.runtime.LeaderSessionMessageFilter$$anonfun$receive$1.applyOrElse(LeaderSessionMessageFilter.scala:44)
>> >> > at
>> >> >
>> >> >
>> >> > scala.runtime.AbstractPartialFunction$mcVL$sp.apply$mcVL$sp(AbstractPartialFunction.scala:33)
>> >> > at
>> >> >
>> >> >
>> >> > scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:33)
>> >> > at
>> >> >
>> >> >
>> >> > scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:25)
>> >> > at
>> >> >
>> >> > org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:33)
>> >> > at
>> >> >
>> >> > org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:28)
>> >> > at
>> >> > scala.PartialFunction$class.applyOrElse(PartialFunction.scala:118)
>> >> > at
>> >> >
>> >> >
>> >> > org.apache.flink.runtime.LogMessages$$anon$1.applyOrElse(LogMessages.scala:28)
>> >> > at akka.actor.Actor$class.aroundReceive(Actor.scala:465)
>> >> > at
>> >> >
>> >> >
>> >> > org.apache.flink.runtime.jobmanager.JobManager.aroundReceive(JobManager.scala:100)
>> >> > at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)
>> >> > at akka.actor.ActorCell.invoke(ActorCell.scala:487)
>> >> > at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:254)
>> >> > at akka.dispatch.Mailbox.run(Mailbox.scala:221)
>> >> > at akka.dispatch.Mailbox.exec(Mailbox.scala:231)
>> >> > at
>> >> > scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
>> >> > at
>> >> >
>> >> >
>> >> > scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.pollAndExecAll(ForkJoinPool.java:1253)
>> >> > at
>> >> >
>> >> >
>> >> > scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1346)
>> >> > at
>> >> >
>> >> > scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
>> >> > at
>> >> >
>> >> >
>> >> > scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
>> >> > Caused by:
>> >> >
>> >> >
>> >> > org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken):
>> >> > Invalid AMRMToken from appattempt_1443166961758_163901_000001
>> >> > at org.apache.hadoop.ipc.Client.call(Client.java:1406)
>> >> > at org.apache.hadoop.ipc.Client.call(Client.java:1359)
>> >> > at
>> >> >
>> >> >
>> >> > org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206)
>> >> > at com.sun.proxy.$Proxy13.allocate(Unknown Source)
>> >> > at
>> >> >
>> >> >
>> >> > org.apache.hadoop.yarn.api.impl.pb.client.ApplicationMasterProtocolPBClientImpl.allocate(ApplicationMasterProtocolPBClientImpl.java:77)
>> >> > ... 29 more
>> >> > 21:30:27,943 ERROR akka.actor.OneForOneStrategy
>> >> > - Invalid AMRMToken from appattempt_1443166961758_163901_000001
>> >> > org.apache.hadoop.security.token.SecretManager$InvalidToken: Invalid
>> >> > AMRMToken from appattempt_1443166961758_163901_000001
>> >> > at
>> >> > sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
>> >> > Method)
>> >> > at
>> >> >
>> >> >
>> >> > sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
>> >> > at
>> >> >
>> >> >
>> >> > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>> >> > at
>> >> > java.lang.reflect.Constructor.newInstance(Constructor.java:526)
>> >> > at
>> >> >
>> >> > org.apache.hadoop.yarn.ipc.RPCUtil.instantiateException(RPCUtil.java:53)
>> >> > at
>> >> >
>> >> >
>> >> > org.apache.hadoop.yarn.ipc.RPCUtil.unwrapAndThrowException(RPCUtil.java:104)
>> >> > at
>> >> >
>> >> >
>> >> > org.apache.hadoop.yarn.api.impl.pb.client.ApplicationMasterProtocolPBClientImpl.allocate(ApplicationMasterProtocolPBClientImpl.java:79)
>> >> > at sun.reflect.GeneratedMethodAccessor3.invoke(Unknown Source)
>> >> > at
>> >> >
>> >> >
>> >> > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>> >> > at java.lang.reflect.Method.invoke(Method.java:606)
>> >> > at
>> >> >
>> >> >
>> >> > org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:186)
>> >> > at
>> >> >
>> >> >
>> >> > org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
>> >> > at com.sun.proxy.$Proxy14.allocate(Unknown Source)
>> >> > at
>> >> >
>> >> >
>> >> > org.apache.hadoop.yarn.client.api.impl.AMRMClientImpl.allocate(AMRMClientImpl.java:245)
>> >> > at
>> >> >
>> >> >
>> >> > org.apache.flink.yarn.YarnJobManager$$anonfun$handleYarnMessage$1.applyOrElse(YarnJobManager.scala:259)
>> >> > at
>> >> > scala.PartialFunction$OrElse.apply(PartialFunction.scala:162)
>> >> > at
>> >> >
>> >> >
>> >> > org.apache.flink.runtime.LeaderSessionMessageFilter$$anonfun$receive$1.applyOrElse(LeaderSessionMessageFilter.scala:44)
>> >> > at
>> >> >
>> >> >
>> >> > scala.runtime.AbstractPartialFunction$mcVL$sp.apply$mcVL$sp(AbstractPartialFunction.scala:33)
>> >> > at
>> >> >
>> >> >
>> >> > scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:33)
>> >> > at
>> >> >
>> >> >
>> >> > scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:25)
>> >> > at
>> >> >
>> >> > org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:33)
>> >> > at
>> >> >
>> >> > org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:28)
>> >> > at
>> >> > scala.PartialFunction$class.applyOrElse(PartialFunction.scala:118)
>> >> > at
>> >> >
>> >> >
>> >> > org.apache.flink.runtime.LogMessages$$anon$1.applyOrElse(LogMessages.scala:28)
>> >> > at akka.actor.Actor$class.aroundReceive(Actor.scala:465)
>> >> > at
>> >> >
>> >> >
>> >> > org.apache.flink.runtime.jobmanager.JobManager.aroundReceive(JobManager.scala:100)
>> >> > at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)
>> >> > at akka.actor.ActorCell.invoke(ActorCell.scala:487)
>> >> > at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:254)
>> >> > at akka.dispatch.Mailbox.run(Mailbox.scala:221)
>> >> > at akka.dispatch.Mailbox.exec(Mailbox.scala:231)
>> >> > at
>> >> > scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
>> >> > at
>> >> >
>> >> >
>> >> > scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.pollAndExecAll(ForkJoinPool.java:1253)
>> >> > at
>> >> >
>> >> >
>> >> > scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1346)
>> >> > at
>> >> >
>> >> > scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
>> >> > at
>> >> >
>> >> >
>> >> > scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
>> >> > Caused by:
>> >> >
>> >> >
>> >> > org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken):
>> >> > Invalid AMRMToken from appattempt_1443166961758_163901_000001
>> >> > at org.apache.hadoop.ipc.Client.call(Client.java:1406)
>> >> > at org.apache.hadoop.ipc.Client.call(Client.java:1359)
>> >> > at
>> >> >
>> >> >
>> >> > org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206)
>> >> > at com.sun.proxy.$Proxy13.allocate(Unknown Source)
>> >> > at
>> >> >
>> >> >
>> >> > org.apache.hadoop.yarn.api.impl.pb.client.ApplicationMasterProtocolPBClientImpl.allocate(ApplicationMasterProtocolPBClientImpl.java:77)
>> >> > ... 29 more
>> >> > 21:30:28,075 INFO org.apache.flink.yarn.YarnJobManager
>> >> > - Stopping JobManager
>> >> > akka.tcp://flink@10.10.200.3:39527/user/jobmanager.
>> >> > 21:30:28,088 INFO
>> >> > org.apache.flink.runtime.executiongraph.ExecutionGraph
>> >> > - Source: Custom Source -> Sink: Unnamed (1/1)
>> >> > (db0d95c11c14505827e696eec7efab94) switched from RUNNING to CANCELING
>> >> > 21:30:28,113 INFO
>> >> > org.apache.flink.runtime.executiongraph.ExecutionGraph
>> >> > - Source: Custom Source -> Sink: Unnamed (1/1)
>> >> > (db0d95c11c14505827e696eec7efab94) switched from CANCELING to FAILED
>> >> > 21:30:28,184 INFO org.apache.flink.runtime.blob.BlobServer
>> >> > - Stopped BLOB server at 0.0.0.0:41281
>> >> > 21:30:28,185 ERROR org.apache.flink.runtime.jobmanager.JobManager
>> >> > - Actor akka://flink/user/jobmanager#403236912 terminated, stopping
>> >> > process...
>> >> > 21:30:28,286 INFO
>> >> > org.apache.flink.runtime.webmonitor.WebRuntimeMonitor
>> >> > - Removing web root dir
>> >> > /tmp/flink-web-e1a44f94-ea6d-40ee-b87c-e3122d5cb9bd
>> >> >
>> >> >
>> >> > --
>> >> > Best regards / Met vriendelijke groeten,
>> >> >
>> >> > Niels Basjes
>> >
>> >
>> >
>> >
>> > --
>> > Best regards / Met vriendelijke groeten,
>> >
>> > Niels Basjes
>
>
>
>
> --
> Best regards / Met vriendelijke groeten,
>
> Niels Basjes

Best regards / Met vriendelijke groeten,

Niels Basjes

Maximilian Michels

Re: Flink job on secure Yarn fails after many hours

Great. Here is the commit to try out:
https://github.com/mxm/flink/commit/f49b9635bec703541f19cb8c615f302a07ea88b3

If you already have the Flink repository, check it out using

git fetch https://github.com/mxm/flink/
f49b9635bec703541f19cb8c615f302a07ea88b3 && git checkout FETCH_HEAD

Alternatively, here's a direct download link to the sources with the
fix included:
https://github.com/mxm/flink/archive/f49b9635bec703541f19cb8c615f302a07ea88b3.zip

Thanks a lot,
Max

On Wed, Dec 2, 2015 at 5:44 PM, Niels Basjes <[hidden email]> wrote:

> Sure, just give me the git repo url to build and I'll give it a try.
>
> Niels
>
> On Wed, Dec 2, 2015 at 4:28 PM, Maximilian Michels <[hidden email]> wrote:
>>
>> I mentioned that the exception gets thrown when requesting container
>> status information. We need this to send a heartbeat to YARN but it is
>> not very crucial if this fails once for the running job. Possibly, we
>> could work around this problem by retrying N times in case of an
>> exception.
>>
>> Would it be possible for you to deploy a custom Flink 0.10.1 version
>> we provide and test again?
>>
>> On Wed, Dec 2, 2015 at 4:03 PM, Niels Basjes <[hidden email]> wrote:
>> > No, I was just asking.
>> > No upgrade is possible for the next month or two.
>> >
>> > This week is our busiest day of the year ...
>> > Our shop is doing about 10 orders per second these days ...
>> >
>> > So they won't upgrade until next January/February
>> >
>> > Niels
>> >
>> > On Wed, Dec 2, 2015 at 3:59 PM, Maximilian Michels <[hidden email]>
>> > wrote:
>> >>
>> >> Hi Niels,
>> >>
>> >> You mentioned you have the option to update Hadoop and redeploy the
>> >> job. Would be great if you could do that and let us know how it turns
>> >> out.
>> >>
>> >> Cheers,
>> >> Max
>> >>
>> >> On Wed, Dec 2, 2015 at 3:45 PM, Niels Basjes <[hidden email]> wrote:
>> >> > Hi,
>> >> >
>> >> > I posted the entire log from the first log line at the moment of
>> >> > failure
>> >> > to
>> >> > the very end of the logfile.
>> >> > This is all I have.
>> >> >
>> >> > As far as I understand the Kerberos and Keytab mechanism in Hadoop
>> >> > Yarn
>> >> > is
>> >> > that it catches the "Invalid Token" and then (if keytab) gets a new
>> >> > Kerberos
>> >> > ticket (or tgt?).
>> >> > When the new ticket has been obtained it retries the call that
>> >> > previously
>> >> > failed.
>> >> > To me it seemed that this call can fail over the invalid Token yet it
>> >> > cannot
>> >> > be retried.
>> >> >
>> >> > At this moment I'm thinking a bug in Hadoop.
>> >> >
>> >> > Niels
>> >> >
>> >> > On Wed, Dec 2, 2015 at 2:51 PM, Maximilian Michels <[hidden email]>
>> >> > wrote:
>> >> >>
>> >> >> Hi Niels,
>> >> >>
>> >> >> Sorry for hear you experienced this exception. From a first glance,
>> >> >> it
>> >> >> looks like a bug in Hadoop to me.
>> >> >>
>> >> >> > "Not retrying because the invoked method is not idempotent, and
>> >> >> > unable
>> >> >> > to determine whether it was invoked"
>> >> >>
>> >> >> That is nothing to worry about. This is Hadoop's internal retry
>> >> >> mechanism that re-attempts to do actions which previously failed if
>> >> >> that's possible. Since the action is not idempotent (it cannot be
>> >> >> executed again without risking to change the state of the execution)
>> >> >> and it also doesn't track its execution states, it won't be retried
>> >> >> again.
>> >> >>
>> >> >> The main issue is this exception:
>> >> >>
>> >> >> >org.apache.hadoop.security.token.SecretManager$InvalidToken:
>> >> >> > Invalid
>> >> >> > AMRMToken from >appattempt_1443166961758_163901_000001
>> >> >>
>> >> >> From the stack trace it is clear that this exception occurs upon
>> >> >> requesting container status information from the Resource Manager:
>> >> >>
>> >> >> >at
>> >> >> >
>> >> >> >
>> >> >> > org.apache.flink.yarn.YarnJobManager$$anonfun$handleYarnMessage$1.applyOrElse(YarnJobManager.scala:259)
>> >> >>
>> >> >> Are there any more exceptions in the log? Do you have the complete
>> >> >> logs available and could you share them?
>> >> >>
>> >> >>
>> >> >> Best regards,
>> >> >> Max
>> >> >>
>> >> >> On Wed, Dec 2, 2015 at 11:47 AM, Niels Basjes <[hidden email]>
>> >> >> wrote:
>> >> >> > Hi,
>> >> >> >
>> >> >> >
>> >> >> > We have a Kerberos secured Yarn cluster here and I'm experimenting
>> >> >> > with
>> >> >> > Apache Flink on top of that.
>> >> >> >
>> >> >> > A few days ago I started a very simple Flink application (just
>> >> >> > stream
>> >> >> > the
>> >> >> > time as a String into HBase 10 times per second).
>> >> >> >
>> >> >> > I (deliberately) asked our IT-ops guys to make my account have a
>> >> >> > max
>> >> >> > ticket
>> >> >> > time of 5 minutes and a max renew time of 10 minutes (yes,
>> >> >> > ridiculously
>> >> >> > low
>> >> >> > timeout values because I needed to validate this
>> >> >> > https://issues.apache.org/jira/browse/FLINK-2977).
>> >> >> >
>> >> >> > This job is started with a keytab file and after running for 31
>> >> >> > hours
>> >> >> > it
>> >> >> > suddenly failed with the exception you see below.
>> >> >> >
>> >> >> > I had the same job running for almost 400 hours until that failed
>> >> >> > too
>> >> >> > (I
>> >> >> > was
>> >> >> > too late to check the logfiles but I suspect the same problem).
>> >> >> >
>> >> >> >
>> >> >> > So in that time span my tickets have expired and new tickets have
>> >> >> > been
>> >> >> > obtained several hundred times.
>> >> >> >
>> >> >> >
>> >> >> > The main error I see is that in the process of a ticket expiring
>> >> >> > and
>> >> >> > being
>> >> >> > renewed I see this message:
>> >> >> >
>> >> >> > Not retrying because the invoked method is not idempotent,
>> >> >> > and
>> >> >> > unable
>> >> >> > to determine whether it was invoked
>> >> >> >
>> >> >> >
>> >> >> > Yarn on the cluster is 2.6.0 ( HDP 2.6.0.2.2.4.2-2 )
>> >> >> >
>> >> >> > Flink is version 0.10.1
>> >> >> >
>> >> >> >
>> >> >> > How do I fix this?
>> >> >> > Is this a bug (in either Hadoop or Flink) or am I doing something
>> >> >> > wrong?
>> >> >> > Would upgrading Yarn to 2.7.1 (i.e. HDP 2.3) fix this?
>> >> >> >
>> >> >> >
>> >> >> > Niels Basjes
>> >> >> >
>> >> >> >
>> >> >> >
>> >> >> > 21:30:27,821 WARN org.apache.hadoop.security.UserGroupInformation
>> >> >> > - PriviledgedActionException as:nbasjes (auth:SIMPLE)
>> >> >> >
>> >> >> >
>> >> >> >
>> >> >> > cause:org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken):
>> >> >> > Invalid AMRMToken from appattempt_1443166961758_163901_000001
>> >> >> > 21:30:27,861 WARN org.apache.hadoop.ipc.Client
>> >> >> > - Exception encountered while connecting to the server :
>> >> >> >
>> >> >> >
>> >> >> >
>> >> >> > org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken):
>> >> >> > Invalid AMRMToken from appattempt_1443166961758_163901_000001
>> >> >> > 21:30:27,861 WARN org.apache.hadoop.security.UserGroupInformation
>> >> >> > - PriviledgedActionException as:nbasjes (auth:SIMPLE)
>> >> >> >
>> >> >> >
>> >> >> >
>> >> >> > cause:org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken):
>> >> >> > Invalid AMRMToken from appattempt_1443166961758_163901_000001
>> >> >> > 21:30:27,891 WARN
>> >> >> > org.apache.hadoop.io.retry.RetryInvocationHandler
>> >> >> > - Exception while invoking class
>> >> >> >
>> >> >> >
>> >> >> >
>> >> >> > org.apache.hadoop.yarn.api.impl.pb.client.ApplicationMasterProtocolPBClientImpl.allocate.
>> >> >> > Not retrying because the invoked method is not idempotent, and
>> >> >> > unable
>> >> >> > to
>> >> >> > determine whether it was invoked
>> >> >> > org.apache.hadoop.security.token.SecretManager$InvalidToken:
>> >> >> > Invalid
>> >> >> > AMRMToken from appattempt_1443166961758_163901_000001
>> >> >> > at
>> >> >> > sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
>> >> >> > Method)
>> >> >> > at
>> >> >> >
>> >> >> >
>> >> >> >
>> >> >> > sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
>> >> >> > at
>> >> >> >
>> >> >> >
>> >> >> >
>> >> >> > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>> >> >> > at
>> >> >> > java.lang.reflect.Constructor.newInstance(Constructor.java:526)
>> >> >> > at
>> >> >> >
>> >> >> >
>> >> >> > org.apache.hadoop.yarn.ipc.RPCUtil.instantiateException(RPCUtil.java:53)
>> >> >> > at
>> >> >> >
>> >> >> >
>> >> >> >
>> >> >> > org.apache.hadoop.yarn.ipc.RPCUtil.unwrapAndThrowException(RPCUtil.java:104)
>> >> >> > at
>> >> >> >
>> >> >> >
>> >> >> >
>> >> >> > org.apache.hadoop.yarn.api.impl.pb.client.ApplicationMasterProtocolPBClientImpl.allocate(ApplicationMasterProtocolPBClientImpl.java:79)
>> >> >> > at sun.reflect.GeneratedMethodAccessor3.invoke(Unknown
>> >> >> > Source)
>> >> >> > at
>> >> >> >
>> >> >> >
>> >> >> >
>> >> >> > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>> >> >> > at java.lang.reflect.Method.invoke(Method.java:606)
>> >> >> > at
>> >> >> >
>> >> >> >
>> >> >> >
>> >> >> > org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:186)
>> >> >> > at
>> >> >> >
>> >> >> >
>> >> >> >
>> >> >> > org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
>> >> >> > at com.sun.proxy.$Proxy14.allocate(Unknown Source)
>> >> >> > at
>> >> >> >
>> >> >> >
>> >> >> >
>> >> >> > org.apache.hadoop.yarn.client.api.impl.AMRMClientImpl.allocate(AMRMClientImpl.java:245)
>> >> >> > at
>> >> >> >
>> >> >> >
>> >> >> >
>> >> >> > org.apache.flink.yarn.YarnJobManager$$anonfun$handleYarnMessage$1.applyOrElse(YarnJobManager.scala:259)
>> >> >> > at
>> >> >> > scala.PartialFunction$OrElse.apply(PartialFunction.scala:162)
>> >> >> > at
>> >> >> >
>> >> >> >
>> >> >> >
>> >> >> > org.apache.flink.runtime.LeaderSessionMessageFilter$$anonfun$receive$1.applyOrElse(LeaderSessionMessageFilter.scala:44)
>> >> >> > at
>> >> >> >
>> >> >> >
>> >> >> >
>> >> >> > scala.runtime.AbstractPartialFunction$mcVL$sp.apply$mcVL$sp(AbstractPartialFunction.scala:33)
>> >> >> > at
>> >> >> >
>> >> >> >
>> >> >> >
>> >> >> > scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:33)
>> >> >> > at
>> >> >> >
>> >> >> >
>> >> >> >
>> >> >> > scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:25)
>> >> >> > at
>> >> >> >
>> >> >> >
>> >> >> > org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:33)
>> >> >> > at
>> >> >> >
>> >> >> >
>> >> >> > org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:28)
>> >> >> > at
>> >> >> > scala.PartialFunction$class.applyOrElse(PartialFunction.scala:118)
>> >> >> > at
>> >> >> >
>> >> >> >
>> >> >> >
>> >> >> > org.apache.flink.runtime.LogMessages$$anon$1.applyOrElse(LogMessages.scala:28)
>> >> >> > at akka.actor.Actor$class.aroundReceive(Actor.scala:465)
>> >> >> > at
>> >> >> >
>> >> >> >
>> >> >> >
>> >> >> > org.apache.flink.runtime.jobmanager.JobManager.aroundReceive(JobManager.scala:100)
>> >> >> > at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)
>> >> >> > at akka.actor.ActorCell.invoke(ActorCell.scala:487)
>> >> >> > at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:254)
>> >> >> > at akka.dispatch.Mailbox.run(Mailbox.scala:221)
>> >> >> > at akka.dispatch.Mailbox.exec(Mailbox.scala:231)
>> >> >> > at
>> >> >> >
>> >> >> > scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
>> >> >> > at
>> >> >> >
>> >> >> >
>> >> >> >
>> >> >> > scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.pollAndExecAll(ForkJoinPool.java:1253)
>> >> >> > at
>> >> >> >
>> >> >> >
>> >> >> >
>> >> >> > scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1346)
>> >> >> > at
>> >> >> >
>> >> >> >
>> >> >> > scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
>> >> >> > at
>> >> >> >
>> >> >> >
>> >> >> >
>> >> >> > scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
>> >> >> > Caused by:
>> >> >> >
>> >> >> >
>> >> >> >
>> >> >> > org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken):
>> >> >> > Invalid AMRMToken from appattempt_1443166961758_163901_000001
>> >> >> > at org.apache.hadoop.ipc.Client.call(Client.java:1406)
>> >> >> > at org.apache.hadoop.ipc.Client.call(Client.java:1359)
>> >> >> > at
>> >> >> >
>> >> >> >
>> >> >> >
>> >> >> > org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206)
>> >> >> > at com.sun.proxy.$Proxy13.allocate(Unknown Source)
>> >> >> > at
>> >> >> >
>> >> >> >
>> >> >> >
>> >> >> > org.apache.hadoop.yarn.api.impl.pb.client.ApplicationMasterProtocolPBClientImpl.allocate(ApplicationMasterProtocolPBClientImpl.java:77)
>> >> >> > ... 29 more
>> >> >> > 21:30:27,943 ERROR akka.actor.OneForOneStrategy
>> >> >> > - Invalid AMRMToken from appattempt_1443166961758_163901_000001
>> >> >> > org.apache.hadoop.security.token.SecretManager$InvalidToken:
>> >> >> > Invalid
>> >> >> > AMRMToken from appattempt_1443166961758_163901_000001
>> >> >> > at
>> >> >> > sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
>> >> >> > Method)
>> >> >> > at
>> >> >> >
>> >> >> >
>> >> >> >
>> >> >> > sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
>> >> >> > at
>> >> >> >
>> >> >> >
>> >> >> >
>> >> >> > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>> >> >> > at
>> >> >> > java.lang.reflect.Constructor.newInstance(Constructor.java:526)
>> >> >> > at
>> >> >> >
>> >> >> >
>> >> >> > org.apache.hadoop.yarn.ipc.RPCUtil.instantiateException(RPCUtil.java:53)
>> >> >> > at
>> >> >> >
>> >> >> >
>> >> >> >
>> >> >> > org.apache.hadoop.yarn.ipc.RPCUtil.unwrapAndThrowException(RPCUtil.java:104)
>> >> >> > at
>> >> >> >
>> >> >> >
>> >> >> >
>> >> >> > org.apache.hadoop.yarn.api.impl.pb.client.ApplicationMasterProtocolPBClientImpl.allocate(ApplicationMasterProtocolPBClientImpl.java:79)
>> >> >> > at sun.reflect.GeneratedMethodAccessor3.invoke(Unknown
>> >> >> > Source)
>> >> >> > at
>> >> >> >
>> >> >> >
>> >> >> >
>> >> >> > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>> >> >> > at java.lang.reflect.Method.invoke(Method.java:606)
>> >> >> > at
>> >> >> >
>> >> >> >
>> >> >> >
>> >> >> > org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:186)
>> >> >> > at
>> >> >> >
>> >> >> >
>> >> >> >
>> >> >> > org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
>> >> >> > at com.sun.proxy.$Proxy14.allocate(Unknown Source)
>> >> >> > at
>> >> >> >
>> >> >> >
>> >> >> >
>> >> >> > org.apache.hadoop.yarn.client.api.impl.AMRMClientImpl.allocate(AMRMClientImpl.java:245)
>> >> >> > at
>> >> >> >
>> >> >> >
>> >> >> >
>> >> >> > org.apache.flink.yarn.YarnJobManager$$anonfun$handleYarnMessage$1.applyOrElse(YarnJobManager.scala:259)
>> >> >> > at
>> >> >> > scala.PartialFunction$OrElse.apply(PartialFunction.scala:162)
>> >> >> > at
>> >> >> >
>> >> >> >
>> >> >> >
>> >> >> > org.apache.flink.runtime.LeaderSessionMessageFilter$$anonfun$receive$1.applyOrElse(LeaderSessionMessageFilter.scala:44)
>> >> >> > at
>> >> >> >
>> >> >> >
>> >> >> >
>> >> >> > scala.runtime.AbstractPartialFunction$mcVL$sp.apply$mcVL$sp(AbstractPartialFunction.scala:33)
>> >> >> > at
>> >> >> >
>> >> >> >
>> >> >> >
>> >> >> > scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:33)
>> >> >> > at
>> >> >> >
>> >> >> >
>> >> >> >
>> >> >> > scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:25)
>> >> >> > at
>> >> >> >
>> >> >> >
>> >> >> > org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:33)
>> >> >> > at
>> >> >> >
>> >> >> >
>> >> >> > org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:28)
>> >> >> > at
>> >> >> > scala.PartialFunction$class.applyOrElse(PartialFunction.scala:118)
>> >> >> > at
>> >> >> >
>> >> >> >
>> >> >> >
>> >> >> > org.apache.flink.runtime.LogMessages$$anon$1.applyOrElse(LogMessages.scala:28)
>> >> >> > at akka.actor.Actor$class.aroundReceive(Actor.scala:465)
>> >> >> > at
>> >> >> >
>> >> >> >
>> >> >> >
>> >> >> > org.apache.flink.runtime.jobmanager.JobManager.aroundReceive(JobManager.scala:100)
>> >> >> > at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)
>> >> >> > at akka.actor.ActorCell.invoke(ActorCell.scala:487)
>> >> >> > at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:254)
>> >> >> > at akka.dispatch.Mailbox.run(Mailbox.scala:221)
>> >> >> > at akka.dispatch.Mailbox.exec(Mailbox.scala:231)
>> >> >> > at
>> >> >> >
>> >> >> > scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
>> >> >> > at
>> >> >> >
>> >> >> >
>> >> >> >
>> >> >> > scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.pollAndExecAll(ForkJoinPool.java:1253)
>> >> >> > at
>> >> >> >
>> >> >> >
>> >> >> >
>> >> >> > scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1346)
>> >> >> > at
>> >> >> >
>> >> >> >
>> >> >> > scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
>> >> >> > at
>> >> >> >
>> >> >> >
>> >> >> >
>> >> >> > scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
>> >> >> > Caused by:
>> >> >> >
>> >> >> >
>> >> >> >
>> >> >> > org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken):
>> >> >> > Invalid AMRMToken from appattempt_1443166961758_163901_000001
>> >> >> > at org.apache.hadoop.ipc.Client.call(Client.java:1406)
>> >> >> > at org.apache.hadoop.ipc.Client.call(Client.java:1359)
>> >> >> > at
>> >> >> >
>> >> >> >
>> >> >> >
>> >> >> > org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206)
>> >> >> > at com.sun.proxy.$Proxy13.allocate(Unknown Source)
>> >> >> > at
>> >> >> >
>> >> >> >
>> >> >> >
>> >> >> > org.apache.hadoop.yarn.api.impl.pb.client.ApplicationMasterProtocolPBClientImpl.allocate(ApplicationMasterProtocolPBClientImpl.java:77)
>> >> >> > ... 29 more
>> >> >> > 21:30:28,075 INFO org.apache.flink.yarn.YarnJobManager
>> >> >> > - Stopping JobManager
>> >> >> > akka.tcp://flink@10.10.200.3:39527/user/jobmanager.
>> >> >> > 21:30:28,088 INFO
>> >> >> > org.apache.flink.runtime.executiongraph.ExecutionGraph
>> >> >> > - Source: Custom Source -> Sink: Unnamed (1/1)
>> >> >> > (db0d95c11c14505827e696eec7efab94) switched from RUNNING to
>> >> >> > CANCELING
>> >> >> > 21:30:28,113 INFO
>> >> >> > org.apache.flink.runtime.executiongraph.ExecutionGraph
>> >> >> > - Source: Custom Source -> Sink: Unnamed (1/1)
>> >> >> > (db0d95c11c14505827e696eec7efab94) switched from CANCELING to
>> >> >> > FAILED
>> >> >> > 21:30:28,184 INFO org.apache.flink.runtime.blob.BlobServer
>> >> >> > - Stopped BLOB server at 0.0.0.0:41281
>> >> >> > 21:30:28,185 ERROR org.apache.flink.runtime.jobmanager.JobManager
>> >> >> > - Actor akka://flink/user/jobmanager#403236912 terminated,
>> >> >> > stopping
>> >> >> > process...
>> >> >> > 21:30:28,286 INFO
>> >> >> > org.apache.flink.runtime.webmonitor.WebRuntimeMonitor
>> >> >> > - Removing web root dir
>> >> >> > /tmp/flink-web-e1a44f94-ea6d-40ee-b87c-e3122d5cb9bd
>> >> >> >
>> >> >> >
>> >> >> > --
>> >> >> > Best regards / Met vriendelijke groeten,
>> >> >> >
>> >> >> > Niels Basjes
>> >> >
>> >> >
>> >> >
>> >> >
>> >> > --
>> >> > Best regards / Met vriendelijke groeten,
>> >> >
>> >> > Niels Basjes
>> >
>> >
>> >
>> >
>> > --
>> > Best regards / Met vriendelijke groeten,
>> >
>> > Niels Basjes
>
>
>
>
> --
> Best regards / Met vriendelijke groeten,
>
> Niels Basjes

范昂

unsubscribe

发自我的 iPhone

> 在 2015年12月3日，上午1:41，Maximilian Michels <[hidden email]> 写道：
>
> Great. Here is the commit to try out:
> https://github.com/mxm/flink/commit/f49b9635bec703541f19cb8c615f302a07ea88b3
>
> If you already have the Flink repository, check it out using
>
> git fetch https://github.com/mxm/flink/
> f49b9635bec703541f19cb8c615f302a07ea88b3 && git checkout FETCH_HEAD
>
> Alternatively, here's a direct download link to the sources with the
> fix included:
> https://github.com/mxm/flink/archive/f49b9635bec703541f19cb8c615f302a07ea88b3.zip
>
> Thanks a lot,
> Max
>
>> On Wed, Dec 2, 2015 at 5:44 PM, Niels Basjes <[hidden email]> wrote:
>> Sure, just give me the git repo url to build and I'll give it a try.
>>
>> Niels
>>
>>> On Wed, Dec 2, 2015 at 4:28 PM, Maximilian Michels <[hidden email]> wrote:
>>>
>>> I mentioned that the exception gets thrown when requesting container
>>> status information. We need this to send a heartbeat to YARN but it is
>>> not very crucial if this fails once for the running job. Possibly, we
>>> could work around this problem by retrying N times in case of an
>>> exception.
>>>
>>> Would it be possible for you to deploy a custom Flink 0.10.1 version
>>> we provide and test again?
>>>
>>>> On Wed, Dec 2, 2015 at 4:03 PM, Niels Basjes <[hidden email]> wrote:
>>>> No, I was just asking.
>>>> No upgrade is possible for the next month or two.
>>>>
>>>> This week is our busiest day of the year ...
>>>> Our shop is doing about 10 orders per second these days ...
>>>>
>>>> So they won't upgrade until next January/February
>>>>
>>>> Niels
>>>>
>>>> On Wed, Dec 2, 2015 at 3:59 PM, Maximilian Michels <[hidden email]>
>>>> wrote:
>>>>>
>>>>> Hi Niels,
>>>>>
>>>>> You mentioned you have the option to update Hadoop and redeploy the
>>>>> job. Would be great if you could do that and let us know how it turns
>>>>> out.
>>>>>
>>>>> Cheers,
>>>>> Max
>>>>>
>>>>>> On Wed, Dec 2, 2015 at 3:45 PM, Niels Basjes <[hidden email]> wrote:
>>>>>> Hi,
>>>>>>
>>>>>> I posted the entire log from the first log line at the moment of
>>>>>> failure
>>>>>> to
>>>>>> the very end of the logfile.
>>>>>> This is all I have.
>>>>>>
>>>>>> As far as I understand the Kerberos and Keytab mechanism in Hadoop
>>>>>> Yarn
>>>>>> is
>>>>>> that it catches the "Invalid Token" and then (if keytab) gets a new
>>>>>> Kerberos
>>>>>> ticket (or tgt?).
>>>>>> When the new ticket has been obtained it retries the call that
>>>>>> previously
>>>>>> failed.
>>>>>> To me it seemed that this call can fail over the invalid Token yet it
>>>>>> cannot
>>>>>> be retried.
>>>>>>
>>>>>> At this moment I'm thinking a bug in Hadoop.
>>>>>>
>>>>>> Niels
>>>>>>
>>>>>> On Wed, Dec 2, 2015 at 2:51 PM, Maximilian Michels <[hidden email]>
>>>>>> wrote:
>>>>>>>
>>>>>>> Hi Niels,
>>>>>>>
>>>>>>> Sorry for hear you experienced this exception. From a first glance,
>>>>>>> it
>>>>>>> looks like a bug in Hadoop to me.
>>>>>>>
>>>>>>>> "Not retrying because the invoked method is not idempotent, and
>>>>>>>> unable
>>>>>>>> to determine whether it was invoked"
>>>>>>>
>>>>>>> That is nothing to worry about. This is Hadoop's internal retry
>>>>>>> mechanism that re-attempts to do actions which previously failed if
>>>>>>> that's possible. Since the action is not idempotent (it cannot be
>>>>>>> executed again without risking to change the state of the execution)
>>>>>>> and it also doesn't track its execution states, it won't be retried
>>>>>>> again.
>>>>>>>
>>>>>>> The main issue is this exception:
>>>>>>>
>>>>>>>> org.apache.hadoop.security.token.SecretManager$InvalidToken:
>>>>>>>> Invalid
>>>>>>>> AMRMToken from >appattempt_1443166961758_163901_000001
>>>>>>>
>>>>>>> From the stack trace it is clear that this exception occurs upon
>>>>>>> requesting container status information from the Resource Manager:
>>>>>>>
>>>>>>>> at
>>>>>>>>
>>>>>>>>
>>>>>>>> org.apache.flink.yarn.YarnJobManager$$anonfun$handleYarnMessage$1.applyOrElse(YarnJobManager.scala:259)
>>>>>>>
>>>>>>> Are there any more exceptions in the log? Do you have the complete
>>>>>>> logs available and could you share them?
>>>>>>>
>>>>>>>
>>>>>>> Best regards,
>>>>>>> Max
>>>>>>>
>>>>>>> On Wed, Dec 2, 2015 at 11:47 AM, Niels Basjes <[hidden email]>
>>>>>>> wrote:
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>>
>>>>>>>> We have a Kerberos secured Yarn cluster here and I'm experimenting
>>>>>>>> with
>>>>>>>> Apache Flink on top of that.
>>>>>>>>
>>>>>>>> A few days ago I started a very simple Flink application (just
>>>>>>>> stream
>>>>>>>> the
>>>>>>>> time as a String into HBase 10 times per second).
>>>>>>>>
>>>>>>>> I (deliberately) asked our IT-ops guys to make my account have a
>>>>>>>> max
>>>>>>>> ticket
>>>>>>>> time of 5 minutes and a max renew time of 10 minutes (yes,
>>>>>>>> ridiculously
>>>>>>>> low
>>>>>>>> timeout values because I needed to validate this
>>>>>>>> https://issues.apache.org/jira/browse/FLINK-2977).
>>>>>>>>
>>>>>>>> This job is started with a keytab file and after running for 31
>>>>>>>> hours
>>>>>>>> it
>>>>>>>> suddenly failed with the exception you see below.
>>>>>>>>
>>>>>>>> I had the same job running for almost 400 hours until that failed
>>>>>>>> too
>>>>>>>> (I
>>>>>>>> was
>>>>>>>> too late to check the logfiles but I suspect the same problem).
>>>>>>>>
>>>>>>>>
>>>>>>>> So in that time span my tickets have expired and new tickets have
>>>>>>>> been
>>>>>>>> obtained several hundred times.
>>>>>>>>
>>>>>>>>
>>>>>>>> The main error I see is that in the process of a ticket expiring
>>>>>>>> and
>>>>>>>> being
>>>>>>>> renewed I see this message:
>>>>>>>>
>>>>>>>> Not retrying because the invoked method is not idempotent,
>>>>>>>> and
>>>>>>>> unable
>>>>>>>> to determine whether it was invoked
>>>>>>>>
>>>>>>>>
>>>>>>>> Yarn on the cluster is 2.6.0 ( HDP 2.6.0.2.2.4.2-2 )
>>>>>>>>
>>>>>>>> Flink is version 0.10.1
>>>>>>>>
>>>>>>>>
>>>>>>>> How do I fix this?
>>>>>>>> Is this a bug (in either Hadoop or Flink) or am I doing something
>>>>>>>> wrong?
>>>>>>>> Would upgrading Yarn to 2.7.1 (i.e. HDP 2.3) fix this?
>>>>>>>>
>>>>>>>>
>>>>>>>> Niels Basjes
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> 21:30:27,821 WARN org.apache.hadoop.security.UserGroupInformation
>>>>>>>> - PriviledgedActionException as:nbasjes (auth:SIMPLE)
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> cause:org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken):
>>>>>>>> Invalid AMRMToken from appattempt_1443166961758_163901_000001
>>>>>>>> 21:30:27,861 WARN org.apache.hadoop.ipc.Client
>>>>>>>> - Exception encountered while connecting to the server :
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken):
>>>>>>>> Invalid AMRMToken from appattempt_1443166961758_163901_000001
>>>>>>>> 21:30:27,861 WARN org.apache.hadoop.security.UserGroupInformation
>>>>>>>> - PriviledgedActionException as:nbasjes (auth:SIMPLE)
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> cause:org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken):
>>>>>>>> Invalid AMRMToken from appattempt_1443166961758_163901_000001
>>>>>>>> 21:30:27,891 WARN
>>>>>>>> org.apache.hadoop.io.retry.RetryInvocationHandler
>>>>>>>> - Exception while invoking class
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> org.apache.hadoop.yarn.api.impl.pb.client.ApplicationMasterProtocolPBClientImpl.allocate.
>>>>>>>> Not retrying because the invoked method is not idempotent, and
>>>>>>>> unable
>>>>>>>> to
>>>>>>>> determine whether it was invoked
>>>>>>>> org.apache.hadoop.security.token.SecretManager$InvalidToken:
>>>>>>>> Invalid
>>>>>>>> AMRMToken from appattempt_1443166961758_163901_000001
>>>>>>>> at
>>>>>>>> sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
>>>>>>>> Method)
>>>>>>>> at
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
>>>>>>>> at
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>>>>>>>> at
>>>>>>>> java.lang.reflect.Constructor.newInstance(Constructor.java:526)
>>>>>>>> at
>>>>>>>>
>>>>>>>>
>>>>>>>> org.apache.hadoop.yarn.ipc.RPCUtil.instantiateException(RPCUtil.java:53)
>>>>>>>> at
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> org.apache.hadoop.yarn.ipc.RPCUtil.unwrapAndThrowException(RPCUtil.java:104)
>>>>>>>> at
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> org.apache.hadoop.yarn.api.impl.pb.client.ApplicationMasterProtocolPBClientImpl.allocate(ApplicationMasterProtocolPBClientImpl.java:79)
>>>>>>>> at sun.reflect.GeneratedMethodAccessor3.invoke(Unknown
>>>>>>>> Source)
>>>>>>>> at
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>>>>>>>> at java.lang.reflect.Method.invoke(Method.java:606)
>>>>>>>> at
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:186)
>>>>>>>> at
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
>>>>>>>> at com.sun.proxy.$Proxy14.allocate(Unknown Source)
>>>>>>>> at
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> org.apache.hadoop.yarn.client.api.impl.AMRMClientImpl.allocate(AMRMClientImpl.java:245)
>>>>>>>> at
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> org.apache.flink.yarn.YarnJobManager$$anonfun$handleYarnMessage$1.applyOrElse(YarnJobManager.scala:259)
>>>>>>>> at
>>>>>>>> scala.PartialFunction$OrElse.apply(PartialFunction.scala:162)
>>>>>>>> at
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> org.apache.flink.runtime.LeaderSessionMessageFilter$$anonfun$receive$1.applyOrElse(LeaderSessionMessageFilter.scala:44)
>>>>>>>> at
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> scala.runtime.AbstractPartialFunction$mcVL$sp.apply$mcVL$sp(AbstractPartialFunction.scala:33)
>>>>>>>> at
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:33)
>>>>>>>> at
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:25)
>>>>>>>> at
>>>>>>>>
>>>>>>>>
>>>>>>>> org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:33)
>>>>>>>> at
>>>>>>>>
>>>>>>>>
>>>>>>>> org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:28)
>>>>>>>> at
>>>>>>>> scala.PartialFunction$class.applyOrElse(PartialFunction.scala:118)
>>>>>>>> at
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> org.apache.flink.runtime.LogMessages$$anon$1.applyOrElse(LogMessages.scala:28)
>>>>>>>> at akka.actor.Actor$class.aroundReceive(Actor.scala:465)
>>>>>>>> at
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> org.apache.flink.runtime.jobmanager.JobManager.aroundReceive(JobManager.scala:100)
>>>>>>>> at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)
>>>>>>>> at akka.actor.ActorCell.invoke(ActorCell.scala:487)
>>>>>>>> at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:254)
>>>>>>>> at akka.dispatch.Mailbox.run(Mailbox.scala:221)
>>>>>>>> at akka.dispatch.Mailbox.exec(Mailbox.scala:231)
>>>>>>>> at
>>>>>>>>
>>>>>>>> scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
>>>>>>>> at
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.pollAndExecAll(ForkJoinPool.java:1253)
>>>>>>>> at
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1346)
>>>>>>>> at
>>>>>>>>
>>>>>>>>
>>>>>>>> scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
>>>>>>>> at
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
>>>>>>>> Caused by:
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken):
>>>>>>>> Invalid AMRMToken from appattempt_1443166961758_163901_000001
>>>>>>>> at org.apache.hadoop.ipc.Client.call(Client.java:1406)
>>>>>>>> at org.apache.hadoop.ipc.Client.call(Client.java:1359)
>>>>>>>> at
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206)
>>>>>>>> at com.sun.proxy.$Proxy13.allocate(Unknown Source)
>>>>>>>> at
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> org.apache.hadoop.yarn.api.impl.pb.client.ApplicationMasterProtocolPBClientImpl.allocate(ApplicationMasterProtocolPBClientImpl.java:77)
>>>>>>>> ... 29 more
>>>>>>>> 21:30:27,943 ERROR akka.actor.OneForOneStrategy
>>>>>>>> - Invalid AMRMToken from appattempt_1443166961758_163901_000001
>>>>>>>> org.apache.hadoop.security.token.SecretManager$InvalidToken:
>>>>>>>> Invalid
>>>>>>>> AMRMToken from appattempt_1443166961758_163901_000001
>>>>>>>> at
>>>>>>>> sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
>>>>>>>> Method)
>>>>>>>> at
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
>>>>>>>> at
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>>>>>>>> at
>>>>>>>> java.lang.reflect.Constructor.newInstance(Constructor.java:526)
>>>>>>>> at
>>>>>>>>
>>>>>>>>
>>>>>>>> org.apache.hadoop.yarn.ipc.RPCUtil.instantiateException(RPCUtil.java:53)
>>>>>>>> at
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> org.apache.hadoop.yarn.ipc.RPCUtil.unwrapAndThrowException(RPCUtil.java:104)
>>>>>>>> at
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> org.apache.hadoop.yarn.api.impl.pb.client.ApplicationMasterProtocolPBClientImpl.allocate(ApplicationMasterProtocolPBClientImpl.java:79)
>>>>>>>> at sun.reflect.GeneratedMethodAccessor3.invoke(Unknown
>>>>>>>> Source)
>>>>>>>> at
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>>>>>>>> at java.lang.reflect.Method.invoke(Method.java:606)
>>>>>>>> at
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:186)
>>>>>>>> at
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
>>>>>>>> at com.sun.proxy.$Proxy14.allocate(Unknown Source)
>>>>>>>> at
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> org.apache.hadoop.yarn.client.api.impl.AMRMClientImpl.allocate(AMRMClientImpl.java:245)
>>>>>>>> at
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> org.apache.flink.yarn.YarnJobManager$$anonfun$handleYarnMessage$1.applyOrElse(YarnJobManager.scala:259)
>>>>>>>> at
>>>>>>>> scala.PartialFunction$OrElse.apply(PartialFunction.scala:162)
>>>>>>>> at
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> org.apache.flink.runtime.LeaderSessionMessageFilter$$anonfun$receive$1.applyOrElse(LeaderSessionMessageFilter.scala:44)
>>>>>>>> at
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> scala.runtime.AbstractPartialFunction$mcVL$sp.apply$mcVL$sp(AbstractPartialFunction.scala:33)
>>>>>>>> at
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:33)
>>>>>>>> at
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:25)
>>>>>>>> at
>>>>>>>>
>>>>>>>>
>>>>>>>> org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:33)
>>>>>>>> at
>>>>>>>>
>>>>>>>>
>>>>>>>> org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:28)
>>>>>>>> at
>>>>>>>> scala.PartialFunction$class.applyOrElse(PartialFunction.scala:118)
>>>>>>>> at
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> org.apache.flink.runtime.LogMessages$$anon$1.applyOrElse(LogMessages.scala:28)
>>>>>>>> at akka.actor.Actor$class.aroundReceive(Actor.scala:465)
>>>>>>>> at
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> org.apache.flink.runtime.jobmanager.JobManager.aroundReceive(JobManager.scala:100)
>>>>>>>> at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)
>>>>>>>> at akka.actor.ActorCell.invoke(ActorCell.scala:487)
>>>>>>>> at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:254)
>>>>>>>> at akka.dispatch.Mailbox.run(Mailbox.scala:221)
>>>>>>>> at akka.dispatch.Mailbox.exec(Mailbox.scala:231)
>>>>>>>> at
>>>>>>>>
>>>>>>>> scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
>>>>>>>> at
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.pollAndExecAll(ForkJoinPool.java:1253)
>>>>>>>> at
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1346)
>>>>>>>> at
>>>>>>>>
>>>>>>>>
>>>>>>>> scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
>>>>>>>> at
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
>>>>>>>> Caused by:
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken):
>>>>>>>> Invalid AMRMToken from appattempt_1443166961758_163901_000001
>>>>>>>> at org.apache.hadoop.ipc.Client.call(Client.java:1406)
>>>>>>>> at org.apache.hadoop.ipc.Client.call(Client.java:1359)
>>>>>>>> at
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206)
>>>>>>>> at com.sun.proxy.$Proxy13.allocate(Unknown Source)
>>>>>>>> at
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> org.apache.hadoop.yarn.api.impl.pb.client.ApplicationMasterProtocolPBClientImpl.allocate(ApplicationMasterProtocolPBClientImpl.java:77)
>>>>>>>> ... 29 more
>>>>>>>> 21:30:28,075 INFO org.apache.flink.yarn.YarnJobManager
>>>>>>>> - Stopping JobManager
>>>>>>>> akka.tcp://flink@10.10.200.3:39527/user/jobmanager.
>>>>>>>> 21:30:28,088 INFO
>>>>>>>> org.apache.flink.runtime.executiongraph.ExecutionGraph
>>>>>>>> - Source: Custom Source -> Sink: Unnamed (1/1)
>>>>>>>> (db0d95c11c14505827e696eec7efab94) switched from RUNNING to
>>>>>>>> CANCELING
>>>>>>>> 21:30:28,113 INFO
>>>>>>>> org.apache.flink.runtime.executiongraph.ExecutionGraph
>>>>>>>> - Source: Custom Source -> Sink: Unnamed (1/1)
>>>>>>>> (db0d95c11c14505827e696eec7efab94) switched from CANCELING to
>>>>>>>> FAILED
>>>>>>>> 21:30:28,184 INFO org.apache.flink.runtime.blob.BlobServer
>>>>>>>> - Stopped BLOB server at 0.0.0.0:41281
>>>>>>>> 21:30:28,185 ERROR org.apache.flink.runtime.jobmanager.JobManager
>>>>>>>> - Actor akka://flink/user/jobmanager#403236912 terminated,
>>>>>>>> stopping
>>>>>>>> process...
>>>>>>>> 21:30:28,286 INFO
>>>>>>>> org.apache.flink.runtime.webmonitor.WebRuntimeMonitor
>>>>>>>> - Removing web root dir
>>>>>>>> /tmp/flink-web-e1a44f94-ea6d-40ee-b87c-e3122d5cb9bd
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> Best regards / Met vriendelijke groeten,
>>>>>>>>
>>>>>>>> Niels Basjes
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Best regards / Met vriendelijke groeten,
>>>>>>
>>>>>> Niels Basjes
>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Best regards / Met vriendelijke groeten,
>>>>
>>>> Niels Basjes
>>
>>
>>
>>
>> --
>> Best regards / Met vriendelijke groeten,
>>
>> Niels Basjes

Maximilian Michels

Re: Flink job on secure Yarn fails after many hours

In reply to this post by Maximilian Michels

I forgot you're using Flink 0.10.1. The above was for the master.

So here's the commit for Flink 0.10.1:
https://github.com/mxm/flink/commit/a41f3866f4097586a7b2262093088861b62930cd

git fetch https://github.com/mxm/flink/ \
a41f3866f4097586a7b2262093088861b62930cd && git checkout FETCH_HEAD

https://github.com/mxm/flink/archive/a41f3866f4097586a7b2262093088861b62930cd.zip

Thanks,
Max

On Wed, Dec 2, 2015 at 6:39 PM, Maximilian Michels <[hidden email]> wrote:

> Great. Here is the commit to try out:
> https://github.com/mxm/flink/commit/f49b9635bec703541f19cb8c615f302a07ea88b3
>
> If you already have the Flink repository, check it out using
>
> git fetch https://github.com/mxm/flink/
> f49b9635bec703541f19cb8c615f302a07ea88b3 && git checkout FETCH_HEAD
>
> Alternatively, here's a direct download link to the sources with the
> fix included:
> https://github.com/mxm/flink/archive/f49b9635bec703541f19cb8c615f302a07ea88b3.zip
>
> Thanks a lot,
> Max
>
> On Wed, Dec 2, 2015 at 5:44 PM, Niels Basjes <[hidden email]> wrote:
>> Sure, just give me the git repo url to build and I'll give it a try.
>>
>> Niels
>>
>> On Wed, Dec 2, 2015 at 4:28 PM, Maximilian Michels <[hidden email]> wrote:
>>>
>>> I mentioned that the exception gets thrown when requesting container
>>> status information. We need this to send a heartbeat to YARN but it is
>>> not very crucial if this fails once for the running job. Possibly, we
>>> could work around this problem by retrying N times in case of an
>>> exception.
>>>
>>> Would it be possible for you to deploy a custom Flink 0.10.1 version
>>> we provide and test again?
>>>
>>> On Wed, Dec 2, 2015 at 4:03 PM, Niels Basjes <[hidden email]> wrote:
>>> > No, I was just asking.
>>> > No upgrade is possible for the next month or two.
>>> >
>>> > This week is our busiest day of the year ...
>>> > Our shop is doing about 10 orders per second these days ...
>>> >
>>> > So they won't upgrade until next January/February
>>> >
>>> > Niels
>>> >
>>> > On Wed, Dec 2, 2015 at 3:59 PM, Maximilian Michels <[hidden email]>
>>> > wrote:
>>> >>
>>> >> Hi Niels,
>>> >>
>>> >> You mentioned you have the option to update Hadoop and redeploy the
>>> >> job. Would be great if you could do that and let us know how it turns
>>> >> out.
>>> >>
>>> >> Cheers,
>>> >> Max
>>> >>
>>> >> On Wed, Dec 2, 2015 at 3:45 PM, Niels Basjes <[hidden email]> wrote:
>>> >> > Hi,
>>> >> >
>>> >> > I posted the entire log from the first log line at the moment of
>>> >> > failure
>>> >> > to
>>> >> > the very end of the logfile.
>>> >> > This is all I have.
>>> >> >
>>> >> > As far as I understand the Kerberos and Keytab mechanism in Hadoop
>>> >> > Yarn
>>> >> > is
>>> >> > that it catches the "Invalid Token" and then (if keytab) gets a new
>>> >> > Kerberos
>>> >> > ticket (or tgt?).
>>> >> > When the new ticket has been obtained it retries the call that
>>> >> > previously
>>> >> > failed.
>>> >> > To me it seemed that this call can fail over the invalid Token yet it
>>> >> > cannot
>>> >> > be retried.
>>> >> >
>>> >> > At this moment I'm thinking a bug in Hadoop.
>>> >> >
>>> >> > Niels
>>> >> >
>>> >> > On Wed, Dec 2, 2015 at 2:51 PM, Maximilian Michels <[hidden email]>
>>> >> > wrote:
>>> >> >>
>>> >> >> Hi Niels,
>>> >> >>
>>> >> >> Sorry for hear you experienced this exception. From a first glance,
>>> >> >> it
>>> >> >> looks like a bug in Hadoop to me.
>>> >> >>
>>> >> >> > "Not retrying because the invoked method is not idempotent, and
>>> >> >> > unable
>>> >> >> > to determine whether it was invoked"
>>> >> >>
>>> >> >> That is nothing to worry about. This is Hadoop's internal retry
>>> >> >> mechanism that re-attempts to do actions which previously failed if
>>> >> >> that's possible. Since the action is not idempotent (it cannot be
>>> >> >> executed again without risking to change the state of the execution)
>>> >> >> and it also doesn't track its execution states, it won't be retried
>>> >> >> again.
>>> >> >>
>>> >> >> The main issue is this exception:
>>> >> >>
>>> >> >> >org.apache.hadoop.security.token.SecretManager$InvalidToken:
>>> >> >> > Invalid
>>> >> >> > AMRMToken from >appattempt_1443166961758_163901_000001
>>> >> >>
>>> >> >> From the stack trace it is clear that this exception occurs upon
>>> >> >> requesting container status information from the Resource Manager:
>>> >> >>
>>> >> >> >at
>>> >> >> >
>>> >> >> >
>>> >> >> > org.apache.flink.yarn.YarnJobManager$$anonfun$handleYarnMessage$1.applyOrElse(YarnJobManager.scala:259)
>>> >> >>
>>> >> >> Are there any more exceptions in the log? Do you have the complete
>>> >> >> logs available and could you share them?
>>> >> >>
>>> >> >>
>>> >> >> Best regards,
>>> >> >> Max
>>> >> >>
>>> >> >> On Wed, Dec 2, 2015 at 11:47 AM, Niels Basjes <[hidden email]>
>>> >> >> wrote:
>>> >> >> > Hi,
>>> >> >> >
>>> >> >> >
>>> >> >> > We have a Kerberos secured Yarn cluster here and I'm experimenting
>>> >> >> > with
>>> >> >> > Apache Flink on top of that.
>>> >> >> >
>>> >> >> > A few days ago I started a very simple Flink application (just
>>> >> >> > stream
>>> >> >> > the
>>> >> >> > time as a String into HBase 10 times per second).
>>> >> >> >
>>> >> >> > I (deliberately) asked our IT-ops guys to make my account have a
>>> >> >> > max
>>> >> >> > ticket
>>> >> >> > time of 5 minutes and a max renew time of 10 minutes (yes,
>>> >> >> > ridiculously
>>> >> >> > low
>>> >> >> > timeout values because I needed to validate this
>>> >> >> > https://issues.apache.org/jira/browse/FLINK-2977).
>>> >> >> >
>>> >> >> > This job is started with a keytab file and after running for 31
>>> >> >> > hours
>>> >> >> > it
>>> >> >> > suddenly failed with the exception you see below.
>>> >> >> >
>>> >> >> > I had the same job running for almost 400 hours until that failed
>>> >> >> > too
>>> >> >> > (I
>>> >> >> > was
>>> >> >> > too late to check the logfiles but I suspect the same problem).
>>> >> >> >
>>> >> >> >
>>> >> >> > So in that time span my tickets have expired and new tickets have
>>> >> >> > been
>>> >> >> > obtained several hundred times.
>>> >> >> >
>>> >> >> >
>>> >> >> > The main error I see is that in the process of a ticket expiring
>>> >> >> > and
>>> >> >> > being
>>> >> >> > renewed I see this message:
>>> >> >> >
>>> >> >> > Not retrying because the invoked method is not idempotent,
>>> >> >> > and
>>> >> >> > unable
>>> >> >> > to determine whether it was invoked
>>> >> >> >
>>> >> >> >
>>> >> >> > Yarn on the cluster is 2.6.0 ( HDP 2.6.0.2.2.4.2-2 )
>>> >> >> >
>>> >> >> > Flink is version 0.10.1
>>> >> >> >
>>> >> >> >
>>> >> >> > How do I fix this?
>>> >> >> > Is this a bug (in either Hadoop or Flink) or am I doing something
>>> >> >> > wrong?
>>> >> >> > Would upgrading Yarn to 2.7.1 (i.e. HDP 2.3) fix this?
>>> >> >> >
>>> >> >> >
>>> >> >> > Niels Basjes
>>> >> >> >
>>> >> >> >
>>> >> >> >
>>> >> >> > 21:30:27,821 WARN org.apache.hadoop.security.UserGroupInformation
>>> >> >> > - PriviledgedActionException as:nbasjes (auth:SIMPLE)
>>> >> >> >
>>> >> >> >
>>> >> >> >
>>> >> >> > cause:org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken):
>>> >> >> > Invalid AMRMToken from appattempt_1443166961758_163901_000001
>>> >> >> > 21:30:27,861 WARN org.apache.hadoop.ipc.Client
>>> >> >> > - Exception encountered while connecting to the server :
>>> >> >> >
>>> >> >> >
>>> >> >> >
>>> >> >> > org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken):
>>> >> >> > Invalid AMRMToken from appattempt_1443166961758_163901_000001
>>> >> >> > 21:30:27,861 WARN org.apache.hadoop.security.UserGroupInformation
>>> >> >> > - PriviledgedActionException as:nbasjes (auth:SIMPLE)
>>> >> >> >
>>> >> >> >
>>> >> >> >
>>> >> >> > cause:org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken):
>>> >> >> > Invalid AMRMToken from appattempt_1443166961758_163901_000001
>>> >> >> > 21:30:27,891 WARN
>>> >> >> > org.apache.hadoop.io.retry.RetryInvocationHandler
>>> >> >> > - Exception while invoking class
>>> >> >> >
>>> >> >> >
>>> >> >> >
>>> >> >> > org.apache.hadoop.yarn.api.impl.pb.client.ApplicationMasterProtocolPBClientImpl.allocate.
>>> >> >> > Not retrying because the invoked method is not idempotent, and
>>> >> >> > unable
>>> >> >> > to
>>> >> >> > determine whether it was invoked
>>> >> >> > org.apache.hadoop.security.token.SecretManager$InvalidToken:
>>> >> >> > Invalid
>>> >> >> > AMRMToken from appattempt_1443166961758_163901_000001
>>> >> >> > at
>>> >> >> > sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
>>> >> >> > Method)
>>> >> >> > at
>>> >> >> >
>>> >> >> >
>>> >> >> >
>>> >> >> > sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
>>> >> >> > at
>>> >> >> >
>>> >> >> >
>>> >> >> >
>>> >> >> > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>>> >> >> > at
>>> >> >> > java.lang.reflect.Constructor.newInstance(Constructor.java:526)
>>> >> >> > at
>>> >> >> >
>>> >> >> >
>>> >> >> > org.apache.hadoop.yarn.ipc.RPCUtil.instantiateException(RPCUtil.java:53)
>>> >> >> > at
>>> >> >> >
>>> >> >> >
>>> >> >> >
>>> >> >> > org.apache.hadoop.yarn.ipc.RPCUtil.unwrapAndThrowException(RPCUtil.java:104)
>>> >> >> > at
>>> >> >> >
>>> >> >> >
>>> >> >> >
>>> >> >> > org.apache.hadoop.yarn.api.impl.pb.client.ApplicationMasterProtocolPBClientImpl.allocate(ApplicationMasterProtocolPBClientImpl.java:79)
>>> >> >> > at sun.reflect.GeneratedMethodAccessor3.invoke(Unknown
>>> >> >> > Source)
>>> >> >> > at
>>> >> >> >
>>> >> >> >
>>> >> >> >
>>> >> >> > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>>> >> >> > at java.lang.reflect.Method.invoke(Method.java:606)
>>> >> >> > at
>>> >> >> >
>>> >> >> >
>>> >> >> >
>>> >> >> > org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:186)
>>> >> >> > at
>>> >> >> >
>>> >> >> >
>>> >> >> >
>>> >> >> > org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
>>> >> >> > at com.sun.proxy.$Proxy14.allocate(Unknown Source)
>>> >> >> > at
>>> >> >> >
>>> >> >> >
>>> >> >> >
>>> >> >> > org.apache.hadoop.yarn.client.api.impl.AMRMClientImpl.allocate(AMRMClientImpl.java:245)
>>> >> >> > at
>>> >> >> >
>>> >> >> >
>>> >> >> >
>>> >> >> > org.apache.flink.yarn.YarnJobManager$$anonfun$handleYarnMessage$1.applyOrElse(YarnJobManager.scala:259)
>>> >> >> > at
>>> >> >> > scala.PartialFunction$OrElse.apply(PartialFunction.scala:162)
>>> >> >> > at
>>> >> >> >
>>> >> >> >
>>> >> >> >
>>> >> >> > org.apache.flink.runtime.LeaderSessionMessageFilter$$anonfun$receive$1.applyOrElse(LeaderSessionMessageFilter.scala:44)
>>> >> >> > at
>>> >> >> >
>>> >> >> >
>>> >> >> >
>>> >> >> > scala.runtime.AbstractPartialFunction$mcVL$sp.apply$mcVL$sp(AbstractPartialFunction.scala:33)
>>> >> >> > at
>>> >> >> >
>>> >> >> >
>>> >> >> >
>>> >> >> > scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:33)
>>> >> >> > at
>>> >> >> >
>>> >> >> >
>>> >> >> >
>>> >> >> > scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:25)
>>> >> >> > at
>>> >> >> >
>>> >> >> >
>>> >> >> > org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:33)
>>> >> >> > at
>>> >> >> >
>>> >> >> >
>>> >> >> > org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:28)
>>> >> >> > at
>>> >> >> > scala.PartialFunction$class.applyOrElse(PartialFunction.scala:118)
>>> >> >> > at
>>> >> >> >
>>> >> >> >
>>> >> >> >
>>> >> >> > org.apache.flink.runtime.LogMessages$$anon$1.applyOrElse(LogMessages.scala:28)
>>> >> >> > at akka.actor.Actor$class.aroundReceive(Actor.scala:465)
>>> >> >> > at
>>> >> >> >
>>> >> >> >
>>> >> >> >
>>> >> >> > org.apache.flink.runtime.jobmanager.JobManager.aroundReceive(JobManager.scala:100)
>>> >> >> > at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)
>>> >> >> > at akka.actor.ActorCell.invoke(ActorCell.scala:487)
>>> >> >> > at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:254)
>>> >> >> > at akka.dispatch.Mailbox.run(Mailbox.scala:221)
>>> >> >> > at akka.dispatch.Mailbox.exec(Mailbox.scala:231)
>>> >> >> > at
>>> >> >> >
>>> >> >> > scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
>>> >> >> > at
>>> >> >> >
>>> >> >> >
>>> >> >> >
>>> >> >> > scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.pollAndExecAll(ForkJoinPool.java:1253)
>>> >> >> > at
>>> >> >> >
>>> >> >> >
>>> >> >> >
>>> >> >> > scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1346)
>>> >> >> > at
>>> >> >> >
>>> >> >> >
>>> >> >> > scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
>>> >> >> > at
>>> >> >> >
>>> >> >> >
>>> >> >> >
>>> >> >> > scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
>>> >> >> > Caused by:
>>> >> >> >
>>> >> >> >
>>> >> >> >
>>> >> >> > org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken):
>>> >> >> > Invalid AMRMToken from appattempt_1443166961758_163901_000001
>>> >> >> > at org.apache.hadoop.ipc.Client.call(Client.java:1406)
>>> >> >> > at org.apache.hadoop.ipc.Client.call(Client.java:1359)
>>> >> >> > at
>>> >> >> >
>>> >> >> >
>>> >> >> >
>>> >> >> > org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206)
>>> >> >> > at com.sun.proxy.$Proxy13.allocate(Unknown Source)
>>> >> >> > at
>>> >> >> >
>>> >> >> >
>>> >> >> >
>>> >> >> > org.apache.hadoop.yarn.api.impl.pb.client.ApplicationMasterProtocolPBClientImpl.allocate(ApplicationMasterProtocolPBClientImpl.java:77)
>>> >> >> > ... 29 more
>>> >> >> > 21:30:27,943 ERROR akka.actor.OneForOneStrategy
>>> >> >> > - Invalid AMRMToken from appattempt_1443166961758_163901_000001
>>> >> >> > org.apache.hadoop.security.token.SecretManager$InvalidToken:
>>> >> >> > Invalid
>>> >> >> > AMRMToken from appattempt_1443166961758_163901_000001
>>> >> >> > at
>>> >> >> > sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
>>> >> >> > Method)
>>> >> >> > at
>>> >> >> >
>>> >> >> >
>>> >> >> >
>>> >> >> > sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
>>> >> >> > at
>>> >> >> >
>>> >> >> >
>>> >> >> >
>>> >> >> > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>>> >> >> > at
>>> >> >> > java.lang.reflect.Constructor.newInstance(Constructor.java:526)
>>> >> >> > at
>>> >> >> >
>>> >> >> >
>>> >> >> > org.apache.hadoop.yarn.ipc.RPCUtil.instantiateException(RPCUtil.java:53)
>>> >> >> > at
>>> >> >> >
>>> >> >> >
>>> >> >> >
>>> >> >> > org.apache.hadoop.yarn.ipc.RPCUtil.unwrapAndThrowException(RPCUtil.java:104)
>>> >> >> > at
>>> >> >> >
>>> >> >> >
>>> >> >> >
>>> >> >> > org.apache.hadoop.yarn.api.impl.pb.client.ApplicationMasterProtocolPBClientImpl.allocate(ApplicationMasterProtocolPBClientImpl.java:79)
>>> >> >> > at sun.reflect.GeneratedMethodAccessor3.invoke(Unknown
>>> >> >> > Source)
>>> >> >> > at
>>> >> >> >
>>> >> >> >
>>> >> >> >
>>> >> >> > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>>> >> >> > at java.lang.reflect.Method.invoke(Method.java:606)
>>> >> >> > at
>>> >> >> >
>>> >> >> >
>>> >> >> >
>>> >> >> > org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:186)
>>> >> >> > at
>>> >> >> >
>>> >> >> >
>>> >> >> >
>>> >> >> > org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
>>> >> >> > at com.sun.proxy.$Proxy14.allocate(Unknown Source)
>>> >> >> > at
>>> >> >> >
>>> >> >> >
>>> >> >> >
>>> >> >> > org.apache.hadoop.yarn.client.api.impl.AMRMClientImpl.allocate(AMRMClientImpl.java:245)
>>> >> >> > at
>>> >> >> >
>>> >> >> >
>>> >> >> >
>>> >> >> > org.apache.flink.yarn.YarnJobManager$$anonfun$handleYarnMessage$1.applyOrElse(YarnJobManager.scala:259)
>>> >> >> > at
>>> >> >> > scala.PartialFunction$OrElse.apply(PartialFunction.scala:162)
>>> >> >> > at
>>> >> >> >
>>> >> >> >
>>> >> >> >
>>> >> >> > org.apache.flink.runtime.LeaderSessionMessageFilter$$anonfun$receive$1.applyOrElse(LeaderSessionMessageFilter.scala:44)
>>> >> >> > at
>>> >> >> >
>>> >> >> >
>>> >> >> >
>>> >> >> > scala.runtime.AbstractPartialFunction$mcVL$sp.apply$mcVL$sp(AbstractPartialFunction.scala:33)
>>> >> >> > at
>>> >> >> >
>>> >> >> >
>>> >> >> >
>>> >> >> > scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:33)
>>> >> >> > at
>>> >> >> >
>>> >> >> >
>>> >> >> >
>>> >> >> > scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:25)
>>> >> >> > at
>>> >> >> >
>>> >> >> >
>>> >> >> > org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:33)
>>> >> >> > at
>>> >> >> >
>>> >> >> >
>>> >> >> > org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:28)
>>> >> >> > at
>>> >> >> > scala.PartialFunction$class.applyOrElse(PartialFunction.scala:118)
>>> >> >> > at
>>> >> >> >
>>> >> >> >
>>> >> >> >
>>> >> >> > org.apache.flink.runtime.LogMessages$$anon$1.applyOrElse(LogMessages.scala:28)
>>> >> >> > at akka.actor.Actor$class.aroundReceive(Actor.scala:465)
>>> >> >> > at
>>> >> >> >
>>> >> >> >
>>> >> >> >
>>> >> >> > org.apache.flink.runtime.jobmanager.JobManager.aroundReceive(JobManager.scala:100)
>>> >> >> > at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)
>>> >> >> > at akka.actor.ActorCell.invoke(ActorCell.scala:487)
>>> >> >> > at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:254)
>>> >> >> > at akka.dispatch.Mailbox.run(Mailbox.scala:221)
>>> >> >> > at akka.dispatch.Mailbox.exec(Mailbox.scala:231)
>>> >> >> > at
>>> >> >> >
>>> >> >> > scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
>>> >> >> > at
>>> >> >> >
>>> >> >> >
>>> >> >> >
>>> >> >> > scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.pollAndExecAll(ForkJoinPool.java:1253)
>>> >> >> > at
>>> >> >> >
>>> >> >> >
>>> >> >> >
>>> >> >> > scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1346)
>>> >> >> > at
>>> >> >> >
>>> >> >> >
>>> >> >> > scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
>>> >> >> > at
>>> >> >> >
>>> >> >> >
>>> >> >> >
>>> >> >> > scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
>>> >> >> > Caused by:
>>> >> >> >
>>> >> >> >
>>> >> >> >
>>> >> >> > org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken):
>>> >> >> > Invalid AMRMToken from appattempt_1443166961758_163901_000001
>>> >> >> > at org.apache.hadoop.ipc.Client.call(Client.java:1406)
>>> >> >> > at org.apache.hadoop.ipc.Client.call(Client.java:1359)
>>> >> >> > at
>>> >> >> >
>>> >> >> >
>>> >> >> >
>>> >> >> > org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206)
>>> >> >> > at com.sun.proxy.$Proxy13.allocate(Unknown Source)
>>> >> >> > at
>>> >> >> >
>>> >> >> >
>>> >> >> >
>>> >> >> > org.apache.hadoop.yarn.api.impl.pb.client.ApplicationMasterProtocolPBClientImpl.allocate(ApplicationMasterProtocolPBClientImpl.java:77)
>>> >> >> > ... 29 more
>>> >> >> > 21:30:28,075 INFO org.apache.flink.yarn.YarnJobManager
>>> >> >> > - Stopping JobManager
>>> >> >> > akka.tcp://flink@10.10.200.3:39527/user/jobmanager.
>>> >> >> > 21:30:28,088 INFO
>>> >> >> > org.apache.flink.runtime.executiongraph.ExecutionGraph
>>> >> >> > - Source: Custom Source -> Sink: Unnamed (1/1)
>>> >> >> > (db0d95c11c14505827e696eec7efab94) switched from RUNNING to
>>> >> >> > CANCELING
>>> >> >> > 21:30:28,113 INFO
>>> >> >> > org.apache.flink.runtime.executiongraph.ExecutionGraph
>>> >> >> > - Source: Custom Source -> Sink: Unnamed (1/1)
>>> >> >> > (db0d95c11c14505827e696eec7efab94) switched from CANCELING to
>>> >> >> > FAILED
>>> >> >> > 21:30:28,184 INFO org.apache.flink.runtime.blob.BlobServer
>>> >> >> > - Stopped BLOB server at 0.0.0.0:41281
>>> >> >> > 21:30:28,185 ERROR org.apache.flink.runtime.jobmanager.JobManager
>>> >> >> > - Actor akka://flink/user/jobmanager#403236912 terminated,
>>> >> >> > stopping
>>> >> >> > process...
>>> >> >> > 21:30:28,286 INFO
>>> >> >> > org.apache.flink.runtime.webmonitor.WebRuntimeMonitor
>>> >> >> > - Removing web root dir
>>> >> >> > /tmp/flink-web-e1a44f94-ea6d-40ee-b87c-e3122d5cb9bd
>>> >> >> >
>>> >> >> >
>>> >> >> > --
>>> >> >> > Best regards / Met vriendelijke groeten,
>>> >> >> >
>>> >> >> > Niels Basjes
>>> >> >
>>> >> >
>>> >> >
>>> >> >
>>> >> > --
>>> >> > Best regards / Met vriendelijke groeten,
>>> >> >
>>> >> > Niels Basjes
>>> >
>>> >
>>> >
>>> >
>>> > --
>>> > Best regards / Met vriendelijke groeten,
>>> >
>>> > Niels Basjes
>>
>>
>>
>>
>> --
>> Best regards / Met vriendelijke groeten,
>>
>> Niels Basjes

Maximilian Michels

Re: Flink job on secure Yarn fails after many hours

Hi Niels,

Just got back from our CI. The build above would fail with a
Checkstyle error. I corrected that. Also I have built the binaries for
your Hadoop version 2.6.0.

Binaries:

https://drive.google.com/file/d/0BziY9U_qva1sZ1FVR3RWeVNrNzA/view?usp=sharing

Source:

https://github.com/mxm/flink/tree/kerberos-yarn-heartbeat-fail-0.10.1

git fetch https://github.com/mxm/flink/ \
kerberos-yarn-heartbeat-fail-0.10.1 && git checkout FETCH_HEAD

https://github.com/mxm/flink/archive/kerberos-yarn-heartbeat-fail-0.10.1.zip

Thanks,
Max

On Wed, Dec 2, 2015 at 6:52 PM, Maximilian Michels <[hidden email]> wrote:

> I forgot you're using Flink 0.10.1. The above was for the master.
>
> So here's the commit for Flink 0.10.1:
> https://github.com/mxm/flink/commit/a41f3866f4097586a7b2262093088861b62930cd
>
> git fetch https://github.com/mxm/flink/ \
> a41f3866f4097586a7b2262093088861b62930cd && git checkout FETCH_HEAD
>
> https://github.com/mxm/flink/archive/a41f3866f4097586a7b2262093088861b62930cd.zip
>
> Thanks,
> Max
>
> On Wed, Dec 2, 2015 at 6:39 PM, Maximilian Michels <[hidden email]> wrote:
>> Great. Here is the commit to try out:
>> https://github.com/mxm/flink/commit/f49b9635bec703541f19cb8c615f302a07ea88b3
>>
>> If you already have the Flink repository, check it out using
>>
>> git fetch https://github.com/mxm/flink/
>> f49b9635bec703541f19cb8c615f302a07ea88b3 && git checkout FETCH_HEAD
>>
>> Alternatively, here's a direct download link to the sources with the
>> fix included:
>> https://github.com/mxm/flink/archive/f49b9635bec703541f19cb8c615f302a07ea88b3.zip
>>
>> Thanks a lot,
>> Max
>>
>> On Wed, Dec 2, 2015 at 5:44 PM, Niels Basjes <[hidden email]> wrote:
>>> Sure, just give me the git repo url to build and I'll give it a try.
>>>
>>> Niels
>>>
>>> On Wed, Dec 2, 2015 at 4:28 PM, Maximilian Michels <[hidden email]> wrote:
>>>>
>>>> I mentioned that the exception gets thrown when requesting container
>>>> status information. We need this to send a heartbeat to YARN but it is
>>>> not very crucial if this fails once for the running job. Possibly, we
>>>> could work around this problem by retrying N times in case of an
>>>> exception.
>>>>
>>>> Would it be possible for you to deploy a custom Flink 0.10.1 version
>>>> we provide and test again?
>>>>
>>>> On Wed, Dec 2, 2015 at 4:03 PM, Niels Basjes <[hidden email]> wrote:
>>>> > No, I was just asking.
>>>> > No upgrade is possible for the next month or two.
>>>> >
>>>> > This week is our busiest day of the year ...
>>>> > Our shop is doing about 10 orders per second these days ...
>>>> >
>>>> > So they won't upgrade until next January/February
>>>> >
>>>> > Niels
>>>> >
>>>> > On Wed, Dec 2, 2015 at 3:59 PM, Maximilian Michels <[hidden email]>
>>>> > wrote:
>>>> >>
>>>> >> Hi Niels,
>>>> >>
>>>> >> You mentioned you have the option to update Hadoop and redeploy the
>>>> >> job. Would be great if you could do that and let us know how it turns
>>>> >> out.
>>>> >>
>>>> >> Cheers,
>>>> >> Max
>>>> >>
>>>> >> On Wed, Dec 2, 2015 at 3:45 PM, Niels Basjes <[hidden email]> wrote:
>>>> >> > Hi,
>>>> >> >
>>>> >> > I posted the entire log from the first log line at the moment of
>>>> >> > failure
>>>> >> > to
>>>> >> > the very end of the logfile.
>>>> >> > This is all I have.
>>>> >> >
>>>> >> > As far as I understand the Kerberos and Keytab mechanism in Hadoop
>>>> >> > Yarn
>>>> >> > is
>>>> >> > that it catches the "Invalid Token" and then (if keytab) gets a new
>>>> >> > Kerberos
>>>> >> > ticket (or tgt?).
>>>> >> > When the new ticket has been obtained it retries the call that
>>>> >> > previously
>>>> >> > failed.
>>>> >> > To me it seemed that this call can fail over the invalid Token yet it
>>>> >> > cannot
>>>> >> > be retried.
>>>> >> >
>>>> >> > At this moment I'm thinking a bug in Hadoop.
>>>> >> >
>>>> >> > Niels
>>>> >> >
>>>> >> > On Wed, Dec 2, 2015 at 2:51 PM, Maximilian Michels <[hidden email]>
>>>> >> > wrote:
>>>> >> >>
>>>> >> >> Hi Niels,
>>>> >> >>
>>>> >> >> Sorry for hear you experienced this exception. From a first glance,
>>>> >> >> it
>>>> >> >> looks like a bug in Hadoop to me.
>>>> >> >>
>>>> >> >> > "Not retrying because the invoked method is not idempotent, and
>>>> >> >> > unable
>>>> >> >> > to determine whether it was invoked"
>>>> >> >>
>>>> >> >> That is nothing to worry about. This is Hadoop's internal retry
>>>> >> >> mechanism that re-attempts to do actions which previously failed if
>>>> >> >> that's possible. Since the action is not idempotent (it cannot be
>>>> >> >> executed again without risking to change the state of the execution)
>>>> >> >> and it also doesn't track its execution states, it won't be retried
>>>> >> >> again.
>>>> >> >>
>>>> >> >> The main issue is this exception:
>>>> >> >>
>>>> >> >> >org.apache.hadoop.security.token.SecretManager$InvalidToken:
>>>> >> >> > Invalid
>>>> >> >> > AMRMToken from >appattempt_1443166961758_163901_000001
>>>> >> >>
>>>> >> >> From the stack trace it is clear that this exception occurs upon
>>>> >> >> requesting container status information from the Resource Manager:
>>>> >> >>
>>>> >> >> >at
>>>> >> >> >
>>>> >> >> >
>>>> >> >> > org.apache.flink.yarn.YarnJobManager$$anonfun$handleYarnMessage$1.applyOrElse(YarnJobManager.scala:259)
>>>> >> >>
>>>> >> >> Are there any more exceptions in the log? Do you have the complete
>>>> >> >> logs available and could you share them?
>>>> >> >>
>>>> >> >>
>>>> >> >> Best regards,
>>>> >> >> Max
>>>> >> >>
>>>> >> >> On Wed, Dec 2, 2015 at 11:47 AM, Niels Basjes <[hidden email]>
>>>> >> >> wrote:
>>>> >> >> > Hi,
>>>> >> >> >
>>>> >> >> >
>>>> >> >> > We have a Kerberos secured Yarn cluster here and I'm experimenting
>>>> >> >> > with
>>>> >> >> > Apache Flink on top of that.
>>>> >> >> >
>>>> >> >> > A few days ago I started a very simple Flink application (just
>>>> >> >> > stream
>>>> >> >> > the
>>>> >> >> > time as a String into HBase 10 times per second).
>>>> >> >> >
>>>> >> >> > I (deliberately) asked our IT-ops guys to make my account have a
>>>> >> >> > max
>>>> >> >> > ticket
>>>> >> >> > time of 5 minutes and a max renew time of 10 minutes (yes,
>>>> >> >> > ridiculously
>>>> >> >> > low
>>>> >> >> > timeout values because I needed to validate this
>>>> >> >> > https://issues.apache.org/jira/browse/FLINK-2977).
>>>> >> >> >
>>>> >> >> > This job is started with a keytab file and after running for 31
>>>> >> >> > hours
>>>> >> >> > it
>>>> >> >> > suddenly failed with the exception you see below.
>>>> >> >> >
>>>> >> >> > I had the same job running for almost 400 hours until that failed
>>>> >> >> > too
>>>> >> >> > (I
>>>> >> >> > was
>>>> >> >> > too late to check the logfiles but I suspect the same problem).
>>>> >> >> >
>>>> >> >> >
>>>> >> >> > So in that time span my tickets have expired and new tickets have
>>>> >> >> > been
>>>> >> >> > obtained several hundred times.
>>>> >> >> >
>>>> >> >> >
>>>> >> >> > The main error I see is that in the process of a ticket expiring
>>>> >> >> > and
>>>> >> >> > being
>>>> >> >> > renewed I see this message:
>>>> >> >> >
>>>> >> >> > Not retrying because the invoked method is not idempotent,
>>>> >> >> > and
>>>> >> >> > unable
>>>> >> >> > to determine whether it was invoked
>>>> >> >> >
>>>> >> >> >
>>>> >> >> > Yarn on the cluster is 2.6.0 ( HDP 2.6.0.2.2.4.2-2 )
>>>> >> >> >
>>>> >> >> > Flink is version 0.10.1
>>>> >> >> >
>>>> >> >> >
>>>> >> >> > How do I fix this?
>>>> >> >> > Is this a bug (in either Hadoop or Flink) or am I doing something
>>>> >> >> > wrong?
>>>> >> >> > Would upgrading Yarn to 2.7.1 (i.e. HDP 2.3) fix this?
>>>> >> >> >
>>>> >> >> >
>>>> >> >> > Niels Basjes
>>>> >> >> >
>>>> >> >> >
>>>> >> >> >
>>>> >> >> > 21:30:27,821 WARN org.apache.hadoop.security.UserGroupInformation
>>>> >> >> > - PriviledgedActionException as:nbasjes (auth:SIMPLE)
>>>> >> >> >
>>>> >> >> >
>>>> >> >> >
>>>> >> >> > cause:org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken):
>>>> >> >> > Invalid AMRMToken from appattempt_1443166961758_163901_000001
>>>> >> >> > 21:30:27,861 WARN org.apache.hadoop.ipc.Client
>>>> >> >> > - Exception encountered while connecting to the server :
>>>> >> >> >
>>>> >> >> >
>>>> >> >> >
>>>> >> >> > org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken):
>>>> >> >> > Invalid AMRMToken from appattempt_1443166961758_163901_000001
>>>> >> >> > 21:30:27,861 WARN org.apache.hadoop.security.UserGroupInformation
>>>> >> >> > - PriviledgedActionException as:nbasjes (auth:SIMPLE)
>>>> >> >> >
>>>> >> >> >
>>>> >> >> >
>>>> >> >> > cause:org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken):
>>>> >> >> > Invalid AMRMToken from appattempt_1443166961758_163901_000001
>>>> >> >> > 21:30:27,891 WARN
>>>> >> >> > org.apache.hadoop.io.retry.RetryInvocationHandler
>>>> >> >> > - Exception while invoking class
>>>> >> >> >
>>>> >> >> >
>>>> >> >> >
>>>> >> >> > org.apache.hadoop.yarn.api.impl.pb.client.ApplicationMasterProtocolPBClientImpl.allocate.
>>>> >> >> > Not retrying because the invoked method is not idempotent, and
>>>> >> >> > unable
>>>> >> >> > to
>>>> >> >> > determine whether it was invoked
>>>> >> >> > org.apache.hadoop.security.token.SecretManager$InvalidToken:
>>>> >> >> > Invalid
>>>> >> >> > AMRMToken from appattempt_1443166961758_163901_000001
>>>> >> >> > at
>>>> >> >> > sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
>>>> >> >> > Method)
>>>> >> >> > at
>>>> >> >> >
>>>> >> >> >
>>>> >> >> >
>>>> >> >> > sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
>>>> >> >> > at
>>>> >> >> >
>>>> >> >> >
>>>> >> >> >
>>>> >> >> > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>>>> >> >> > at
>>>> >> >> > java.lang.reflect.Constructor.newInstance(Constructor.java:526)
>>>> >> >> > at
>>>> >> >> >
>>>> >> >> >
>>>> >> >> > org.apache.hadoop.yarn.ipc.RPCUtil.instantiateException(RPCUtil.java:53)
>>>> >> >> > at
>>>> >> >> >
>>>> >> >> >
>>>> >> >> >
>>>> >> >> > org.apache.hadoop.yarn.ipc.RPCUtil.unwrapAndThrowException(RPCUtil.java:104)
>>>> >> >> > at
>>>> >> >> >
>>>> >> >> >
>>>> >> >> >
>>>> >> >> > org.apache.hadoop.yarn.api.impl.pb.client.ApplicationMasterProtocolPBClientImpl.allocate(ApplicationMasterProtocolPBClientImpl.java:79)
>>>> >> >> > at sun.reflect.GeneratedMethodAccessor3.invoke(Unknown
>>>> >> >> > Source)
>>>> >> >> > at
>>>> >> >> >
>>>> >> >> >
>>>> >> >> >
>>>> >> >> > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>>>> >> >> > at java.lang.reflect.Method.invoke(Method.java:606)
>>>> >> >> > at
>>>> >> >> >
>>>> >> >> >
>>>> >> >> >
>>>> >> >> > org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:186)
>>>> >> >> > at
>>>> >> >> >
>>>> >> >> >
>>>> >> >> >
>>>> >> >> > org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
>>>> >> >> > at com.sun.proxy.$Proxy14.allocate(Unknown Source)
>>>> >> >> > at
>>>> >> >> >
>>>> >> >> >
>>>> >> >> >
>>>> >> >> > org.apache.hadoop.yarn.client.api.impl.AMRMClientImpl.allocate(AMRMClientImpl.java:245)
>>>> >> >> > at
>>>> >> >> >
>>>> >> >> >
>>>> >> >> >
>>>> >> >> > org.apache.flink.yarn.YarnJobManager$$anonfun$handleYarnMessage$1.applyOrElse(YarnJobManager.scala:259)
>>>> >> >> > at
>>>> >> >> > scala.PartialFunction$OrElse.apply(PartialFunction.scala:162)
>>>> >> >> > at
>>>> >> >> >
>>>> >> >> >
>>>> >> >> >
>>>> >> >> > org.apache.flink.runtime.LeaderSessionMessageFilter$$anonfun$receive$1.applyOrElse(LeaderSessionMessageFilter.scala:44)
>>>> >> >> > at
>>>> >> >> >
>>>> >> >> >
>>>> >> >> >
>>>> >> >> > scala.runtime.AbstractPartialFunction$mcVL$sp.apply$mcVL$sp(AbstractPartialFunction.scala:33)
>>>> >> >> > at
>>>> >> >> >
>>>> >> >> >
>>>> >> >> >
>>>> >> >> > scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:33)
>>>> >> >> > at
>>>> >> >> >
>>>> >> >> >
>>>> >> >> >
>>>> >> >> > scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:25)
>>>> >> >> > at
>>>> >> >> >
>>>> >> >> >
>>>> >> >> > org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:33)
>>>> >> >> > at
>>>> >> >> >
>>>> >> >> >
>>>> >> >> > org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:28)
>>>> >> >> > at
>>>> >> >> > scala.PartialFunction$class.applyOrElse(PartialFunction.scala:118)
>>>> >> >> > at
>>>> >> >> >
>>>> >> >> >
>>>> >> >> >
>>>> >> >> > org.apache.flink.runtime.LogMessages$$anon$1.applyOrElse(LogMessages.scala:28)
>>>> >> >> > at akka.actor.Actor$class.aroundReceive(Actor.scala:465)
>>>> >> >> > at
>>>> >> >> >
>>>> >> >> >
>>>> >> >> >
>>>> >> >> > org.apache.flink.runtime.jobmanager.JobManager.aroundReceive(JobManager.scala:100)
>>>> >> >> > at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)
>>>> >> >> > at akka.actor.ActorCell.invoke(ActorCell.scala:487)
>>>> >> >> > at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:254)
>>>> >> >> > at akka.dispatch.Mailbox.run(Mailbox.scala:221)
>>>> >> >> > at akka.dispatch.Mailbox.exec(Mailbox.scala:231)
>>>> >> >> > at
>>>> >> >> >
>>>> >> >> > scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
>>>> >> >> > at
>>>> >> >> >
>>>> >> >> >
>>>> >> >> >
>>>> >> >> > scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.pollAndExecAll(ForkJoinPool.java:1253)
>>>> >> >> > at
>>>> >> >> >
>>>> >> >> >
>>>> >> >> >
>>>> >> >> > scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1346)
>>>> >> >> > at
>>>> >> >> >
>>>> >> >> >
>>>> >> >> > scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
>>>> >> >> > at
>>>> >> >> >
>>>> >> >> >
>>>> >> >> >
>>>> >> >> > scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
>>>> >> >> > Caused by:
>>>> >> >> >
>>>> >> >> >
>>>> >> >> >
>>>> >> >> > org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken):
>>>> >> >> > Invalid AMRMToken from appattempt_1443166961758_163901_000001
>>>> >> >> > at org.apache.hadoop.ipc.Client.call(Client.java:1406)
>>>> >> >> > at org.apache.hadoop.ipc.Client.call(Client.java:1359)
>>>> >> >> > at
>>>> >> >> >
>>>> >> >> >
>>>> >> >> >
>>>> >> >> > org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206)
>>>> >> >> > at com.sun.proxy.$Proxy13.allocate(Unknown Source)
>>>> >> >> > at
>>>> >> >> >
>>>> >> >> >
>>>> >> >> >
>>>> >> >> > org.apache.hadoop.yarn.api.impl.pb.client.ApplicationMasterProtocolPBClientImpl.allocate(ApplicationMasterProtocolPBClientImpl.java:77)
>>>> >> >> > ... 29 more
>>>> >> >> > 21:30:27,943 ERROR akka.actor.OneForOneStrategy
>>>> >> >> > - Invalid AMRMToken from appattempt_1443166961758_163901_000001
>>>> >> >> > org.apache.hadoop.security.token.SecretManager$InvalidToken:
>>>> >> >> > Invalid
>>>> >> >> > AMRMToken from appattempt_1443166961758_163901_000001
>>>> >> >> > at
>>>> >> >> > sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
>>>> >> >> > Method)
>>>> >> >> > at
>>>> >> >> >
>>>> >> >> >
>>>> >> >> >
>>>> >> >> > sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
>>>> >> >> > at
>>>> >> >> >
>>>> >> >> >
>>>> >> >> >
>>>> >> >> > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>>>> >> >> > at
>>>> >> >> > java.lang.reflect.Constructor.newInstance(Constructor.java:526)
>>>> >> >> > at
>>>> >> >> >
>>>> >> >> >
>>>> >> >> > org.apache.hadoop.yarn.ipc.RPCUtil.instantiateException(RPCUtil.java:53)
>>>> >> >> > at
>>>> >> >> >
>>>> >> >> >
>>>> >> >> >
>>>> >> >> > org.apache.hadoop.yarn.ipc.RPCUtil.unwrapAndThrowException(RPCUtil.java:104)
>>>> >> >> > at
>>>> >> >> >
>>>> >> >> >
>>>> >> >> >
>>>> >> >> > org.apache.hadoop.yarn.api.impl.pb.client.ApplicationMasterProtocolPBClientImpl.allocate(ApplicationMasterProtocolPBClientImpl.java:79)
>>>> >> >> > at sun.reflect.GeneratedMethodAccessor3.invoke(Unknown
>>>> >> >> > Source)
>>>> >> >> > at
>>>> >> >> >
>>>> >> >> >
>>>> >> >> >
>>>> >> >> > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>>>> >> >> > at java.lang.reflect.Method.invoke(Method.java:606)
>>>> >> >> > at
>>>> >> >> >
>>>> >> >> >
>>>> >> >> >
>>>> >> >> > org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:186)
>>>> >> >> > at
>>>> >> >> >
>>>> >> >> >
>>>> >> >> >
>>>> >> >> > org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
>>>> >> >> > at com.sun.proxy.$Proxy14.allocate(Unknown Source)
>>>> >> >> > at
>>>> >> >> >
>>>> >> >> >
>>>> >> >> >
>>>> >> >> > org.apache.hadoop.yarn.client.api.impl.AMRMClientImpl.allocate(AMRMClientImpl.java:245)
>>>> >> >> > at
>>>> >> >> >
>>>> >> >> >
>>>> >> >> >
>>>> >> >> > org.apache.flink.yarn.YarnJobManager$$anonfun$handleYarnMessage$1.applyOrElse(YarnJobManager.scala:259)
>>>> >> >> > at
>>>> >> >> > scala.PartialFunction$OrElse.apply(PartialFunction.scala:162)
>>>> >> >> > at
>>>> >> >> >
>>>> >> >> >
>>>> >> >> >
>>>> >> >> > org.apache.flink.runtime.LeaderSessionMessageFilter$$anonfun$receive$1.applyOrElse(LeaderSessionMessageFilter.scala:44)
>>>> >> >> > at
>>>> >> >> >
>>>> >> >> >
>>>> >> >> >
>>>> >> >> > scala.runtime.AbstractPartialFunction$mcVL$sp.apply$mcVL$sp(AbstractPartialFunction.scala:33)
>>>> >> >> > at
>>>> >> >> >
>>>> >> >> >
>>>> >> >> >
>>>> >> >> > scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:33)
>>>> >> >> > at
>>>> >> >> >
>>>> >> >> >
>>>> >> >> >
>>>> >> >> > scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:25)
>>>> >> >> > at
>>>> >> >> >
>>>> >> >> >
>>>> >> >> > org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:33)
>>>> >> >> > at
>>>> >> >> >
>>>> >> >> >
>>>> >> >> > org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:28)
>>>> >> >> > at
>>>> >> >> > scala.PartialFunction$class.applyOrElse(PartialFunction.scala:118)
>>>> >> >> > at
>>>> >> >> >
>>>> >> >> >
>>>> >> >> >
>>>> >> >> > org.apache.flink.runtime.LogMessages$$anon$1.applyOrElse(LogMessages.scala:28)
>>>> >> >> > at akka.actor.Actor$class.aroundReceive(Actor.scala:465)
>>>> >> >> > at
>>>> >> >> >
>>>> >> >> >
>>>> >> >> >
>>>> >> >> > org.apache.flink.runtime.jobmanager.JobManager.aroundReceive(JobManager.scala:100)
>>>> >> >> > at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)
>>>> >> >> > at akka.actor.ActorCell.invoke(ActorCell.scala:487)
>>>> >> >> > at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:254)
>>>> >> >> > at akka.dispatch.Mailbox.run(Mailbox.scala:221)
>>>> >> >> > at akka.dispatch.Mailbox.exec(Mailbox.scala:231)
>>>> >> >> > at
>>>> >> >> >
>>>> >> >> > scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
>>>> >> >> > at
>>>> >> >> >
>>>> >> >> >
>>>> >> >> >
>>>> >> >> > scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.pollAndExecAll(ForkJoinPool.java:1253)
>>>> >> >> > at
>>>> >> >> >
>>>> >> >> >
>>>> >> >> >
>>>> >> >> > scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1346)
>>>> >> >> > at
>>>> >> >> >
>>>> >> >> >
>>>> >> >> > scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
>>>> >> >> > at
>>>> >> >> >
>>>> >> >> >
>>>> >> >> >
>>>> >> >> > scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
>>>> >> >> > Caused by:
>>>> >> >> >
>>>> >> >> >
>>>> >> >> >
>>>> >> >> > org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken):
>>>> >> >> > Invalid AMRMToken from appattempt_1443166961758_163901_000001
>>>> >> >> > at org.apache.hadoop.ipc.Client.call(Client.java:1406)
>>>> >> >> > at org.apache.hadoop.ipc.Client.call(Client.java:1359)
>>>> >> >> > at
>>>> >> >> >
>>>> >> >> >
>>>> >> >> >
>>>> >> >> > org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206)
>>>> >> >> > at com.sun.proxy.$Proxy13.allocate(Unknown Source)
>>>> >> >> > at
>>>> >> >> >
>>>> >> >> >
>>>> >> >> >
>>>> >> >> > org.apache.hadoop.yarn.api.impl.pb.client.ApplicationMasterProtocolPBClientImpl.allocate(ApplicationMasterProtocolPBClientImpl.java:77)
>>>> >> >> > ... 29 more
>>>> >> >> > 21:30:28,075 INFO org.apache.flink.yarn.YarnJobManager
>>>> >> >> > - Stopping JobManager
>>>> >> >> > akka.tcp://flink@10.10.200.3:39527/user/jobmanager.
>>>> >> >> > 21:30:28,088 INFO
>>>> >> >> > org.apache.flink.runtime.executiongraph.ExecutionGraph
>>>> >> >> > - Source: Custom Source -> Sink: Unnamed (1/1)
>>>> >> >> > (db0d95c11c14505827e696eec7efab94) switched from RUNNING to
>>>> >> >> > CANCELING
>>>> >> >> > 21:30:28,113 INFO
>>>> >> >> > org.apache.flink.runtime.executiongraph.ExecutionGraph
>>>> >> >> > - Source: Custom Source -> Sink: Unnamed (1/1)
>>>> >> >> > (db0d95c11c14505827e696eec7efab94) switched from CANCELING to
>>>> >> >> > FAILED
>>>> >> >> > 21:30:28,184 INFO org.apache.flink.runtime.blob.BlobServer
>>>> >> >> > - Stopped BLOB server at 0.0.0.0:41281
>>>> >> >> > 21:30:28,185 ERROR org.apache.flink.runtime.jobmanager.JobManager
>>>> >> >> > - Actor akka://flink/user/jobmanager#403236912 terminated,
>>>> >> >> > stopping
>>>> >> >> > process...
>>>> >> >> > 21:30:28,286 INFO
>>>> >> >> > org.apache.flink.runtime.webmonitor.WebRuntimeMonitor
>>>> >> >> > - Removing web root dir
>>>> >> >> > /tmp/flink-web-e1a44f94-ea6d-40ee-b87c-e3122d5cb9bd
>>>> >> >> >
>>>> >> >> >
>>>> >> >> > --
>>>> >> >> > Best regards / Met vriendelijke groeten,
>>>> >> >> >
>>>> >> >> > Niels Basjes
>>>> >> >
>>>> >> >
>>>> >> >
>>>> >> >
>>>> >> > --
>>>> >> > Best regards / Met vriendelijke groeten,
>>>> >> >
>>>> >> > Niels Basjes
>>>> >
>>>> >
>>>> >
>>>> >
>>>> > --
>>>> > Best regards / Met vriendelijke groeten,
>>>> >
>>>> > Niels Basjes
>>>
>>>
>>>
>>>
>>> --
>>> Best regards / Met vriendelijke groeten,
>>>
>>> Niels Basjes

Niels Basjes

Re: Flink job on secure Yarn fails after many hours

Hi Maximilian,

I just downloaded the version from your google drive and used that to run my test topology that accesses HBase.

I deliberately started it twice to double the chance to run into this situation.

I'll keep you posted.

Niels

On Thu, Dec 3, 2015 at 11:44 AM, Maximilian Michels <[hidden email]> wrote:

Hi Niels,

Just got back from our CI. The build above would fail with a
Checkstyle error. I corrected that. Also I have built the binaries for
your Hadoop version 2.6.0.

Binaries:

https://drive.google.com/file/d/0BziY9U_qva1sZ1FVR3RWeVNrNzA/view?usp=sharing

Source:

https://github.com/mxm/flink/tree/kerberos-yarn-heartbeat-fail-0.10.1

git fetch https://github.com/mxm/flink/ \
kerberos-yarn-heartbeat-fail-0.10.1 && git checkout FETCH_HEAD

https://github.com/mxm/flink/archive/kerberos-yarn-heartbeat-fail-0.10.1.zip

Thanks,
Max

On Wed, Dec 2, 2015 at 6:52 PM, Maximilian Michels <[hidden email]> wrote:
> I forgot you're using Flink 0.10.1. The above was for the master.
>
> So here's the commit for Flink 0.10.1:
> https://github.com/mxm/flink/commit/a41f3866f4097586a7b2262093088861b62930cd
>
> git fetch https://github.com/mxm/flink/ \
> a41f3866f4097586a7b2262093088861b62930cd && git checkout FETCH_HEAD
>
> https://github.com/mxm/flink/archive/a41f3866f4097586a7b2262093088861b62930cd.zip
>
> Thanks,
> Max
>
> On Wed, Dec 2, 2015 at 6:39 PM, Maximilian Michels <[hidden email]> wrote:
>> Great. Here is the commit to try out:
>> https://github.com/mxm/flink/commit/f49b9635bec703541f19cb8c615f302a07ea88b3
>>
>> If you already have the Flink repository, check it out using
>>
>> git fetch https://github.com/mxm/flink/
>> f49b9635bec703541f19cb8c615f302a07ea88b3 && git checkout FETCH_HEAD
>>
>> Alternatively, here's a direct download link to the sources with the
>> fix included:
>> https://github.com/mxm/flink/archive/f49b9635bec703541f19cb8c615f302a07ea88b3.zip
>>
>> Thanks a lot,
>> Max
>>
>> On Wed, Dec 2, 2015 at 5:44 PM, Niels Basjes <[hidden email]> wrote:
>>> Sure, just give me the git repo url to build and I'll give it a try.
>>>
>>> Niels
>>>
>>> On Wed, Dec 2, 2015 at 4:28 PM, Maximilian Michels <[hidden email]> wrote:
>>>>
>>>> I mentioned that the exception gets thrown when requesting container
>>>> status information. We need this to send a heartbeat to YARN but it is
>>>> not very crucial if this fails once for the running job. Possibly, we
>>>> could work around this problem by retrying N times in case of an
>>>> exception.
>>>>
>>>> Would it be possible for you to deploy a custom Flink 0.10.1 version
>>>> we provide and test again?
>>>>
>>>> On Wed, Dec 2, 2015 at 4:03 PM, Niels Basjes <[hidden email]> wrote:
>>>> > No, I was just asking.
>>>> > No upgrade is possible for the next month or two.
>>>> >
>>>> > This week is our busiest day of the year ...
>>>> > Our shop is doing about 10 orders per second these days ...
>>>> >
>>>> > So they won't upgrade until next January/February
>>>> >
>>>> > Niels
>>>> >
>>>> > On Wed, Dec 2, 2015 at 3:59 PM, Maximilian Michels <[hidden email]>
>>>> > wrote:
>>>> >>
>>>> >> Hi Niels,
>>>> >>
>>>> >> You mentioned you have the option to update Hadoop and redeploy the
>>>> >> job. Would be great if you could do that and let us know how it turns
>>>> >> out.
>>>> >>
>>>> >> Cheers,
>>>> >> Max
>>>> >>
>>>> >> On Wed, Dec 2, 2015 at 3:45 PM, Niels Basjes <[hidden email]> wrote:
>>>> >> > Hi,
>>>> >> >
>>>> >> > I posted the entire log from the first log line at the moment of
>>>> >> > failure
>>>> >> > to
>>>> >> > the very end of the logfile.
>>>> >> > This is all I have.
>>>> >> >
>>>> >> > As far as I understand the Kerberos and Keytab mechanism in Hadoop
>>>> >> > Yarn
>>>> >> > is
>>>> >> > that it catches the "Invalid Token" and then (if keytab) gets a new
>>>> >> > Kerberos
>>>> >> > ticket (or tgt?).
>>>> >> > When the new ticket has been obtained it retries the call that
>>>> >> > previously
>>>> >> > failed.
>>>> >> > To me it seemed that this call can fail over the invalid Token yet it
>>>> >> > cannot
>>>> >> > be retried.
>>>> >> >
>>>> >> > At this moment I'm thinking a bug in Hadoop.
>>>> >> >
>>>> >> > Niels
>>>> >> >
>>>> >> > On Wed, Dec 2, 2015 at 2:51 PM, Maximilian Michels <[hidden email]>
>>>> >> > wrote:
>>>> >> >>
>>>> >> >> Hi Niels,
>>>> >> >>
>>>> >> >> Sorry for hear you experienced this exception. From a first glance,
>>>> >> >> it
>>>> >> >> looks like a bug in Hadoop to me.
>>>> >> >>
>>>> >> >> > "Not retrying because the invoked method is not idempotent, and
>>>> >> >> > unable
>>>> >> >> > to determine whether it was invoked"
>>>> >> >>
>>>> >> >> That is nothing to worry about. This is Hadoop's internal retry
>>>> >> >> mechanism that re-attempts to do actions which previously failed if
>>>> >> >> that's possible. Since the action is not idempotent (it cannot be
>>>> >> >> executed again without risking to change the state of the execution)
>>>> >> >> and it also doesn't track its execution states, it won't be retried
>>>> >> >> again.
>>>> >> >>
>>>> >> >> The main issue is this exception:
>>>> >> >>
>>>> >> >> >org.apache.hadoop.security.token.SecretManager$InvalidToken:
>>>> >> >> > Invalid
>>>> >> >> > AMRMToken from >appattempt_1443166961758_163901_000001
>>>> >> >>
>>>> >> >> From the stack trace it is clear that this exception occurs upon
>>>> >> >> requesting container status information from the Resource Manager:
>>>> >> >>
>>>> >> >> >at
>>>> >> >> >
>>>> >> >> >
>>>> >> >> > org.apache.flink.yarn.YarnJobManager$$anonfun$handleYarnMessage$1.applyOrElse(YarnJobManager.scala:259)
>>>> >> >>
>>>> >> >> Are there any more exceptions in the log? Do you have the complete
>>>> >> >> logs available and could you share them?
>>>> >> >>
>>>> >> >>
>>>> >> >> Best regards,
>>>> >> >> Max
>>>> >> >>
>>>> >> >> On Wed, Dec 2, 2015 at 11:47 AM, Niels Basjes <[hidden email]>
>>>> >> >> wrote:
>>>> >> >> > Hi,
>>>> >> >> >
>>>> >> >> >
>>>> >> >> > We have a Kerberos secured Yarn cluster here and I'm experimenting
>>>> >> >> > with
>>>> >> >> > Apache Flink on top of that.
>>>> >> >> >
>>>> >> >> > A few days ago I started a very simple Flink application (just
>>>> >> >> > stream
>>>> >> >> > the
>>>> >> >> > time as a String into HBase 10 times per second).
>>>> >> >> >
>>>> >> >> > I (deliberately) asked our IT-ops guys to make my account have a
>>>> >> >> > max
>>>> >> >> > ticket
>>>> >> >> > time of 5 minutes and a max renew time of 10 minutes (yes,
>>>> >> >> > ridiculously
>>>> >> >> > low
>>>> >> >> > timeout values because I needed to validate this
>>>> >> >> > https://issues.apache.org/jira/browse/FLINK-2977).
>>>> >> >> >
>>>> >> >> > This job is started with a keytab file and after running for 31
>>>> >> >> > hours
>>>> >> >> > it
>>>> >> >> > suddenly failed with the exception you see below.
>>>> >> >> >
>>>> >> >> > I had the same job running for almost 400 hours until that failed
>>>> >> >> > too
>>>> >> >> > (I
>>>> >> >> > was
>>>> >> >> > too late to check the logfiles but I suspect the same problem).
>>>> >> >> >
>>>> >> >> >
>>>> >> >> > So in that time span my tickets have expired and new tickets have
>>>> >> >> > been
>>>> >> >> > obtained several hundred times.
>>>> >> >> >
>>>> >> >> >
>>>> >> >> > The main error I see is that in the process of a ticket expiring
>>>> >> >> > and
>>>> >> >> > being
>>>> >> >> > renewed I see this message:
>>>> >> >> >
>>>> >> >> > Not retrying because the invoked method is not idempotent,
>>>> >> >> > and
>>>> >> >> > unable
>>>> >> >> > to determine whether it was invoked
>>>> >> >> >
>>>> >> >> >
>>>> >> >> > Yarn on the cluster is 2.6.0 ( HDP 2.6.0.2.2.4.2-2 )
>>>> >> >> >
>>>> >> >> > Flink is version 0.10.1
>>>> >> >> >
>>>> >> >> >
>>>> >> >> > How do I fix this?
>>>> >> >> > Is this a bug (in either Hadoop or Flink) or am I doing something
>>>> >> >> > wrong?
>>>> >> >> > Would upgrading Yarn to 2.7.1 (i.e. HDP 2.3) fix this?
>>>> >> >> >
>>>> >> >> >
>>>> >> >> > Niels Basjes
>>>> >> >> >
>>>> >> >> >
>>>> >> >> >
>>>> >> >> > 21:30:27,821 WARN org.apache.hadoop.security.UserGroupInformation
>>>> >> >> > - PriviledgedActionException as:nbasjes (auth:SIMPLE)
>>>> >> >> >
>>>> >> >> >
>>>> >> >> >
>>>> >> >> > cause:org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken):
>>>> >> >> > Invalid AMRMToken from appattempt_1443166961758_163901_000001
>>>> >> >> > 21:30:27,861 WARN org.apache.hadoop.ipc.Client
>>>> >> >> > - Exception encountered while connecting to the server :
>>>> >> >> >
>>>> >> >> >
>>>> >> >> >
>>>> >> >> > org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken):
>>>> >> >> > Invalid AMRMToken from appattempt_1443166961758_163901_000001
>>>> >> >> > 21:30:27,861 WARN org.apache.hadoop.security.UserGroupInformation
>>>> >> >> > - PriviledgedActionException as:nbasjes (auth:SIMPLE)
>>>> >> >> >
>>>> >> >> >
>>>> >> >> >
>>>> >> >> > cause:org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken):
>>>> >> >> > Invalid AMRMToken from appattempt_1443166961758_163901_000001
>>>> >> >> > 21:30:27,891 WARN
>>>> >> >> > org.apache.hadoop.io.retry.RetryInvocationHandler
>>>> >> >> > - Exception while invoking class
>>>> >> >> >
>>>> >> >> >
>>>> >> >> >
>>>> >> >> > org.apache.hadoop.yarn.api.impl.pb.client.ApplicationMasterProtocolPBClientImpl.allocate.
>>>> >> >> > Not retrying because the invoked method is not idempotent, and
>>>> >> >> > unable
>>>> >> >> > to
>>>> >> >> > determine whether it was invoked
>>>> >> >> > org.apache.hadoop.security.token.SecretManager$InvalidToken:
>>>> >> >> > Invalid
>>>> >> >> > AMRMToken from appattempt_1443166961758_163901_000001
>>>> >> >> > at
>>>> >> >> > sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
>>>> >> >> > Method)
>>>> >> >> > at
>>>> >> >> >
>>>> >> >> >
>>>> >> >> >
>>>> >> >> > sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
>>>> >> >> > at
>>>> >> >> >
>>>> >> >> >
>>>> >> >> >
>>>> >> >> > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>>>> >> >> > at
>>>> >> >> > java.lang.reflect.Constructor.newInstance(Constructor.java:526)
>>>> >> >> > at
>>>> >> >> >
>>>> >> >> >
>>>> >> >> > org.apache.hadoop.yarn.ipc.RPCUtil.instantiateException(RPCUtil.java:53)
>>>> >> >> > at
>>>> >> >> >
>>>> >> >> >
>>>> >> >> >
>>>> >> >> > org.apache.hadoop.yarn.ipc.RPCUtil.unwrapAndThrowException(RPCUtil.java:104)
>>>> >> >> > at
>>>> >> >> >
>>>> >> >> >
>>>> >> >> >
>>>> >> >> > org.apache.hadoop.yarn.api.impl.pb.client.ApplicationMasterProtocolPBClientImpl.allocate(ApplicationMasterProtocolPBClientImpl.java:79)
>>>> >> >> > at sun.reflect.GeneratedMethodAccessor3.invoke(Unknown
>>>> >> >> > Source)
>>>> >> >> > at
>>>> >> >> >
>>>> >> >> >
>>>> >> >> >
>>>> >> >> > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>>>> >> >> > at java.lang.reflect.Method.invoke(Method.java:606)
>>>> >> >> > at
>>>> >> >> >
>>>> >> >> >
>>>> >> >> >
>>>> >> >> > org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:186)
>>>> >> >> > at
>>>> >> >> >
>>>> >> >> >
>>>> >> >> >
>>>> >> >> > org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
>>>> >> >> > at com.sun.proxy.$Proxy14.allocate(Unknown Source)
>>>> >> >> > at
>>>> >> >> >
>>>> >> >> >
>>>> >> >> >
>>>> >> >> > org.apache.hadoop.yarn.client.api.impl.AMRMClientImpl.allocate(AMRMClientImpl.java:245)
>>>> >> >> > at
>>>> >> >> >
>>>> >> >> >
>>>> >> >> >
>>>> >> >> > org.apache.flink.yarn.YarnJobManager$$anonfun$handleYarnMessage$1.applyOrElse(YarnJobManager.scala:259)
>>>> >> >> > at
>>>> >> >> > scala.PartialFunction$OrElse.apply(PartialFunction.scala:162)
>>>> >> >> > at
>>>> >> >> >
>>>> >> >> >
>>>> >> >> >
>>>> >> >> > org.apache.flink.runtime.LeaderSessionMessageFilter$$anonfun$receive$1.applyOrElse(LeaderSessionMessageFilter.scala:44)
>>>> >> >> > at
>>>> >> >> >
>>>> >> >> >
>>>> >> >> >
>>>> >> >> > scala.runtime.AbstractPartialFunction$mcVL$sp.apply$mcVL$sp(AbstractPartialFunction.scala:33)
>>>> >> >> > at
>>>> >> >> >
>>>> >> >> >
>>>> >> >> >
>>>> >> >> > scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:33)
>>>> >> >> > at
>>>> >> >> >
>>>> >> >> >
>>>> >> >> >
>>>> >> >> > scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:25)
>>>> >> >> > at
>>>> >> >> >
>>>> >> >> >
>>>> >> >> > org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:33)
>>>> >> >> > at
>>>> >> >> >
>>>> >> >> >
>>>> >> >> > org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:28)
>>>> >> >> > at
>>>> >> >> > scala.PartialFunction$class.applyOrElse(PartialFunction.scala:118)
>>>> >> >> > at
>>>> >> >> >
>>>> >> >> >
>>>> >> >> >
>>>> >> >> > org.apache.flink.runtime.LogMessages$$anon$1.applyOrElse(LogMessages.scala:28)
>>>> >> >> > at akka.actor.Actor$class.aroundReceive(Actor.scala:465)
>>>> >> >> > at
>>>> >> >> >
>>>> >> >> >
>>>> >> >> >
>>>> >> >> > org.apache.flink.runtime.jobmanager.JobManager.aroundReceive(JobManager.scala:100)
>>>> >> >> > at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)
>>>> >> >> > at akka.actor.ActorCell.invoke(ActorCell.scala:487)
>>>> >> >> > at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:254)
>>>> >> >> > at akka.dispatch.Mailbox.run(Mailbox.scala:221)
>>>> >> >> > at akka.dispatch.Mailbox.exec(Mailbox.scala:231)
>>>> >> >> > at
>>>> >> >> >
>>>> >> >> > scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
>>>> >> >> > at
>>>> >> >> >
>>>> >> >> >
>>>> >> >> >
>>>> >> >> > scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.pollAndExecAll(ForkJoinPool.java:1253)
>>>> >> >> > at
>>>> >> >> >
>>>> >> >> >
>>>> >> >> >
>>>> >> >> > scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1346)
>>>> >> >> > at
>>>> >> >> >
>>>> >> >> >
>>>> >> >> > scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
>>>> >> >> > at
>>>> >> >> >
>>>> >> >> >
>>>> >> >> >
>>>> >> >> > scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
>>>> >> >> > Caused by:
>>>> >> >> >
>>>> >> >> >
>>>> >> >> >
>>>> >> >> > org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken):
>>>> >> >> > Invalid AMRMToken from appattempt_1443166961758_163901_000001
>>>> >> >> > at org.apache.hadoop.ipc.Client.call(Client.java:1406)
>>>> >> >> > at org.apache.hadoop.ipc.Client.call(Client.java:1359)
>>>> >> >> > at
>>>> >> >> >
>>>> >> >> >
>>>> >> >> >
>>>> >> >> > org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206)
>>>> >> >> > at com.sun.proxy.$Proxy13.allocate(Unknown Source)
>>>> >> >> > at
>>>> >> >> >
>>>> >> >> >
>>>> >> >> >
>>>> >> >> > org.apache.hadoop.yarn.api.impl.pb.client.ApplicationMasterProtocolPBClientImpl.allocate(ApplicationMasterProtocolPBClientImpl.java:77)
>>>> >> >> > ... 29 more
>>>> >> >> > 21:30:27,943 ERROR akka.actor.OneForOneStrategy
>>>> >> >> > - Invalid AMRMToken from appattempt_1443166961758_163901_000001
>>>> >> >> > org.apache.hadoop.security.token.SecretManager$InvalidToken:
>>>> >> >> > Invalid
>>>> >> >> > AMRMToken from appattempt_1443166961758_163901_000001
>>>> >> >> > at
>>>> >> >> > sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
>>>> >> >> > Method)
>>>> >> >> > at
>>>> >> >> >
>>>> >> >> >
>>>> >> >> >
>>>> >> >> > sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
>>>> >> >> > at
>>>> >> >> >
>>>> >> >> >
>>>> >> >> >
>>>> >> >> > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>>>> >> >> > at
>>>> >> >> > java.lang.reflect.Constructor.newInstance(Constructor.java:526)
>>>> >> >> > at
>>>> >> >> >
>>>> >> >> >
>>>> >> >> > org.apache.hadoop.yarn.ipc.RPCUtil.instantiateException(RPCUtil.java:53)
>>>> >> >> > at
>>>> >> >> >
>>>> >> >> >
>>>> >> >> >
>>>> >> >> > org.apache.hadoop.yarn.ipc.RPCUtil.unwrapAndThrowException(RPCUtil.java:104)
>>>> >> >> > at
>>>> >> >> >
>>>> >> >> >
>>>> >> >> >
>>>> >> >> > org.apache.hadoop.yarn.api.impl.pb.client.ApplicationMasterProtocolPBClientImpl.allocate(ApplicationMasterProtocolPBClientImpl.java:79)
>>>> >> >> > at sun.reflect.GeneratedMethodAccessor3.invoke(Unknown
>>>> >> >> > Source)
>>>> >> >> > at
>>>> >> >> >
>>>> >> >> >
>>>> >> >> >
>>>> >> >> > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>>>> >> >> > at java.lang.reflect.Method.invoke(Method.java:606)
>>>> >> >> > at
>>>> >> >> >
>>>> >> >> >
>>>> >> >> >
>>>> >> >> > org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:186)
>>>> >> >> > at
>>>> >> >> >
>>>> >> >> >
>>>> >> >> >
>>>> >> >> > org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
>>>> >> >> > at com.sun.proxy.$Proxy14.allocate(Unknown Source)
>>>> >> >> > at
>>>> >> >> >
>>>> >> >> >
>>>> >> >> >
>>>> >> >> > org.apache.hadoop.yarn.client.api.impl.AMRMClientImpl.allocate(AMRMClientImpl.java:245)
>>>> >> >> > at
>>>> >> >> >
>>>> >> >> >
>>>> >> >> >
>>>> >> >> > org.apache.flink.yarn.YarnJobManager$$anonfun$handleYarnMessage$1.applyOrElse(YarnJobManager.scala:259)
>>>> >> >> > at
>>>> >> >> > scala.PartialFunction$OrElse.apply(PartialFunction.scala:162)
>>>> >> >> > at
>>>> >> >> >
>>>> >> >> >
>>>> >> >> >
>>>> >> >> > org.apache.flink.runtime.LeaderSessionMessageFilter$$anonfun$receive$1.applyOrElse(LeaderSessionMessageFilter.scala:44)
>>>> >> >> > at
>>>> >> >> >
>>>> >> >> >
>>>> >> >> >
>>>> >> >> > scala.runtime.AbstractPartialFunction$mcVL$sp.apply$mcVL$sp(AbstractPartialFunction.scala:33)
>>>> >> >> > at
>>>> >> >> >
>>>> >> >> >
>>>> >> >> >
>>>> >> >> > scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:33)
>>>> >> >> > at
>>>> >> >> >
>>>> >> >> >
>>>> >> >> >
>>>> >> >> > scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:25)
>>>> >> >> > at
>>>> >> >> >
>>>> >> >> >
>>>> >> >> > org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:33)
>>>> >> >> > at
>>>> >> >> >
>>>> >> >> >
>>>> >> >> > org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:28)
>>>> >> >> > at
>>>> >> >> > scala.PartialFunction$class.applyOrElse(PartialFunction.scala:118)
>>>> >> >> > at
>>>> >> >> >
>>>> >> >> >
>>>> >> >> >
>>>> >> >> > org.apache.flink.runtime.LogMessages$$anon$1.applyOrElse(LogMessages.scala:28)
>>>> >> >> > at akka.actor.Actor$class.aroundReceive(Actor.scala:465)
>>>> >> >> > at
>>>> >> >> >
>>>> >> >> >
>>>> >> >> >
>>>> >> >> > org.apache.flink.runtime.jobmanager.JobManager.aroundReceive(JobManager.scala:100)
>>>> >> >> > at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)
>>>> >> >> > at akka.actor.ActorCell.invoke(ActorCell.scala:487)
>>>> >> >> > at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:254)
>>>> >> >> > at akka.dispatch.Mailbox.run(Mailbox.scala:221)
>>>> >> >> > at akka.dispatch.Mailbox.exec(Mailbox.scala:231)
>>>> >> >> > at
>>>> >> >> >
>>>> >> >> > scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
>>>> >> >> > at
>>>> >> >> >
>>>> >> >> >
>>>> >> >> >
>>>> >> >> > scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.pollAndExecAll(ForkJoinPool.java:1253)
>>>> >> >> > at
>>>> >> >> >
>>>> >> >> >
>>>> >> >> >
>>>> >> >> > scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1346)
>>>> >> >> > at
>>>> >> >> >
>>>> >> >> >
>>>> >> >> > scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
>>>> >> >> > at
>>>> >> >> >
>>>> >> >> >
>>>> >> >> >
>>>> >> >> > scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
>>>> >> >> > Caused by:
>>>> >> >> >
>>>> >> >> >
>>>> >> >> >
>>>> >> >> > org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken):
>>>> >> >> > Invalid AMRMToken from appattempt_1443166961758_163901_000001
>>>> >> >> > at org.apache.hadoop.ipc.Client.call(Client.java:1406)
>>>> >> >> > at org.apache.hadoop.ipc.Client.call(Client.java:1359)
>>>> >> >> > at
>>>> >> >> >
>>>> >> >> >
>>>> >> >> >
>>>> >> >> > org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206)
>>>> >> >> > at com.sun.proxy.$Proxy13.allocate(Unknown Source)
>>>> >> >> > at
>>>> >> >> >
>>>> >> >> >
>>>> >> >> >
>>>> >> >> > org.apache.hadoop.yarn.api.impl.pb.client.ApplicationMasterProtocolPBClientImpl.allocate(ApplicationMasterProtocolPBClientImpl.java:77)
>>>> >> >> > ... 29 more
>>>> >> >> > 21:30:28,075 INFO org.apache.flink.yarn.YarnJobManager
>>>> >> >> > - Stopping JobManager
>>>> >> >> > akka.tcp://flink@10.10.200.3:39527/user/jobmanager.
>>>> >> >> > 21:30:28,088 INFO
>>>> >> >> > org.apache.flink.runtime.executiongraph.ExecutionGraph
>>>> >> >> > - Source: Custom Source -> Sink: Unnamed (1/1)
>>>> >> >> > (db0d95c11c14505827e696eec7efab94) switched from RUNNING to
>>>> >> >> > CANCELING
>>>> >> >> > 21:30:28,113 INFO
>>>> >> >> > org.apache.flink.runtime.executiongraph.ExecutionGraph
>>>> >> >> > - Source: Custom Source -> Sink: Unnamed (1/1)
>>>> >> >> > (db0d95c11c14505827e696eec7efab94) switched from CANCELING to
>>>> >> >> > FAILED
>>>> >> >> > 21:30:28,184 INFO org.apache.flink.runtime.blob.BlobServer
>>>> >> >> > - Stopped BLOB server at 0.0.0.0:41281
>>>> >> >> > 21:30:28,185 ERROR org.apache.flink.runtime.jobmanager.JobManager
>>>> >> >> > - Actor akka://flink/user/jobmanager#403236912 terminated,
>>>> >> >> > stopping
>>>> >> >> > process...
>>>> >> >> > 21:30:28,286 INFO
>>>> >> >> > org.apache.flink.runtime.webmonitor.WebRuntimeMonitor
>>>> >> >> > - Removing web root dir
>>>> >> >> > /tmp/flink-web-e1a44f94-ea6d-40ee-b87c-e3122d5cb9bd
>>>> >> >> >
>>>> >> >> >
>>>> >> >> > --
>>>> >> >> > Best regards / Met vriendelijke groeten,
>>>> >> >> >
>>>> >> >> > Niels Basjes
>>>> >> >
>>>> >> >
>>>> >> >
>>>> >> >
>>>> >> > --
>>>> >> > Best regards / Met vriendelijke groeten,
>>>> >> >
>>>> >> > Niels Basjes
>>>> >
>>>> >
>>>> >
>>>> >
>>>> > --
>>>> > Best regards / Met vriendelijke groeten,
>>>> >
>>>> > Niels Basjes
>>>
>>>
>>>
>>>
>>> --
>>> Best regards / Met vriendelijke groeten,
>>>
>>> Niels Basjes

Best regards / Met vriendelijke groeten,

Niels Basjes

Thomas Lamirault

RE:Flink job on secure Yarn fails after many hours

Hello everyone,

We are facing the same probleme now in our Flink applications, launch using YARN.

Just want to know if there is any update about this exception ?

Thanks

Thomas

De : [hidden email] [[hidden email]] de la part de Niels Basjes [[hidden email]]
Envoyé : vendredi 4 décembre 2015 10:40
À : [hidden email]
Objet : Re: Flink job on secure Yarn fails after many hours

Hi Maximilian,

I just downloaded the version from your google drive and used that to run my test topology that accesses HBase.

I deliberately started it twice to double the chance to run into this situation.

I'll keep you posted.

Niels

On Thu, Dec 3, 2015 at 11:44 AM, Maximilian Michels <[hidden email]> wrote:

Hi Niels,

Just got back from our CI. The build above would fail with a
Checkstyle error. I corrected that. Also I have built the binaries for
your Hadoop version 2.6.0.

Binaries:

https://github.com/mxm/flink/archive/kerberos-yarn-heartbeat-fail-0.10.1.zip

Thanks,
Max

On Wed, Dec 2, 2015 at 6:52 PM, Maximilian Michels <0.0.0.0:41281
>>>> >> >> > 21:30:28,185 ERROR org.apache.flink.runtime.jobmanager.JobManager
>>>> >> >> > - Actor akka://flink/user/jobmanager#403236912 terminated,
>>>> >> >> > stopping
>>>> >> >> > process...
>>>> >> >> > 21:30:28,286 INFO
>>>> >> >> > org.apache.flink.runtime.webmonitor.WebRuntimeMonitor
>>>> >> >> > - Removing web root dir
>>>> >> >> > /tmp/flink-web-e1a44f94-ea6d-40ee-b87c-e3122d5cb9bd
>>>> >> >> >
>>>> >> >> >
>>>> >> >> > --
>>>> >> >> > Best regards / Met vriendelijke groeten,
>>>> >> >> >
>>>> >> >> > Niels Basjes
>>>> >> >
>>>> >> >
>>>> >> >
>>>> >> >
>>>> >> > --
>>>> >> > Best regards / Met vriendelijke groeten,
>>>> >> >
>>>> >> > Niels Basjes
>>>> >
>>>> >
>>>> >
>>>> >
>>>> > --
>>>> > Best regards / Met vriendelijke groeten,
>>>> >
>>>> > Niels Basjes
>>>
>>>
>>>
>>>
>>> --
>>> Best regards / Met vriendelijke groeten,
>>>
>>> Niels Basjes

Best regards / Met vriendelijke groeten,

Niels Basjes

Maximilian Michels

Re: Flink job on secure Yarn fails after many hours

Hi Thomas,

Nils (CC) and I found out that you need at least Hadoop version 2.6.1
to properly run Kerberos applications on Hadoop clusters. Versions
before that have critical bugs related to the internal security token
handling that may expire the token although it is still valid.

That said, there is another limitation of Hadoop that the maximum
internal token life time is one week. To work around this limit, you
have two options:

a) increasing the maximum token life time

In yarn-site.xml:

<property>
<name>yarn.resourcemanager.delegation.token.max-lifetime</name>
<value>9223372036854775807</value>
</property>

In hdfs-site.xml

<property>
<name>dfs.namenode.delegation.token.max-lifetime</name>
<value>9223372036854775807</value>
</property>

b) setup the Yarn ResourceManager as a proxy for the HDFS Namenode:

From http://www.cloudera.com/documentation/enterprise/5-3-x/topics/cm_sg_yarn_long_jobs.html

"You can work around this by configuring the ResourceManager as a
proxy user for the corresponding HDFS NameNode so that the
ResourceManager can request new tokens when the existing ones are past
their maximum lifetime."

@Nils: Could you comment on what worked best for you?

Best,
Max

On Mon, Mar 14, 2016 at 12:24 PM, Thomas Lamirault
<[hidden email]> wrote:

>
> Hello everyone,
>
>
>
> We are facing the same probleme now in our Flink applications, launch using YARN.
>
> Just want to know if there is any update about this exception ?
>
>
>
> Thanks
>
>
>
> Thomas
>
>
>
> ________________________________
>
> De : [hidden email] [[hidden email]] de la part de Niels Basjes [[hidden email]]
> Envoyé : vendredi 4 décembre 2015 10:40
> À : [hidden email]
> Objet : Re: Flink job on secure Yarn fails after many hours
>
> Hi Maximilian,
>
> I just downloaded the version from your google drive and used that to run my test topology that accesses HBase.
> I deliberately started it twice to double the chance to run into this situation.
>
> I'll keep you posted.
>
> Niels
>
>
> On Thu, Dec 3, 2015 at 11:44 AM, Maximilian Michels <[hidden email]> wrote:
>>
>> Hi Niels,
>>
>> Just got back from our CI. The build above would fail with a
>> Checkstyle error. I corrected that. Also I have built the binaries for
>> your Hadoop version 2.6.0.
>>
>> Binaries:
>>
>> https://github.com/mxm/flink/archive/kerberos-yarn-heartbeat-fail-0.10.1.zip
>>
>> Thanks,
>> Max
>>
>> On Wed, Dec 2, 2015 at 6:52 PM, Maximilian Michels <0.0.0.0:41281
>> >>>> >> >> > 21:30:28,185 ERROR org.apache.flink.runtime.jobmanager.JobManager
>> >>>> >> >> > - Actor akka://flink/user/jobmanager#403236912 terminated,
>> >>>> >> >> > stopping
>> >>>> >> >> > process...
>> >>>> >> >> > 21:30:28,286 INFO
>> >>>> >> >> > org.apache.flink.runtime.webmonitor.WebRuntimeMonitor
>> >>>> >> >> > - Removing web root dir
>> >>>> >> >> > /tmp/flink-web-e1a44f94-ea6d-40ee-b87c-e3122d5cb9bd
>> >>>> >> >> >
>> >>>> >> >> >
>> >>>> >> >> > --
>> >>>> >> >> > Best regards / Met vriendelijke groeten,
>> >>>> >> >> >
>> >>>> >> >> > Niels Basjes
>> >>>> >> >
>> >>>> >> >
>> >>>> >> >
>> >>>> >> >
>> >>>> >> > --
>> >>>> >> > Best regards / Met vriendelijke groeten,
>> >>>> >> >
>> >>>> >> > Niels Basjes
>> >>>> >
>> >>>> >
>> >>>> >
>> >>>> >
>> >>>> > --
>> >>>> > Best regards / Met vriendelijke groeten,
>> >>>> >
>> >>>> > Niels Basjes
>> >>>
>> >>>
>> >>>
>> >>>
>> >>> --
>> >>> Best regards / Met vriendelijke groeten,
>> >>>
>> >>> Niels Basjes
>
>
>
>
> --
> Best regards / Met vriendelijke groeten,
>
> Niels Basjes

Thomas Lamirault

RE:Flink job on secure Yarn fails after many hours

Hi Max,

I will try these workaround.
Thanks

Thomas

________________________________________
De : Maximilian Michels [[hidden email]]
Envoyé : mardi 15 mars 2016 16:51
À : [hidden email]
Cc : Niels Basjes
Objet : Re: Flink job on secure Yarn fails after many hours

Hi Thomas,

Nils (CC) and I found out that you need at least Hadoop version 2.6.1
to properly run Kerberos applications on Hadoop clusters. Versions
before that have critical bugs related to the internal security token
handling that may expire the token although it is still valid.

That said, there is another limitation of Hadoop that the maximum
internal token life time is one week. To work around this limit, you
have two options:

a) increasing the maximum token life time

In yarn-site.xml:

<property>
<name>yarn.resourcemanager.delegation.token.max-lifetime</name>
<value>9223372036854775807</value>
</property>

In hdfs-site.xml

<property>
<name>dfs.namenode.delegation.token.max-lifetime</name>
<value>9223372036854775807</value>
</property>

b) setup the Yarn ResourceManager as a proxy for the HDFS Namenode:

From http://www.cloudera.com/documentation/enterprise/5-3-x/topics/cm_sg_yarn_long_jobs.html

"You can work around this by configuring the ResourceManager as a
proxy user for the corresponding HDFS NameNode so that the
ResourceManager can request new tokens when the existing ones are past
their maximum lifetime."

@Nils: Could you comment on what worked best for you?

Best,
Max

On Mon, Mar 14, 2016 at 12:24 PM, Thomas Lamirault
<[hidden email]> wrote:

Niels Basjes

Re: Flink job on secure Yarn fails after many hours

Hi,

In my environment doing the "proxy" thing didn't work.

With an token expire of 168 hours (1 week) the job consistently terminates at exactly (within a margin of 10 seconds) 173.5 hours.

So far we have not been able to solve this problem.

Our teams now simply assume the thing fails once in a while and have an automatic restart feature (i.e. shell script with a while true loop).

The best guess at a root cause is this https://issues.apache.org/jira/browse/HDFS-9276

If you have a real solution or a reference to a related bug report to this problem then please share!

Niels Basjes

On Thu, Mar 17, 2016 at 10:20 AM, Thomas Lamirault <[hidden email]> wrote:

Hi Max,

I will try these workaround.
Thanks

Thomas

________________________________________
De : Maximilian Michels [[hidden email]]
Envoyé : mardi 15 mars 2016 16:51
À : [hidden email]
Cc : Niels Basjes

Objet : Re: Flink job on secure Yarn fails after many hours

Hi Thomas,

Nils (CC) and I found out that you need at least Hadoop version 2.6.1
to properly run Kerberos applications on Hadoop clusters. Versions
before that have critical bugs related to the internal security token
handling that may expire the token although it is still valid.

That said, there is another limitation of Hadoop that the maximum
internal token life time is one week. To work around this limit, you
have two options:

a) increasing the maximum token life time

In yarn-site.xml:

<property>
<name>yarn.resourcemanager.delegation.token.max-lifetime</name>
<value>9223372036854775807</value>
</property>

In hdfs-site.xml

<property>
<name>dfs.namenode.delegation.token.max-lifetime</name>
<value>9223372036854775807</value>
</property>

b) setup the Yarn ResourceManager as a proxy for the HDFS Namenode:

From http://www.cloudera.com/documentation/enterprise/5-3-x/topics/cm_sg_yarn_long_jobs.html

"You can work around this by configuring the ResourceManager as a
proxy user for the corresponding HDFS NameNode so that the
ResourceManager can request new tokens when the existing ones are past
their maximum lifetime."

@Nils: Could you comment on what worked best for you?

Best,
Max

On Mon, Mar 14, 2016 at 12:24 PM, Thomas Lamirault
<[hidden email]> wrote:
>
> Hello everyone,
>
>
>
> We are facing the same probleme now in our Flink applications, launch using YARN.
>
> Just want to know if there is any update about this exception ?
>
>
>
> Thanks
>
>
>
> Thomas
>
>
>
> ________________________________
>
> De : [hidden email] [[hidden email]] de la part de Niels Basjes [[hidden email]]
> Envoyé : vendredi 4 décembre 2015 10:40
> À : [hidden email]
> Objet : Re: Flink job on secure Yarn fails after many hours
>
> Hi Maximilian,
>
> I just downloaded the version from your google drive and used that to run my test topology that accesses HBase.
> I deliberately started it twice to double the chance to run into this situation.
>
> I'll keep you posted.
>
> Niels
>
>
> On Thu, Dec 3, 2015 at 11:44 AM, Maximilian Michels <[hidden email]> wrote:
>>
>> Hi Niels,
>>
>> Just got back from our CI. The build above would fail with a
>> Checkstyle error. I corrected that. Also I have built the binaries for
>> your Hadoop version 2.6.0.
>>
>> Binaries:
>>
>> https://github.com/mxm/flink/archive/kerberos-yarn-heartbeat-fail-0.10.1.zip
>>
>> Thanks,
>> Max
>>
>> On Wed, Dec 2, 2015 at 6:52 PM, Maximilian Michels <0.0.0.0:41281
>> >>>> >> >> > 21:30:28,185 ERROR org.apache.flink.runtime.jobmanager.JobManager
>> >>>> >> >> > - Actor akka://flink/user/jobmanager#403236912 terminated,
>> >>>> >> >> > stopping
>> >>>> >> >> > process...
>> >>>> >> >> > 21:30:28,286 INFO
>> >>>> >> >> > org.apache.flink.runtime.webmonitor.WebRuntimeMonitor
>> >>>> >> >> > - Removing web root dir
>> >>>> >> >> > /tmp/flink-web-e1a44f94-ea6d-40ee-b87c-e3122d5cb9bd
>> >>>> >> >> >
>> >>>> >> >> >
>> >>>> >> >> > --
>> >>>> >> >> > Best regards / Met vriendelijke groeten,
>> >>>> >> >> >
>> >>>> >> >> > Niels Basjes
>> >>>> >> >
>> >>>> >> >
>> >>>> >> >
>> >>>> >> >
>> >>>> >> > --
>> >>>> >> > Best regards / Met vriendelijke groeten,
>> >>>> >> >
>> >>>> >> > Niels Basjes
>> >>>> >
>> >>>> >
>> >>>> >
>> >>>> >
>> >>>> > --
>> >>>> > Best regards / Met vriendelijke groeten,
>> >>>> >
>> >>>> > Niels Basjes
>> >>>
>> >>>
>> >>>
>> >>>
>> >>> --
>> >>> Best regards / Met vriendelijke groeten,
>> >>>
>> >>> Niels Basjes
>
>
>
>
> --
> Best regards / Met vriendelijke groeten,
>
> Niels Basjes

Best regards / Met vriendelijke groeten,

Niels Basjes

Maximilian Michels

Re: Flink job on secure Yarn fails after many hours

Hi Niels,

Thanks for the feedback. As far as I know, Hadoop deliberately
defaults to the one week maximum life time of delegation tokens. Have
you tried increasing the maximum token life time or was that not an
option?

I wonder why do you use a while loop? Would it be possible to use the
Yarn failover mechanism which starts a new ApplicationMaster and
resubmits the job?

Thanks,
Max

On Thu, Mar 17, 2016 at 12:43 PM, Niels Basjes <[hidden email]> wrote:

> Hi,
>
> In my environment doing the "proxy" thing didn't work.
> With an token expire of 168 hours (1 week) the job consistently terminates
> at exactly (within a margin of 10 seconds) 173.5 hours.
> So far we have not been able to solve this problem.
>
> Our teams now simply assume the thing fails once in a while and have an
> automatic restart feature (i.e. shell script with a while true loop).
> The best guess at a root cause is this
> https://issues.apache.org/jira/browse/HDFS-9276
>
> If you have a real solution or a reference to a related bug report to this
> problem then please share!
>
> Niels Basjes
>
>
>
> On Thu, Mar 17, 2016 at 10:20 AM, Thomas Lamirault
> <[hidden email]> wrote:
>>
>> Hi Max,
>>
>> I will try these workaround.
>> Thanks
>>
>> Thomas
>>
>> ________________________________________
>> De : Maximilian Michels [[hidden email]]
>> Envoyé : mardi 15 mars 2016 16:51
>> À : [hidden email]
>> Cc : Niels Basjes
>> Objet : Re: Flink job on secure Yarn fails after many hours
>>
>> Hi Thomas,
>>
>> Nils (CC) and I found out that you need at least Hadoop version 2.6.1
>> to properly run Kerberos applications on Hadoop clusters. Versions
>> before that have critical bugs related to the internal security token
>> handling that may expire the token although it is still valid.
>>
>> That said, there is another limitation of Hadoop that the maximum
>> internal token life time is one week. To work around this limit, you
>> have two options:
>>
>> a) increasing the maximum token life time
>>
>> In yarn-site.xml:
>>
>> <property>
>> <name>yarn.resourcemanager.delegation.token.max-lifetime</name>
>> <value>9223372036854775807</value>
>> </property>
>>
>> In hdfs-site.xml
>>
>> <property>
>> <name>dfs.namenode.delegation.token.max-lifetime</name>
>> <value>9223372036854775807</value>
>> </property>
>>
>>
>> b) setup the Yarn ResourceManager as a proxy for the HDFS Namenode:
>>
>> From
>> http://www.cloudera.com/documentation/enterprise/5-3-x/topics/cm_sg_yarn_long_jobs.html
>>
>> "You can work around this by configuring the ResourceManager as a
>> proxy user for the corresponding HDFS NameNode so that the
>> ResourceManager can request new tokens when the existing ones are past
>> their maximum lifetime."
>>
>> @Nils: Could you comment on what worked best for you?
>>
>> Best,
>> Max
>>
>>
>> On Mon, Mar 14, 2016 at 12:24 PM, Thomas Lamirault
>> <[hidden email]> wrote:
>> >
>> > Hello everyone,
>> >
>> >
>> >
>> > We are facing the same probleme now in our Flink applications, launch
>> > using YARN.
>> >
>> > Just want to know if there is any update about this exception ?
>> >
>> >
>> >
>> > Thanks
>> >
>> >
>> >
>> > Thomas
>> >
>> >
>> >
>> > ________________________________
>> >
>> > De : [hidden email] [[hidden email]] de la part de Niels Basjes
>> > [[hidden email]]
>> > Envoyé : vendredi 4 décembre 2015 10:40
>> > À : [hidden email]
>> > Objet : Re: Flink job on secure Yarn fails after many hours
>> >
>> > Hi Maximilian,
>> >
>> > I just downloaded the version from your google drive and used that to
>> > run my test topology that accesses HBase.
>> > I deliberately started it twice to double the chance to run into this
>> > situation.
>> >
>> > I'll keep you posted.
>> >
>> > Niels
>> >
>> >
>> > On Thu, Dec 3, 2015 at 11:44 AM, Maximilian Michels <[hidden email]>
>> > wrote:
>> >>
>> >> Hi Niels,
>> >>
>> >> Just got back from our CI. The build above would fail with a
>> >> Checkstyle error. I corrected that. Also I have built the binaries for
>> >> your Hadoop version 2.6.0.
>> >>
>> >> Binaries:
>> >>
>> >>
>> >> https://github.com/mxm/flink/archive/kerberos-yarn-heartbeat-fail-0.10.1.zip
>> >>
>> >> Thanks,
>> >> Max
>> >>
>> >> On Wed, Dec 2, 2015 at 6:52 PM, Maximilian Michels <0.0.0.0:41281
>> >> >>>> >> >> > 21:30:28,185 ERROR
>> >> >>>> >> >> > org.apache.flink.runtime.jobmanager.JobManager
>> >> >>>> >> >> > - Actor akka://flink/user/jobmanager#403236912 terminated,
>> >> >>>> >> >> > stopping
>> >> >>>> >> >> > process...
>> >> >>>> >> >> > 21:30:28,286 INFO
>> >> >>>> >> >> > org.apache.flink.runtime.webmonitor.WebRuntimeMonitor
>> >> >>>> >> >> > - Removing web root dir
>> >> >>>> >> >> > /tmp/flink-web-e1a44f94-ea6d-40ee-b87c-e3122d5cb9bd
>> >> >>>> >> >> >
>> >> >>>> >> >> >
>> >> >>>> >> >> > --
>> >> >>>> >> >> > Best regards / Met vriendelijke groeten,
>> >> >>>> >> >> >
>> >> >>>> >> >> > Niels Basjes
>> >> >>>> >> >
>> >> >>>> >> >
>> >> >>>> >> >
>> >> >>>> >> >
>> >> >>>> >> > --
>> >> >>>> >> > Best regards / Met vriendelijke groeten,
>> >> >>>> >> >
>> >> >>>> >> > Niels Basjes
>> >> >>>> >
>> >> >>>> >
>> >> >>>> >
>> >> >>>> >
>> >> >>>> > --
>> >> >>>> > Best regards / Met vriendelijke groeten,
>> >> >>>> >
>> >> >>>> > Niels Basjes
>> >> >>>
>> >> >>>
>> >> >>>
>> >> >>>
>> >> >>> --
>> >> >>> Best regards / Met vriendelijke groeten,
>> >> >>>
>> >> >>> Niels Basjes
>> >
>> >
>> >
>> >
>> > --
>> > Best regards / Met vriendelijke groeten,
>> >
>> > Niels Basjes
>
>
>
>
> --
> Best regards / Met vriendelijke groeten,
>
> Niels Basjes

jimdowling

Re: Flink job on secure Yarn fails after many hours

Hi
Has anything ever happened on this issue, and not will it be addressed for 1.2?
It's a blocker for us.

To quote the YARN security docs:
"Any YARN service intended to run for an extended period of time must have a strategy for renewing credentials."

Reference:
https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/YarnApplicationSecurity.html

Spark has this implemented as a thread in its application master that periodically renews delegation tokens with (1) hdfs and (2) yarn:
https://www.cloudera.com/documentation/enterprise/latest/topics/cm_sg_yarn_long_jobs.html

rmetzger0

Re: Flink job on secure Yarn fails after many hours

In reply to this post by Maximilian Michels

Niels, are you still facing this issue?

As far as I understood it, the security changes in Flink 1.2.0 use a new Kerberos mechanism that allows infinite token renewal.

On Thu, Mar 17, 2016 at 7:30 AM, Maximilian Michels <[hidden email]> wrote:

Hi Niels,

Thanks for the feedback. As far as I know, Hadoop deliberately
defaults to the one week maximum life time of delegation tokens. Have
you tried increasing the maximum token life time or was that not an
option?

I wonder why do you use a while loop? Would it be possible to use the
Yarn failover mechanism which starts a new ApplicationMaster and
resubmits the job?

Thanks,
Max

On Thu, Mar 17, 2016 at 12:43 PM, Niels Basjes <[hidden email]> wrote:
> Hi,
>
> In my environment doing the "proxy" thing didn't work.
> With an token expire of 168 hours (1 week) the job consistently terminates
> at exactly (within a margin of 10 seconds) 173.5 hours.
> So far we have not been able to solve this problem.
>
> Our teams now simply assume the thing fails once in a while and have an
> automatic restart feature (i.e. shell script with a while true loop).
> The best guess at a root cause is this
> https://issues.apache.org/jira/browse/HDFS-9276
>
> If you have a real solution or a reference to a related bug report to this
> problem then please share!
>
> Niels Basjes
>
>
>
> On Thu, Mar 17, 2016 at 10:20 AM, Thomas Lamirault
> <[hidden email]> wrote:
>>
>> Hi Max,
>>
>> I will try these workaround.
>> Thanks
>>
>> Thomas
>>
>> ________________________________________
>> De : Maximilian Michels [[hidden email]]
>> Envoyé : mardi 15 mars 2016 16:51
>> À : [hidden email]
>> Cc : Niels Basjes
>> Objet : Re: Flink job on secure Yarn fails after many hours
>>
>> Hi Thomas,
>>
>> Nils (CC) and I found out that you need at least Hadoop version 2.6.1
>> to properly run Kerberos applications on Hadoop clusters. Versions
>> before that have critical bugs related to the internal security token
>> handling that may expire the token although it is still valid.
>>
>> That said, there is another limitation of Hadoop that the maximum
>> internal token life time is one week. To work around this limit, you
>> have two options:
>>
>> a) increasing the maximum token life time
>>
>> In yarn-site.xml:
>>
>> <property>
>> <name>yarn.resourcemanager.delegation.token.max-lifetime</name>
>> <value>9223372036854775807</value>
>> </property>
>>
>> In hdfs-site.xml
>>
>> <property>
>> <name>dfs.namenode.delegation.token.max-lifetime</name>
>> <value>9223372036854775807</value>
>> </property>
>>
>>
>> b) setup the Yarn ResourceManager as a proxy for the HDFS Namenode:
>>
>> From
>> http://www.cloudera.com/documentation/enterprise/5-3-x/topics/cm_sg_yarn_long_jobs.html
>>
>> "You can work around this by configuring the ResourceManager as a
>> proxy user for the corresponding HDFS NameNode so that the
>> ResourceManager can request new tokens when the existing ones are past
>> their maximum lifetime."
>>
>> @Nils: Could you comment on what worked best for you?
>>
>> Best,
>> Max
>>
>>
>> On Mon, Mar 14, 2016 at 12:24 PM, Thomas Lamirault
>> <[hidden email]> wrote:
>> >
>> > Hello everyone,
>> >
>> >
>> >
>> > We are facing the same probleme now in our Flink applications, launch
>> > using YARN.
>> >
>> > Just want to know if there is any update about this exception ?
>> >
>> >
>> >
>> > Thanks
>> >
>> >
>> >
>> > Thomas
>> >
>> >
>> >
>> > ________________________________
>> >
>> > De : [hidden email] [[hidden email]] de la part de Niels Basjes
>> > [[hidden email]]
>> > Envoyé : vendredi 4 décembre 2015 10:40
>> > À : [hidden email]
>> > Objet : Re: Flink job on secure Yarn fails after many hours
>> >
>> > Hi Maximilian,
>> >
>> > I just downloaded the version from your google drive and used that to
>> > run my test topology that accesses HBase.
>> > I deliberately started it twice to double the chance to run into this
>> > situation.
>> >
>> > I'll keep you posted.
>> >
>> > Niels
>> >
>> >
>> > On Thu, Dec 3, 2015 at 11:44 AM, Maximilian Michels <[hidden email]>
>> > wrote:
>> >>
>> >> Hi Niels,
>> >>
>> >> Just got back from our CI. The build above would fail with a
>> >> Checkstyle error. I corrected that. Also I have built the binaries for
>> >> your Hadoop version 2.6.0.
>> >>
>> >> Binaries:
>> >>
>> >>
>> >> https://github.com/mxm/flink/archive/kerberos-yarn-heartbeat-fail-0.10.1.zip
>> >>
>> >> Thanks,
>> >> Max
>> >>
>> >> On Wed, Dec 2, 2015 at 6:52 PM, Maximilian Michels <0.0.0.0:41281
>> >> >>>> >> >> > 21:30:28,185 ERROR
>> >> >>>> >> >> > org.apache.flink.runtime.jobmanager.JobManager
>> >> >>>> >> >> > - Actor akka://flink/user/jobmanager#403236912 terminated,
>> >> >>>> >> >> > stopping
>> >> >>>> >> >> > process...
>> >> >>>> >> >> > 21:30:28,286 INFO
>> >> >>>> >> >> > org.apache.flink.runtime.webmonitor.WebRuntimeMonitor
>> >> >>>> >> >> > - Removing web root dir
>> >> >>>> >> >> > /tmp/flink-web-e1a44f94-ea6d-40ee-b87c-e3122d5cb9bd
>> >> >>>> >> >> >
>> >> >>>> >> >> >
>> >> >>>> >> >> > --
>> >> >>>> >> >> > Best regards / Met vriendelijke groeten,
>> >> >>>> >> >> >
>> >> >>>> >> >> > Niels Basjes
>> >> >>>> >> >
>> >> >>>> >> >
>> >> >>>> >> >
>> >> >>>> >> >
>> >> >>>> >> > --
>> >> >>>> >> > Best regards / Met vriendelijke groeten,
>> >> >>>> >> >
>> >> >>>> >> > Niels Basjes
>> >> >>>> >
>> >> >>>> >
>> >> >>>> >
>> >> >>>> >
>> >> >>>> > --
>> >> >>>> > Best regards / Met vriendelijke groeten,
>> >> >>>> >
>> >> >>>> > Niels Basjes
>> >> >>>
>> >> >>>
>> >> >>>
>> >> >>>
>> >> >>> --
>> >> >>> Best regards / Met vriendelijke groeten,
>> >> >>>
>> >> >>> Niels Basjes
>> >
>> >
>> >
>> >
>> > --
>> > Best regards / Met vriendelijke groeten,
>> >
>> > Niels Basjes
>
>
>
>
> --
> Best regards / Met vriendelijke groeten,
>
> Niels Basjes

Niels Basjes

Re: Flink job on secure Yarn fails after many hours

Hi,

No, this issue is now gone for us.

The fixed in 1.2.0 ensured that we are now able to run jobs on our cluster beyond the 7 days limit.

Niels

On Wed, Apr 12, 2017 at 5:35 PM, Robert Metzger <[hidden email]> wrote:

Niels, are you still facing this issue?

As far as I understood it, the security changes in Flink 1.2.0 use a new Kerberos mechanism that allows infinite token renewal.

On Thu, Mar 17, 2016 at 7:30 AM, Maximilian Michels <[hidden email]> wrote:
Hi Niels,

Thanks for the feedback. As far as I know, Hadoop deliberately
defaults to the one week maximum life time of delegation tokens. Have
you tried increasing the maximum token life time or was that not an
option?

I wonder why do you use a while loop? Would it be possible to use the
Yarn failover mechanism which starts a new ApplicationMaster and
resubmits the job?

Thanks,
Max

On Thu, Mar 17, 2016 at 12:43 PM, Niels Basjes <[hidden email]> wrote:
> Hi,
>
> In my environment doing the "proxy" thing didn't work.
> With an token expire of 168 hours (1 week) the job consistently terminates
> at exactly (within a margin of 10 seconds) 173.5 hours.
> So far we have not been able to solve this problem.
>
> Our teams now simply assume the thing fails once in a while and have an
> automatic restart feature (i.e. shell script with a while true loop).
> The best guess at a root cause is this
> https://issues.apache.org/jira/browse/HDFS-9276
>
> If you have a real solution or a reference to a related bug report to this
> problem then please share!
>
> Niels Basjes
>
>
>
> On Thu, Mar 17, 2016 at 10:20 AM, Thomas Lamirault
> <[hidden email]> wrote:
>>
>> Hi Max,
>>
>> I will try these workaround.
>> Thanks
>>
>> Thomas
>>
>> ________________________________________
>> De : Maximilian Michels [[hidden email]]
>> Envoyé : mardi 15 mars 2016 16:51
>> À : [hidden email]
>> Cc : Niels Basjes
>> Objet : Re: Flink job on secure Yarn fails after many hours
>>
>> Hi Thomas,
>>
>> Nils (CC) and I found out that you need at least Hadoop version 2.6.1
>> to properly run Kerberos applications on Hadoop clusters. Versions
>> before that have critical bugs related to the internal security token
>> handling that may expire the token although it is still valid.
>>
>> That said, there is another limitation of Hadoop that the maximum
>> internal token life time is one week. To work around this limit, you
>> have two options:
>>
>> a) increasing the maximum token life time
>>
>> In yarn-site.xml:
>>
>> <property>
>> <name>yarn.resourcemanager.delegation.token.max-lifetime</name>
>> <value>9223372036854775807</value>
>> </property>
>>
>> In hdfs-site.xml
>>
>> <property>
>> <name>dfs.namenode.delegation.token.max-lifetime</name>
>> <value>9223372036854775807</value>
>> </property>
>>
>>
>> b) setup the Yarn ResourceManager as a proxy for the HDFS Namenode:
>>
>> From
>> http://www.cloudera.com/documentation/enterprise/5-3-x/topics/cm_sg_yarn_long_jobs.html
>>
>> "You can work around this by configuring the ResourceManager as a
>> proxy user for the corresponding HDFS NameNode so that the
>> ResourceManager can request new tokens when the existing ones are past
>> their maximum lifetime."
>>
>> @Nils: Could you comment on what worked best for you?
>>
>> Best,
>> Max
>>
>>
>> On Mon, Mar 14, 2016 at 12:24 PM, Thomas Lamirault
>> <[hidden email]> wrote:
>> >
>> > Hello everyone,
>> >
>> >
>> >
>> > We are facing the same probleme now in our Flink applications, launch
>> > using YARN.
>> >
>> > Just want to know if there is any update about this exception ?
>> >
>> >
>> >
>> > Thanks
>> >
>> >
>> >
>> > Thomas
>> >
>> >
>> >
>> > ________________________________
>> >
>> > De : [hidden email] [[hidden email]] de la part de Niels Basjes
>> > [[hidden email]]
>> > Envoyé : vendredi 4 décembre 2015 10:40
>> > À : [hidden email]
>> > Objet : Re: Flink job on secure Yarn fails after many hours
>> >
>> > Hi Maximilian,
>> >
>> > I just downloaded the version from your google drive and used that to
>> > run my test topology that accesses HBase.
>> > I deliberately started it twice to double the chance to run into this
>> > situation.
>> >
>> > I'll keep you posted.
>> >
>> > Niels
>> >
>> >
>> > On Thu, Dec 3, 2015 at 11:44 AM, Maximilian Michels <[hidden email]>
>> > wrote:
>> >>
>> >> Hi Niels,
>> >>
>> >> Just got back from our CI. The build above would fail with a
>> >> Checkstyle error. I corrected that. Also I have built the binaries for
>> >> your Hadoop version 2.6.0.
>> >>
>> >> Binaries:
>> >>
>> >>
>> >> https://github.com/mxm/flink/archive/kerberos-yarn-heartbeat-fail-0.10.1.zip
>> >>
>> >> Thanks,
>> >> Max
>> >>
>> >> On Wed, Dec 2, 2015 at 6:52 PM, Maximilian Michels <0.0.0.0:41281
>> >> >>>> >> >> > 21:30:28,185 ERROR
>> >> >>>> >> >> > org.apache.flink.runtime.jobmanager.JobManager
>> >> >>>> >> >> > - Actor akka://flink/user/jobmanager#403236912 terminated,
>> >> >>>> >> >> > stopping
>> >> >>>> >> >> > process...
>> >> >>>> >> >> > 21:30:28,286 INFO
>> >> >>>> >> >> > org.apache.flink.runtime.webmonitor.WebRuntimeMonitor
>> >> >>>> >> >> > - Removing web root dir
>> >> >>>> >> >> > /tmp/flink-web-e1a44f94-ea6d-40ee-b87c-e3122d5cb9bd
>> >> >>>> >> >> >
>> >> >>>> >> >> >
>> >> >>>> >> >> > --
>> >> >>>> >> >> > Best regards / Met vriendelijke groeten,
>> >> >>>> >> >> >
>> >> >>>> >> >> > Niels Basjes
>> >> >>>> >> >
>> >> >>>> >> >
>> >> >>>> >> >
>> >> >>>> >> >
>> >> >>>> >> > --
>> >> >>>> >> > Best regards / Met vriendelijke groeten,
>> >> >>>> >> >
>> >> >>>> >> > Niels Basjes
>> >> >>>> >
>> >> >>>> >
>> >> >>>> >
>> >> >>>> >
>> >> >>>> > --
>> >> >>>> > Best regards / Met vriendelijke groeten,
>> >> >>>> >
>> >> >>>> > Niels Basjes
>> >> >>>
>> >> >>>
>> >> >>>
>> >> >>>
>> >> >>> --
>> >> >>> Best regards / Met vriendelijke groeten,
>> >> >>>
>> >> >>> Niels Basjes
>> >
>> >
>> >
>> >
>> > --
>> > Best regards / Met vriendelijke groeten,
>> >
>> > Niels Basjes
>
>
>
>
> --
> Best regards / Met vriendelijke groeten,
>
> Niels Basjes

Best regards / Met vriendelijke groeten,

Niels Basjes