Hello,
I am frequently seeing this error in my jobmanager logs: 2019-11-18 09:07:08,863 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph - Source: kdi_g2 -> (Sink: s_id_metadata, Sink: t_id_metadata) (23/24) (5e10e88814fe4ab0be5f554ec59bd93d) switched from RUNNING to FAILED. java.lang.Exception: Container released on a *lost* node at org.apache.flink.yarn.YarnResourceManager.lambda$onContainersCompleted$0(YarnResourceManager.java:370) at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRunAsync(AkkaRpcActor.java:397) at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:190) at org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.handleRpcMessage(FencedAkkaRpcActor.java:74) at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleMessage(AkkaRpcActor.java:152) at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:26) at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:21) at scala.PartialFunction.applyOrElse(PartialFunction.scala:123) at scala.PartialFunction.applyOrElse$(PartialFunction.scala:122) at akka.japi.pf.UnitCaseStatement.applyOrElse(CaseStatements.scala:21) at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171) at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:172) at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:172) at akka.actor.Actor.aroundReceive(Actor.scala:517) at akka.actor.Actor.aroundReceive$(Actor.scala:515) at akka.actor.AbstractActor.aroundReceive(AbstractActor.scala:225) at akka.actor.ActorCell.receiveMessage(ActorCell.scala:592) at akka.actor.ActorCell.invoke(ActorCell.scala:561) at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:258) at akka.dispatch.Mailbox.run(Mailbox.scala:225) at akka.dispatch.Mailbox.exec(Mailbox.scala:235) at akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) at akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) Yarn nodemanager logs don't show anything out of the ordinary when such exceptions occur, so I am suspecting it is a timeout of some sort due to network problems. (YARN is not killing the container ). It's difficult to diagnose because Flink doesn't give any reason for losing the node. Is it a timeout? OOM? Is there a corresponding configuration I should tune? What is the recommended course of action? |
Hi Amran, Did you monitor or have a look at your memory metrics(e.g. full GC) of your TM. There is a similar thread that a user reported the same question due to full GC, the link is here[1]. Best, Vino amran dean <[hidden email]> 于2019年11月22日周五 上午7:15写道:
|
Hi Amran, >> Container released on a *lost* node If you see such exceptions, it means that the corresponding Yarn NodeManager has lost. So all the containers running on this node will be released. The Flink YarnResourceManager receives the 'lost' message from Yarn ResourceManager and will allocate a new taskmanager container instead. If you want to find the root cause, you need to check Yarn NodeManager logs why it has been lost. Best, Yang vino yang <[hidden email]> 于2019年11月22日周五 上午10:20写道:
|
Free forum by Nabble | Edit this page |