Hi All,
I`m running flink1.6 on yarn,after the program run for a day, the flink program fails on yarn, and the error log is as follows: It seems that it is due to a timeout error. But I have the following questions: 1. In which step the flink components communicate failed? What are the two components? 2. How to solve this problem? Thanks a lot!! java.lang.Exception: Cannot deploy task LeftOuterJoin(where: (=(id, article_id)), join: (id, created_time, article_score, PU, article_id, CU, CN)) -> select: (id, created_time, article_score, PU, CU, CN) (2/2) (d403002a7accc5133cf89a386ddc1dfb) - TaskManager (container_1532509321420_463249_01_000002 @ sh-bs-3-i1-hadoop-17-225 (dataPort=10459)) not responding after a rpcTimeout of 10000 ms at org.apache.flink.runtime.executiongraph.Execution.lambda$deploy$5(Execution.java:601) ~[flink-runtime_2.11-1.6.0.jar:1.6.0] at java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:760) ~[na:1.8.0_65] at java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:736) ~[na:1.8.0_65] at java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:442) ~[na:1.8.0_65] at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) ~[na:1.8.0_65] at java.util.concurrent.FutureTask.run(FutureTask.java:266) ~[na:1.8.0_65] at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) ~[na:1.8.0_65] at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) ~[na:1.8.0_65] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) ~[na:1.8.0_65] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) ~[na:1.8.0_65] at java.lang.Thread.run(Thread.java:745) ~[na:1.8.0_65] Caused by: akka.pattern.AskTimeoutException: Ask timed out on [Actor[<a href="akka.tcp://flink@sh-bs-3-i1-hadoop-17-225:24213/user/taskmanager_0#-1762816591" class="">akka.tcp://flink@sh-bs-3-i1-hadoop-17-225:24213/user/taskmanager_0#-1762816591]] after [10000 ms]. Sender[null] sent message of type "org.apache.flink.runtime.rpc.messages.RemoteRpcInvocation". at akka.pattern.PromiseActorRef$$anonfun$1.apply$mcV$sp(AskSupport.scala:604) ~[akka-actor_2.11-2.4.20.jar:na] at akka.actor.Scheduler$$anon$4.run(Scheduler.scala:126) ~[akka-actor_2.11-2.4.20.jar:na] at scala.concurrent.Future$InternalCallbackExecutor$.unbatchedExecute(Future.scala:601) ~[scala-library-2.11.8.jar:na] at scala.concurrent.BatchingExecutor$class.execute(BatchingExecutor.scala:109) ~[scala-library-2.11.8.jar:na] at scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:599) ~[scala-library-2.11.8.jar:na] at akka.actor.LightArrayRevolverScheduler$TaskHolder.executeTask(LightArrayRevolverScheduler.scala:329) ~[akka-actor_2.11-2.4.20.jar:na] at akka.actor.LightArrayRevolverScheduler$$anon$4.executeBucket$1(LightArrayRevolverScheduler.scala:280) ~[akka-actor_2.11-2.4.20.jar:na] at akka.actor.LightArrayRevolverScheduler$$anon$4.nextTick(LightArrayRevolverScheduler.scala:284) ~[akka-actor_2.11-2.4.20.jar:na] at akka.actor.LightArrayRevolverScheduler$$anon$4.run(LightArrayRevolverScheduler.scala:236) ~[akka-actor_2.11-2.4.20.jar:na] ... 1 common frames omitted Best, Henry
|
Hi, 1. This rpc timeout occurs during JobMaster deploying task into TaskExecutor. The rpc thread in TaskExecutor does not respond the deployment message within 10 seconds. There are many possibilities to cause this issue, such as network problem between TaskExecutor and JobMaster or other time-consuming operators in TaskExecutor. The root cause may be a bit complicated for tracing. First you can debug when the TaskExecutor receives this message, then you can check when the TaskExecutor responses this message, and may also need check what is the rpc thread doing during these times. 2. You can increase the default value of rpc timeout parameter(akka.ask.timeout) to work around temporarily. Best, Zhijiang
|
Free forum by Nabble | Edit this page |