Flink application down due to RpcTimeout exception

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

Flink application down due to RpcTimeout exception

徐涛
Hi All,
I`m running flink1.6 on yarn,after the program run for a day, the flink program fails on yarn, and the error log is as follows:
It seems that it is due to a timeout error. But I have the following questions:
1. In which step the flink components communicate failed? What are the two components? 
2. How to solve this problem?
Thanks a lot!!

java.lang.Exception: Cannot deploy task LeftOuterJoin(where: (=(id, article_id)), join: (id, created_time, article_score, PU, article_id, CU, CN)) -> select: (id, created_time, article_score, PU, CU, CN) (2/2) (d403002a7accc5133cf89a386ddc1dfb) - TaskManager (container_1532509321420_463249_01_000002 @ sh-bs-3-i1-hadoop-17-225 (dataPort=10459)) not responding after a rpcTimeout of 10000 ms
	at org.apache.flink.runtime.executiongraph.Execution.lambda$deploy$5(Execution.java:601) ~[flink-runtime_2.11-1.6.0.jar:1.6.0]
	at java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:760) ~[na:1.8.0_65]
	at java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:736) ~[na:1.8.0_65]
	at java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:442) ~[na:1.8.0_65]
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) ~[na:1.8.0_65]
	at java.util.concurrent.FutureTask.run(FutureTask.java:266) ~[na:1.8.0_65]
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) ~[na:1.8.0_65]
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) ~[na:1.8.0_65]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) ~[na:1.8.0_65]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) ~[na:1.8.0_65]
	at java.lang.Thread.run(Thread.java:745) ~[na:1.8.0_65]
Caused by: akka.pattern.AskTimeoutException: Ask timed out on [Actor[<a href="akka.tcp://flink@sh-bs-3-i1-hadoop-17-225:24213/user/taskmanager_0#-1762816591" class="">akka.tcp://flink@sh-bs-3-i1-hadoop-17-225:24213/user/taskmanager_0#-1762816591]] after [10000 ms]. Sender[null] sent message of type "org.apache.flink.runtime.rpc.messages.RemoteRpcInvocation".
	at akka.pattern.PromiseActorRef$$anonfun$1.apply$mcV$sp(AskSupport.scala:604) ~[akka-actor_2.11-2.4.20.jar:na]
	at akka.actor.Scheduler$$anon$4.run(Scheduler.scala:126) ~[akka-actor_2.11-2.4.20.jar:na]
	at scala.concurrent.Future$InternalCallbackExecutor$.unbatchedExecute(Future.scala:601) ~[scala-library-2.11.8.jar:na]
	at scala.concurrent.BatchingExecutor$class.execute(BatchingExecutor.scala:109) ~[scala-library-2.11.8.jar:na]
	at scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:599) ~[scala-library-2.11.8.jar:na]
	at akka.actor.LightArrayRevolverScheduler$TaskHolder.executeTask(LightArrayRevolverScheduler.scala:329) ~[akka-actor_2.11-2.4.20.jar:na]
	at akka.actor.LightArrayRevolverScheduler$$anon$4.executeBucket$1(LightArrayRevolverScheduler.scala:280) ~[akka-actor_2.11-2.4.20.jar:na]
	at akka.actor.LightArrayRevolverScheduler$$anon$4.nextTick(LightArrayRevolverScheduler.scala:284) ~[akka-actor_2.11-2.4.20.jar:na]
	at akka.actor.LightArrayRevolverScheduler$$anon$4.run(LightArrayRevolverScheduler.scala:236) ~[akka-actor_2.11-2.4.20.jar:na]
	... 1 common frames omitted


Best,
Henry
Reply | Threaded
Open this post in threaded view
|

回复:Flink application down due to RpcTimeout exception

Zhijiang(wangzhijiang999)
Hi,

1.  This rpc timeout occurs during JobMaster deploying task into TaskExecutor. The rpc thread in TaskExecutor does not respond the deployment message within 10 seconds. There are many possibilities to cause this issue, such as network problem between TaskExecutor and JobMaster or other time-consuming operators in TaskExecutor. The root cause may be a bit complicated for tracing. First you can debug when the TaskExecutor receives this message, then you can check when the TaskExecutor responses this message, and may also need check what is the rpc thread doing during these times.

2.   You can increase the default value of rpc timeout parameter(akka.ask.timeout) to work around temporarily.

Best,
Zhijiang
------------------------------------------------------------------
发件人:徐涛 <[hidden email]>
发送时间:2018年9月13日(星期四) 14:10
收件人:user <[hidden email]>
主 题:Flink application down due to RpcTimeout exception

Hi All,
I`m running flink1.6 on yarn,after the program run for a day, the flink program fails on yarn, and the error log is as follows:
It seems that it is due to a timeout error. But I have the following questions:
1. In which step the flink components communicate failed? What are the two components? 
2. How to solve this problem?
Thanks a lot!!

java.lang.Exception: Cannot deploy task LeftOuterJoin(where: (=(id, article_id)), join: (id, created_time, article_score, PU, article_id, CU, CN)) -> select: (id, created_time, article_score, PU, CU, CN) (2/2) (d403002a7accc5133cf89a386ddc1dfb) - TaskManager (container_1532509321420_463249_01_000002 @ sh-bs-3-i1-hadoop-17-225 (dataPort=10459)) not responding after a rpcTimeout of 10000 ms
	at org.apache.flink.runtime.executiongraph.Execution.lambda$deploy$5(Execution.java:601) ~[flink-runtime_2.11-1.6.0.jar:1.6.0]
	at java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:760) ~[na:1.8.0_65]
	at java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:736) ~[na:1.8.0_65]
	at java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:442) ~[na:1.8.0_65]
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) ~[na:1.8.0_65]
	at java.util.concurrent.FutureTask.run(FutureTask.java:266) ~[na:1.8.0_65]
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) ~[na:1.8.0_65]
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) ~[na:1.8.0_65]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) ~[na:1.8.0_65]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) ~[na:1.8.0_65]
	at java.lang.Thread.run(Thread.java:745) ~[na:1.8.0_65]
Caused by: akka.pattern.AskTimeoutException: Ask timed out on [Actor[<a  href="akka.tcp://flink@sh-bs-3-i1-hadoop-17-225:24213/user/taskmanager_0#-1762816591" target="_blank">akka.tcp://flink@sh-bs-3-i1-hadoop-17-225:24213/user/taskmanager_0#-1762816591]] after [10000 ms]. Sender[null] sent message of type "org.apache.flink.runtime.rpc.messages.RemoteRpcInvocation".
	at akka.pattern.PromiseActorRef$$anonfun$1.apply$mcV$sp(AskSupport.scala:604) ~[akka-actor_2.11-2.4.20.jar:na]
	at akka.actor.Scheduler$$anon$4.run(Scheduler.scala:126) ~[akka-actor_2.11-2.4.20.jar:na]
	at scala.concurrent.Future$InternalCallbackExecutor$.unbatchedExecute(Future.scala:601) ~[scala-library-2.11.8.jar:na]
	at scala.concurrent.BatchingExecutor$class.execute(BatchingExecutor.scala:109) ~[scala-library-2.11.8.jar:na]
	at scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:599) ~[scala-library-2.11.8.jar:na]
	at akka.actor.LightArrayRevolverScheduler$TaskHolder.executeTask(LightArrayRevolverScheduler.scala:329) ~[akka-actor_2.11-2.4.20.jar:na]
	at akka.actor.LightArrayRevolverScheduler$$anon$4.executeBucket$1(LightArrayRevolverScheduler.scala:280) ~[akka-actor_2.11-2.4.20.jar:na]
	at akka.actor.LightArrayRevolverScheduler$$anon$4.nextTick(LightArrayRevolverScheduler.scala:284) ~[akka-actor_2.11-2.4.20.jar:na]
	at akka.actor.LightArrayRevolverScheduler$$anon$4.run(LightArrayRevolverScheduler.scala:236) ~[akka-actor_2.11-2.4.20.jar:na]
	... 1 common frames omitted


Best,
Henry