We are trying out Flink 1.7.0. We always get this exception when submitting a job with external checkpoint via REST. Job parallelism is 1,600. state size is probably in the range of 1-5 TBs. Job is actually started. Just REST api returns this failure. If we submitting the job without external checkpoint, everything works fine. Anyone else see such problem with 1.7? Appreciate your help! Thanks, Steven org.apache.flink.runtime.rest.handler.RestHandlerException: akka.pattern.AskTimeoutException: Ask timed out on [Actor[akka://flink/user/dispatcher#-641142843]] after [10000 ms]. Sender[null] sent message of type "org.apache.flink.runtime.rpc.messages.LocalFencedMessage". at org.apache.flink.runtime.webmonitor.handlers.JarRunHandler.lambda$handleRequest$4(JarRunHandler.java:114) at java.util.concurrent.CompletableFuture.uniExceptionally(CompletableFuture.java:870) at java.util.concurrent.CompletableFuture$UniExceptionally.tryFire(CompletableFuture.java:852) at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474) at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977) at org.apache.flink.runtime.concurrent.FutureUtils$1.onComplete(FutureUtils.java:772) at akka.dispatch.OnComplete.internal(Future.scala:258) at akka.dispatch.OnComplete.internal(Future.scala:256) at akka.dispatch.japi$CallbackBridge.apply(Future.scala:186) at akka.dispatch.japi$CallbackBridge.apply(Future.scala:183) at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:36) at org.apache.flink.runtime.concurrent.Executors$DirectExecutionContext.execute(Executors.java:83) at scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:44) at scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:252) at akka.pattern.PromiseActorRef$$anonfun$1.apply$mcV$sp(AskSupport.scala:603) at akka.actor.Scheduler$$anon$4.run(Scheduler.scala:126) at scala.concurrent.Future$InternalCallbackExecutor$.unbatchedExecute(Future.scala:601) at scala.concurrent.BatchingExecutor$class.execute(BatchingExecutor.scala:109) at scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:599) at akka.actor.LightArrayRevolverScheduler$TaskHolder.executeTask(LightArrayRevolverScheduler.scala:329) at akka.actor.LightArrayRevolverScheduler$$anon$4.executeBucket$1(LightArrayRevolverScheduler.scala:280) at akka.actor.LightArrayRevolverScheduler$$anon$4.nextTick(LightArrayRevolverScheduler.scala:284) at akka.actor.LightArrayRevolverScheduler$$anon$4.run(LightArrayRevolverScheduler.scala:236) at java.lang.Thread.run(Thread.java:748) Caused by: java.util.concurrent.CompletionException: akka.pattern.AskTimeoutException: Ask timed out on [Actor[akka://flink/user/dispatcher#-641142843]] after [10000 ms]. Sender[null] sent message of type "org.apache.flink.runtime.rpc.messages.LocalFencedMessage". at java.util.concurrent.CompletableFuture.encodeRelay(CompletableFuture.java:326) at java.util.concurrent.CompletableFuture.completeRelay(CompletableFuture.java:338) at java.util.concurrent.CompletableFuture.uniRelay(CompletableFuture.java:911) at java.util.concurrent.CompletableFuture$UniRelay.tryFire(CompletableFuture.java:899) ... 21 more Caused by: akka.pattern.AskTimeoutException: Ask timed out on [Actor[akka://flink/user/dispatcher#-641142843]] after [10000 ms]. Sender[null] sent message of type "org.apache.flink.runtime.rpc.messages.LocalFencedMessage". at akka.pattern.PromiseActorRef$$anonfun$1.apply$mcV$sp(AskSupport.scala:604) ... 9 more |
We are also experiencing this! Thanks for speaking up! It's relieving to know we're not alone :) We tried adding `akka.ask.timeout: 1 min` to our `flink-conf.yaml`, which did not seem to have any effect. I tried adding every other related akka, rpc, etc. timeout and still continue to encounter these errors. I believe they may also impact our ability to deploy (as we get a timeout when submitting the job programmatically). I'd love to see a solution to this if one exists! Best, Aaron Levin On Thu, Jan 10, 2019 at 2:58 PM Steven Wu <[hidden email]> wrote:
|
Hi all, I think increasing the default value of the config option web.timeout [1] is what you are looking for. Best, Gary [1] https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/runtime/rest/handler/RestHandlerConfiguration.java#L76 [2] https://github.com/apache/flink/blob/a07ce7f6c88dc7d0c0d2ba55a0ab3f2283bf247c/flink-core/src/main/java/org/apache/flink/configuration/WebOptions.java#L177 On Thu, Jan 10, 2019 at 9:19 PM Aaron Levin <[hidden email]> wrote:
|
Gary, thanks a lot. web.timeout seems to help. now I ran into a diff issue with loading the checkpoint. will take that separately. On Thu, Jan 10, 2019 at 12:25 PM Gary Yao <[hidden email]> wrote:
|
Free forum by Nabble | Edit this page |