I'm having a problem with akka timeout when starting my cluster. The error is "Ask timed out after 10000 ms.". I have changed the akka.ask.timeout config setting to be 300000 ms, but it still times out and fails after 10 seconds. I confirmed that the config is properly set by both checking the Job Manager configuration tab (it shows 300000 ms) as well logging the output of AkkaUtils.getTimeout(configuration) which also shows 300000ms. It seems something is not honoring that configuration value.
I did find a different thread that discussed the fact that the LocalStreamEnvironment will not honor this setting, but that is not my case. I am running on a cluster (AWS EMR) using the regular StreamExecutionEnvironment. This is Flink 1.5.2. Any ideas? ~~~~~ 2018-08-31 17:37:55 INFO org.apache.hadoop.yarn.client.api.impl.AMRMClientImpl - Received new token for : ip-10-213-139-66.ec2.internal:8041 2018-08-31 17:37:55 INFO org.apache.hadoop.yarn.client.api.impl.AMRMClientImpl - Received new token for : ip-10-213-136-25.ec2.internal:8041 2018-08-31 17:38:34 ERROR o.a.flink.runtime.rest.handler.job.JobExecutionResultHandler - Implementation error: Unhandled exception. akka.pattern.AskTimeoutException: Ask timed out on [Actor[akka://flink/user/dispatcher#-219618710]] after [10000 ms]. Sender[null] sent message of type "org.apache.flink.runtime.rpc.messages.LocalFencedMessage". at akka.pattern.PromiseActorRef$$anonfun$1.apply$mcV$sp(AskSupport.scala:604) at akka.actor.Scheduler$$anon$4.run(Scheduler.scala:126) at scala.concurrent.Future$InternalCallbackExecutor$.unbatchedExecute(Future.scala:601) at scala.concurrent.BatchingExecutor$class.execute(BatchingExecutor.scala:109) at scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:599) at akka.actor.LightArrayRevolverScheduler$TaskHolder.executeTask(LightArrayRevolverScheduler.scala:329) at akka.actor.LightArrayRevolverScheduler$$anon$4.executeBucket$1(LightArrayRevolverScheduler.scala:280) at akka.actor.LightArrayRevolverScheduler$$anon$4.nextTick(LightArrayRevolverScheduler.scala:284) at akka.actor.LightArrayRevolverScheduler$$anon$4.run(LightArrayRevolverScheduler.scala:236) at java.lang.Thread.run(Thread.java:748) 2018-08-31 17:38:41 INFO org.apache.hadoop.yarn.client.api.impl.AMRMClientImpl - Waiting for application to be successfully unregistered. 2018-08-31 17:38:41 INFO o.a.hadoop.yarn.client.api.async.impl.AMRMClientAsyncImpl - Interrupted while waiting for queue java.lang.InterruptedException: null at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.reportInterruptAfterWait(AbstractQueuedSynchronizer.java:2014) at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2048) at java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442) at org.apache.hadoop.yarn.client.api.async.impl.AMRMClientAsyncImpl$CallbackHandlerThread.run(AMRMClientAsyncImpl.java:323) 2018-08-31 17:38:42 WARN akka.remote.ReliableDeliverySupervisor flink-akka.remote.default-remote-dispatcher-81 - Association with remote system [akka.tcp://[hidden email]:42027] has failed, address is now gated for [50] ms. Reason: [Disassociated] |
Hi Greg, Can you describe the steps to reproduce the problem, or can you attach the full jobmanager logs? Because JobExecutionResultHandler appears in your log, I assume that you are starting a job cluster on YARN. Without seeing the complete logs, I cannot be sure what exactly happens. For now, you can try setting the config option web.timeout to a higher value. Best, Gary On Fri, Aug 31, 2018 at 8:01 PM, Greg Finch <[hidden email]> wrote:
|
Thanks Gary. Attached is the jobmanager log. You are correct that this is running on YARN. I changed web.timeout as you suggested - that seems to be working the few times I tested it. This problem comes and goes though - sometimes it starts before it times out. I'll keep the web.timeout setting and reply again if the problem comes up again. Thanks again for your quick response! On Fri, Aug 31, 2018 at 1:38 PM Gary Yao <[hidden email]> wrote:
jobmanager.out.txt (36K) Download Attachment |
Well ... that didn't take long. The next time I tried, I got the Akka timeout again. Attached are the logs from the last attempt. They're very similar to the other logs I sent. On Fri, Aug 31, 2018 at 2:04 PM Greg Finch <[hidden email]> wrote:
jobmanager.out.txt (32K) Download Attachment |
Hi Greg, Unfortunately the environment information [1] is not logged. Can you set the log level for all Flink packages to DEBUG? Do you install Flink yourself on EMR, or do you use the pre-installed one? Can you show us the command with which you start the cluster/submit the job? I do not know if it is related but I found these warnings in your second log file: 2018-08-31 19:14:32 WARN org.apache.flink.configuration.Configuration - Configuration cannot evaluate value 300s as a long integer number 2018-08-31 19:14:32 WARN org.apache.flink.configuration.Configuration - Configuration cannot evaluate value 300s as a long integer number Best, Gary [1] https://github.com/apache/flink/blob/9ae5009b6a82248bfae99dac088c1f6e285aa70f/flink-runtime/src/main/java/org/apache/flink/runtime/util/EnvironmentInformation.java#L281 On Fri, Aug 31, 2018 at 9:18 PM, Greg Finch <[hidden email]> wrote:
|
Hi Gary, Turns out, the configuration warning you mentioned was the key. The akka.ask.timeout requires a duration unit, but the web.timeout setting is looking for a long. So the change I made earlier would not have applied since it couldn't read `300s`. Since making that change (`web.timeout: 300000`), I have not been able to reproduce the error - everything starts successfully every time. I do have debug logging turned on for now. If it happens again in the next couple of days, I will send details with debug logs. Thanks again for your help! Greg On Fri, Aug 31, 2018 at 3:21 PM Gary Yao <[hidden email]> wrote:
|
Free forum by Nabble | Edit this page |