(DEPRECATED) Apache Flink User Mailing List archive.

akka.ask.timeout setting not honored

Classic

List

Threaded

6 messages Options

Greg Finch

akka.ask.timeout setting not honored

I'm having a problem with akka timeout when starting my cluster. The error is "Ask timed out after 10000 ms.". I have changed the akka.ask.timeout config setting to be 300000 ms, but it still times out and fails after 10 seconds. I confirmed that the config is properly set by both checking the Job Manager configuration tab (it shows 300000 ms) as well logging the output of AkkaUtils.getTimeout(configuration) which also shows 300000ms. It seems something is not honoring that configuration value.

I did find a different thread that discussed the fact that the LocalStreamEnvironment will not honor this setting, but that is not my case. I am running on a cluster (AWS EMR) using the regular StreamExecutionEnvironment. This is Flink 1.5.2.

Any ideas?

~~~~~

2018-08-31 17:37:55 INFO  org.apache.hadoop.yarn.client.api.impl.AMRMClientImpl  - Received new token for : ip-10-213-139-66.ec2.internal:8041
2018-08-31 17:37:55 INFO  org.apache.hadoop.yarn.client.api.impl.AMRMClientImpl  - Received new token for : ip-10-213-136-25.ec2.internal:8041
2018-08-31 17:38:34 ERROR o.a.flink.runtime.rest.handler.job.JobExecutionResultHandler  - Implementation error: Unhandled exception.
akka.pattern.AskTimeoutException: Ask timed out on [Actor[akka://flink/user/dispatcher#-219618710]] after [10000 ms]. Sender[null] sent message of type "org.apache.flink.runtime.rpc.messages.LocalFencedMessage".
	at akka.pattern.PromiseActorRef$$anonfun$1.apply$mcV$sp(AskSupport.scala:604)
	at akka.actor.Scheduler$$anon$4.run(Scheduler.scala:126)
	at scala.concurrent.Future$InternalCallbackExecutor$.unbatchedExecute(Future.scala:601)
	at scala.concurrent.BatchingExecutor$class.execute(BatchingExecutor.scala:109)
	at scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:599)
	at akka.actor.LightArrayRevolverScheduler$TaskHolder.executeTask(LightArrayRevolverScheduler.scala:329)
	at akka.actor.LightArrayRevolverScheduler$$anon$4.executeBucket$1(LightArrayRevolverScheduler.scala:280)
	at akka.actor.LightArrayRevolverScheduler$$anon$4.nextTick(LightArrayRevolverScheduler.scala:284)
	at akka.actor.LightArrayRevolverScheduler$$anon$4.run(LightArrayRevolverScheduler.scala:236)
	at java.lang.Thread.run(Thread.java:748)
2018-08-31 17:38:41 INFO  org.apache.hadoop.yarn.client.api.impl.AMRMClientImpl  - Waiting for application to be successfully unregistered.
2018-08-31 17:38:41 INFO  o.a.hadoop.yarn.client.api.async.impl.AMRMClientAsyncImpl  - Interrupted while waiting for queue
java.lang.InterruptedException: null
	at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.reportInterruptAfterWait(AbstractQueuedSynchronizer.java:2014)
	at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2048)
	at java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442)
	at org.apache.hadoop.yarn.client.api.async.impl.AMRMClientAsyncImpl$CallbackHandlerThread.run(AMRMClientAsyncImpl.java:323)
2018-08-31 17:38:42 WARN  akka.remote.ReliableDeliverySupervisor flink-akka.remote.default-remote-dispatcher-81 - Association with remote system [akka.tcp://[hidden email]:42027] has failed, address is now gated for [50] ms. Reason: [Disassociated]

Gary Yao-2

Re: akka.ask.timeout setting not honored

Hi Greg,

Can you describe the steps to reproduce the problem, or can you attach the
full jobmanager logs? Because JobExecutionResultHandler appears in your log, I
assume that you are starting a job cluster on YARN. Without seeing the
complete logs, I cannot be sure what exactly happens. For now, you can try
setting the config option web.timeout to a higher value.

Best,
Gary

On Fri, Aug 31, 2018 at 8:01 PM, Greg Finch <[hidden email]> wrote:

Any ideas?

~~~~~

2018-08-31 17:37:55 INFO  org.apache.hadoop.yarn.client.api.impl.AMRMClientImpl  - Received new token for : ip-10-213-139-66.ec2.internal:8041
2018-08-31 17:37:55 INFO  org.apache.hadoop.yarn.client.api.impl.AMRMClientImpl  - Received new token for : ip-10-213-136-25.ec2.internal:8041
2018-08-31 17:38:34 ERROR o.a.flink.runtime.rest.handler.job.JobExecutionResultHandler  - Implementation error: Unhandled exception.
akka.pattern.AskTimeoutException: Ask timed out on [Actor[akka://flink/user/dispatcher#-219618710]] after [10000 ms]. Sender[null] sent message of type "org.apache.flink.runtime.rpc.messages.LocalFencedMessage".
	at akka.pattern.PromiseActorRef$$anonfun$1.apply$mcV$sp(AskSupport.scala:604)
	at akka.actor.Scheduler$$anon$4.run(Scheduler.scala:126)
	at scala.concurrent.Future$InternalCallbackExecutor$.unbatchedExecute(Future.scala:601)
	at scala.concurrent.BatchingExecutor$class.execute(BatchingExecutor.scala:109)
	at scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:599)
	at akka.actor.LightArrayRevolverScheduler$TaskHolder.executeTask(LightArrayRevolverScheduler.scala:329)
	at akka.actor.LightArrayRevolverScheduler$$anon$4.executeBucket$1(LightArrayRevolverScheduler.scala:280)
	at akka.actor.LightArrayRevolverScheduler$$anon$4.nextTick(LightArrayRevolverScheduler.scala:284)
	at akka.actor.LightArrayRevolverScheduler$$anon$4.run(LightArrayRevolverScheduler.scala:236)
	at java.lang.Thread.run(Thread.java:748)
2018-08-31 17:38:41 INFO  org.apache.hadoop.yarn.client.api.impl.AMRMClientImpl  - Waiting for application to be successfully unregistered.
2018-08-31 17:38:41 INFO  o.a.hadoop.yarn.client.api.async.impl.AMRMClientAsyncImpl  - Interrupted while waiting for queue
java.lang.InterruptedException: null
	at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.reportInterruptAfterWait(AbstractQueuedSynchronizer.java:2014)
	at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2048)
	at java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442)
	at org.apache.hadoop.yarn.client.api.async.impl.AMRMClientAsyncImpl$CallbackHandlerThread.run(AMRMClientAsyncImpl.java:323)
2018-08-31 17:38:42 WARN  akka.remote.ReliableDeliverySupervisor flink-akka.remote.default-remote-dispatcher-81 - Association with remote system [akka.tcp://flink@ip-10-213-142-102.ec2.internal:42027] has failed, address is now gated for [50] ms. Reason: [Disassociated]

Greg Finch

Re: akka.ask.timeout setting not honored

Thanks Gary. Attached is the jobmanager log. You are correct that this is running on YARN. I changed web.timeout as you suggested - that seems to be working the few times I tested it. This problem comes and goes though - sometimes it starts before it times out. I'll keep the web.timeout setting and reply again if the problem comes up again. Thanks again for your quick response!

On Fri, Aug 31, 2018 at 1:38 PM Gary Yao <[hidden email]> wrote:

Hi Greg,

Can you describe the steps to reproduce the problem, or can you attach the
full jobmanager logs? Because JobExecutionResultHandler appears in your log, I
assume that you are starting a job cluster on YARN. Without seeing the
complete logs, I cannot be sure what exactly happens. For now, you can try
setting the config option web.timeout to a higher value.

Best,
Gary
On Fri, Aug 31, 2018 at 8:01 PM, Greg Finch <[hidden email]> wrote:
I'm having a problem with akka timeout when starting my cluster. The error is "Ask timed out after 10000 ms.". I have changed the akka.ask.timeout config setting to be 300000 ms, but it still times out and fails after 10 seconds. I confirmed that the config is properly set by both checking the Job Manager configuration tab (it shows 300000 ms) as well logging the output of AkkaUtils.getTimeout(configuration) which also shows 300000ms. It seems something is not honoring that configuration value.

I did find a different thread that discussed the fact that the LocalStreamEnvironment will not honor this setting, but that is not my case. I am running on a cluster (AWS EMR) using the regular StreamExecutionEnvironment. This is Flink 1.5.2.

Any ideas?

~~~~~
2018-08-31 17:37:55 INFO  org.apache.hadoop.yarn.client.api.impl.AMRMClientImpl  - Received new token for : ip-10-213-139-66.ec2.internal:8041
2018-08-31 17:37:55 INFO  org.apache.hadoop.yarn.client.api.impl.AMRMClientImpl  - Received new token for : ip-10-213-136-25.ec2.internal:8041
2018-08-31 17:38:34 ERROR o.a.flink.runtime.rest.handler.job.JobExecutionResultHandler  - Implementation error: Unhandled exception.
akka.pattern.AskTimeoutException: Ask timed out on [Actor[akka://flink/user/dispatcher#-219618710]] after [10000 ms]. Sender[null] sent message of type "org.apache.flink.runtime.rpc.messages.LocalFencedMessage".
	at akka.pattern.PromiseActorRef$$anonfun$1.apply$mcV$sp(AskSupport.scala:604)
	at akka.actor.Scheduler$$anon$4.run(Scheduler.scala:126)
	at scala.concurrent.Future$InternalCallbackExecutor$.unbatchedExecute(Future.scala:601)
	at scala.concurrent.BatchingExecutor$class.execute(BatchingExecutor.scala:109)
	at scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:599)
	at akka.actor.LightArrayRevolverScheduler$TaskHolder.executeTask(LightArrayRevolverScheduler.scala:329)
	at akka.actor.LightArrayRevolverScheduler$$anon$4.executeBucket$1(LightArrayRevolverScheduler.scala:280)
	at akka.actor.LightArrayRevolverScheduler$$anon$4.nextTick(LightArrayRevolverScheduler.scala:284)
	at akka.actor.LightArrayRevolverScheduler$$anon$4.run(LightArrayRevolverScheduler.scala:236)
	at java.lang.Thread.run(Thread.java:748)
2018-08-31 17:38:41 INFO  org.apache.hadoop.yarn.client.api.impl.AMRMClientImpl  - Waiting for application to be successfully unregistered.
2018-08-31 17:38:41 INFO  o.a.hadoop.yarn.client.api.async.impl.AMRMClientAsyncImpl  - Interrupted while waiting for queue
java.lang.InterruptedException: null
	at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.reportInterruptAfterWait(AbstractQueuedSynchronizer.java:2014)
	at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2048)
	at java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442)
	at org.apache.hadoop.yarn.client.api.async.impl.AMRMClientAsyncImpl$CallbackHandlerThread.run(AMRMClientAsyncImpl.java:323)
2018-08-31 17:38:42 WARN  akka.remote.ReliableDeliverySupervisor flink-akka.remote.default-remote-dispatcher-81 - Association with remote system [akka.tcp://[hidden email]:42027] has failed, address is now gated for [50] ms. Reason: [Disassociated] 

jobmanager.out.txt (36K) Download Attachment

Greg Finch

Re: akka.ask.timeout setting not honored

Well ... that didn't take long. The next time I tried, I got the Akka timeout again. Attached are the logs from the last attempt. They're very similar to the other logs I sent.

On Fri, Aug 31, 2018 at 2:04 PM Greg Finch <[hidden email]> wrote:

Thanks Gary. Attached is the jobmanager log. You are correct that this is running on YARN. I changed web.timeout as you suggested - that seems to be working the few times I tested it. This problem comes and goes though - sometimes it starts before it times out. I'll keep the web.timeout setting and reply again if the problem comes up again. Thanks again for your quick response!
On Fri, Aug 31, 2018 at 1:38 PM Gary Yao <[hidden email]> wrote:
Hi Greg,

Can you describe the steps to reproduce the problem, or can you attach the
full jobmanager logs? Because JobExecutionResultHandler appears in your log, I
assume that you are starting a job cluster on YARN. Without seeing the
complete logs, I cannot be sure what exactly happens. For now, you can try
setting the config option web.timeout to a higher value.

Best,
Gary
On Fri, Aug 31, 2018 at 8:01 PM, Greg Finch <[hidden email]> wrote:
I'm having a problem with akka timeout when starting my cluster. The error is "Ask timed out after 10000 ms.". I have changed the akka.ask.timeout config setting to be 300000 ms, but it still times out and fails after 10 seconds. I confirmed that the config is properly set by both checking the Job Manager configuration tab (it shows 300000 ms) as well logging the output of AkkaUtils.getTimeout(configuration) which also shows 300000ms. It seems something is not honoring that configuration value.

I did find a different thread that discussed the fact that the LocalStreamEnvironment will not honor this setting, but that is not my case. I am running on a cluster (AWS EMR) using the regular StreamExecutionEnvironment. This is Flink 1.5.2.

Any ideas?

~~~~~
2018-08-31 17:37:55 INFO  org.apache.hadoop.yarn.client.api.impl.AMRMClientImpl  - Received new token for : ip-10-213-139-66.ec2.internal:8041
2018-08-31 17:37:55 INFO  org.apache.hadoop.yarn.client.api.impl.AMRMClientImpl  - Received new token for : ip-10-213-136-25.ec2.internal:8041
2018-08-31 17:38:34 ERROR o.a.flink.runtime.rest.handler.job.JobExecutionResultHandler  - Implementation error: Unhandled exception.
akka.pattern.AskTimeoutException: Ask timed out on [Actor[akka://flink/user/dispatcher#-219618710]] after [10000 ms]. Sender[null] sent message of type "org.apache.flink.runtime.rpc.messages.LocalFencedMessage".
	at akka.pattern.PromiseActorRef$$anonfun$1.apply$mcV$sp(AskSupport.scala:604)
	at akka.actor.Scheduler$$anon$4.run(Scheduler.scala:126)
	at scala.concurrent.Future$InternalCallbackExecutor$.unbatchedExecute(Future.scala:601)
	at scala.concurrent.BatchingExecutor$class.execute(BatchingExecutor.scala:109)
	at scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:599)
	at akka.actor.LightArrayRevolverScheduler$TaskHolder.executeTask(LightArrayRevolverScheduler.scala:329)
	at akka.actor.LightArrayRevolverScheduler$$anon$4.executeBucket$1(LightArrayRevolverScheduler.scala:280)
	at akka.actor.LightArrayRevolverScheduler$$anon$4.nextTick(LightArrayRevolverScheduler.scala:284)
	at akka.actor.LightArrayRevolverScheduler$$anon$4.run(LightArrayRevolverScheduler.scala:236)
	at java.lang.Thread.run(Thread.java:748)
2018-08-31 17:38:41 INFO  org.apache.hadoop.yarn.client.api.impl.AMRMClientImpl  - Waiting for application to be successfully unregistered.
2018-08-31 17:38:41 INFO  o.a.hadoop.yarn.client.api.async.impl.AMRMClientAsyncImpl  - Interrupted while waiting for queue
java.lang.InterruptedException: null
	at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.reportInterruptAfterWait(AbstractQueuedSynchronizer.java:2014)
	at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2048)
	at java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442)
	at org.apache.hadoop.yarn.client.api.async.impl.AMRMClientAsyncImpl$CallbackHandlerThread.run(AMRMClientAsyncImpl.java:323)
2018-08-31 17:38:42 WARN  akka.remote.ReliableDeliverySupervisor flink-akka.remote.default-remote-dispatcher-81 - Association with remote system [akka.tcp://[hidden email]:42027] has failed, address is now gated for [50] ms. Reason: [Disassociated] 

jobmanager.out.txt (32K) Download Attachment

Gary Yao-2

Re: akka.ask.timeout setting not honored

Hi Greg,

Unfortunately the environment information [1] is not logged. Can you set the
log level for all Flink packages to DEBUG?

Do you install Flink yourself on EMR, or do you use the pre-installed one?
Can you show us the command with which you start the cluster/submit the job?

I do not know if it is related but I found these warnings in your second log file:

    2018-08-31 19:14:32 WARN org.apache.flink.configuration.Configuration - Configuration cannot evaluate value 300s as a long integer number
    2018-08-31 19:14:32 WARN org.apache.flink.configuration.Configuration - Configuration cannot evaluate value 300s as a long integer number

Best,
Gary

[1] https://github.com/apache/flink/blob/9ae5009b6a82248bfae99dac088c1f6e285aa70f/flink-runtime/src/main/java/org/apache/flink/runtime/util/EnvironmentInformation.java#L281

On Fri, Aug 31, 2018 at 9:18 PM, Greg Finch <[hidden email]> wrote:

Well ... that didn't take long. The next time I tried, I got the Akka timeout again. Attached are the logs from the last attempt. They're very similar to the other logs I sent.
On Fri, Aug 31, 2018 at 2:04 PM Greg Finch <[hidden email]> wrote:
Thanks Gary. Attached is the jobmanager log. You are correct that this is running on YARN. I changed web.timeout as you suggested - that seems to be working the few times I tested it. This problem comes and goes though - sometimes it starts before it times out. I'll keep the web.timeout setting and reply again if the problem comes up again. Thanks again for your quick response!
On Fri, Aug 31, 2018 at 1:38 PM Gary Yao <[hidden email]> wrote:
Hi Greg,

Can you describe the steps to reproduce the problem, or can you attach the
full jobmanager logs? Because JobExecutionResultHandler appears in your log, I
assume that you are starting a job cluster on YARN. Without seeing the
complete logs, I cannot be sure what exactly happens. For now, you can try
setting the config option web.timeout to a higher value.

Best,
Gary
On Fri, Aug 31, 2018 at 8:01 PM, Greg Finch <[hidden email]> wrote:
I'm having a problem with akka timeout when starting my cluster. The error is "Ask timed out after 10000 ms.". I have changed the akka.ask.timeout config setting to be 300000 ms, but it still times out and fails after 10 seconds. I confirmed that the config is properly set by both checking the Job Manager configuration tab (it shows 300000 ms) as well logging the output of AkkaUtils.getTimeout(configuration) which also shows 300000ms. It seems something is not honoring that configuration value.

I did find a different thread that discussed the fact that the LocalStreamEnvironment will not honor this setting, but that is not my case. I am running on a cluster (AWS EMR) using the regular StreamExecutionEnvironment. This is Flink 1.5.2.

Any ideas?

~~~~~
2018-08-31 17:37:55 INFO  org.apache.hadoop.yarn.client.api.impl.AMRMClientImpl  - Received new token for : ip-10-213-139-66.ec2.internal:8041
2018-08-31 17:37:55 INFO  org.apache.hadoop.yarn.client.api.impl.AMRMClientImpl  - Received new token for : ip-10-213-136-25.ec2.internal:8041
2018-08-31 17:38:34 ERROR o.a.flink.runtime.rest.handler.job.JobExecutionResultHandler  - Implementation error: Unhandled exception.
akka.pattern.AskTimeoutException: Ask timed out on [Actor[akka://flink/user/dispatcher#-219618710]] after [10000 ms]. Sender[null] sent message of type "org.apache.flink.runtime.rpc.messages.LocalFencedMessage".
	at akka.pattern.PromiseActorRef$$anonfun$1.apply$mcV$sp(AskSupport.scala:604)
	at akka.actor.Scheduler$$anon$4.run(Scheduler.scala:126)
	at scala.concurrent.Future$InternalCallbackExecutor$.unbatchedExecute(Future.scala:601)
	at scala.concurrent.BatchingExecutor$class.execute(BatchingExecutor.scala:109)
	at scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:599)
	at akka.actor.LightArrayRevolverScheduler$TaskHolder.executeTask(LightArrayRevolverScheduler.scala:329)
	at akka.actor.LightArrayRevolverScheduler$$anon$4.executeBucket$1(LightArrayRevolverScheduler.scala:280)
	at akka.actor.LightArrayRevolverScheduler$$anon$4.nextTick(LightArrayRevolverScheduler.scala:284)
	at akka.actor.LightArrayRevolverScheduler$$anon$4.run(LightArrayRevolverScheduler.scala:236)
	at java.lang.Thread.run(Thread.java:748)
2018-08-31 17:38:41 INFO  org.apache.hadoop.yarn.client.api.impl.AMRMClientImpl  - Waiting for application to be successfully unregistered.
2018-08-31 17:38:41 INFO  o.a.hadoop.yarn.client.api.async.impl.AMRMClientAsyncImpl  - Interrupted while waiting for queue
java.lang.InterruptedException: null
	at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.reportInterruptAfterWait(AbstractQueuedSynchronizer.java:2014)
	at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2048)
	at java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442)
	at org.apache.hadoop.yarn.client.api.async.impl.AMRMClientAsyncImpl$CallbackHandlerThread.run(AMRMClientAsyncImpl.java:323)
2018-08-31 17:38:42 WARN  akka.remote.ReliableDeliverySupervisor flink-akka.remote.default-remote-dispatcher-81 - Association with remote system [akka.tcp://flink@ip-10-213-142-102.ec2.internal:42027] has failed, address is now gated for [50] ms. Reason: [Disassociated] 

Greg Finch

Re: akka.ask.timeout setting not honored

Hi Gary,

Turns out, the configuration warning you mentioned was the key. The akka.ask.timeout requires a duration unit, but the web.timeout setting is looking for a long. So the change I made earlier would not have applied since it couldn't read `300s`. Since making that change (`web.timeout: 300000`), I have not been able to reproduce the error - everything starts successfully every time. I do have debug logging turned on for now. If it happens again in the next couple of days, I will send details with debug logs.

Thanks again for your help!

Greg

On Fri, Aug 31, 2018 at 3:21 PM Gary Yao <[hidden email]> wrote:

Hi Greg,

Unfortunately the environment information [1] is not logged. Can you set the
log level for all Flink packages to DEBUG?

Do you install Flink yourself on EMR, or do you use the pre-installed one?
Can you show us the command with which you start the cluster/submit the job?

I do not know if it is related but I found these warnings in your second log file:

    2018-08-31 19:14:32 WARN org.apache.flink.configuration.Configuration - Configuration cannot evaluate value 300s as a long integer number
    2018-08-31 19:14:32 WARN org.apache.flink.configuration.Configuration - Configuration cannot evaluate value 300s as a long integer number

Best,
Gary

[1] https://github.com/apache/flink/blob/9ae5009b6a82248bfae99dac088c1f6e285aa70f/flink-runtime/src/main/java/org/apache/flink/runtime/util/EnvironmentInformation.java#L281
On Fri, Aug 31, 2018 at 9:18 PM, Greg Finch <[hidden email]> wrote:
Well ... that didn't take long. The next time I tried, I got the Akka timeout again. Attached are the logs from the last attempt. They're very similar to the other logs I sent.
On Fri, Aug 31, 2018 at 2:04 PM Greg Finch <[hidden email]> wrote:
Thanks Gary. Attached is the jobmanager log. You are correct that this is running on YARN. I changed web.timeout as you suggested - that seems to be working the few times I tested it. This problem comes and goes though - sometimes it starts before it times out. I'll keep the web.timeout setting and reply again if the problem comes up again. Thanks again for your quick response!
On Fri, Aug 31, 2018 at 1:38 PM Gary Yao <[hidden email]> wrote:
Hi Greg,

Can you describe the steps to reproduce the problem, or can you attach the
full jobmanager logs? Because JobExecutionResultHandler appears in your log, I
assume that you are starting a job cluster on YARN. Without seeing the
complete logs, I cannot be sure what exactly happens. For now, you can try
setting the config option web.timeout to a higher value.

Best,
Gary
On Fri, Aug 31, 2018 at 8:01 PM, Greg Finch <[hidden email]> wrote:
I'm having a problem with akka timeout when starting my cluster. The error is "Ask timed out after 10000 ms.". I have changed the akka.ask.timeout config setting to be 300000 ms, but it still times out and fails after 10 seconds. I confirmed that the config is properly set by both checking the Job Manager configuration tab (it shows 300000 ms) as well logging the output of AkkaUtils.getTimeout(configuration) which also shows 300000ms. It seems something is not honoring that configuration value.

I did find a different thread that discussed the fact that the LocalStreamEnvironment will not honor this setting, but that is not my case. I am running on a cluster (AWS EMR) using the regular StreamExecutionEnvironment. This is Flink 1.5.2.

Any ideas?

~~~~~
2018-08-31 17:37:55 INFO  org.apache.hadoop.yarn.client.api.impl.AMRMClientImpl  - Received new token for : ip-10-213-139-66.ec2.internal:8041
2018-08-31 17:37:55 INFO  org.apache.hadoop.yarn.client.api.impl.AMRMClientImpl  - Received new token for : ip-10-213-136-25.ec2.internal:8041
2018-08-31 17:38:34 ERROR o.a.flink.runtime.rest.handler.job.JobExecutionResultHandler  - Implementation error: Unhandled exception.
akka.pattern.AskTimeoutException: Ask timed out on [Actor[akka://flink/user/dispatcher#-219618710]] after [10000 ms]. Sender[null] sent message of type "org.apache.flink.runtime.rpc.messages.LocalFencedMessage".
	at akka.pattern.PromiseActorRef$$anonfun$1.apply$mcV$sp(AskSupport.scala:604)
	at akka.actor.Scheduler$$anon$4.run(Scheduler.scala:126)
	at scala.concurrent.Future$InternalCallbackExecutor$.unbatchedExecute(Future.scala:601)
	at scala.concurrent.BatchingExecutor$class.execute(BatchingExecutor.scala:109)
	at scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:599)
	at akka.actor.LightArrayRevolverScheduler$TaskHolder.executeTask(LightArrayRevolverScheduler.scala:329)
	at akka.actor.LightArrayRevolverScheduler$$anon$4.executeBucket$1(LightArrayRevolverScheduler.scala:280)
	at akka.actor.LightArrayRevolverScheduler$$anon$4.nextTick(LightArrayRevolverScheduler.scala:284)
	at akka.actor.LightArrayRevolverScheduler$$anon$4.run(LightArrayRevolverScheduler.scala:236)
	at java.lang.Thread.run(Thread.java:748)
2018-08-31 17:38:41 INFO  org.apache.hadoop.yarn.client.api.impl.AMRMClientImpl  - Waiting for application to be successfully unregistered.
2018-08-31 17:38:41 INFO  o.a.hadoop.yarn.client.api.async.impl.AMRMClientAsyncImpl  - Interrupted while waiting for queue
java.lang.InterruptedException: null
	at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.reportInterruptAfterWait(AbstractQueuedSynchronizer.java:2014)
	at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2048)
	at java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442)
	at org.apache.hadoop.yarn.client.api.async.impl.AMRMClientAsyncImpl$CallbackHandlerThread.run(AMRMClientAsyncImpl.java:323)
2018-08-31 17:38:42 WARN  akka.remote.ReliableDeliverySupervisor flink-akka.remote.default-remote-dispatcher-81 - Association with remote system [akka.tcp://[hidden email]:42027] has failed, address is now gated for [50] ms. Reason: [Disassociated]