UI stability at high parallelism

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

UI stability at high parallelism

Richard Moorhead
When I submit a job to flink session with parallelism higher than 128, the job is submitted and renders in the UI but when I view the job itself the UI starts to rapidly emit errors in the upper right:

Server Response:
Unable to load requested file /bad-request.

Is this a known issue? Is there a fix? Does this indicate underlying stability issues?
Reply | Threaded
Open this post in threaded view
|

Re: UI stability at high parallelism

张光辉
We also encountered a similar issue internally. cc +huweihua.ckl 

Richard Moorhead <[hidden email]> 于2020年2月13日周四 上午9:40写道:
When I submit a job to flink session with parallelism higher than 128, the job is submitted and renders in the UI but when I view the job itself the UI starts to rapidly emit errors in the upper right:

Server Response:
Unable to load requested file /bad-request.

Is this a known issue? Is there a fix? Does this indicate underlying stability issues?
Reply | Threaded
Open this post in threaded view
|

Re: UI stability at high parallelism

HuWeihua
In reply to this post by Richard Moorhead
Hi, Richard

This is most likely that the Rest Api has timed out, you can try to find some evidence in the jobmanager log.

You can provide the full log to help us find the root cause.


Best
Weihua Hu

2020年2月13日 09:40,Richard Moorhead <[hidden email]> 写道:

When I submit a job to flink session with parallelism higher than 128, the job is submitted and renders in the UI but when I view the job itself the UI starts to rapidly emit errors in the upper right:

Server Response:
Unable to load requested file /bad-request.

Is this a known issue? Is there a fix? Does this indicate underlying stability issues?

Reply | Threaded
Open this post in threaded view
|

Re: UI stability at high parallelism

Richard Moorhead
2020-02-14 11:50:35,402 ERROR org.apache.flink.runtime.rest.handler.job.JobsOverviewHandler - Unhandled exception.
akka.pattern.AskTimeoutException: Ask timed out on [Actor[akka://flink/user/dispatcher#1293527273]] after [10000 ms]. Message of type [org.apache.flink.runtime.rpc.messages.LocalFencedMessage]. A typical reason for `AskTimeoutException` is that the recipient actor didn't send a reply.
at akka.pattern.PromiseActorRef$.$anonfun$defaultOnTimeout$1(AskSupport.scala:635)
at akka.pattern.PromiseActorRef$.$anonfun$apply$1(AskSupport.scala:650)
at akka.actor.Scheduler$$anon$4.run(Scheduler.scala:205)
at scala.concurrent.Future$InternalCallbackExecutor$.unbatchedExecute(Future.scala:870)
at scala.concurrent.BatchingExecutor.execute(BatchingExecutor.scala:109)
at scala.concurrent.BatchingExecutor.execute$(BatchingExecutor.scala:103)
at scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:868)
at akka.actor.LightArrayRevolverScheduler$TaskHolder.executeTask(LightArrayRevolverScheduler.scala:328)
at akka.actor.LightArrayRevolverScheduler$$anon$3.executeBucket$1(LightArrayRevolverScheduler.scala:279)
at akka.actor.LightArrayRevolverScheduler$$anon$3.nextTick(LightArrayRevolverScheduler.scala:283)
at akka.actor.LightArrayRevolverScheduler$$anon$3.run(LightArrayRevolverScheduler.scala:235)
at java.lang.Thread.run(Thread.java:748)

On Wed, Feb 12, 2020 at 11:30 PM HuWeihua <[hidden email]> wrote:
Hi, Richard

This is most likely that the Rest Api has timed out, you can try to find some evidence in the jobmanager log.

You can provide the full log to help us find the root cause.


Best
Weihua Hu

2020年2月13日 09:40,Richard Moorhead <[hidden email]> 写道:

When I submit a job to flink session with parallelism higher than 128, the job is submitted and renders in the UI but when I view the job itself the UI starts to rapidly emit errors in the upper right:

Server Response:
Unable to load requested file /bad-request.

Is this a known issue? Is there a fix? Does this indicate underlying stability issues?

Reply | Threaded
Open this post in threaded view
|

Re: UI stability at high parallelism

HuWeihua
These logs prove that it is indeed a timeout issue, In our scenario, it was due to the task deploy took a lot of time.
You can check if the time from Task from SCHEDULED to DEPLOYING in the log is greater than 10s. This step are processed in mainThread and will block the processing of requests from the UI. 

By now, you can increase the ‘akka.ask.timeout’ to avoid this. 

I have created a jira issue to improve this. https://issues.apache.org/jira/browse/FLINK-16069 .

Best
Weihua Hu

2020年2月15日 01:54,Richard Moorhead <[hidden email]> 写道:

2020-02-14 11:50:35,402 ERROR org.apache.flink.runtime.rest.handler.job.JobsOverviewHandler - Unhandled exception.
akka.pattern.AskTimeoutException: Ask timed out on [Actor[<a href="akka://flink/user/dispatcher#1293527273" class="">akka://flink/user/dispatcher#1293527273]] after [10000 ms]. Message of type [org.apache.flink.runtime.rpc.messages.LocalFencedMessage]. A typical reason for `AskTimeoutException` is that the recipient actor didn't send a reply.
at akka.pattern.PromiseActorRef$.$anonfun$defaultOnTimeout$1(AskSupport.scala:635)
at akka.pattern.PromiseActorRef$.$anonfun$apply$1(AskSupport.scala:650)
at akka.actor.Scheduler$$anon$4.run(Scheduler.scala:205)
at scala.concurrent.Future$InternalCallbackExecutor$.unbatchedExecute(Future.scala:870)
at scala.concurrent.BatchingExecutor.execute(BatchingExecutor.scala:109)
at scala.concurrent.BatchingExecutor.execute$(BatchingExecutor.scala:103)
at scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:868)
at akka.actor.LightArrayRevolverScheduler$TaskHolder.executeTask(LightArrayRevolverScheduler.scala:328)
at akka.actor.LightArrayRevolverScheduler$$anon$3.executeBucket$1(LightArrayRevolverScheduler.scala:279)
at akka.actor.LightArrayRevolverScheduler$$anon$3.nextTick(LightArrayRevolverScheduler.scala:283)
at akka.actor.LightArrayRevolverScheduler$$anon$3.run(LightArrayRevolverScheduler.scala:235)
at java.lang.Thread.run(Thread.java:748)

On Wed, Feb 12, 2020 at 11:30 PM HuWeihua <[hidden email]> wrote:
Hi, Richard

This is most likely that the Rest Api has timed out, you can try to find some evidence in the jobmanager log.

You can provide the full log to help us find the root cause.


Best
Weihua Hu

2020年2月13日 09:40,Richard Moorhead <[hidden email]> 写道:

When I submit a job to flink session with parallelism higher than 128, the job is submitted and renders in the UI but when I view the job itself the UI starts to rapidly emit errors in the upper right:

Server Response:
Unable to load requested file /bad-request.

Is this a known issue? Is there a fix? Does this indicate underlying stability issues?