(DEPRECATED) Apache Flink User Mailing List archive.

UI stability at high parallelism

Classic

List

Threaded

5 messages Options

Richard Moorhead

UI stability at high parallelism

When I submit a job to flink session with parallelism higher than 128, the job is submitted and renders in the UI but when I view the job itself the UI starts to rapidly emit errors in the upper right:

Server Response:
Unable to load requested file /bad-request.

Is this a known issue? Is there a fix? Does this indicate underlying stability issues?

张光辉

Re: UI stability at high parallelism

We also encountered a similar issue internally. cc +huweihua.ckl

Richard Moorhead <[hidden email]> 于2020年2月13日周四上午9:40写道：

When I submit a job to flink session with parallelism higher than 128, the job is submitted and renders in the UI but when I view the job itself the UI starts to rapidly emit errors in the upper right:

Server Response:
Unable to load requested file /bad-request.

Is this a known issue? Is there a fix? Does this indicate underlying stability issues?

HuWeihua

Re: UI stability at high parallelism

In reply to this post by Richard Moorhead

Hi, Richard

This is most likely that the Rest Api has timed out, you can try to find some evidence in the jobmanager log.

You can provide the full log to help us find the root cause.

Best

Weihua Hu

2020年2月13日 09:40，Richard Moorhead <[hidden email]> 写道：

When I submit a job to flink session with parallelism higher than 128, the job is submitted and renders in the UI but when I view the job itself the UI starts to rapidly emit errors in the upper right:

Server Response:
Unable to load requested file /bad-request.

Is this a known issue? Is there a fix? Does this indicate underlying stability issues?

Richard Moorhead

Re: UI stability at high parallelism

2020-02-14 11:50:35,402 ERROR org.apache.flink.runtime.rest.handler.job.JobsOverviewHandler - Unhandled exception.
	akka.pattern.AskTimeoutException: Ask timed out on [Actor[akka://flink/user/dispatcher#1293527273]] after [10000 ms]. Message of type [org.apache.flink.runtime.rpc.messages.LocalFencedMessage]. A typical reason for `AskTimeoutException` is that the recipient actor didn't send a reply.
	at akka.pattern.PromiseActorRef$.$anonfun$defaultOnTimeout$1(AskSupport.scala:635)
	at akka.pattern.PromiseActorRef$.$anonfun$apply$1(AskSupport.scala:650)
	at akka.actor.Scheduler$$anon$4.run(Scheduler.scala:205)
	at scala.concurrent.Future$InternalCallbackExecutor$.unbatchedExecute(Future.scala:870)
	at scala.concurrent.BatchingExecutor.execute(BatchingExecutor.scala:109)
	at scala.concurrent.BatchingExecutor.execute$(BatchingExecutor.scala:103)
	at scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:868)
	at akka.actor.LightArrayRevolverScheduler$TaskHolder.executeTask(LightArrayRevolverScheduler.scala:328)
	at akka.actor.LightArrayRevolverScheduler$$anon$3.executeBucket$1(LightArrayRevolverScheduler.scala:279)
	at akka.actor.LightArrayRevolverScheduler$$anon$3.nextTick(LightArrayRevolverScheduler.scala:283)
	at akka.actor.LightArrayRevolverScheduler$$anon$3.run(LightArrayRevolverScheduler.scala:235)
	at java.lang.Thread.run(Thread.java:748)

On Wed, Feb 12, 2020 at 11:30 PM HuWeihua <[hidden email]> wrote:

Hi, Richard

This is most likely that the Rest Api has timed out, you can try to find some evidence in the jobmanager log.

You can provide the full log to help us find the root cause.

Best
Weihua Hu

2020年2月13日 09:40，Richard Moorhead <[hidden email]> 写道：

When I submit a job to flink session with parallelism higher than 128, the job is submitted and renders in the UI but when I view the job itself the UI starts to rapidly emit errors in the upper right:

Server Response:
Unable to load requested file /bad-request.

Is this a known issue? Is there a fix? Does this indicate underlying stability issues?

HuWeihua

Re: UI stability at high parallelism

These logs prove that it is indeed a timeout issue, In our scenario, it was due to the task deploy took a lot of time.

You can check if the time from Task from SCHEDULED to DEPLOYING in the log is greater than 10s. This step are processed in mainThread and will block the processing of requests from the UI.

By now, you can increase the ‘akka.ask.timeout’ to avoid this.

I have created a jira issue to improve this. https://issues.apache.org/jira/browse/FLINK-16069 .

Best

Weihua Hu

2020年2月15日 01:54，Richard Moorhead <[hidden email]> 写道：

2020-02-14 11:50:35,402 ERROR org.apache.flink.runtime.rest.handler.job.JobsOverviewHandler - Unhandled exception.
akka.pattern.AskTimeoutException: Ask timed out on [Actor[<a href="akka://flink/user/dispatcher#1293527273" class="">akka://flink/user/dispatcher#1293527273]] after [10000 ms]. Message of type [org.apache.flink.runtime.rpc.messages.LocalFencedMessage]. A typical reason for `AskTimeoutException` is that the recipient actor didn't send a reply.
at akka.pattern.PromiseActorRef$.$anonfun$defaultOnTimeout$1(AskSupport.scala:635)
at akka.pattern.PromiseActorRef$.$anonfun$apply$1(AskSupport.scala:650)
at akka.actor.Scheduler$$anon$4.run(Scheduler.scala:205)
at scala.concurrent.Future$InternalCallbackExecutor$.unbatchedExecute(Future.scala:870)
at scala.concurrent.BatchingExecutor.execute(BatchingExecutor.scala:109)
at scala.concurrent.BatchingExecutor.execute$(BatchingExecutor.scala:103)
at scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:868)
at akka.actor.LightArrayRevolverScheduler$TaskHolder.executeTask(LightArrayRevolverScheduler.scala:328)
at akka.actor.LightArrayRevolverScheduler$$anon$3.executeBucket$1(LightArrayRevolverScheduler.scala:279)
at akka.actor.LightArrayRevolverScheduler$$anon$3.nextTick(LightArrayRevolverScheduler.scala:283)
at akka.actor.LightArrayRevolverScheduler$$anon$3.run(LightArrayRevolverScheduler.scala:235)
at java.lang.Thread.run(Thread.java:748)

On Wed, Feb 12, 2020 at 11:30 PM HuWeihua <[hidden email]> wrote:
Hi, Richard

This is most likely that the Rest Api has timed out, you can try to find some evidence in the jobmanager log.

You can provide the full log to help us find the root cause.

Best
Weihua Hu

2020年2月13日 09:40，Richard Moorhead <[hidden email]> 写道：

When I submit a job to flink session with parallelism higher than 128, the job is submitted and renders in the UI but when I view the job itself the UI starts to rapidly emit errors in the upper right:

Server Response:
Unable to load requested file /bad-request.

Is this a known issue? Is there a fix? Does this indicate underlying stability issues?