(DEPRECATED) Apache Flink User Mailing List archive.

1.6 UI issues

Classic

List

Threaded

10 messages Options

Juan Gentile

1.6 UI issues

Hello!

We are migrating the the last 1.6 version and all the jobs seem to work fine, but when we check individual jobs through the web interface we encounter the issue that after clicking on a job, either it takes too long to load the information of the job or it never loads at all.

Has anyone had this issue? Any clues as to why?

Thank you,

Juan

Yun Tang

Re: 1.6 UI issues

Hi Juan

From our experience, you could check the jobmanager.log first to see whether existing similar logs below:

max allowed size 128000 bytes, actual size of encoded class akka.actor.Status$Success was xxx bytes

If you see these logs, you should increase the akka.framesize to larger value (default value is '10485760b') [1].

Otherwise, you could check the gc-log of job manager to see whether the gc overhead is too heavy for your job manager, consider to increase the memory for your job manager if so.

[1] https://ci.apache.org/projects/flink/flink-docs-stable/ops/config.html#distributed-coordination-via-akka

Apache Flink 1.6 Documentation: Configuration

Key Default Description; jobmanager.heap.size "1024m" JVM heap size for the JobManager. taskmanager.heap.size "1024m" JVM heap size for the TaskManagers, which are the parallel workers of the system.

ci.apache.org

Best

Yun Tang

From: Juan Gentile <[hidden email]>
Sent: Wednesday, October 31, 2018 22:05
To: [hidden email]
Subject: 1.6 UI issues

Hello!

Has anyone had this issue? Any clues as to why?

Thank you,

Juan

Juan Gentile

Re: 1.6 UI issues

Hello Yun,

We haven’t seen the error in the log as you mentioned. We also checked the GC and it seems to be okay. Inspecting the UI we found the following error:

{"errors":["Could not retrieve the redirect address of the current leader. Please try to refresh."]}

We suspect we are running into the same issue as described here (http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/akka-timeout-td14996.html) but we are not so sure.

Have you encountered this issue before?

Thank you,

From: Yun Tang <[hidden email]>
Date: Thursday, 1 November 2018 at 12:31
To: Juan Gentile <[hidden email]>, "[hidden email]" <[hidden email]>
Subject: Re: 1.6 UI issues

Hi Juan

From our experience, you could check the jobmanager.log first to see whether existing similar logs below:

max allowed size 128000 bytes, actual size of encoded class akka.actor.Status$Success was xxx bytes

If you see these logs, you should increase the akka.framesize to larger value (default value is '10485760b') [1].

Otherwise, you could check the gc-log of job manager to see whether the gc overhead is too heavy for your job manager, consider to increase the memory for your job manager if so.

[1] https://ci.apache.org/projects/flink/flink-docs-stable/ops/config.html#distributed-coordination-via-akka

Apache Flink 1.6 Documentation: Configuration

Key Default Description; jobmanager.heap.size "1024m" JVM heap size for the JobManager. taskmanager.heap.size "1024m" JVM heap size for the TaskManagers, which are the parallel workers of the system.

ci.apache.org

Best

Yun Tang

From: Juan Gentile <[hidden email]>
Sent: Wednesday, October 31, 2018 22:05
To: [hidden email]
Subject: 1.6 UI issues

Hello!

Has anyone had this issue? Any clues as to why?

Thank you,

Juan

Dawid Wysakowicz-2

Re: 1.6 UI issues

Hi Juan,

It doesn't look similar to the issue linked to me. What cluster setup are you using? Are you running HA mode?

I am adding Till to cc, who might be able to help you more.

Best,

Dawid

On 02/11/2018 17:26, Juan Gentile wrote:

Hello Yun,

We haven’t seen the error in the log as you mentioned. We also checked the GC and it seems to be okay. Inspecting the UI we found the following error:

{"errors":["Could not retrieve the redirect address of the current leader. Please try to refresh."]}

We suspect we are running into the same issue as described here (http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/akka-timeout-td14996.html) but we are not so sure.

Have you encountered this issue before?

Thank you,

From: Yun Tang [hidden email]
Date: Thursday, 1 November 2018 at 12:31
To: Juan Gentile [hidden email], [hidden email] [hidden email]
Subject: Re: 1.6 UI issues

Hi Juan

From our experience, you could check the jobmanager.log first to see whether existing similar logs below:
max allowed size 128000 bytes, actual size of encoded class akka.actor.Status$Success was xxx bytes

If you see these logs, you should increase the akka.framesize to larger value (default value is '10485760b') [1].

Otherwise, you could check the gc-log of job manager to see whether the gc overhead is too heavy for your job manager, consider to increase the memory for your job manager if so.
[1] https://ci.apache.org/projects/flink/flink-docs-stable/ops/config.html#distributed-coordination-via-akka

Apache Flink 1.6 Documentation: Configuration

Key Default Description; jobmanager.heap.size "1024m" JVM heap size for the JobManager. taskmanager.heap.size "1024m" JVM heap size for the TaskManagers, which are the parallel workers of the system.

ci.apache.org

Best

Yun Tang

From: Juan Gentile [hidden email]
Sent: Wednesday, October 31, 2018 22:05
To: [hidden email]
Subject: 1.6 UI issues

Hello!

We are migrating the the last 1.6 version and all the jobs seem to work fine, but when we check individual jobs through the web interface we encounter the issue that after clicking on a job, either it takes too long to load the information of the job or it never loads at all.

Has anyone had this issue? Any clues as to why?

Thank you,

Juan

signature.asc (849 bytes) Download Attachment

Till Rohrmann

Re: 1.6 UI issues

Hi Juan,

could you share the cluster entrypoint logs with us? They should contain more information about the internal server error.

Just to make sure, you are using Flink 1.6.2, right?

Cheers,

Till

On Thu, Nov 8, 2018 at 3:29 PM Dawid Wysakowicz <[hidden email]> wrote:

Hi Juan,

It doesn't look similar to the issue linked to me. What cluster setup are you using? Are you running HA mode?

I am adding Till to cc, who might be able to help you more.

Best,

Dawid

On 02/11/2018 17:26, Juan Gentile wrote:
Hello Yun,

We haven’t seen the error in the log as you mentioned. We also checked the GC and it seems to be okay. Inspecting the UI we found the following error:

{"errors":["Could not retrieve the redirect address of the current leader. Please try to refresh."]}

We suspect we are running into the same issue as described here (http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/akka-timeout-td14996.html) but we are not so sure.

Have you encountered this issue before?

Thank you,

From: Yun Tang [hidden email]
Date: Thursday, 1 November 2018 at 12:31
To: Juan Gentile [hidden email], [hidden email] [hidden email]
Subject: Re: 1.6 UI issues

Hi Juan

From our experience, you could check the jobmanager.log first to see whether existing similar logs below:
max allowed size 128000 bytes, actual size of encoded class akka.actor.Status$Success was xxx bytes

If you see these logs, you should increase the akka.framesize to larger value (default value is '10485760b') [1].

Otherwise, you could check the gc-log of job manager to see whether the gc overhead is too heavy for your job manager, consider to increase the memory for your job manager if so.
[1] https://ci.apache.org/projects/flink/flink-docs-stable/ops/config.html#distributed-coordination-via-akka

Apache Flink 1.6 Documentation: Configuration

Key Default Description; jobmanager.heap.size "1024m" JVM heap size for the JobManager. taskmanager.heap.size "1024m" JVM heap size for the TaskManagers, which are the parallel workers of the system.

ci.apache.org

Best

Yun Tang

From: Juan Gentile [hidden email]
Sent: Wednesday, October 31, 2018 22:05
To: [hidden email]
Subject: 1.6 UI issues

Hello!

We are migrating the the last 1.6 version and all the jobs seem to work fine, but when we check individual jobs through the web interface we encounter the issue that after clicking on a job, either it takes too long to load the information of the job or it never loads at all.

Has anyone had this issue? Any clues as to why?

Thank you,

Juan

image001.png (180K) Download Attachment

image002.png (118K) Download Attachment

Oleksandr Nitavskyi

Re: 1.6 UI issues

Hello guys. Happy new year!

Context: we started to have some troubles with UI after bumping our Flink version from 1.4 to 1.6.3. UI couldn’t render Job details page, so inspecting of the jobs for us has become impossible with the new version.

And looks like we have a workaround for our UI issue.

After some investigation we realized that starting from Flink 1.5 version we started to have a timeout on the actor call: restfulGateway.requestJob(jobId, timeout) in ExecutionGraphCache. So we have increased web.timeout parameter and we have stopped to have timeout exception on the JobManager side.

Also in SingleJobController on the Angular JS side we needed to tweak web.refresh-interval in order to ensure that Front-End is waiting for back-end request to be finished. Otherwise Angular JS side can make another request in SingleJobController and don’t know why when older request is finished no UI has been changed. We will have a look closer on this behavior.

Does it ring a bell for you probably?

Thank you

Kind Regards

Oleksandr

From: Till Rohrmann <[hidden email]>
Date: Wednesday 19 December 2018 at 16:52
To: Juan Gentile <[hidden email]>
Cc: "[hidden email]" <[hidden email]>, Jeff Bean <[hidden email]>, Oleksandr Nitavskyi <[hidden email]>
Subject: Re: 1.6 UI issues

Hi Juan,

thanks for the log. The log file does not contain anything suspicious. Are you sure that you sent me the right file? The timestamps don't seem to match. In the attached log, the job seems to run without problems.

Cheers,

Till

On Wed, Dec 19, 2018 at 10:26 AM Juan Gentile <[hidden email]> wrote:

Hello Till, Dawid

Sorry for the late response on this issue and thank you Jeff for helping us with this.

Yes we are using 1.6.2

I attach the logs from the Job Master.

Also we noticed this exception:

2018-12-19 08:50:10,497 ERROR org.apache.flink.runtime.rest.handler.job.JobDetailsHandler   - Implementation error: Unhandled exception.

java.util.concurrent.CancellationException

    at java.util.concurrent.CompletableFuture.cancel(CompletableFuture.java:2263)

    at org.apache.flink.runtime.rest.handler.legacy.ExecutionGraphCache.getExecutionGraph(ExecutionGraphCache.java:124)

    at org.apache.flink.runtime.rest.handler.job.AbstractExecutionGraphHandler.handleRequest(AbstractExecutionGraphHandler.java:76)

    at org.apache.flink.runtime.rest.handler.AbstractRestHandler.respondToRequest(AbstractRestHandler.java:78)

    at org.apache.flink.runtime.rest.handler.AbstractHandler.respondAsLeader(AbstractHandler.java:154)

    at org.apache.flink.runtime.rest.handler.RedirectHandler.lambda$null$0(RedirectHandler.java:142)

    at java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:760)

    at java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:736)

    at java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:442)

    at org.apache.flink.shaded.netty4.io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:163)

    at org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:404)

    at org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:463)

    at org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:884)

    at org.apache.flink.shaded.netty4.io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)

    at java.lang.Thread.run(Thread.java:748)

2018-12-19 08:50:17,977 ERROR org.apache.flink.runtime.rest.handler.job.JobsOverviewHandler - Implementation error: Unhandled exception.

akka.pattern.AskTimeoutException: Ask timed out on [Actor[akka://flink/user/dispatcher#-760166654]] after [10000 ms]. Sender[null] sent message of type "org.apache.flink.runtime.rpc.messages.LocalFencedMessage".

    at akka.pattern.PromiseActorRef$$anonfun$1.apply$mcV$sp(AskSupport.scala:604)

    at akka.actor.Scheduler$$anon$4.run(Scheduler.scala:126)

    at scala.concurrent.Future$InternalCallbackExecutor$.unbatchedExecute(Future.scala:601)

    at scala.concurrent.BatchingExecutor$class.execute(BatchingExecutor.scala:109)

    at scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:599)

    at akka.actor.LightArrayRevolverScheduler$TaskHolder.executeTask(LightArrayRevolverScheduler.scala:329)

    at akka.actor.LightArrayRevolverScheduler$$anon$4.executeBucket$1(LightArrayRevolverScheduler.scala:280)

    at akka.actor.LightArrayRevolverScheduler$$anon$4.nextTick(LightArrayRevolverScheduler.scala:284)

    at akka.actor.LightArrayRevolverScheduler$$anon$4.run(LightArrayRevolverScheduler.scala:236)

    at java.lang.Thread.run(Thread.java:748)

For which we tested with this parameter: -Dakka.ask.timeout=60s

But the issue remains.

Thank you

Juan

From: Till Rohrmann <[hidden email]>
Date: Thursday, 8 November 2018 at 16:06
To: "[hidden email]" <[hidden email]>
Cc: Juan Gentile <[hidden email]>, "[hidden email]" <[hidden email]>, user <[hidden email]>
Subject: Re: 1.6 UI issues

Hi Juan,

could you share the cluster entrypoint logs with us? They should contain more information about the internal server error.

Just to make sure, you are using Flink 1.6.2, right?

Cheers,

Till
On Thu, Nov 8, 2018 at 3:29 PM Dawid Wysakowicz <[hidden email]> wrote:
Hi Juan,

It doesn't look similar to the issue linked to me. What cluster setup are you using? Are you running HA mode?

I am adding Till to cc, who might be able to help you more.

Best,

Dawid

On 02/11/2018 17:26, Juan Gentile wrote:
Hello Yun,

We haven’t seen the error in the log as you mentioned. We also checked the GC and it seems to be okay. Inspecting the UI we found the following error:
Error! Filename not specified.

{"errors":["Could not retrieve the redirect address of the current leader. Please try to refresh."]}

Error! Filename not specified.

We suspect we are running into the same issue as described here (http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/akka-timeout-td14996.html) but we are not so sure.

Have you encountered this issue before?

Thank you,

From: Yun Tang [hidden email]
Date: Thursday, 1 November 2018 at 12:31
To: Juan Gentile [hidden email], [hidden email] [hidden email]
Subject: Re: 1.6 UI issues

Hi Juan

From our experience, you could check the jobmanager.log first to see whether existing similar logs below:
max allowed size 128000 bytes, actual size of encoded class akka.actor.Status$Success was xxx bytes
 
If you see these logs, you should increase the akka.framesize to larger value (default value is '10485760b') [1].
 
Otherwise, you could check the gc-log of job manager to see whether the gc overhead is too heavy for your job manager, consider to increase the memory for your job manager if so.
[1] https://ci.apache.org/projects/flink/flink-docs-stable/ops/config.html#distributed-coordination-via-akka

Apache Flink 1.6 Documentation: Configuration

Key Default Description; jobmanager.heap.size "1024m" JVM heap size for the JobManager. taskmanager.heap.size "1024m" JVM heap size for the TaskManagers, which are the parallel workers of the system.

ci.apache.org

Best

Yun Tang

From: Juan Gentile [hidden email]
Sent: Wednesday, October 31, 2018 22:05
To: [hidden email]
Subject: 1.6 UI issues

Hello!

We are migrating the the last 1.6 version and all the jobs seem to work fine, but when we check individual jobs through the web interface we encounter the issue that after clicking on a job, either it takes too long to load the information of the job or it never loads at all.

Has anyone had this issue? Any clues as to why?

Thank you,

Juan

Till Rohrmann

Re: 1.6 UI issues

Hi Oleksandr,

the requestJob call should only take longer if either the `JobMaster` is overloaded and too busy to respond to the request or if the ArchivedExecutionGraph is very large (e.g. very large accumulators) and generating it and sending it over to the RestServerEndpoint takes too long. This is also the change which was introduced with Flink 1.5. Instead of simply handing over a reference to the RestServerEndpoint from the JobMaster, the ArchivedExecutionGraph now needs to be sent through the network stack to the RestServerEndpoint.

If you did not change the akka.framesize then the maximum size of the ArchivedExecutionGraph should only be 10 MB, though. Therefore, I would guess that your `JobMaster` must be quite busy if the requests time out.

Cheers,

Till

On Wed, Jan 2, 2019 at 10:58 AM Oleksandr Nitavskyi <[hidden email]> wrote:

Hello guys. Happy new year!

Context: we started to have some troubles with UI after bumping our Flink version from 1.4 to 1.6.3. UI couldn’t render Job details page, so inspecting of the jobs for us has become impossible with the new version.

And looks like we have a workaround for our UI issue.

After some investigation we realized that starting from Flink 1.5 version we started to have a timeout on the actor call: restfulGateway.requestJob(jobId, timeout) in ExecutionGraphCache. So we have increased web.timeout parameter and we have stopped to have timeout exception on the JobManager side.

Also in SingleJobController on the Angular JS side we needed to tweak web.refresh-interval in order to ensure that Front-End is waiting for back-end request to be finished. Otherwise Angular JS side can make another request in SingleJobController and don’t know why when older request is finished no UI has been changed. We will have a look closer on this behavior.

Does it ring a bell for you probably?

Thank you

Kind Regards

Oleksandr

From: Till Rohrmann <[hidden email]>
Date: Wednesday 19 December 2018 at 16:52
To: Juan Gentile <[hidden email]>
Cc: "[hidden email]" <[hidden email]>, Jeff Bean <[hidden email]>, Oleksandr Nitavskyi <[hidden email]>
Subject: Re: 1.6 UI issues

Hi Juan,

thanks for the log. The log file does not contain anything suspicious. Are you sure that you sent me the right file? The timestamps don't seem to match. In the attached log, the job seems to run without problems.

Cheers,

Till
On Wed, Dec 19, 2018 at 10:26 AM Juan Gentile <[hidden email]> wrote:
Hello Till, Dawid

Sorry for the late response on this issue and thank you Jeff for helping us with this.

Yes we are using 1.6.2

I attach the logs from the Job Master.

Also we noticed this exception:

2018-12-19 08:50:10,497 ERROR org.apache.flink.runtime.rest.handler.job.JobDetailsHandler   - Implementation error: Unhandled exception.

java.util.concurrent.CancellationException

    at java.util.concurrent.CompletableFuture.cancel(CompletableFuture.java:2263)

    at org.apache.flink.runtime.rest.handler.legacy.ExecutionGraphCache.getExecutionGraph(ExecutionGraphCache.java:124)

    at org.apache.flink.runtime.rest.handler.job.AbstractExecutionGraphHandler.handleRequest(AbstractExecutionGraphHandler.java:76)

    at org.apache.flink.runtime.rest.handler.AbstractRestHandler.respondToRequest(AbstractRestHandler.java:78)

    at org.apache.flink.runtime.rest.handler.AbstractHandler.respondAsLeader(AbstractHandler.java:154)

    at org.apache.flink.runtime.rest.handler.RedirectHandler.lambda$null$0(RedirectHandler.java:142)

    at java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:760)

    at java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:736)

    at java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:442)

    at org.apache.flink.shaded.netty4.io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:163)

    at org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:404)

    at org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:463)

    at org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:884)

    at org.apache.flink.shaded.netty4.io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)

    at java.lang.Thread.run(Thread.java:748)

2018-12-19 08:50:17,977 ERROR org.apache.flink.runtime.rest.handler.job.JobsOverviewHandler - Implementation error: Unhandled exception.

akka.pattern.AskTimeoutException: Ask timed out on [Actor[akka://flink/user/dispatcher#-760166654]] after [10000 ms]. Sender[null] sent message of type "org.apache.flink.runtime.rpc.messages.LocalFencedMessage".

    at akka.pattern.PromiseActorRef$$anonfun$1.apply$mcV$sp(AskSupport.scala:604)

    at akka.actor.Scheduler$$anon$4.run(Scheduler.scala:126)

    at scala.concurrent.Future$InternalCallbackExecutor$.unbatchedExecute(Future.scala:601)

    at scala.concurrent.BatchingExecutor$class.execute(BatchingExecutor.scala:109)

    at scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:599)

    at akka.actor.LightArrayRevolverScheduler$TaskHolder.executeTask(LightArrayRevolverScheduler.scala:329)

    at akka.actor.LightArrayRevolverScheduler$$anon$4.executeBucket$1(LightArrayRevolverScheduler.scala:280)

    at akka.actor.LightArrayRevolverScheduler$$anon$4.nextTick(LightArrayRevolverScheduler.scala:284)

    at akka.actor.LightArrayRevolverScheduler$$anon$4.run(LightArrayRevolverScheduler.scala:236)

    at java.lang.Thread.run(Thread.java:748)

For which we tested with this parameter: -Dakka.ask.timeout=60s

But the issue remains.

Thank you

Juan

From: Till Rohrmann <[hidden email]>
Date: Thursday, 8 November 2018 at 16:06
To: "[hidden email]" <[hidden email]>
Cc: Juan Gentile <[hidden email]>, "[hidden email]" <[hidden email]>, user <[hidden email]>
Subject: Re: 1.6 UI issues

Hi Juan,

could you share the cluster entrypoint logs with us? They should contain more information about the internal server error.

Just to make sure, you are using Flink 1.6.2, right?

Cheers,

Till
On Thu, Nov 8, 2018 at 3:29 PM Dawid Wysakowicz <[hidden email]> wrote:
Hi Juan,

It doesn't look similar to the issue linked to me. What cluster setup are you using? Are you running HA mode?

I am adding Till to cc, who might be able to help you more.

Best,

Dawid

On 02/11/2018 17:26, Juan Gentile wrote:
Hello Yun,

We haven’t seen the error in the log as you mentioned. We also checked the GC and it seems to be okay. Inspecting the UI we found the following error:
Error! Filename not specified.

{"errors":["Could not retrieve the redirect address of the current leader. Please try to refresh."]}

Error! Filename not specified.

We suspect we are running into the same issue as described here (http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/akka-timeout-td14996.html) but we are not so sure.

Have you encountered this issue before?

Thank you,

From: Yun Tang [hidden email]
Date: Thursday, 1 November 2018 at 12:31
To: Juan Gentile [hidden email], [hidden email] [hidden email]
Subject: Re: 1.6 UI issues

Hi Juan

From our experience, you could check the jobmanager.log first to see whether existing similar logs below:
max allowed size 128000 bytes, actual size of encoded class akka.actor.Status$Success was xxx bytes
 
If you see these logs, you should increase the akka.framesize to larger value (default value is '10485760b') [1].
 
Otherwise, you could check the gc-log of job manager to see whether the gc overhead is too heavy for your job manager, consider to increase the memory for your job manager if so.
[1] https://ci.apache.org/projects/flink/flink-docs-stable/ops/config.html#distributed-coordination-via-akka

Apache Flink 1.6 Documentation: Configuration

Key Default Description; jobmanager.heap.size "1024m" JVM heap size for the JobManager. taskmanager.heap.size "1024m" JVM heap size for the TaskManagers, which are the parallel workers of the system.

ci.apache.org

Best

Yun Tang

From: Juan Gentile [hidden email]
Sent: Wednesday, October 31, 2018 22:05
To: [hidden email]
Subject: 1.6 UI issues

Hello!

We are migrating the the last 1.6 version and all the jobs seem to work fine, but when we check individual jobs through the web interface we encounter the issue that after clicking on a job, either it takes too long to load the information of the job or it never loads at all.

Has anyone had this issue? Any clues as to why?

Thank you,

Juan

Oleksandr Nitavskyi

Re: 1.6 UI issues

Hello Till,

First congratulations to you and the whole Flink community! It is great to see such success and recognition of the Apache Flink and your work.

Thanks also for the previous answer and good tips. On our side we have made several more steps in understanding the issue.

So I think we have two related problems in Flink, which can be reproduced in our set up:

UI issue

Looks like there are some routing problems on Angular side in Flink UI. Angular refreshes job state (which is 20 kb in our case) every 10 sec by default (web.refresh-interval).

cid:image001.png@01D4A674.134F5390

cid:image002.png@01D4A674.134F5390

In case one of refresh calls take more than web.refresh-interval next request is made.

cid:image003.png@01D4A674.134F5390

After a while first requests started to complete, but UI is not rendered correctly in this case

cid:image004.png@01D4A674.134F5390

Only name tabs are shown and no graph, not metrics were requested and rendered. What do you think if I create a Jira bug for this issue?

Second issue is the reason why we observe such behavior. After some profiling in JVisualVM and JMC, looks like the hot spot for us is adding Metrics into the HashMap.

In tested set up we had 60 Task managers and on every Task Manager we get 6114 metrics (Operators * metrics amount), which has created 366840 inserts per 10 seconds, which means 36k inserts per second. The problem is that in case of small refresh interval a lot of requests from UI DDOS back-end system in our case.

If you think it is interesting I can share profiler snapshots with you. The most interesting part the hot methods:

Stack Trace Sample Count Percentage(%)

org.apache.flink.runtime.rest.handler.legacy.metrics.MetricStore.addMetric(Map, String, MetricDump) 709 79.395

java.util.concurrent.ConcurrentHashMap.putVal(Object, Object, boolean) 595 66.629

sun.misc.FloatingDecimal.toJavaFormatString(double) 89 9.966

Also a lot of CPU wasted in New GC again in addMetric method:

Stack Trace TLABs Total TLAB Size(bytes) Pressure(%)

org.apache.flink.runtime.rest.handler.legacy.metrics.MetricStore.addMetric(Map, String, MetricDump) 2,537 1,791,614,312 61.372

Increasing interval in MetricFetcher#update by recompiling Flink improves UI responsiveness.

Also we are using G1 garbage collector for our Job Manager which has 8 Gb of the heap. What we have noticed, that young GC takes very significant amount of time, specially during the Scan RS phase. Is there any recommendation from the community about GC algorithm we should use for JobManager (and TaskManager)?

Thank you

Kind Regards

Oleksandr

From: Till Rohrmann <[hidden email]>
Date: Wednesday 2 January 2019 at 14:34
To: Oleksandr Nitavskyi <[hidden email]>
Cc: "[hidden email]" <[hidden email]>, "[hidden email]" <[hidden email]>, Jeff Bean <[hidden email]>, Jérôme Viveret <[hidden email]>, Juan Gentile <[hidden email]>
Subject: Re: 1.6 UI issues

Hi Oleksandr,

Cheers,

Till

On Wed, Jan 2, 2019 at 10:58 AM Oleksandr Nitavskyi <[hidden email]> wrote:

Hello guys. Happy new year!

Context: we started to have some troubles with UI after bumping our Flink version from 1.4 to 1.6.3. UI couldn’t render Job details page, so inspecting of the jobs for us has become impossible with the new version.

And looks like we have a workaround for our UI issue.

After some investigation we realized that starting from Flink 1.5 version we started to have a timeout on the actor call: restfulGateway.requestJob(jobId, timeout) in ExecutionGraphCache. So we have increased web.timeout parameter and we have stopped to have timeout exception on the JobManager side.

Also in SingleJobController on the Angular JS side we needed to tweak web.refresh-interval in order to ensure that Front-End is waiting for back-end request to be finished. Otherwise Angular JS side can make another request in SingleJobController and don’t know why when older request is finished no UI has been changed. We will have a look closer on this behavior.

Does it ring a bell for you probably?

Thank you

Kind Regards

Oleksandr

From: Till Rohrmann <[hidden email]>
Date: Wednesday 19 December 2018 at 16:52
To: Juan Gentile <[hidden email]>
Cc: "[hidden email]" <[hidden email]>, Jeff Bean <[hidden email]>, Oleksandr Nitavskyi <[hidden email]>
Subject: Re: 1.6 UI issues

Hi Juan,

thanks for the log. The log file does not contain anything suspicious. Are you sure that you sent me the right file? The timestamps don't seem to match. In the attached log, the job seems to run without problems.

Cheers,

Till
On Wed, Dec 19, 2018 at 10:26 AM Juan Gentile <[hidden email]> wrote:
Hello Till, Dawid

Sorry for the late response on this issue and thank you Jeff for helping us with this.

Yes we are using 1.6.2

I attach the logs from the Job Master.

Also we noticed this exception:

2018-12-19 08:50:10,497 ERROR org.apache.flink.runtime.rest.handler.job.JobDetailsHandler   - Implementation error: Unhandled exception.

java.util.concurrent.CancellationException

    at java.util.concurrent.CompletableFuture.cancel(CompletableFuture.java:2263)

    at org.apache.flink.runtime.rest.handler.legacy.ExecutionGraphCache.getExecutionGraph(ExecutionGraphCache.java:124)

    at org.apache.flink.runtime.rest.handler.job.AbstractExecutionGraphHandler.handleRequest(AbstractExecutionGraphHandler.java:76)

    at org.apache.flink.runtime.rest.handler.AbstractRestHandler.respondToRequest(AbstractRestHandler.java:78)

    at org.apache.flink.runtime.rest.handler.AbstractHandler.respondAsLeader(AbstractHandler.java:154)

    at org.apache.flink.runtime.rest.handler.RedirectHandler.lambda$null$0(RedirectHandler.java:142)

    at java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:760)

    at java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:736)

    at java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:442)

    at org.apache.flink.shaded.netty4.io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:163)

    at org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:404)

    at org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:463)

    at org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:884)

    at org.apache.flink.shaded.netty4.io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)

    at java.lang.Thread.run(Thread.java:748)

2018-12-19 08:50:17,977 ERROR org.apache.flink.runtime.rest.handler.job.JobsOverviewHandler - Implementation error: Unhandled exception.

akka.pattern.AskTimeoutException: Ask timed out on [Actor[akka://flink/user/dispatcher#-760166654]] after [10000 ms]. Sender[null] sent message of type "org.apache.flink.runtime.rpc.messages.LocalFencedMessage".

    at akka.pattern.PromiseActorRef$$anonfun$1.apply$mcV$sp(AskSupport.scala:604)

    at akka.actor.Scheduler$$anon$4.run(Scheduler.scala:126)

    at scala.concurrent.Future$InternalCallbackExecutor$.unbatchedExecute(Future.scala:601)

    at scala.concurrent.BatchingExecutor$class.execute(BatchingExecutor.scala:109)

    at scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:599)

    at akka.actor.LightArrayRevolverScheduler$TaskHolder.executeTask(LightArrayRevolverScheduler.scala:329)

    at akka.actor.LightArrayRevolverScheduler$$anon$4.executeBucket$1(LightArrayRevolverScheduler.scala:280)

    at akka.actor.LightArrayRevolverScheduler$$anon$4.nextTick(LightArrayRevolverScheduler.scala:284)

    at akka.actor.LightArrayRevolverScheduler$$anon$4.run(LightArrayRevolverScheduler.scala:236)

    at java.lang.Thread.run(Thread.java:748)

For which we tested with this parameter: -Dakka.ask.timeout=60s

But the issue remains.

Thank you

Juan

From: Till Rohrmann <[hidden email]>
Date: Thursday, 8 November 2018 at 16:06
To: "[hidden email]" <[hidden email]>
Cc: Juan Gentile <[hidden email]>, "[hidden email]" <[hidden email]>, user <[hidden email]>
Subject: Re: 1.6 UI issues

Hi Juan,

could you share the cluster entrypoint logs with us? They should contain more information about the internal server error.

Just to make sure, you are using Flink 1.6.2, right?

Cheers,

Till
On Thu, Nov 8, 2018 at 3:29 PM Dawid Wysakowicz <[hidden email]> wrote:
Hi Juan,

It doesn't look similar to the issue linked to me. What cluster setup are you using? Are you running HA mode?

I am adding Till to cc, who might be able to help you more.

Best,

Dawid

On 02/11/2018 17:26, Juan Gentile wrote:
Hello Yun,

We haven’t seen the error in the log as you mentioned. We also checked the GC and it seems to be okay. Inspecting the UI we found the following error:
Error! Filename not specified.

{"errors":["Could not retrieve the redirect address of the current leader. Please try to refresh."]}

Error! Filename not specified.

We suspect we are running into the same issue as described here (http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/akka-timeout-td14996.html) but we are not so sure.

Have you encountered this issue before?

Thank you,

From: Yun Tang [hidden email]
Date: Thursday, 1 November 2018 at 12:31
To: Juan Gentile [hidden email], [hidden email] [hidden email]
Subject: Re: 1.6 UI issues

Hi Juan

From our experience, you could check the jobmanager.log first to see whether existing similar logs below:
max allowed size 128000 bytes, actual size of encoded class akka.actor.Status$Success was xxx bytes
 
If you see these logs, you should increase the akka.framesize to larger value (default value is '10485760b') [1].
 
Otherwise, you could check the gc-log of job manager to see whether the gc overhead is too heavy for your job manager, consider to increase the memory for your job manager if so.
[1] https://ci.apache.org/projects/flink/flink-docs-stable/ops/config.html#distributed-coordination-via-akka

Apache Flink 1.6 Documentation: Configuration

Key Default Description; jobmanager.heap.size "1024m" JVM heap size for the JobManager. taskmanager.heap.size "1024m" JVM heap size for the TaskManagers, which are the parallel workers of the system.

ci.apache.org

Best

Yun Tang

From: Juan Gentile [hidden email]
Sent: Wednesday, October 31, 2018 22:05
To: [hidden email]
Subject: 1.6 UI issues

Hello!

We are migrating the the last 1.6 version and all the jobs seem to work fine, but when we check individual jobs through the web interface we encounter the issue that after clicking on a job, either it takes too long to load the information of the job or it never loads at all.

Has anyone had this issue? Any clues as to why?

Thank you,

Juan

Till Rohrmann

Re: 1.6 UI issues

Hi Oleksandr,

thanks a lot for the kind wishes and for the detailed investigation.

1. I think if the cluster cannot serve the information within the web.refresh-interval, it would be best to increase it. I quickly looked into the `ExecutionGraphCache` which is used for storing the `ArchivedExecutionGraph` and it looks one could change the logic a bit. What we do at the moment is to invalidate the ExecutionGraph cache entries after the web.refresh-interval and request an update from the cluster. This has the benefit (given that the response is fast) that we see faster the updated state. Instead one could also invalidate the old ExecutionGraph cache entry only after the response for the new request has arrived. This would prevent your situation because you would keep the old state as long as the request is in flight. The downside of this approach would be that you might wait another UI refresh interval until you see the results if the response is very fast. For that you could open a JIRA issue to further discuss it.

2. The high load caused by the MetricStore is indeed a problem. For that we should also open a JIRA issue to investigate what we could improve here. One thing we should definitely do is to make the fetching interval configurable so that one doesn't have to recompile Flink in order to change it. I actually quickly added it [1,2].

Thanks a lot for your help with debugging the problems!

[1] https://github.com/apache/flink/pull/7459

[2] https://issues.apache.org/jira/browse/FLINK-11300

Cheers,

Till

On Thu, Jan 10, 2019 at 10:08 AM Oleksandr Nitavskyi <[hidden email]> wrote:

Hello Till,

First congratulations to you and the whole Flink community! It is great to see such success and recognition of the Apache Flink and your work.

Thanks also for the previous answer and good tips. On our side we have made several more steps in understanding the issue.

So I think we have two related problems in Flink, which can be reproduced in our set up:

UI issue

Looks like there are some routing problems on Angular side in Flink UI. Angular refreshes job state (which is 20 kb in our case) every 10 sec by default (web.refresh-interval).

In case one of refresh calls take more than web.refresh-interval next request is made.

After a while first requests started to complete, but UI is not rendered correctly in this case

Only name tabs are shown and no graph, not metrics were requested and rendered. What do you think if I create a Jira bug for this issue?

Second issue is the reason why we observe such behavior. After some profiling in JVisualVM and JMC, looks like the hot spot for us is adding Metrics into the HashMap.

In tested set up we had 60 Task managers and on every Task Manager we get 6114 metrics (Operators * metrics amount), which has created 366840 inserts per 10 seconds, which means 36k inserts per second. The problem is that in case of small refresh interval a lot of requests from UI DDOS back-end system in our case.

If you think it is interesting I can share profiler snapshots with you. The most interesting part the hot methods:

Stack Trace                                                                                                                                                                            Sample Count    Percentage(%)

org.apache.flink.runtime.rest.handler.legacy.metrics.MetricStore.addMetric(Map, String, MetricDump)   709                                79.395

java.util.concurrent.ConcurrentHashMap.putVal(Object, Object, boolean)                                             595                               66.629

sun.misc.FloatingDecimal.toJavaFormatString(double)                                                                                89                               9.966

Also a lot of CPU wasted in New GC again in addMetric method:

Stack Trace                                                                                                                                                                                         TLABs    Total TLAB Size(bytes)   Pressure(%)

org.apache.flink.runtime.rest.handler.legacy.metrics.MetricStore.addMetric(Map, String, MetricDump)   2,537     1,791,614,312                   61.372

Increasing interval in MetricFetcher#update by recompiling Flink improves UI responsiveness.

Also we are using G1 garbage collector for our Job Manager which has 8 Gb of the heap. What we have noticed, that young GC takes very significant amount of time, specially during the Scan RS phase. Is there any recommendation from the community about GC algorithm we should use for JobManager (and TaskManager)?

Thank you

Kind Regards

Oleksandr

From: Till Rohrmann <[hidden email]>
Date: Wednesday 2 January 2019 at 14:34
To: Oleksandr Nitavskyi <[hidden email]>
Cc: "[hidden email]" <[hidden email]>, "[hidden email]" <[hidden email]>, Jeff Bean <[hidden email]>, Jérôme Viveret <[hidden email]>, Juan Gentile <[hidden email]>
Subject: Re: 1.6 UI issues

Hi Oleksandr,

the requestJob call should only take longer if either the `JobMaster` is overloaded and too busy to respond to the request or if the ArchivedExecutionGraph is very large (e.g. very large accumulators) and generating it and sending it over to the RestServerEndpoint takes too long. This is also the change which was introduced with Flink 1.5. Instead of simply handing over a reference to the RestServerEndpoint from the JobMaster, the ArchivedExecutionGraph now needs to be sent through the network stack to the RestServerEndpoint.

If you did not change the akka.framesize then the maximum size of the ArchivedExecutionGraph should only be 10 MB, though. Therefore, I would guess that your `JobMaster` must be quite busy if the requests time out.

Cheers,

Till
On Wed, Jan 2, 2019 at 10:58 AM Oleksandr Nitavskyi <[hidden email]> wrote:
Hello guys. Happy new year!

Context: we started to have some troubles with UI after bumping our Flink version from 1.4 to 1.6.3. UI couldn’t render Job details page, so inspecting of the jobs for us has become impossible with the new version.

And looks like we have a workaround for our UI issue.

After some investigation we realized that starting from Flink 1.5 version we started to have a timeout on the actor call: restfulGateway.requestJob(jobId, timeout) in ExecutionGraphCache. So we have increased web.timeout parameter and we have stopped to have timeout exception on the JobManager side.

Also in SingleJobController on the Angular JS side we needed to tweak web.refresh-interval in order to ensure that Front-End is waiting for back-end request to be finished. Otherwise Angular JS side can make another request in SingleJobController and don’t know why when older request is finished no UI has been changed. We will have a look closer on this behavior.

Does it ring a bell for you probably?

Thank you

Kind Regards

Oleksandr

From: Till Rohrmann <[hidden email]>
Date: Wednesday 19 December 2018 at 16:52
To: Juan Gentile <[hidden email]>
Cc: "[hidden email]" <[hidden email]>, Jeff Bean <[hidden email]>, Oleksandr Nitavskyi <[hidden email]>
Subject: Re: 1.6 UI issues

Hi Juan,

thanks for the log. The log file does not contain anything suspicious. Are you sure that you sent me the right file? The timestamps don't seem to match. In the attached log, the job seems to run without problems.

Cheers,

Till
On Wed, Dec 19, 2018 at 10:26 AM Juan Gentile <[hidden email]> wrote:
Hello Till, Dawid

Sorry for the late response on this issue and thank you Jeff for helping us with this.

Yes we are using 1.6.2

I attach the logs from the Job Master.

Also we noticed this exception:

2018-12-19 08:50:10,497 ERROR org.apache.flink.runtime.rest.handler.job.JobDetailsHandler   - Implementation error: Unhandled exception.

java.util.concurrent.CancellationException

    at java.util.concurrent.CompletableFuture.cancel(CompletableFuture.java:2263)

    at org.apache.flink.runtime.rest.handler.legacy.ExecutionGraphCache.getExecutionGraph(ExecutionGraphCache.java:124)

    at org.apache.flink.runtime.rest.handler.job.AbstractExecutionGraphHandler.handleRequest(AbstractExecutionGraphHandler.java:76)

    at org.apache.flink.runtime.rest.handler.AbstractRestHandler.respondToRequest(AbstractRestHandler.java:78)

    at org.apache.flink.runtime.rest.handler.AbstractHandler.respondAsLeader(AbstractHandler.java:154)

    at org.apache.flink.runtime.rest.handler.RedirectHandler.lambda$null$0(RedirectHandler.java:142)

    at java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:760)

    at java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:736)

    at java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:442)

    at org.apache.flink.shaded.netty4.io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:163)

    at org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:404)

    at org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:463)

    at org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:884)

    at org.apache.flink.shaded.netty4.io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)

    at java.lang.Thread.run(Thread.java:748)

2018-12-19 08:50:17,977 ERROR org.apache.flink.runtime.rest.handler.job.JobsOverviewHandler - Implementation error: Unhandled exception.

akka.pattern.AskTimeoutException: Ask timed out on [Actor[akka://flink/user/dispatcher#-760166654]] after [10000 ms]. Sender[null] sent message of type "org.apache.flink.runtime.rpc.messages.LocalFencedMessage".

    at akka.pattern.PromiseActorRef$$anonfun$1.apply$mcV$sp(AskSupport.scala:604)

    at akka.actor.Scheduler$$anon$4.run(Scheduler.scala:126)

    at scala.concurrent.Future$InternalCallbackExecutor$.unbatchedExecute(Future.scala:601)

    at scala.concurrent.BatchingExecutor$class.execute(BatchingExecutor.scala:109)

    at scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:599)

    at akka.actor.LightArrayRevolverScheduler$TaskHolder.executeTask(LightArrayRevolverScheduler.scala:329)

    at akka.actor.LightArrayRevolverScheduler$$anon$4.executeBucket$1(LightArrayRevolverScheduler.scala:280)

    at akka.actor.LightArrayRevolverScheduler$$anon$4.nextTick(LightArrayRevolverScheduler.scala:284)

    at akka.actor.LightArrayRevolverScheduler$$anon$4.run(LightArrayRevolverScheduler.scala:236)

    at java.lang.Thread.run(Thread.java:748)

For which we tested with this parameter: -Dakka.ask.timeout=60s

But the issue remains.

Thank you

Juan

From: Till Rohrmann <[hidden email]>
Date: Thursday, 8 November 2018 at 16:06
To: "[hidden email]" <[hidden email]>
Cc: Juan Gentile <[hidden email]>, "[hidden email]" <[hidden email]>, user <[hidden email]>
Subject: Re: 1.6 UI issues

Hi Juan,

could you share the cluster entrypoint logs with us? They should contain more information about the internal server error.

Just to make sure, you are using Flink 1.6.2, right?

Cheers,

Till
On Thu, Nov 8, 2018 at 3:29 PM Dawid Wysakowicz <[hidden email]> wrote:
Hi Juan,

It doesn't look similar to the issue linked to me. What cluster setup are you using? Are you running HA mode?

I am adding Till to cc, who might be able to help you more.

Best,

Dawid

On 02/11/2018 17:26, Juan Gentile wrote:
Hello Yun,

We haven’t seen the error in the log as you mentioned. We also checked the GC and it seems to be okay. Inspecting the UI we found the following error:
Error! Filename not specified.

{"errors":["Could not retrieve the redirect address of the current leader. Please try to refresh."]}

Error! Filename not specified.

We suspect we are running into the same issue as described here (http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/akka-timeout-td14996.html) but we are not so sure.

Have you encountered this issue before?

Thank you,

From: Yun Tang [hidden email]
Date: Thursday, 1 November 2018 at 12:31
To: Juan Gentile [hidden email], [hidden email] [hidden email]
Subject: Re: 1.6 UI issues

Hi Juan

From our experience, you could check the jobmanager.log first to see whether existing similar logs below:
max allowed size 128000 bytes, actual size of encoded class akka.actor.Status$Success was xxx bytes
 
If you see these logs, you should increase the akka.framesize to larger value (default value is '10485760b') [1].
 
Otherwise, you could check the gc-log of job manager to see whether the gc overhead is too heavy for your job manager, consider to increase the memory for your job manager if so.
[1] https://ci.apache.org/projects/flink/flink-docs-stable/ops/config.html#distributed-coordination-via-akka

Apache Flink 1.6 Documentation: Configuration

Key Default Description; jobmanager.heap.size "1024m" JVM heap size for the JobManager. taskmanager.heap.size "1024m" JVM heap size for the TaskManagers, which are the parallel workers of the system.

ci.apache.org

Best

Yun Tang

From: Juan Gentile [hidden email]
Sent: Wednesday, October 31, 2018 22:05
To: [hidden email]
Subject: 1.6 UI issues

Hello!

We are migrating the the last 1.6 version and all the jobs seem to work fine, but when we check individual jobs through the web interface we encounter the issue that after clicking on a job, either it takes too long to load the information of the job or it never loads at all.

Has anyone had this issue? Any clues as to why?

Thank you,

Juan

Oleksandr Nitavskyi

Re: 1.6 UI issues

Hi again here,

So I have created two Jira issues: https://issues.apache.org/jira/browse/FLINK-11394 about UI problem and https://issues.apache.org/jira/browse/FLINK-11396 related to the GC pressure in MetricsStore, let's continue the technical discussion there.

As a workaround for GC pressure can be the usage of more predictable GC that G1 with ergonomics. We have switched to Parallel GC for JM and hope it will be good enough for all our use-cases. While on the TM side we still prefer to use G1 due to the latency promises it has.

Cheers

Oleksandr

From: Till Rohrmann <[hidden email]>
Sent: Thursday, January 10, 2019 6:27:10 PM
To: Oleksandr Nitavskyi
Cc: [hidden email]; [hidden email]; Jeff Bean; Jérôme Viveret; Juan Gentile
Subject: Re: 1.6 UI issues

Hi Oleksandr,

thanks a lot for the kind wishes and for the detailed investigation.

Thanks a lot for your help with debugging the problems!

[1] https://github.com/apache/flink/pull/7459

[2] https://issues.apache.org/jira/browse/FLINK-11300

Cheers,

Till

On Thu, Jan 10, 2019 at 10:08 AM Oleksandr Nitavskyi <[hidden email]> wrote:

Hello Till,

First congratulations to you and the whole Flink community! It is great to see such success and recognition of the Apache Flink and your work.

Thanks also for the previous answer and good tips. On our side we have made several more steps in understanding the issue.

So I think we have two related problems in Flink, which can be reproduced in our set up:

UI issue

Looks like there are some routing problems on Angular side in Flink UI. Angular refreshes job state (which is 20 kb in our case) every 10 sec by default (web.refresh-interval).

In case one of refresh calls take more than web.refresh-interval next request is made.

After a while first requests started to complete, but UI is not rendered correctly in this case

Only name tabs are shown and no graph, not metrics were requested and rendered. What do you think if I create a Jira bug for this issue?

Second issue is the reason why we observe such behavior. After some profiling in JVisualVM and JMC, looks like the hot spot for us is adding Metrics into the HashMap.

In tested set up we had 60 Task managers and on every Task Manager we get 6114 metrics (Operators * metrics amount), which has created 366840 inserts per 10 seconds, which means 36k inserts per second. The problem is that in case of small refresh interval a lot of requests from UI DDOS back-end system in our case.

If you think it is interesting I can share profiler snapshots with you. The most interesting part the hot methods:

Stack Trace                                                                                                                                                                            Sample Count    Percentage(%)

org.apache.flink.runtime.rest.handler.legacy.metrics.MetricStore.addMetric(Map, String, MetricDump)   709                                79.395

java.util.concurrent.ConcurrentHashMap.putVal(Object, Object, boolean)                                             595                               66.629

sun.misc.FloatingDecimal.toJavaFormatString(double)                                                                                89                               9.966

Also a lot of CPU wasted in New GC again in addMetric method:

Stack Trace                                                                                                                                                                                         TLABs    Total TLAB Size(bytes)   Pressure(%)

org.apache.flink.runtime.rest.handler.legacy.metrics.MetricStore.addMetric(Map, String, MetricDump)   2,537     1,791,614,312                   61.372

Increasing interval in MetricFetcher#update by recompiling Flink improves UI responsiveness.

Also we are using G1 garbage collector for our Job Manager which has 8 Gb of the heap. What we have noticed, that young GC takes very significant amount of time, specially during the Scan RS phase. Is there any recommendation from the community about GC algorithm we should use for JobManager (and TaskManager)?

Thank you

Kind Regards

Oleksandr

From: Till Rohrmann <[hidden email]>
Date: Wednesday 2 January 2019 at 14:34
To: Oleksandr Nitavskyi <[hidden email]>
Cc: "[hidden email]" <[hidden email]>, "[hidden email]" <[hidden email]>, Jeff Bean <[hidden email]>, Jérôme Viveret <[hidden email]>, Juan Gentile <[hidden email]>
Subject: Re: 1.6 UI issues

Hi Oleksandr,

the requestJob call should only take longer if either the `JobMaster` is overloaded and too busy to respond to the request or if the ArchivedExecutionGraph is very large (e.g. very large accumulators) and generating it and sending it over to the RestServerEndpoint takes too long. This is also the change which was introduced with Flink 1.5. Instead of simply handing over a reference to the RestServerEndpoint from the JobMaster, the ArchivedExecutionGraph now needs to be sent through the network stack to the RestServerEndpoint.

If you did not change the akka.framesize then the maximum size of the ArchivedExecutionGraph should only be 10 MB, though. Therefore, I would guess that your `JobMaster` must be quite busy if the requests time out.

Cheers,

Till
On Wed, Jan 2, 2019 at 10:58 AM Oleksandr Nitavskyi <[hidden email]> wrote:
Hello guys. Happy new year!

Context: we started to have some troubles with UI after bumping our Flink version from 1.4 to 1.6.3. UI couldn’t render Job details page, so inspecting of the jobs for us has become impossible with the new version.

And looks like we have a workaround for our UI issue.

After some investigation we realized that starting from Flink 1.5 version we started to have a timeout on the actor call: restfulGateway.requestJob(jobId, timeout) in ExecutionGraphCache. So we have increased web.timeout parameter and we have stopped to have timeout exception on the JobManager side.

Also in SingleJobController on the Angular JS side we needed to tweak web.refresh-interval in order to ensure that Front-End is waiting for back-end request to be finished. Otherwise Angular JS side can make another request in SingleJobController and don’t know why when older request is finished no UI has been changed. We will have a look closer on this behavior.

Does it ring a bell for you probably?

Thank you

Kind Regards

Oleksandr

From: Till Rohrmann <[hidden email]>
Date: Wednesday 19 December 2018 at 16:52
To: Juan Gentile <[hidden email]>
Cc: "[hidden email]" <[hidden email]>, Jeff Bean <[hidden email]>, Oleksandr Nitavskyi <[hidden email]>
Subject: Re: 1.6 UI issues

Hi Juan,

thanks for the log. The log file does not contain anything suspicious. Are you sure that you sent me the right file? The timestamps don't seem to match. In the attached log, the job seems to run without problems.

Cheers,

Till
On Wed, Dec 19, 2018 at 10:26 AM Juan Gentile <[hidden email]> wrote:
Hello Till, Dawid

Sorry for the late response on this issue and thank you Jeff for helping us with this.

Yes we are using 1.6.2

I attach the logs from the Job Master.

Also we noticed this exception:

2018-12-19 08:50:10,497 ERROR org.apache.flink.runtime.rest.handler.job.JobDetailsHandler   - Implementation error: Unhandled exception.

java.util.concurrent.CancellationException

    at java.util.concurrent.CompletableFuture.cancel(CompletableFuture.java:2263)

    at org.apache.flink.runtime.rest.handler.legacy.ExecutionGraphCache.getExecutionGraph(ExecutionGraphCache.java:124)

    at org.apache.flink.runtime.rest.handler.job.AbstractExecutionGraphHandler.handleRequest(AbstractExecutionGraphHandler.java:76)

    at org.apache.flink.runtime.rest.handler.AbstractRestHandler.respondToRequest(AbstractRestHandler.java:78)

    at org.apache.flink.runtime.rest.handler.AbstractHandler.respondAsLeader(AbstractHandler.java:154)

    at org.apache.flink.runtime.rest.handler.RedirectHandler.lambda$null$0(RedirectHandler.java:142)

    at java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:760)

    at java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:736)

    at java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:442)

    at org.apache.flink.shaded.netty4.io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:163)

    at org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:404)

    at org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:463)

    at org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:884)

    at org.apache.flink.shaded.netty4.io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)

    at java.lang.Thread.run(Thread.java:748)

2018-12-19 08:50:17,977 ERROR org.apache.flink.runtime.rest.handler.job.JobsOverviewHandler - Implementation error: Unhandled exception.

akka.pattern.AskTimeoutException: Ask timed out on [Actor[akka://flink/user/dispatcher#-760166654]] after [10000 ms]. Sender[null] sent message of type "org.apache.flink.runtime.rpc.messages.LocalFencedMessage".

    at akka.pattern.PromiseActorRef$$anonfun$1.apply$mcV$sp(AskSupport.scala:604)

    at akka.actor.Scheduler$$anon$4.run(Scheduler.scala:126)

    at scala.concurrent.Future$InternalCallbackExecutor$.unbatchedExecute(Future.scala:601)

    at scala.concurrent.BatchingExecutor$class.execute(BatchingExecutor.scala:109)

    at scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:599)

    at akka.actor.LightArrayRevolverScheduler$TaskHolder.executeTask(LightArrayRevolverScheduler.scala:329)

    at akka.actor.LightArrayRevolverScheduler$$anon$4.executeBucket$1(LightArrayRevolverScheduler.scala:280)

    at akka.actor.LightArrayRevolverScheduler$$anon$4.nextTick(LightArrayRevolverScheduler.scala:284)

    at akka.actor.LightArrayRevolverScheduler$$anon$4.run(LightArrayRevolverScheduler.scala:236)

    at java.lang.Thread.run(Thread.java:748)

For which we tested with this parameter: -Dakka.ask.timeout=60s

But the issue remains.

Thank you

Juan

From: Till Rohrmann <[hidden email]>
Date: Thursday, 8 November 2018 at 16:06
To: "[hidden email]" <[hidden email]>
Cc: Juan Gentile <[hidden email]>, "[hidden email]" <[hidden email]>, user <[hidden email]>
Subject: Re: 1.6 UI issues

Hi Juan,

could you share the cluster entrypoint logs with us? They should contain more information about the internal server error.

Just to make sure, you are using Flink 1.6.2, right?

Cheers,

Till
On Thu, Nov 8, 2018 at 3:29 PM Dawid Wysakowicz <[hidden email]> wrote:
Hi Juan,

It doesn't look similar to the issue linked to me. What cluster setup are you using? Are you running HA mode?

I am adding Till to cc, who might be able to help you more.

Best,

Dawid

On 02/11/2018 17:26, Juan Gentile wrote:
Hello Yun,

We haven’t seen the error in the log as you mentioned. We also checked the GC and it seems to be okay. Inspecting the UI we found the following error:
Error! Filename not specified.

{"errors":["Could not retrieve the redirect address of the current leader. Please try to refresh."]}

Error! Filename not specified.

We suspect we are running into the same issue as described here (http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/akka-timeout-td14996.html) but we are not so sure.

Have you encountered this issue before?

Thank you,

From: Yun Tang [hidden email]
Date: Thursday, 1 November 2018 at 12:31
To: Juan Gentile [hidden email], [hidden email] [hidden email]
Subject: Re: 1.6 UI issues

Hi Juan

From our experience, you could check the jobmanager.log first to see whether existing similar logs below:
max allowed size 128000 bytes, actual size of encoded class akka.actor.Status$Success was xxx bytes
 
If you see these logs, you should increase the akka.framesize to larger value (default value is '10485760b') [1].
 
Otherwise, you could check the gc-log of job manager to see whether the gc overhead is too heavy for your job manager, consider to increase the memory for your job manager if so.
[1] https://ci.apache.org/projects/flink/flink-docs-stable/ops/config.html#distributed-coordination-via-akka

Apache Flink 1.6 Documentation: Configuration

Key Default Description; jobmanager.heap.size "1024m" JVM heap size for the JobManager. taskmanager.heap.size "1024m" JVM heap size for the TaskManagers, which are the parallel workers of the system.

ci.apache.org

Best

Yun Tang

From: Juan Gentile [hidden email]
Sent: Wednesday, October 31, 2018 22:05
To: [hidden email]
Subject: 1.6 UI issues

Hello!

We are migrating the the last 1.6 version and all the jobs seem to work fine, but when we check individual jobs through the web interface we encounter the issue that after clicking on a job, either it takes too long to load the information of the job or it never loads at all.

Has anyone had this issue? Any clues as to why?

Thank you,

Juan