(DEPRECATED) Apache Flink User Mailing List archive.

Re: Standalone cluster instability

Posted by Alexander Smirnov on
URL: http://deprecated-apache-flink-user-mailing-list-archive.369.s1.nabble.com/Standalone-cluster-instability-tp19091p19166.html

Hi Piotr,

I didn't find anything special in the logs before the failure.

Here are the logs, please take a look:

https://drive.google.com/drive/folders/1zlUDMpbO9xZjjJzf28lUX-bkn_x7QV59?usp=sharing

The configuration is:

3 task managers:

qafdsflinkw011.scl

qafdsflinkw012.scl

qafdsflinkw013.scl - lost connection

3 job managers:

qafdsflinkm011.scl - the leader

qafdsflinkm012.scl

qafdsflinkm013.scl

3 zookeepers:

qafdsflinkzk011.scl

qafdsflinkzk012.scl

qafdsflinkzk013.scl

Thank you,

Alex

On Wed, Mar 21, 2018 at 6:23 PM Piotr Nowojski <[hidden email]> wrote:

Hi,

Does the issue really happen after 48 hours?
Is there some indication of a failure in TaskManager log?

If you will be still unable to solve the problem, please provide full TaskManager and JobManager logs.

Piotrek

On 21 Mar 2018, at 16:00, Alexander Smirnov <[hidden email]> wrote:

One more question - I see a lot of line like the following in the logs

[2018-03-21 00:30:35,975] ERROR Association to [akka.tcp://flink@...:35320] with UID [1500204560] irrecoverably failed. Quarantining address. (akka.remote.Remoting)
[2018-03-21 00:34:15,208] WARN Association to [akka.tcp://flink@...:41068] with unknown UID is irrecoverably failed. Address cannot be quarantined without knowing the UID, gating instead for 5000 ms. (akka.remote.Remoting)
[2018-03-21 00:34:15,235] WARN Association to [akka.tcp://flink@...:40677] with unknown UID is irrecoverably failed. Address cannot be quarantined without knowing the UID, gating instead for 5000 ms. (akka.remote.Remoting)
[2018-03-21 00:34:15,256] WARN Association to [akka.tcp://flink@...:40382] with unknown UID is irrecoverably failed. Address cannot be quarantined without knowing the UID, gating instead for 5000 ms. (akka.remote.Remoting)
[2018-03-21 00:34:15,256] WARN Association to [akka.tcp://flink@...:44744] with unknown UID is irrecoverably failed. Address cannot be quarantined without knowing the UID, gating instead for 5000 ms. (akka.remote.Remoting)
[2018-03-21 00:34:15,266] WARN Association to [akka.tcp://flink@...:42413] with unknown UID is irrecoverably failed. Address cannot be quarantined without knowing the UID, gating instead for 5000 ms. (akka.remote.Remoting)

The host is available, but I don't understand where port number comes from. Task Manager uses another port (which is printed in logs on startup)
Could you please help to understand why it happens?

Thank you,
Alex

On Wed, Mar 21, 2018 at 4:19 PM Alexander Smirnov <[hidden email]> wrote:
Hello,

I've assembled a standalone cluster of 3 task managers and 3 job managers(and 3 ZK) following the instructions at

https://ci.apache.org/projects/flink/flink-docs-release-1.4/ops/deployment/cluster_setup.html and https://ci.apache.org/projects/flink/flink-docs-release-1.4/ops/jobmanager_high_availability.html

It works ok, but randomly, task managers becomes unavailable. JobManager has exception like below in logs:

[2018-03-19 00:33:10,211] WARN Association with remote system [akka.tcp://flink@...:42413] has failed, address is now gated for [5000] ms. Reason: [Association failed with [akka.tcp://flink@...:42413]] Caused by: [Connection refused: qafdsflinkw811.nn.five9lab.com/10.5.61.124:42413] (akka.remote.ReliableDeliverySupervisor)
[2018-03-21 00:30:35,975] ERROR Association to [akka.tcp://flink@...:35320] with UID [1500204560] irrecoverably failed. Quarantining address. (akka.remote.Remoting)
java.util.concurrent.TimeoutException: Remote system has been silent for too long. (more than 48.0 hours)
at akka.remote.ReliableDeliverySupervisor$$anonfun$idle$1.applyOrElse(Endpoint.scala:375)
at akka.actor.Actor$class.aroundReceive(Actor.scala:502)
at akka.remote.ReliableDeliverySupervisor.aroundReceive(Endpoint.scala:203)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:526)
at akka.actor.ActorCell.invoke(ActorCell.scala:495)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:257)
at akka.dispatch.Mailbox.run(Mailbox.scala:224)
at akka.dispatch.Mailbox.exec(Mailbox.scala:234)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)

I can't find a reason for this exception, any ideas?

Thank you,
Alex