I depoloy a standard alone cluster with flink1.5 and when I try to restart
the only jobmanger, below is the log print from task manager: 2018-06-04 12:06:35,882 WARN akka.remote.ReliableDeliverySupervisor - Association with remote system [akka.tcp://flink@ipaddress:6123] has failed, address is now gated for [50] ms. Reason: [Disassociated] 2018-06-04 12:07:17,580 INFO org.apache.flink.runtime.taskexecutor.TaskExecutor - The heartbeat of ResourceManager with id 6af9bbb514a6ddaeca95d6e52db6dbd5 timed out. 2018-06-04 12:07:17,580 INFO org.apache.flink.runtime.taskexecutor.TaskExecutor - Close ResourceManager connection 6af9bbb514a6ddaeca95d6e52db6dbd5. 2018-06-04 12:07:17,611 WARN akka.remote.transport.netty.NettyTransport - Remote connection to [null] failed with java.net.ConnectException: Connection refused: /ipaddress:6123 2018-06-04 12:07:17,611 WARN akka.remote.ReliableDeliverySupervisor - Association with remote system [akka.tcp://flink@ipaddress:6123] has failed, address is now gated for [50] ms. Reason: [Association failed with [akka.tcp://flink@ipaddress:6123]] Caused by: [Connection refused: /ipaddress:6123] so I'd like to know if there is a config to ask task manager to keep retrying to connect to job manager(since I am restating jobmanager so it will come back later)? -- Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/ |
when I debug the jobmanager and below is the error log in task manager:
2018-06-04 17:16:33,295 INFO org.apache.flink.runtime.taskexecutor.TaskExecutor - The heartbeat of ResourceManager with id 35df0455efc2fb6fa3f2467f7f5d2ba1 timed out. 2018-06-04 17:16:33,297 DEBUG org.apache.flink.runtime.taskexecutor.TaskExecutor - Close ResourceManager connection 35df0455efc2fb6fa3f2467f7f5d2ba1. java.util.concurrent.TimeoutException: The heartbeat of ResourceManager with id 35df0455efc2fb6fa3f2467f7f5d2ba1 timed out. at org.apache.flink.runtime.taskexecutor.TaskExecutor$ResourceManagerHeartbeatListener.lambda$notifyHeartbeatTimeout$0(TaskExecutor.java:1553) at org.apache.flink.runtime.taskexecutor.TaskExecutor$ResourceManagerHeartbeatListener$$Lambda$26/1975100911.run(Unknown Source) at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRunAsync(AkkaRpcActor.java:295) at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleMessage(AkkaRpcActor.java:150) at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.lambda$onReceive$1(AkkaRpcActor.java:132) at org.apache.flink.runtime.rpc.akka.AkkaRpcActor$$Lambda$12/1732386307.apply(Unknown Source) at akka.actor.ActorCell$$anonfun$become$1.applyOrElse(ActorCell.scala:544) at akka.actor.Actor$class.aroundReceive(Actor.scala:502) at akka.actor.UntypedActor.aroundReceive(UntypedActor.scala:95) at akka.actor.ActorCell.receiveMessage(ActorCell.scala:526) at akka.actor.ActorCell.invoke(ActorCell.scala:495) at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:257) at akka.dispatch.Mailbox.run(Mailbox.scala:224) at akka.dispatch.Mailbox.exec(Mailbox.scala:234) at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) -- Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/ |
so is there a way or config to ask taskmanager to keep continue connectting
to jobmanager? -- Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/ |
can anybody share anythoughts, insights about this issue?
-- Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/ |
Please look into high-availability
to make your cluster resistant against shutdowns.
On 05.06.2018 12:31, makeyang wrote: can anybody share anythoughts, insights about this issue? -- Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/
|
Hello Chesnay, I have used an HA setup without the masters file and have seen failover happen based on alerts from a leader election routine.... Is it actually required that there be a masters file when there is a central arbiterer ZK that has the alive JMs and a call back to force TMs to switch to a new leader in case of failure... On Tue, Jun 5, 2018, 6:45 AM Chesnay Schepler <[hidden email]> wrote:
|
Even though I must admit that the jobs restart but they do restart successfully with the new JM..... On Fri, Jul 6, 2018, 8:08 AM Vishal Santoshi <[hidden email]> wrote:
|
In reply to this post by Vishal Santoshi
If i remember correctly the masters
file is only used by the [start|stop]-cluster.sh scripts to
determine how many JobManagers should be started / stopped and
which port they should use.
it's not necessarily required, but without it you have to manually start/stop all jobmanagers. On 06.07.2018 14:08, Vishal Santoshi wrote:
|
Yep, pwrfect, that we do. Can you confirm though that jobs will restart in the case of a failover ? That is what we see and that is fine.. On Fri, Jul 6, 2018, 8:24 AM Chesnay Schepler <[hidden email]> wrote:
|
Free forum by Nabble | Edit this page |