(DEPRECATED) Apache Flink User Mailing List archive.

is there a config to ask taskmanager to keep retrying connect to jobmanager after Disassociated?

Classic

List

Threaded

9 messages Options

makeyang

is there a config to ask taskmanager to keep retrying connect to jobmanager after Disassociated?

I depoloy a standard alone cluster with flink1.5 and when I try to restart
the only jobmanger, below is the log print from task manager:
2018-06-04 12:06:35,882 WARN akka.remote.ReliableDeliverySupervisor
- Association with remote system [akka.tcp://flink@ipaddress:6123] has
failed, address is now gated for [50] ms. Reason: [Disassociated]
2018-06-04 12:07:17,580 INFO
org.apache.flink.runtime.taskexecutor.TaskExecutor - The
heartbeat of ResourceManager with id 6af9bbb514a6ddaeca95d6e52db6dbd5 timed
out.
2018-06-04 12:07:17,580 INFO
org.apache.flink.runtime.taskexecutor.TaskExecutor - Close
ResourceManager connection 6af9bbb514a6ddaeca95d6e52db6dbd5.
2018-06-04 12:07:17,611 WARN akka.remote.transport.netty.NettyTransport
- Remote connection to [null] failed with java.net.ConnectException:
Connection refused: /ipaddress:6123
2018-06-04 12:07:17,611 WARN akka.remote.ReliableDeliverySupervisor
- Association with remote system [akka.tcp://flink@ipaddress:6123] has
failed, address is now gated for [50] ms. Reason: [Association failed with
[akka.tcp://flink@ipaddress:6123]] Caused by: [Connection refused:
/ipaddress:6123]

so I'd like to know if there is a config to ask task manager to keep
retrying to connect to job manager(since I am restating jobmanager so it
will come back later)?

--
Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/

makeyang

Re: is there a config to ask taskmanager to keep retrying connect to jobmanager after Disassociated?

when I debug the jobmanager and below is the error log in task manager:
2018-06-04 17:16:33,295 INFO
org.apache.flink.runtime.taskexecutor.TaskExecutor - The
heartbeat of ResourceManager with id 35df0455efc2fb6fa3f2467f7f5d2ba1 timed
out.
2018-06-04 17:16:33,297 DEBUG
org.apache.flink.runtime.taskexecutor.TaskExecutor - Close
ResourceManager connection 35df0455efc2fb6fa3f2467f7f5d2ba1.
java.util.concurrent.TimeoutException: The heartbeat of ResourceManager with
id 35df0455efc2fb6fa3f2467f7f5d2ba1 timed out.
at
org.apache.flink.runtime.taskexecutor.TaskExecutor$ResourceManagerHeartbeatListener.lambda$notifyHeartbeatTimeout$0(TaskExecutor.java:1553)
at
org.apache.flink.runtime.taskexecutor.TaskExecutor$ResourceManagerHeartbeatListener$$Lambda$26/1975100911.run(Unknown
Source)
at
org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRunAsync(AkkaRpcActor.java:295)
at
org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleMessage(AkkaRpcActor.java:150)
at
org.apache.flink.runtime.rpc.akka.AkkaRpcActor.lambda$onReceive$1(AkkaRpcActor.java:132)
at
org.apache.flink.runtime.rpc.akka.AkkaRpcActor$$Lambda$12/1732386307.apply(Unknown
Source)
at
akka.actor.ActorCell$$anonfun$become$1.applyOrElse(ActorCell.scala:544)
at akka.actor.Actor$class.aroundReceive(Actor.scala:502)
at akka.actor.UntypedActor.aroundReceive(UntypedActor.scala:95)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:526)
at akka.actor.ActorCell.invoke(ActorCell.scala:495)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:257)
at akka.dispatch.Mailbox.run(Mailbox.scala:224)
at akka.dispatch.Mailbox.exec(Mailbox.scala:234)
at
scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at
scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at
scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at
scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)

--
Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/

makeyang

Re: is there a config to ask taskmanager to keep retrying connect to jobmanager after Disassociated?

so is there a way or config to ask taskmanager to keep continue connectting
to jobmanager?

--
Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/

makeyang

Re: is there a config to ask taskmanager to keep retrying connect to jobmanager after Disassociated?

can anybody share anythoughts, insights about this issue?

--
Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/

Chesnay Schepler

Re: is there a config to ask taskmanager to keep retrying connect to jobmanager after Disassociated?

Please look into high-availability to make your cluster resistant against shutdowns.

On 05.06.2018 12:31, makeyang wrote:

can anybody share anythoughts, insights about this issue?



--
Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/

Vishal Santoshi

Re: is there a config to ask taskmanager to keep retrying connect to jobmanager after Disassociated?

Hello Chesnay, I have used an HA setup without the masters file and have seen failover happen based on alerts from a leader election routine.... Is it actually required that there be a masters file when there is a central arbiterer ZK that has the alive JMs and a call back to force TMs to switch to a new leader in case of failure...

On Tue, Jun 5, 2018, 6:45 AM Chesnay Schepler <[hidden email]> wrote:

Please look into high-availability to make your cluster resistant against shutdowns.

On 05.06.2018 12:31, makeyang wrote:
can anybody share anythoughts, insights about this issue?



--
Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/

Vishal Santoshi

Re: is there a config to ask taskmanager to keep retrying connect to jobmanager after Disassociated?

Even though I must admit that the jobs restart but they do restart successfully with the new JM.....

On Fri, Jul 6, 2018, 8:08 AM Vishal Santoshi <[hidden email]> wrote:

Hello Chesnay, I have used an HA setup without the masters file and have seen failover happen based on alerts from a leader election routine.... Is it actually required that there be a masters file when there is a central arbiterer ZK that has the alive JMs and a call back to force TMs to switch to a new leader in case of failure...
On Tue, Jun 5, 2018, 6:45 AM Chesnay Schepler <[hidden email]> wrote:
Please look into high-availability to make your cluster resistant against shutdowns.

On 05.06.2018 12:31, makeyang wrote:
can anybody share anythoughts, insights about this issue?



--
Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/

Chesnay Schepler

Re: is there a config to ask taskmanager to keep retrying connect to jobmanager after Disassociated?

In reply to this post by Vishal Santoshi

If i remember correctly the masters file is only used by the [start|stop]-cluster.sh scripts to determine how many JobManagers should be started / stopped and which port they should use.

it's not necessarily required, but without it you have to manually start/stop all jobmanagers.

On 06.07.2018 14:08, Vishal Santoshi wrote:

Hello Chesnay, I have used an HA setup without the masters file and have seen failover happen based on alerts from a leader election routine.... Is it actually required that there be a masters file when there is a central arbiterer ZK that has the alive JMs and a call back to force TMs to switch to a new leader in case of failure...
On Tue, Jun 5, 2018, 6:45 AM Chesnay Schepler <[hidden email]> wrote:
Please look into high-availability to make your cluster resistant against shutdowns.

On 05.06.2018 12:31, makeyang wrote:
can anybody share anythoughts, insights about this issue?



--
Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/

Vishal Santoshi

Re: is there a config to ask taskmanager to keep retrying connect to jobmanager after Disassociated?

Yep, pwrfect, that we do. Can you confirm though that jobs will restart in the case of a failover ? That is what we see and that is fine..

On Fri, Jul 6, 2018, 8:24 AM Chesnay Schepler <[hidden email]> wrote:

If i remember correctly the masters file is only used by the [start|stop]-cluster.sh scripts to determine how many JobManagers should be started / stopped and which port they should use.

it's not necessarily required, but without it you have to manually start/stop all jobmanagers.

On 06.07.2018 14:08, Vishal Santoshi wrote:
Hello Chesnay, I have used an HA setup without the masters file and have seen failover happen based on alerts from a leader election routine.... Is it actually required that there be a masters file when there is a central arbiterer ZK that has the alive JMs and a call back to force TMs to switch to a new leader in case of failure...
On Tue, Jun 5, 2018, 6:45 AM Chesnay Schepler <[hidden email]> wrote:
Please look into high-availability to make your cluster resistant against shutdowns.

On 05.06.2018 12:31, makeyang wrote:
can anybody share anythoughts, insights about this issue?



--
Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/