is there a config to ask taskmanager to keep retrying connect to jobmanager after Disassociated?

classic Classic list List threaded Threaded
9 messages Options
Reply | Threaded
Open this post in threaded view
|

is there a config to ask taskmanager to keep retrying connect to jobmanager after Disassociated?

makeyang
I depoloy a standard alone cluster with flink1.5 and when I try to restart
the only jobmanger, below is the log print from task manager:
2018-06-04 12:06:35,882 WARN  akka.remote.ReliableDeliverySupervisor                      
- Association with remote system [akka.tcp://flink@ipaddress:6123] has
failed, address is now gated for [50] ms. Reason: [Disassociated]
2018-06-04 12:07:17,580 INFO
org.apache.flink.runtime.taskexecutor.TaskExecutor            - The
heartbeat of ResourceManager with id 6af9bbb514a6ddaeca95d6e52db6dbd5 timed
out.
2018-06-04 12:07:17,580 INFO
org.apache.flink.runtime.taskexecutor.TaskExecutor            - Close
ResourceManager connection 6af9bbb514a6ddaeca95d6e52db6dbd5.
2018-06-04 12:07:17,611 WARN  akka.remote.transport.netty.NettyTransport                  
- Remote connection to [null] failed with java.net.ConnectException:
Connection refused: /ipaddress:6123
2018-06-04 12:07:17,611 WARN  akka.remote.ReliableDeliverySupervisor                      
- Association with remote system [akka.tcp://flink@ipaddress:6123] has
failed, address is now gated for [50] ms. Reason: [Association failed with
[akka.tcp://flink@ipaddress:6123]] Caused by: [Connection refused:
/ipaddress:6123]

so I'd like to know if there is a config to ask task manager to keep
retrying to connect to job manager(since I am restating jobmanager so it
will come back later)?



--
Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/
Reply | Threaded
Open this post in threaded view
|

Re: is there a config to ask taskmanager to keep retrying connect to jobmanager after Disassociated?

makeyang
when I debug the jobmanager and below is the error log in task manager:
2018-06-04 17:16:33,295 INFO
org.apache.flink.runtime.taskexecutor.TaskExecutor            - The
heartbeat of ResourceManager with id 35df0455efc2fb6fa3f2467f7f5d2ba1 timed
out.
2018-06-04 17:16:33,297 DEBUG
org.apache.flink.runtime.taskexecutor.TaskExecutor            - Close
ResourceManager connection 35df0455efc2fb6fa3f2467f7f5d2ba1.
java.util.concurrent.TimeoutException: The heartbeat of ResourceManager with
id 35df0455efc2fb6fa3f2467f7f5d2ba1 timed out.
        at
org.apache.flink.runtime.taskexecutor.TaskExecutor$ResourceManagerHeartbeatListener.lambda$notifyHeartbeatTimeout$0(TaskExecutor.java:1553)
        at
org.apache.flink.runtime.taskexecutor.TaskExecutor$ResourceManagerHeartbeatListener$$Lambda$26/1975100911.run(Unknown
Source)
        at
org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRunAsync(AkkaRpcActor.java:295)
        at
org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleMessage(AkkaRpcActor.java:150)
        at
org.apache.flink.runtime.rpc.akka.AkkaRpcActor.lambda$onReceive$1(AkkaRpcActor.java:132)
        at
org.apache.flink.runtime.rpc.akka.AkkaRpcActor$$Lambda$12/1732386307.apply(Unknown
Source)
        at
akka.actor.ActorCell$$anonfun$become$1.applyOrElse(ActorCell.scala:544)
        at akka.actor.Actor$class.aroundReceive(Actor.scala:502)
        at akka.actor.UntypedActor.aroundReceive(UntypedActor.scala:95)
        at akka.actor.ActorCell.receiveMessage(ActorCell.scala:526)
        at akka.actor.ActorCell.invoke(ActorCell.scala:495)
        at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:257)
        at akka.dispatch.Mailbox.run(Mailbox.scala:224)
        at akka.dispatch.Mailbox.exec(Mailbox.scala:234)
        at
scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
        at
scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
        at
scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
        at
scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)



--
Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/
Reply | Threaded
Open this post in threaded view
|

Re: is there a config to ask taskmanager to keep retrying connect to jobmanager after Disassociated?

makeyang
so is there a way or config to ask taskmanager to keep continue connectting
to jobmanager?



--
Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/
Reply | Threaded
Open this post in threaded view
|

Re: is there a config to ask taskmanager to keep retrying connect to jobmanager after Disassociated?

makeyang
can anybody share anythoughts, insights about this issue?



--
Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/
Reply | Threaded
Open this post in threaded view
|

Re: is there a config to ask taskmanager to keep retrying connect to jobmanager after Disassociated?

Chesnay Schepler
Please look into high-availability to make your cluster resistant against shutdowns.

On 05.06.2018 12:31, makeyang wrote:
can anybody share anythoughts, insights about this issue?



--
Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/


Reply | Threaded
Open this post in threaded view
|

Re: is there a config to ask taskmanager to keep retrying connect to jobmanager after Disassociated?

Vishal Santoshi
Hello Chesnay, I have used an HA setup without the masters file and have seen failover happen based on alerts from a leader election routine.... Is it actually required that there be a masters file when there is a central arbiterer ZK  that has the alive JMs and a call back to force TMs to switch to a new leader in case of failure...

On Tue, Jun 5, 2018, 6:45 AM Chesnay Schepler <[hidden email]> wrote:
Please look into high-availability to make your cluster resistant against shutdowns.

On 05.06.2018 12:31, makeyang wrote:
can anybody share anythoughts, insights about this issue?



--
Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/


Reply | Threaded
Open this post in threaded view
|

Re: is there a config to ask taskmanager to keep retrying connect to jobmanager after Disassociated?

Vishal Santoshi
Even though I must admit that the jobs restart but they do restart successfully  with the new JM.....

On Fri, Jul 6, 2018, 8:08 AM Vishal Santoshi <[hidden email]> wrote:
Hello Chesnay, I have used an HA setup without the masters file and have seen failover happen based on alerts from a leader election routine.... Is it actually required that there be a masters file when there is a central arbiterer ZK  that has the alive JMs and a call back to force TMs to switch to a new leader in case of failure...

On Tue, Jun 5, 2018, 6:45 AM Chesnay Schepler <[hidden email]> wrote:
Please look into high-availability to make your cluster resistant against shutdowns.

On 05.06.2018 12:31, makeyang wrote:
can anybody share anythoughts, insights about this issue?



--
Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/


Reply | Threaded
Open this post in threaded view
|

Re: is there a config to ask taskmanager to keep retrying connect to jobmanager after Disassociated?

Chesnay Schepler
In reply to this post by Vishal Santoshi
If i remember correctly the masters file is only used by the [start|stop]-cluster.sh scripts to determine how many JobManagers should be started / stopped and which port they should use.

it's not necessarily required, but without it you have to manually start/stop all jobmanagers.

On 06.07.2018 14:08, Vishal Santoshi wrote:
Hello Chesnay, I have used an HA setup without the masters file and have seen failover happen based on alerts from a leader election routine.... Is it actually required that there be a masters file when there is a central arbiterer ZK  that has the alive JMs and a call back to force TMs to switch to a new leader in case of failure...

On Tue, Jun 5, 2018, 6:45 AM Chesnay Schepler <[hidden email]> wrote:
Please look into high-availability to make your cluster resistant against shutdowns.

On 05.06.2018 12:31, makeyang wrote:
can anybody share anythoughts, insights about this issue?



--
Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/



Reply | Threaded
Open this post in threaded view
|

Re: is there a config to ask taskmanager to keep retrying connect to jobmanager after Disassociated?

Vishal Santoshi
Yep, pwrfect, that we do.  Can you confirm though that jobs will restart in the case of a failover ? That is what we see and that is fine..

On Fri, Jul 6, 2018, 8:24 AM Chesnay Schepler <[hidden email]> wrote:
If i remember correctly the masters file is only used by the [start|stop]-cluster.sh scripts to determine how many JobManagers should be started / stopped and which port they should use.

it's not necessarily required, but without it you have to manually start/stop all jobmanagers.

On 06.07.2018 14:08, Vishal Santoshi wrote:
Hello Chesnay, I have used an HA setup without the masters file and have seen failover happen based on alerts from a leader election routine.... Is it actually required that there be a masters file when there is a central arbiterer ZK  that has the alive JMs and a call back to force TMs to switch to a new leader in case of failure...

On Tue, Jun 5, 2018, 6:45 AM Chesnay Schepler <[hidden email]> wrote:
Please look into high-availability to make your cluster resistant against shutdowns.

On 05.06.2018 12:31, makeyang wrote:
can anybody share anythoughts, insights about this issue?



--
Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/