akka.remote.ReliableDeliverySupervisor Temporary failure in name resolution

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

akka.remote.ReliableDeliverySupervisor Temporary failure in name resolution

miki haiat
Hi , 

Im running flink jobs on kubernetes after a day or so.
the task manager and job manager    losing connection   and i have to restart earthing . 
Im assuming that one of the pods crashed and when now pod start he cant find the job manager ?
Also i saw that is an Akka issue...  and it wiil be fixed in version 1.5 .

How can i safely deploy jobs on kubernetes .


task manager logs 
2018-03-06 07:23:18,186 INFO  org.apache.flink.runtime.taskmanager.TaskManager              - Trying to register at JobManager akka.tcp://flink@flink-jobmanager:6123/user/jobmanager (attempt 1594, timeout: 30000 milliseconds)
2018-03-06 07:23:48,196 INFO  org.apache.flink.runtime.taskmanager.TaskManager              - Trying to register at JobManager akka.tcp://flink@flink-jobmanager:6123/user/jobmanager (attempt 1595, timeout: 30000 milliseconds)
2018-03-06 07:24:18,216 INFO  org.apache.flink.runtime.taskmanager.TaskManager              - Trying to register at JobManager akka.tcp://flink@flink-jobmanager:6123/user/jobmanager (attempt 1596, timeout: 30000 milliseconds)
2018-03-06 07:24:48,237 INFO  org.apache.flink.runtime.taskmanager.TaskManager              - Trying to register at JobManager akka.tcp://flink@flink-jobmanager:6123/user/jobmanager (attempt 1597, timeout: 30000 milliseconds)
2018-03-06 07:24:53,042 WARN  akka.remote.ReliableDeliverySupervisor                        - Association with remote system [akka.tcp://flink@flink-jobmanager:6123] has failed, address is now gated for [5000] ms. Reason: [Disassociated] 

Job manager logs 

2018-03-06 07:25:18,262 INFO  org.apache.flink.runtime.instance.InstanceManager             - Registered TaskManager at flink-taskmanager-3509325052-bqtkd (akka.tcp://flink@flink-taskmanager-3509325052-bqtkd:35073/user/taskmanager) as c37614c28df29d34b80676488e386da3. Current number of registered hosts is 2. Current number of alive task slots is 16.
2018-03-06 07:25:18,263 WARN  akka.remote.ReliableDeliverySupervisor                        - Association with remote system [akka.tcp://flink@flink-taskmanager-3509325052-bqtkd:35073] has failed, address is now gated for [5000] ms. Reason: [Association failed with [akka.tcp://flink@flink-taskmanager-3509325052-bqtkd:35073]] Caused by: [flink-taskmanager-3509325052-bqtkd: Temporary failure in name resolution]
2018-03-06 07:25:23,282 WARN  akka.remote.ReliableDeliverySupervisor                        - Association with remote system [akka.tcp://flink@flink-taskmanager-3509325052-bqtkd:35073] has failed, address is now gated for [5000] ms. Reason: [Association failed with [akka.tcp://flink@flink-taskmanager-3509325052-bqtkd:35073]] Caused by: [flink-taskmanager-3509325052-bqtkd]
2018-03-06 07:25:28,303 WARN  akka.remote.ReliableDeliverySupervisor                        - Association with remote system [akka.tcp://flink@flink-taskmanager-3509325052-bqtkd:35073] has failed, address is now gated for [5000] ms. Reason: [Association failed with [akka.tcp://flink@flink-taskmanager-3509325052-bqtkd:35073]] Caused by: [flink-taskmanager-3509325052-bqtkd: Temporary failure in name resolution]
2018-03-06 07:25:33,322 WARN  akka.remote.ReliableDeliverySupervisor                        - Association with remote system [akka.tcp://flink@flink-taskmanager-3509325052-bqtkd:35073] has failed, address is now gated for [5000] ms. Reason: [Association failed with [akka.tcp://flink@flink-taskmanager-3509325052-bqtkd:35073]] Caused by: [flink-taskmanager-3509325052-bqtkd]
2018-03-06 07:25:38,343 WARN  akka.remote.ReliableDeliverySupervisor                        - Association with remote system [akka.tcp://flink@flink-taskmanager-3509325052-bqtkd:35073] has failed, address is now gated for [5000] ms. Reason: [Association failed with [akka.tcp://flink@flink-taskmanager-3509325052-bqtkd:35073]] Caused by: [flink-taskmanager-3509325052-bqtkd: Temporary failure in name resolution]
2018-03-06 07:25:43,362 WARN  akka.remote.ReliableDeliverySupervisor                        - Association with remote system [akka.tcp://flink@flink-taskmanager-3509325052-bqtkd:35073] has failed, address is now gated for [5000] ms. Reason: [Association failed with [akka.tcp://flink@flink-taskmanager-3509325052-bqtkd:35073]] Caused by: [flink-taskmanager-3509325052-bqtkd]
2018-03-06 07:25:48,383 WARN  akka.remote.ReliableDeliverySupervisor                        - Association with remote system [akka.tcp://flink@flink-taskmanager-3509325052-bqtkd:35073] has failed, address is now gated for [5000] ms. Reason: [Association failed with [akka.tcp://flink@flink-taskmanager-3509325052-bqtkd:35073]] Caused by: [flink-taskmanager-3509325052-bqtkd: Temporary failure in name resolution]
2018-03-06 07:25:53,402 WARN  akka.remote.ReliableDeliverySupervisor                        - Association with remote system [akka.tcp://flink@flink-taskmanager-3509325052-bqtkd:35073] has failed, address is now gated for [5000] ms. Reason: [Association failed with [akka.tcp://flink@flink-taskmanager-3509325052-bqtkd:35073]] Caused by: [flink-taskmanager-3509325052-bqtkd]
2018-03-06 07:25:58,423 WARN  akka.remote.ReliableDeliverySupervisor                        - Association with remote system [akka.tcp://flink@flink-taskmanager-3509325052-bqtkd:35073] has failed, address is now gated for [5000] ms. Reason: [Association failed with [akka.tcp://flink@flink-taskmanager-3509325052-bqtkd:35073]] Caused by: [flink-taskmanager-3509325052-bqtkd: Temporary failure in name resolution]
2018-03-06 07:26:03,442 WARN  akka.remote.ReliableDeliverySupervisor                        - Association with remote system [akka.tcp://flink@flink-taskmanager-3509325052-bqtkd:35073] has failed, address is now gated for [5000] ms. Reason: [Association failed with [akka.tcp://flink@flink-taskmanager-3509325052-bqtkd:35073]] Caused by: [flink-taskmanager-3509325052-bqtkd]
2018-03-06 07:26:08,463 WARN  akka.remote.ReliableDeliverySupervisor                        - Association with remote system [akka.tcp://flink@flink-taskmanager-3509325052-bqtkd:35073] has failed, address is now gated for [5000] ms. Reason: [Association failed with [akka.tcp://flink@flink-taskmanager-3509325052-bqtkd:35073]] Caused by: [flink-taskmanager-3509325052-bqtkd: Temporary failure in name resolution]
2018-03-06 07:26:13,482 WARN  akka.remote.ReliableDeliverySupervisor                        - Association with remote system [akka.tcp://flink@flink-taskmanager-3509325052-bqtkd:35073] has failed, address is now gated for [5000] ms. Reason: [Association failed with [akka.tcp://flink@flink-taskmanager-3509325052-bqtkd:35073]] Caused by: [flink-taskmanager-3509325052-bqtkd]
2018-03-06 07:26:18,504 WARN  akka.remote.ReliableDeliverySupervisor                        - Association with remote system [akka.tcp://flink@flink-taskmanager-3509325052-bqtkd:35073] has failed, address is now gated for [5000] ms. Reason: [Association failed with [akka.tcp://flink@flink-taskmanager-3509325052-bqtkd:35073]] Caused by: [flink-taskmanager-3509325052-bqtkd: Temporary failure in name resolution]
 
Reply | Threaded
Open this post in threaded view
|

Re: akka.remote.ReliableDeliverySupervisor Temporary failure in name resolution

Nico Kruber
Hi Miki,
I'm no expert on the Kubernetes part, but could that be related to
https://github.com/kubernetes/kubernetes/issues/6667?

I'm not sure this is an Akka issue: if it cannot communicate with some
address it basically blocks it from further connection attempts for a
given time (here 5 seconds).

Is there some firewall or port configuration blocking the connection
between the JobManager and the (new) TaskManager?


I tried to reproduce it locally with minikube, but starting jobmanager
and taskmanager services as described in [1] and then deleting the task
managers and re-starting them again worked without a flaw. My bet is on
something Flink-external because of the "Temporary failure in name
resolution" error message.
Maybe @Patrick (cc'd) has encountered this before and knows more.



Nico


[1]
https://ci.apache.org/projects/flink/flink-docs-release-1.4/ops/deployment/kubernetes.html

On 06/03/18 11:35, miki haiat wrote:

> Hi , 
>
> Im running flink jobs on kubernetes after a day or so.
> the task manager and job manager    losing connection   and i have to
> restart earthing . 
> Im assuming that one of the pods crashed and when now pod start he cant
> find the job manager ?
> Also i saw that is an Akka issue...  and it wiil be fixed in version 1.5 .
>
> How can i safely deploy jobs on kubernetes .
>
>
> task manager logs 
>
>     2018-03-06 07:23:18,186 INFO 
>     org.apache.flink.runtime.taskmanager.TaskManager              -
>     Trying to register at JobManager
>     akka.tcp://flink@flink-jobmanager:6123/user/jobmanager (attempt
>     1594, timeout: 30000 milliseconds)
>     2018-03-06 07:23:48,196 INFO 
>     org.apache.flink.runtime.taskmanager.TaskManager              -
>     Trying to register at JobManager
>     akka.tcp://flink@flink-jobmanager:6123/user/jobmanager (attempt
>     1595, timeout: 30000 milliseconds)
>     2018-03-06 07:24:18,216 INFO 
>     org.apache.flink.runtime.taskmanager.TaskManager              -
>     Trying to register at JobManager
>     akka.tcp://flink@flink-jobmanager:6123/user/jobmanager (attempt
>     1596, timeout: 30000 milliseconds)
>     2018-03-06 07:24:48,237 INFO 
>     org.apache.flink.runtime.taskmanager.TaskManager              -
>     Trying to register at JobManager
>     akka.tcp://flink@flink-jobmanager:6123/user/jobmanager (attempt
>     1597, timeout: 30000 milliseconds)
>     2018-03-06 07:24:53,042 WARN 
>     akka.remote.ReliableDeliverySupervisor                        -
>     Association with remote system
>     [akka.tcp://flink@flink-jobmanager:6123] has failed, address is now
>     gated for [5000] ms. Reason: [Disassociated] 
>
>
> Job manager logs 
>
>
>     2018-03-06 07:25:18,262 INFO 
>     org.apache.flink.runtime.instance.InstanceManager             -
>     Registered TaskManager at flink-taskmanager-3509325052-bqtkd
>     (akka.tcp://flink@flink-taskmanager-3509325052-bqtkd:35073/user/taskmanager)
>     as c37614c28df29d34b80676488e386da3. Current number of registered
>     hosts is 2. Current number of alive task slots is 16.
>     2018-03-06 07:25:18,263 WARN 
>     akka.remote.ReliableDeliverySupervisor                        -
>     Association with remote system
>     [akka.tcp://flink@flink-taskmanager-3509325052-bqtkd:35073] has
>     failed, address is now gated for [5000] ms. Reason: [Association
>     failed with
>     [akka.tcp://flink@flink-taskmanager-3509325052-bqtkd:35073]] Caused
>     by: [flink-taskmanager-3509325052-bqtkd: Temporary failure in name
>     resolution]
>     2018-03-06 07:25:23,282 WARN 
>     akka.remote.ReliableDeliverySupervisor                        -
>     Association with remote system
>     [akka.tcp://flink@flink-taskmanager-3509325052-bqtkd:35073] has
>     failed, address is now gated for [5000] ms. Reason: [Association
>     failed with
>     [akka.tcp://flink@flink-taskmanager-3509325052-bqtkd:35073]] Caused
>     by: [flink-taskmanager-3509325052-bqtkd]
>     2018-03-06 07:25:28,303 WARN 
>     akka.remote.ReliableDeliverySupervisor                        -
>     Association with remote system
>     [akka.tcp://flink@flink-taskmanager-3509325052-bqtkd:35073] has
>     failed, address is now gated for [5000] ms. Reason: [Association
>     failed with
>     [akka.tcp://flink@flink-taskmanager-3509325052-bqtkd:35073]] Caused
>     by: [flink-taskmanager-3509325052-bqtkd: Temporary failure in name
>     resolution]
>     2018-03-06 07:25:33,322 WARN 
>     akka.remote.ReliableDeliverySupervisor                        -
>     Association with remote system
>     [akka.tcp://flink@flink-taskmanager-3509325052-bqtkd:35073] has
>     failed, address is now gated for [5000] ms. Reason: [Association
>     failed with
>     [akka.tcp://flink@flink-taskmanager-3509325052-bqtkd:35073]] Caused
>     by: [flink-taskmanager-3509325052-bqtkd]
>     2018-03-06 07:25:38,343 WARN 
>     akka.remote.ReliableDeliverySupervisor                        -
>     Association with remote system
>     [akka.tcp://flink@flink-taskmanager-3509325052-bqtkd:35073] has
>     failed, address is now gated for [5000] ms. Reason: [Association
>     failed with
>     [akka.tcp://flink@flink-taskmanager-3509325052-bqtkd:35073]] Caused
>     by: [flink-taskmanager-3509325052-bqtkd: Temporary failure in name
>     resolution]
>     2018-03-06 07:25:43,362 WARN 
>     akka.remote.ReliableDeliverySupervisor                        -
>     Association with remote system
>     [akka.tcp://flink@flink-taskmanager-3509325052-bqtkd:35073] has
>     failed, address is now gated for [5000] ms. Reason: [Association
>     failed with
>     [akka.tcp://flink@flink-taskmanager-3509325052-bqtkd:35073]] Caused
>     by: [flink-taskmanager-3509325052-bqtkd]
>     2018-03-06 07:25:48,383 WARN 
>     akka.remote.ReliableDeliverySupervisor                        -
>     Association with remote system
>     [akka.tcp://flink@flink-taskmanager-3509325052-bqtkd:35073] has
>     failed, address is now gated for [5000] ms. Reason: [Association
>     failed with
>     [akka.tcp://flink@flink-taskmanager-3509325052-bqtkd:35073]] Caused
>     by: [flink-taskmanager-3509325052-bqtkd: Temporary failure in name
>     resolution]
>     2018-03-06 07:25:53,402 WARN 
>     akka.remote.ReliableDeliverySupervisor                        -
>     Association with remote system
>     [akka.tcp://flink@flink-taskmanager-3509325052-bqtkd:35073] has
>     failed, address is now gated for [5000] ms. Reason: [Association
>     failed with
>     [akka.tcp://flink@flink-taskmanager-3509325052-bqtkd:35073]] Caused
>     by: [flink-taskmanager-3509325052-bqtkd]
>     2018-03-06 07:25:58,423 WARN 
>     akka.remote.ReliableDeliverySupervisor                        -
>     Association with remote system
>     [akka.tcp://flink@flink-taskmanager-3509325052-bqtkd:35073] has
>     failed, address is now gated for [5000] ms. Reason: [Association
>     failed with
>     [akka.tcp://flink@flink-taskmanager-3509325052-bqtkd:35073]] Caused
>     by: [flink-taskmanager-3509325052-bqtkd: Temporary failure in name
>     resolution]
>     2018-03-06 07:26:03,442 WARN 
>     akka.remote.ReliableDeliverySupervisor                        -
>     Association with remote system
>     [akka.tcp://flink@flink-taskmanager-3509325052-bqtkd:35073] has
>     failed, address is now gated for [5000] ms. Reason: [Association
>     failed with
>     [akka.tcp://flink@flink-taskmanager-3509325052-bqtkd:35073]] Caused
>     by: [flink-taskmanager-3509325052-bqtkd]
>     2018-03-06 07:26:08,463 WARN 
>     akka.remote.ReliableDeliverySupervisor                        -
>     Association with remote system
>     [akka.tcp://flink@flink-taskmanager-3509325052-bqtkd:35073] has
>     failed, address is now gated for [5000] ms. Reason: [Association
>     failed with
>     [akka.tcp://flink@flink-taskmanager-3509325052-bqtkd:35073]] Caused
>     by: [flink-taskmanager-3509325052-bqtkd: Temporary failure in name
>     resolution]
>     2018-03-06 07:26:13,482 WARN 
>     akka.remote.ReliableDeliverySupervisor                        -
>     Association with remote system
>     [akka.tcp://flink@flink-taskmanager-3509325052-bqtkd:35073] has
>     failed, address is now gated for [5000] ms. Reason: [Association
>     failed with
>     [akka.tcp://flink@flink-taskmanager-3509325052-bqtkd:35073]] Caused
>     by: [flink-taskmanager-3509325052-bqtkd]
>     2018-03-06 07:26:18,504 WARN 
>     akka.remote.ReliableDeliverySupervisor                        -
>     Association with remote system
>     [akka.tcp://flink@flink-taskmanager-3509325052-bqtkd:35073] has
>     failed, address is now gated for [5000] ms. Reason: [Association
>     failed with
>     [akka.tcp://flink@flink-taskmanager-3509325052-bqtkd:35073]] Caused
>     by: [flink-taskmanager-3509325052-bqtkd: Temporary failure in name
>     resolution]
>
>  


signature.asc (201 bytes) Download Attachment