Hi ,
Im running flink jobs on kubernetes after a day or so. the task manager and job manager losing connection and i have to restart earthing . Im assuming that one of the pods crashed and when now pod start he cant find the job manager ? Also i saw that is an Akka issue... and it wiil be fixed in version 1.5 . How can i safely deploy jobs on
kubernetes . task manager logs 2018-03-06 07:23:18,186 INFO org.apache.flink.runtime.taskmanager.TaskManager - Trying to register at JobManager akka.tcp://flink@flink-jobmanager:6123/user/jobmanager (attempt 1594, timeout: 30000 milliseconds) Job manager logs
|
Hi Miki,
I'm no expert on the Kubernetes part, but could that be related to https://github.com/kubernetes/kubernetes/issues/6667? I'm not sure this is an Akka issue: if it cannot communicate with some address it basically blocks it from further connection attempts for a given time (here 5 seconds). Is there some firewall or port configuration blocking the connection between the JobManager and the (new) TaskManager? I tried to reproduce it locally with minikube, but starting jobmanager and taskmanager services as described in [1] and then deleting the task managers and re-starting them again worked without a flaw. My bet is on something Flink-external because of the "Temporary failure in name resolution" error message. Maybe @Patrick (cc'd) has encountered this before and knows more. Nico [1] https://ci.apache.org/projects/flink/flink-docs-release-1.4/ops/deployment/kubernetes.html On 06/03/18 11:35, miki haiat wrote: > Hi , > > Im running flink jobs on kubernetes after a day or so. > the task manager and job manager losing connection and i have to > restart earthing . > Im assuming that one of the pods crashed and when now pod start he cant > find the job manager ? > Also i saw that is an Akka issue... and it wiil be fixed in version 1.5 . > > How can i safely deploy jobs on kubernetes . > > > task manager logs > > 2018-03-06 07:23:18,186 INFO > org.apache.flink.runtime.taskmanager.TaskManager - > Trying to register at JobManager > akka.tcp://flink@flink-jobmanager:6123/user/jobmanager (attempt > 1594, timeout: 30000 milliseconds) > 2018-03-06 07:23:48,196 INFO > org.apache.flink.runtime.taskmanager.TaskManager - > Trying to register at JobManager > akka.tcp://flink@flink-jobmanager:6123/user/jobmanager (attempt > 1595, timeout: 30000 milliseconds) > 2018-03-06 07:24:18,216 INFO > org.apache.flink.runtime.taskmanager.TaskManager - > Trying to register at JobManager > akka.tcp://flink@flink-jobmanager:6123/user/jobmanager (attempt > 1596, timeout: 30000 milliseconds) > 2018-03-06 07:24:48,237 INFO > org.apache.flink.runtime.taskmanager.TaskManager - > Trying to register at JobManager > akka.tcp://flink@flink-jobmanager:6123/user/jobmanager (attempt > 1597, timeout: 30000 milliseconds) > 2018-03-06 07:24:53,042 WARN > akka.remote.ReliableDeliverySupervisor - > Association with remote system > [akka.tcp://flink@flink-jobmanager:6123] has failed, address is now > gated for [5000] ms. Reason: [Disassociated] > > > Job manager logs > > > 2018-03-06 07:25:18,262 INFO > org.apache.flink.runtime.instance.InstanceManager - > Registered TaskManager at flink-taskmanager-3509325052-bqtkd > (akka.tcp://flink@flink-taskmanager-3509325052-bqtkd:35073/user/taskmanager) > as c37614c28df29d34b80676488e386da3. Current number of registered > hosts is 2. Current number of alive task slots is 16. > 2018-03-06 07:25:18,263 WARN > akka.remote.ReliableDeliverySupervisor - > Association with remote system > [akka.tcp://flink@flink-taskmanager-3509325052-bqtkd:35073] has > failed, address is now gated for [5000] ms. Reason: [Association > failed with > [akka.tcp://flink@flink-taskmanager-3509325052-bqtkd:35073]] Caused > by: [flink-taskmanager-3509325052-bqtkd: Temporary failure in name > resolution] > 2018-03-06 07:25:23,282 WARN > akka.remote.ReliableDeliverySupervisor - > Association with remote system > [akka.tcp://flink@flink-taskmanager-3509325052-bqtkd:35073] has > failed, address is now gated for [5000] ms. Reason: [Association > failed with > [akka.tcp://flink@flink-taskmanager-3509325052-bqtkd:35073]] Caused > by: [flink-taskmanager-3509325052-bqtkd] > 2018-03-06 07:25:28,303 WARN > akka.remote.ReliableDeliverySupervisor - > Association with remote system > [akka.tcp://flink@flink-taskmanager-3509325052-bqtkd:35073] has > failed, address is now gated for [5000] ms. Reason: [Association > failed with > [akka.tcp://flink@flink-taskmanager-3509325052-bqtkd:35073]] Caused > by: [flink-taskmanager-3509325052-bqtkd: Temporary failure in name > resolution] > 2018-03-06 07:25:33,322 WARN > akka.remote.ReliableDeliverySupervisor - > Association with remote system > [akka.tcp://flink@flink-taskmanager-3509325052-bqtkd:35073] has > failed, address is now gated for [5000] ms. Reason: [Association > failed with > [akka.tcp://flink@flink-taskmanager-3509325052-bqtkd:35073]] Caused > by: [flink-taskmanager-3509325052-bqtkd] > 2018-03-06 07:25:38,343 WARN > akka.remote.ReliableDeliverySupervisor - > Association with remote system > [akka.tcp://flink@flink-taskmanager-3509325052-bqtkd:35073] has > failed, address is now gated for [5000] ms. Reason: [Association > failed with > [akka.tcp://flink@flink-taskmanager-3509325052-bqtkd:35073]] Caused > by: [flink-taskmanager-3509325052-bqtkd: Temporary failure in name > resolution] > 2018-03-06 07:25:43,362 WARN > akka.remote.ReliableDeliverySupervisor - > Association with remote system > [akka.tcp://flink@flink-taskmanager-3509325052-bqtkd:35073] has > failed, address is now gated for [5000] ms. Reason: [Association > failed with > [akka.tcp://flink@flink-taskmanager-3509325052-bqtkd:35073]] Caused > by: [flink-taskmanager-3509325052-bqtkd] > 2018-03-06 07:25:48,383 WARN > akka.remote.ReliableDeliverySupervisor - > Association with remote system > [akka.tcp://flink@flink-taskmanager-3509325052-bqtkd:35073] has > failed, address is now gated for [5000] ms. Reason: [Association > failed with > [akka.tcp://flink@flink-taskmanager-3509325052-bqtkd:35073]] Caused > by: [flink-taskmanager-3509325052-bqtkd: Temporary failure in name > resolution] > 2018-03-06 07:25:53,402 WARN > akka.remote.ReliableDeliverySupervisor - > Association with remote system > [akka.tcp://flink@flink-taskmanager-3509325052-bqtkd:35073] has > failed, address is now gated for [5000] ms. Reason: [Association > failed with > [akka.tcp://flink@flink-taskmanager-3509325052-bqtkd:35073]] Caused > by: [flink-taskmanager-3509325052-bqtkd] > 2018-03-06 07:25:58,423 WARN > akka.remote.ReliableDeliverySupervisor - > Association with remote system > [akka.tcp://flink@flink-taskmanager-3509325052-bqtkd:35073] has > failed, address is now gated for [5000] ms. Reason: [Association > failed with > [akka.tcp://flink@flink-taskmanager-3509325052-bqtkd:35073]] Caused > by: [flink-taskmanager-3509325052-bqtkd: Temporary failure in name > resolution] > 2018-03-06 07:26:03,442 WARN > akka.remote.ReliableDeliverySupervisor - > Association with remote system > [akka.tcp://flink@flink-taskmanager-3509325052-bqtkd:35073] has > failed, address is now gated for [5000] ms. Reason: [Association > failed with > [akka.tcp://flink@flink-taskmanager-3509325052-bqtkd:35073]] Caused > by: [flink-taskmanager-3509325052-bqtkd] > 2018-03-06 07:26:08,463 WARN > akka.remote.ReliableDeliverySupervisor - > Association with remote system > [akka.tcp://flink@flink-taskmanager-3509325052-bqtkd:35073] has > failed, address is now gated for [5000] ms. Reason: [Association > failed with > [akka.tcp://flink@flink-taskmanager-3509325052-bqtkd:35073]] Caused > by: [flink-taskmanager-3509325052-bqtkd: Temporary failure in name > resolution] > 2018-03-06 07:26:13,482 WARN > akka.remote.ReliableDeliverySupervisor - > Association with remote system > [akka.tcp://flink@flink-taskmanager-3509325052-bqtkd:35073] has > failed, address is now gated for [5000] ms. Reason: [Association > failed with > [akka.tcp://flink@flink-taskmanager-3509325052-bqtkd:35073]] Caused > by: [flink-taskmanager-3509325052-bqtkd] > 2018-03-06 07:26:18,504 WARN > akka.remote.ReliableDeliverySupervisor - > Association with remote system > [akka.tcp://flink@flink-taskmanager-3509325052-bqtkd:35073] has > failed, address is now gated for [5000] ms. Reason: [Association > failed with > [akka.tcp://flink@flink-taskmanager-3509325052-bqtkd:35073]] Caused > by: [flink-taskmanager-3509325052-bqtkd: Temporary failure in name > resolution] > > signature.asc (201 bytes) Download Attachment |
Free forum by Nabble | Edit this page |