Taskmanagers in Docker Fail to Resolve Own Hostnames and Won't Accept Tasks

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Taskmanagers in Docker Fail to Resolve Own Hostnames and Won't Accept Tasks

Martin, Nick-2

I’m running Flink 1.7.2 in a Docker swarm. Intermittently, new task managers will fail to resolve their own host names when starting up. In the log I see “no hostname could be resolved” messages coming from TaskManagerLocation. The webUI on the jobmanager shows the taskmanagers as are associated/connected with the jobmanager, but their akka paths show their IP, rather than the container name that ‘good’ taskmanager show. Those taskmanagers that are listed by IP give ‘failed to connect’ errors when new jobs are started that try to use those taskmanagers, and that job eventually fails. But the taskmanagers with this condition still give regular heartbeats to the Jobmanager, so the jobmanager keeps trying to assign work to them. Does anyone know what’s going on here?

Reply | Threaded
Open this post in threaded view
|

Re: Taskmanagers in Docker Fail to Resolve Own Hostnames and Won't Accept Tasks

Yang Wang
Hi Martin,

Could you `docker exec` into the problematic taskmanager and check whether the hostname could
be resolved to a correct ip? You could use `nslookup {tm_hostname}` to verify.


Best,
Yang

Martin, Nick J [US] (IS) <[hidden email]> 于2019年12月21日周六 上午6:07写道:

I’m running Flink 1.7.2 in a Docker swarm. Intermittently, new task managers will fail to resolve their own host names when starting up. In the log I see “no hostname could be resolved” messages coming from TaskManagerLocation. The webUI on the jobmanager shows the taskmanagers as are associated/connected with the jobmanager, but their akka paths show their IP, rather than the container name that ‘good’ taskmanager show. Those taskmanagers that are listed by IP give ‘failed to connect’ errors when new jobs are started that try to use those taskmanagers, and that job eventually fails. But the taskmanagers with this condition still give regular heartbeats to the Jobmanager, so the jobmanager keeps trying to assign work to them. Does anyone know what’s going on here?

Reply | Threaded
Open this post in threaded view
|

RE: EXT :Re: Taskmanagers in Docker Fail to Resolve Own Hostnames and Won't Accept Tasks

Martin, Nick-2

Yes, the container seems to be resolving its own host name correctly (the Flink docker image doesn’t come with nslookup installed, but pinging by host name worked). When I did the check, it had been a considerable time since the container started, so I can’t rule out a race condition between flink startup and container hostname assignment.

 

Another weird thing I noticed is that the IP being reported by the Jobmanager in place of the host name isn’t for an individual container. Instead, it’s the virtual IP for the whole task manager service. Which seems strange, since that hostname that points to the taskmanager service isn’t something I put in Flink’s config files anywhere, and I don’t think containers should be referring to themselves by that name.

 

From: Yang Wang [mailto:[hidden email]]
Sent: Sunday, December 22, 2019 7:15 PM
To: Martin, Nick J [US] (IS) <[hidden email]>
Cc: user <[hidden email]>
Subject: EXT :Re: Taskmanagers in Docker Fail to Resolve Own Hostnames and Won't Accept Tasks

 

Hi Martin,

 

Could you `docker exec` into the problematic taskmanager and check whether the hostname could

be resolved to a correct ip? You could use `nslookup {tm_hostname}` to verify.

 

 

Best,

Yang

 

Martin, Nick J [US] (IS) <[hidden email]> 20191221日周六 上午6:07写道:

I’m running Flink 1.7.2 in a Docker swarm. Intermittently, new task managers will fail to resolve their own host names when starting up. In the log I see “no hostname could be resolved” messages coming from TaskManagerLocation. The webUI on the jobmanager shows the taskmanagers as are associated/connected with the jobmanager, but their akka paths show their IP, rather than the container name that ‘good’ taskmanager show. Those taskmanagers that are listed by IP give ‘failed to connect’ errors when new jobs are started that try to use those taskmanagers, and that job eventually fails. But the taskmanagers with this condition still give regular heartbeats to the Jobmanager, so the jobmanager keeps trying to assign work to them. Does anyone know what’s going on here?