(DEPRECATED) Apache Flink User Mailing List archive.

Taskmanagers in Docker Fail to Resolve Own Hostnames and Won't Accept Tasks

Classic

List

Threaded

3 messages Options

Martin, Nick-2

Taskmanagers in Docker Fail to Resolve Own Hostnames and Won't Accept Tasks

I’m running Flink 1.7.2 in a Docker swarm. Intermittently, new task managers will fail to resolve their own host names when starting up. In the log I see “no hostname could be resolved” messages coming from TaskManagerLocation. The webUI on the jobmanager shows the taskmanagers as are associated/connected with the jobmanager, but their akka paths show their IP, rather than the container name that ‘good’ taskmanager show. Those taskmanagers that are listed by IP give ‘failed to connect’ errors when new jobs are started that try to use those taskmanagers, and that job eventually fails. But the taskmanagers with this condition still give regular heartbeats to the Jobmanager, so the jobmanager keeps trying to assign work to them. Does anyone know what’s going on here?

Yang Wang

Re: Taskmanagers in Docker Fail to Resolve Own Hostnames and Won't Accept Tasks

Hi Martin,

Could you `docker exec` into the problematic taskmanager and check whether the hostname could

be resolved to a correct ip? You could use `nslookup {tm_hostname}` to verify.

Best,

Yang

Martin, Nick J [US] (IS) <[hidden email]> 于2019年12月21日周六上午6:07写道：

I’m running Flink 1.7.2 in a Docker swarm. Intermittently, new task managers will fail to resolve their own host names when starting up. In the log I see “no hostname could be resolved” messages coming from TaskManagerLocation. The webUI on the jobmanager shows the taskmanagers as are associated/connected with the jobmanager, but their akka paths show their IP, rather than the container name that ‘good’ taskmanager show. Those taskmanagers that are listed by IP give ‘failed to connect’ errors when new jobs are started that try to use those taskmanagers, and that job eventually fails. But the taskmanagers with this condition still give regular heartbeats to the Jobmanager, so the jobmanager keeps trying to assign work to them. Does anyone know what’s going on here?

Martin, Nick-2

RE: EXT :Re: Taskmanagers in Docker Fail to Resolve Own Hostnames and Won't Accept Tasks

Yes, the container seems to be resolving its own host name correctly (the Flink docker image doesn’t come with nslookup installed, but pinging by host name worked). When I did the check, it had been a considerable time since the container started, so I can’t rule out a race condition between flink startup and container hostname assignment.

Another weird thing I noticed is that the IP being reported by the Jobmanager in place of the host name isn’t for an individual container. Instead, it’s the virtual IP for the whole task manager service. Which seems strange, since that hostname that points to the taskmanager service isn’t something I put in Flink’s config files anywhere, and I don’t think containers should be referring to themselves by that name.