Flink 1.7 jobmanager tries to lookup taskmanager by its hostname in k8s environment

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Flink 1.7 jobmanager tries to lookup taskmanager by its hostname in k8s environment

spoganshev
When I to deploy Flink 1.7 job to Kubernetes, the job itself runs, but upon visiting Flink UI I can see no metrics and there are WARN messages in jobmanager's log:

[flink-metrics-14] WARN akka.remote.ReliableDeliverySupervisor flink-metrics-akka.remote.default-remote-dispatcher-3 - Association with remote system [akka.tcp://flink-metrics@adhoc-historical-taskmanager-d4b65dfd4-h5nrx:44491] has failed, address is now gated for [50] ms. Reason: [Association failed with [akka.tcp://flink-metrics@adhoc-historical-taskmanager-d4b65dfd4-h5nrx:44491]] Caused by: [adhoc-historical-taskmanager-d4b65dfd4-h5nrx: Name or service not known]

Note: adhoc-historical-taskmanager-d4b65dfd4-h5nrx is a hostname of a pod on which taskmanager is running.

So, jobmanager tries to resolve taskmanager's hostname (which probably got to it from taskmanager itself) on a random port. How can this be mitigated?


Reply | Threaded
Open this post in threaded view
|

Re: Flink 1.7 jobmanager tries to lookup taskmanager by its hostname in k8s environment

Chesnay Schepler
This is a known issue, see
https://issues.apache.org/jira/browse/FLINK-11127.

I'm not aware of a workaround.

On 12.12.2018 14:07, Sergei Poganshev wrote:

> When I to deploy Flink 1.7 job to Kubernetes, the job itself runs, but
> upon visiting Flink UI I can see no metrics and there are WARN
> messages in jobmanager's log:
>
> [flink-metrics-14] WARN akka.remote.ReliableDeliverySupervisor
> flink-metrics-akka.remote.default-remote-dispatcher-3 - Association
> with remote system
> [akka.tcp://flink-metrics@adhoc-historical-taskmanager-d4b65dfd4-h5nrx:44491]
> has failed, address is now gated for [50] ms. Reason: [Association
> failed with
> [akka.tcp://flink-metrics@adhoc-historical-taskmanager-d4b65dfd4-h5nrx:44491]]
> Caused by: [adhoc-historical-taskmanager-d4b65dfd4-h5nrx: Name or
> service not known]
>
> Note: adhoc-historical-taskmanager-d4b65dfd4-h5nrx is a hostname of a
> pod on which taskmanager is running.
>
> So, jobmanager tries to resolve taskmanager's hostname (which probably
> got to it from taskmanager itself) on a random port. How can this be
> mitigated?
>
>

Reply | Threaded
Open this post in threaded view
|

Re: Flink 1.7 jobmanager tries to lookup taskmanager by its hostname in k8s environment

Derek VerLee

I dealt with this issue by making the taskmanagers a statefulset.

By itself, this doesn't solve the issue, because the taskmanager's `hostname` will not be a resovable FQDN on its own, you need to append the rest of the FQDN for the statefulset's "serviceName" to make it resolvable.  I handle this by passing the fully qualified serviceName in as an environment variable  and using this to overwriting taskmanager.host in flink.conf in the containers entrypoint script.

It's a kludge, but it works. Using statefulsets brings along a lot of "baggage" that may be overkill for taskmanagers.  However it does have an unrelated benefit for jobs with large state, in that you can attach dedicated disks in the form of PVCs, rather than using up the host's root disk.


On 12/12/18 8:20 AM, Chesnay Schepler wrote:
This is a known issue, see https://issues.apache.org/jira/browse/FLINK-11127.

I'm not aware of a workaround.

On 12.12.2018 14:07, Sergei Poganshev wrote:
When I to deploy Flink 1.7 job to Kubernetes, the job itself runs, but upon visiting Flink UI I can see no metrics and there are WARN messages in jobmanager's log:

[flink-metrics-14] WARN akka.remote.ReliableDeliverySupervisor flink-metrics-akka.remote.default-remote-dispatcher-3 - Association with remote system [akka.tcp://flink-metrics@adhoc-historical-taskmanager-d4b65dfd4-h5nrx:44491] has failed, address is now gated for [50] ms. Reason: [Association failed with [akka.tcp://flink-metrics@adhoc-historical-taskmanager-d4b65dfd4-h5nrx:44491]] Caused by: [adhoc-historical-taskmanager-d4b65dfd4-h5nrx: Name or service not known]

Note: adhoc-historical-taskmanager-d4b65dfd4-h5nrx is a hostname of a pod on which taskmanager is running.

So, jobmanager tries to resolve taskmanager's hostname (which probably got to it from taskmanager itself) on a random port. How can this be mitigated?