When I to deploy Flink 1.7 job to Kubernetes, the job itself runs, but upon visiting Flink UI I can see no metrics and there are WARN messages in jobmanager's log: [flink-metrics-14] WARN akka.remote.ReliableDeliverySupervisor flink-metrics-akka.remote.default-remote-dispatcher-3 - Association with remote system [akka.tcp://flink-metrics@adhoc-historical-taskmanager-d4b65dfd4-h5nrx:44491] has failed, address is now gated for [50] ms. Reason: [Association failed with [akka.tcp://flink-metrics@adhoc-historical-taskmanager-d4b65dfd4-h5nrx:44491]] Caused by: [adhoc-historical-taskmanager-d4b65dfd4-h5nrx: Name or service not known] Note: adhoc-historical-taskmanager-d4b65dfd4-h5nrx is a hostname of a pod on which taskmanager is running. So, jobmanager tries to resolve taskmanager's hostname (which probably got to it from taskmanager itself) on a random port. How can this be mitigated? |
This is a known issue, see
https://issues.apache.org/jira/browse/FLINK-11127. I'm not aware of a workaround. On 12.12.2018 14:07, Sergei Poganshev wrote: > When I to deploy Flink 1.7 job to Kubernetes, the job itself runs, but > upon visiting Flink UI I can see no metrics and there are WARN > messages in jobmanager's log: > > [flink-metrics-14] WARN akka.remote.ReliableDeliverySupervisor > flink-metrics-akka.remote.default-remote-dispatcher-3 - Association > with remote system > [akka.tcp://flink-metrics@adhoc-historical-taskmanager-d4b65dfd4-h5nrx:44491] > has failed, address is now gated for [50] ms. Reason: [Association > failed with > [akka.tcp://flink-metrics@adhoc-historical-taskmanager-d4b65dfd4-h5nrx:44491]] > Caused by: [adhoc-historical-taskmanager-d4b65dfd4-h5nrx: Name or > service not known] > > Note: adhoc-historical-taskmanager-d4b65dfd4-h5nrx is a hostname of a > pod on which taskmanager is running. > > So, jobmanager tries to resolve taskmanager's hostname (which probably > got to it from taskmanager itself) on a random port. How can this be > mitigated? > > |
I dealt with this issue by making the taskmanagers a statefulset. By itself, this doesn't solve the issue, because the taskmanager's `hostname` will not be a resovable FQDN on its own, you need to append the rest of the FQDN for the statefulset's "serviceName" to make it resolvable. I handle this by passing the fully qualified serviceName in as an environment variable and using this to overwriting taskmanager.host in flink.conf in the containers entrypoint script. It's a kludge, but it works. Using statefulsets brings along a
lot of "baggage" that may be overkill for taskmanagers. However
it does have an unrelated benefit for jobs with large state, in
that you can attach dedicated disks in the form of PVCs, rather
than using up the host's root disk.
On 12/12/18 8:20 AM, Chesnay Schepler
wrote:
This is a known issue, see https://issues.apache.org/jira/browse/FLINK-11127. |
Free forum by Nabble | Edit this page |