Hostname for taskmanagers when running in docker

classic Classic list List threaded Threaded
8 messages Options
Reply | Threaded
Open this post in threaded view
|

Hostname for taskmanagers when running in docker

Nikola Hrusov
Hello,

After upgrading the docker image for flink to 1.11.1 from 1.9 the hostname of the taskmanagers reported to our metrics show as IPs (e.g. 10.0.23.101) instead of hostnames.

In the docker compose file we specify the hostname as such:

hostname: "taskmanager-{{ '{{' }}.Node.Hostname{{ '}}' }}"

Is there another way of achieving this?

Regards
,
Nikola Hrusov
Reply | Threaded
Open this post in threaded view
|

Re: Hostname for taskmanagers when running in docker

Xintong Song
Hi Nikola,

I'm not entirely sure about how this happened. Would need some more information to investigate, such as the complete configurations for taskmanagers in your docker compose file, and the taskmanager logs.

One quick thing you may try is to explicitly set the configuration option `taskmanager.host` for your task managers, see if that is reflected in the metrics.

Thank you~

Xintong Song



On Wed, Aug 12, 2020 at 3:06 PM Nikola Hrusov <[hidden email]> wrote:
Hello,

After upgrading the docker image for flink to 1.11.1 from 1.9 the hostname of the taskmanagers reported to our metrics show as IPs (e.g. 10.0.23.101) instead of hostnames.

In the docker compose file we specify the hostname as such:

hostname: "taskmanager-{{ '{{' }}.Node.Hostname{{ '}}' }}"

Is there another way of achieving this?

Regards
,
Nikola Hrusov
Reply | Threaded
Open this post in threaded view
|

Re: Hostname for taskmanagers when running in docker

Nikola Hrusov
Hi Xintong,

I have tried using the configuration taskmanager.host, but that actually makes it even worse. I have made a simple setup with docker compose to reproduce/explain it easier. You can find the compose files here: https://github.com/nikobearrr/flink-hostname-metrics

I have made 2 identical compose files which can start a flink jobmanager and taskmanager together with graphite (for metrics). One is called docker-compose.yml and the second one docker-compose_with_hostname.yml
The only difference between those two is line #29 which is the taskmanager.host variable. They both expose port 8081 for flink cluster UI and port 8082 for graphite UI.


Running the setup without the taskmanager.host
When you run the compose without the taskmanager.host variable the cluster starts just fine and the taskmanager registers. Running a job on that cluster would be just fine.
The issue is that if you check the metrics in the Graphite UI instead of the hostname it will show the IP (in this case 172.20.0.3). That was not the case with prior 1.11 version of flink.

image.png






Running the setup with the taskmanager.host
Once I run the compose which includes the taskmanager.host variable I can see the cluster UI and it starts up fine. Also the metrics come correctly:

image.png

However, now I found something wrong.

The first thing is that when you go to http://localhost:8081/#/task-manager/70704dc334ac8007925409c575e42d7d/metrics where  70704dc334ac8007925409c575e42d7d is the GUID of the taskmanager I start getting those logs in my console:

jobmanager_1     | 2020-10-27 15:59:44,366 WARN  akka.remote.ReliableDeliverySupervisor                       [] - Association with remote system [akka.tcp://flink-metrics@taskmanager-node01:42269] has failed, address is now gated for [50] ms. Reason: [Association failed with [akka.tcp://flink-metrics@taskmanager-node01:42269]] Caused by: [java.net.UnknownHostException: taskmanager-node01: Name or service not known]
jobmanager_1     | 2020-10-27 15:59:55,236 WARN  akka.remote.ReliableDeliverySupervisor                       [] - Association with remote system [akka.tcp://flink-metrics@taskmanager-node01:42269] has failed, address is now gated for [50] ms. Reason: [Association failed with [akka.tcp://flink-metrics@taskmanager-node01:42269]] Caused by: [java.net.UnknownHostException: taskmanager-node01: Name or service not known]
jobmanager_1     | 2020-10-27 16:00:07,219 WARN  akka.remote.ReliableDeliverySupervisor                       [] - Association with remote system [akka.tcp://flink-metrics@taskmanager-node01:42269] has failed, address is now gated for [50] ms. Reason: [Association failed with [akka.tcp://flink-metrics@taskmanager-node01:42269]] Caused by: [java.net.UnknownHostException: taskmanager-node01: Name or service not known]

image.png

Also the metrics for the taskmanager do not show as shown on the picture above. If you do not use "taskmanager.host" then metrics show and there are no such WARN logs for UnknownHostException.
More importantly, we also see issues with this when we run batch jobs for what seems the same issue. The jobs fail on submission. This only happens when we explicitly set "taskmanager.host" variable

from taskmanager:

2020-10-27T17:11:55.646Z [flink-akka.actor.default-dispatcher-2] INFO org.apache.flink.runtime.taskexecutor.DefaultJobLeaderService - Add job 6b6f5ee3a1ab7556fd0db64de0f7cb1d for job leader monitoring.
2020-10-27T17:11:55.646Z [flink-akka.actor.default-dispatcher-2] INFO org.apache.flink.runtime.taskexecutor.DefaultJobLeaderService - Try to register at job manager akka.tcp://flink@jobmanager:6123/user/rpc/jobmanager_3 with leader id 00000000-0000-0000-0000-000000000000.
2020-10-27T17:11:55.652Z [flink-akka.actor.default-dispatcher-2] INFO org.apache.flink.runtime.taskexecutor.DefaultJobLeaderService - Resolved JobManager address, beginning registration
2020-10-27T17:11:55.657Z [flink-akka.actor.default-dispatcher-4] INFO org.apache.flink.runtime.taskexecutor.DefaultJobLeaderService - Registration at JobManager was declined: Could not accept TaskManager registration. TaskManager address taskmanager-node01 cannot be resolved. taskmanager-node01: No address associated with hostname
2020-10-27T17:11:55.657Z [flink-akka.actor.default-dispatcher-4] INFO org.apache.flink.runtime.taskexecutor.DefaultJobLeaderService - Pausing and re-attempting registration in 30000 ms
2020-10-27T17:12:25.646Z [flink-akka.actor.default-dispatcher-4] INFO org.apache.flink.runtime.taskexecutor.slot.TaskSlotTableImpl - Free slot TaskSlot(index:1, state:ALLOCATED, resource profile: ResourceProfile{cpuCores=1.0000000000000000, taskHeapMemory=62.400mb (65431141 bytes), taskOffHeapMemory=0 bytes, managedMemory=62.720mb (65766687 bytes), networkMemory=15.680mb (16441671 bytes)}, allocationId: a4dfa7dfa91bbce7cd091aa1512c0745, jobId: 6b6f5ee3a1ab7556fd0db64de0f7cb1d).
2020-10-27T17:12:25.647Z [flink-akka.actor.default-dispatcher-4] INFO org.apache.flink.runtime.taskexecutor.DefaultJobLeaderService - Remove job 6b6f5ee3a1ab7556fd0db64de0f7cb1d from job leader monitoring.


from jobmanager:

2020-10-27T17:11:55.670Z [flink-akka.actor.default-dispatcher-36] INFO org.apache.flink.runtime.jobmaster.JobMaster - Connecting to ResourceManager akka.tcp://flink@jobmanager:6123/user/rpc/resourcemanager_*(00000000000000000000000000000000)
2020-10-27T17:11:55.671Z [flink-akka.actor.default-dispatcher-40] INFO org.apache.flink.runtime.jobmaster.JobMaster - Resolved ResourceManager address, beginning registration
2020-10-27T17:11:55.671Z [flink-akka.actor.default-dispatcher-36] INFO org.apache.flink.runtime.resourcemanager.StandaloneResourceManager - Registering job manager [hidden email]://flink@jobmanager:6123/user/rpc/jobmanager_3 for job 6b6f5ee3a1ab7556fd0db64de0f7cb1d.
2020-10-27T17:11:55.672Z [flink-akka.actor.default-dispatcher-35] INFO org.apache.flink.runtime.resourcemanager.StandaloneResourceManager - Registered job manager [hidden email]://flink@jobmanager:6123/user/rpc/jobmanager_3 for job 6b6f5ee3a1ab7556fd0db64de0f7cb1d.
2020-10-27T17:11:55.672Z [flink-akka.actor.default-dispatcher-36] INFO org.apache.flink.runtime.jobmaster.JobMaster - JobManager successfully registered at ResourceManager, leader id: 00000000000000000000000000000000.
2020-10-27T17:11:55.672Z [flink-akka.actor.default-dispatcher-36] INFO org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl - Requesting new slot [SlotRequestId{b9b6db1c89bc8e69809fa0cf66ef0ef7}] and profile ResourceProfile{UNKNOWN} from resource manager.
2020-10-27T17:11:55.672Z [flink-akka.actor.default-dispatcher-40] INFO org.apache.flink.runtime.resourcemanager.StandaloneResourceManager - Request slot with profile ResourceProfile{UNKNOWN} for job 6b6f5ee3a1ab7556fd0db64de0f7cb1d with allocation id a4dfa7dfa91bbce7cd091aa1512c0745.
2020-10-27T17:11:55.686Z [flink-akka.actor.default-dispatcher-40] ERROR org.apache.flink.runtime.jobmaster.JobMaster - Could not accept TaskManager registration. TaskManager address taskmanager-node01 cannot be resolved. taskmanager-node01: No address associated with hostname



Both the taskmanager and jobmanager agree on 1 thing: "taskmanager-node01: No address associated with hostname". 
Setting the hostname explicitly helps the metrics in graphite, but then the job submission/execution does not work, which is even worse than not having the metrics.

So my question is: Is there anything more which needs to be set when using the taskmanager.host config? Or perhaps I am doing something wrong with the setup? 

Regards
,
Nikola Hrusov


On Fri, Aug 14, 2020 at 6:26 AM Xintong Song <[hidden email]> wrote:
Hi Nikola,

I'm not entirely sure about how this happened. Would need some more information to investigate, such as the complete configurations for taskmanagers in your docker compose file, and the taskmanager logs.

One quick thing you may try is to explicitly set the configuration option `taskmanager.host` for your task managers, see if that is reflected in the metrics.

Thank you~

Xintong Song



On Wed, Aug 12, 2020 at 3:06 PM Nikola Hrusov <[hidden email]> wrote:
Hello,

After upgrading the docker image for flink to 1.11.1 from 1.9 the hostname of the taskmanagers reported to our metrics show as IPs (e.g. 10.0.23.101) instead of hostnames.

In the docker compose file we specify the hostname as such:

hostname: "taskmanager-{{ '{{' }}.Node.Hostname{{ '}}' }}"

Is there another way of achieving this?

Regards
,
Nikola Hrusov
Reply | Threaded
Open this post in threaded view
|

Re: Hostname for taskmanagers when running in docker

Nikola Hrusov
Hello,

I am still trying to find how to properly setup a cluster with flink 1.11 and receive metrics on the hostnames.
In my previous email I outlined I need to choose: a) receiving proper metrics or b) running my jobs. Ideally I should be able to do both as this is possible with flink 1.10

Can somebody shed some light on this matter?

Regards
,
Nikola Hrusov


On Tue, Oct 27, 2020 at 9:35 PM Nikola Hrusov <[hidden email]> wrote:
Hi Xintong,

I have tried using the configuration taskmanager.host, but that actually makes it even worse. I have made a simple setup with docker compose to reproduce/explain it easier. You can find the compose files here: https://github.com/nikobearrr/flink-hostname-metrics

I have made 2 identical compose files which can start a flink jobmanager and taskmanager together with graphite (for metrics). One is called docker-compose.yml and the second one docker-compose_with_hostname.yml
The only difference between those two is line #29 which is the taskmanager.host variable. They both expose port 8081 for flink cluster UI and port 8082 for graphite UI.


Running the setup without the taskmanager.host
When you run the compose without the taskmanager.host variable the cluster starts just fine and the taskmanager registers. Running a job on that cluster would be just fine.
The issue is that if you check the metrics in the Graphite UI instead of the hostname it will show the IP (in this case 172.20.0.3). That was not the case with prior 1.11 version of flink.

image.png






Running the setup with the taskmanager.host
Once I run the compose which includes the taskmanager.host variable I can see the cluster UI and it starts up fine. Also the metrics come correctly:

image.png

However, now I found something wrong.

The first thing is that when you go to http://localhost:8081/#/task-manager/70704dc334ac8007925409c575e42d7d/metrics where  70704dc334ac8007925409c575e42d7d is the GUID of the taskmanager I start getting those logs in my console:

jobmanager_1     | 2020-10-27 15:59:44,366 WARN  akka.remote.ReliableDeliverySupervisor                       [] - Association with remote system [akka.tcp://flink-metrics@taskmanager-node01:42269] has failed, address is now gated for [50] ms. Reason: [Association failed with [akka.tcp://flink-metrics@taskmanager-node01:42269]] Caused by: [java.net.UnknownHostException: taskmanager-node01: Name or service not known]
jobmanager_1     | 2020-10-27 15:59:55,236 WARN  akka.remote.ReliableDeliverySupervisor                       [] - Association with remote system [akka.tcp://flink-metrics@taskmanager-node01:42269] has failed, address is now gated for [50] ms. Reason: [Association failed with [akka.tcp://flink-metrics@taskmanager-node01:42269]] Caused by: [java.net.UnknownHostException: taskmanager-node01: Name or service not known]
jobmanager_1     | 2020-10-27 16:00:07,219 WARN  akka.remote.ReliableDeliverySupervisor                       [] - Association with remote system [akka.tcp://flink-metrics@taskmanager-node01:42269] has failed, address is now gated for [50] ms. Reason: [Association failed with [akka.tcp://flink-metrics@taskmanager-node01:42269]] Caused by: [java.net.UnknownHostException: taskmanager-node01: Name or service not known]

image.png

Also the metrics for the taskmanager do not show as shown on the picture above. If you do not use "taskmanager.host" then metrics show and there are no such WARN logs for UnknownHostException.
More importantly, we also see issues with this when we run batch jobs for what seems the same issue. The jobs fail on submission. This only happens when we explicitly set "taskmanager.host" variable

from taskmanager:

2020-10-27T17:11:55.646Z [flink-akka.actor.default-dispatcher-2] INFO org.apache.flink.runtime.taskexecutor.DefaultJobLeaderService - Add job 6b6f5ee3a1ab7556fd0db64de0f7cb1d for job leader monitoring.
2020-10-27T17:11:55.646Z [flink-akka.actor.default-dispatcher-2] INFO org.apache.flink.runtime.taskexecutor.DefaultJobLeaderService - Try to register at job manager akka.tcp://flink@jobmanager:6123/user/rpc/jobmanager_3 with leader id 00000000-0000-0000-0000-000000000000.
2020-10-27T17:11:55.652Z [flink-akka.actor.default-dispatcher-2] INFO org.apache.flink.runtime.taskexecutor.DefaultJobLeaderService - Resolved JobManager address, beginning registration
2020-10-27T17:11:55.657Z [flink-akka.actor.default-dispatcher-4] INFO org.apache.flink.runtime.taskexecutor.DefaultJobLeaderService - Registration at JobManager was declined: Could not accept TaskManager registration. TaskManager address taskmanager-node01 cannot be resolved. taskmanager-node01: No address associated with hostname
2020-10-27T17:11:55.657Z [flink-akka.actor.default-dispatcher-4] INFO org.apache.flink.runtime.taskexecutor.DefaultJobLeaderService - Pausing and re-attempting registration in 30000 ms
2020-10-27T17:12:25.646Z [flink-akka.actor.default-dispatcher-4] INFO org.apache.flink.runtime.taskexecutor.slot.TaskSlotTableImpl - Free slot TaskSlot(index:1, state:ALLOCATED, resource profile: ResourceProfile{cpuCores=1.0000000000000000, taskHeapMemory=62.400mb (65431141 bytes), taskOffHeapMemory=0 bytes, managedMemory=62.720mb (65766687 bytes), networkMemory=15.680mb (16441671 bytes)}, allocationId: a4dfa7dfa91bbce7cd091aa1512c0745, jobId: 6b6f5ee3a1ab7556fd0db64de0f7cb1d).
2020-10-27T17:12:25.647Z [flink-akka.actor.default-dispatcher-4] INFO org.apache.flink.runtime.taskexecutor.DefaultJobLeaderService - Remove job 6b6f5ee3a1ab7556fd0db64de0f7cb1d from job leader monitoring.


from jobmanager:

2020-10-27T17:11:55.670Z [flink-akka.actor.default-dispatcher-36] INFO org.apache.flink.runtime.jobmaster.JobMaster - Connecting to ResourceManager akka.tcp://flink@jobmanager:6123/user/rpc/resourcemanager_*(00000000000000000000000000000000)
2020-10-27T17:11:55.671Z [flink-akka.actor.default-dispatcher-40] INFO org.apache.flink.runtime.jobmaster.JobMaster - Resolved ResourceManager address, beginning registration
2020-10-27T17:11:55.671Z [flink-akka.actor.default-dispatcher-36] INFO org.apache.flink.runtime.resourcemanager.StandaloneResourceManager - Registering job manager [hidden email]://flink@jobmanager:6123/user/rpc/jobmanager_3 for job 6b6f5ee3a1ab7556fd0db64de0f7cb1d.
2020-10-27T17:11:55.672Z [flink-akka.actor.default-dispatcher-35] INFO org.apache.flink.runtime.resourcemanager.StandaloneResourceManager - Registered job manager [hidden email]://flink@jobmanager:6123/user/rpc/jobmanager_3 for job 6b6f5ee3a1ab7556fd0db64de0f7cb1d.
2020-10-27T17:11:55.672Z [flink-akka.actor.default-dispatcher-36] INFO org.apache.flink.runtime.jobmaster.JobMaster - JobManager successfully registered at ResourceManager, leader id: 00000000000000000000000000000000.
2020-10-27T17:11:55.672Z [flink-akka.actor.default-dispatcher-36] INFO org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl - Requesting new slot [SlotRequestId{b9b6db1c89bc8e69809fa0cf66ef0ef7}] and profile ResourceProfile{UNKNOWN} from resource manager.
2020-10-27T17:11:55.672Z [flink-akka.actor.default-dispatcher-40] INFO org.apache.flink.runtime.resourcemanager.StandaloneResourceManager - Request slot with profile ResourceProfile{UNKNOWN} for job 6b6f5ee3a1ab7556fd0db64de0f7cb1d with allocation id a4dfa7dfa91bbce7cd091aa1512c0745.
2020-10-27T17:11:55.686Z [flink-akka.actor.default-dispatcher-40] ERROR org.apache.flink.runtime.jobmaster.JobMaster - Could not accept TaskManager registration. TaskManager address taskmanager-node01 cannot be resolved. taskmanager-node01: No address associated with hostname



Both the taskmanager and jobmanager agree on 1 thing: "taskmanager-node01: No address associated with hostname". 
Setting the hostname explicitly helps the metrics in graphite, but then the job submission/execution does not work, which is even worse than not having the metrics.

So my question is: Is there anything more which needs to be set when using the taskmanager.host config? Or perhaps I am doing something wrong with the setup? 

Regards
,
Nikola Hrusov


On Fri, Aug 14, 2020 at 6:26 AM Xintong Song <[hidden email]> wrote:
Hi Nikola,

I'm not entirely sure about how this happened. Would need some more information to investigate, such as the complete configurations for taskmanagers in your docker compose file, and the taskmanager logs.

One quick thing you may try is to explicitly set the configuration option `taskmanager.host` for your task managers, see if that is reflected in the metrics.

Thank you~

Xintong Song



On Wed, Aug 12, 2020 at 3:06 PM Nikola Hrusov <[hidden email]> wrote:
Hello,

After upgrading the docker image for flink to 1.11.1 from 1.9 the hostname of the taskmanagers reported to our metrics show as IPs (e.g. 10.0.23.101) instead of hostnames.

In the docker compose file we specify the hostname as such:

hostname: "taskmanager-{{ '{{' }}.Node.Hostname{{ '}}' }}"

Is there another way of achieving this?

Regards
,
Nikola Hrusov
Reply | Threaded
Open this post in threaded view
|

Re: Hostname for taskmanagers when running in docker

rmetzger0
Hey Nikola,
sorry for the delayed response.

I just tried the docker-compose files you've provided, and "docker-compose -f docker-compose.yml up" works for me -- metrics are shown in the UI, and I'm able to submit jobs via the web UI and the command line client.

I got to work the "docker-compose_with_hostname.yml" after the following change:


diff --git a/docker-compose_with_hostname.yml b/docker-compose_with_hostname.yml
index d876cb5..c34073f 100644
--- a/docker-compose_with_hostname.yml
+++ b/docker-compose_with_hostname.yml
@@ -26,7 +26,7 @@ services:
         FLINK_PROPERTIES=
         jobmanager.rpc.address: jobmanager
         taskmanager.numberOfTaskSlots: 2
-        taskmanager.host: "taskmanager-node01"
+        taskmanager.host: "taskmanager01"
         metrics.reporter.grph.factory.class: org.apache.flink.metrics.graphite.GraphiteReporterFactory
         metrics.reporter.grph.host: graphite
         metrics.reporter.grph.port: 2003

The name of the docker-compose service is the hostname.

On Tue, Nov 3, 2020 at 5:24 PM Nikola Hrusov <[hidden email]> wrote:
Hello,

I am still trying to find how to properly setup a cluster with flink 1.11 and receive metrics on the hostnames.
In my previous email I outlined I need to choose: a) receiving proper metrics or b) running my jobs. Ideally I should be able to do both as this is possible with flink 1.10

Can somebody shed some light on this matter?

Regards
,
Nikola Hrusov


On Tue, Oct 27, 2020 at 9:35 PM Nikola Hrusov <[hidden email]> wrote:
Hi Xintong,

I have tried using the configuration taskmanager.host, but that actually makes it even worse. I have made a simple setup with docker compose to reproduce/explain it easier. You can find the compose files here: https://github.com/nikobearrr/flink-hostname-metrics

I have made 2 identical compose files which can start a flink jobmanager and taskmanager together with graphite (for metrics). One is called docker-compose.yml and the second one docker-compose_with_hostname.yml
The only difference between those two is line #29 which is the taskmanager.host variable. They both expose port 8081 for flink cluster UI and port 8082 for graphite UI.


Running the setup without the taskmanager.host
When you run the compose without the taskmanager.host variable the cluster starts just fine and the taskmanager registers. Running a job on that cluster would be just fine.
The issue is that if you check the metrics in the Graphite UI instead of the hostname it will show the IP (in this case 172.20.0.3). That was not the case with prior 1.11 version of flink.

image.png






Running the setup with the taskmanager.host
Once I run the compose which includes the taskmanager.host variable I can see the cluster UI and it starts up fine. Also the metrics come correctly:

image.png

However, now I found something wrong.

The first thing is that when you go to http://localhost:8081/#/task-manager/70704dc334ac8007925409c575e42d7d/metrics where  70704dc334ac8007925409c575e42d7d is the GUID of the taskmanager I start getting those logs in my console:

jobmanager_1     | 2020-10-27 15:59:44,366 WARN  akka.remote.ReliableDeliverySupervisor                       [] - Association with remote system [akka.tcp://flink-metrics@taskmanager-node01:42269] has failed, address is now gated for [50] ms. Reason: [Association failed with [akka.tcp://flink-metrics@taskmanager-node01:42269]] Caused by: [java.net.UnknownHostException: taskmanager-node01: Name or service not known]
jobmanager_1     | 2020-10-27 15:59:55,236 WARN  akka.remote.ReliableDeliverySupervisor                       [] - Association with remote system [akka.tcp://flink-metrics@taskmanager-node01:42269] has failed, address is now gated for [50] ms. Reason: [Association failed with [akka.tcp://flink-metrics@taskmanager-node01:42269]] Caused by: [java.net.UnknownHostException: taskmanager-node01: Name or service not known]
jobmanager_1     | 2020-10-27 16:00:07,219 WARN  akka.remote.ReliableDeliverySupervisor                       [] - Association with remote system [akka.tcp://flink-metrics@taskmanager-node01:42269] has failed, address is now gated for [50] ms. Reason: [Association failed with [akka.tcp://flink-metrics@taskmanager-node01:42269]] Caused by: [java.net.UnknownHostException: taskmanager-node01: Name or service not known]

image.png

Also the metrics for the taskmanager do not show as shown on the picture above. If you do not use "taskmanager.host" then metrics show and there are no such WARN logs for UnknownHostException.
More importantly, we also see issues with this when we run batch jobs for what seems the same issue. The jobs fail on submission. This only happens when we explicitly set "taskmanager.host" variable

from taskmanager:

2020-10-27T17:11:55.646Z [flink-akka.actor.default-dispatcher-2] INFO org.apache.flink.runtime.taskexecutor.DefaultJobLeaderService - Add job 6b6f5ee3a1ab7556fd0db64de0f7cb1d for job leader monitoring.
2020-10-27T17:11:55.646Z [flink-akka.actor.default-dispatcher-2] INFO org.apache.flink.runtime.taskexecutor.DefaultJobLeaderService - Try to register at job manager akka.tcp://flink@jobmanager:6123/user/rpc/jobmanager_3 with leader id 00000000-0000-0000-0000-000000000000.
2020-10-27T17:11:55.652Z [flink-akka.actor.default-dispatcher-2] INFO org.apache.flink.runtime.taskexecutor.DefaultJobLeaderService - Resolved JobManager address, beginning registration
2020-10-27T17:11:55.657Z [flink-akka.actor.default-dispatcher-4] INFO org.apache.flink.runtime.taskexecutor.DefaultJobLeaderService - Registration at JobManager was declined: Could not accept TaskManager registration. TaskManager address taskmanager-node01 cannot be resolved. taskmanager-node01: No address associated with hostname
2020-10-27T17:11:55.657Z [flink-akka.actor.default-dispatcher-4] INFO org.apache.flink.runtime.taskexecutor.DefaultJobLeaderService - Pausing and re-attempting registration in 30000 ms
2020-10-27T17:12:25.646Z [flink-akka.actor.default-dispatcher-4] INFO org.apache.flink.runtime.taskexecutor.slot.TaskSlotTableImpl - Free slot TaskSlot(index:1, state:ALLOCATED, resource profile: ResourceProfile{cpuCores=1.0000000000000000, taskHeapMemory=62.400mb (65431141 bytes), taskOffHeapMemory=0 bytes, managedMemory=62.720mb (65766687 bytes), networkMemory=15.680mb (16441671 bytes)}, allocationId: a4dfa7dfa91bbce7cd091aa1512c0745, jobId: 6b6f5ee3a1ab7556fd0db64de0f7cb1d).
2020-10-27T17:12:25.647Z [flink-akka.actor.default-dispatcher-4] INFO org.apache.flink.runtime.taskexecutor.DefaultJobLeaderService - Remove job 6b6f5ee3a1ab7556fd0db64de0f7cb1d from job leader monitoring.


from jobmanager:

2020-10-27T17:11:55.670Z [flink-akka.actor.default-dispatcher-36] INFO org.apache.flink.runtime.jobmaster.JobMaster - Connecting to ResourceManager akka.tcp://flink@jobmanager:6123/user/rpc/resourcemanager_*(00000000000000000000000000000000)
2020-10-27T17:11:55.671Z [flink-akka.actor.default-dispatcher-40] INFO org.apache.flink.runtime.jobmaster.JobMaster - Resolved ResourceManager address, beginning registration
2020-10-27T17:11:55.671Z [flink-akka.actor.default-dispatcher-36] INFO org.apache.flink.runtime.resourcemanager.StandaloneResourceManager - Registering job manager [hidden email]://flink@jobmanager:6123/user/rpc/jobmanager_3 for job 6b6f5ee3a1ab7556fd0db64de0f7cb1d.
2020-10-27T17:11:55.672Z [flink-akka.actor.default-dispatcher-35] INFO org.apache.flink.runtime.resourcemanager.StandaloneResourceManager - Registered job manager [hidden email]://flink@jobmanager:6123/user/rpc/jobmanager_3 for job 6b6f5ee3a1ab7556fd0db64de0f7cb1d.
2020-10-27T17:11:55.672Z [flink-akka.actor.default-dispatcher-36] INFO org.apache.flink.runtime.jobmaster.JobMaster - JobManager successfully registered at ResourceManager, leader id: 00000000000000000000000000000000.
2020-10-27T17:11:55.672Z [flink-akka.actor.default-dispatcher-36] INFO org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl - Requesting new slot [SlotRequestId{b9b6db1c89bc8e69809fa0cf66ef0ef7}] and profile ResourceProfile{UNKNOWN} from resource manager.
2020-10-27T17:11:55.672Z [flink-akka.actor.default-dispatcher-40] INFO org.apache.flink.runtime.resourcemanager.StandaloneResourceManager - Request slot with profile ResourceProfile{UNKNOWN} for job 6b6f5ee3a1ab7556fd0db64de0f7cb1d with allocation id a4dfa7dfa91bbce7cd091aa1512c0745.
2020-10-27T17:11:55.686Z [flink-akka.actor.default-dispatcher-40] ERROR org.apache.flink.runtime.jobmaster.JobMaster - Could not accept TaskManager registration. TaskManager address taskmanager-node01 cannot be resolved. taskmanager-node01: No address associated with hostname



Both the taskmanager and jobmanager agree on 1 thing: "taskmanager-node01: No address associated with hostname". 
Setting the hostname explicitly helps the metrics in graphite, but then the job submission/execution does not work, which is even worse than not having the metrics.

So my question is: Is there anything more which needs to be set when using the taskmanager.host config? Or perhaps I am doing something wrong with the setup? 

Regards
,
Nikola Hrusov


On Fri, Aug 14, 2020 at 6:26 AM Xintong Song <[hidden email]> wrote:
Hi Nikola,

I'm not entirely sure about how this happened. Would need some more information to investigate, such as the complete configurations for taskmanagers in your docker compose file, and the taskmanager logs.

One quick thing you may try is to explicitly set the configuration option `taskmanager.host` for your task managers, see if that is reflected in the metrics.

Thank you~

Xintong Song



On Wed, Aug 12, 2020 at 3:06 PM Nikola Hrusov <[hidden email]> wrote:
Hello,

After upgrading the docker image for flink to 1.11.1 from 1.9 the hostname of the taskmanagers reported to our metrics show as IPs (e.g. 10.0.23.101) instead of hostnames.

In the docker compose file we specify the hostname as such:

hostname: "taskmanager-{{ '{{' }}.Node.Hostname{{ '}}' }}"

Is there another way of achieving this?

Regards
,
Nikola Hrusov
Reply | Threaded
Open this post in threaded view
|

Re: Hostname for taskmanagers when running in docker

rmetzger0
I hope it's fine that I moved our discussion back on the list.

You can not put an arbitrary hostname for the Flink configuration key "taskmanager.host". It must be a valid, resolvable hostname within the Flink cluster so that the RPC services can reach each other.
I don't think there's a way to define a custom taskmanager name in the Flink metrics.

I'm adding Chesnay to the conversation, since he's very familiar with the metrics system.


On Wed, Nov 4, 2020 at 6:08 PM Nikola Hrusov <[hidden email]> wrote:
Hello Robert,

Thank you for your reply.

What you have described is more or less the issue I am experiencing.

When using the example you provided (with taskmanager.host "taskmanager01" and not custom one), if you set the scale to 3 you will get all jobs under the same hostname, even if they are run on different servers. In that way I cannot identify which server has run the job or which taskmanager. I can only base it on the 32-length hash which changes every time a taskmanager restarts, so it is not a good indicator. 

The current setup I have is 3 servers and each one of them gets a hostname like "taskmanager-node01" (where servers are node01, node02, node03). That is running in docker swarm, with flink 1.10.2

  taskmanager:
    image: flinkimage
    hostname: "taskmanager-{{.Node.Hostname}}"
    command: taskmanager
    deploy:
      mode: global

This is how the taskmanager is being deployed. The `taskmanager.host` is not specified and not needed. The hostname used to be used only for the metrics as far as I can tell.


My goal is to be able to put the server name (hostname) in metrics so I can identify which jobs are running on them. By default the metrics on flink would use the hostname provided instead of the IP address. 
Is that change in flink's behaviour intentional/on purpose?


Regards
,
Nikola Hrusov



On Tue, Nov 3, 2020 at 11:26 PM Robert Metzger <[hidden email]> wrote:
Hey Nikola,
sorry for the delayed response.

I just tried the docker-compose files you've provided, and "docker-compose -f docker-compose.yml up" works for me -- metrics are shown in the UI, and I'm able to submit jobs via the web UI and the command line client.

I got to work the "docker-compose_with_hostname.yml" after the following change:


diff --git a/docker-compose_with_hostname.yml b/docker-compose_with_hostname.yml
index d876cb5..c34073f 100644
--- a/docker-compose_with_hostname.yml
+++ b/docker-compose_with_hostname.yml
@@ -26,7 +26,7 @@ services:
         FLINK_PROPERTIES=
         jobmanager.rpc.address: jobmanager
         taskmanager.numberOfTaskSlots: 2
-        taskmanager.host: "taskmanager-node01"
+        taskmanager.host: "taskmanager01"
         metrics.reporter.grph.factory.class: org.apache.flink.metrics.graphite.GraphiteReporterFactory
         metrics.reporter.grph.host: graphite
         metrics.reporter.grph.port: 2003

The name of the docker-compose service is the hostname.

On Tue, Nov 3, 2020 at 5:24 PM Nikola Hrusov <[hidden email]> wrote:
Hello,

I am still trying to find how to properly setup a cluster with flink 1.11 and receive metrics on the hostnames.
In my previous email I outlined I need to choose: a) receiving proper metrics or b) running my jobs. Ideally I should be able to do both as this is possible with flink 1.10

Can somebody shed some light on this matter?

Regards
,
Nikola Hrusov


On Tue, Oct 27, 2020 at 9:35 PM Nikola Hrusov <[hidden email]> wrote:
Hi Xintong,

I have tried using the configuration taskmanager.host, but that actually makes it even worse. I have made a simple setup with docker compose to reproduce/explain it easier. You can find the compose files here: https://github.com/nikobearrr/flink-hostname-metrics

I have made 2 identical compose files which can start a flink jobmanager and taskmanager together with graphite (for metrics). One is called docker-compose.yml and the second one docker-compose_with_hostname.yml
The only difference between those two is line #29 which is the taskmanager.host variable. They both expose port 8081 for flink cluster UI and port 8082 for graphite UI.


Running the setup without the taskmanager.host
When you run the compose without the taskmanager.host variable the cluster starts just fine and the taskmanager registers. Running a job on that cluster would be just fine.
The issue is that if you check the metrics in the Graphite UI instead of the hostname it will show the IP (in this case 172.20.0.3). That was not the case with prior 1.11 version of flink.

image.png






Running the setup with the taskmanager.host
Once I run the compose which includes the taskmanager.host variable I can see the cluster UI and it starts up fine. Also the metrics come correctly:

image.png

However, now I found something wrong.

The first thing is that when you go to http://localhost:8081/#/task-manager/70704dc334ac8007925409c575e42d7d/metrics where  70704dc334ac8007925409c575e42d7d is the GUID of the taskmanager I start getting those logs in my console:

jobmanager_1     | 2020-10-27 15:59:44,366 WARN  akka.remote.ReliableDeliverySupervisor                       [] - Association with remote system [akka.tcp://flink-metrics@taskmanager-node01:42269] has failed, address is now gated for [50] ms. Reason: [Association failed with [akka.tcp://flink-metrics@taskmanager-node01:42269]] Caused by: [java.net.UnknownHostException: taskmanager-node01: Name or service not known]
jobmanager_1     | 2020-10-27 15:59:55,236 WARN  akka.remote.ReliableDeliverySupervisor                       [] - Association with remote system [akka.tcp://flink-metrics@taskmanager-node01:42269] has failed, address is now gated for [50] ms. Reason: [Association failed with [akka.tcp://flink-metrics@taskmanager-node01:42269]] Caused by: [java.net.UnknownHostException: taskmanager-node01: Name or service not known]
jobmanager_1     | 2020-10-27 16:00:07,219 WARN  akka.remote.ReliableDeliverySupervisor                       [] - Association with remote system [akka.tcp://flink-metrics@taskmanager-node01:42269] has failed, address is now gated for [50] ms. Reason: [Association failed with [akka.tcp://flink-metrics@taskmanager-node01:42269]] Caused by: [java.net.UnknownHostException: taskmanager-node01: Name or service not known]

image.png

Also the metrics for the taskmanager do not show as shown on the picture above. If you do not use "taskmanager.host" then metrics show and there are no such WARN logs for UnknownHostException.
More importantly, we also see issues with this when we run batch jobs for what seems the same issue. The jobs fail on submission. This only happens when we explicitly set "taskmanager.host" variable

from taskmanager:

2020-10-27T17:11:55.646Z [flink-akka.actor.default-dispatcher-2] INFO org.apache.flink.runtime.taskexecutor.DefaultJobLeaderService - Add job 6b6f5ee3a1ab7556fd0db64de0f7cb1d for job leader monitoring.
2020-10-27T17:11:55.646Z [flink-akka.actor.default-dispatcher-2] INFO org.apache.flink.runtime.taskexecutor.DefaultJobLeaderService - Try to register at job manager akka.tcp://flink@jobmanager:6123/user/rpc/jobmanager_3 with leader id 00000000-0000-0000-0000-000000000000.
2020-10-27T17:11:55.652Z [flink-akka.actor.default-dispatcher-2] INFO org.apache.flink.runtime.taskexecutor.DefaultJobLeaderService - Resolved JobManager address, beginning registration
2020-10-27T17:11:55.657Z [flink-akka.actor.default-dispatcher-4] INFO org.apache.flink.runtime.taskexecutor.DefaultJobLeaderService - Registration at JobManager was declined: Could not accept TaskManager registration. TaskManager address taskmanager-node01 cannot be resolved. taskmanager-node01: No address associated with hostname
2020-10-27T17:11:55.657Z [flink-akka.actor.default-dispatcher-4] INFO org.apache.flink.runtime.taskexecutor.DefaultJobLeaderService - Pausing and re-attempting registration in 30000 ms
2020-10-27T17:12:25.646Z [flink-akka.actor.default-dispatcher-4] INFO org.apache.flink.runtime.taskexecutor.slot.TaskSlotTableImpl - Free slot TaskSlot(index:1, state:ALLOCATED, resource profile: ResourceProfile{cpuCores=1.0000000000000000, taskHeapMemory=62.400mb (65431141 bytes), taskOffHeapMemory=0 bytes, managedMemory=62.720mb (65766687 bytes), networkMemory=15.680mb (16441671 bytes)}, allocationId: a4dfa7dfa91bbce7cd091aa1512c0745, jobId: 6b6f5ee3a1ab7556fd0db64de0f7cb1d).
2020-10-27T17:12:25.647Z [flink-akka.actor.default-dispatcher-4] INFO org.apache.flink.runtime.taskexecutor.DefaultJobLeaderService - Remove job 6b6f5ee3a1ab7556fd0db64de0f7cb1d from job leader monitoring.


from jobmanager:

2020-10-27T17:11:55.670Z [flink-akka.actor.default-dispatcher-36] INFO org.apache.flink.runtime.jobmaster.JobMaster - Connecting to ResourceManager akka.tcp://flink@jobmanager:6123/user/rpc/resourcemanager_*(00000000000000000000000000000000)
2020-10-27T17:11:55.671Z [flink-akka.actor.default-dispatcher-40] INFO org.apache.flink.runtime.jobmaster.JobMaster - Resolved ResourceManager address, beginning registration
2020-10-27T17:11:55.671Z [flink-akka.actor.default-dispatcher-36] INFO org.apache.flink.runtime.resourcemanager.StandaloneResourceManager - Registering job manager [hidden email]://flink@jobmanager:6123/user/rpc/jobmanager_3 for job 6b6f5ee3a1ab7556fd0db64de0f7cb1d.
2020-10-27T17:11:55.672Z [flink-akka.actor.default-dispatcher-35] INFO org.apache.flink.runtime.resourcemanager.StandaloneResourceManager - Registered job manager [hidden email]://flink@jobmanager:6123/user/rpc/jobmanager_3 for job 6b6f5ee3a1ab7556fd0db64de0f7cb1d.
2020-10-27T17:11:55.672Z [flink-akka.actor.default-dispatcher-36] INFO org.apache.flink.runtime.jobmaster.JobMaster - JobManager successfully registered at ResourceManager, leader id: 00000000000000000000000000000000.
2020-10-27T17:11:55.672Z [flink-akka.actor.default-dispatcher-36] INFO org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl - Requesting new slot [SlotRequestId{b9b6db1c89bc8e69809fa0cf66ef0ef7}] and profile ResourceProfile{UNKNOWN} from resource manager.
2020-10-27T17:11:55.672Z [flink-akka.actor.default-dispatcher-40] INFO org.apache.flink.runtime.resourcemanager.StandaloneResourceManager - Request slot with profile ResourceProfile{UNKNOWN} for job 6b6f5ee3a1ab7556fd0db64de0f7cb1d with allocation id a4dfa7dfa91bbce7cd091aa1512c0745.
2020-10-27T17:11:55.686Z [flink-akka.actor.default-dispatcher-40] ERROR org.apache.flink.runtime.jobmaster.JobMaster - Could not accept TaskManager registration. TaskManager address taskmanager-node01 cannot be resolved. taskmanager-node01: No address associated with hostname



Both the taskmanager and jobmanager agree on 1 thing: "taskmanager-node01: No address associated with hostname". 
Setting the hostname explicitly helps the metrics in graphite, but then the job submission/execution does not work, which is even worse than not having the metrics.

So my question is: Is there anything more which needs to be set when using the taskmanager.host config? Or perhaps I am doing something wrong with the setup? 

Regards
,
Nikola Hrusov


On Fri, Aug 14, 2020 at 6:26 AM Xintong Song <[hidden email]> wrote:
Hi Nikola,

I'm not entirely sure about how this happened. Would need some more information to investigate, such as the complete configurations for taskmanagers in your docker compose file, and the taskmanager logs.

One quick thing you may try is to explicitly set the configuration option `taskmanager.host` for your task managers, see if that is reflected in the metrics.

Thank you~

Xintong Song



On Wed, Aug 12, 2020 at 3:06 PM Nikola Hrusov <[hidden email]> wrote:
Hello,

After upgrading the docker image for flink to 1.11.1 from 1.9 the hostname of the taskmanagers reported to our metrics show as IPs (e.g. 10.0.23.101) instead of hostnames.

In the docker compose file we specify the hostname as such:

hostname: "taskmanager-{{ '{{' }}.Node.Hostname{{ '}}' }}"

Is there another way of achieving this?

Regards
,
Nikola Hrusov
Reply | Threaded
Open this post in threaded view
|

Re: Hostname for taskmanagers when running in docker

Chesnay Schepler
There is no convenient cosmetic way to achieve what you want.
The only approach that would currently work is hard-coding the host into the configuration of each taskmanager via the metrics.scope.* configuration options.

On 11/4/2020 8:14 PM, Robert Metzger wrote:
I hope it's fine that I moved our discussion back on the list.

You can not put an arbitrary hostname for the Flink configuration key "taskmanager.host". It must be a valid, resolvable hostname within the Flink cluster so that the RPC services can reach each other.
I don't think there's a way to define a custom taskmanager name in the Flink metrics.

I'm adding Chesnay to the conversation, since he's very familiar with the metrics system.


On Wed, Nov 4, 2020 at 6:08 PM Nikola Hrusov <[hidden email]> wrote:
Hello Robert,

Thank you for your reply.

What you have described is more or less the issue I am experiencing.

When using the example you provided (with taskmanager.host "taskmanager01" and not custom one), if you set the scale to 3 you will get all jobs under the same hostname, even if they are run on different servers. In that way I cannot identify which server has run the job or which taskmanager. I can only base it on the 32-length hash which changes every time a taskmanager restarts, so it is not a good indicator. 

The current setup I have is 3 servers and each one of them gets a hostname like "taskmanager-node01" (where servers are node01, node02, node03). That is running in docker swarm, with flink 1.10.2

  taskmanager:
    image: flinkimage
    hostname: "taskmanager-{{.Node.Hostname}}"
    command: taskmanager
    deploy:
      mode: global

This is how the taskmanager is being deployed. The `taskmanager.host` is not specified and not needed. The hostname used to be used only for the metrics as far as I can tell.


My goal is to be able to put the server name (hostname) in metrics so I can identify which jobs are running on them. By default the metrics on flink would use the hostname provided instead of the IP address. 
Is that change in flink's behaviour intentional/on purpose?


Regards
,
Nikola Hrusov



On Tue, Nov 3, 2020 at 11:26 PM Robert Metzger <[hidden email]> wrote:
Hey Nikola,
sorry for the delayed response.

I just tried the docker-compose files you've provided, and "docker-compose -f docker-compose.yml up" works for me -- metrics are shown in the UI, and I'm able to submit jobs via the web UI and the command line client.

I got to work the "docker-compose_with_hostname.yml" after the following change:


diff --git a/docker-compose_with_hostname.yml b/docker-compose_with_hostname.yml
index d876cb5..c34073f 100644
--- a/docker-compose_with_hostname.yml
+++ b/docker-compose_with_hostname.yml
@@ -26,7 +26,7 @@ services:
         FLINK_PROPERTIES=
         jobmanager.rpc.address: jobmanager
         taskmanager.numberOfTaskSlots: 2
-        taskmanager.host: "taskmanager-node01"
+        taskmanager.host: "taskmanager01"
         metrics.reporter.grph.factory.class: org.apache.flink.metrics.graphite.GraphiteReporterFactory
         metrics.reporter.grph.host: graphite
         metrics.reporter.grph.port: 2003

The name of the docker-compose service is the hostname.

On Tue, Nov 3, 2020 at 5:24 PM Nikola Hrusov <[hidden email]> wrote:
Hello,

I am still trying to find how to properly setup a cluster with flink 1.11 and receive metrics on the hostnames.
In my previous email I outlined I need to choose: a) receiving proper metrics or b) running my jobs. Ideally I should be able to do both as this is possible with flink 1.10

Can somebody shed some light on this matter?

Regards
,
Nikola Hrusov


On Tue, Oct 27, 2020 at 9:35 PM Nikola Hrusov <[hidden email]> wrote:
Hi Xintong,

I have tried using the configuration taskmanager.host, but that actually makes it even worse. I have made a simple setup with docker compose to reproduce/explain it easier. You can find the compose files here: https://github.com/nikobearrr/flink-hostname-metrics

I have made 2 identical compose files which can start a flink jobmanager and taskmanager together with graphite (for metrics). One is called docker-compose.yml and the second one docker-compose_with_hostname.yml
The only difference between those two is line #29 which is the taskmanager.host variable. They both expose port 8081 for flink cluster UI and port 8082 for graphite UI.


Running the setup without the taskmanager.host
When you run the compose without the taskmanager.host variable the cluster starts just fine and the taskmanager registers. Running a job on that cluster would be just fine.
The issue is that if you check the metrics in the Graphite UI instead of the hostname it will show the IP (in this case 172.20.0.3). That was not the case with prior 1.11 version of flink.

image.png






Running the setup with the taskmanager.host
Once I run the compose which includes the taskmanager.host variable I can see the cluster UI and it starts up fine. Also the metrics come correctly:

image.png

However, now I found something wrong.

The first thing is that when you go to http://localhost:8081/#/task-manager/70704dc334ac8007925409c575e42d7d/metrics where  70704dc334ac8007925409c575e42d7d is the GUID of the taskmanager I start getting those logs in my console:

jobmanager_1     | 2020-10-27 15:59:44,366 WARN  akka.remote.ReliableDeliverySupervisor                       [] - Association with remote system [akka.tcp://flink-metrics@taskmanager-node01:42269] has failed, address is now gated for [50] ms. Reason: [Association failed with [akka.tcp://flink-metrics@taskmanager-node01:42269]] Caused by: [java.net.UnknownHostException: taskmanager-node01: Name or service not known]
jobmanager_1     | 2020-10-27 15:59:55,236 WARN  akka.remote.ReliableDeliverySupervisor                       [] - Association with remote system [akka.tcp://flink-metrics@taskmanager-node01:42269] has failed, address is now gated for [50] ms. Reason: [Association failed with [akka.tcp://flink-metrics@taskmanager-node01:42269]] Caused by: [java.net.UnknownHostException: taskmanager-node01: Name or service not known]
jobmanager_1     | 2020-10-27 16:00:07,219 WARN  akka.remote.ReliableDeliverySupervisor                       [] - Association with remote system [akka.tcp://flink-metrics@taskmanager-node01:42269] has failed, address is now gated for [50] ms. Reason: [Association failed with [akka.tcp://flink-metrics@taskmanager-node01:42269]] Caused by: [java.net.UnknownHostException: taskmanager-node01: Name or service not known]

image.png

Also the metrics for the taskmanager do not show as shown on the picture above. If you do not use "taskmanager.host" then metrics show and there are no such WARN logs for UnknownHostException.
More importantly, we also see issues with this when we run batch jobs for what seems the same issue. The jobs fail on submission. This only happens when we explicitly set "taskmanager.host" variable

from taskmanager:

2020-10-27T17:11:55.646Z [flink-akka.actor.default-dispatcher-2] INFO org.apache.flink.runtime.taskexecutor.DefaultJobLeaderService - Add job 6b6f5ee3a1ab7556fd0db64de0f7cb1d for job leader monitoring.
2020-10-27T17:11:55.646Z [flink-akka.actor.default-dispatcher-2] INFO org.apache.flink.runtime.taskexecutor.DefaultJobLeaderService - Try to register at job manager akka.tcp://flink@jobmanager:6123/user/rpc/jobmanager_3 with leader id 00000000-0000-0000-0000-000000000000.
2020-10-27T17:11:55.652Z [flink-akka.actor.default-dispatcher-2] INFO org.apache.flink.runtime.taskexecutor.DefaultJobLeaderService - Resolved JobManager address, beginning registration
2020-10-27T17:11:55.657Z [flink-akka.actor.default-dispatcher-4] INFO org.apache.flink.runtime.taskexecutor.DefaultJobLeaderService - Registration at JobManager was declined: Could not accept TaskManager registration. TaskManager address taskmanager-node01 cannot be resolved. taskmanager-node01: No address associated with hostname
2020-10-27T17:11:55.657Z [flink-akka.actor.default-dispatcher-4] INFO org.apache.flink.runtime.taskexecutor.DefaultJobLeaderService - Pausing and re-attempting registration in 30000 ms
2020-10-27T17:12:25.646Z [flink-akka.actor.default-dispatcher-4] INFO org.apache.flink.runtime.taskexecutor.slot.TaskSlotTableImpl - Free slot TaskSlot(index:1, state:ALLOCATED, resource profile: ResourceProfile{cpuCores=1.0000000000000000, taskHeapMemory=62.400mb (65431141 bytes), taskOffHeapMemory=0 bytes, managedMemory=62.720mb (65766687 bytes), networkMemory=15.680mb (16441671 bytes)}, allocationId: a4dfa7dfa91bbce7cd091aa1512c0745, jobId: 6b6f5ee3a1ab7556fd0db64de0f7cb1d).
2020-10-27T17:12:25.647Z [flink-akka.actor.default-dispatcher-4] INFO org.apache.flink.runtime.taskexecutor.DefaultJobLeaderService - Remove job 6b6f5ee3a1ab7556fd0db64de0f7cb1d from job leader monitoring.


from jobmanager:

2020-10-27T17:11:55.670Z [flink-akka.actor.default-dispatcher-36] INFO org.apache.flink.runtime.jobmaster.JobMaster - Connecting to ResourceManager akka.tcp://flink@jobmanager:6123/user/rpc/resourcemanager_*(00000000000000000000000000000000)
2020-10-27T17:11:55.671Z [flink-akka.actor.default-dispatcher-40] INFO org.apache.flink.runtime.jobmaster.JobMaster - Resolved ResourceManager address, beginning registration
2020-10-27T17:11:55.671Z [flink-akka.actor.default-dispatcher-36] INFO org.apache.flink.runtime.resourcemanager.StandaloneResourceManager - Registering job manager [hidden email] for job 6b6f5ee3a1ab7556fd0db64de0f7cb1d.
2020-10-27T17:11:55.672Z [flink-akka.actor.default-dispatcher-35] INFO org.apache.flink.runtime.resourcemanager.StandaloneResourceManager - Registered job manager [hidden email] for job 6b6f5ee3a1ab7556fd0db64de0f7cb1d.
2020-10-27T17:11:55.672Z [flink-akka.actor.default-dispatcher-36] INFO org.apache.flink.runtime.jobmaster.JobMaster - JobManager successfully registered at ResourceManager, leader id: 00000000000000000000000000000000.
2020-10-27T17:11:55.672Z [flink-akka.actor.default-dispatcher-36] INFO org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl - Requesting new slot [SlotRequestId{b9b6db1c89bc8e69809fa0cf66ef0ef7}] and profile ResourceProfile{UNKNOWN} from resource manager.
2020-10-27T17:11:55.672Z [flink-akka.actor.default-dispatcher-40] INFO org.apache.flink.runtime.resourcemanager.StandaloneResourceManager - Request slot with profile ResourceProfile{UNKNOWN} for job 6b6f5ee3a1ab7556fd0db64de0f7cb1d with allocation id a4dfa7dfa91bbce7cd091aa1512c0745.
2020-10-27T17:11:55.686Z [flink-akka.actor.default-dispatcher-40] ERROR org.apache.flink.runtime.jobmaster.JobMaster - Could not accept TaskManager registration. TaskManager address taskmanager-node01 cannot be resolved. taskmanager-node01: No address associated with hostname



Both the taskmanager and jobmanager agree on 1 thing: "taskmanager-node01: No address associated with hostname". 
Setting the hostname explicitly helps the metrics in graphite, but then the job submission/execution does not work, which is even worse than not having the metrics.

So my question is: Is there anything more which needs to be set when using the taskmanager.host config? Or perhaps I am doing something wrong with the setup? 

Regards
,
Nikola Hrusov


On Fri, Aug 14, 2020 at 6:26 AM Xintong Song <[hidden email]> wrote:
Hi Nikola,

I'm not entirely sure about how this happened. Would need some more information to investigate, such as the complete configurations for taskmanagers in your docker compose file, and the taskmanager logs.

One quick thing you may try is to explicitly set the configuration option `taskmanager.host` for your task managers, see if that is reflected in the metrics.

Thank you~

Xintong Song



On Wed, Aug 12, 2020 at 3:06 PM Nikola Hrusov <[hidden email]> wrote:
Hello,

After upgrading the docker image for flink to 1.11.1 from 1.9 the hostname of the taskmanagers reported to our metrics show as IPs (e.g. 10.0.23.101) instead of hostnames.

In the docker compose file we specify the hostname as such:

hostname: "taskmanager-{{ '{{' }}.Node.Hostname{{ '}}' }}"

Is there another way of achieving this?

Regards
,
Nikola Hrusov


Reply | Threaded
Open this post in threaded view
|

Re: Hostname for taskmanagers when running in docker

Nikola Hrusov
Hello,

Thank you both for your input. I accidentally have pressed Reply instead of Reply all, thanks for bringing back the discussion to the userlist.

As it is there are 2 ways to configure the hostname 1) using docker's hostname property under a service 2) using flink's explicit taskmanager.host configuration.

Prior to 1.11 the taskmanager.host variable was not needed. The cluster does not seem to have taken it into consideration, because when I go to my jobmanager -> taskmanagers I could see the list of taskmanagers based on internal IPs. So the default internal IPs were being used, however the docker's hostname attribute was used for the metrics. The metrics format was flink.host.<host>.job.xxxx and the <host> was replaced with the value I have put in the docker hostname attribute. My only issue is that I couldn't find anything in the documentation regarding such a change and suddenly the metrics I have were not using the docker's hostname.

I will try to use the metrics scope and pass the name of the hostname there instead.
Regards
,
Nikola Hrusov


On Thu, Nov 5, 2020 at 1:31 AM Chesnay Schepler <[hidden email]> wrote:
There is no convenient cosmetic way to achieve what you want.
The only approach that would currently work is hard-coding the host into the configuration of each taskmanager via the metrics.scope.* configuration options.

On 11/4/2020 8:14 PM, Robert Metzger wrote:
I hope it's fine that I moved our discussion back on the list.

You can not put an arbitrary hostname for the Flink configuration key "taskmanager.host". It must be a valid, resolvable hostname within the Flink cluster so that the RPC services can reach each other.
I don't think there's a way to define a custom taskmanager name in the Flink metrics.

I'm adding Chesnay to the conversation, since he's very familiar with the metrics system.


On Wed, Nov 4, 2020 at 6:08 PM Nikola Hrusov <[hidden email]> wrote:
Hello Robert,

Thank you for your reply.

What you have described is more or less the issue I am experiencing.

When using the example you provided (with taskmanager.host "taskmanager01" and not custom one), if you set the scale to 3 you will get all jobs under the same hostname, even if they are run on different servers. In that way I cannot identify which server has run the job or which taskmanager. I can only base it on the 32-length hash which changes every time a taskmanager restarts, so it is not a good indicator. 

The current setup I have is 3 servers and each one of them gets a hostname like "taskmanager-node01" (where servers are node01, node02, node03). That is running in docker swarm, with flink 1.10.2

  taskmanager:
    image: flinkimage
    hostname: "taskmanager-{{.Node.Hostname}}"
    command: taskmanager
    deploy:
      mode: global

This is how the taskmanager is being deployed. The `taskmanager.host` is not specified and not needed. The hostname used to be used only for the metrics as far as I can tell.


My goal is to be able to put the server name (hostname) in metrics so I can identify which jobs are running on them. By default the metrics on flink would use the hostname provided instead of the IP address. 
Is that change in flink's behaviour intentional/on purpose?


Regards
,
Nikola Hrusov



On Tue, Nov 3, 2020 at 11:26 PM Robert Metzger <[hidden email]> wrote:
Hey Nikola,
sorry for the delayed response.

I just tried the docker-compose files you've provided, and "docker-compose -f docker-compose.yml up" works for me -- metrics are shown in the UI, and I'm able to submit jobs via the web UI and the command line client.

I got to work the "docker-compose_with_hostname.yml" after the following change:


diff --git a/docker-compose_with_hostname.yml b/docker-compose_with_hostname.yml
index d876cb5..c34073f 100644
--- a/docker-compose_with_hostname.yml
+++ b/docker-compose_with_hostname.yml
@@ -26,7 +26,7 @@ services:
         FLINK_PROPERTIES=
         jobmanager.rpc.address: jobmanager
         taskmanager.numberOfTaskSlots: 2
-        taskmanager.host: "taskmanager-node01"
+        taskmanager.host: "taskmanager01"
         metrics.reporter.grph.factory.class: org.apache.flink.metrics.graphite.GraphiteReporterFactory
         metrics.reporter.grph.host: graphite
         metrics.reporter.grph.port: 2003

The name of the docker-compose service is the hostname.

On Tue, Nov 3, 2020 at 5:24 PM Nikola Hrusov <[hidden email]> wrote:
Hello,

I am still trying to find how to properly setup a cluster with flink 1.11 and receive metrics on the hostnames.
In my previous email I outlined I need to choose: a) receiving proper metrics or b) running my jobs. Ideally I should be able to do both as this is possible with flink 1.10

Can somebody shed some light on this matter?

Regards
,
Nikola Hrusov


On Tue, Oct 27, 2020 at 9:35 PM Nikola Hrusov <[hidden email]> wrote:
Hi Xintong,

I have tried using the configuration taskmanager.host, but that actually makes it even worse. I have made a simple setup with docker compose to reproduce/explain it easier. You can find the compose files here: https://github.com/nikobearrr/flink-hostname-metrics

I have made 2 identical compose files which can start a flink jobmanager and taskmanager together with graphite (for metrics). One is called docker-compose.yml and the second one docker-compose_with_hostname.yml
The only difference between those two is line #29 which is the taskmanager.host variable. They both expose port 8081 for flink cluster UI and port 8082 for graphite UI.


Running the setup without the taskmanager.host
When you run the compose without the taskmanager.host variable the cluster starts just fine and the taskmanager registers. Running a job on that cluster would be just fine.
The issue is that if you check the metrics in the Graphite UI instead of the hostname it will show the IP (in this case 172.20.0.3). That was not the case with prior 1.11 version of flink.

image.png






Running the setup with the taskmanager.host
Once I run the compose which includes the taskmanager.host variable I can see the cluster UI and it starts up fine. Also the metrics come correctly:

image.png

However, now I found something wrong.

The first thing is that when you go to http://localhost:8081/#/task-manager/70704dc334ac8007925409c575e42d7d/metrics where  70704dc334ac8007925409c575e42d7d is the GUID of the taskmanager I start getting those logs in my console:

jobmanager_1     | 2020-10-27 15:59:44,366 WARN  akka.remote.ReliableDeliverySupervisor                       [] - Association with remote system [akka.tcp://flink-metrics@taskmanager-node01:42269] has failed, address is now gated for [50] ms. Reason: [Association failed with [akka.tcp://flink-metrics@taskmanager-node01:42269]] Caused by: [java.net.UnknownHostException: taskmanager-node01: Name or service not known]
jobmanager_1     | 2020-10-27 15:59:55,236 WARN  akka.remote.ReliableDeliverySupervisor                       [] - Association with remote system [akka.tcp://flink-metrics@taskmanager-node01:42269] has failed, address is now gated for [50] ms. Reason: [Association failed with [akka.tcp://flink-metrics@taskmanager-node01:42269]] Caused by: [java.net.UnknownHostException: taskmanager-node01: Name or service not known]
jobmanager_1     | 2020-10-27 16:00:07,219 WARN  akka.remote.ReliableDeliverySupervisor                       [] - Association with remote system [akka.tcp://flink-metrics@taskmanager-node01:42269] has failed, address is now gated for [50] ms. Reason: [Association failed with [akka.tcp://flink-metrics@taskmanager-node01:42269]] Caused by: [java.net.UnknownHostException: taskmanager-node01: Name or service not known]

image.png

Also the metrics for the taskmanager do not show as shown on the picture above. If you do not use "taskmanager.host" then metrics show and there are no such WARN logs for UnknownHostException.
More importantly, we also see issues with this when we run batch jobs for what seems the same issue. The jobs fail on submission. This only happens when we explicitly set "taskmanager.host" variable

from taskmanager:

2020-10-27T17:11:55.646Z [flink-akka.actor.default-dispatcher-2] INFO org.apache.flink.runtime.taskexecutor.DefaultJobLeaderService - Add job 6b6f5ee3a1ab7556fd0db64de0f7cb1d for job leader monitoring.
2020-10-27T17:11:55.646Z [flink-akka.actor.default-dispatcher-2] INFO org.apache.flink.runtime.taskexecutor.DefaultJobLeaderService - Try to register at job manager akka.tcp://flink@jobmanager:6123/user/rpc/jobmanager_3 with leader id 00000000-0000-0000-0000-000000000000.
2020-10-27T17:11:55.652Z [flink-akka.actor.default-dispatcher-2] INFO org.apache.flink.runtime.taskexecutor.DefaultJobLeaderService - Resolved JobManager address, beginning registration
2020-10-27T17:11:55.657Z [flink-akka.actor.default-dispatcher-4] INFO org.apache.flink.runtime.taskexecutor.DefaultJobLeaderService - Registration at JobManager was declined: Could not accept TaskManager registration. TaskManager address taskmanager-node01 cannot be resolved. taskmanager-node01: No address associated with hostname
2020-10-27T17:11:55.657Z [flink-akka.actor.default-dispatcher-4] INFO org.apache.flink.runtime.taskexecutor.DefaultJobLeaderService - Pausing and re-attempting registration in 30000 ms
2020-10-27T17:12:25.646Z [flink-akka.actor.default-dispatcher-4] INFO org.apache.flink.runtime.taskexecutor.slot.TaskSlotTableImpl - Free slot TaskSlot(index:1, state:ALLOCATED, resource profile: ResourceProfile{cpuCores=1.0000000000000000, taskHeapMemory=62.400mb (65431141 bytes), taskOffHeapMemory=0 bytes, managedMemory=62.720mb (65766687 bytes), networkMemory=15.680mb (16441671 bytes)}, allocationId: a4dfa7dfa91bbce7cd091aa1512c0745, jobId: 6b6f5ee3a1ab7556fd0db64de0f7cb1d).
2020-10-27T17:12:25.647Z [flink-akka.actor.default-dispatcher-4] INFO org.apache.flink.runtime.taskexecutor.DefaultJobLeaderService - Remove job 6b6f5ee3a1ab7556fd0db64de0f7cb1d from job leader monitoring.


from jobmanager:

2020-10-27T17:11:55.670Z [flink-akka.actor.default-dispatcher-36] INFO org.apache.flink.runtime.jobmaster.JobMaster - Connecting to ResourceManager akka.tcp://flink@jobmanager:6123/user/rpc/resourcemanager_*(00000000000000000000000000000000)
2020-10-27T17:11:55.671Z [flink-akka.actor.default-dispatcher-40] INFO org.apache.flink.runtime.jobmaster.JobMaster - Resolved ResourceManager address, beginning registration
2020-10-27T17:11:55.671Z [flink-akka.actor.default-dispatcher-36] INFO org.apache.flink.runtime.resourcemanager.StandaloneResourceManager - Registering job manager [hidden email] for job 6b6f5ee3a1ab7556fd0db64de0f7cb1d.
2020-10-27T17:11:55.672Z [flink-akka.actor.default-dispatcher-35] INFO org.apache.flink.runtime.resourcemanager.StandaloneResourceManager - Registered job manager [hidden email] for job 6b6f5ee3a1ab7556fd0db64de0f7cb1d.
2020-10-27T17:11:55.672Z [flink-akka.actor.default-dispatcher-36] INFO org.apache.flink.runtime.jobmaster.JobMaster - JobManager successfully registered at ResourceManager, leader id: 00000000000000000000000000000000.
2020-10-27T17:11:55.672Z [flink-akka.actor.default-dispatcher-36] INFO org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl - Requesting new slot [SlotRequestId{b9b6db1c89bc8e69809fa0cf66ef0ef7}] and profile ResourceProfile{UNKNOWN} from resource manager.
2020-10-27T17:11:55.672Z [flink-akka.actor.default-dispatcher-40] INFO org.apache.flink.runtime.resourcemanager.StandaloneResourceManager - Request slot with profile ResourceProfile{UNKNOWN} for job 6b6f5ee3a1ab7556fd0db64de0f7cb1d with allocation id a4dfa7dfa91bbce7cd091aa1512c0745.
2020-10-27T17:11:55.686Z [flink-akka.actor.default-dispatcher-40] ERROR org.apache.flink.runtime.jobmaster.JobMaster - Could not accept TaskManager registration. TaskManager address taskmanager-node01 cannot be resolved. taskmanager-node01: No address associated with hostname



Both the taskmanager and jobmanager agree on 1 thing: "taskmanager-node01: No address associated with hostname". 
Setting the hostname explicitly helps the metrics in graphite, but then the job submission/execution does not work, which is even worse than not having the metrics.

So my question is: Is there anything more which needs to be set when using the taskmanager.host config? Or perhaps I am doing something wrong with the setup? 

Regards
,
Nikola Hrusov


On Fri, Aug 14, 2020 at 6:26 AM Xintong Song <[hidden email]> wrote:
Hi Nikola,

I'm not entirely sure about how this happened. Would need some more information to investigate, such as the complete configurations for taskmanagers in your docker compose file, and the taskmanager logs.

One quick thing you may try is to explicitly set the configuration option `taskmanager.host` for your task managers, see if that is reflected in the metrics.

Thank you~

Xintong Song



On Wed, Aug 12, 2020 at 3:06 PM Nikola Hrusov <[hidden email]> wrote:
Hello,

After upgrading the docker image for flink to 1.11.1 from 1.9 the hostname of the taskmanagers reported to our metrics show as IPs (e.g. 10.0.23.101) instead of hostnames.

In the docker compose file we specify the hostname as such:

hostname: "taskmanager-{{ '{{' }}.Node.Hostname{{ '}}' }}"

Is there another way of achieving this?

Regards
,
Nikola Hrusov