Re: Jira issue Flink-11127

Posted by Andrey Zagrebin-3 on
URL: http://deprecated-apache-flink-user-mailing-list-archive.369.s1.nabble.com/Jira-issue-Flink-11127-tp26180p26298.html

cc [hidden email]

On Fri, Feb 22, 2019 at 1:28 AM Boris Lublinsky <[hidden email]> wrote:
Adding metric-query port makes it a bit better, but there is still an error


019-02-22 00:03:56,173 INFO  org.apache.flink.runtime.taskexecutor.TaskExecutor            - Could not resolve ResourceManager address akka.tcp://flink@maudlin-ibis-fdp-flink-jobmanager:6123/user/resourcemanager, retrying in 10000 ms: Ask timed out on [ActorSelection[Anchor(akka.tcp://flink@maudlin-ibis-fdp-flink-jobmanager:6123/), Path(/user/resourcemanager)]] after [10000 ms]. Sender[null] sent message of type "akka.actor.Identify"..
2019-02-22 00:04:16,213 INFO  org.apache.flink.runtime.taskexecutor.TaskExecutor            - Could not resolve ResourceManager address akka.tcp://flink@maudlin-ibis-fdp-flink-jobmanager:6123/user/resourcemanager, retrying in 10000 ms: Ask timed out on [ActorSelection[Anchor(akka.tcp://flink@maudlin-ibis-fdp-flink-jobmanager:6123/), Path(/user/resourcemanager)]] after [10000 ms]. Sender[null] sent message of type "akka.actor.Identify"..
2019-02-22 00:04:36,253 INFO  org.apache.flink.runtime.taskexecutor.TaskExecutor            - Could not resolve ResourceManager address akka.tcp://flink@maudlin-ibis-fdp-flink-jobmanager:6123/user/resourcemanager, retrying in 10000 ms: Ask timed out on [ActorSelection[Anchor(akka.tcp://flink@maudlin-ibis-fdp-flink-jobmanager:6123/), Path(/user/resourcemanager)]] after [10000 ms]. Sender[null] sent message of type "akka.actor.Identify"..
2019-02-22 00:04:56,293 INFO  org.apache.flink.runtime.taskexecutor.TaskExecutor            - Could not resolve ResourceManager address akka.tcp://flink@maudlin-ibis-fdp-flink-jobmanager:6123/user/resourcemanager, retrying in 10000 ms: Ask timed out on [ActorSelection[Anchor(akka.tcp://flink@maudlin-ibis-fdp-flink-jobmanager:6123/), Path(/user/resourcemanager)]] after [10000 ms]. Sender[null] sent message of type "akka.actor.Identify”..

In the task manager and

2019-02-21 23:59:46,479 ERROR akka.remote.EndpointWriter                                    - dropping message [class akka.actor.ActorSelectionMessage] for non-local recipient [Actor[akka.tcp://flink@maudlin-ibis-fdp-flink-jobmanager:6123/]] arriving at [akka.tcp://flink@maudlin-ibis-fdp-flink-jobmanager:6123] inbound addresses are [akka.tcp://flink@127.0.0.1:6123]
2019-02-21 23:59:57,808 ERROR akka.remote.EndpointWriter                                    - dropping message [class akka.actor.ActorSelectionMessage] for non-local recipient [Actor[akka.tcp://flink@maudlin-ibis-fdp-flink-jobmanager:6123/]] arriving at [akka.tcp://flink@maudlin-ibis-fdp-flink-jobmanager:6123] inbound addresses are [akka.tcp://flink@127.0.0.1:6123]
2019-02-22 00:00:06,519 ERROR akka.remote.EndpointWriter                                    - dropping message [class akka.actor.ActorSelectionMessage] for non-local recipient [Actor[akka.tcp://flink@maudlin-ibis-fdp-flink-jobmanager:6123/]] arriving at [akka.tcp://flink@maudlin-ibis-fdp-flink-jobmanager:6123] inbound addresses are [akka.tcp://flink@127.0.0.1:6123]
2019-02-22 00:00:17,849 ERROR akka.remote.EndpointWriter                                    - dropping message [class akka.actor.ActorSelectionMessage] for non-local recipient [Actor[akka.tcp://flink@maudlin-ibis-fdp-flink-jobmanager:6123/]] arriving at [akka.tcp://flink@maudlin-ibis-fdp-flink-jobmanager:6123] inbound addresses are [akka.tcp://flink@127.0.0.1:6123]
2019-02-22 00:00:26,558 ERROR akka.remote.EndpointWriter                                    - dropping message [class akka.actor.ActorSelectionMessage] for non-local recipient [Actor[akka.tcp://flink@maudlin-ibis-fdp-flink-jobmanager:6123/]] arriving at [akka.tcp://flink@maudlin-ibis-fdp-flink-jobmanager:6123] inbound addresses are [akka.tcp://flink@127.0.0.1:6123]
2019-02-22 00:00:37,888 ERROR akka.remote.EndpointWriter                                    - dropping message [class akka.actor.ActorSelectionMessage] for non-local recipient [Actor[akka.tcp://flink@maudlin-ibis-fdp-flink-jobmanager:6123/]] arriving at [akka.tcp://flink@maudlin-ibis-fdp-flink-jobmanager:6123] inbound addresses are [akka.tcp://flink@127.0.0.1:6123]

I the job manager

Port 6123 is opened in both Job Manager deployment

apiVersion: extensions/v1beta1
kind: Deployment
metadata:
  name: {{ template "fullname" . }}-jobmanager
spec:
  replicas: 1
  template:
    metadata:
      annotations:
        prometheus.io/scrape: 'true'
        prometheus.io/port: '9249'
      labels:
        server: flink
        app: {{ template "fullname" . }}
        component: jobmanager
    spec:
      containers:
      - name: jobmanager
        image: {{ .Values.image }}:{{ .Values.imageTag }}
        imagePullPolicy: {{ .Values.imagePullPolicy }}
        args:
        - jobmanager
        ports:
        - containerPort: 6123
          name: rpc
        - containerPort: 6124
          name: blob
        - containerPort: 8081
          name: ui
        env:
        - name: CONTAINER_METRIC_PORT
          value: '{{ .Values.flink.metric_query_port }}'
        - name: JOB_MANAGER_RPC_ADDRESS
          value : {{ template "fullname" . }}-jobmanager
        livenessProbe:
          httpGet:
            path: /overview
            port: 8081
          initialDelaySeconds: 30
          periodSeconds: 10
        resources:
          limits:
            cpu: {{ .Values.resources.jobmanager.limits.cpu }}
            memory: {{ .Values.resources.jobmanager.limits.memory }}
          requests:
            cpu: {{ .Values.resources.jobmanager.requests.cpu }}
            memory: {{ .Values.resources.jobmanager.requests.memory }}

And Job manager service

apiVersion: v1
kind: Service
metadata:
  name: {{ template "fullname" . }}-jobmanager
spec:
  ports:
  - name: rpc
    port: 6123
  - name: blob
    port: 6124
  - name: ui
    port: 8081
  selector:
    app: {{ template "fullname" . }}
    component: jobmanager




Boris Lublinsky
FDP Architect
[hidden email]
https://www.lightbend.com/

On Feb 21, 2019, at 6:13 PM, Boris Lublinsky <[hidden email]> wrote:


Boris Lublinsky
FDP Architect
[hidden email]
https://www.lightbend.com/

On Feb 21, 2019, at 2:05 AM, Konstantin Knauf <[hidden email]> wrote:

Hi Boris,

the exact command depends on the docker-entrypoint.sh script and the image you are using. For the example contained in the Flink repository it is "task-manager", I think. The important thing is to pass "taskmanager.host" to the Taskmanager process. You can verify by checking the Taskmanager logs. These should contain lines like below:

2019-02-21 08:03:00,004 INFO  org.apache.flink.runtime.taskexecutor.TaskManagerRunner      [] -  Program Arguments:
2019-02-21 08:03:00,008 INFO  org.apache.flink.runtime.taskexecutor.TaskManagerRunner      [] -     -Dtaskmanager.host=10.12.10.173

In the Jobmanager logs you should see that the Taskmanager is registered under the IP above in a line similar to:

2019-02-21 08:03:26,874 INFO  org.apache.flink.runtime.resourcemanager.StandaloneResourceManager [] - Registering TaskManager with ResourceID a0513ba2c472d2d1efc07626da9c1bda (akka.tcp://flink@10.12.10.173:46531/user/taskmanager_0) at ResourceManager

A service per Taskmanager is not required. The purpose of the config parameter is that the Jobmanager addresses the taskmanagers by IP instead of hostname.

Hope this helps!

Cheers,

Konstantin



On Wed, Feb 20, 2019 at 4:37 PM Boris Lublinsky <[hidden email]> wrote:
Also, The suggested workaround does not quite work.
2019-02-20 15:27:43,928 WARN akka.remote.ReliableDeliverySupervisor - Association with remote system [akka.tcp://flink-metrics@flink-taskmanager-1:6170] has failed, address is now gated for [50] ms. Reason: [Association failed with [akka.tcp://flink-metrics@flink-taskmanager-1:6170]] Caused by: [flink-taskmanager-1: No address associated with hostname]
2019-02-20 15:27:48,750 ERROR org.apache.flink.runtime.rest.handler.legacy.files.StaticFileServerHandler - Caught exception

I think the problem is that its trying to connect to flink-task-manager-1

Using busybody to experiment with nslookup, I can see
/ # nslookup flink-taskmanager-1.flink-taskmanager
Server:    10.0.11.151
Address 1: 10.0.11.151 ip-10-0-11-151.us-west-2.compute.internal

Name:      flink-taskmanager-1.flink-taskmanager
Address 1: 10.131.2.136 flink-taskmanager-1.flink-taskmanager.flink.svc.cluster.local
/ # nslookup flink-taskmanager-1
Server:    10.0.11.151
Address 1: 10.0.11.151 ip-10-0-11-151.us-west-2.compute.internal

nslookup: can't resolve 'flink-taskmanager-1'
/ # nslookup flink-taskmanager-0.flink-taskmanager
Server:    10.0.11.151
Address 1: 10.0.11.151 ip-10-0-11-151.us-west-2.compute.internal

Name:      flink-taskmanager-0.flink-taskmanager
Address 1: 10.131.0.111 flink-taskmanager-0.flink-taskmanager.flink.svc.cluster.local
/ # nslookup flink-taskmanager-0
Server:    10.0.11.151
Address 1: 10.0.11.151 ip-10-0-11-151.us-west-2.compute.internal

nslookup: can't resolve 'flink-taskmanager-0'
/ # 

So the name should be postfixed with the service name. How do I force it? I suspect I am missing config parameter

 
Boris Lublinsky
FDP Architect
[hidden email]
https://www.lightbend.com/

On Feb 19, 2019, at 4:33 AM, Konstantin Knauf <[hidden email]> wrote:

Hi Boris,

the solution is actually simpler than it sounds from the ticket. The only thing you need to do is to set the "taskmanager.host" to the Pod's IP address in the Flink configuration. The easiest way to do this is to pass this config dynamically via a command-line parameter. 

The Deployment spec could looks something like this:
containers:
- name: taskmanager
[...]
args:
- "taskmanager.sh"
- "start-foreground"
- "-Dtaskmanager.host=$(K8S_POD_IP)"
[...]
  env:
- name: K8S_POD_IP
valueFrom:
fieldRef:
fieldPath: status.podIP

Hope this helps and let me know if this works. 

Best, 

Konstantin

On Sun, Feb 17, 2019 at 9:51 PM Boris Lublinsky <[hidden email]> wrote:
Apparently there is a workaround for it.
Is it possible provide the complete helm chart for it.
Bits and pieces are in the ticket, but it would be nice to see the full chart

Boris Lublinsky
FDP Architect
[hidden email]
https://www.lightbend.com/



--
Konstantin Knauf | Solutions Architect
+49 160 91394525


Follow us @VervericaData
--
Join Flink Forward - The Apache Flink Conference
Stream Processing | Event Driven | Real Time
--
Data Artisans GmbH | Invalidenstrasse 115, 10115 Berlin, Germany
--
Data Artisans GmbH
Registered at Amtsgericht Charlottenburg: HRB 158244 B
Managing Directors: Dr. Kostas Tzoumas, Dr. Stephan Ewen   



--
Konstantin Knauf | Solutions Architect
+49 160 91394525

Follow us @VervericaData
--
Join Flink Forward - The Apache Flink Conference
Stream Processing | Event Driven | Real Time
--
Data Artisans GmbH | Invalidenstrasse 115, 10115 Berlin, Germany
--
Data Artisans GmbH
Registered at Amtsgericht Charlottenburg: HRB 158244 B
Managing Directors: Dr. Kostas Tzoumas, Dr. Stephan Ewen