Flink -mesos-app master hang

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

Flink -mesos-app master hang

bdas77
Hi All,
I'm trying to run a flink docker from the marathon with mesos app master; I could see it goes on a continuous loop and failed to launch the task manger. If I go to mesos master UI I could see job manager web UI with task manager zero .

I have pretty much checked every possible log starting from Ubuntu machine docker.log /mesos master/slave  pretty much no information other than just failed task , I could see below log @ flink . However, I'm able to run same docker image if I run jobamanger and taskmanager by itself in marathon and let it connect via jobmanager RPC port .

for mesos config , I'm using below details from yml
mesos.master: ${MESOS_MASTER}
mesos.failover-timeout: 60
mesos.initial-tasks: ${INITIAL_TASK_MANAGERS}
mesos.resourcemanager.tasks.mem: ${RESOURCEMANAGER_TASKS_MEM:-4096}
mesos.resourcemanager.tasks.cpus:${RESOURCEMANAGER_TASKS_CPU:-1}
mesos.resourcemanager.tasks.container.type: docker
mesos.resourcemanager.tasks.container.image.name: ${IMAGE_NAME}

---------------------------
07-30 02:05:48,351 WARN  org.apache.flink.mesos.scheduler.TaskMonitor                  - Mesos task taskmanager-00002 failed unexpectedly.
2017-07-30 02:05:48,352 INFO  org.apache.flink.mesos.runtime.clusterframework.MesosFlinkResourceManager  - Mesos task taskmanager-00002 failed, with a TaskManager in launch or registration. State: TASK_FAILED Reason: REASON_COMMAND_EXECUTOR_FAILED (Container exited with status 127)
-----------------------------------------------------

Please let me know if any one has any pointer to debug further ..


~ Biswajit

Thank you
~/Das
Reply | Threaded
Open this post in threaded view
|

Fwd: Flink -mesos-app master hang

bdas77
Hi There,

I have posted this here in the group a few days back and after that I have been exchanging email with Eron, thanks to Eron for all the tips.  Now  I see this basic auth error, I'm little confused how come Job Manager launched fine and task manager failing to auth.
Also, mesos doc says by default authenticate is false so it should not have gone there,  do I have to disable somewhere inside flink ??? I don't see any config or property in code.

This is kind of blocker for me now for mesos deployment , really appreciate for any inputs/suggestion

~ Biswajit

---------- Forwarded message ----------
From: Eron Wright <[hidden email]>
Date: Wed, Aug 2, 2017 at 10:51 AM

From: Biswajit Das <[hidden email]>
Sent: Wednesday, August 2, 2017 10:19:45 AM
To: Eron Wright
Subject: Re: Flink -mesos-app master hang
 
Hi Eron ,

Good morning , I'm really sorry for flooding question . I'll post this one to user group also .
I could narrow down the actual error thrown by mesos , seems like JM some how not able to authenticate . I'm little confused if it is docker private registry tls error or some thing else , I have started slave even with --docker_config , previously mostly I was using  docker.tar.gz with container for private repo authentication .

017-08-02 03:32:54,163 WARN  org.apache.flink.mesos.scheduler.TaskMonitor                  - Mesos task taskmanager-00003 failed unexpectedly.
2017-08-02 03:32:54,163 INFO  org.apache.flink.mesos.runtime.clusterframework.MesosFlinkResourceManager  - Mesos task taskmanager-00003 failed, with a TaskManager in launch or registration. State: TASK_FAILED Reason: REASON_CONTAINER_LAUNCH_FAILED (Failed to launch container: Unexpected WWW-Authenticate header format: 'Basic realm="Registry Realm"')
2017-08-02 03:32:54,163 INFO  org.apache.flink.mesos.runtime.clusterframework.MesosFlinkResourceManager  - Diagnostics for task taskmanager-00003 in state TAS
K_FAILED : reason=REASON_CONTAINER_LAUNCH_FAILED message=Failed to launch container: Unexpected WWW-Authenticate header format: 'Basic realm="Registry Realm"'
2017-08-02 03:32:54,163 INFO  org.apache.flink.mesos.runtime.clusterframework.MesosFlinkResourceManager  - Total number of failed tasks so far: 3
2017-08-02 03:32:54,164 ERROR org.apache.flink.mesos.runtime.clusterframework.MesosFlinkResourceManager  - Stopping Mesos session because the number of failed tasks (3) exceeded the maximum failed tasks (2). This number is controlled by the 'mesos.maximum-failed-tasks' configuration setting. By default its the number of requested tasks.
2017-08-02 03:32:54,164 INFO  org.apache.flink.mesos.runtime.clusterframework.MesosFlinkResourceManager  - Shutting down cluster with status FAILED : Stopping Mesos session because the number of failed tasks (3) exceeded the maximum failed tasks (2). This number is controlled by the 'mesos.maximum-failed-tasks' configuration setting. By default its the number of requested tasks.
2017-08-02 03:32:54,164 INFO  org.apache.flink.mesos.runtime.clusterframework.MesosFlinkResourceManager  - Shutting down and unregistering as a Mesos framework.
2017-08-02 03:32:54,171 INFO  org.apache.flink.mesos.runtime.clusterframework.MesosFlinkResourceManager  - Stopping Mesos resource master
root@ip-172-31-4-44:/etc/me

On Tue, Aug 1, 2017 at 1:53 PM, Eron Wright <[hidden email]> wrote:

I think you're on the right track, in trying to configure the docker image provider.  This is on Linux right, and you definitely restarted the agents?


An important difference between the JM and the TM is that the JM is a task launched by the Marathon framework, whereas the TM is a task launched by the JM framework.  The respective configurations and behaviors are different.   For example, I see that Marathon is launching the JM with the Docker containerizer, whereas the JS is launching the TM with the Mesos containerizer (with Docker image provider support).     The Mesos containerizer is more modern and preferred, and I don't think Flink supports anything else.


The doc I linked to shows how to launch a docker image-based container with mesos-execute.   Using mesos-execute to verify your cluster configuration is a good idea, to isolate any issue.  For example, see if you can launch a container using the Mesos containerizer and the Docker image provider, executing a simple command such as 'sleep'.


Eron


From: Biswajit Das <[hidden email]>
Sent: Tuesday, August 1, 2017 10:02:51 AM
To: Eron Wright

Subject: Re: Flink -mesos-app master hang
 
Hi Eron ,

Thank you for the email , I really appreciate your reply.

That's what is confusing me. I have been running mesos with container both on staging and production for almost a year now with mostly spark/presto load everything containerize fairly big cluster. .. Here is one of my slave config . One interesting part here is ,  app master is launched and I can access job manager web UI from mesos frame work , I can also see it is registered itself as `flink` framework . The only thing I'm seeing task manager is showing `0` . I have asked to create 2 instance


/usr/sbin/mesos-slave --master=zk://XXX/mesos --log_dir=/var/log/mesos --attributes=environment:dev;agent_role:generic --containerizers=docker,mesos --executor_registration_timeout=10mins --hostname=XXX --image_providers=appc,docker --ip=XXX --isolation=filesystem/linux,docker/runtime --resources=ports(*):[0-65535] --work_dir=/var/lib/mesos


Previously I never had --image_providers and --isolation , after seeing this error I have added this two but not much help , I'm running on ubuntu /mesos 1.1.0 and submitting the job with marathon ..


I have tried with toggling mesos debug log , not much info ...other hen git signal to kill the framework ..

marathon json task
{
  "id": "/flink-app-master",
  "cmd": null,
  "cpus": 2,
  "mem": 4096,
  "disk": 10000,
  "instances": 1,
  "constraints": [
    [
      "hostname",
      "LIKE",
      "xxx" ->>> restricited to some host for debugging as I have fairly big cluster
    ]
  ],
  "acceptedResourceRoles": [
    "*"
  ],
  "container": {
    "type": "DOCKER",
    "volumes": [],
    "docker": {
      "image": "docker.xx.xx/flink:1.8.0",
      "network": "HOST",
      "portMappings": [],
      "privileged": false,
      "parameters": [],
      "forcePullImage": false
    }
  },
  "env": {
    "MESOS_MASTER": "zk://XX/mesos"
  },
  "portDefinitions": [
    {
      "port": 9081,
      "protocol": "tcp",
      "name": "default",
      "labels": {}
    }
  ],
  "uris": [
    "file:///etc/docker.tar.gz"
  ],
  "fetch": [
    {
      "uri": "file:///etc/docker.tar.gz",
      "extract": true,
      "executable": false,
      "cache": false
    }
  ]
}

On Tue, Aug 1, 2017 at 7:22 AM, Eron Wright <[hidden email]> wrote:
From the error message it seems that your Mesos cluster doesn't have the docker image provisioner installed.   The message originates from Mesos anyway so the problem lies there.   Note that docker image support is provided in Linux only.  You can also use the Flink on Mesos support without images, if you make sure that JAVA_HOME is set on all executors.

Hope this helps!

From: Biswajit Das
Sent: Tuesday, August 1, 1:24 AM
Subject: Re: Flink -mesos-app master hang
Hi Eron ,  I have came across some of your comment in JIRA and wanted to clarify this ^^ . I'm kind of running little clueless ,  Any pointer for me to look ..


-----------------------------------------------
2017-08-01 07:26:34,688 INFO  org.apache.flink.mesos.scheduler.LaunchCoordinator            - Waiting for more offers; 1 task(s) are not yet launched.
2017-08-01 07:26:34,717 INFO  org.apache.flink.mesos.runtime.clusterframework.MesosFlinkResourceManager  - Launching Mesos task taskmanager-00039 on host 172.31.5.212.
2017-08-01 07:26:34,731 WARN  org.apache.flink.mesos.scheduler.TaskMonitor                  - Mesos task taskmanager-00039 failed unexpectedly.
2017-08-01 07:26:34,733 INFO  org.apache.flink.mesos.runtime.clusterframework.MesosFlinkResourceManager  - Mesos task taskmanager-00039 failed, with a TaskManager in launch or registration. State: TASK_FAILED Reason: REASON_CONTAINER_LAUNCH_FAILED (Failed to launch container: Unsupported container image type: DOCKER)
2017-08-01 07:26:34,733 INFO  org.apache.flink.mesos.runtime.clusterframework.MesosFlinkResourceManager  - Diagnostics for task taskmanager-00039 in state TASK_FAILED : reason=REASON_CONTAINER_LAUNCH_FAILED message=Failed to launch container: Unsupported container image type: DOCKER
2017-08-01 07:26:34,733 INFO  org.apache.flink.mesos.runtime.clusterframework.MesosFlinkResourceManager  - Total number of failed tasks so far: 3
2017-08-01 07:26:34,734 ERROR org.apache.flink.mesos.runtime.clusterframework.MesosFlinkResourceManager  - Stopping Mesos session because the number of failed tasks (3) exceeded the maximum failed tasks (2). This number is controlled by the 'mesos.maximum-failed-tasks' configuration setting. By default its the number of requested tasks.
2017-08-01 07:26:34,734 INFO  org.apache.flink.mesos.runtime.clusterframework.MesosFlinkResourceManager  - Shutting down cluster with status FAILED : Stopping Mesos session because the number of failed tasks (3) exceeded the maximum failed tasks (2). This number is controlled by the 'mesos.maximum-failed-tasks' configuration setting. By default its the number of requested tasks.
2017-08-01 07:26:34,734 INFO  org.apache.flink.mesos.runtime.clusterframework.MesosFlinkResourceManager  - Shutting down and unregistering as a Mesos framework.
2017-08-01 07:26:34,745 INFO  org.apache.flink.mesos.runtime.clusterframework.MesosFlinkResourceManager  - Stopping Mesos resource master
2017-08-01 07:26:34,745 INFO  org.apache.f
---------------------------------------------------

Thank you in advance .
~Biswajit

On Sun, Jul 30, 2017 at 12:42 PM, Biswajit Das <[hidden email]> wrote:
Hi All,
I'm trying to run a flink docker from the marathon with mesos app master; I could see it goes on a continuous loop and failed to launch the task manger. If I go to mesos master UI I could see job manager web UI with task manager zero .

I have pretty much checked every possible log starting from Ubuntu machine docker.log /mesos master/slave  pretty much no information other than just failed task , I could see below log @ flink . However, I'm able to run same docker image if I run jobamanger and taskmanager by itself in marathon and let it connect via jobmanager RPC port .

for mesos config , I'm using below details from yml
mesos.master: ${MESOS_MASTER}
mesos.failover-timeout: 60
mesos.initial-tasks: ${INITIAL_TASK_MANAGERS}
mesos.resourcemanager.tasks.mem: ${RESOURCEMANAGER_TASKS_MEM:-4096}
mesos.resourcemanager.tasks.cpus:${RESOURCEMANAGER_TASKS_CPU:-1}
mesos.resourcemanager.tasks.container.type: docker
---------------------------
07-30 02:05:48,351 WARN  org.apache.flink.mesos.scheduler.TaskMonitor                  - Mesos task taskmanager-00002 failed unexpectedly.
2017-07-30 02:05:48,352 INFO  org.apache.flink.mesos.runtime.clusterframework.MesosFlinkResourceManager  - Mesos task taskmanager-00002 failed, with a TaskManager in launch or registration. State: TASK_FAILED Reason: REASON_COMMAND_EXECUTOR_FAILED (Container exited with status 127)
-----------------------------------------------------

Please let me know if any one has any pointer to debug further ..


~ Biswajit







Thank you
~/Das
Reply | Threaded
Open this post in threaded view
|

Re: Flink -mesos-app master hang

Till Rohrmann
Hi Biswajit,

are there any Mesos logs which might help us pinpointing the problem? I've actually never run Flink on Mesos with Docker images. But it could be that Flink does not set things properly up for running Docker images. I'll try to run Flink based on Docker images over the weekend in order to see whether I can reproduce the problem.

Cheers,
Till

On Wed, Aug 2, 2017 at 8:48 PM, Biswajit Das <[hidden email]> wrote:
Hi There,

I have posted this here in the group a few days back and after that I have been exchanging email with Eron, thanks to Eron for all the tips.  Now  I see this basic auth error, I'm little confused how come Job Manager launched fine and task manager failing to auth.
Also, mesos doc says by default authenticate is false so it should not have gone there,  do I have to disable somewhere inside flink ??? I don't see any config or property in code.

This is kind of blocker for me now for mesos deployment , really appreciate for any inputs/suggestion

~ Biswajit

---------- Forwarded message ----------
From: Eron Wright <[hidden email]>
Date: Wed, Aug 2, 2017 at 10:51 AM

From: Biswajit Das <[hidden email]>
Sent: Wednesday, August 2, 2017 10:19:45 AM
To: Eron Wright
Subject: Re: Flink -mesos-app master hang
 
Hi Eron ,

Good morning , I'm really sorry for flooding question . I'll post this one to user group also .
I could narrow down the actual error thrown by mesos , seems like JM some how not able to authenticate . I'm little confused if it is docker private registry tls error or some thing else , I have started slave even with --docker_config , previously mostly I was using  docker.tar.gz with container for private repo authentication .

017-08-02 03:32:54,163 WARN  org.apache.flink.mesos.scheduler.TaskMonitor                  - Mesos task taskmanager-00003 failed unexpectedly.
2017-08-02 03:32:54,163 INFO  org.apache.flink.mesos.runtime.clusterframework.MesosFlinkResourceManager  - Mesos task taskmanager-00003 failed, with a TaskManager in launch or registration. State: TASK_FAILED Reason: REASON_CONTAINER_LAUNCH_FAILED (Failed to launch container: Unexpected WWW-Authenticate header format: 'Basic realm="Registry Realm"')
2017-08-02 03:32:54,163 INFO  org.apache.flink.mesos.runtime.clusterframework.MesosFlinkResourceManager  - Diagnostics for task taskmanager-00003 in state TAS
K_FAILED : reason=REASON_CONTAINER_LAUNCH_FAILED message=Failed to launch container: Unexpected WWW-Authenticate header format: 'Basic realm="Registry Realm"'
2017-08-02 03:32:54,163 INFO  org.apache.flink.mesos.runtime.clusterframework.MesosFlinkResourceManager  - Total number of failed tasks so far: 3
2017-08-02 03:32:54,164 ERROR org.apache.flink.mesos.runtime.clusterframework.MesosFlinkResourceManager  - Stopping Mesos session because the number of failed tasks (3) exceeded the maximum failed tasks (2). This number is controlled by the 'mesos.maximum-failed-tasks' configuration setting. By default its the number of requested tasks.
2017-08-02 03:32:54,164 INFO  org.apache.flink.mesos.runtime.clusterframework.MesosFlinkResourceManager  - Shutting down cluster with status FAILED : Stopping Mesos session because the number of failed tasks (3) exceeded the maximum failed tasks (2). This number is controlled by the 'mesos.maximum-failed-tasks' configuration setting. By default its the number of requested tasks.
2017-08-02 03:32:54,164 INFO  org.apache.flink.mesos.runtime.clusterframework.MesosFlinkResourceManager  - Shutting down and unregistering as a Mesos framework.
2017-08-02 03:32:54,171 INFO  org.apache.flink.mesos.runtime.clusterframework.MesosFlinkResourceManager  - Stopping Mesos resource master
root@ip-172-31-4-44:/etc/me

On Tue, Aug 1, 2017 at 1:53 PM, Eron Wright <[hidden email]> wrote:

I think you're on the right track, in trying to configure the docker image provider.  This is on Linux right, and you definitely restarted the agents?


An important difference between the JM and the TM is that the JM is a task launched by the Marathon framework, whereas the TM is a task launched by the JM framework.  The respective configurations and behaviors are different.   For example, I see that Marathon is launching the JM with the Docker containerizer, whereas the JS is launching the TM with the Mesos containerizer (with Docker image provider support).     The Mesos containerizer is more modern and preferred, and I don't think Flink supports anything else.


The doc I linked to shows how to launch a docker image-based container with mesos-execute.   Using mesos-execute to verify your cluster configuration is a good idea, to isolate any issue.  For example, see if you can launch a container using the Mesos containerizer and the Docker image provider, executing a simple command such as 'sleep'.


Eron


From: Biswajit Das <[hidden email]>
Sent: Tuesday, August 1, 2017 10:02:51 AM
To: Eron Wright

Subject: Re: Flink -mesos-app master hang
 
Hi Eron ,

Thank you for the email , I really appreciate your reply.

That's what is confusing me. I have been running mesos with container both on staging and production for almost a year now with mostly spark/presto load everything containerize fairly big cluster. .. Here is one of my slave config . One interesting part here is ,  app master is launched and I can access job manager web UI from mesos frame work , I can also see it is registered itself as `flink` framework . The only thing I'm seeing task manager is showing `0` . I have asked to create 2 instance


/usr/sbin/mesos-slave --master=zk://XXX/mesos --log_dir=/var/log/mesos --attributes=environment:dev;agent_role:generic --containerizers=docker,mesos --executor_registration_timeout=10mins --hostname=XXX --image_providers=appc,docker --ip=XXX --isolation=filesystem/linux,docker/runtime --resources=ports(*):[0-65535] --work_dir=/var/lib/mesos


Previously I never had --image_providers and --isolation , after seeing this error I have added this two but not much help , I'm running on ubuntu /mesos 1.1.0 and submitting the job with marathon ..


I have tried with toggling mesos debug log , not much info ...other hen git signal to kill the framework ..

marathon json task
{
  "id": "/flink-app-master",
  "cmd": null,
  "cpus": 2,
  "mem": 4096,
  "disk": 10000,
  "instances": 1,
  "constraints": [
    [
      "hostname",
      "LIKE",
      "xxx" ->>> restricited to some host for debugging as I have fairly big cluster
    ]
  ],
  "acceptedResourceRoles": [
    "*"
  ],
  "container": {
    "type": "DOCKER",
    "volumes": [],
    "docker": {
      "image": "docker.xx.xx/flink:1.8.0",
      "network": "HOST",
      "portMappings": [],
      "privileged": false,
      "parameters": [],
      "forcePullImage": false
    }
  },
  "env": {
    "MESOS_MASTER": "zk://XX/mesos"
  },
  "portDefinitions": [
    {
      "port": 9081,
      "protocol": "tcp",
      "name": "default",
      "labels": {}
    }
  ],
  "uris": [
    "file:///etc/docker.tar.gz"
  ],
  "fetch": [
    {
      "uri": "file:///etc/docker.tar.gz",
      "extract": true,
      "executable": false,
      "cache": false
    }
  ]
}

On Tue, Aug 1, 2017 at 7:22 AM, Eron Wright <[hidden email]> wrote:
From the error message it seems that your Mesos cluster doesn't have the docker image provisioner installed.   The message originates from Mesos anyway so the problem lies there.   Note that docker image support is provided in Linux only.  You can also use the Flink on Mesos support without images, if you make sure that JAVA_HOME is set on all executors.

Hope this helps!

From: Biswajit Das
Sent: Tuesday, August 1, 1:24 AM
Subject: Re: Flink -mesos-app master hang
Hi Eron ,  I have came across some of your comment in JIRA and wanted to clarify this ^^ . I'm kind of running little clueless ,  Any pointer for me to look ..


-----------------------------------------------
2017-08-01 07:26:34,688 INFO  org.apache.flink.mesos.scheduler.LaunchCoordinator            - Waiting for more offers; 1 task(s) are not yet launched.
2017-08-01 07:26:34,717 INFO  org.apache.flink.mesos.runtime.clusterframework.MesosFlinkResourceManager  - Launching Mesos task taskmanager-00039 on host 172.31.5.212.
2017-08-01 07:26:34,731 WARN  org.apache.flink.mesos.scheduler.TaskMonitor                  - Mesos task taskmanager-00039 failed unexpectedly.
2017-08-01 07:26:34,733 INFO  org.apache.flink.mesos.runtime.clusterframework.MesosFlinkResourceManager  - Mesos task taskmanager-00039 failed, with a TaskManager in launch or registration. State: TASK_FAILED Reason: REASON_CONTAINER_LAUNCH_FAILED (Failed to launch container: Unsupported container image type: DOCKER)
2017-08-01 07:26:34,733 INFO  org.apache.flink.mesos.runtime.clusterframework.MesosFlinkResourceManager  - Diagnostics for task taskmanager-00039 in state TASK_FAILED : reason=REASON_CONTAINER_LAUNCH_FAILED message=Failed to launch container: Unsupported container image type: DOCKER
2017-08-01 07:26:34,733 INFO  org.apache.flink.mesos.runtime.clusterframework.MesosFlinkResourceManager  - Total number of failed tasks so far: 3
2017-08-01 07:26:34,734 ERROR org.apache.flink.mesos.runtime.clusterframework.MesosFlinkResourceManager  - Stopping Mesos session because the number of failed tasks (3) exceeded the maximum failed tasks (2). This number is controlled by the 'mesos.maximum-failed-tasks' configuration setting. By default its the number of requested tasks.
2017-08-01 07:26:34,734 INFO  org.apache.flink.mesos.runtime.clusterframework.MesosFlinkResourceManager  - Shutting down cluster with status FAILED : Stopping Mesos session because the number of failed tasks (3) exceeded the maximum failed tasks (2). This number is controlled by the 'mesos.maximum-failed-tasks' configuration setting. By default its the number of requested tasks.
2017-08-01 07:26:34,734 INFO  org.apache.flink.mesos.runtime.clusterframework.MesosFlinkResourceManager  - Shutting down and unregistering as a Mesos framework.
2017-08-01 07:26:34,745 INFO  org.apache.flink.mesos.runtime.clusterframework.MesosFlinkResourceManager  - Stopping Mesos resource master
2017-08-01 07:26:34,745 INFO  org.apache.f
---------------------------------------------------

Thank you in advance .
~Biswajit

On Sun, Jul 30, 2017 at 12:42 PM, Biswajit Das <[hidden email]> wrote:
Hi All,
I'm trying to run a flink docker from the marathon with mesos app master; I could see it goes on a continuous loop and failed to launch the task manger. If I go to mesos master UI I could see job manager web UI with task manager zero .

I have pretty much checked every possible log starting from Ubuntu machine docker.log /mesos master/slave  pretty much no information other than just failed task , I could see below log @ flink . However, I'm able to run same docker image if I run jobamanger and taskmanager by itself in marathon and let it connect via jobmanager RPC port .

for mesos config , I'm using below details from yml
mesos.master: ${MESOS_MASTER}
mesos.failover-timeout: 60
mesos.initial-tasks: ${INITIAL_TASK_MANAGERS}
mesos.resourcemanager.tasks.mem: ${RESOURCEMANAGER_TASKS_MEM:-4096}
mesos.resourcemanager.tasks.cpus:${RESOURCEMANAGER_TASKS_CPU:-1}
mesos.resourcemanager.tasks.container.type: docker
---------------------------
07-30 02:05:48,351 WARN  org.apache.flink.mesos.scheduler.TaskMonitor                  - Mesos task taskmanager-00002 failed unexpectedly.
2017-07-30 02:05:48,352 INFO  org.apache.flink.mesos.runtime.clusterframework.MesosFlinkResourceManager  - Mesos task taskmanager-00002 failed, with a TaskManager in launch or registration. State: TASK_FAILED Reason: REASON_COMMAND_EXECUTOR_FAILED (Container exited with status 127)
-----------------------------------------------------

Please let me know if any one has any pointer to debug further ..


~ Biswajit







Reply | Threaded
Open this post in threaded view
|

Re: Flink -mesos-app master hang

bdas77
Hi Till ,

Thank you for the reply , I have posted some logs with initial email chain . I think issue is more to do with docker private registry when there is authorization involved . I can run docker running Job manager and task manager as separate task for marathon and connect via RPC port . I was trying to run via mesos app master so that job manager itself launch the task manager part of framework .

Thank you again

~ Biswajit

On Fri, Aug 4, 2017 at 3:17 AM, Till Rohrmann <[hidden email]> wrote:
Hi Biswajit,

are there any Mesos logs which might help us pinpointing the problem? I've actually never run Flink on Mesos with Docker images. But it could be that Flink does not set things properly up for running Docker images. I'll try to run Flink based on Docker images over the weekend in order to see whether I can reproduce the problem.

Cheers,
Till

On Wed, Aug 2, 2017 at 8:48 PM, Biswajit Das <[hidden email]> wrote:
Hi There,

I have posted this here in the group a few days back and after that I have been exchanging email with Eron, thanks to Eron for all the tips.  Now  I see this basic auth error, I'm little confused how come Job Manager launched fine and task manager failing to auth.
Also, mesos doc says by default authenticate is false so it should not have gone there,  do I have to disable somewhere inside flink ??? I don't see any config or property in code.

This is kind of blocker for me now for mesos deployment , really appreciate for any inputs/suggestion

~ Biswajit

---------- Forwarded message ----------
From: Eron Wright <[hidden email]>
Date: Wed, Aug 2, 2017 at 10:51 AM

From: Biswajit Das <[hidden email]>
Sent: Wednesday, August 2, 2017 10:19:45 AM
To: Eron Wright
Subject: Re: Flink -mesos-app master hang
 
Hi Eron ,

Good morning , I'm really sorry for flooding question . I'll post this one to user group also .
I could narrow down the actual error thrown by mesos , seems like JM some how not able to authenticate . I'm little confused if it is docker private registry tls error or some thing else , I have started slave even with --docker_config , previously mostly I was using  docker.tar.gz with container for private repo authentication .

017-08-02 03:32:54,163 WARN  org.apache.flink.mesos.scheduler.TaskMonitor                  - Mesos task taskmanager-00003 failed unexpectedly.
2017-08-02 03:32:54,163 INFO  org.apache.flink.mesos.runtime.clusterframework.MesosFlinkResourceManager  - Mesos task taskmanager-00003 failed, with a TaskManager in launch or registration. State: TASK_FAILED Reason: REASON_CONTAINER_LAUNCH_FAILED (Failed to launch container: Unexpected WWW-Authenticate header format: 'Basic realm="Registry Realm"')
2017-08-02 03:32:54,163 INFO  org.apache.flink.mesos.runtime.clusterframework.MesosFlinkResourceManager  - Diagnostics for task taskmanager-00003 in state TAS
K_FAILED : reason=REASON_CONTAINER_LAUNCH_FAILED message=Failed to launch container: Unexpected WWW-Authenticate header format: 'Basic realm="Registry Realm"'
2017-08-02 03:32:54,163 INFO  org.apache.flink.mesos.runtime.clusterframework.MesosFlinkResourceManager  - Total number of failed tasks so far: 3
2017-08-02 03:32:54,164 ERROR org.apache.flink.mesos.runtime.clusterframework.MesosFlinkResourceManager  - Stopping Mesos session because the number of failed tasks (3) exceeded the maximum failed tasks (2). This number is controlled by the 'mesos.maximum-failed-tasks' configuration setting. By default its the number of requested tasks.
2017-08-02 03:32:54,164 INFO  org.apache.flink.mesos.runtime.clusterframework.MesosFlinkResourceManager  - Shutting down cluster with status FAILED : Stopping Mesos session because the number of failed tasks (3) exceeded the maximum failed tasks (2). This number is controlled by the 'mesos.maximum-failed-tasks' configuration setting. By default its the number of requested tasks.
2017-08-02 03:32:54,164 INFO  org.apache.flink.mesos.runtime.clusterframework.MesosFlinkResourceManager  - Shutting down and unregistering as a Mesos framework.
2017-08-02 03:32:54,171 INFO  org.apache.flink.mesos.runtime.clusterframework.MesosFlinkResourceManager  - Stopping Mesos resource master
root@ip-172-31-4-44:/etc/me

On Tue, Aug 1, 2017 at 1:53 PM, Eron Wright <[hidden email]> wrote:

I think you're on the right track, in trying to configure the docker image provider.  This is on Linux right, and you definitely restarted the agents?


An important difference between the JM and the TM is that the JM is a task launched by the Marathon framework, whereas the TM is a task launched by the JM framework.  The respective configurations and behaviors are different.   For example, I see that Marathon is launching the JM with the Docker containerizer, whereas the JS is launching the TM with the Mesos containerizer (with Docker image provider support).     The Mesos containerizer is more modern and preferred, and I don't think Flink supports anything else.


The doc I linked to shows how to launch a docker image-based container with mesos-execute.   Using mesos-execute to verify your cluster configuration is a good idea, to isolate any issue.  For example, see if you can launch a container using the Mesos containerizer and the Docker image provider, executing a simple command such as 'sleep'.


Eron


From: Biswajit Das <[hidden email]>
Sent: Tuesday, August 1, 2017 10:02:51 AM
To: Eron Wright

Subject: Re: Flink -mesos-app master hang
 
Hi Eron ,

Thank you for the email , I really appreciate your reply.

That's what is confusing me. I have been running mesos with container both on staging and production for almost a year now with mostly spark/presto load everything containerize fairly big cluster. .. Here is one of my slave config . One interesting part here is ,  app master is launched and I can access job manager web UI from mesos frame work , I can also see it is registered itself as `flink` framework . The only thing I'm seeing task manager is showing `0` . I have asked to create 2 instance


/usr/sbin/mesos-slave --master=zk://XXX/mesos --log_dir=/var/log/mesos --attributes=environment:dev;agent_role:generic --containerizers=docker,mesos --executor_registration_timeout=10mins --hostname=XXX --image_providers=appc,docker --ip=XXX --isolation=filesystem/linux,docker/runtime --resources=ports(*):[0-65535] --work_dir=/var/lib/mesos


Previously I never had --image_providers and --isolation , after seeing this error I have added this two but not much help , I'm running on ubuntu /mesos 1.1.0 and submitting the job with marathon ..


I have tried with toggling mesos debug log , not much info ...other hen git signal to kill the framework ..

marathon json task
{
  "id": "/flink-app-master",
  "cmd": null,
  "cpus": 2,
  "mem": 4096,
  "disk": 10000,
  "instances": 1,
  "constraints": [
    [
      "hostname",
      "LIKE",
      "xxx" ->>> restricited to some host for debugging as I have fairly big cluster
    ]
  ],
  "acceptedResourceRoles": [
    "*"
  ],
  "container": {
    "type": "DOCKER",
    "volumes": [],
    "docker": {
      "image": "docker.xx.xx/flink:1.8.0",
      "network": "HOST",
      "portMappings": [],
      "privileged": false,
      "parameters": [],
      "forcePullImage": false
    }
  },
  "env": {
    "MESOS_MASTER": "zk://XX/mesos"
  },
  "portDefinitions": [
    {
      "port": 9081,
      "protocol": "tcp",
      "name": "default",
      "labels": {}
    }
  ],
  "uris": [
    "file:///etc/docker.tar.gz"
  ],
  "fetch": [
    {
      "uri": "file:///etc/docker.tar.gz",
      "extract": true,
      "executable": false,
      "cache": false
    }
  ]
}

On Tue, Aug 1, 2017 at 7:22 AM, Eron Wright <[hidden email]> wrote:
From the error message it seems that your Mesos cluster doesn't have the docker image provisioner installed.   The message originates from Mesos anyway so the problem lies there.   Note that docker image support is provided in Linux only.  You can also use the Flink on Mesos support without images, if you make sure that JAVA_HOME is set on all executors.

Hope this helps!

From: Biswajit Das
Sent: Tuesday, August 1, 1:24 AM
Subject: Re: Flink -mesos-app master hang
Hi Eron ,  I have came across some of your comment in JIRA and wanted to clarify this ^^ . I'm kind of running little clueless ,  Any pointer for me to look ..


-----------------------------------------------
2017-08-01 07:26:34,688 INFO  org.apache.flink.mesos.scheduler.LaunchCoordinator            - Waiting for more offers; 1 task(s) are not yet launched.
2017-08-01 07:26:34,717 INFO  org.apache.flink.mesos.runtime.clusterframework.MesosFlinkResourceManager  - Launching Mesos task taskmanager-00039 on host 172.31.5.212.
2017-08-01 07:26:34,731 WARN  org.apache.flink.mesos.scheduler.TaskMonitor                  - Mesos task taskmanager-00039 failed unexpectedly.
2017-08-01 07:26:34,733 INFO  org.apache.flink.mesos.runtime.clusterframework.MesosFlinkResourceManager  - Mesos task taskmanager-00039 failed, with a TaskManager in launch or registration. State: TASK_FAILED Reason: REASON_CONTAINER_LAUNCH_FAILED (Failed to launch container: Unsupported container image type: DOCKER)
2017-08-01 07:26:34,733 INFO  org.apache.flink.mesos.runtime.clusterframework.MesosFlinkResourceManager  - Diagnostics for task taskmanager-00039 in state TASK_FAILED : reason=REASON_CONTAINER_LAUNCH_FAILED message=Failed to launch container: Unsupported container image type: DOCKER
2017-08-01 07:26:34,733 INFO  org.apache.flink.mesos.runtime.clusterframework.MesosFlinkResourceManager  - Total number of failed tasks so far: 3
2017-08-01 07:26:34,734 ERROR org.apache.flink.mesos.runtime.clusterframework.MesosFlinkResourceManager  - Stopping Mesos session because the number of failed tasks (3) exceeded the maximum failed tasks (2). This number is controlled by the 'mesos.maximum-failed-tasks' configuration setting. By default its the number of requested tasks.
2017-08-01 07:26:34,734 INFO  org.apache.flink.mesos.runtime.clusterframework.MesosFlinkResourceManager  - Shutting down cluster with status FAILED : Stopping Mesos session because the number of failed tasks (3) exceeded the maximum failed tasks (2). This number is controlled by the 'mesos.maximum-failed-tasks' configuration setting. By default its the number of requested tasks.
2017-08-01 07:26:34,734 INFO  org.apache.flink.mesos.runtime.clusterframework.MesosFlinkResourceManager  - Shutting down and unregistering as a Mesos framework.
2017-08-01 07:26:34,745 INFO  org.apache.flink.mesos.runtime.clusterframework.MesosFlinkResourceManager  - Stopping Mesos resource master
2017-08-01 07:26:34,745 INFO  org.apache.f
---------------------------------------------------

Thank you in advance .
~Biswajit

On Sun, Jul 30, 2017 at 12:42 PM, Biswajit Das <[hidden email]> wrote:
Hi All,
I'm trying to run a flink docker from the marathon with mesos app master; I could see it goes on a continuous loop and failed to launch the task manger. If I go to mesos master UI I could see job manager web UI with task manager zero .

I have pretty much checked every possible log starting from Ubuntu machine docker.log /mesos master/slave  pretty much no information other than just failed task , I could see below log @ flink . However, I'm able to run same docker image if I run jobamanger and taskmanager by itself in marathon and let it connect via jobmanager RPC port .

for mesos config , I'm using below details from yml
mesos.master: ${MESOS_MASTER}
mesos.failover-timeout: 60
mesos.initial-tasks: ${INITIAL_TASK_MANAGERS}
mesos.resourcemanager.tasks.mem: ${RESOURCEMANAGER_TASKS_MEM:-4096}
mesos.resourcemanager.tasks.cpus:${RESOURCEMANAGER_TASKS_CPU:-1}
mesos.resourcemanager.tasks.container.type: docker
---------------------------
07-30 02:05:48,351 WARN  org.apache.flink.mesos.scheduler.TaskMonitor                  - Mesos task taskmanager-00002 failed unexpectedly.
2017-07-30 02:05:48,352 INFO  org.apache.flink.mesos.runtime.clusterframework.MesosFlinkResourceManager  - Mesos task taskmanager-00002 failed, with a TaskManager in launch or registration. State: TASK_FAILED Reason: REASON_COMMAND_EXECUTOR_FAILED (Container exited with status 127)
-----------------------------------------------------

Please let me know if any one has any pointer to debug further ..


~ Biswajit









Thank you
~/Das