I am trying to configure Flink to work on top of Mesos. I am using Flink release-1.3. I am using DCOS 1.9's underlying mesos which is version 1.2. I am able to start Flink without any issues when the taskmanager starts on the same host as that of appmaster. But when the taskmanager is launched on a different host, the container fails to launch. The flink mesos-appmaster log is something as follows:
2017-06-08 19:19:01,537 INFO org.apache.flink.mesos.runtime.clusterframework.MesosFlinkResourceManager - Launching Mesos task taskmanager-00003 on host 10.101.2.117. 2017-06-08 19:19:01,550 INFO org.apache.flink.mesos.runtime.clusterframework.MesosFlinkResourceManager - Launching Mesos task taskmanager-00002 on host 10.101.2.117. 2017-06-08 19:19:01,607 INFO org.apache.flink.mesos.runtime.clusterframework.MesosFlinkResourceManager - Launching Mesos task taskmanager-00001 on host 10.101.2.117. 2017-06-08 19:19:01,623 INFO org.apache.flink.mesos.runtime.clusterframework.MesosFlinkResourceManager - Launching Mesos task taskmanager-00004 on host 10.101.2.117. 2017-06-08 19:19:01,645 INFO org.apache.flink.mesos.runtime.clusterframework.MesosFlinkResourceManager - Launching Mesos task taskmanager-00006 on host 10.101.2.91. 2017-06-08 19:19:01,660 INFO org.apache.flink.mesos.runtime.clusterframework.MesosFlinkResourceManager - Launching Mesos task taskmanager-00005 on host 10.101.2.91. 2017-06-08 19:19:01,674 INFO org.apache.flink.mesos.runtime.clusterframework.MesosFlinkResourceManager - Launching Mesos task taskmanager-00007 on host 10.101.2.91. 2017-06-08 19:19:02,234 WARN org.apache.flink.mesos.scheduler.TaskMonitor - Mesos task taskmanager-00003 failed unexpectedly. 2017-06-08 19:19:02,234 WARN org.apache.flink.mesos.scheduler.TaskMonitor - Mesos task taskmanager-00002 failed unexpectedly. 2017-06-08 19:19:02,245 INFO org.apache.flink.mesos.runtime.clusterframework.MesosFlinkResourceManager - Mesos task taskmanager-00002 failed, with a TaskManager in launch or registration. State: TASK_FAILED Reason: REASON_CONTAINER_LAUNCH_FAILED (Failed to launch container: Failed to fetch all URIs for container '125055b6-9a19-4d62-a019-5d8a4197c043' with exit status: 256) 2017-06-08 19:19:02,246 INFO org.apache.flink.mesos.runtime.clusterframework.MesosFlinkResourceManager - Diagnostics for task taskmanager-00002 in state TASK_FAILED : reason=REASON_CONTAINER_LAUNCH_FAILED message=Failed to launch container: Failed to fetch all URIs for container '125055b6-9a19-4d62-a019-5d8a4197c043' with exit status: 256 2017-06-08 19:19:02,247 INFO org.apache.flink.mesos.runtime.clusterframework.MesosFlinkResourceManager - Total number of failed tasks so far: 1 2017-06-08 19:19:02,252 INFO org.apache.flink.mesos.runtime.clusterframework.MesosFlinkResourceManager - Mesos task taskmanager-00003 failed, with a TaskManager in launch or registration. State: TASK_FAILED Reason: REASON_CONTAINER_LAUNCH_FAILED (Failed to launch container: Failed to fetch all URIs for container '69259a92-b3e4-44c7-9afd-3ac650524570' with exit status: 256) 2017-06-08 19:19:02,252 INFO org.apache.flink.mesos.runtime.clusterframework.MesosFlinkResourceManager - Diagnostics for task taskmanager-00003 in state TASK_FAILED : reason=REASON_CONTAINER_LAUNCH_FAILED message=Failed to launch container: Failed to fetch all URIs for container '69259a92-b3e4-44c7-9afd-3ac650524570' with exit status: 256 2017-06-08 19:19:02,252 INFO org.apache.flink.mesos.runtime.clusterframework.MesosFlinkResourceManager - Total number of failed tasks so far: 2 2017-06-08 19:19:02,313 INFO org.apache.flink.mesos.runtime.clusterframework.MesosFlinkResourceManager - Scheduling Mesos task taskmanager-00008 with (2048.0 MB, 1.0 cpus). 2017-06-08 19:19:02,330 INFO org.apache.flink.mesos.runtime.clusterframework.MesosFlinkResourceManager - Scheduling Mesos task taskmanager-00009 with (2048.0 MB, 1.0 cpus). 2017-06-08 19:19:02,331 INFO org.apache.flink.mesos.scheduler.LaunchCoordinator - Now gathering offers for at least 2 task(s). 2017-06-08 19:19:02,332 WARN org.apache.flink.mesos.scheduler.TaskMonitor - Mesos task taskmanager-00004 failed unexpectedly. 2017-06-08 19:19:02,332 WARN org.apache.flink.mesos.scheduler.TaskMonitor - Mesos task taskmanager-00001 failed unexpectedly. 2017-06-08 19:19:02,412 INFO org.apache.flink.mesos.runtime.clusterframework.MesosFlinkResourceManager - Mesos task taskmanager-00004 failed, with a TaskManager in launch or registration. State: TASK_FAILED Reason: REASON_CONTAINER_LAUNCH_FAILED (Failed to launch container: Failed to fetch all URIs for container 'a65c3e35-579d-4302-830f-be50b6d0ca06' with exit status: 256) 2017-06-08 19:19:02,412 INFO org.apache.flink.mesos.runtime.clusterframework.MesosFlinkResourceManager - Diagnostics for task taskmanager-00004 in state TASK_FAILED : reason=REASON_CONTAINER_LAUNCH_FAILED message=Failed to launch container: Failed to fetch all URIs for container 'a65c3e35-579d-4302-830f-be50b6d0ca06' with exit status: 256 2017-06-08 19:19:02,412 INFO org.apache.flink.mesos.runtime.clusterframework.MesosFlinkResourceManager - Total number of failed tasks so far: 3 2017-06-08 19:19:02,432 INFO org.apache.flink.mesos.runtime.clusterframework.MesosFlinkResourceManager - Mesos task taskmanager-00001 failed, with a TaskManager in launch or registration. State: TASK_FAILED Reason: REASON_CONTAINER_LAUNCH_FAILED (Failed to launch container: Failed to fetch all URIs for container '325e14fe-8840-4996-96dc-5c7ffc159d12' with exit status: 256) I checked the stderr in Mesos sandbox and it is as follows: I0608 19:20:06.184386 30480 fetcher.cpp:531] Fetcher Info: {"cache_directory":"\/tmp\/mesos\/fetch\/slaves\/6b7667c0-1b1a-43a4-ba1f-27cb0660608f-S6\/flink","items":[{"action":"BYPASS_CACHE","uri":{"cache":true,"executable":true,"extract":false,"output_file":"flink\/bin\/mesos-taskmanager.sh","value":"http:\/\/localhost:38985\/567dfcb8-f7d7-4d53-8518-53c1b3e7ef78\/flink\/bin\/mesos-taskmanager.sh"}},{"action":"BYPASS_CACHE","uri":{"cache":true,"executable":true,"extract":false,"output_file":"flink\/bin\/yarn-session.sh","value":"http:\/\/localhost:38985\/567dfcb8-f7d7-4d53-8518-53c1b3e7ef78\/flink\/bin\/yarn-session.sh"}},{"action":"BYPASS_CACHE","uri":{"cache":true,"executable":false,"extract":false,"output_file":"flink\/conf\/log4j-console.properties","value":"http:\/\/localhost:38985\/567dfcb8-f7d7-4d53-8518-53c1b3e7ef78\/flink\/conf\/log4j-console.properties"}},{"action":"BYPASS_CACHE","uri":{"cache":true,"executable":false,"extract":false,"output_file":"flink\/conf\/log4j.properties","value":"http:\/\/localhost:38985\/567dfcb8-f7d7-4d53-8518-53c1b3e7ef78\/flink\/conf\/log4j.properties"}},{"action":"BYPASS_CACHE","uri":{"cache":true,"executable":false,"extract":false,"output_file":"flink\/lib\/log4j-1.2.17.jar","value":"http:\/\/localhost:38985\/567dfcb8-f7d7-4d53-8518-53c1b3e7ef78\/flink\/lib\/log4j-1.2.17.jar"}},{"action":"BYPASS_CACHE","uri":{"cache":true,"executable":true,"extract":false,"output_file":"flink\/bin\/mesos-appmaster.sh","value":"http:\/\/localhost:38985\/567dfcb8-f7d7-4d53-8518-53c1b3e7ef78\/flink\/bin\/mesos-appmaster.sh"}},{"action":"BYPASS_CACHE","uri":{"cache":true,"executable":true,"extract":false,"output_file":"flink\/bin\/stop-zookeeper-quorum.sh","value":"http:\/\/localhost:38985\/567dfcb8-f7d7-4d53-8518-53c1b3e7ef78\/flink\/bin\/stop-zookeeper-quorum.sh"}},{"action":"BYPASS_CACHE","uri":{"cache":true,"executable":true,"extract":false,"output_file":"flink\/bin\/stop-local.sh","value":"http:\/\/localhost:38985\/567dfcb8-f7d7-4d53-8518-53c1b3e7ef78\/flink\/bin\/stop-local.sh"}},{"action":"BYPASS_CACHE","uri":{"cache":true,"executable":true,"extract":false,"output_file":"flink\/bin\/taskmanager.sh","value":"http:\/\/localhost:38985\/567dfcb8-f7d7-4d53-8518-53c1b3e7ef78\/flink\/bin\/taskmanager.sh"}},{"action":"BYPASS_CACHE","uri":{"cache":true,"executable":true,"extract":false,"output_file":"flink\/bin\/start-local.bat","value":"http:\/\/localhost:38985\/567dfcb8-f7d7-4d53-8518-53c1b3e7ef78\/flink\/bin\/start-local.bat"}},{"action":"BYPASS_CACHE","uri":{"cache":true,"executable":true,"extract":false,"output_file":"flink\/bin\/start-cluster.sh","value":"http:\/\/localhost:38985\/567dfcb8-f7d7-4d53-8518-53c1b3e7ef78\/flink\/bin\/start-cluster.sh"}},{"action":"BYPASS_CACHE","uri":{"cache":true,"executable":true,"extract":false,"output_file":"flink\/bin\/stop-cluster.sh","value":"http:\/\/localhost:38985\/567dfcb8-f7d7-4d53-8518-53c1b3e7ef78\/flink\/bin\/stop-cluster.sh"}},{"action":"BYPASS_CACHE","uri":{"cache":true,"executable":true,"extract":false,"output_file":"flink\/bin\/start-scala-shell.sh","value":"http:\/\/localhost:38985\/567dfcb8-f7d7-4d53-8518-53c1b3e7ef78\/flink\/bin\/start-scala-shell.sh"}},{"action":"BYPASS_CACHE","uri":{"cache":true,"executable":true,"extract":false,"output_file":"flink\/bin\/flink","value":"http:\/\/localhost:38985\/567dfcb8-f7d7-4d53-8518-53c1b3e7ef78\/flink\/bin\/flink"}},{"action":"BYPASS_CACHE","uri":{"cache":true,"executable":true,"extract":false,"output_file":"flink\/bin\/pyflink.sh","value":"http:\/\/localhost:38985\/567dfcb8-f7d7-4d53-8518-53c1b3e7ef78\/flink\/bin\/pyflink.sh"}},{"action":"BYPASS_CACHE","uri":{"cache":true,"executable":false,"extract":false,"output_file":"flink\/conf\/log4j-yarn-session.properties","value":"http:\/\/localhost:38985\/567dfcb8-f7d7-4d53-8518-53c1b3e7ef78\/flink\/conf\/log4j-yarn-session.properties"}},{"action":"BYPASS_CACHE","uri":{"cache":true,"executable":false,"extract":false,"output_file":"flink\/conf\/logback-yarn.xml","value":"http:\/\/localhost:38985\/567dfcb8-f7d7-4d53-8518-53c1b3e7ef78\/flink\/conf\/logback-yarn.xml"}},{"action":"BYPASS_CACHE","uri":{"cache":true,"executable":true,"extract":false,"output_file":"flink\/bin\/flink-daemon.sh","value":"http:\/\/localhost:38985\/567dfcb8-f7d7-4d53-8518-53c1b3e7ef78\/flink\/bin\/flink-daemon.sh"}},{"action":"BYPASS_CACHE","uri":{"cache":true,"executable":true,"extract":false,"output_file":"flink\/bin\/zookeeper.sh","value":"http:\/\/localhost:38985\/567dfcb8-f7d7-4d53-8518-53c1b3e7ef78\/flink\/bin\/zookeeper.sh"}},{"action":"BYPASS_CACHE","uri":{"cache":true,"executable":false,"extract":false,"output_file":"flink\/conf\/logback-console.xml","value":"http:\/\/localhost:38985\/567dfcb8-f7d7-4d53-8518-53c1b3e7ef78\/flink\/conf\/logback-console.xml"}},{"action":"BYPASS_CACHE","uri":{"cache":true,"executable":false,"extract":false,"output_file":"flink\/conf\/masters","value":"http:\/\/localhost:38985\/567dfcb8-f7d7-4d53-8518-53c1b3e7ef78\/flink\/conf\/masters"}},{"action":"BYPASS_CACHE","uri":{"cache":true,"executable":true,"extract":false,"output_file":"flink\/conf\/flink-conf.yaml","value":"http:\/\/localhost:38985\/567dfcb8-f7d7-4d53-8518-53c1b3e7ef78\/flink\/conf\/flink-conf.yaml"}},{"action":"BYPASS_CACHE","uri":{"cache":true,"executable":false,"extract":false,"output_file":"flink\/conf\/zoo.cfg","value":"http:\/\/localhost:38985\/567dfcb8-f7d7-4d53-8518-53c1b3e7ef78\/flink\/conf\/zoo.cfg"}},{"action":"BYPASS_CACHE","uri":{"cache":true,"executable":false,"extract":false,"output_file":"flink\/lib\/flink-shaded-hadoop2-uber-1.3-SNAPSHOT.jar","value":"http:\/\/localhost:38985\/567dfcb8-f7d7-4d53-8518-53c1b3e7ef78\/flink\/lib\/flink-shaded-hadoop2-uber-1.3-SNAPSHOT.jar"}},{"action":"BYPASS_CACHE","uri":{"cache":true,"executable":false,"extract":false,"output_file":"flink\/conf\/slaves","value":"http:\/\/localhost:38985\/567dfcb8-f7d7-4d53-8518-53c1b3e7ef78\/flink\/conf\/slaves"}},{"action":"BYPASS_CACHE","uri":{"cache":true,"executable":false,"extract":false,"output_file":"flink\/lib\/flink-dist_2.10-1.3-SNAPSHOT.jar","value":"http:\/\/localhost:38985\/567dfcb8-f7d7-4d53-8518-53c1b3e7ef78\/flink\/lib\/flink-dist_2.10-1.3-SNAPSHOT.jar"}},{"action":"BYPASS_CACHE","uri":{"cache":true,"executable":false,"extract":false,"output_file":"flink\/lib\/slf4j-log4j12-1.7.7.jar","value":"http:\/\/localhost:38985\/567dfcb8-f7d7-4d53-8518-53c1b3e7ef78\/flink\/lib\/slf4j-log4j12-1.7.7.jar"}},{"action":"BYPASS_CACHE","uri":{"cache":true,"executable":false,"extract":false,"output_file":"flink\/conf\/log4j-cli.properties","value":"http:\/\/localhost:38985\/567dfcb8-f7d7-4d53-8518-53c1b3e7ef78\/flink\/conf\/log4j-cli.properties"}},{"action":"BYPASS_CACHE","uri":{"cache":true,"executable":true,"extract":false,"output_file":"flink\/bin\/historyserver.sh","value":"http:\/\/localhost:38985\/567dfcb8-f7d7-4d53-8518-53c1b3e7ef78\/flink\/bin\/historyserver.sh"}},{"action":"BYPASS_CACHE","uri":{"cache":true,"executable":false,"extract":false,"output_file":"flink\/lib\/flink-python_2.10-1.3-SNAPSHOT.jar","value":"http:\/\/localhost:38985\/567dfcb8-f7d7-4d53-8518-53c1b3e7ef78\/flink\/lib\/flink-python_2.10-1.3-SNAPSHOT.jar"}},{"action":"BYPASS_CACHE","uri":{"cache":true,"executable":false,"extract":false,"output_file":"flink\/conf\/logback.xml","value":"http:\/\/localhost:38985\/567dfcb8-f7d7-4d53-8518-53c1b3e7ef78\/flink\/conf\/logback.xml"}},{"action":"BYPASS_CACHE","uri":{"cache":true,"executable":true,"extract":false,"output_file":"flink\/bin\/pyflink.bat","value":"http:\/\/localhost:38985\/567dfcb8-f7d7-4d53-8518-53c1b3e7ef78\/flink\/bin\/pyflink.bat"}},{"action":"BYPASS_CACHE","uri":{"cache":true,"executable":true,"extract":false,"output_file":"flink\/bin\/start-local.sh","value":"http:\/\/localhost:38985\/567dfcb8-f7d7-4d53-8518-53c1b3e7ef78\/flink\/bin\/start-local.sh"}},{"action":"BYPASS_CACHE","uri":{"cache":true,"executable":true,"extract":false,"output_file":"flink\/bin\/flink.bat","value":"http:\/\/localhost:38985\/567dfcb8-f7d7-4d53-8518-53c1b3e7ef78\/flink\/bin\/flink.bat"}},{"action":"BYPASS_CACHE","uri":{"cache":true,"executable":true,"extract":false,"output_file":"flink\/bin\/start-zookeeper-quorum.sh","value":"http:\/\/localhost:38985\/567dfcb8-f7d7-4d53-8518-53c1b3e7ef78\/flink\/bin\/start-zookeeper-quorum.sh"}},{"action":"BYPASS_CACHE","uri":{"cache":true,"executable":true,"extract":false,"output_file":"flink\/bin\/jobmanager.sh","value":"http:\/\/localhost:38985\/567dfcb8-f7d7-4d53-8518-53c1b3e7ef78\/flink\/bin\/jobmanager.sh"}},{"action":"BYPASS_CACHE","uri":{"cache":true,"executable":true,"extract":false,"output_file":"flink\/bin\/flink-console.sh","value":"http:\/\/localhost:38985\/567dfcb8-f7d7-4d53-8518-53c1b3e7ef78\/flink\/bin\/flink-console.sh"}},{"action":"BYPASS_CACHE","uri":{"cache":true,"executable":true,"extract":false,"output_file":"flink\/bin\/config.sh","value":"http:\/\/localhost:38985\/567dfcb8-f7d7-4d53-8518-53c1b3e7ef78\/flink\/bin\/config.sh"}}],"sandbox_directory":"\/var\/lib\/mesos\/slave\/slaves\/6b7667c0-1b1a-43a4-ba1f-27cb0660608f-S6\/frameworks\/6b7667c0-1b1a-43a4-ba1f-27cb0660608f-0030\/executors\/taskmanager-00009\/runs\/d8d1756d-f977-43f6-a53f-55c19b6c6294","user":"flink"} I0608 19:20:06.189909 30480 fetcher.cpp:442] Fetching URI 'http://localhost:38985/567dfcb8-f7d7-4d53-8518-53c1b3e7ef78/flink/bin/mesos-taskmanager.sh' I0608 19:20:06.189932 30480 fetcher.cpp:283] Fetching directly into the sandbox directory I0608 19:20:06.190213 30480 fetcher.cpp:220] Fetching URI 'http://localhost:38985/567dfcb8-f7d7-4d53-8518-53c1b3e7ef78/flink/bin/mesos-taskmanager.sh' I0608 19:20:06.190251 30480 fetcher.cpp:163] Downloading resource from 'http://localhost:38985/567dfcb8-f7d7-4d53-8518-53c1b3e7ef78/flink/bin/mesos-taskmanager.sh' to '/var/lib/mesos/slave/slaves/6b7667c0-1b1a-43a4-ba1f-27cb0660608f-S6/frameworks/6b7667c0-1b1a-43a4-ba1f-27cb0660608f-0030/executors/taskmanager-00009/runs/d8d1756d-f977-43f6-a53f-55c19b6c6294/flink/bin/mesos-taskmanager.sh' Failed to fetch 'http://localhost:38985/567dfcb8-f7d7-4d53-8518-53c1b3e7ef78/flink/bin/mesos-taskmanager.sh': Error downloading resource: Couldn't connect to server Failed to synchronize with agent (it's probably exited) So, my question is what am I missing? Will I need to mention some special URI in marathon for flink? I am setting mesos.master as zk://leader.mesos:2181/mesos. Is this the one that is creating problem? Or, have I missed some mesos or marathon setting? Also, I am launching this via Marathon and I have the same flink dist at same path in all the slaves Thanks, |
Hi Ani, the problem is that you have to set a reachable jobmanager hostname in the Cheers, On Thu, Jun 8, 2017 at 11:21 PM, ani.desh1512 <[hidden email]> wrote: I am trying to configure Flink to work on top of Mesos. I am using Flink |