Hi, Was trying to understand why it takes about 9 minutes between the last try to start a container and when it finally gets the sigterm to kill the YarnApplicationMasterRunner. Client:
Calc Engine: 2017-08-28 12:39:23,596 INFO org.apache.flink.yarn.YarnClusterClient - Waiting until all TaskManagers have connected
Calc Engine: Waiting until all TaskManagers have connected
Calc Engine: 2017-08-28 12:39:23,600 INFO org.apache.flink.yarn.YarnClusterClient - Starting client actor system.
Calc Engine: 2017-08-28 12:39:24,077 INFO akka.event.slf4j.Slf4jLogger - Slf4jLogger started
Calc Engine: 2017-08-28 12:39:24,366 INFO Remoting - Remoting started; listening on addresses :[akka.tcp://[hidden email]:39353]
Calc Engine: 2017-08-28 12:39:24,609 INFO org.apache.flink.yarn.YarnClusterClient - TaskManager status (0/4)
Calc Engine: TaskManager status (0/4)
Calc Engine: 2017-08-28 12:39:29,864 INFO org.apache.flink.yarn.YarnClusterClient - TaskManager status (1/4)
Calc Engine: TaskManager status (1/4)
Calc Engine: 2017-08-28 12:39:30,389 INFO org.apache.flink.yarn.YarnClusterClient - TaskManager status (2/4)
Calc Engine: TaskManager status (2/4)
Calc Engine: 2017-08-28 12:41:04,920 INFO org.apache.flink.yarn.YarnClusterClient - TaskManager status (1/4)
Calc Engine: TaskManager status (1/4)
Calc Engine: 2017-08-28 12:41:13,775 INFO org.apache.flink.yarn.YarnClusterClient - TaskManager status (0/4)
Calc Engine: TaskManager status (0/4)
Calc Engine: 2017-08-28 12:50:43,133 WARN akka.remote.ReliableDeliverySupervisor - Association with remote system [akka.tcp://[hidden email]:58084] has failed, address is now gated for [5000] ms. Reason: [Disassociated]
Logs: Container id: container_e71_1503688027943_30786_01_000013
Exit code: 134
Stack trace: ExitCodeException exitCode=134:
at org.apache.hadoop.util.Shell.runCommand(Shell.java:538)
at org.apache.hadoop.util.Shell.run(Shell.java:455)
at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:715)
at org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.launchContainer(LinuxContainerExecutor.java:293)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Shell output: main : command provided 1
main : user is delp
main : requested yarn user is delp
Container exited with a non-zero exit code 134
17/08/28 12:39:51 INFO yarn.YarnFlinkResourceManager: Total number of failed containers so far: 5
17/08/28 12:39:51 ERROR yarn.YarnFlinkResourceManager: Stopping YARN session because the number of failed containers (5) exceeded the maximum failed containers (4). This number is controlled by the 'yarn.maximum-failed-containers' configuration setting. By default its the number of requested containers.
17/08/28 12:39:51 INFO yarn.YarnFlinkResourceManager: Shutting down cluster with status FAILED : Stopping YARN session because the number of failed containers (5) exceeded the maximum failed containers (4). This number is controlled by the 'yarn.maximum-failed-containers' configuration setting. By default its the number of requested containers.
17/08/28 12:39:51 INFO yarn.YarnFlinkResourceManager: Unregistering application from the YARN Resource Manager
17/08/28 12:39:51 INFO impl.AMRMClientImpl: Waiting for application to be successfully unregistered.
17/08/28 12:39:51 INFO impl.ContainerManagementProtocolProxy: Opening proxy : d191303-010.dc.gs.com:45454
17/08/28 12:39:51 INFO impl.AMRMClientAsyncImpl: Interrupted while waiting for queue
java.lang.InterruptedException
at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.reportInterruptAfterWait(AbstractQueuedSynchronizer.java:2017)
at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2052)
at java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442)
at org.apache.hadoop.yarn.client.api.async.impl.AMRMClientAsyncImpl$CallbackHandlerThread.run(AMRMClientAsyncImpl.java:274)
17/08/28 12:39:51 INFO impl.ContainerManagementProtocolProxy: Opening proxy : d191303-010.dc.gs.com:45454
17/08/28 12:39:51 INFO impl.ContainerManagementProtocolProxy: Opening proxy : d191303-019.dc.gs.com:45454
17/08/28 12:39:51 INFO impl.ContainerManagementProtocolProxy: Opening proxy : d191303-010.dc.gs.com:45454
17/08/28 12:39:51 INFO impl.ContainerManagementProtocolProxy: Opening proxy : d191303-016.dc.gs.com:45454
17/08/28 12:39:51 INFO impl.ContainerManagementProtocolProxy: Opening proxy : d191303-013.dc.gs.com:45454
17/08/28 12:39:51 INFO impl.ContainerManagementProtocolProxy: Opening proxy : d191303-019.dc.gs.com:45454
17/08/28 12:39:51 INFO impl.ContainerManagementProtocolProxy: Opening proxy : d191303-019.dc.gs.com:45454
17/08/28 12:39:52 WARN remote.ReliableDeliverySupervisor: Association with remote system [akka.tcp://[hidden email]:48786] has failed, address is now gated for [5000] ms. Reason: [Disassociated]
17/08/28 12:39:52 WARN remote.ReliableDeliverySupervisor: Association with remote system [akka.tcp://[hidden email]:58367] has failed, address is now gated for [5000] ms. Reason: [Disassociated]
17/08/28 12:40:01 WARN remote.ReliableDeliverySupervisor: Association with remote system [akka.tcp://[hidden email]:58367] has failed, address is now gated for [5000] ms. Reason: [Association failed with [akka.tcp://[hidden email]:58367]] Caused by: [Connection refused: d191303-016.dc.gs.com/10.79.162.181:58367]
17/08/28 12:40:01 WARN remote.ReliableDeliverySupervisor: Association with remote system [akka.tcp://[hidden email]:48786] has failed, address is now gated for [5000] ms. Reason: [Association failed with [akka.tcp://[hidden email]:48786]] Caused by: [Connection refused: d191303-010.dc.gs.com/10.79.252.104:48786]
17/08/28 12:40:11 WARN remote.ReliableDeliverySupervisor: Association with remote system [akka.tcp://[hidden email]:48786] has failed, address is now gated for [5000] ms. Reason: [Association failed with [akka.tcp://[hidden email]:48786]] Caused by: [Connection refused: d191303-010.dc.gs.com/10.79.252.104:48786]
17/08/28 12:40:11 WARN remote.ReliableDeliverySupervisor: Association with remote system [akka.tcp://[hidden email]:58367] has failed, address is now gated for [5000] ms. Reason: [Association failed with [akka.tcp://[hidden email]:58367]] Caused by: [Connection refused: d191303-016.dc.gs.com/10.79.162.181:58367]
17/08/28 12:40:21 WARN remote.ReliableDeliverySupervisor: Association with remote system [akka.tcp://[hidden email]:48786] has failed, address is now gated for [5000] ms. Reason: [Association failed with [akka.tcp://[hidden email]:48786]] Caused by: [Connection refused: d191303-010.dc.gs.com/10.79.252.104:48786]
17/08/28 12:40:21 WARN remote.ReliableDeliverySupervisor: Association with remote system [akka.tcp://[hidden email]:58367] has failed, address is now gated for [5000] ms. Reason: [Association failed with [akka.tcp://[hidden email]:58367]] Caused by: [Connection refused: d191303-016.dc.gs.com/10.79.162.181:58367]
17/08/28 12:40:31 WARN remote.ReliableDeliverySupervisor: Association with remote system [akka.tcp://[hidden email]:58367] has failed, address is now gated for [5000] ms. Reason: [Association failed with [akka.tcp://[hidden email]:58367]] Caused by: [Connection refused: d191303-016.dc.gs.com/10.79.162.181:58367]
17/08/28 12:40:31 WARN remote.ReliableDeliverySupervisor: Association with remote system [akka.tcp://[hidden email]:48786] has failed, address is now gated for [5000] ms. Reason: [Association failed with [akka.tcp://[hidden email]:48786]] Caused by: [Connection refused: d191303-010.dc.gs.com/10.79.252.104:48786]
17/08/28 12:40:41 WARN remote.ReliableDeliverySupervisor: Association with remote system [akka.tcp://[hidden email]:58367] has failed, address is now gated for [5000] ms. Reason: [Association failed with [akka.tcp://[hidden email]:58367]] Caused by: [Connection refused: d191303-016.dc.gs.com/10.79.162.181:58367]
17/08/28 12:40:41 WARN remote.ReliableDeliverySupervisor: Association with remote system [akka.tcp://[hidden email]:48786] has failed, address is now gated for [5000] ms. Reason: [Association failed with [akka.tcp://[hidden email]:48786]] Caused by: [Connection refused: d191303-010.dc.gs.com/10.79.252.104:48786]
17/08/28 12:40:51 WARN remote.ReliableDeliverySupervisor: Association with remote system [akka.tcp://[hidden email]:58367] has failed, address is now gated for [5000] ms. Reason: [Association failed with [akka.tcp://[hidden email]:58367]] Caused by: [Connection refused: d191303-016.dc.gs.com/10.79.162.181:58367]
17/08/28 12:40:51 WARN remote.ReliableDeliverySupervisor: Association with remote system [akka.tcp://[hidden email]:48786] has failed, address is now gated for [5000] ms. Reason: [Association failed with [akka.tcp://[hidden email]:48786]] Caused by: [Connection refused: d191303-010.dc.gs.com/10.79.252.104:48786]
17/08/28 12:41:01 WARN remote.ReliableDeliverySupervisor: Association with remote system [akka.tcp://[hidden email]:48786] has failed, address is now gated for [5000] ms. Reason: [Association failed with [akka.tcp://[hidden email]:48786]] Caused by: [Connection refused: d191303-010.dc.gs.com/10.79.252.104:48786]
17/08/28 12:41:01 WARN remote.ReliableDeliverySupervisor: Association with remote system [akka.tcp://[hidden email]:58367] has failed, address is now gated for [5000] ms. Reason: [Association failed with [akka.tcp://[hidden email]:58367]] Caused by: [Connection refused: d191303-016.dc.gs.com/10.79.162.181:58367]
17/08/28 12:41:04 WARN remote.RemoteWatcher: Detected unreachable: [akka.tcp://[hidden email]:48786]
17/08/28 12:41:04 INFO yarn.YarnJobManager: Task manager akka.tcp://[hidden email]:48786/user/taskmanager terminated.
17/08/28 12:41:04 INFO instance.InstanceManager: Unregistered task manager d191303-010.dc.gs.com/10.79.252.104. Number of registered task managers 1. Number of available slots 2.
17/08/28 12:41:11 WARN remote.ReliableDeliverySupervisor: Association with remote system [akka.tcp://[hidden email]:58367] has failed, address is now gated for [5000] ms. Reason: [Association failed with [akka.tcp://[hidden email]:58367]] Caused by: [Connection refused: d191303-016.dc.gs.com/10.79.162.181:58367]
17/08/28 12:41:13 WARN remote.RemoteWatcher: Detected unreachable: [akka.tcp://[hidden email]:58367]
17/08/28 12:41:13 INFO yarn.YarnJobManager: Task manager akka.tcp://[hidden email]:58367/user/taskmanager terminated.
17/08/28 12:41:13 INFO instance.InstanceManager: Unregistered task manager d191303-016.dc.gs.com/10.79.162.181. Number of registered task managers 0. Number of available slots 0.
17/08/28 12:50:42 INFO yarn.YarnApplicationMasterRunner: RECEIVED SIGNAL 15: SIGTERM. Shutting down as requested. 17/08/28 12:50:42 INFO webmonitor.WebRuntimeMonitor: Removing web dashboard root cache directory /tmp/flink-web-d1eebf19-098f-419e-859e-101cfd6c0749
17/08/28 12:50:42 INFO webmonitor.WebRuntimeMonitor: Removing web dashboard jar upload directory /tmp/flink-web-4d9bcf76-ddcb-4dbe-b91d-4a8d8da3d716
17/08/28 12:50:42 INFO blob.BlobServer: Stopped BLOB server at 0.0.0.0:35815
Regina Chan Goldman Sachs
–
Enterprise Platforms, Data Architecture 30 Hudson Street, 37th floor | Jersey City, NY 07302
( (212) 902-5697 |
Free forum by Nabble | Edit this page |