Flink Yarn Session failures

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

Flink Yarn Session failures

Chan, Regina

Hi,

 

Was trying to understand why it takes about 9 minutes between the last try to start a container and when it finally gets the sigterm to kill the YarnApplicationMasterRunner.

 

Client:

 
Calc Engine: 2017-08-28 12:39:23,596 INFO  org.apache.flink.yarn.YarnClusterClient                       - Waiting until all TaskManagers have connected
Calc Engine: Waiting until all TaskManagers have connected
Calc Engine: 2017-08-28 12:39:23,600 INFO  org.apache.flink.yarn.YarnClusterClient                       - Starting client actor system.
Calc Engine: 2017-08-28 12:39:24,077 INFO  akka.event.slf4j.Slf4jLogger                                  - Slf4jLogger started
Calc Engine: 2017-08-28 12:39:24,366 INFO  Remoting                                                      - Remoting started; listening on addresses :[akka.tcp://[hidden email]:39353]
Calc Engine: 2017-08-28 12:39:24,609 INFO  org.apache.flink.yarn.YarnClusterClient                       - TaskManager status (0/4)
Calc Engine: TaskManager status (0/4)
Calc Engine: 2017-08-28 12:39:29,864 INFO  org.apache.flink.yarn.YarnClusterClient                       - TaskManager status (1/4)
Calc Engine: TaskManager status (1/4)
Calc Engine: 2017-08-28 12:39:30,389 INFO  org.apache.flink.yarn.YarnClusterClient                       - TaskManager status (2/4)
Calc Engine: TaskManager status (2/4)
Calc Engine: 2017-08-28 12:41:04,920 INFO  org.apache.flink.yarn.YarnClusterClient                       - TaskManager status (1/4)
Calc Engine: TaskManager status (1/4)
Calc Engine: 2017-08-28 12:41:13,775 INFO  org.apache.flink.yarn.YarnClusterClient                       - TaskManager status (0/4)
Calc Engine: TaskManager status (0/4)
Calc Engine: 2017-08-28 12:50:43,133 WARN  akka.remote.ReliableDeliverySupervisor                        - Association with remote system [akka.tcp://[hidden email]:58084] has failed, address is now gated for [5000] ms. Reason: [Disassociated]

 

 

 

Logs:

 

Container id: container_e71_1503688027943_30786_01_000013
Exit code: 134
Stack trace: ExitCodeException exitCode=134: 
        at org.apache.hadoop.util.Shell.runCommand(Shell.java:538)
        at org.apache.hadoop.util.Shell.run(Shell.java:455)
        at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:715)
        at org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.launchContainer(LinuxContainerExecutor.java:293)
        at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302)
        at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82)
        at java.util.concurrent.FutureTask.run(FutureTask.java:262)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:745)
 
Shell output: main : command provided 1
main : user is delp
main : requested yarn user is delp
 
Container exited with a non-zero exit code 134
 
17/08/28 12:39:51 INFO yarn.YarnFlinkResourceManager: Total number of failed containers so far: 5
17/08/28 12:39:51 ERROR yarn.YarnFlinkResourceManager: Stopping YARN session because the number of failed containers (5) exceeded the maximum failed containers (4). This number is controlled by the 'yarn.maximum-failed-containers' configuration setting. By default its the number of requested containers.
17/08/28 12:39:51 INFO yarn.YarnFlinkResourceManager: Shutting down cluster with status FAILED : Stopping YARN session because the number of failed containers (5) exceeded the maximum failed containers (4). This number is controlled by the 'yarn.maximum-failed-containers' configuration setting. By default its the number of requested containers.
17/08/28 12:39:51 INFO yarn.YarnFlinkResourceManager: Unregistering application from the YARN Resource Manager
17/08/28 12:39:51 INFO impl.AMRMClientImpl: Waiting for application to be successfully unregistered.
17/08/28 12:39:51 INFO impl.ContainerManagementProtocolProxy: Opening proxy : d191303-010.dc.gs.com:45454
17/08/28 12:39:51 INFO impl.AMRMClientAsyncImpl: Interrupted while waiting for queue
java.lang.InterruptedException
        at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.reportInterruptAfterWait(AbstractQueuedSynchronizer.java:2017)
        at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2052)
        at java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442)
        at org.apache.hadoop.yarn.client.api.async.impl.AMRMClientAsyncImpl$CallbackHandlerThread.run(AMRMClientAsyncImpl.java:274)
17/08/28 12:39:51 INFO impl.ContainerManagementProtocolProxy: Opening proxy : d191303-010.dc.gs.com:45454
17/08/28 12:39:51 INFO impl.ContainerManagementProtocolProxy: Opening proxy : d191303-019.dc.gs.com:45454
17/08/28 12:39:51 INFO impl.ContainerManagementProtocolProxy: Opening proxy : d191303-010.dc.gs.com:45454
17/08/28 12:39:51 INFO impl.ContainerManagementProtocolProxy: Opening proxy : d191303-016.dc.gs.com:45454
17/08/28 12:39:51 INFO impl.ContainerManagementProtocolProxy: Opening proxy : d191303-013.dc.gs.com:45454
17/08/28 12:39:51 INFO impl.ContainerManagementProtocolProxy: Opening proxy : d191303-019.dc.gs.com:45454
17/08/28 12:39:51 INFO impl.ContainerManagementProtocolProxy: Opening proxy : d191303-019.dc.gs.com:45454
17/08/28 12:39:52 WARN remote.ReliableDeliverySupervisor: Association with remote system [akka.tcp://[hidden email]:48786] has failed, address is now gated for [5000] ms. Reason: [Disassociated] 
17/08/28 12:39:52 WARN remote.ReliableDeliverySupervisor: Association with remote system [akka.tcp://[hidden email]:58367] has failed, address is now gated for [5000] ms. Reason: [Disassociated] 
17/08/28 12:40:01 WARN remote.ReliableDeliverySupervisor: Association with remote system [akka.tcp://[hidden email]:58367] has failed, address is now gated for [5000] ms. Reason: [Association failed with [akka.tcp://[hidden email]:58367]] Caused by: [Connection refused: d191303-016.dc.gs.com/10.79.162.181:58367]
17/08/28 12:40:01 WARN remote.ReliableDeliverySupervisor: Association with remote system [akka.tcp://[hidden email]:48786] has failed, address is now gated for [5000] ms. Reason: [Association failed with [akka.tcp://[hidden email]:48786]] Caused by: [Connection refused: d191303-010.dc.gs.com/10.79.252.104:48786]
17/08/28 12:40:11 WARN remote.ReliableDeliverySupervisor: Association with remote system [akka.tcp://[hidden email]:48786] has failed, address is now gated for [5000] ms. Reason: [Association failed with [akka.tcp://[hidden email]:48786]] Caused by: [Connection refused: d191303-010.dc.gs.com/10.79.252.104:48786]
17/08/28 12:40:11 WARN remote.ReliableDeliverySupervisor: Association with remote system [akka.tcp://[hidden email]:58367] has failed, address is now gated for [5000] ms. Reason: [Association failed with [akka.tcp://[hidden email]:58367]] Caused by: [Connection refused: d191303-016.dc.gs.com/10.79.162.181:58367]
17/08/28 12:40:21 WARN remote.ReliableDeliverySupervisor: Association with remote system [akka.tcp://[hidden email]:48786] has failed, address is now gated for [5000] ms. Reason: [Association failed with [akka.tcp://[hidden email]:48786]] Caused by: [Connection refused: d191303-010.dc.gs.com/10.79.252.104:48786]
17/08/28 12:40:21 WARN remote.ReliableDeliverySupervisor: Association with remote system [akka.tcp://[hidden email]:58367] has failed, address is now gated for [5000] ms. Reason: [Association failed with [akka.tcp://[hidden email]:58367]] Caused by: [Connection refused: d191303-016.dc.gs.com/10.79.162.181:58367]
17/08/28 12:40:31 WARN remote.ReliableDeliverySupervisor: Association with remote system [akka.tcp://[hidden email]:58367] has failed, address is now gated for [5000] ms. Reason: [Association failed with [akka.tcp://[hidden email]:58367]] Caused by: [Connection refused: d191303-016.dc.gs.com/10.79.162.181:58367]
17/08/28 12:40:31 WARN remote.ReliableDeliverySupervisor: Association with remote system [akka.tcp://[hidden email]:48786] has failed, address is now gated for [5000] ms. Reason: [Association failed with [akka.tcp://[hidden email]:48786]] Caused by: [Connection refused: d191303-010.dc.gs.com/10.79.252.104:48786]
17/08/28 12:40:41 WARN remote.ReliableDeliverySupervisor: Association with remote system [akka.tcp://[hidden email]:58367] has failed, address is now gated for [5000] ms. Reason: [Association failed with [akka.tcp://[hidden email]:58367]] Caused by: [Connection refused: d191303-016.dc.gs.com/10.79.162.181:58367]
17/08/28 12:40:41 WARN remote.ReliableDeliverySupervisor: Association with remote system [akka.tcp://[hidden email]:48786] has failed, address is now gated for [5000] ms. Reason: [Association failed with [akka.tcp://[hidden email]:48786]] Caused by: [Connection refused: d191303-010.dc.gs.com/10.79.252.104:48786]
17/08/28 12:40:51 WARN remote.ReliableDeliverySupervisor: Association with remote system [akka.tcp://[hidden email]:58367] has failed, address is now gated for [5000] ms. Reason: [Association failed with [akka.tcp://[hidden email]:58367]] Caused by: [Connection refused: d191303-016.dc.gs.com/10.79.162.181:58367]
17/08/28 12:40:51 WARN remote.ReliableDeliverySupervisor: Association with remote system [akka.tcp://[hidden email]:48786] has failed, address is now gated for [5000] ms. Reason: [Association failed with [akka.tcp://[hidden email]:48786]] Caused by: [Connection refused: d191303-010.dc.gs.com/10.79.252.104:48786]
17/08/28 12:41:01 WARN remote.ReliableDeliverySupervisor: Association with remote system [akka.tcp://[hidden email]:48786] has failed, address is now gated for [5000] ms. Reason: [Association failed with [akka.tcp://[hidden email]:48786]] Caused by: [Connection refused: d191303-010.dc.gs.com/10.79.252.104:48786]
17/08/28 12:41:01 WARN remote.ReliableDeliverySupervisor: Association with remote system [akka.tcp://[hidden email]:58367] has failed, address is now gated for [5000] ms. Reason: [Association failed with [akka.tcp://[hidden email]:58367]] Caused by: [Connection refused: d191303-016.dc.gs.com/10.79.162.181:58367]
17/08/28 12:41:04 WARN remote.RemoteWatcher: Detected unreachable: [akka.tcp://[hidden email]:48786]
17/08/28 12:41:04 INFO yarn.YarnJobManager: Task manager akka.tcp://[hidden email]:48786/user/taskmanager terminated.
17/08/28 12:41:04 INFO instance.InstanceManager: Unregistered task manager d191303-010.dc.gs.com/10.79.252.104. Number of registered task managers 1. Number of available slots 2.
17/08/28 12:41:11 WARN remote.ReliableDeliverySupervisor: Association with remote system [akka.tcp://[hidden email]:58367] has failed, address is now gated for [5000] ms. Reason: [Association failed with [akka.tcp://[hidden email]:58367]] Caused by: [Connection refused: d191303-016.dc.gs.com/10.79.162.181:58367]
17/08/28 12:41:13 WARN remote.RemoteWatcher: Detected unreachable: [akka.tcp://[hidden email]:58367]
17/08/28 12:41:13 INFO yarn.YarnJobManager: Task manager akka.tcp://[hidden email]:58367/user/taskmanager terminated.
17/08/28 12:41:13 INFO instance.InstanceManager: Unregistered task manager d191303-016.dc.gs.com/10.79.162.181. Number of registered task managers 0. Number of available slots 0.
17/08/28 12:50:42 INFO yarn.YarnApplicationMasterRunner: RECEIVED SIGNAL 15: SIGTERM. Shutting down as requested.
17/08/28 12:50:42 INFO webmonitor.WebRuntimeMonitor: Removing web dashboard root cache directory /tmp/flink-web-d1eebf19-098f-419e-859e-101cfd6c0749
17/08/28 12:50:42 INFO webmonitor.WebRuntimeMonitor: Removing web dashboard jar upload directory /tmp/flink-web-4d9bcf76-ddcb-4dbe-b91d-4a8d8da3d716
17/08/28 12:50:42 INFO blob.BlobServer: Stopped BLOB server at 0.0.0.0:35815

 

 

 

 

Regina Chan

Goldman Sachs Enterprise Platforms, Data Architecture

30 Hudson Street, 37th floor | Jersey City, NY 07302 (  (212) 902-5697