Hi Folks, I'm running Flink (1.2-SNAPSHOT nightly) on YARN (Hadoop 2.7.2). A few hours after I start a streaming job (built using kafka connect 0.10_2.11) it gets killed seemingly for no reason. After inspecting the logs my best guess is that YARN is killing containers due to high virtual memory usage. Any guesses on why this might be happening or tips of what I should be looking for? What I'll do next is enable taskmanager.debug.memory.startLogThread to keep investigating. Also, I was deploying flink-1.2-SNAPSHOT-bin-hadoop2.tgz on YARN, but my job uses scala 2.11 dependencies so I'll try using flink-1.2-SNAPSHOT-bin-hadoop2_2.11.tgz instead.
2016-12-15 17:44:03,763 WARN akka.remote.ReliableDeliverySupervisor - Association with remote system [akka.tcp://flink@10.0.0.8:49832] has failed, address is now gated for [5000] ms. Reason is: [Disassociated]. 2016-12-15 17:44:05,475 INFO org.apache.flink.yarn.YarnFlinkResourceManager - Container ResourceID{resourceId='container_1481732559439_0002_01_000004'} failed. Exit status: 1 2016-12-15 17:44:05,476 INFO org.apache.flink.yarn.YarnFlinkResourceManager - Diagnostics for container ResourceID{resourceId='container_1481732559439_0002_01_000004'} in state COMPLETE : exitStatus=1 diagnostics=Exception from container-launch. Container id: container_1481732559439_0002_01_000004 Exit code: 1 Stack trace: ExitCodeException exitCode=1: at org.apache.hadoop.util.Shell.runCommand(Shell.java:545) at org.apache.hadoop.util.Shell.run(Shell.java:456) at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:722) at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:212) at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302) at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Container exited with a non-zero exit code 1
container_1481732559439_0002_01_000004: 2.6 GB of 5 GB physical memory used; 38.1 GB of 10.5 GB virtual memory used 2016-12-15 17:44:03,119 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Memory usage of ProcessTree 62223 for container-id container_1481732559439_0002_01_000001: 656.3 MB of 2 GB physical memory used; 3.2 GB of 4.2 GB virtual memory used 2016-12-15 17:44:03,766 WARN org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Exit code from container container_1481732559439_0002_01_000004 is : 1 2016-12-15 17:44:03,766 WARN org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Exception from container-launch with container ID: container_1481732559439_0002_01_000004 and exit code: 1 ExitCodeException exitCode=1: at org.apache.hadoop.util.Shell.runCommand(Shell.java:545) at org.apache.hadoop.util.Shell.run(Shell.java:456) at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:722) at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:212) at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302) at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Best regards, Paulo Cezar |
Hi!
To diagnose this a little better, can you help us with the following info: - Are you using RocksDB? - What is your flink configuration, especially around memory settings? - What do you use for TaskManager heap size? Any manual value, or do you let Flink/Yarn set it automatically based on container size? - Do you use any libraries or connectors in your program? Greetings, Stephan On Fri, Dec 16, 2016 at 5:47 PM, Paulo Cezar <[hidden email]> wrote:
|
Also, can you tell us what OS you are running on? On Fri, Dec 16, 2016 at 6:23 PM, Stephan Ewen <[hidden email]> wrote:
|
In reply to this post by Stephan Ewen
No.
I'm using default config with 2GB for jobmanager and 5GB for taskmanagers. I'm starting flink via "./bin/yarn-session.sh -d -n 5 -jm 2048 -tm 5120 -s 4 -nm 'Flink'"
No manual values here. YARN config is pretty much default with maximum allocation of 12GB of physical memory and ratio between virtual memory to physical memory 2.1 (via yarn.nodemanager.vmem-pmem-ratio).
I'm using flink-connector-kafka-0.10_2.11, a MongoDB client, a gRPC client and some http libraries like unirest and Apache HttpClient.
My YARN cluster runs on Docker containers (docker version 1.12) with images based on Ubuntu 14.04. Host OS is Ubuntu 14.04.4 LTS (GNU/Linux 3.19.0-65-generic x86_64). |
Hi Paulo! Hmm, interesting. The high discrepancy between virtual and physical memory usually means that the process either maps large files into memory, or that it pre-allocates a lot of memory without immediately using it. Neither of these things are done by Flink. Could this be an effect of either the Docker environment (mapping certain kernel spaces / libraries / whatever) or a result of one of the libraries (gRPC or so)? Stephan On Mon, Dec 19, 2016 at 12:32 PM, Paulo Cezar <[hidden email]> wrote:
|
Hi Stephan, thanks for your support. I was able to track the problem a few days ago. Unirest was the one to blame, I was using it on some mapfuncionts to connect to external services and for some reason it was using insane amounts of virtual memory. Paulo Cezar On Mon, Dec 19, 2016 at 11:30 AM Stephan Ewen <[hidden email]> wrote:
|
Happy to hear that!
On Thu, Jan 5, 2017 at 1:34 PM, Paulo Cezar <[hidden email]> wrote:
|
Free forum by Nabble | Edit this page |