High virtual memory usage

classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|

High virtual memory usage

Paulo Cezar
Hi Folks,

I'm running Flink (1.2-SNAPSHOT nightly) on YARN (Hadoop 2.7.2). A few hours after I start a streaming job (built using kafka connect 0.10_2.11) it gets killed seemingly for no reason. After inspecting the logs my best guess is that YARN is killing containers due to high virtual memory usage.

Any guesses on why this might be happening or tips of what I should be looking for?

What I'll do next is enable taskmanager.debug.memory.startLogThread to keep investigating. Also, I was deploying flink-1.2-SNAPSHOT-bin-hadoop2.tgz on YARN, but my job uses scala 2.11 dependencies so I'll try using flink-1.2-SNAPSHOT-bin-hadoop2_2.11.tgz instead.

  • Flink logs:
2016-12-15 17:44:03,763 WARN  akka.remote.ReliableDeliverySupervisor                        - Association with remote system [akka.tcp://flink@10.0.0.8:49832] has failed, address is now gated for [5000] ms. Reason is: [Disassociated].
2016-12-15 17:44:05,475 INFO  org.apache.flink.yarn.YarnFlinkResourceManager                - Container ResourceID{resourceId='container_1481732559439_0002_01_000004'} failed. Exit status: 1
2016-12-15 17:44:05,476 INFO  org.apache.flink.yarn.YarnFlinkResourceManager                - Diagnostics for container ResourceID{resourceId='container_1481732559439_0002_01_000004'} in state COMPLETE : exitStatus=1 diagnostics=Exception from container-launch.
Container id: container_1481732559439_0002_01_000004
Exit code: 1
Stack trace: ExitCodeException exitCode=1: 
	at org.apache.hadoop.util.Shell.runCommand(Shell.java:545)
	at org.apache.hadoop.util.Shell.run(Shell.java:456)
	at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:722)
	at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:212)
	at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302)
	at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
	at java.lang.Thread.run(Thread.java:745)


Container exited with a non-zero exit code 1

  • YARN logs:
container_1481732559439_0002_01_000004: 2.6 GB of 5 GB physical memory used; 38.1 GB of 10.5 GB virtual memory used
2016-12-15 17:44:03,119 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Memory usage of ProcessTree 62223 for container-id container_1481732559439_0002_01_000001: 656.3 MB of 2 GB physical memory used; 3.2 GB of 4.2 GB virtual memory used
2016-12-15 17:44:03,766 WARN org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Exit code from container container_1481732559439_0002_01_000004 is : 1
2016-12-15 17:44:03,766 WARN org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Exception from container-launch with container ID: container_1481732559439_0002_01_000004 and exit code: 1
ExitCodeException exitCode=1: 
	at org.apache.hadoop.util.Shell.runCommand(Shell.java:545)
	at org.apache.hadoop.util.Shell.run(Shell.java:456)
	at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:722)
	at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:212)
	at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302)
	at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
	at java.lang.Thread.run(Thread.java:745)

Best regards,
Paulo Cezar
Reply | Threaded
Open this post in threaded view
|

Re: High virtual memory usage

Stephan Ewen
Hi!

To diagnose this a little better, can you help us with the following info:

  - Are you using RocksDB?
  - What is your flink configuration, especially around memory settings?
  - What do you use for TaskManager heap size? Any manual value, or do you let Flink/Yarn set it automatically based on container size?
  - Do you use any libraries or connectors in your program?

Greetings,
Stephan


On Fri, Dec 16, 2016 at 5:47 PM, Paulo Cezar <[hidden email]> wrote:
Hi Folks,

I'm running Flink (1.2-SNAPSHOT nightly) on YARN (Hadoop 2.7.2). A few hours after I start a streaming job (built using kafka connect 0.10_2.11) it gets killed seemingly for no reason. After inspecting the logs my best guess is that YARN is killing containers due to high virtual memory usage.

Any guesses on why this might be happening or tips of what I should be looking for?

What I'll do next is enable taskmanager.debug.memory.startLogThread to keep investigating. Also, I was deploying flink-1.2-SNAPSHOT-bin-hadoop2.tgz on YARN, but my job uses scala 2.11 dependencies so I'll try using flink-1.2-SNAPSHOT-bin-hadoop2_2.11.tgz instead.

  • Flink logs:
2016-12-15 17:44:03,763 WARN  akka.remote.ReliableDeliverySupervisor                        - Association with remote system [akka.tcp://flink@10.0.0.8:49832] has failed, address is now gated for [5000] ms. Reason is: [Disassociated].
2016-12-15 17:44:05,475 INFO  org.apache.flink.yarn.YarnFlinkResourceManager                - Container ResourceID{resourceId='container_1481732559439_0002_01_000004'} failed. Exit status: 1
2016-12-15 17:44:05,476 INFO  org.apache.flink.yarn.YarnFlinkResourceManager                - Diagnostics for container ResourceID{resourceId='container_1481732559439_0002_01_000004'} in state COMPLETE : exitStatus=1 diagnostics=Exception from container-launch.
Container id: container_1481732559439_0002_01_000004
Exit code: 1
Stack trace: ExitCodeException exitCode=1: 
	at org.apache.hadoop.util.Shell.runCommand(Shell.java:545)
	at org.apache.hadoop.util.Shell.run(Shell.java:456)
	at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:722)
	at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:212)
	at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302)
	at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
	at java.lang.Thread.run(Thread.java:745)


Container exited with a non-zero exit code 1

  • YARN logs:
container_1481732559439_0002_01_000004: 2.6 GB of 5 GB physical memory used; 38.1 GB of 10.5 GB virtual memory used
2016-12-15 17:44:03,119 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Memory usage of ProcessTree 62223 for container-id container_1481732559439_0002_01_000001: 656.3 MB of 2 GB physical memory used; 3.2 GB of 4.2 GB virtual memory used
2016-12-15 17:44:03,766 WARN org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Exit code from container container_1481732559439_0002_01_000004 is : 1
2016-12-15 17:44:03,766 WARN org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Exception from container-launch with container ID: container_1481732559439_0002_01_000004 and exit code: 1
ExitCodeException exitCode=1: 
	at org.apache.hadoop.util.Shell.runCommand(Shell.java:545)
	at org.apache.hadoop.util.Shell.run(Shell.java:456)
	at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:722)
	at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:212)
	at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302)
	at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
	at java.lang.Thread.run(Thread.java:745)

Best regards,
Paulo Cezar

Reply | Threaded
Open this post in threaded view
|

Re: High virtual memory usage

Stephan Ewen
Also, can you tell us what OS you are running on?

On Fri, Dec 16, 2016 at 6:23 PM, Stephan Ewen <[hidden email]> wrote:
Hi!

To diagnose this a little better, can you help us with the following info:

  - Are you using RocksDB?
  - What is your flink configuration, especially around memory settings?
  - What do you use for TaskManager heap size? Any manual value, or do you let Flink/Yarn set it automatically based on container size?
  - Do you use any libraries or connectors in your program?

Greetings,
Stephan


On Fri, Dec 16, 2016 at 5:47 PM, Paulo Cezar <[hidden email]> wrote:
Hi Folks,

I'm running Flink (1.2-SNAPSHOT nightly) on YARN (Hadoop 2.7.2). A few hours after I start a streaming job (built using kafka connect 0.10_2.11) it gets killed seemingly for no reason. After inspecting the logs my best guess is that YARN is killing containers due to high virtual memory usage.

Any guesses on why this might be happening or tips of what I should be looking for?

What I'll do next is enable taskmanager.debug.memory.startLogThread to keep investigating. Also, I was deploying flink-1.2-SNAPSHOT-bin-hadoop2.tgz on YARN, but my job uses scala 2.11 dependencies so I'll try using flink-1.2-SNAPSHOT-bin-hadoop2_2.11.tgz instead.

  • Flink logs:
2016-12-15 17:44:03,763 WARN  akka.remote.ReliableDeliverySupervisor                        - Association with remote system [akka.tcp://flink@10.0.0.8:49832] has failed, address is now gated for [5000] ms. Reason is: [Disassociated].
2016-12-15 17:44:05,475 INFO  org.apache.flink.yarn.YarnFlinkResourceManager                - Container ResourceID{resourceId='container_1481732559439_0002_01_000004'} failed. Exit status: 1
2016-12-15 17:44:05,476 INFO  org.apache.flink.yarn.YarnFlinkResourceManager                - Diagnostics for container ResourceID{resourceId='container_1481732559439_0002_01_000004'} in state COMPLETE : exitStatus=1 diagnostics=Exception from container-launch.
Container id: container_1481732559439_0002_01_000004
Exit code: 1
Stack trace: ExitCodeException exitCode=1: 
	at org.apache.hadoop.util.Shell.runCommand(Shell.java:545)
	at org.apache.hadoop.util.Shell.run(Shell.java:456)
	at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:722)
	at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:212)
	at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302)
	at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
	at java.lang.Thread.run(Thread.java:745)


Container exited with a non-zero exit code 1

  • YARN logs:
container_1481732559439_0002_01_000004: 2.6 GB of 5 GB physical memory used; 38.1 GB of 10.5 GB virtual memory used
2016-12-15 17:44:03,119 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Memory usage of ProcessTree 62223 for container-id container_1481732559439_0002_01_000001: 656.3 MB of 2 GB physical memory used; 3.2 GB of 4.2 GB virtual memory used
2016-12-15 17:44:03,766 WARN org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Exit code from container container_1481732559439_0002_01_000004 is : 1
2016-12-15 17:44:03,766 WARN org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Exception from container-launch with container ID: container_1481732559439_0002_01_000004 and exit code: 1
ExitCodeException exitCode=1: 
	at org.apache.hadoop.util.Shell.runCommand(Shell.java:545)
	at org.apache.hadoop.util.Shell.run(Shell.java:456)
	at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:722)
	at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:212)
	at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302)
	at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
	at java.lang.Thread.run(Thread.java:745)

Best regards,
Paulo Cezar


Reply | Threaded
Open this post in threaded view
|

Re: High virtual memory usage

Paulo Cezar
In reply to this post by Stephan Ewen
  - Are you using RocksDB?
No.
 
  - What is your flink configuration, especially around memory settings? 
I'm using default config with 2GB for jobmanager and 5GB for taskmanagers. I'm starting flink via "./bin/yarn-session.sh -d -n 5 -jm 2048 -tm 5120 -s 4 -nm 'Flink'"

  - What do you use for TaskManager heap size? Any manual value, or do you let Flink/Yarn set it automatically based on container size?
No manual values here. YARN config is pretty much default with maximum allocation of 12GB of physical memory and ratio between virtual memory to physical memory 2.1 (via yarn.nodemanager.vmem-pmem-ratio).
 
  - Do you use any libraries or connectors in your program?
I'm using  flink-connector-kafka-0.10_2.11, a MongoDB client, a gRPC client and some http libraries like unirest and Apache HttpClient.

  - Also, can you tell us what OS you are running on?
My YARN cluster runs on Docker containers (docker version 1.12) with images based on Ubuntu 14.04. Host OS is Ubuntu 14.04.4 LTS (GNU/Linux 3.19.0-65-generic x86_64).

Reply | Threaded
Open this post in threaded view
|

Re: High virtual memory usage

Stephan Ewen
Hi Paulo!

Hmm, interesting. The high discrepancy between virtual and physical memory usually means that the process either maps large files into memory, or that it pre-allocates a lot of memory without immediately using it.
Neither of these things are done by Flink.

Could this be an effect of either the Docker environment (mapping certain kernel spaces / libraries / whatever) or a result of one of the libraries (gRPC or so)?

Stephan


On Mon, Dec 19, 2016 at 12:32 PM, Paulo Cezar <[hidden email]> wrote:
  - Are you using RocksDB?
No.
 
  - What is your flink configuration, especially around memory settings? 
I'm using default config with 2GB for jobmanager and 5GB for taskmanagers. I'm starting flink via "./bin/yarn-session.sh -d -n 5 -jm 2048 -tm 5120 -s 4 -nm 'Flink'"

  - What do you use for TaskManager heap size? Any manual value, or do you let Flink/Yarn set it automatically based on container size?
No manual values here. YARN config is pretty much default with maximum allocation of 12GB of physical memory and ratio between virtual memory to physical memory 2.1 (via yarn.nodemanager.vmem-pmem-ratio).
 
  - Do you use any libraries or connectors in your program?
I'm using  flink-connector-kafka-0.10_2.11, a MongoDB client, a gRPC client and some http libraries like unirest and Apache HttpClient.

  - Also, can you tell us what OS you are running on?
My YARN cluster runs on Docker containers (docker version 1.12) with images based on Ubuntu 14.04. Host OS is Ubuntu 14.04.4 LTS (GNU/Linux 3.19.0-65-generic x86_64).


Reply | Threaded
Open this post in threaded view
|

Re: High virtual memory usage

Paulo Cezar
Hi Stephan, thanks for your support.

I was able to track the problem a few days ago. Unirest was the one to blame, I was using it on some mapfuncionts to connect to external services and for some reason it was using insane amounts of virtual memory.

Paulo Cezar

On Mon, Dec 19, 2016 at 11:30 AM Stephan Ewen <[hidden email]> wrote:
Hi Paulo!

Hmm, interesting. The high discrepancy between virtual and physical memory usually means that the process either maps large files into memory, or that it pre-allocates a lot of memory without immediately using it.
Neither of these things are done by Flink.

Could this be an effect of either the Docker environment (mapping certain kernel spaces / libraries / whatever) or a result of one of the libraries (gRPC or so)?

Stephan


On Mon, Dec 19, 2016 at 12:32 PM, Paulo Cezar <[hidden email]> wrote:
  - Are you using RocksDB?
No.
 
  - What is your flink configuration, especially around memory settings? 
I'm using default config with 2GB for jobmanager and 5GB for taskmanagers. I'm starting flink via "./bin/yarn-session.sh -d -n 5 -jm 2048 -tm 5120 -s 4 -nm 'Flink'"

  - What do you use for TaskManager heap size? Any manual value, or do you let Flink/Yarn set it automatically based on container size?
No manual values here. YARN config is pretty much default with maximum allocation of 12GB of physical memory and ratio between virtual memory to physical memory 2.1 (via yarn.nodemanager.vmem-pmem-ratio).
 
  - Do you use any libraries or connectors in your program?
I'm using  flink-connector-kafka-0.10_2.11, a MongoDB client, a gRPC client and some http libraries like unirest and Apache HttpClient.

  - Also, can you tell us what OS you are running on?
My YARN cluster runs on Docker containers (docker version 1.12) with images based on Ubuntu 14.04. Host OS is Ubuntu 14.04.4 LTS (GNU/Linux 3.19.0-65-generic x86_64).


Reply | Threaded
Open this post in threaded view
|

Re: High virtual memory usage

Stephan Ewen
Happy to hear that!



On Thu, Jan 5, 2017 at 1:34 PM, Paulo Cezar <[hidden email]> wrote:
Hi Stephan, thanks for your support.

I was able to track the problem a few days ago. Unirest was the one to blame, I was using it on some mapfuncionts to connect to external services and for some reason it was using insane amounts of virtual memory.

Paulo Cezar

On Mon, Dec 19, 2016 at 11:30 AM Stephan Ewen <[hidden email]> wrote:
Hi Paulo!

Hmm, interesting. The high discrepancy between virtual and physical memory usually means that the process either maps large files into memory, or that it pre-allocates a lot of memory without immediately using it.
Neither of these things are done by Flink.

Could this be an effect of either the Docker environment (mapping certain kernel spaces / libraries / whatever) or a result of one of the libraries (gRPC or so)?

Stephan


On Mon, Dec 19, 2016 at 12:32 PM, Paulo Cezar <[hidden email]> wrote:
  - Are you using RocksDB?
No.
 
  - What is your flink configuration, especially around memory settings? 
I'm using default config with 2GB for jobmanager and 5GB for taskmanagers. I'm starting flink via "./bin/yarn-session.sh -d -n 5 -jm 2048 -tm 5120 -s 4 -nm 'Flink'"

  - What do you use for TaskManager heap size? Any manual value, or do you let Flink/Yarn set it automatically based on container size?
No manual values here. YARN config is pretty much default with maximum allocation of 12GB of physical memory and ratio between virtual memory to physical memory 2.1 (via yarn.nodemanager.vmem-pmem-ratio).
 
  - Do you use any libraries or connectors in your program?
I'm using  flink-connector-kafka-0.10_2.11, a MongoDB client, a gRPC client and some http libraries like unirest and Apache HttpClient.

  - Also, can you tell us what OS you are running on?
My YARN cluster runs on Docker containers (docker version 1.12) with images based on Ubuntu 14.04. Host OS is Ubuntu 14.04.4 LTS (GNU/Linux 3.19.0-65-generic x86_64).