GC on taskmanagers

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

GC on taskmanagers

Emmanuel
My Java is still rusty and I often run into OutOfMemoryError: GC overhead exceeded...

Yes, I need to look for memory leaks...

But first I need to clear up this memory so I can run again without having to shut down and restart everything.

I've tried using the jcmd <pid> GC.run command on eachof the JVM instances on a taskmanager but I get a boat load of output like this:

On the host running the command:
com.sun.tools.attach.AttachNotSupportedException: Unable to open socket file: target process not responding or HotSpot VM not loaded
at sun.tools.attach.LinuxVirtualMachine.<init>(LinuxVirtualMachine.java:106)
at sun.tools.attach.LinuxAttachProvider.attachVirtualMachine(LinuxAttachProvider.java:63)
at com.sun.tools.attach.VirtualMachine.attach(VirtualMachine.java:213)
at sun.tools.jcmd.JCmd.executeCommandForPid(JCmd.java:140)
at sun.tools.jcmd.JCmd.main(JCmd.java:129)



and on the taskmanager log:

"Flink-IPC Server handler 1 on 6121" daemon prio=10 tid=0x00007f5f107ee000 nid=0x8f waiting on condition [0x00007f5eb4803000]
   java.lang.Thread.State: WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to wait for  <0x00000000f37e95c0> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
at java.util.concurrent.locks.LockSupport.park(LockSupport.java:186)
at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2043)
at java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442)
at org.apache.flink.runtime.ipc.Server$Handler.run(Server.java:941)

"Flink-IPC Server handler 0 on 6121" daemon prio=10 tid=0x00007f5f107eb800 nid=0x8e waiting on condition [0x00007f5eb4904000]
   java.lang.Thread.State: WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to wait for  <0x00000000f37e95c0> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
at java.util.concurrent.locks.LockSupport.park(LockSupport.java:186)
at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2043)
at java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442)
at org.apache.flink.runtime.ipc.Server$Handler.run(Server.java:941)

"Flink-IPC Server listener on 6121" daemon prio=10 tid=0x00007f5f107e9800 nid=0x8d runnable [0x00007f5eb4a05000]
   java.lang.Thread.State: RUNNABLE
at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method)
at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:269)
at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:79)
at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:87)
- locked <0x00000000f385d3c0> (a sun.nio.ch.Util$2)
- locked <0x00000000f385d3d0> (a java.util.Collections$UnmodifiableSet)
- locked <0x00000000f385d378> (a sun.nio.ch.EPollSelectorImpl)
at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:98)
at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:102)
at org.apache.flink.runtime.ipc.Server$Listener.run(Server.java:341)

"Flink-IPC Server Responder" daemon prio=10 tid=0x00007f5f107e8800 nid=0x8c runnable [0x00007f5eb4b06000]
   java.lang.Thread.State: RUNNABLE
at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method)
at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:269)
at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:79)
at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:87)
- locked <0x00000000f387b528> (a sun.nio.ch.Util$2)
- locked <0x00000000f387b538> (a java.util.Collections$UnmodifiableSet)
- locked <0x00000000f387b4e0> (a sun.nio.ch.EPollSelectorImpl)
at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:98)
at org.apache.flink.runtime.ipc.Server$Responder.run(Server.java:506)

"Service Thread" daemon prio=10 tid=0x00007f5f100c2000 nid=0x8a runnable [0x0000000000000000]
   java.lang.Thread.State: RUNNABLE

"C2 CompilerThread1" daemon prio=10 tid=0x00007f5f100c0000 nid=0x89 waiting on condition [0x0000000000000000]
   java.lang.Thread.State: RUNNABLE

"C2 CompilerThread0" daemon prio=10 tid=0x00007f5f100bd000 nid=0x88 waiting on condition [0x0000000000000000]
   java.lang.Thread.State: RUNNABLE

"Signal Dispatcher" daemon prio=10 tid=0x00007f5f100b3000 nid=0x87 waiting on condition [0x0000000000000000]
   java.lang.Thread.State: RUNNABLE

"Finalizer" daemon prio=10 tid=0x00007f5f1009c800 nid=0x86 in Object.wait() [0x00007f5eb605b000]
   java.lang.Thread.State: WAITING (on object monitor)
at java.lang.Object.wait(Native Method)
- waiting on <0x00000000f381cc08> (a java.lang.ref.ReferenceQueue$Lock)
at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:135)
- locked <0x00000000f381cc08> (a java.lang.ref.ReferenceQueue$Lock)
at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:151)
at java.lang.ref.Finalizer$FinalizerThread.run(Finalizer.java:189)

"Reference Handler" daemon prio=10 tid=0x00007f5f10098800 nid=0x85 in Object.wait() [0x00007f5eb615c000]
   java.lang.Thread.State: WAITING (on object monitor)
at java.lang.Object.wait(Native Method)
- waiting on <0x00000000f381c820> (a java.lang.ref.Reference$Lock)
at java.lang.Object.wait(Object.java:503)
at java.lang.ref.Reference$ReferenceHandler.run(Reference.java:133)
- locked <0x00000000f381c820> (a java.lang.ref.Reference$Lock)

"main" prio=10 tid=0x00007f5f1000d800 nid=0x6a in Object.wait() [0x00007f5f178d4000]
   java.lang.Thread.State: WAITING (on object monitor)
at java.lang.Object.wait(Native Method)
- waiting on <0x00000000fbe14200> (a java.lang.Object)
at java.lang.Object.wait(Object.java:503)
at org.apache.flink.runtime.taskmanager.TaskManager.main(TaskManager.java:1115)
- locked <0x00000000fbe14200> (a java.lang.Object)

"VM Thread" prio=10 tid=0x00007f5f10096000 nid=0x84 runnable

"GC task thread#0 (ParallelGC)" prio=10 tid=0x00007f5f10023000 nid=0x6b runnable

"GC task thread#1 (ParallelGC)" prio=10 tid=0x00007f5f10025000 nid=0x6c runnable

"GC task thread#2 (ParallelGC)" prio=10 tid=0x00007f5f10027000 nid=0x6d runnable

"GC task thread#3 (ParallelGC)" prio=10 tid=0x00007f5f10029000 nid=0x6e runnable

"GC task thread#4 (ParallelGC)" prio=10 tid=0x00007f5f1002a800 nid=0x6f runnable

"GC task thread#5 (ParallelGC)" prio=10 tid=0x00007f5f1002c800 nid=0x70 runnable

"GC task thread#6 (ParallelGC)" prio=10 tid=0x00007f5f1002e800 nid=0x71 runnable

"GC task thread#7 (ParallelGC)" prio=10 tid=0x00007f5f10030000 nid=0x72 runnable

"GC task thread#8 (ParallelGC)" prio=10 tid=0x00007f5f10032000 nid=0x73 runnable

"GC task thread#9 (ParallelGC)" prio=10 tid=0x00007f5f10034000 nid=0x74 runnable

"GC task thread#10 (ParallelGC)" prio=10 tid=0x00007f5f10036000 nid=0x75 runnable

"GC task thread#11 (ParallelGC)" prio=10 tid=0x00007f5f10037800 nid=0x76 runnable

"GC task thread#12 (ParallelGC)" prio=10 tid=0x00007f5f10039800 nid=0x77 runnable

"GC task thread#13 (ParallelGC)" prio=10 tid=0x00007f5f1003b800 nid=0x78 runnable

"GC task thread#14 (ParallelGC)" prio=10 tid=0x00007f5f1003d000 nid=0x79 runnable

"GC task thread#15 (ParallelGC)" prio=10 tid=0x00007f5f1003f000 nid=0x7a runnable

"GC task thread#16 (ParallelGC)" prio=10 tid=0x00007f5f10041000 nid=0x7b runnable

"GC task thread#17 (ParallelGC)" prio=10 tid=0x00007f5f10043000 nid=0x7c runnable

"GC task thread#18 (ParallelGC)" prio=10 tid=0x00007f5f10044800 nid=0x7d runnable

"GC task thread#19 (ParallelGC)" prio=10 tid=0x00007f5f10046800 nid=0x7e runnable

"GC task thread#20 (ParallelGC)" prio=10 tid=0x00007f5f10048800 nid=0x7f runnable

"GC task thread#21 (ParallelGC)" prio=10 tid=0x00007f5f1004a000 nid=0x80 runnable

"GC task thread#22 (ParallelGC)" prio=10 tid=0x00007f5f1004c000 nid=0x81 runnable

"VM Periodic Task Thread" prio=10 tid=0x00007f5f100d5000 nid=0x8b waiting on condition

JNI global references: 530

Heap
 PSYoungGen      total 76800K, used 63133K [0x00000000faa80000, 0x0000000100000000, 0x0000000100000000)
  eden space 66048K, 95% used [0x00000000faa80000,0x00000000fe827690,0x00000000feb00000)
  from space 10752K, 0% used [0x00000000ff580000,0x00000000ff580000,0x0000000100000000)
  to   space 10752K, 0% used [0x00000000feb00000,0x00000000feb00000,0x00000000ff580000)
 ParOldGen       total 175104K, used 175046K [0x00000000eff80000, 0x00000000faa80000, 0x00000000faa80000)
  object space 175104K, 99% used [0x00000000eff80000,0x00000000faa71bb0,0x00000000faa80000)
 PSPermGen       total 29696K, used 29267K [0x00000000dff80000, 0x00000000e1c80000, 0x00000000eff80000)
  object space 29696K, 98% used [0x00000000dff80000,0x00000000e1c14d38,0x00000000e1c80000)





Any insight on clearing GC cleanly when this happens?

THanks!


Reply | Threaded
Open this post in threaded view
|

Re: GC on taskmanagers

Maximilian Michels
Hi Emmanuel,

In Java, the garbage collector will always run periodically. So remotely executing it won't make any difference.

If you want to reuse the existing Java process without restarting it, you have to stop the program code from executing which is causing the OutOfMemoryError. Usually, this is quite tricky because your program might not even accept input any more because it is constantly occupied with the garbage collection.

Where was the OutOfMemoryError thrown? Do you have the stack trace of the error? From the task manager stack trace, it actually looks like your program is not executing any more. I would try executing a demo program (e.g. WordCount) to check your setup.

Best regards,
Max

On Tue, Mar 31, 2015 at 5:44 AM, Emmanuel <[hidden email]> wrote:
My Java is still rusty and I often run into OutOfMemoryError: GC overhead exceeded...

Yes, I need to look for memory leaks...

But first I need to clear up this memory so I can run again without having to shut down and restart everything.

I've tried using the jcmd <pid> GC.run command on eachof the JVM instances on a taskmanager but I get a boat load of output like this:

On the host running the command:
com.sun.tools.attach.AttachNotSupportedException: Unable to open socket file: target process not responding or HotSpot VM not loaded
at sun.tools.attach.LinuxVirtualMachine.<init>(LinuxVirtualMachine.java:106)
at sun.tools.attach.LinuxAttachProvider.attachVirtualMachine(LinuxAttachProvider.java:63)
at com.sun.tools.attach.VirtualMachine.attach(VirtualMachine.java:213)
at sun.tools.jcmd.JCmd.executeCommandForPid(JCmd.java:140)
at sun.tools.jcmd.JCmd.main(JCmd.java:129)



and on the taskmanager log:

"Flink-IPC Server handler 1 on 6121" daemon prio=10 tid=0x00007f5f107ee000 nid=0x8f waiting on condition [0x00007f5eb4803000]
   java.lang.Thread.State: WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to wait for  <0x00000000f37e95c0> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
at java.util.concurrent.locks.LockSupport.park(LockSupport.java:186)
at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2043)
at java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442)
at org.apache.flink.runtime.ipc.Server$Handler.run(Server.java:941)

"Flink-IPC Server handler 0 on 6121" daemon prio=10 tid=0x00007f5f107eb800 nid=0x8e waiting on condition [0x00007f5eb4904000]
   java.lang.Thread.State: WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to wait for  <0x00000000f37e95c0> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
at java.util.concurrent.locks.LockSupport.park(LockSupport.java:186)
at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2043)
at java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442)
at org.apache.flink.runtime.ipc.Server$Handler.run(Server.java:941)

"Flink-IPC Server listener on 6121" daemon prio=10 tid=0x00007f5f107e9800 nid=0x8d runnable [0x00007f5eb4a05000]
   java.lang.Thread.State: RUNNABLE
at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method)
at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:269)
at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:79)
at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:87)
- locked <0x00000000f385d3c0> (a sun.nio.ch.Util$2)
- locked <0x00000000f385d3d0> (a java.util.Collections$UnmodifiableSet)
- locked <0x00000000f385d378> (a sun.nio.ch.EPollSelectorImpl)
at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:98)
at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:102)
at org.apache.flink.runtime.ipc.Server$Listener.run(Server.java:341)

"Flink-IPC Server Responder" daemon prio=10 tid=0x00007f5f107e8800 nid=0x8c runnable [0x00007f5eb4b06000]
   java.lang.Thread.State: RUNNABLE
at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method)
at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:269)
at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:79)
at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:87)
- locked <0x00000000f387b528> (a sun.nio.ch.Util$2)
- locked <0x00000000f387b538> (a java.util.Collections$UnmodifiableSet)
- locked <0x00000000f387b4e0> (a sun.nio.ch.EPollSelectorImpl)
at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:98)
at org.apache.flink.runtime.ipc.Server$Responder.run(Server.java:506)

"Service Thread" daemon prio=10 tid=0x00007f5f100c2000 nid=0x8a runnable [0x0000000000000000]
   java.lang.Thread.State: RUNNABLE

"C2 CompilerThread1" daemon prio=10 tid=0x00007f5f100c0000 nid=0x89 waiting on condition [0x0000000000000000]
   java.lang.Thread.State: RUNNABLE

"C2 CompilerThread0" daemon prio=10 tid=0x00007f5f100bd000 nid=0x88 waiting on condition [0x0000000000000000]
   java.lang.Thread.State: RUNNABLE

"Signal Dispatcher" daemon prio=10 tid=0x00007f5f100b3000 nid=0x87 waiting on condition [0x0000000000000000]
   java.lang.Thread.State: RUNNABLE

"Finalizer" daemon prio=10 tid=0x00007f5f1009c800 nid=0x86 in Object.wait() [0x00007f5eb605b000]
   java.lang.Thread.State: WAITING (on object monitor)
at java.lang.Object.wait(Native Method)
- waiting on <0x00000000f381cc08> (a java.lang.ref.ReferenceQueue$Lock)
at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:135)
- locked <0x00000000f381cc08> (a java.lang.ref.ReferenceQueue$Lock)
at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:151)
at java.lang.ref.Finalizer$FinalizerThread.run(Finalizer.java:189)

"Reference Handler" daemon prio=10 tid=0x00007f5f10098800 nid=0x85 in Object.wait() [0x00007f5eb615c000]
   java.lang.Thread.State: WAITING (on object monitor)
at java.lang.Object.wait(Native Method)
- waiting on <0x00000000f381c820> (a java.lang.ref.Reference$Lock)
at java.lang.Object.wait(Object.java:503)
at java.lang.ref.Reference$ReferenceHandler.run(Reference.java:133)
- locked <0x00000000f381c820> (a java.lang.ref.Reference$Lock)

"main" prio=10 tid=0x00007f5f1000d800 nid=0x6a in Object.wait() [0x00007f5f178d4000]
   java.lang.Thread.State: WAITING (on object monitor)
at java.lang.Object.wait(Native Method)
- waiting on <0x00000000fbe14200> (a java.lang.Object)
at java.lang.Object.wait(Object.java:503)
at org.apache.flink.runtime.taskmanager.TaskManager.main(TaskManager.java:1115)
- locked <0x00000000fbe14200> (a java.lang.Object)

"VM Thread" prio=10 tid=0x00007f5f10096000 nid=0x84 runnable

"GC task thread#0 (ParallelGC)" prio=10 tid=0x00007f5f10023000 nid=0x6b runnable

"GC task thread#1 (ParallelGC)" prio=10 tid=0x00007f5f10025000 nid=0x6c runnable

"GC task thread#2 (ParallelGC)" prio=10 tid=0x00007f5f10027000 nid=0x6d runnable

"GC task thread#3 (ParallelGC)" prio=10 tid=0x00007f5f10029000 nid=0x6e runnable

"GC task thread#4 (ParallelGC)" prio=10 tid=0x00007f5f1002a800 nid=0x6f runnable

"GC task thread#5 (ParallelGC)" prio=10 tid=0x00007f5f1002c800 nid=0x70 runnable

"GC task thread#6 (ParallelGC)" prio=10 tid=0x00007f5f1002e800 nid=0x71 runnable

"GC task thread#7 (ParallelGC)" prio=10 tid=0x00007f5f10030000 nid=0x72 runnable

"GC task thread#8 (ParallelGC)" prio=10 tid=0x00007f5f10032000 nid=0x73 runnable

"GC task thread#9 (ParallelGC)" prio=10 tid=0x00007f5f10034000 nid=0x74 runnable

"GC task thread#10 (ParallelGC)" prio=10 tid=0x00007f5f10036000 nid=0x75 runnable

"GC task thread#11 (ParallelGC)" prio=10 tid=0x00007f5f10037800 nid=0x76 runnable

"GC task thread#12 (ParallelGC)" prio=10 tid=0x00007f5f10039800 nid=0x77 runnable

"GC task thread#13 (ParallelGC)" prio=10 tid=0x00007f5f1003b800 nid=0x78 runnable

"GC task thread#14 (ParallelGC)" prio=10 tid=0x00007f5f1003d000 nid=0x79 runnable

"GC task thread#15 (ParallelGC)" prio=10 tid=0x00007f5f1003f000 nid=0x7a runnable

"GC task thread#16 (ParallelGC)" prio=10 tid=0x00007f5f10041000 nid=0x7b runnable

"GC task thread#17 (ParallelGC)" prio=10 tid=0x00007f5f10043000 nid=0x7c runnable

"GC task thread#18 (ParallelGC)" prio=10 tid=0x00007f5f10044800 nid=0x7d runnable

"GC task thread#19 (ParallelGC)" prio=10 tid=0x00007f5f10046800 nid=0x7e runnable

"GC task thread#20 (ParallelGC)" prio=10 tid=0x00007f5f10048800 nid=0x7f runnable

"GC task thread#21 (ParallelGC)" prio=10 tid=0x00007f5f1004a000 nid=0x80 runnable

"GC task thread#22 (ParallelGC)" prio=10 tid=0x00007f5f1004c000 nid=0x81 runnable

"VM Periodic Task Thread" prio=10 tid=0x00007f5f100d5000 nid=0x8b waiting on condition

JNI global references: 530

Heap
 PSYoungGen      total 76800K, used 63133K [0x00000000faa80000, 0x0000000100000000, 0x0000000100000000)
  eden space 66048K, 95% used [0x00000000faa80000,0x00000000fe827690,0x00000000feb00000)
  from space 10752K, 0% used [0x00000000ff580000,0x00000000ff580000,0x0000000100000000)
  to   space 10752K, 0% used [0x00000000feb00000,0x00000000feb00000,0x00000000ff580000)
 ParOldGen       total 175104K, used 175046K [0x00000000eff80000, 0x00000000faa80000, 0x00000000faa80000)
  object space 175104K, 99% used [0x00000000eff80000,0x00000000faa71bb0,0x00000000faa80000)
 PSPermGen       total 29696K, used 29267K [0x00000000dff80000, 0x00000000e1c80000, 0x00000000eff80000)
  object space 29696K, 98% used [0x00000000dff80000,0x00000000e1c14d38,0x00000000e1c80000)





Any insight on clearing GC cleanly when this happens?

THanks!



Reply | Threaded
Open this post in threaded view
|

RE: GC on taskmanagers

Emmanuel
Max,

Thanks for the answer...

What I am saying is that my program is not running indeed, yet it doesn't seem garbage collection occurs after cancelling the job. is you saw in the log, the memory is still 99% used even though I cancelled the job, and I cannot seem to run another job. 
I've had to kill the task manager and restart.
Maybe I'm not clear about how things work, but I thought cancelling a job would just remove the program from memory and clear that memory that wasn't garbage collected, but that doesn't seem to happen, unless I ned to wait a while.
Once my code generated the OutOfMemoryError, I can't seem to be able to run another job.
So the question is: what am I supposed to do to clear the memory after a program failed with this OutOfMemoryError.

Thanks




From: [hidden email]
Date: Tue, 31 Mar 2015 11:29:30 +0200
Subject: Re: GC on taskmanagers
To: [hidden email]

Hi Emmanuel,

In Java, the garbage collector will always run periodically. So remotely executing it won't make any difference.

If you want to reuse the existing Java process without restarting it, you have to stop the program code from executing which is causing the OutOfMemoryError. Usually, this is quite tricky because your program might not even accept input any more because it is constantly occupied with the garbage collection.

Where was the OutOfMemoryError thrown? Do you have the stack trace of the error? From the task manager stack trace, it actually looks like your program is not executing any more. I would try executing a demo program (e.g. WordCount) to check your setup.

Best regards,
Max

On Tue, Mar 31, 2015 at 5:44 AM, Emmanuel <[hidden email]> wrote:
My Java is still rusty and I often run into OutOfMemoryError: GC overhead exceeded...

Yes, I need to look for memory leaks...

But first I need to clear up this memory so I can run again without having to shut down and restart everything.

I've tried using the jcmd <pid> GC.run command on eachof the JVM instances on a taskmanager but I get a boat load of output like this:

On the host running the command:
com.sun.tools.attach.AttachNotSupportedException: Unable to open socket file: target process not responding or HotSpot VM not loaded
at sun.tools.attach.LinuxVirtualMachine.<init>(LinuxVirtualMachine.java:106)
at sun.tools.attach.LinuxAttachProvider.attachVirtualMachine(LinuxAttachProvider.java:63)
at com.sun.tools.attach.VirtualMachine.attach(VirtualMachine.java:213)
at sun.tools.jcmd.JCmd.executeCommandForPid(JCmd.java:140)
at sun.tools.jcmd.JCmd.main(JCmd.java:129)



and on the taskmanager log:

"Flink-IPC Server handler 1 on 6121" daemon prio=10 tid=0x00007f5f107ee000 nid=0x8f waiting on condition [0x00007f5eb4803000]
   java.lang.Thread.State: WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to wait for  <0x00000000f37e95c0> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
at java.util.concurrent.locks.LockSupport.park(LockSupport.java:186)
at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2043)
at java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442)
at org.apache.flink.runtime.ipc.Server$Handler.run(Server.java:941)

"Flink-IPC Server handler 0 on 6121" daemon prio=10 tid=0x00007f5f107eb800 nid=0x8e waiting on condition [0x00007f5eb4904000]
   java.lang.Thread.State: WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to wait for  <0x00000000f37e95c0> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
at java.util.concurrent.locks.LockSupport.park(LockSupport.java:186)
at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2043)
at java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442)
at org.apache.flink.runtime.ipc.Server$Handler.run(Server.java:941)

"Flink-IPC Server listener on 6121" daemon prio=10 tid=0x00007f5f107e9800 nid=0x8d runnable [0x00007f5eb4a05000]
   java.lang.Thread.State: RUNNABLE
at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method)
at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:269)
at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:79)
at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:87)
- locked <0x00000000f385d3c0> (a sun.nio.ch.Util$2)
- locked <0x00000000f385d3d0> (a java.util.Collections$UnmodifiableSet)
- locked <0x00000000f385d378> (a sun.nio.ch.EPollSelectorImpl)
at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:98)
at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:102)
at org.apache.flink.runtime.ipc.Server$Listener.run(Server.java:341)

"Flink-IPC Server Responder" daemon prio=10 tid=0x00007f5f107e8800 nid=0x8c runnable [0x00007f5eb4b06000]
   java.lang.Thread.State: RUNNABLE
at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method)
at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:269)
at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:79)
at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:87)
- locked <0x00000000f387b528> (a sun.nio.ch.Util$2)
- locked <0x00000000f387b538> (a java.util.Collections$UnmodifiableSet)
- locked <0x00000000f387b4e0> (a sun.nio.ch.EPollSelectorImpl)
at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:98)
at org.apache.flink.runtime.ipc.Server$Responder.run(Server.java:506)

"Service Thread" daemon prio=10 tid=0x00007f5f100c2000 nid=0x8a runnable [0x0000000000000000]
   java.lang.Thread.State: RUNNABLE

"C2 CompilerThread1" daemon prio=10 tid=0x00007f5f100c0000 nid=0x89 waiting on condition [0x0000000000000000]
   java.lang.Thread.State: RUNNABLE

"C2 CompilerThread0" daemon prio=10 tid=0x00007f5f100bd000 nid=0x88 waiting on condition [0x0000000000000000]
   java.lang.Thread.State: RUNNABLE

"Signal Dispatcher" daemon prio=10 tid=0x00007f5f100b3000 nid=0x87 waiting on condition [0x0000000000000000]
   java.lang.Thread.State: RUNNABLE

"Finalizer" daemon prio=10 tid=0x00007f5f1009c800 nid=0x86 in Object.wait() [0x00007f5eb605b000]
   java.lang.Thread.State: WAITING (on object monitor)
at java.lang.Object.wait(Native Method)
- waiting on <0x00000000f381cc08> (a java.lang.ref.ReferenceQueue$Lock)
at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:135)
- locked <0x00000000f381cc08> (a java.lang.ref.ReferenceQueue$Lock)
at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:151)
at java.lang.ref.Finalizer$FinalizerThread.run(Finalizer.java:189)

"Reference Handler" daemon prio=10 tid=0x00007f5f10098800 nid=0x85 in Object.wait() [0x00007f5eb615c000]
   java.lang.Thread.State: WAITING (on object monitor)
at java.lang.Object.wait(Native Method)
- waiting on <0x00000000f381c820> (a java.lang.ref.Reference$Lock)
at java.lang.Object.wait(Object.java:503)
at java.lang.ref.Reference$ReferenceHandler.run(Reference.java:133)
- locked <0x00000000f381c820> (a java.lang.ref.Reference$Lock)

"main" prio=10 tid=0x00007f5f1000d800 nid=0x6a in Object.wait() [0x00007f5f178d4000]
   java.lang.Thread.State: WAITING (on object monitor)
at java.lang.Object.wait(Native Method)
- waiting on <0x00000000fbe14200> (a java.lang.Object)
at java.lang.Object.wait(Object.java:503)
at org.apache.flink.runtime.taskmanager.TaskManager.main(TaskManager.java:1115)
- locked <0x00000000fbe14200> (a java.lang.Object)

"VM Thread" prio=10 tid=0x00007f5f10096000 nid=0x84 runnable

"GC task thread#0 (ParallelGC)" prio=10 tid=0x00007f5f10023000 nid=0x6b runnable

"GC task thread#1 (ParallelGC)" prio=10 tid=0x00007f5f10025000 nid=0x6c runnable

"GC task thread#2 (ParallelGC)" prio=10 tid=0x00007f5f10027000 nid=0x6d runnable

"GC task thread#3 (ParallelGC)" prio=10 tid=0x00007f5f10029000 nid=0x6e runnable

"GC task thread#4 (ParallelGC)" prio=10 tid=0x00007f5f1002a800 nid=0x6f runnable

"GC task thread#5 (ParallelGC)" prio=10 tid=0x00007f5f1002c800 nid=0x70 runnable

"GC task thread#6 (ParallelGC)" prio=10 tid=0x00007f5f1002e800 nid=0x71 runnable

"GC task thread#7 (ParallelGC)" prio=10 tid=0x00007f5f10030000 nid=0x72 runnable

"GC task thread#8 (ParallelGC)" prio=10 tid=0x00007f5f10032000 nid=0x73 runnable

"GC task thread#9 (ParallelGC)" prio=10 tid=0x00007f5f10034000 nid=0x74 runnable

"GC task thread#10 (ParallelGC)" prio=10 tid=0x00007f5f10036000 nid=0x75 runnable

"GC task thread#11 (ParallelGC)" prio=10 tid=0x00007f5f10037800 nid=0x76 runnable

"GC task thread#12 (ParallelGC)" prio=10 tid=0x00007f5f10039800 nid=0x77 runnable

"GC task thread#13 (ParallelGC)" prio=10 tid=0x00007f5f1003b800 nid=0x78 runnable

"GC task thread#14 (ParallelGC)" prio=10 tid=0x00007f5f1003d000 nid=0x79 runnable

"GC task thread#15 (ParallelGC)" prio=10 tid=0x00007f5f1003f000 nid=0x7a runnable

"GC task thread#16 (ParallelGC)" prio=10 tid=0x00007f5f10041000 nid=0x7b runnable

"GC task thread#17 (ParallelGC)" prio=10 tid=0x00007f5f10043000 nid=0x7c runnable

"GC task thread#18 (ParallelGC)" prio=10 tid=0x00007f5f10044800 nid=0x7d runnable

"GC task thread#19 (ParallelGC)" prio=10 tid=0x00007f5f10046800 nid=0x7e runnable

"GC task thread#20 (ParallelGC)" prio=10 tid=0x00007f5f10048800 nid=0x7f runnable

"GC task thread#21 (ParallelGC)" prio=10 tid=0x00007f5f1004a000 nid=0x80 runnable

"GC task thread#22 (ParallelGC)" prio=10 tid=0x00007f5f1004c000 nid=0x81 runnable

"VM Periodic Task Thread" prio=10 tid=0x00007f5f100d5000 nid=0x8b waiting on condition

JNI global references: 530

Heap
 PSYoungGen      total 76800K, used 63133K [0x00000000faa80000, 0x0000000100000000, 0x0000000100000000)
  eden space 66048K, 95% used [0x00000000faa80000,0x00000000fe827690,0x00000000feb00000)
  from space 10752K, 0% used [0x00000000ff580000,0x00000000ff580000,0x0000000100000000)
  to   space 10752K, 0% used [0x00000000feb00000,0x00000000feb00000,0x00000000ff580000)
 ParOldGen       total 175104K, used 175046K [0x00000000eff80000, 0x00000000faa80000, 0x00000000faa80000)
  object space 175104K, 99% used [0x00000000eff80000,0x00000000faa71bb0,0x00000000faa80000)
 PSPermGen       total 29696K, used 29267K [0x00000000dff80000, 0x00000000e1c80000, 0x00000000eff80000)
  object space 29696K, 98% used [0x00000000dff80000,0x00000000e1c14d38,0x00000000e1c80000)





Any insight on clearing GC cleanly when this happens?

THanks!



Reply | Threaded
Open this post in threaded view
|

Re: GC on taskmanagers

Maximilian Michels
Hi Emmanuel,

If a job fails due to an Exception being thrown, the job is canceled. The task manager remains intact and further jobs can be submitted.

An OutOfMemoryError, as the name implies, is not an Exception but an Error. In general, this error is thrown when you make excessive use of Object creation or when the task manager's memory size is simply too low. Due to the nature of an out-of-memory situation, it can be thrown at an arbitrary location in the code. Your job might be the cause of the OutOfMemoryError but it can be thrown in Flink's internals (e.g. when a new Object is allocated). It's pretty much impossible to catch this error everywhere as it would require all code blocks to catch this error. And even if we successfully caught a OutOfMemoryError, it is quite unclear how to recover in such situation. Unfortunately, that usually means that the task manager needs to restarted.

Are you using the Streaming API of Flink? Currently, we don't use the managed memory there which makes OutOfMemoryErrors more likely to occur. Do you have a code excerpt you might want to share with us? We could try to find the cause of the error there. Otherwise, you could try to increase the task manager's memory if you have more available on your machines.


Best regards,
Max

On Tue, Mar 31, 2015 at 8:08 PM, Emmanuel <[hidden email]> wrote:
Max,

Thanks for the answer...

What I am saying is that my program is not running indeed, yet it doesn't seem garbage collection occurs after cancelling the job. is you saw in the log, the memory is still 99% used even though I cancelled the job, and I cannot seem to run another job. 
I've had to kill the task manager and restart.
Maybe I'm not clear about how things work, but I thought cancelling a job would just remove the program from memory and clear that memory that wasn't garbage collected, but that doesn't seem to happen, unless I ned to wait a while.
Once my code generated the OutOfMemoryError, I can't seem to be able to run another job.
So the question is: what am I supposed to do to clear the memory after a program failed with this OutOfMemoryError.

Thanks




From: [hidden email]
Date: Tue, 31 Mar 2015 11:29:30 +0200
Subject: Re: GC on taskmanagers
To: [hidden email]


Hi Emmanuel,

In Java, the garbage collector will always run periodically. So remotely executing it won't make any difference.

If you want to reuse the existing Java process without restarting it, you have to stop the program code from executing which is causing the OutOfMemoryError. Usually, this is quite tricky because your program might not even accept input any more because it is constantly occupied with the garbage collection.

Where was the OutOfMemoryError thrown? Do you have the stack trace of the error? From the task manager stack trace, it actually looks like your program is not executing any more. I would try executing a demo program (e.g. WordCount) to check your setup.

Best regards,
Max

On Tue, Mar 31, 2015 at 5:44 AM, Emmanuel <[hidden email]> wrote:
My Java is still rusty and I often run into OutOfMemoryError: GC overhead exceeded...

Yes, I need to look for memory leaks...

But first I need to clear up this memory so I can run again without having to shut down and restart everything.

I've tried using the jcmd <pid> GC.run command on eachof the JVM instances on a taskmanager but I get a boat load of output like this:

On the host running the command:
com.sun.tools.attach.AttachNotSupportedException: Unable to open socket file: target process not responding or HotSpot VM not loaded
at sun.tools.attach.LinuxVirtualMachine.<init>(LinuxVirtualMachine.java:106)
at sun.tools.attach.LinuxAttachProvider.attachVirtualMachine(LinuxAttachProvider.java:63)
at com.sun.tools.attach.VirtualMachine.attach(VirtualMachine.java:213)
at sun.tools.jcmd.JCmd.executeCommandForPid(JCmd.java:140)
at sun.tools.jcmd.JCmd.main(JCmd.java:129)



and on the taskmanager log:

"Flink-IPC Server handler 1 on 6121" daemon prio=10 tid=0x00007f5f107ee000 nid=0x8f waiting on condition [0x00007f5eb4803000]
   java.lang.Thread.State: WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to wait for  <0x00000000f37e95c0> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
at java.util.concurrent.locks.LockSupport.park(LockSupport.java:186)
at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2043)
at java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442)
at org.apache.flink.runtime.ipc.Server$Handler.run(Server.java:941)

"Flink-IPC Server handler 0 on 6121" daemon prio=10 tid=0x00007f5f107eb800 nid=0x8e waiting on condition [0x00007f5eb4904000]
   java.lang.Thread.State: WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to wait for  <0x00000000f37e95c0> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
at java.util.concurrent.locks.LockSupport.park(LockSupport.java:186)
at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2043)
at java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442)
at org.apache.flink.runtime.ipc.Server$Handler.run(Server.java:941)

"Flink-IPC Server listener on 6121" daemon prio=10 tid=0x00007f5f107e9800 nid=0x8d runnable [0x00007f5eb4a05000]
   java.lang.Thread.State: RUNNABLE
at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method)
at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:269)
at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:79)
at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:87)
- locked <0x00000000f385d3c0> (a sun.nio.ch.Util$2)
- locked <0x00000000f385d3d0> (a java.util.Collections$UnmodifiableSet)
- locked <0x00000000f385d378> (a sun.nio.ch.EPollSelectorImpl)
at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:98)
at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:102)
at org.apache.flink.runtime.ipc.Server$Listener.run(Server.java:341)

"Flink-IPC Server Responder" daemon prio=10 tid=0x00007f5f107e8800 nid=0x8c runnable [0x00007f5eb4b06000]
   java.lang.Thread.State: RUNNABLE
at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method)
at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:269)
at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:79)
at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:87)
- locked <0x00000000f387b528> (a sun.nio.ch.Util$2)
- locked <0x00000000f387b538> (a java.util.Collections$UnmodifiableSet)
- locked <0x00000000f387b4e0> (a sun.nio.ch.EPollSelectorImpl)
at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:98)
at org.apache.flink.runtime.ipc.Server$Responder.run(Server.java:506)

"Service Thread" daemon prio=10 tid=0x00007f5f100c2000 nid=0x8a runnable [0x0000000000000000]
   java.lang.Thread.State: RUNNABLE

"C2 CompilerThread1" daemon prio=10 tid=0x00007f5f100c0000 nid=0x89 waiting on condition [0x0000000000000000]
   java.lang.Thread.State: RUNNABLE

"C2 CompilerThread0" daemon prio=10 tid=0x00007f5f100bd000 nid=0x88 waiting on condition [0x0000000000000000]
   java.lang.Thread.State: RUNNABLE

"Signal Dispatcher" daemon prio=10 tid=0x00007f5f100b3000 nid=0x87 waiting on condition [0x0000000000000000]
   java.lang.Thread.State: RUNNABLE

"Finalizer" daemon prio=10 tid=0x00007f5f1009c800 nid=0x86 in Object.wait() [0x00007f5eb605b000]
   java.lang.Thread.State: WAITING (on object monitor)
at java.lang.Object.wait(Native Method)
- waiting on <0x00000000f381cc08> (a java.lang.ref.ReferenceQueue$Lock)
at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:135)
- locked <0x00000000f381cc08> (a java.lang.ref.ReferenceQueue$Lock)
at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:151)
at java.lang.ref.Finalizer$FinalizerThread.run(Finalizer.java:189)

"Reference Handler" daemon prio=10 tid=0x00007f5f10098800 nid=0x85 in Object.wait() [0x00007f5eb615c000]
   java.lang.Thread.State: WAITING (on object monitor)
at java.lang.Object.wait(Native Method)
- waiting on <0x00000000f381c820> (a java.lang.ref.Reference$Lock)
at java.lang.Object.wait(Object.java:503)
at java.lang.ref.Reference$ReferenceHandler.run(Reference.java:133)
- locked <0x00000000f381c820> (a java.lang.ref.Reference$Lock)

"main" prio=10 tid=0x00007f5f1000d800 nid=0x6a in Object.wait() [0x00007f5f178d4000]
   java.lang.Thread.State: WAITING (on object monitor)
at java.lang.Object.wait(Native Method)
- waiting on <0x00000000fbe14200> (a java.lang.Object)
at java.lang.Object.wait(Object.java:503)
at org.apache.flink.runtime.taskmanager.TaskManager.main(TaskManager.java:1115)
- locked <0x00000000fbe14200> (a java.lang.Object)

"VM Thread" prio=10 tid=0x00007f5f10096000 nid=0x84 runnable

"GC task thread#0 (ParallelGC)" prio=10 tid=0x00007f5f10023000 nid=0x6b runnable

"GC task thread#1 (ParallelGC)" prio=10 tid=0x00007f5f10025000 nid=0x6c runnable

"GC task thread#2 (ParallelGC)" prio=10 tid=0x00007f5f10027000 nid=0x6d runnable

"GC task thread#3 (ParallelGC)" prio=10 tid=0x00007f5f10029000 nid=0x6e runnable

"GC task thread#4 (ParallelGC)" prio=10 tid=0x00007f5f1002a800 nid=0x6f runnable

"GC task thread#5 (ParallelGC)" prio=10 tid=0x00007f5f1002c800 nid=0x70 runnable

"GC task thread#6 (ParallelGC)" prio=10 tid=0x00007f5f1002e800 nid=0x71 runnable

"GC task thread#7 (ParallelGC)" prio=10 tid=0x00007f5f10030000 nid=0x72 runnable

"GC task thread#8 (ParallelGC)" prio=10 tid=0x00007f5f10032000 nid=0x73 runnable

"GC task thread#9 (ParallelGC)" prio=10 tid=0x00007f5f10034000 nid=0x74 runnable

"GC task thread#10 (ParallelGC)" prio=10 tid=0x00007f5f10036000 nid=0x75 runnable

"GC task thread#11 (ParallelGC)" prio=10 tid=0x00007f5f10037800 nid=0x76 runnable

"GC task thread#12 (ParallelGC)" prio=10 tid=0x00007f5f10039800 nid=0x77 runnable

"GC task thread#13 (ParallelGC)" prio=10 tid=0x00007f5f1003b800 nid=0x78 runnable

"GC task thread#14 (ParallelGC)" prio=10 tid=0x00007f5f1003d000 nid=0x79 runnable

"GC task thread#15 (ParallelGC)" prio=10 tid=0x00007f5f1003f000 nid=0x7a runnable

"GC task thread#16 (ParallelGC)" prio=10 tid=0x00007f5f10041000 nid=0x7b runnable

"GC task thread#17 (ParallelGC)" prio=10 tid=0x00007f5f10043000 nid=0x7c runnable

"GC task thread#18 (ParallelGC)" prio=10 tid=0x00007f5f10044800 nid=0x7d runnable

"GC task thread#19 (ParallelGC)" prio=10 tid=0x00007f5f10046800 nid=0x7e runnable

"GC task thread#20 (ParallelGC)" prio=10 tid=0x00007f5f10048800 nid=0x7f runnable

"GC task thread#21 (ParallelGC)" prio=10 tid=0x00007f5f1004a000 nid=0x80 runnable

"GC task thread#22 (ParallelGC)" prio=10 tid=0x00007f5f1004c000 nid=0x81 runnable

"VM Periodic Task Thread" prio=10 tid=0x00007f5f100d5000 nid=0x8b waiting on condition

JNI global references: 530

Heap
 PSYoungGen      total 76800K, used 63133K [0x00000000faa80000, 0x0000000100000000, 0x0000000100000000)
  eden space 66048K, 95% used [0x00000000faa80000,0x00000000fe827690,0x00000000feb00000)
  from space 10752K, 0% used [0x00000000ff580000,0x00000000ff580000,0x0000000100000000)
  to   space 10752K, 0% used [0x00000000feb00000,0x00000000feb00000,0x00000000ff580000)
 ParOldGen       total 175104K, used 175046K [0x00000000eff80000, 0x00000000faa80000, 0x00000000faa80000)
  object space 175104K, 99% used [0x00000000eff80000,0x00000000faa71bb0,0x00000000faa80000)
 PSPermGen       total 29696K, used 29267K [0x00000000dff80000, 0x00000000e1c80000, 0x00000000eff80000)
  object space 29696K, 98% used [0x00000000dff80000,0x00000000e1c14d38,0x00000000e1c80000)





Any insight on clearing GC cleanly when this happens?

THanks!