I'm having an issue where off-heap memory is growing unchecked until I get
OOM exceptions. I was hoping that upgrading to 1.4 would solve these, since the child-first classloader is supposed to resolve issues with Avro classes cached in a different classloader (which prevents the classloaders from being garbage collected). However, after upgrading, we are still seeing an off-heap memory leak. I think I may have isolated the issue to the JmxReporter class used for collecting Kafka metrics. Here are the details of what I'm seeing: Our cluster is running in kubernetes, using the latest flink:1.4 docker image. We are using the default classloading order (child first). If I resubmit my job repeatedly, the ClassLoaders from the previous job submissions don't get cleaned up, and the non-heap memory slowly grows until the task manager runs out of memory. I can see all of the un-deleted classloaders if I run "sudo -u flink jmap -clstats <proc_id>" (the output is below). This list of dead classloaders continues to grow every time I kill and resubmit a new Flink job. In all, it lists 3200 dead class loaders. I'm only going to upload the ones which show more than 2K of used memory. finding class loader instances ..done. computing per loader stat ..done. please wait.. computing liveness.liveness analysis may be inaccurate ... class_loader classes bytes parent_loader alive? type 0x00000000807302a0 7522 12213076 0x00000000804c58c0 dead sun/misc/Launcher$AppClassLoader@0x000000010000f070 0x000000008eb00000 3699 6021535 0x00000000807302a0 dead org/apache/flink/runtime/execution/librarycache/FlinkUserCodeClassLoaders$ChildFirstClassLoader@0x00000001005cdc98 0x0000000094200190 3693 6016807 0x00000000807302a0 dead org/apache/flink/runtime/execution/librarycache/FlinkUserCodeClassLoaders$ChildFirstClassLoader@0x00000001005cdc98 0x000000009e7bc6c8 3696 6001081 0x00000000807302a0 dead org/apache/flink/runtime/execution/librarycache/FlinkUserCodeClassLoaders$ChildFirstClassLoader@0x00000001005cdc98 0x00000000a9d80008 3584 5530412 0x00000000807302a0 dead org/apache/flink/runtime/execution/librarycache/FlinkUserCodeClassLoaders$ChildFirstClassLoader@0x00000001005cdc98 0x00000000f4103650 3581 5527354 0x00000000807302a0 dead org/apache/flink/runtime/execution/librarycache/FlinkUserCodeClassLoaders$ChildFirstClassLoader@0x00000001005cdc98 0x00000000901801f8 3581 5527354 0x00000000807302a0 dead org/apache/flink/runtime/execution/librarycache/FlinkUserCodeClassLoaders$ChildFirstClassLoader@0x00000001005cdc98 0x00000000942637c0 3231 5121176 0x00000000807302a0 dead org/apache/flink/runtime/execution/librarycache/FlinkUserCodeClassLoaders$ChildFirstClassLoader@0x00000001005cdc98 0x0000000096c2ec00 3231 5119662 0x00000000807302a0 dead org/apache/flink/runtime/execution/librarycache/FlinkUserCodeClassLoaders$ChildFirstClassLoader@0x00000001005cdc98 0x000000008f600000 3225 5116241 0x00000000807302a0 dead org/apache/flink/runtime/execution/librarycache/FlinkUserCodeClassLoaders$ChildFirstClassLoader@0x00000001005cdc98 0x0000000092700d48 3228 5112270 0x00000000807302a0 dead org/apache/flink/runtime/execution/librarycache/FlinkUserCodeClassLoaders$ChildFirstClassLoader@0x00000001005cdc98 <bootstrap> 2548 4424440 null live <internal> 0x0000000096b77190 2234 3634602 0x00000000807302a0 dead org/apache/flink/runtime/execution/librarycache/FlinkUserCodeClassLoaders$ChildFirstClassLoader@0x00000001005cdc98 Next I took a heap dump: sudo -u flink jmap -dump:format=b,file=/tmp/HeapDump.hprof <procpid> Then, using Eclipse Memory Analyzer, I followed the steps from this blog post: http://java.jiderhamn.se/2011/12/11/classloader-leaks-i-how-to-find-classloader-leaks-with-eclipse-memory-analyser-mat/ The result of looking for strong references to classes in a dead classloader is this tree: Class Name | Shallow Heap | Retained Heap ------------------------------------------------------------------------------------------------------------------------------------------- org.apache.flink.runtime.execution.librarycache.FlinkUserCodeClassLoaders$ChildFirstClassLoader @ 0x94200190| 88 | 616,992 '- <classloader> class org.apache.kafka.common.metrics.JmxReporter$KafkaMbean @ 0x94250cb0 | 0 | 0 '- <class> org.apache.kafka.common.metrics.JmxReporter$KafkaMbean @ 0xbae537e8 | 24 | 328 '- object com.sun.jmx.mbeanserver.NamedObject @ 0xbace01e0 | 24 | 24 '- value java.util.HashMap$Node @ 0xbace0110 | 32 | 232 '- [247] java.util.HashMap$Node[512] @ 0xfa0d08c0 | 2,064 | 120,104 '- table java.util.HashMap @ 0x806e9f08 | 48 | 120,152 '- value java.util.HashMap$Node @ 0x806e9ee8 | 32 | 120,184 '- [8] java.util.HashMap$Node[16] @ 0x80502da0 | 80 | 134,944 '- table java.util.HashMap @ 0x80502d70 | 48 | 134,992 '- domainTb com.sun.jmx.mbeanserver.Repository @ 0x80502d50 | 32 | 135,200 ------------------------------------------------------------------------------------------------------------------------------------------- This suggests to me that JMX registry is getting created by bootstrap or app classloader. Then we're registering MBeans with it for each running job. Then when the job is called the classloader can't be cleaned up because there is a reference to the object in the bootstrap classloader. Is there anything that can be done about this in the job code? -- Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/ |
Thanks for reporting this.
To narrow things down a bit, is your job using both kafka sources and sinks? On 06.02.2018 21:30, Edward wrote: > I'm having an issue where off-heap memory is growing unchecked until I get > OOM exceptions. > > I was hoping that upgrading to 1.4 would solve these, since the child-first > classloader is supposed to resolve issues with Avro classes cached in a > different classloader (which prevents the classloaders from being garbage > collected). > > However, after upgrading, we are still seeing an off-heap memory leak. I > think I may have isolated the issue to the JmxReporter class used for > collecting Kafka metrics. > > Here are the details of what I'm seeing: > Our cluster is running in kubernetes, using the latest flink:1.4 docker > image. We are using the default classloading order (child first). > > If I resubmit my job repeatedly, the ClassLoaders from the previous job > submissions don't get cleaned up, and the non-heap memory slowly grows until > the task manager runs out of memory. > > I can see all of the un-deleted classloaders if I run "sudo -u flink jmap > -clstats <proc_id>" (the output is below). This list of dead classloaders > continues to grow every time I kill and resubmit a new Flink job. In all, > it lists 3200 dead class loaders. I'm only going to upload the ones which > show more than 2K of used memory. > > finding class loader instances ..done. > computing per loader stat ..done. > please wait.. computing liveness.liveness analysis may be inaccurate ... > class_loader classes bytes parent_loader alive? type > 0x00000000807302a0 7522 12213076 0x00000000804c58c0 dead > sun/misc/Launcher$AppClassLoader@0x000000010000f070 > 0x000000008eb00000 3699 6021535 0x00000000807302a0 dead > org/apache/flink/runtime/execution/librarycache/FlinkUserCodeClassLoaders$ChildFirstClassLoader@0x00000001005cdc98 > 0x0000000094200190 3693 6016807 0x00000000807302a0 dead > org/apache/flink/runtime/execution/librarycache/FlinkUserCodeClassLoaders$ChildFirstClassLoader@0x00000001005cdc98 > 0x000000009e7bc6c8 3696 6001081 0x00000000807302a0 dead > org/apache/flink/runtime/execution/librarycache/FlinkUserCodeClassLoaders$ChildFirstClassLoader@0x00000001005cdc98 > 0x00000000a9d80008 3584 5530412 0x00000000807302a0 dead > org/apache/flink/runtime/execution/librarycache/FlinkUserCodeClassLoaders$ChildFirstClassLoader@0x00000001005cdc98 > 0x00000000f4103650 3581 5527354 0x00000000807302a0 dead > org/apache/flink/runtime/execution/librarycache/FlinkUserCodeClassLoaders$ChildFirstClassLoader@0x00000001005cdc98 > 0x00000000901801f8 3581 5527354 0x00000000807302a0 dead > org/apache/flink/runtime/execution/librarycache/FlinkUserCodeClassLoaders$ChildFirstClassLoader@0x00000001005cdc98 > 0x00000000942637c0 3231 5121176 0x00000000807302a0 dead > org/apache/flink/runtime/execution/librarycache/FlinkUserCodeClassLoaders$ChildFirstClassLoader@0x00000001005cdc98 > 0x0000000096c2ec00 3231 5119662 0x00000000807302a0 dead > org/apache/flink/runtime/execution/librarycache/FlinkUserCodeClassLoaders$ChildFirstClassLoader@0x00000001005cdc98 > 0x000000008f600000 3225 5116241 0x00000000807302a0 dead > org/apache/flink/runtime/execution/librarycache/FlinkUserCodeClassLoaders$ChildFirstClassLoader@0x00000001005cdc98 > 0x0000000092700d48 3228 5112270 0x00000000807302a0 dead > org/apache/flink/runtime/execution/librarycache/FlinkUserCodeClassLoaders$ChildFirstClassLoader@0x00000001005cdc98 > <bootstrap> 2548 4424440 null live <internal> > 0x0000000096b77190 2234 3634602 0x00000000807302a0 dead > org/apache/flink/runtime/execution/librarycache/FlinkUserCodeClassLoaders$ChildFirstClassLoader@0x00000001005cdc98 > > Next I took a heap dump: > sudo -u flink jmap -dump:format=b,file=/tmp/HeapDump.hprof <procpid> > > Then, using Eclipse Memory Analyzer, I followed the steps from this blog > post: > http://java.jiderhamn.se/2011/12/11/classloader-leaks-i-how-to-find-classloader-leaks-with-eclipse-memory-analyser-mat/ > > The result of looking for strong references to classes in a dead classloader > is this tree: > > Class Name > | Shallow Heap | Retained Heap > ------------------------------------------------------------------------------------------------------------------------------------------- > org.apache.flink.runtime.execution.librarycache.FlinkUserCodeClassLoaders$ChildFirstClassLoader > @ 0x94200190| 88 | 616,992 > '- <classloader> class > org.apache.kafka.common.metrics.JmxReporter$KafkaMbean @ 0x94250cb0 > | 0 | 0 > '- <class> org.apache.kafka.common.metrics.JmxReporter$KafkaMbean @ > 0xbae537e8 | 24 | 328 > '- object com.sun.jmx.mbeanserver.NamedObject @ 0xbace01e0 > | 24 | 24 > '- value java.util.HashMap$Node @ 0xbace0110 > | 32 | 232 > '- [247] java.util.HashMap$Node[512] @ 0xfa0d08c0 > | 2,064 | 120,104 > '- table java.util.HashMap @ 0x806e9f08 > | 48 | 120,152 > '- value java.util.HashMap$Node @ 0x806e9ee8 > | 32 | 120,184 > '- [8] java.util.HashMap$Node[16] @ 0x80502da0 > | 80 | 134,944 > '- table java.util.HashMap @ 0x80502d70 > | 48 | 134,992 > '- domainTb com.sun.jmx.mbeanserver.Repository @ > 0x80502d50 | 32 | 135,200 > ------------------------------------------------------------------------------------------------------------------------------------------- > > This suggests to me that JMX registry is getting created by bootstrap or app > classloader. Then we're registering MBeans with it for each running job. > Then when the job is called the classloader can't be cleaned up because > there is a reference to the object in the bootstrap classloader. > > Is there anything that can be done about this in the job code? > > > > > > -- > Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/ > |
It would also be useful to know which source/sink you are using and
which kafka version that is. On 07.02.2018 08:58, Chesnay Schepler wrote: > Thanks for reporting this. > > To narrow things down a bit, is your job using both kafka sources and > sinks? > > On 06.02.2018 21:30, Edward wrote: >> I'm having an issue where off-heap memory is growing unchecked until >> I get >> OOM exceptions. >> >> I was hoping that upgrading to 1.4 would solve these, since the >> child-first >> classloader is supposed to resolve issues with Avro classes cached in a >> different classloader (which prevents the classloaders from being >> garbage >> collected). >> >> However, after upgrading, we are still seeing an off-heap memory leak. I >> think I may have isolated the issue to the JmxReporter class used for >> collecting Kafka metrics. >> >> Here are the details of what I'm seeing: >> Our cluster is running in kubernetes, using the latest flink:1.4 docker >> image. We are using the default classloading order (child first). >> >> If I resubmit my job repeatedly, the ClassLoaders from the previous job >> submissions don't get cleaned up, and the non-heap memory slowly >> grows until >> the task manager runs out of memory. >> >> I can see all of the un-deleted classloaders if I run "sudo -u flink >> jmap >> -clstats <proc_id>" (the output is below). This list of dead >> classloaders >> continues to grow every time I kill and resubmit a new Flink job. In >> all, >> it lists 3200 dead class loaders. I'm only going to upload the ones >> which >> show more than 2K of used memory. >> >> finding class loader instances ..done. >> computing per loader stat ..done. >> please wait.. computing liveness.liveness analysis may be inaccurate ... >> class_loader classes bytes parent_loader alive? type >> 0x00000000807302a0 7522 12213076 0x00000000804c58c0 dead >> sun/misc/Launcher$AppClassLoader@0x000000010000f070 >> 0x000000008eb00000 3699 6021535 0x00000000807302a0 dead >> org/apache/flink/runtime/execution/librarycache/FlinkUserCodeClassLoaders$ChildFirstClassLoader@0x00000001005cdc98 >> >> 0x0000000094200190 3693 6016807 0x00000000807302a0 dead >> org/apache/flink/runtime/execution/librarycache/FlinkUserCodeClassLoaders$ChildFirstClassLoader@0x00000001005cdc98 >> >> 0x000000009e7bc6c8 3696 6001081 0x00000000807302a0 dead >> org/apache/flink/runtime/execution/librarycache/FlinkUserCodeClassLoaders$ChildFirstClassLoader@0x00000001005cdc98 >> >> 0x00000000a9d80008 3584 5530412 0x00000000807302a0 dead >> org/apache/flink/runtime/execution/librarycache/FlinkUserCodeClassLoaders$ChildFirstClassLoader@0x00000001005cdc98 >> >> 0x00000000f4103650 3581 5527354 0x00000000807302a0 dead >> org/apache/flink/runtime/execution/librarycache/FlinkUserCodeClassLoaders$ChildFirstClassLoader@0x00000001005cdc98 >> >> 0x00000000901801f8 3581 5527354 0x00000000807302a0 dead >> org/apache/flink/runtime/execution/librarycache/FlinkUserCodeClassLoaders$ChildFirstClassLoader@0x00000001005cdc98 >> >> 0x00000000942637c0 3231 5121176 0x00000000807302a0 dead >> org/apache/flink/runtime/execution/librarycache/FlinkUserCodeClassLoaders$ChildFirstClassLoader@0x00000001005cdc98 >> >> 0x0000000096c2ec00 3231 5119662 0x00000000807302a0 dead >> org/apache/flink/runtime/execution/librarycache/FlinkUserCodeClassLoaders$ChildFirstClassLoader@0x00000001005cdc98 >> >> 0x000000008f600000 3225 5116241 0x00000000807302a0 dead >> org/apache/flink/runtime/execution/librarycache/FlinkUserCodeClassLoaders$ChildFirstClassLoader@0x00000001005cdc98 >> >> 0x0000000092700d48 3228 5112270 0x00000000807302a0 dead >> org/apache/flink/runtime/execution/librarycache/FlinkUserCodeClassLoaders$ChildFirstClassLoader@0x00000001005cdc98 >> >> <bootstrap> 2548 4424440 null live <internal> >> 0x0000000096b77190 2234 3634602 0x00000000807302a0 dead >> org/apache/flink/runtime/execution/librarycache/FlinkUserCodeClassLoaders$ChildFirstClassLoader@0x00000001005cdc98 >> >> >> Next I took a heap dump: >> sudo -u flink jmap -dump:format=b,file=/tmp/HeapDump.hprof <procpid> >> >> Then, using Eclipse Memory Analyzer, I followed the steps from this blog >> post: >> http://java.jiderhamn.se/2011/12/11/classloader-leaks-i-how-to-find-classloader-leaks-with-eclipse-memory-analyser-mat/ >> >> >> The result of looking for strong references to classes in a dead >> classloader >> is this tree: >> >> Class Name >> | Shallow Heap | Retained Heap >> ------------------------------------------------------------------------------------------------------------------------------------------- >> >> org.apache.flink.runtime.execution.librarycache.FlinkUserCodeClassLoaders$ChildFirstClassLoader >> >> @ 0x94200190| 88 | 616,992 >> '- <classloader> class >> org.apache.kafka.common.metrics.JmxReporter$KafkaMbean @ 0x94250cb0 >> | 0 | 0 >> '- <class> org.apache.kafka.common.metrics.JmxReporter$KafkaMbean @ >> 0xbae537e8 | 24 | 328 >> '- object com.sun.jmx.mbeanserver.NamedObject @ 0xbace01e0 >> | 24 | 24 >> '- value java.util.HashMap$Node @ 0xbace0110 >> | 32 | 232 >> '- [247] java.util.HashMap$Node[512] @ 0xfa0d08c0 >> | 2,064 | 120,104 >> '- table java.util.HashMap @ 0x806e9f08 >> | 48 | 120,152 >> '- value java.util.HashMap$Node @ 0x806e9ee8 >> | 32 | 120,184 >> '- [8] java.util.HashMap$Node[16] @ 0x80502da0 >> | 80 | 134,944 >> '- table java.util.HashMap @ 0x80502d70 >> | 48 | 134,992 >> '- domainTb >> com.sun.jmx.mbeanserver.Repository @ >> 0x80502d50 | 32 | 135,200 >> ------------------------------------------------------------------------------------------------------------------------------------------- >> >> >> This suggests to me that JMX registry is getting created by bootstrap >> or app >> classloader. Then we're registering MBeans with it for each running job. >> Then when the job is called the classloader can't be cleaned up because >> there is a reference to the object in the bootstrap classloader. >> >> Is there anything that can be done about this in the job code? >> >> >> >> >> >> -- >> Sent from: >> http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/ >> > > |
We are using FlinkKafkaConsumer011 and FlinkKafkaProducer011, but we also
experienced the same behavior with FlinkKafkaConsumer010 and FlinkKafkaProducer010. -- Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/ |
This could be a bug in Kafkas JmxReporter class:
https://issues.apache.org/jira/browse/KAFKA-6307 On 07.02.2018 13:37, Edward wrote: > We are using FlinkKafkaConsumer011 and FlinkKafkaProducer011, but we also > experienced the same behavior with FlinkKafkaConsumer010 and > FlinkKafkaProducer010. > > > > -- > Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/ > |
I applied the change in the pull request associated with that Kafka bug, and
unfortunately it didn't resolve anything. It doesn't unregister any additional MBeans which are created by Kafka's JmxRepository -- it is just a fix to the remove mbeans from Kafka's cache of mbeans (i.e. it is doing cleanup within the job's ChildFirst classloader, not the bootstrap/App classloader where the strong reference exists). -- Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/ |
I looked through the kafka JmxReporter
class (and other kafka classes that we use) and couldn't spot
anything
that would prevent the Mbeans from being unregistered. The obvious implication would be that we don't properly clean up kafka classes, but that code looks good as well. So at the moment I'm a bit stumped as to why this is happening. For reproducing the problem, could you answer the following questions?
On 09.02.2018 14:19, Edward wrote: I applied the change in the pull request associated with that Kafka bug, and unfortunately it didn't resolve anything. It doesn't unregister any additional MBeans which are created by Kafka's JmxRepository -- it is just a fix to the remove mbeans from Kafka's cache of mbeans (i.e. it is doing cleanup within the job's ChildFirst classloader, not the bootstrap/App classloader where the strong reference exists). -- Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/
|
Free forum by Nabble | Edit this page |