Suspected classloader leak in Flink 1.11.1

classic Classic list List threaded Threaded
11 messages Options
Reply | Threaded
Open this post in threaded view
|

Suspected classloader leak in Flink 1.11.1

Tamir Sagi

Hey all,

We are encountering memory issues on a Flink client and task managers, which I would like to raise here.

we are running Flink on a session cluster (version 1.11.1) on Kubernetes, submitting batch jobs with Flink client on Spring boot application (using RestClusterClient).

When jobs are being submitted and running, one after another, We see that the metaspace memory(with max size of  1GB)  keeps increasing, as well as linear increase in the heap memory (though it's a more moderate increase). We do see GC working on the heap and releasing some resources.

By analyzing the memory of the client Java application with profiling tools, We saw that there are many instances of Flink's ChildFirstClassLoader (perhaps as the number of jobs which were running), and therefore many instances of the same class, each from a different instance of the Class Loader (as shown in the attached screenshot). Similarly, to the Flink task manager memory.

We would expect to see one instance of Class Loader. Therefore, We suspect that the reason for the increase is Class Loaders not being cleaned.

Does anyone have some insights about this issue, or ideas how to proceed the investigation?


Flink Client application (VisualVm)



 

 

Shallow Size 
com.fasterxmI.jackson.databind.PropertyMetadata 
com.fasterxmIjackson.databind.PropertyMetadata 
com.fasterxmI.jackson.databind.PropertyMetadata 
com.fasterxmIjackson.databind.PropertyMetadata 
com.fasterxmI.jackson.databind.PropertyMetadata 
com.fasterxmIjackson.databind.PropertyMetadata 
com.fasterxmI.jackson.databind.PropertyMetadata 
com.fasterxmIjackson.databind.PropertyMetadata 
com.fasterxmI.jackson.databind.PropertyMetadata 
com.fasterxmIjackson.databind.PropertyMetadata 
com.fasterxmI.jackson.databind.PropertyMetadata 
com.fasterxmIjackson.databind.PropertyMetadata 
com.fasterxmI.jackson.databind.PropertyMetadata 
com.fasterxmIjackson.databind.PropertyMetadata 
com.fasterxmI.jackson.databind.PropertyMetadata 
com.fasterxmIjackson.databind.PropertyMetadata 
com.fasterxmI.jackson.databind.PropertyMetadata 
org.apache.fIink.utiI.ChiIdFirstCIassLoader (41) 
org.apache.fIink.utiI.ChiIdFirstCIassLoader (79) 
org.apache.fIink.utiI.ChiIdFirstCIassLoader (82) 
org.apache.fIink.utiI.ChiIdFirstCIassLoader (23) 
org.apache.fIink.utiI.ChiIdFirstCIassLoader (36) 
org.apache.fIink.utiI.ChiIdFirstCIassLoader (34) 
org.apache.fIink.utiI.ChiIdFirstCIassLoader (84) 
org.apache.fIink.utiI.ChiIdFirstCIassLoader (92) 
org.apache.fIink.utiI.ChiIdFirstCIassLoader (59) 
org.apache.fIink.utiI.ChiIdFirstCIassLoader (70) 
org.apache.fIink.utiI.ChiIdFirstCIassLoader (3) 
org.apache.fIink.utiI.ChiIdFirstCIassLoader (60) 
org.apache.fIink.utiI.ChiIdFirstCIassLoader (8) 
org.apache.fIink.utiI.ChiIdFirstCIassLoader (17) 
org.apache.fIink.utiI.ChiIdFirstCIassLoader (31) 
org.apache.fIink.utiI.ChiIdFirstCIassLoader (12) 
org.apache.fIink.utiI.ChiIdFirstCIassLoader (49) 
Objects 
0% 
0% 
0% 
0% 
0% 
0% 
0% 
0% 
0% 
0% 
0% 
0% 
0% 
0% 
0% 
0% 
Retained Size 
120 
120 
120 
120 
120 
120 
120 
120 
120 
120 
120 
120 
120 
120 
120 
120 
120 
0% 
0% 
0% 
0% 
0% 
0% 
0% 
0% 
0% 
0% 
0% 
0% 
0% 
0% 
0% 
0% 
z 120 
z 120 
z 120 
z 120 
z 120 
z 120 
z 120 
z 120 
z 120 
z 120 
z 120 
z 120 
z 120 
z 120 
z 120 
z 120 
z 120 
0% 
0% 
0% 
0% 
0% 
0% 
0% 
0% 
0% 
0% 
0% 
0% 
0% 
0% 
0% 
0%

We have used different GCs but same results.


Task Manager


Total Size 4GB

metaspace 1GB

Off heap 512mb


Screenshot form Task manager, 612MB are occupied and not being released. 


We used jcmd tool and attached 3 files

  1. Threads print
  2. VM.metaspace output
  3. VM.classloader
In addition, we have tried calling GC manually, but it did not change much.

Thank you




Confidentiality: This communication and any attachments are intended for the above-named persons only and may be confidential and/or legally privileged. Any opinions expressed in this communication are not necessarily those of NICE Actimize. If this communication has come to you in error you must take no action based on it, nor must you copy or show it to anyone; please delete/destroy and inform the sender by e-mail immediately. 
Monitoring: NICE Actimize may monitor incoming and outgoing e-mails.
Viruses: Although we have taken steps toward ensuring that this e-mail and attachments are free from any virus, we advise that in keeping with good computing practice the recipient should ensure they are actually virus free.


task-manager-thread-print.txt (49K) Download Attachment
task-manager-vm-classloader.txt (970 bytes) Download Attachment
task-manager-vm-metaspace.txt (2K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Suspected classloader leak in Flink 1.11.1

Kezhu Wang
Hi Tamir,


Besides this, I think GC.class_histogram(even filtered) could help us listing suspected objects.


Best,
Kezhu Wang


On February 28, 2021 at 21:25:07, Tamir Sagi ([hidden email]) wrote:


Hey all,

We are encountering memory issues on a Flink client and task managers, which I would like to raise here.

we are running Flink on a session cluster (version 1.11.1) on Kubernetes, submitting batch jobs with Flink client on Spring boot application (using RestClusterClient).

When jobs are being submitted and running, one after another, We see that the metaspace memory(with max size of  1GB)  keeps increasing, as well as linear increase in the heap memory (though it's a more moderate increase). We do see GC working on the heap and releasing some resources.

By analyzing the memory of the client Java application with profiling tools, We saw that there are many instances of Flink's ChildFirstClassLoader (perhaps as the number of jobs which were running), and therefore many instances of the same class, each from a different instance of the Class Loader (as shown in the attached screenshot). Similarly, to the Flink task manager memory.

We would expect to see one instance of Class Loader. Therefore, We suspect that the reason for the increase is Class Loaders not being cleaned.

Does anyone have some insights about this issue, or ideas how to proceed the investigation?


Flink Client application (VisualVm)



 

 

Shallow Size  
com.fasterxmI.jackson.databind.PropertyMetadata  
com.fasterxmIjackson.databind.PropertyMetadata  
com.fasterxmI.jackson.databind.PropertyMetadata  
com.fasterxmIjackson.databind.PropertyMetadata  
com.fasterxmI.jackson.databind.PropertyMetadata  
com.fasterxmIjackson.databind.PropertyMetadata  
com.fasterxmI.jackson.databind.PropertyMetadata  
com.fasterxmIjackson.databind.PropertyMetadata  
com.fasterxmI.jackson.databind.PropertyMetadata  
com.fasterxmIjackson.databind.PropertyMetadata  
com.fasterxmI.jackson.databind.PropertyMetadata  
com.fasterxmIjackson.databind.PropertyMetadata  
com.fasterxmI.jackson.databind.PropertyMetadata  
com.fasterxmIjackson.databind.PropertyMetadata  
com.fasterxmI.jackson.databind.PropertyMetadata  
com.fasterxmIjackson.databind.PropertyMetadata  
com.fasterxmI.jackson.databind.PropertyMetadata  
org.apache.fIink.utiI.ChiIdFirstCIassLoader (41)  
org.apache.fIink.utiI.ChiIdFirstCIassLoader (79)  
org.apache.fIink.utiI.ChiIdFirstCIassLoader (82)  
org.apache.fIink.utiI.ChiIdFirstCIassLoader (23)  
org.apache.fIink.utiI.ChiIdFirstCIassLoader (36)  
org.apache.fIink.utiI.ChiIdFirstCIassLoader (34)  
org.apache.fIink.utiI.ChiIdFirstCIassLoader (84)  
org.apache.fIink.utiI.ChiIdFirstCIassLoader (92)  
org.apache.fIink.utiI.ChiIdFirstCIassLoader (59)  
org.apache.fIink.utiI.ChiIdFirstCIassLoader (70)  
org.apache.fIink.utiI.ChiIdFirstCIassLoader (3)  
org.apache.fIink.utiI.ChiIdFirstCIassLoader (60)  
org.apache.fIink.utiI.ChiIdFirstCIassLoader (8)  
org.apache.fIink.utiI.ChiIdFirstCIassLoader (17)  
org.apache.fIink.utiI.ChiIdFirstCIassLoader (31)  
org.apache.fIink.utiI.ChiIdFirstCIassLoader (12)  
org.apache.fIink.utiI.ChiIdFirstCIassLoader (49)  
Objects  
0%  
0%  
0%  
0%  
0%  
0%  
0%  
0%  
0%  
0%  
0%  
0%  
0%  
0%  
0%  
0%  
Retained Size  
120  
120  
120  
120  
120  
120  
120  
120  
120  
120  
120  
120  
120  
120  
120  
120  
120  
0%  
0%  
0%  
0%  
0%  
0%  
0%  
0%  
0%  
0%  
0%  
0%  
0%  
0%  
0%  
0%  
z 120  
z 120  
z 120  
z 120  
z 120  
z 120  
z 120  
z 120  
z 120  
z 120  
z 120  
z 120  
z 120  
z 120  
z 120  
z 120  
z 120  
0%  
0%  
0%  
0%  
0%  
0%  
0%  
0%  
0%  
0%  
0%  
0%  
0%  
0%  
0%  
0%

We have used different GCs but same results.


Task Manager


Total Size 4GB

metaspace 1GB

Off heap 512mb


Screenshot form Task manager, 612MB are occupied and not being released. 


We used jcmd tool and attached 3 files

  1. Threads print
  2. VM.metaspace output
  3. VM.classloader
In addition, we have tried calling GC manually, but it did not change much.

Thank you




Confidentiality: This communication and any attachments are intended for the above-named persons only and may be confidential and/or legally privileged. Any opinions expressed in this communication are not necessarily those of NICE Actimize. If this communication has come to you in error you must take no action based on it, nor must you copy or show it to anyone; please delete/destroy and inform the sender by e-mail immediately. 
Monitoring: NICE Actimize may monitor incoming and outgoing e-mails.
Viruses: Although we have taken steps toward ensuring that this e-mail and attachments are free from any virus, we advise that in keeping with good computing practice the recipient should ensure they are actually virus free.

Reply | Threaded
Open this post in threaded view
|

Re: Suspected classloader leak in Flink 1.11.1

Chesnay Schepler
I'd suggest to take a heap dump and investigate what is referencing these classloaders; chances are that some thread isn't being cleaned up.

On 2/28/2021 3:46 PM, Kezhu Wang wrote:
Hi Tamir,


Besides this, I think GC.class_histogram(even filtered) could help us listing suspected objects.


Best,
Kezhu Wang


On February 28, 2021 at 21:25:07, Tamir Sagi ([hidden email]) wrote:


Hey all,

We are encountering memory issues on a Flink client and task managers, which I would like to raise here.

we are running Flink on a session cluster (version 1.11.1) on Kubernetes, submitting batch jobs with Flink client on Spring boot application (using RestClusterClient).

When jobs are being submitted and running, one after another, We see that the metaspace memory(with max size of  1GB)  keeps increasing, as well as linear increase in the heap memory (though it's a more moderate increase). We do see GC working on the heap and releasing some resources.

By analyzing the memory of the client Java application with profiling tools, We saw that there are many instances of Flink's ChildFirstClassLoader (perhaps as the number of jobs which were running), and therefore many instances of the same class, each from a different instance of the Class Loader (as shown in the attached screenshot). Similarly, to the Flink task manager memory.

We would expect to see one instance of Class Loader. Therefore, We suspect that the reason for the increase is Class Loaders not being cleaned.

Does anyone have some insights about this issue, or ideas how to proceed the investigation?


Flink Client application (VisualVm)



 

 

Shallow Size
                        com.fasterxmI.jackson.databind.PropertyMetadata
                        com.fasterxmIjackson.databind.PropertyMetadata
                        com.fasterxmI.jackson.databind.PropertyMetadata
                        com.fasterxmIjackson.databind.PropertyMetadata
                        com.fasterxmI.jackson.databind.PropertyMetadata
                        com.fasterxmIjackson.databind.PropertyMetadata
                        com.fasterxmI.jackson.databind.PropertyMetadata
                        com.fasterxmIjackson.databind.PropertyMetadata
                        com.fasterxmI.jackson.databind.PropertyMetadata
                        com.fasterxmIjackson.databind.PropertyMetadata
                        com.fasterxmI.jackson.databind.PropertyMetadata
                        com.fasterxmIjackson.databind.PropertyMetadata
                        com.fasterxmI.jackson.databind.PropertyMetadata
                        com.fasterxmIjackson.databind.PropertyMetadata
                        com.fasterxmI.jackson.databind.PropertyMetadata
                        com.fasterxmIjackson.databind.PropertyMetadata
                        com.fasterxmI.jackson.databind.PropertyMetadata
                        org.apache.fIink.utiI.ChiIdFirstCIassLoader (41)
                        org.apache.fIink.utiI.ChiIdFirstCIassLoader (79)
                        org.apache.fIink.utiI.ChiIdFirstCIassLoader (82)
                        org.apache.fIink.utiI.ChiIdFirstCIassLoader (23)
                        org.apache.fIink.utiI.ChiIdFirstCIassLoader (36)
                        org.apache.fIink.utiI.ChiIdFirstCIassLoader (34)
                        org.apache.fIink.utiI.ChiIdFirstCIassLoader (84)
                        org.apache.fIink.utiI.ChiIdFirstCIassLoader (92)
                        org.apache.fIink.utiI.ChiIdFirstCIassLoader (59)
                        org.apache.fIink.utiI.ChiIdFirstCIassLoader (70)
                        org.apache.fIink.utiI.ChiIdFirstCIassLoader (3)
                        org.apache.fIink.utiI.ChiIdFirstCIassLoader (60)
                        org.apache.fIink.utiI.ChiIdFirstCIassLoader (8)
                        org.apache.fIink.utiI.ChiIdFirstCIassLoader (17)
                        org.apache.fIink.utiI.ChiIdFirstCIassLoader (31)
                        org.apache.fIink.utiI.ChiIdFirstCIassLoader (12)
                        org.apache.fIink.utiI.ChiIdFirstCIassLoader (49)
                        Objects 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0%
                        0% 0% 0% Retained Size 120 120 120 120 120 120
                        120 120 120 120 120 120 120 120 120 120 120 0%
                        0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% z
                        120 z 120 z 120 z 120 z 120 z 120 z 120 z 120 z
                        120 z 120 z 120 z 120 z 120 z 120 z 120 z 120 z
                        120 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0%
                        0%

We have used different GCs but same results.


Task Manager


Total Size 4GB

metaspace 1GB

Off heap 512mb


Screenshot form Task manager, 612MB are occupied and not being released. 


We used jcmd tool and attached 3 files

  1. Threads print
  2. VM.metaspace output
  3. VM.classloader
In addition, we have tried calling GC manually, but it did not change much.

Thank you




Confidentiality: This communication and any attachments are intended for the above-named persons only and may be confidential and/or legally privileged. Any opinions expressed in this communication are not necessarily those of NICE Actimize. If this communication has come to you in error you must take no action based on it, nor must you copy or show it to anyone; please delete/destroy and inform the sender by e-mail immediately. 
Monitoring: NICE Actimize may monitor incoming and outgoing e-mails.
Viruses: Although we have taken steps toward ensuring that this e-mail and attachments are free from any virus, we advise that in keeping with good computing practice the recipient should ensure they are actually virus free.


Reply | Threaded
Open this post in threaded view
|

Re: Suspected classloader leak in Flink 1.11.1

Tamir Sagi
In reply to this post by Kezhu Wang
Hey Kezhu,

The histogram has been taken from Task Manager using jcmd tool.

By means of batch job, do you means that you compile job graph from DataSet API in client side and then submit it through RestClient ? I am not familiar with data set api, usually, there is no `ChildFirstClassLoader` creation in client side for job graph building. Could you depict a pseudo for this or did you create `ChildFirstClassLoader` yourself ?
Yes, we have a batch app. we read a file from s3 using hadoop-s3-plugin, then map that data into DataSet then just print it.
Then we have a Flink Client application which saves the batch app jar.

Attached the following files:
  1. batch-source-code.java - main function
  2. FlatMapXSightMsgProcessor.java - custom RichFlatMapFunction
  3. flink-job-submit.txt - The code to submit the job

I've noticed 2 behaviors:
  1. Idle - Once Task manager application boots up the memory consumption gradually grows, starting ~360MB to ~430MB(within few minutes) I see logs where many classes are loaded into JVM and never get released.(Might be a normal behavior)
  2. Batch Job Execution - A simple batch job with single operation . The memory bumps to ~600MB (after single execution). once the job is finished the memory never freed. I executed GC several times (Manually + Programmatically) it did not help(although some classes were unloaded). the memory keeps growing while more batch jobs are executed.
Attached Task Manager Logs from yesterday after a single batch execution.(Memory grew to 612MB and never freed)
  1. taskmgr.txt - Task manager logs (2021-02-28T16:06:05,983 is the timestamp when the job was submitted)
  2. gc-class-historgram.txt
  3. thread-print.txt
  4. vm-class-loader-stats.txt
  5. vm-class-loaders.txt
  6. heap_info.txt

Same behavior has been observed in Flink Client application. Once the batch job is executed the memory is increased gradually and does not get cleaned afterwards.(We observed many ChildFirstClassLoader instances)


Thank you
Tamir. 


From: Kezhu Wang <[hidden email]>
Sent: Sunday, February 28, 2021 6:57 PM
To: Tamir Sagi <[hidden email]>
Subject: Re: Suspected classloader leak in Flink 1.11.1
 

EXTERNAL EMAIL



HI Tamir,

The histogram has no instance of `ChildFirstClassLoader`.

> we are running Flink on a session cluster (version 1.11.1) on Kubernetes, submitting batch jobs with Flink client on Spring boot application (using RestClusterClient).

By analyzing the memory of the client Java application with profiling tools, We saw that there are many instances of Flink's ChildFirstClassLoader (perhaps as the number of jobs which were running), and therefore many instances of the same class, each from a different instance of the Class Loader (as shown in the attached screenshot). Similarly, to the Flink task manager memory.

By means of batch job, do you means that you compile job graph from DataSet API in client side and then submit it through RestClient ? I am not familiar with data set api, usually, there is no `ChildFirstClassLoader` creation in client side for job graph building. Could you depict a pseudo for this or did you create `ChildFirstClassLoader` yourself ?


In addition, we have tried calling GC manually, but it did not change much.

It might take serval runs to collect a class loader instance.


Best,
Kezhu Wang


On February 28, 2021 at 23:27:38, Tamir Sagi ([hidden email]) wrote:

Hey Kezhu,
Thanks for fast responding,

I've read that link few days ago.; Today I ran a simple batch job with single operation (using hadoop s3 plugin) but the same behavior was observed.

attached GC.class_histogram (Not filtered)


Tamir.




From: Kezhu Wang <[hidden email]>
Sent: Sunday, February 28, 2021 4:46 PM
To: [hidden email] <[hidden email]>; Tamir Sagi <[hidden email]>
Subject: Re: Suspected classloader leak in Flink 1.11.1
 

EXTERNAL EMAIL



Hi Tamir,


Besides this, I think GC.class_histogram(even filtered) could help us listing suspected objects.


Best,
Kezhu Wang


On February 28, 2021 at 21:25:07, Tamir Sagi ([hidden email]) wrote:


Hey all,

We are encountering memory issues on a Flink client and task managers, which I would like to raise here.

we are running Flink on a session cluster (version 1.11.1) on Kubernetes, submitting batch jobs with Flink client on Spring boot application (using RestClusterClient).

When jobs are being submitted and running, one after another, We see that the metaspace memory(with max size of  1GB)  keeps increasing, as well as linear increase in the heap memory (though it's a more moderate increase). We do see GC working on the heap and releasing some resources.

By analyzing the memory of the client Java application with profiling tools, We saw that there are many instances of Flink's ChildFirstClassLoader (perhaps as the number of jobs which were running), and therefore many instances of the same class, each from a different instance of the Class Loader (as shown in the attached screenshot). Similarly, to the Flink task manager memory.

We would expect to see one instance of Class Loader. Therefore, We suspect that the reason for the increase is Class Loaders not being cleaned.

Does anyone have some insights about this issue, or ideas how to proceed the investigation?


Flink Client application (VisualVm)



 

 

Shallow Size   
com.fasterxmI.jackson.databind.PropertyMetadata   
com.fasterxmIjackson.databind.PropertyMetadata   
com.fasterxmI.jackson.databind.PropertyMetadata   
com.fasterxmIjackson.databind.PropertyMetadata   
com.fasterxmI.jackson.databind.PropertyMetadata   
com.fasterxmIjackson.databind.PropertyMetadata   
com.fasterxmI.jackson.databind.PropertyMetadata   
com.fasterxmIjackson.databind.PropertyMetadata   
com.fasterxmI.jackson.databind.PropertyMetadata   
com.fasterxmIjackson.databind.PropertyMetadata   
com.fasterxmI.jackson.databind.PropertyMetadata   
com.fasterxmIjackson.databind.PropertyMetadata   
com.fasterxmI.jackson.databind.PropertyMetadata   
com.fasterxmIjackson.databind.PropertyMetadata   
com.fasterxmI.jackson.databind.PropertyMetadata   
com.fasterxmIjackson.databind.PropertyMetadata   
com.fasterxmI.jackson.databind.PropertyMetadata   
org.apache.fIink.utiI.ChiIdFirstCIassLoader (41)   
org.apache.fIink.utiI.ChiIdFirstCIassLoader (79)   
org.apache.fIink.utiI.ChiIdFirstCIassLoader (82)   
org.apache.fIink.utiI.ChiIdFirstCIassLoader (23)   
org.apache.fIink.utiI.ChiIdFirstCIassLoader (36)   
org.apache.fIink.utiI.ChiIdFirstCIassLoader (34)   
org.apache.fIink.utiI.ChiIdFirstCIassLoader (84)   
org.apache.fIink.utiI.ChiIdFirstCIassLoader (92)   
org.apache.fIink.utiI.ChiIdFirstCIassLoader (59)   
org.apache.fIink.utiI.ChiIdFirstCIassLoader (70)   
org.apache.fIink.utiI.ChiIdFirstCIassLoader (3)   
org.apache.fIink.utiI.ChiIdFirstCIassLoader (60)   
org.apache.fIink.utiI.ChiIdFirstCIassLoader (8)   
org.apache.fIink.utiI.ChiIdFirstCIassLoader (17)   
org.apache.fIink.utiI.ChiIdFirstCIassLoader (31)   
org.apache.fIink.utiI.ChiIdFirstCIassLoader (12)   
org.apache.fIink.utiI.ChiIdFirstCIassLoader (49)   
Objects   
0%   
0%   
0%   
0%   
0%   
0%   
0%   
0%   
0%   
0%   
0%   
0%   
0%   
0%   
0%   
0%   
Retained Size   
120   
120   
120   
120   
120   
120   
120   
120   
120   
120   
120   
120   
120   
120   
120   
120   
120   
0%   
0%   
0%   
0%   
0%   
0%   
0%   
0%   
0%   
0%   
0%   
0%   
0%   
0%   
0%   
0%   
z 120   
z 120   
z 120   
z 120   
z 120   
z 120   
z 120   
z 120   
z 120   
z 120   
z 120   
z 120   
z 120   
z 120   
z 120   
z 120   
z 120   
0%   
0%   
0%   
0%   
0%   
0%   
0%   
0%   
0%   
0%   
0%   
0%   
0%   
0%   
0%   
0%

We have used different GCs but same results.


Task Manager


Total Size 4GB

metaspace 1GB

Off heap 512mb


Screenshot form Task manager, 612MB are occupied and not being released. 


We used jcmd tool and attached 3 files

  1. Threads print
  2. VM.metaspace output
  3. VM.classloader
In addition, we have tried calling GC manually, but it did not change much.

Thank you




Confidentiality: This communication and any attachments are intended for the above-named persons only and may be confidential and/or legally privileged. Any opinions expressed in this communication are not necessarily those of NICE Actimize. If this communication has come to you in error you must take no action based on it, nor must you copy or show it to anyone; please delete/destroy and inform the sender by e-mail immediately. 
Monitoring: NICE Actimize may monitor incoming and outgoing e-mails.
Viruses: Although we have taken steps toward ensuring that this e-mail and attachments are free from any virus, we advise that in keeping with good computing practice the recipient should ensure they are actually virus free.


Confidentiality: This communication and any attachments are intended for the above-named persons only and may be confidential and/or legally privileged. Any opinions expressed in this communication are not necessarily those of NICE Actimize. If this communication has come to you in error you must take no action based on it, nor must you copy or show it to anyone; please delete/destroy and inform the sender by e-mail immediately. 
Monitoring: NICE Actimize may monitor incoming and outgoing e-mails.
Viruses: Although we have taken steps toward ensuring that this e-mail and attachments are free from any virus, we advise that in keeping with good computing practice the recipient should ensure they are actually virus free.


Confidentiality: This communication and any attachments are intended for the above-named persons only and may be confidential and/or legally privileged. Any opinions expressed in this communication are not necessarily those of NICE Actimize. If this communication has come to you in error you must take no action based on it, nor must you copy or show it to anyone; please delete/destroy and inform the sender by e-mail immediately. 
Monitoring: NICE Actimize may monitor incoming and outgoing e-mails.
Viruses: Although we have taken steps toward ensuring that this e-mail and attachments are free from any virus, we advise that in keeping with good computing practice the recipient should ensure they are actually virus free.


taskmgr.txt (2M) Download Attachment
thread-print.txt (47K) Download Attachment
vm-class-loader-stats.txt (28K) Download Attachment
vm-class-loaders.txt (1K) Download Attachment
gc-class-histogram.txt (532K) Download Attachment
flink-job-submit.txt (732 bytes) Download Attachment
batch-source-code.java (3K) Download Attachment
FlatMapXSightMsgProcessor.java (1K) Download Attachment
heap-info.txt (420 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Suspected classloader leak in Flink 1.11.1

Kezhu Wang
Hi Tamir,

> The histogram has been taken from Task Manager using jcmd tool.

From that histogram, I guest there is no classloader leaking.

> A simple batch job with single operation . The memory bumps to ~600MB (after single execution). once the job is finished the memory never freed.

It could be just new code paths and hence new classes. A single execution does not making much sense. Multiple or dozen runs and continuous memory increasing among them and not decreasing after could be symptom of leaking.

You could use following steps to verify whether there are issues in your task managers:
* Run job N times, the more the better.
* Wait all jobs finished or stopped.
* Trigger manually gc dozen times.
* Take class histogram and check whether there are any “ChildFirstClassLoader”.
* If there are roughly N “ChildFirstClassLoader” in histogram, then we can pretty sure there might be class loader leaking.
* If there is no “ChildFirstClassLoader” or few but memory still higher than a threshold, say ~600MB or more, it could be other shape of leaking.


In all leaking case, an heap dump as @Chesnay said could be more helpful since it can tell us which object/class/thread keep memory from freeing.


Besides this, I saw an attachment “task-manager-thrad-print.txt” in initial mail, when and where did you capture ? Task Manager ? Is there any job still running ? 


Best,
Kezhu Wang

On March 1, 2021 at 18:38:55, Tamir Sagi ([hidden email]) wrote:

Hey Kezhu,

The histogram has been taken from Task Manager using jcmd tool.

By means of batch job, do you means that you compile job graph from DataSet API in client side and then submit it through RestClient ? I am not familiar with data set api, usually, there is no `ChildFirstClassLoader` creation in client side for job graph building. Could you depict a pseudo for this or did you create `ChildFirstClassLoader` yourself ?
Yes, we have a batch app. we read a file from s3 using hadoop-s3-plugin, then map that data into DataSet then just print it.
Then we have a Flink Client application which saves the batch app jar.

Attached the following files:
  1. batch-source-code.java - main function
  2. FlatMapXSightMsgProcessor.java - custom RichFlatMapFunction
  3. flink-job-submit.txt - The code to submit the job

I've noticed 2 behaviors:
  1. Idle - Once Task manager application boots up the memory consumption gradually grows, starting ~360MB to ~430MB(within few minutes) I see logs where many classes are loaded into JVM and never get released.(Might be a normal behavior)
  2. Batch Job Execution - A simple batch job with single operation . The memory bumps to ~600MB (after single execution). once the job is finished the memory never freed. I executed GC several times (Manually + Programmatically) it did not help(although some classes were unloaded). the memory keeps growing while more batch jobs are executed.
Attached Task Manager Logs from yesterday after a single batch execution.(Memory grew to 612MB and never freed)
  1. taskmgr.txt - Task manager logs (2021-02-28T16:06:05,983 is the timestamp when the job was submitted)
  2. gc-class-historgram.txt
  3. thread-print.txt
  4. vm-class-loader-stats.txt
  5. vm-class-loaders.txt
  6. heap_info.txt

Same behavior has been observed in Flink Client application. Once the batch job is executed the memory is increased gradually and does not get cleaned afterwards.(We observed many ChildFirstClassLoader instances)


Thank you
Tamir. 


From: Kezhu Wang <[hidden email]>
Sent: Sunday, February 28, 2021 6:57 PM
To: Tamir Sagi <[hidden email]>
Subject: Re: Suspected classloader leak in Flink 1.11.1
 

EXTERNAL EMAIL



HI Tamir,

The histogram has no instance of `ChildFirstClassLoader`.

> we are running Flink on a session cluster (version 1.11.1) on Kubernetes, submitting batch jobs with Flink client on Spring boot application (using RestClusterClient).

By analyzing the memory of the client Java application with profiling tools, We saw that there are many instances of Flink's ChildFirstClassLoader (perhaps as the number of jobs which were running), and therefore many instances of the same class, each from a different instance of the Class Loader (as shown in the attached screenshot). Similarly, to the Flink task manager memory.

By means of batch job, do you means that you compile job graph from DataSet API in client side and then submit it through RestClient ? I am not familiar with data set api, usually, there is no `ChildFirstClassLoader` creation in client side for job graph building. Could you depict a pseudo for this or did you create `ChildFirstClassLoader` yourself ?


In addition, we have tried calling GC manually, but it did not change much.

It might take serval runs to collect a class loader instance.


Best,
Kezhu Wang


On February 28, 2021 at 23:27:38, Tamir Sagi ([hidden email]) wrote:

Hey Kezhu,
Thanks for fast responding,

I've read that link few days ago.; Today I ran a simple batch job with single operation (using hadoop s3 plugin) but the same behavior was observed.

attached GC.class_histogram (Not filtered)


Tamir.




From: Kezhu Wang <[hidden email]>
Sent: Sunday, February 28, 2021 4:46 PM
To: [hidden email] <[hidden email]>; Tamir Sagi <[hidden email]>
Subject: Re: Suspected classloader leak in Flink 1.11.1
 

EXTERNAL EMAIL



Hi Tamir,


Besides this, I think GC.class_histogram(even filtered) could help us listing suspected objects.


Best,
Kezhu Wang


On February 28, 2021 at 21:25:07, Tamir Sagi ([hidden email]) wrote:


Hey all,

We are encountering memory issues on a Flink client and task managers, which I would like to raise here.

we are running Flink on a session cluster (version 1.11.1) on Kubernetes, submitting batch jobs with Flink client on Spring boot application (using RestClusterClient).

When jobs are being submitted and running, one after another, We see that the metaspace memory(with max size of  1GB)  keeps increasing, as well as linear increase in the heap memory (though it's a more moderate increase). We do see GC working on the heap and releasing some resources.

By analyzing the memory of the client Java application with profiling tools, We saw that there are many instances of Flink's ChildFirstClassLoader (perhaps as the number of jobs which were running), and therefore many instances of the same class, each from a different instance of the Class Loader (as shown in the attached screenshot). Similarly, to the Flink task manager memory.

We would expect to see one instance of Class Loader. Therefore, We suspect that the reason for the increase is Class Loaders not being cleaned.

Does anyone have some insights about this issue, or ideas how to proceed the investigation?


Flink Client application (VisualVm)



 

 

Shallow Size    
com.fasterxmI.jackson.databind.PropertyMetadata    
com.fasterxmIjackson.databind.PropertyMetadata    
com.fasterxmI.jackson.databind.PropertyMetadata    
com.fasterxmIjackson.databind.PropertyMetadata    
com.fasterxmI.jackson.databind.PropertyMetadata    
com.fasterxmIjackson.databind.PropertyMetadata    
com.fasterxmI.jackson.databind.PropertyMetadata    
com.fasterxmIjackson.databind.PropertyMetadata    
com.fasterxmI.jackson.databind.PropertyMetadata    
com.fasterxmIjackson.databind.PropertyMetadata    
com.fasterxmI.jackson.databind.PropertyMetadata    
com.fasterxmIjackson.databind.PropertyMetadata    
com.fasterxmI.jackson.databind.PropertyMetadata    
com.fasterxmIjackson.databind.PropertyMetadata    
com.fasterxmI.jackson.databind.PropertyMetadata    
com.fasterxmIjackson.databind.PropertyMetadata    
com.fasterxmI.jackson.databind.PropertyMetadata    
org.apache.fIink.utiI.ChiIdFirstCIassLoader (41)    
org.apache.fIink.utiI.ChiIdFirstCIassLoader (79)    
org.apache.fIink.utiI.ChiIdFirstCIassLoader (82)    
org.apache.fIink.utiI.ChiIdFirstCIassLoader (23)    
org.apache.fIink.utiI.ChiIdFirstCIassLoader (36)    
org.apache.fIink.utiI.ChiIdFirstCIassLoader (34)    
org.apache.fIink.utiI.ChiIdFirstCIassLoader (84)    
org.apache.fIink.utiI.ChiIdFirstCIassLoader (92)    
org.apache.fIink.utiI.ChiIdFirstCIassLoader (59)    
org.apache.fIink.utiI.ChiIdFirstCIassLoader (70)    
org.apache.fIink.utiI.ChiIdFirstCIassLoader (3)    
org.apache.fIink.utiI.ChiIdFirstCIassLoader (60)    
org.apache.fIink.utiI.ChiIdFirstCIassLoader (8)    
org.apache.fIink.utiI.ChiIdFirstCIassLoader (17)    
org.apache.fIink.utiI.ChiIdFirstCIassLoader (31)    
org.apache.fIink.utiI.ChiIdFirstCIassLoader (12)    
org.apache.fIink.utiI.ChiIdFirstCIassLoader (49)    
Objects    
0%    
0%    
0%    
0%    
0%    
0%    
0%    
0%    
0%    
0%    
0%    
0%    
0%    
0%    
0%    
0%    
Retained Size    
120    
120    
120    
120    
120    
120    
120    
120    
120    
120    
120    
120    
120    
120    
120    
120    
120    
0%    
0%    
0%    
0%    
0%    
0%    
0%    
0%    
0%    
0%    
0%    
0%    
0%    
0%    
0%    
0%    
z 120    
z 120    
z 120    
z 120    
z 120    
z 120    
z 120    
z 120    
z 120    
z 120    
z 120    
z 120    
z 120    
z 120    
z 120    
z 120    
z 120    
0%    
0%    
0%    
0%    
0%    
0%    
0%    
0%    
0%    
0%    
0%    
0%    
0%    
0%    
0%    
0%

We have used different GCs but same results.


Task Manager


Total Size 4GB

metaspace 1GB

Off heap 512mb


Screenshot form Task manager, 612MB are occupied and not being released. 


We used jcmd tool and attached 3 files

  1. Threads print
  2. VM.metaspace output
  3. VM.classloader
In addition, we have tried calling GC manually, but it did not change much.

Thank you




Confidentiality: This communication and any attachments are intended for the above-named persons only and may be confidential and/or legally privileged. Any opinions expressed in this communication are not necessarily those of NICE Actimize. If this communication has come to you in error you must take no action based on it, nor must you copy or show it to anyone; please delete/destroy and inform the sender by e-mail immediately. 
Monitoring: NICE Actimize may monitor incoming and outgoing e-mails.
Viruses: Although we have taken steps toward ensuring that this e-mail and attachments are free from any virus, we advise that in keeping with good computing practice the recipient should ensure they are actually virus free.


Confidentiality: This communication and any attachments are intended for the above-named persons only and may be confidential and/or legally privileged. Any opinions expressed in this communication are not necessarily those of NICE Actimize. If this communication has come to you in error you must take no action based on it, nor must you copy or show it to anyone; please delete/destroy and inform the sender by e-mail immediately. 
Monitoring: NICE Actimize may monitor incoming and outgoing e-mails.
Viruses: Although we have taken steps toward ensuring that this e-mail and attachments are free from any virus, we advise that in keeping with good computing practice the recipient should ensure they are actually virus free.


Confidentiality: This communication and any attachments are intended for the above-named persons only and may be confidential and/or legally privileged. Any opinions expressed in this communication are not necessarily those of NICE Actimize. If this communication has come to you in error you must take no action based on it, nor must you copy or show it to anyone; please delete/destroy and inform the sender by e-mail immediately. 
Monitoring: NICE Actimize may monitor incoming and outgoing e-mails.
Viruses: Although we have taken steps toward ensuring that this e-mail and attachments are free from any virus, we advise that in keeping with good computing practice the recipient should ensure they are actually virus free.

Reply | Threaded
Open this post in threaded view
|

Re: Suspected classloader leak in Flink 1.11.1

Tamir Sagi
Hey,

I'd expect that what happens in a single execution will repeat itself in N executions.

I ran entire cycle of jobs(28 jobs). 
Once it finished:
  • Memory has grown to 1GB
  • I called GC ~100 times using "jcmd 1 GC.run" command.  Which did not affect much.
Prior running the tests I started Flight recording using "jcmd 1 JFR.start ", I stopped it after calling GC ~100 times.
Following figure shows the graphs from "recording.jfr" in Virtual Vm.



and Metaspace(top right)


docker stats command filtered to relevant Task manager container


Following files of task-manager are attached :
  • task-manager-VM.metaspace - taken via "jcmd 1 VM.metaspace"
  • task-manager-gc-class-histogram.txt via "jcmd 1 GC.class_histogram"

Task manager heap dump is ~100MB,
here is a summary:



Flink client app metric(taken from Lens):




We see a tight coupling between Task Manager app and Flink Client app, as the batch job runs on the client side(via reflection)
what happens with class loaders in that case? 

we also noticed many logs in Task manager related to PoolingHttpClientConnectionManager


and IdleConnectionReaper  InterruptedException  



On Client app we noticed many instances of that thread (From heap dump)



We uploaded 2 heap dumps and task-manager flight recording file into Google drive
  1. task-manager-heap-dump.hprof
  2. java_flink_client.hprof.
  3. task-manager-recording.jfr


Thanks,
Tamir.

From: Kezhu Wang <[hidden email]>
Sent: Monday, March 1, 2021 2:21 PM
To: [hidden email] <[hidden email]>; Tamir Sagi <[hidden email]>
Subject: Re: Suspected classloader leak in Flink 1.11.1
 

EXTERNAL EMAIL



Hi Tamir,

> The histogram has been taken from Task Manager using jcmd tool.

From that histogram, I guest there is no classloader leaking.

> A simple batch job with single operation . The memory bumps to ~600MB (after single execution). once the job is finished the memory never freed.

It could be just new code paths and hence new classes. A single execution does not making much sense. Multiple or dozen runs and continuous memory increasing among them and not decreasing after could be symptom of leaking.

You could use following steps to verify whether there are issues in your task managers:
* Run job N times, the more the better.
* Wait all jobs finished or stopped.
* Trigger manually gc dozen times.
* Take class histogram and check whether there are any “ChildFirstClassLoader”.
* If there are roughly N “ChildFirstClassLoader” in histogram, then we can pretty sure there might be class loader leaking.
* If there is no “ChildFirstClassLoader” or few but memory still higher than a threshold, say ~600MB or more, it could be other shape of leaking.


In all leaking case, an heap dump as @Chesnay said could be more helpful since it can tell us which object/class/thread keep memory from freeing.


Besides this, I saw an attachment “task-manager-thrad-print.txt” in initial mail, when and where did you capture ? Task Manager ? Is there any job still running ? 


Best,
Kezhu Wang

On March 1, 2021 at 18:38:55, Tamir Sagi ([hidden email]) wrote:

Hey Kezhu,

The histogram has been taken from Task Manager using jcmd tool.

By means of batch job, do you means that you compile job graph from DataSet API in client side and then submit it through RestClient ? I am not familiar with data set api, usually, there is no `ChildFirstClassLoader` creation in client side for job graph building. Could you depict a pseudo for this or did you create `ChildFirstClassLoader` yourself ?
Yes, we have a batch app. we read a file from s3 using hadoop-s3-plugin, then map that data into DataSet then just print it.
Then we have a Flink Client application which saves the batch app jar.

Attached the following files:
  1. batch-source-code.java - main function
  2. FlatMapXSightMsgProcessor.java - custom RichFlatMapFunction
  3. flink-job-submit.txt - The code to submit the job

I've noticed 2 behaviors:
  1. Idle - Once Task manager application boots up the memory consumption gradually grows, starting ~360MB to ~430MB(within few minutes) I see logs where many classes are loaded into JVM and never get released.(Might be a normal behavior)
  2. Batch Job Execution - A simple batch job with single operation . The memory bumps to ~600MB (after single execution). once the job is finished the memory never freed. I executed GC several times (Manually + Programmatically) it did not help(although some classes were unloaded). the memory keeps growing while more batch jobs are executed.
Attached Task Manager Logs from yesterday after a single batch execution.(Memory grew to 612MB and never freed)
  1. taskmgr.txt - Task manager logs (2021-02-28T16:06:05,983 is the timestamp when the job was submitted)
  2. gc-class-historgram.txt
  3. thread-print.txt
  4. vm-class-loader-stats.txt
  5. vm-class-loaders.txt
  6. heap_info.txt

Same behavior has been observed in Flink Client application. Once the batch job is executed the memory is increased gradually and does not get cleaned afterwards.(We observed many ChildFirstClassLoader instances)


Thank you
Tamir. 


From: Kezhu Wang <[hidden email]>
Sent: Sunday, February 28, 2021 6:57 PM
To: Tamir Sagi <[hidden email]>
Subject: Re: Suspected classloader leak in Flink 1.11.1
 

EXTERNAL EMAIL



HI Tamir,

The histogram has no instance of `ChildFirstClassLoader`.

> we are running Flink on a session cluster (version 1.11.1) on Kubernetes, submitting batch jobs with Flink client on Spring boot application (using RestClusterClient).

By analyzing the memory of the client Java application with profiling tools, We saw that there are many instances of Flink's ChildFirstClassLoader (perhaps as the number of jobs which were running), and therefore many instances of the same class, each from a different instance of the Class Loader (as shown in the attached screenshot). Similarly, to the Flink task manager memory.

By means of batch job, do you means that you compile job graph from DataSet API in client side and then submit it through RestClient ? I am not familiar with data set api, usually, there is no `ChildFirstClassLoader` creation in client side for job graph building. Could you depict a pseudo for this or did you create `ChildFirstClassLoader` yourself ?


In addition, we have tried calling GC manually, but it did not change much.

It might take serval runs to collect a class loader instance.


Best,
Kezhu Wang


On February 28, 2021 at 23:27:38, Tamir Sagi ([hidden email]) wrote:

Hey Kezhu,
Thanks for fast responding,

I've read that link few days ago.; Today I ran a simple batch job with single operation (using hadoop s3 plugin) but the same behavior was observed.

attached GC.class_histogram (Not filtered)


Tamir.




From: Kezhu Wang <[hidden email]>
Sent: Sunday, February 28, 2021 4:46 PM
To: [hidden email] <[hidden email]>; Tamir Sagi <[hidden email]>
Subject: Re: Suspected classloader leak in Flink 1.11.1
 

EXTERNAL EMAIL



Hi Tamir,


Besides this, I think GC.class_histogram(even filtered) could help us listing suspected objects.


Best,
Kezhu Wang


On February 28, 2021 at 21:25:07, Tamir Sagi ([hidden email]) wrote:


Hey all,

We are encountering memory issues on a Flink client and task managers, which I would like to raise here.

we are running Flink on a session cluster (version 1.11.1) on Kubernetes, submitting batch jobs with Flink client on Spring boot application (using RestClusterClient).

When jobs are being submitted and running, one after another, We see that the metaspace memory(with max size of  1GB)  keeps increasing, as well as linear increase in the heap memory (though it's a more moderate increase). We do see GC working on the heap and releasing some resources.

By analyzing the memory of the client Java application with profiling tools, We saw that there are many instances of Flink's ChildFirstClassLoader (perhaps as the number of jobs which were running), and therefore many instances of the same class, each from a different instance of the Class Loader (as shown in the attached screenshot). Similarly, to the Flink task manager memory.

We would expect to see one instance of Class Loader. Therefore, We suspect that the reason for the increase is Class Loaders not being cleaned.

Does anyone have some insights about this issue, or ideas how to proceed the investigation?


Flink Client application (VisualVm)



 

 

Shallow Size    
com.fasterxmI.jackson.databind.PropertyMetadata    
com.fasterxmIjackson.databind.PropertyMetadata    
com.fasterxmI.jackson.databind.PropertyMetadata    
com.fasterxmIjackson.databind.PropertyMetadata    
com.fasterxmI.jackson.databind.PropertyMetadata    
com.fasterxmIjackson.databind.PropertyMetadata    
com.fasterxmI.jackson.databind.PropertyMetadata    
com.fasterxmIjackson.databind.PropertyMetadata    
com.fasterxmI.jackson.databind.PropertyMetadata    
com.fasterxmIjackson.databind.PropertyMetadata    
com.fasterxmI.jackson.databind.PropertyMetadata    
com.fasterxmIjackson.databind.PropertyMetadata    
com.fasterxmI.jackson.databind.PropertyMetadata    
com.fasterxmIjackson.databind.PropertyMetadata    
com.fasterxmI.jackson.databind.PropertyMetadata    
com.fasterxmIjackson.databind.PropertyMetadata    
com.fasterxmI.jackson.databind.PropertyMetadata    
org.apache.fIink.utiI.ChiIdFirstCIassLoader (41)    
org.apache.fIink.utiI.ChiIdFirstCIassLoader (79)    
org.apache.fIink.utiI.ChiIdFirstCIassLoader (82)    
org.apache.fIink.utiI.ChiIdFirstCIassLoader (23)    
org.apache.fIink.utiI.ChiIdFirstCIassLoader (36)    
org.apache.fIink.utiI.ChiIdFirstCIassLoader (34)    
org.apache.fIink.utiI.ChiIdFirstCIassLoader (84)    
org.apache.fIink.utiI.ChiIdFirstCIassLoader (92)    
org.apache.fIink.utiI.ChiIdFirstCIassLoader (59)    
org.apache.fIink.utiI.ChiIdFirstCIassLoader (70)    
org.apache.fIink.utiI.ChiIdFirstCIassLoader (3)    
org.apache.fIink.utiI.ChiIdFirstCIassLoader (60)    
org.apache.fIink.utiI.ChiIdFirstCIassLoader (8)    
org.apache.fIink.utiI.ChiIdFirstCIassLoader (17)    
org.apache.fIink.utiI.ChiIdFirstCIassLoader (31)    
org.apache.fIink.utiI.ChiIdFirstCIassLoader (12)    
org.apache.fIink.utiI.ChiIdFirstCIassLoader (49)    
Objects    
0%    
0%    
0%    
0%    
0%    
0%    
0%    
0%    
0%    
0%    
0%    
0%    
0%    
0%    
0%    
0%    
Retained Size    
120    
120    
120    
120    
120    
120    
120    
120    
120    
120    
120    
120    
120    
120    
120    
120    
120    
0%    
0%    
0%    
0%    
0%    
0%    
0%    
0%    
0%    
0%    
0%    
0%    
0%    
0%    
0%    
0%    
z 120    
z 120    
z 120    
z 120    
z 120    
z 120    
z 120    
z 120    
z 120    
z 120    
z 120    
z 120    
z 120    
z 120    
z 120    
z 120    
z 120    
0%    
0%    
0%    
0%    
0%    
0%    
0%    
0%    
0%    
0%    
0%    
0%    
0%    
0%    
0%    
0%

We have used different GCs but same results.


Task Manager


Total Size 4GB

metaspace 1GB

Off heap 512mb


Screenshot form Task manager, 612MB are occupied and not being released. 


We used jcmd tool and attached 3 files

  1. Threads print
  2. VM.metaspace output
  3. VM.classloader
In addition, we have tried calling GC manually, but it did not change much.

Thank you




Confidentiality: This communication and any attachments are intended for the above-named persons only and may be confidential and/or legally privileged. Any opinions expressed in this communication are not necessarily those of NICE Actimize. If this communication has come to you in error you must take no action based on it, nor must you copy or show it to anyone; please delete/destroy and inform the sender by e-mail immediately. 
Monitoring: NICE Actimize may monitor incoming and outgoing e-mails.
Viruses: Although we have taken steps toward ensuring that this e-mail and attachments are free from any virus, we advise that in keeping with good computing practice the recipient should ensure they are actually virus free.


Confidentiality: This communication and any attachments are intended for the above-named persons only and may be confidential and/or legally privileged. Any opinions expressed in this communication are not necessarily those of NICE Actimize. If this communication has come to you in error you must take no action based on it, nor must you copy or show it to anyone; please delete/destroy and inform the sender by e-mail immediately. 
Monitoring: NICE Actimize may monitor incoming and outgoing e-mails.
Viruses: Although we have taken steps toward ensuring that this e-mail and attachments are free from any virus, we advise that in keeping with good computing practice the recipient should ensure they are actually virus free.


Confidentiality: This communication and any attachments are intended for the above-named persons only and may be confidential and/or legally privileged. Any opinions expressed in this communication are not necessarily those of NICE Actimize. If this communication has come to you in error you must take no action based on it, nor must you copy or show it to anyone; please delete/destroy and inform the sender by e-mail immediately. 
Monitoring: NICE Actimize may monitor incoming and outgoing e-mails.
Viruses: Although we have taken steps toward ensuring that this e-mail and attachments are free from any virus, we advise that in keeping with good computing practice the recipient should ensure they are actually virus free.


Confidentiality: This communication and any attachments are intended for the above-named persons only and may be confidential and/or legally privileged. Any opinions expressed in this communication are not necessarily those of NICE Actimize. If this communication has come to you in error you must take no action based on it, nor must you copy or show it to anyone; please delete/destroy and inform the sender by e-mail immediately. 
Monitoring: NICE Actimize may monitor incoming and outgoing e-mails.
Viruses: Although we have taken steps toward ensuring that this e-mail and attachments are free from any virus, we advise that in keeping with good computing practice the recipient should ensure they are actually virus free.


task-manager-gc-class-histogram.txt (1M) Download Attachment
task-manager-VM-metaspace.txt (2K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Suspected classloader leak in Flink 1.11.1

Chesnay Schepler

the java-sdk-connection-reaper thread and amazon's JMX integration are causing the leak.


What strikes me as odd is that I see some dynamodb classes being referenced in the child classloaders, but I don't see where they could come from based on the application that you provided us with.


Could you clarify how exactly you depend on Amazon dependencies? (connectors, filesystems, _other stuff_)


On 3/1/2021 5:24 PM, Tamir Sagi wrote:
Hey,

I'd expect that what happens in a single execution will repeat itself in N executions.

I ran entire cycle of jobs(28 jobs). 
Once it finished:
  • Memory has grown to 1GB
  • I called GC ~100 times using "jcmd 1 GC.run" command.  Which did not affect much.
Prior running the tests I started Flight recording using "jcmd 1 JFR.start ", I stopped it after calling GC ~100 times.
Following figure shows the graphs from "recording.jfr" in Virtual Vm.



and Metaspace(top right)


docker stats command filtered to relevant Task manager container


Following files of task-manager are attached :
  • task-manager-VM.metaspace - taken via "jcmd 1 VM.metaspace"
  • task-manager-gc-class-histogram.txt via "jcmd 1 GC.class_histogram"

Task manager heap dump is ~100MB,
here is a summary:



Flink client app metric(taken from Lens):




We see a tight coupling between Task Manager app and Flink Client app, as the batch job runs on the client side(via reflection)
what happens with class loaders in that case? 

we also noticed many logs in Task manager related to PoolingHttpClientConnectionManager


and IdleConnectionReaper  InterruptedException  



On Client app we noticed many instances of that thread (From heap dump)



We uploaded 2 heap dumps and task-manager flight recording file into Google drive
  1. task-manager-heap-dump.hprof
  2. java_flink_client.hprof.
  3. task-manager-recording.jfr


Thanks,
Tamir.

From: Kezhu Wang [hidden email]
Sent: Monday, March 1, 2021 2:21 PM
To: [hidden email] [hidden email]; Tamir Sagi [hidden email]
Subject: Re: Suspected classloader leak in Flink 1.11.1
 

EXTERNAL EMAIL



Hi Tamir,

> The histogram has been taken from Task Manager using jcmd tool.

From that histogram, I guest there is no classloader leaking.

> A simple batch job with single operation . The memory bumps to ~600MB (after single execution). once the job is finished the memory never freed.

It could be just new code paths and hence new classes. A single execution does not making much sense. Multiple or dozen runs and continuous memory increasing among them and not decreasing after could be symptom of leaking.

You could use following steps to verify whether there are issues in your task managers:
* Run job N times, the more the better.
* Wait all jobs finished or stopped.
* Trigger manually gc dozen times.
* Take class histogram and check whether there are any “ChildFirstClassLoader”.
* If there are roughly N “ChildFirstClassLoader” in histogram, then we can pretty sure there might be class loader leaking.
* If there is no “ChildFirstClassLoader” or few but memory still higher than a threshold, say ~600MB or more, it could be other shape of leaking.


In all leaking case, an heap dump as @Chesnay said could be more helpful since it can tell us which object/class/thread keep memory from freeing.


Besides this, I saw an attachment “task-manager-thrad-print.txt” in initial mail, when and where did you capture ? Task Manager ? Is there any job still running ? 


Best,
Kezhu Wang

On March 1, 2021 at 18:38:55, Tamir Sagi ([hidden email]) wrote:

Hey Kezhu,

The histogram has been taken from Task Manager using jcmd tool.

By means of batch job, do you means that you compile job graph from DataSet API in client side and then submit it through RestClient ? I am not familiar with data set api, usually, there is no `ChildFirstClassLoader` creation in client side for job graph building. Could you depict a pseudo for this or did you create `ChildFirstClassLoader` yourself ?
Yes, we have a batch app. we read a file from s3 using hadoop-s3-plugin, then map that data into DataSet then just print it.
Then we have a Flink Client application which saves the batch app jar.

Attached the following files:
  1. batch-source-code.java - main function
  2. FlatMapXSightMsgProcessor.java - custom RichFlatMapFunction
  3. flink-job-submit.txt - The code to submit the job

I've noticed 2 behaviors:
  1. Idle - Once Task manager application boots up the memory consumption gradually grows, starting ~360MB to ~430MB(within few minutes) I see logs where many classes are loaded into JVM and never get released.(Might be a normal behavior)
  2. Batch Job Execution - A simple batch job with single operation . The memory bumps to ~600MB (after single execution). once the job is finished the memory never freed. I executed GC several times (Manually + Programmatically) it did not help(although some classes were unloaded). the memory keeps growing while more batch jobs are executed.
Attached Task Manager Logs from yesterday after a single batch execution.(Memory grew to 612MB and never freed)
  1. taskmgr.txt - Task manager logs (2021-02-28T16:06:05,983 is the timestamp when the job was submitted)
  2. gc-class-historgram.txt
  3. thread-print.txt
  4. vm-class-loader-stats.txt
  5. vm-class-loaders.txt
  6. heap_info.txt

Same behavior has been observed in Flink Client application. Once the batch job is executed the memory is increased gradually and does not get cleaned afterwards.(We observed many ChildFirstClassLoader instances)


Thank you
Tamir. 


From: Kezhu Wang <[hidden email]>
Sent: Sunday, February 28, 2021 6:57 PM
To: Tamir Sagi <[hidden email]>
Subject: Re: Suspected classloader leak in Flink 1.11.1
 

EXTERNAL EMAIL



HI Tamir,

The histogram has no instance of `ChildFirstClassLoader`.

> we are running Flink on a session cluster (version 1.11.1) on Kubernetes, submitting batch jobs with Flink client on Spring boot application (using RestClusterClient).

By analyzing the memory of the client Java application with profiling tools, We saw that there are many instances of Flink's ChildFirstClassLoader (perhaps as the number of jobs which were running), and therefore many instances of the same class, each from a different instance of the Class Loader (as shown in the attached screenshot). Similarly, to the Flink task manager memory.

By means of batch job, do you means that you compile job graph from DataSet API in client side and then submit it through RestClient ? I am not familiar with data set api, usually, there is no `ChildFirstClassLoader` creation in client side for job graph building. Could you depict a pseudo for this or did you create `ChildFirstClassLoader` yourself ?


In addition, we have tried calling GC manually, but it did not change much.

It might take serval runs to collect a class loader instance.


Best,
Kezhu Wang


On February 28, 2021 at 23:27:38, Tamir Sagi ([hidden email]) wrote:

Hey Kezhu,
Thanks for fast responding,

I've read that link few days ago.; Today I ran a simple batch job with single operation (using hadoop s3 plugin) but the same behavior was observed.

attached GC.class_histogram (Not filtered)


Tamir.




From: Kezhu Wang <[hidden email]>
Sent: Sunday, February 28, 2021 4:46 PM
To: [hidden email] <[hidden email]>; Tamir Sagi <[hidden email]>
Subject: Re: Suspected classloader leak in Flink 1.11.1
 

EXTERNAL EMAIL



Hi Tamir,


Besides this, I think GC.class_histogram(even filtered) could help us listing suspected objects.


Best,
Kezhu Wang


On February 28, 2021 at 21:25:07, Tamir Sagi ([hidden email]) wrote:


Hey all,

We are encountering memory issues on a Flink client and task managers, which I would like to raise here.

we are running Flink on a session cluster (version 1.11.1) on Kubernetes, submitting batch jobs with Flink client on Spring boot application (using RestClusterClient).

When jobs are being submitted and running, one after another, We see that the metaspace memory(with max size of  1GB)  keeps increasing, as well as linear increase in the heap memory (though it's a more moderate increase). We do see GC working on the heap and releasing some resources.

By analyzing the memory of the client Java application with profiling tools, We saw that there are many instances of Flink's ChildFirstClassLoader (perhaps as the number of jobs which were running), and therefore many instances of the same class, each from a different instance of the Class Loader (as shown in the attached screenshot). Similarly, to the Flink task manager memory.

We would expect to see one instance of Class Loader. Therefore, We suspect that the reason for the increase is Class Loaders not being cleaned.

Does anyone have some insights about this issue, or ideas how to proceed the investigation?


Flink Client application (VisualVm)



 

 

Shallow Size
                                                    com.fasterxmI.jackson.databind.PropertyMetadata
com.fasterxmIjackson.databind.PropertyMetadata
                                                    com.fasterxmI.jackson.databind.PropertyMetadata
com.fasterxmIjackson.databind.PropertyMetadata
                                                    com.fasterxmI.jackson.databind.PropertyMetadata
com.fasterxmIjackson.databind.PropertyMetadata
                                                    com.fasterxmI.jackson.databind.PropertyMetadata
com.fasterxmIjackson.databind.PropertyMetadata
                                                    com.fasterxmI.jackson.databind.PropertyMetadata
com.fasterxmIjackson.databind.PropertyMetadata
                                                    com.fasterxmI.jackson.databind.PropertyMetadata
com.fasterxmIjackson.databind.PropertyMetadata
                                                    com.fasterxmI.jackson.databind.PropertyMetadata
com.fasterxmIjackson.databind.PropertyMetadata
                                                    com.fasterxmI.jackson.databind.PropertyMetadata
com.fasterxmIjackson.databind.PropertyMetadata
                                                    com.fasterxmI.jackson.databind.PropertyMetadata
org.apache.fIink.utiI.ChiIdFirstCIassLoader (41)
                                                    org.apache.fIink.utiI.ChiIdFirstCIassLoader
                                                    (79)
                                                    org.apache.fIink.utiI.ChiIdFirstCIassLoader
                                                    (82)
                                                    org.apache.fIink.utiI.ChiIdFirstCIassLoader
                                                    (23)
                                                    org.apache.fIink.utiI.ChiIdFirstCIassLoader
                                                    (36)
                                                    org.apache.fIink.utiI.ChiIdFirstCIassLoader
                                                    (34)
                                                    org.apache.fIink.utiI.ChiIdFirstCIassLoader
                                                    (84)
                                                    org.apache.fIink.utiI.ChiIdFirstCIassLoader
                                                    (92)
                                                    org.apache.fIink.utiI.ChiIdFirstCIassLoader
                                                    (59)
                                                    org.apache.fIink.utiI.ChiIdFirstCIassLoader
                                                    (70)
                                                    org.apache.fIink.utiI.ChiIdFirstCIassLoader
                                                    (3)
                                                    org.apache.fIink.utiI.ChiIdFirstCIassLoader
                                                    (60)
                                                    org.apache.fIink.utiI.ChiIdFirstCIassLoader
                                                    (8)
                                                    org.apache.fIink.utiI.ChiIdFirstCIassLoader
                                                    (17)
                                                    org.apache.fIink.utiI.ChiIdFirstCIassLoader
                                                    (31)
                                                    org.apache.fIink.utiI.ChiIdFirstCIassLoader
                                                    (12)
                                                    org.apache.fIink.utiI.ChiIdFirstCIassLoader
                                                    (49) Objects 0% 0%
                                                    0% 0% 0% 0% 0% 0% 0%
                                                    0% 0% 0% 0% 0% 0% 0%
                                                    Retained Size 120
                                                    120 120 120 120 120
                                                    120 120 120 120 120
                                                    120 120 120 120 120
                                                    120 0% 0% 0% 0% 0%
                                                    0% 0% 0% 0% 0% 0% 0%
                                                    0% 0% 0% 0% z 120 z
                                                    120 z 120 z 120 z
                                                    120 z 120 z 120 z
                                                    120 z 120 z 120 z
                                                    120 z 120 z 120 z
                                                    120 z 120 z 120 z
                                                    120 0% 0% 0% 0% 0%
                                                    0% 0% 0% 0% 0% 0% 0%
                                                    0% 0% 0% 0%

We have used different GCs but same results.


Task Manager


Total Size 4GB

metaspace 1GB

Off heap 512mb


Screenshot form Task manager, 612MB are occupied and not being released. 


We used jcmd tool and attached 3 files

  1. Threads print
  2. VM.metaspace output
  3. VM.classloader
In addition, we have tried calling GC manually, but it did not change much.

Thank you




Confidentiality: This communication and any attachments are intended for the above-named persons only and may be confidential and/or legally privileged. Any opinions expressed in this communication are not necessarily those of NICE Actimize. If this communication has come to you in error you must take no action based on it, nor must you copy or show it to anyone; please delete/destroy and inform the sender by e-mail immediately. 
Monitoring: NICE Actimize may monitor incoming and outgoing e-mails.
Viruses: Although we have taken steps toward ensuring that this e-mail and attachments are free from any virus, we advise that in keeping with good computing practice the recipient should ensure they are actually virus free.


Confidentiality: This communication and any attachments are intended for the above-named persons only and may be confidential and/or legally privileged. Any opinions expressed in this communication are not necessarily those of NICE Actimize. If this communication has come to you in error you must take no action based on it, nor must you copy or show it to anyone; please delete/destroy and inform the sender by e-mail immediately. 
Monitoring: NICE Actimize may monitor incoming and outgoing e-mails.
Viruses: Although we have taken steps toward ensuring that this e-mail and attachments are free from any virus, we advise that in keeping with good computing practice the recipient should ensure they are actually virus free.


Confidentiality: This communication and any attachments are intended for the above-named persons only and may be confidential and/or legally privileged. Any opinions expressed in this communication are not necessarily those of NICE Actimize. If this communication has come to you in error you must take no action based on it, nor must you copy or show it to anyone; please delete/destroy and inform the sender by e-mail immediately. 
Monitoring: NICE Actimize may monitor incoming and outgoing e-mails.
Viruses: Although we have taken steps toward ensuring that this e-mail and attachments are free from any virus, we advise that in keeping with good computing practice the recipient should ensure they are actually virus free.


Confidentiality: This communication and any attachments are intended for the above-named persons only and may be confidential and/or legally privileged. Any opinions expressed in this communication are not necessarily those of NICE Actimize. If this communication has come to you in error you must take no action based on it, nor must you copy or show it to anyone; please delete/destroy and inform the sender by e-mail immediately. 
Monitoring: NICE Actimize may monitor incoming and outgoing e-mails.
Viruses: Although we have taken steps toward ensuring that this e-mail and attachments are free from any virus, we advise that in keeping with good computing practice the recipient should ensure they are actually virus free.


Reply | Threaded
Open this post in threaded view
|

Re: Suspected classloader leak in Flink 1.11.1

Kezhu Wang
Hi Chesnay,

Thanks for give a hand and solve this.

I guess `FlatMapXSightMsgProcessor` is a minimal reproducible version while the heap dump could be taken from near production environment.


Best,
Kezhu Wang

On March 2, 2021 at 01:00:52, Chesnay Schepler ([hidden email]) wrote:

the java-sdk-connection-reaper thread and amazon's JMX integration are causing the leak.


What strikes me as odd is that I see some dynamodb classes being referenced in the child classloaders, but I don't see where they could come from based on the application that you provided us with.


Could you clarify how exactly you depend on Amazon dependencies? (connectors, filesystems, _other stuff_)


On 3/1/2021 5:24 PM, Tamir Sagi wrote:
Hey,

I'd expect that what happens in a single execution will repeat itself in N executions.

I ran entire cycle of jobs(28 jobs). 
Once it finished:
  • Memory has grown to 1GB
  • I called GC ~100 times using "jcmd 1 GC.run" command.  Which did not affect much.
Prior running the tests I started Flight recording using "jcmd 1 JFR.start ", I stopped it after calling GC ~100 times.
Following figure shows the graphs from "recording.jfr" in Virtual Vm.



and Metaspace(top right)


docker stats command filtered to relevant Task manager container


Following files of task-manager are attached :
  • task-manager-VM.metaspace - taken via "jcmd 1 VM.metaspace"
  • task-manager-gc-class-histogram.txt via "jcmd 1 GC.class_histogram"

Task manager heap dump is ~100MB,
here is a summary:



Flink client app metric(taken from Lens):




We see a tight coupling between Task Manager app and Flink Client app, as the batch job runs on the client side(via reflection)
what happens with class loaders in that case? 

we also noticed many logs in Task manager related to PoolingHttpClientConnectionManager


and IdleConnectionReaper  InterruptedException  



On Client app we noticed many instances of that thread (From heap dump)



We uploaded 2 heap dumps and task-manager flight recording file into Google drive
  1. task-manager-heap-dump.hprof
  2. java_flink_client.hprof.
  3. task-manager-recording.jfr


Thanks,
Tamir.

From: Kezhu Wang [hidden email]
Sent: Monday, March 1, 2021 2:21 PM
To: [hidden email] [hidden email]; Tamir Sagi [hidden email]
Subject: Re: Suspected classloader leak in Flink 1.11.1
 

EXTERNAL EMAIL



Hi Tamir,

> The histogram has been taken from Task Manager using jcmd tool.

From that histogram, I guest there is no classloader leaking.

> A simple batch job with single operation . The memory bumps to ~600MB (after single execution). once the job is finished the memory never freed.

It could be just new code paths and hence new classes. A single execution does not making much sense. Multiple or dozen runs and continuous memory increasing among them and not decreasing after could be symptom of leaking.

You could use following steps to verify whether there are issues in your task managers:
* Run job N times, the more the better.
* Wait all jobs finished or stopped.
* Trigger manually gc dozen times.
* Take class histogram and check whether there are any “ChildFirstClassLoader”.
* If there are roughly N “ChildFirstClassLoader” in histogram, then we can pretty sure there might be class loader leaking.
* If there is no “ChildFirstClassLoader” or few but memory still higher than a threshold, say ~600MB or more, it could be other shape of leaking.


In all leaking case, an heap dump as @Chesnay said could be more helpful since it can tell us which object/class/thread keep memory from freeing.


Besides this, I saw an attachment “task-manager-thrad-print.txt” in initial mail, when and where did you capture ? Task Manager ? Is there any job still running ? 


Best,
Kezhu Wang

On March 1, 2021 at 18:38:55, Tamir Sagi ([hidden email]) wrote:

Hey Kezhu,

The histogram has been taken from Task Manager using jcmd tool.

By means of batch job, do you means that you compile job graph from DataSet API in client side and then submit it through RestClient ? I am not familiar with data set api, usually, there is no `ChildFirstClassLoader` creation in client side for job graph building. Could you depict a pseudo for this or did you create `ChildFirstClassLoader` yourself ?
Yes, we have a batch app. we read a file from s3 using hadoop-s3-plugin, then map that data into DataSet then just print it.
Then we have a Flink Client application which saves the batch app jar.

Attached the following files:
  1. batch-source-code.java - main function
  2. FlatMapXSightMsgProcessor.java - custom RichFlatMapFunction
  3. flink-job-submit.txt - The code to submit the job

I've noticed 2 behaviors:
  1. Idle - Once Task manager application boots up the memory consumption gradually grows, starting ~360MB to ~430MB(within few minutes) I see logs where many classes are loaded into JVM and never get released.(Might be a normal behavior)
  2. Batch Job Execution - A simple batch job with single operation . The memory bumps to ~600MB (after single execution). once the job is finished the memory never freed. I executed GC several times (Manually + Programmatically) it did not help(although some classes were unloaded). the memory keeps growing while more batch jobs are executed.
Attached Task Manager Logs from yesterday after a single batch execution.(Memory grew to 612MB and never freed)
  1. taskmgr.txt - Task manager logs (2021-02-28T16:06:05,983 is the timestamp when the job was submitted)
  2. gc-class-historgram.txt
  3. thread-print.txt
  4. vm-class-loader-stats.txt
  5. vm-class-loaders.txt
  6. heap_info.txt

Same behavior has been observed in Flink Client application. Once the batch job is executed the memory is increased gradually and does not get cleaned afterwards.(We observed many ChildFirstClassLoader instances)


Thank you
Tamir. 


From: Kezhu Wang <[hidden email]>
Sent: Sunday, February 28, 2021 6:57 PM
To: Tamir Sagi <[hidden email]>
Subject: Re: Suspected classloader leak in Flink 1.11.1
 

EXTERNAL EMAIL



HI Tamir,

The histogram has no instance of `ChildFirstClassLoader`.

> we are running Flink on a session cluster (version 1.11.1) on Kubernetes, submitting batch jobs with Flink client on Spring boot application (using RestClusterClient).

By analyzing the memory of the client Java application with profiling tools, We saw that there are many instances of Flink's ChildFirstClassLoader (perhaps as the number of jobs which were running), and therefore many instances of the same class, each from a different instance of the Class Loader (as shown in the attached screenshot). Similarly, to the Flink task manager memory.

By means of batch job, do you means that you compile job graph from DataSet API in client side and then submit it through RestClient ? I am not familiar with data set api, usually, there is no `ChildFirstClassLoader` creation in client side for job graph building. Could you depict a pseudo for this or did you create `ChildFirstClassLoader` yourself ?


In addition, we have tried calling GC manually, but it did not change much.

It might take serval runs to collect a class loader instance.


Best,
Kezhu Wang


On February 28, 2021 at 23:27:38, Tamir Sagi ([hidden email]) wrote:

Hey Kezhu,
Thanks for fast responding,

I've read that link few days ago.; Today I ran a simple batch job with single operation (using hadoop s3 plugin) but the same behavior was observed.

attached GC.class_histogram (Not filtered)


Tamir.




From: Kezhu Wang <[hidden email]>
Sent: Sunday, February 28, 2021 4:46 PM
To: [hidden email] <[hidden email]>; Tamir Sagi <[hidden email]>
Subject: Re: Suspected classloader leak in Flink 1.11.1
 

EXTERNAL EMAIL



Hi Tamir,


Besides this, I think GC.class_histogram(even filtered) could help us listing suspected objects.


Best,
Kezhu Wang


On February 28, 2021 at 21:25:07, Tamir Sagi ([hidden email]) wrote:


Hey all,

We are encountering memory issues on a Flink client and task managers, which I would like to raise here.

we are running Flink on a session cluster (version 1.11.1) on Kubernetes, submitting batch jobs with Flink client on Spring boot application (using RestClusterClient).

When jobs are being submitted and running, one after another, We see that the metaspace memory(with max size of  1GB)  keeps increasing, as well as linear increase in the heap memory (though it's a more moderate increase). We do see GC working on the heap and releasing some resources.

By analyzing the memory of the client Java application with profiling tools, We saw that there are many instances of Flink's ChildFirstClassLoader (perhaps as the number of jobs which were running), and therefore many instances of the same class, each from a different instance of the Class Loader (as shown in the attached screenshot). Similarly, to the Flink task manager memory.

We would expect to see one instance of Class Loader. Therefore, We suspect that the reason for the increase is Class Loaders not being cleaned.

Does anyone have some insights about this issue, or ideas how to proceed the investigation?


Flink Client application (VisualVm)



 

 

Shallow Size
                                                    com.fasterxmI.jackson.databind.PropertyMetadata
com.fasterxmIjackson.databind.PropertyMetadata
                                                    com.fasterxmI.jackson.databind.PropertyMetadata
com.fasterxmIjackson.databind.PropertyMetadata
                                                    com.fasterxmI.jackson.databind.PropertyMetadata
com.fasterxmIjackson.databind.PropertyMetadata
                                                    com.fasterxmI.jackson.databind.PropertyMetadata
com.fasterxmIjackson.databind.PropertyMetadata
                                                    com.fasterxmI.jackson.databind.PropertyMetadata
com.fasterxmIjackson.databind.PropertyMetadata
                                                    com.fasterxmI.jackson.databind.PropertyMetadata
com.fasterxmIjackson.databind.PropertyMetadata
                                                    com.fasterxmI.jackson.databind.PropertyMetadata
com.fasterxmIjackson.databind.PropertyMetadata
                                                    com.fasterxmI.jackson.databind.PropertyMetadata
com.fasterxmIjackson.databind.PropertyMetadata
                                                    com.fasterxmI.jackson.databind.PropertyMetadata
org.apache.fIink.utiI.ChiIdFirstCIassLoader (41)
                                                    org.apache.fIink.utiI.ChiIdFirstCIassLoader
                                                    (79)
                                                    org.apache.fIink.utiI.ChiIdFirstCIassLoader
                                                    (82)
                                                    org.apache.fIink.utiI.ChiIdFirstCIassLoader
                                                    (23)
                                                    org.apache.fIink.utiI.ChiIdFirstCIassLoader
                                                    (36)
                                                    org.apache.fIink.utiI.ChiIdFirstCIassLoader
                                                    (34)
                                                    org.apache.fIink.utiI.ChiIdFirstCIassLoader
                                                    (84)
                                                    org.apache.fIink.utiI.ChiIdFirstCIassLoader
                                                    (92)
                                                    org.apache.fIink.utiI.ChiIdFirstCIassLoader
                                                    (59)
                                                    org.apache.fIink.utiI.ChiIdFirstCIassLoader
                                                    (70)
                                                    org.apache.fIink.utiI.ChiIdFirstCIassLoader
                                                    (3)
                                                    org.apache.fIink.utiI.ChiIdFirstCIassLoader
                                                    (60)
                                                    org.apache.fIink.utiI.ChiIdFirstCIassLoader
                                                    (8)
                                                    org.apache.fIink.utiI.ChiIdFirstCIassLoader
                                                    (17)
                                                    org.apache.fIink.utiI.ChiIdFirstCIassLoader
                                                    (31)
                                                    org.apache.fIink.utiI.ChiIdFirstCIassLoader
                                                    (12)
                                                    org.apache.fIink.utiI.ChiIdFirstCIassLoader
                                                    (49) Objects 0% 0%
                                                    0% 0% 0% 0% 0% 0% 0%
                                                    0% 0% 0% 0% 0% 0% 0%
                                                    Retained Size 120
                                                    120 120 120 120 120
                                                    120 120 120 120 120
                                                    120 120 120 120 120
                                                    120 0% 0% 0% 0% 0%
                                                    0% 0% 0% 0% 0% 0% 0%
                                                    0% 0% 0% 0% z 120 z
                                                    120 z 120 z 120 z
                                                    120 z 120 z 120 z
                                                    120 z 120 z 120 z
                                                    120 z 120 z 120 z
                                                    120 z 120 z 120 z
                                                    120 0% 0% 0% 0% 0%
                                                    0% 0% 0% 0% 0% 0% 0%
                                                    0% 0% 0% 0%

We have used different GCs but same results.


Task Manager


Total Size 4GB

metaspace 1GB

Off heap 512mb


Screenshot form Task manager, 612MB are occupied and not being released. 


We used jcmd tool and attached 3 files

  1. Threads print
  2. VM.metaspace output
  3. VM.classloader
In addition, we have tried calling GC manually, but it did not change much.

Thank you




Confidentiality: This communication and any attachments are intended for the above-named persons only and may be confidential and/or legally privileged. Any opinions expressed in this communication are not necessarily those of NICE Actimize. If this communication has come to you in error you must take no action based on it, nor must you copy or show it to anyone; please delete/destroy and inform the sender by e-mail immediately. 
Monitoring: NICE Actimize may monitor incoming and outgoing e-mails.
Viruses: Although we have taken steps toward ensuring that this e-mail and attachments are free from any virus, we advise that in keeping with good computing practice the recipient should ensure they are actually virus free.


Confidentiality: This communication and any attachments are intended for the above-named persons only and may be confidential and/or legally privileged. Any opinions expressed in this communication are not necessarily those of NICE Actimize. If this communication has come to you in error you must take no action based on it, nor must you copy or show it to anyone; please delete/destroy and inform the sender by e-mail immediately. 
Monitoring: NICE Actimize may monitor incoming and outgoing e-mails.
Viruses: Although we have taken steps toward ensuring that this e-mail and attachments are free from any virus, we advise that in keeping with good computing practice the recipient should ensure they are actually virus free.


Confidentiality: This communication and any attachments are intended for the above-named persons only and may be confidential and/or legally privileged. Any opinions expressed in this communication are not necessarily those of NICE Actimize. If this communication has come to you in error you must take no action based on it, nor must you copy or show it to anyone; please delete/destroy and inform the sender by e-mail immediately. 
Monitoring: NICE Actimize may monitor incoming and outgoing e-mails.
Viruses: Although we have taken steps toward ensuring that this e-mail and attachments are free from any virus, we advise that in keeping with good computing practice the recipient should ensure they are actually virus free.


Confidentiality: This communication and any attachments are intended for the above-named persons only and may be confidential and/or legally privileged. Any opinions expressed in this communication are not necessarily those of NICE Actimize. If this communication has come to you in error you must take no action based on it, nor must you copy or show it to anyone; please delete/destroy and inform the sender by e-mail immediately. 
Monitoring: NICE Actimize may monitor incoming and outgoing e-mails.
Viruses: Although we have taken steps toward ensuring that this e-mail and attachments are free from any virus, we advise that in keeping with good computing practice the recipient should ensure they are actually virus free.


Reply | Threaded
Open this post in threaded view
|

Re: Suspected classloader leak in Flink 1.11.1

Tamir Sagi
Thank you Kezhu and Chesnay,

The code I provided you is a minimal code to show what is executed as part of the batch along the Flink client app. you were right that it's not consistent with the heap dump. (which has been taken in dev env)

We run multiple Integration tests(Job per test) against Flink session cluster(Running on Kubernetes). with 2 task manager, single job manager. The jobs are submitted via Flink Client app which runs on top of spring boot application along Kafka.

I suspected that IdleConnectionReaper is the root cause to some sort of leak(In the flink client app) however, I was trying to manually shutdown the IdleConnectionReaper Once the job finished.
via calling "com.amazonaws.http.IdleConnectionReaper.shutdown()".  - which is suggested as a workaround.

It did not affect much. the memory has not been released .(shutdown method always returns false, the instance is null) ref: https://github.com/aws/aws-sdk-java/blob/master/aws-java-sdk-core/src/main/java/com/amazonaws/http/IdleConnectionReaper.java#L148-L157

On the batch code , I added close method which close the connections to aws clients once the operation finished. it did not help either, as the memory keep growing gradually.

We came across the following setting 

any more ideas based on the heap dump/Flight recording(Task manager)?
Is it correct that the Flink client & Task manager are strongly coupled?

Thanks,
Tamir.


From: Kezhu Wang <[hidden email]>
Sent: Monday, March 1, 2021 7:54 PM
To: Tamir Sagi <[hidden email]>; [hidden email] <[hidden email]>; Chesnay Schepler <[hidden email]>
Subject: Re: Suspected classloader leak in Flink 1.11.1
 

EXTERNAL EMAIL



Hi Chesnay,

Thanks for give a hand and solve this.

I guess `FlatMapXSightMsgProcessor` is a minimal reproducible version while the heap dump could be taken from near production environment.


Best,
Kezhu Wang

On March 2, 2021 at 01:00:52, Chesnay Schepler ([hidden email]) wrote:

the java-sdk-connection-reaper thread and amazon's JMX integration are causing the leak.


What strikes me as odd is that I see some dynamodb classes being referenced in the child classloaders, but I don't see where they could come from based on the application that you provided us with.


Could you clarify how exactly you depend on Amazon dependencies? (connectors, filesystems, _other stuff_)


On 3/1/2021 5:24 PM, Tamir Sagi wrote:
Hey,

I'd expect that what happens in a single execution will repeat itself in N executions.

I ran entire cycle of jobs(28 jobs). 
Once it finished:
  • Memory has grown to 1GB
  • I called GC ~100 times using "jcmd 1 GC.run" command.  Which did not affect much.
Prior running the tests I started Flight recording using "jcmd 1 JFR.start ", I stopped it after calling GC ~100 times.
Following figure shows the graphs from "recording.jfr" in Virtual Vm.



and Metaspace(top right)


docker stats command filtered to relevant Task manager container


Following files of task-manager are attached :
  • task-manager-VM.metaspace - taken via "jcmd 1 VM.metaspace"
  • task-manager-gc-class-histogram.txt via "jcmd 1 GC.class_histogram"

Task manager heap dump is ~100MB,
here is a summary:



Flink client app metric(taken from Lens):




We see a tight coupling between Task Manager app and Flink Client app, as the batch job runs on the client side(via reflection)
what happens with class loaders in that case? 

we also noticed many logs in Task manager related to PoolingHttpClientConnectionManager


and IdleConnectionReaper  InterruptedException  



On Client app we noticed many instances of that thread (From heap dump)



We uploaded 2 heap dumps and task-manager flight recording file into Google drive
  1. task-manager-heap-dump.hprof
  2. java_flink_client.hprof.
  3. task-manager-recording.jfr


Thanks,
Tamir.

From: Kezhu Wang [hidden email]
Sent: Monday, March 1, 2021 2:21 PM
To: [hidden email] [hidden email]; Tamir Sagi [hidden email]
Subject: Re: Suspected classloader leak in Flink 1.11.1
 

EXTERNAL EMAIL



Hi Tamir,

> The histogram has been taken from Task Manager using jcmd tool.

From that histogram, I guest there is no classloader leaking.

> A simple batch job with single operation . The memory bumps to ~600MB (after single execution). once the job is finished the memory never freed.

It could be just new code paths and hence new classes. A single execution does not making much sense. Multiple or dozen runs and continuous memory increasing among them and not decreasing after could be symptom of leaking.

You could use following steps to verify whether there are issues in your task managers:
* Run job N times, the more the better.
* Wait all jobs finished or stopped.
* Trigger manually gc dozen times.
* Take class histogram and check whether there are any “ChildFirstClassLoader”.
* If there are roughly N “ChildFirstClassLoader” in histogram, then we can pretty sure there might be class loader leaking.
* If there is no “ChildFirstClassLoader” or few but memory still higher than a threshold, say ~600MB or more, it could be other shape of leaking.


In all leaking case, an heap dump as @Chesnay said could be more helpful since it can tell us which object/class/thread keep memory from freeing.


Besides this, I saw an attachment “task-manager-thrad-print.txt” in initial mail, when and where did you capture ? Task Manager ? Is there any job still running ? 


Best,
Kezhu Wang

On March 1, 2021 at 18:38:55, Tamir Sagi ([hidden email]) wrote:

Hey Kezhu,

The histogram has been taken from Task Manager using jcmd tool.

By means of batch job, do you means that you compile job graph from DataSet API in client side and then submit it through RestClient ? I am not familiar with data set api, usually, there is no `ChildFirstClassLoader` creation in client side for job graph building. Could you depict a pseudo for this or did you create `ChildFirstClassLoader` yourself ?
Yes, we have a batch app. we read a file from s3 using hadoop-s3-plugin, then map that data into DataSet then just print it.
Then we have a Flink Client application which saves the batch app jar.

Attached the following files:
  1. batch-source-code.java - main function
  2. FlatMapXSightMsgProcessor.java - custom RichFlatMapFunction
  3. flink-job-submit.txt - The code to submit the job

I've noticed 2 behaviors:
  1. Idle - Once Task manager application boots up the memory consumption gradually grows, starting ~360MB to ~430MB(within few minutes) I see logs where many classes are loaded into JVM and never get released.(Might be a normal behavior)
  2. Batch Job Execution - A simple batch job with single operation . The memory bumps to ~600MB (after single execution). once the job is finished the memory never freed. I executed GC several times (Manually + Programmatically) it did not help(although some classes were unloaded). the memory keeps growing while more batch jobs are executed.
Attached Task Manager Logs from yesterday after a single batch execution.(Memory grew to 612MB and never freed)
  1. taskmgr.txt - Task manager logs (2021-02-28T16:06:05,983 is the timestamp when the job was submitted)
  2. gc-class-historgram.txt
  3. thread-print.txt
  4. vm-class-loader-stats.txt
  5. vm-class-loaders.txt
  6. heap_info.txt

Same behavior has been observed in Flink Client application. Once the batch job is executed the memory is increased gradually and does not get cleaned afterwards.(We observed many ChildFirstClassLoader instances)


Thank you
Tamir. 


From: Kezhu Wang <[hidden email]>
Sent: Sunday, February 28, 2021 6:57 PM
To: Tamir Sagi <[hidden email]>
Subject: Re: Suspected classloader leak in Flink 1.11.1
 

EXTERNAL EMAIL



HI Tamir,

The histogram has no instance of `ChildFirstClassLoader`.

> we are running Flink on a session cluster (version 1.11.1) on Kubernetes, submitting batch jobs with Flink client on Spring boot application (using RestClusterClient).

By analyzing the memory of the client Java application with profiling tools, We saw that there are many instances of Flink's ChildFirstClassLoader (perhaps as the number of jobs which were running), and therefore many instances of the same class, each from a different instance of the Class Loader (as shown in the attached screenshot). Similarly, to the Flink task manager memory.

By means of batch job, do you means that you compile job graph from DataSet API in client side and then submit it through RestClient ? I am not familiar with data set api, usually, there is no `ChildFirstClassLoader` creation in client side for job graph building. Could you depict a pseudo for this or did you create `ChildFirstClassLoader` yourself ?


In addition, we have tried calling GC manually, but it did not change much.

It might take serval runs to collect a class loader instance.


Best,
Kezhu Wang


On February 28, 2021 at 23:27:38, Tamir Sagi ([hidden email]) wrote:

Hey Kezhu,
Thanks for fast responding,

I've read that link few days ago.; Today I ran a simple batch job with single operation (using hadoop s3 plugin) but the same behavior was observed.

attached GC.class_histogram (Not filtered)


Tamir.




From: Kezhu Wang <[hidden email]>
Sent: Sunday, February 28, 2021 4:46 PM
To: [hidden email] <[hidden email]>; Tamir Sagi <[hidden email]>
Subject: Re: Suspected classloader leak in Flink 1.11.1
 

EXTERNAL EMAIL



Hi Tamir,


Besides this, I think GC.class_histogram(even filtered) could help us listing suspected objects.


Best,
Kezhu Wang


On February 28, 2021 at 21:25:07, Tamir Sagi ([hidden email]) wrote:


Hey all,

We are encountering memory issues on a Flink client and task managers, which I would like to raise here.

we are running Flink on a session cluster (version 1.11.1) on Kubernetes, submitting batch jobs with Flink client on Spring boot application (using RestClusterClient).

When jobs are being submitted and running, one after another, We see that the metaspace memory(with max size of  1GB)  keeps increasing, as well as linear increase in the heap memory (though it's a more moderate increase). We do see GC working on the heap and releasing some resources.

By analyzing the memory of the client Java application with profiling tools, We saw that there are many instances of Flink's ChildFirstClassLoader (perhaps as the number of jobs which were running), and therefore many instances of the same class, each from a different instance of the Class Loader (as shown in the attached screenshot). Similarly, to the Flink task manager memory.

We would expect to see one instance of Class Loader. Therefore, We suspect that the reason for the increase is Class Loaders not being cleaned.

Does anyone have some insights about this issue, or ideas how to proceed the investigation?


Flink Client application (VisualVm)



 

 

Shallow Size
                                                    com.fasterxmI.jackson.databind.PropertyMetadata
com.fasterxmIjackson.databind.PropertyMetadata
                                                    com.fasterxmI.jackson.databind.PropertyMetadata
com.fasterxmIjackson.databind.PropertyMetadata
                                                    com.fasterxmI.jackson.databind.PropertyMetadata
com.fasterxmIjackson.databind.PropertyMetadata
                                                    com.fasterxmI.jackson.databind.PropertyMetadata
com.fasterxmIjackson.databind.PropertyMetadata
                                                    com.fasterxmI.jackson.databind.PropertyMetadata
com.fasterxmIjackson.databind.PropertyMetadata
                                                    com.fasterxmI.jackson.databind.PropertyMetadata
com.fasterxmIjackson.databind.PropertyMetadata
                                                    com.fasterxmI.jackson.databind.PropertyMetadata
com.fasterxmIjackson.databind.PropertyMetadata
                                                    com.fasterxmI.jackson.databind.PropertyMetadata
com.fasterxmIjackson.databind.PropertyMetadata
                                                    com.fasterxmI.jackson.databind.PropertyMetadata
org.apache.fIink.utiI.ChiIdFirstCIassLoader (41)
                                                    org.apache.fIink.utiI.ChiIdFirstCIassLoader
                                                    (79)
                                                    org.apache.fIink.utiI.ChiIdFirstCIassLoader
                                                    (82)
                                                    org.apache.fIink.utiI.ChiIdFirstCIassLoader
                                                    (23)
                                                    org.apache.fIink.utiI.ChiIdFirstCIassLoader
                                                    (36)
                                                    org.apache.fIink.utiI.ChiIdFirstCIassLoader
                                                    (34)
                                                    org.apache.fIink.utiI.ChiIdFirstCIassLoader
                                                    (84)
                                                    org.apache.fIink.utiI.ChiIdFirstCIassLoader
                                                    (92)
                                                    org.apache.fIink.utiI.ChiIdFirstCIassLoader
                                                    (59)
                                                    org.apache.fIink.utiI.ChiIdFirstCIassLoader
                                                    (70)
                                                    org.apache.fIink.utiI.ChiIdFirstCIassLoader
                                                    (3)
                                                    org.apache.fIink.utiI.ChiIdFirstCIassLoader
                                                    (60)
                                                    org.apache.fIink.utiI.ChiIdFirstCIassLoader
                                                    (8)
                                                    org.apache.fIink.utiI.ChiIdFirstCIassLoader
                                                    (17)
                                                    org.apache.fIink.utiI.ChiIdFirstCIassLoader
                                                    (31)
                                                    org.apache.fIink.utiI.ChiIdFirstCIassLoader
                                                    (12)
                                                    org.apache.fIink.utiI.ChiIdFirstCIassLoader
                                                    (49) Objects 0% 0%
                                                    0% 0% 0% 0% 0% 0% 0%
                                                    0% 0% 0% 0% 0% 0% 0%
                                                    Retained Size 120
                                                    120 120 120 120 120
                                                    120 120 120 120 120
                                                    120 120 120 120 120
                                                    120 0% 0% 0% 0% 0%
                                                    0% 0% 0% 0% 0% 0% 0%
                                                    0% 0% 0% 0% z 120 z
                                                    120 z 120 z 120 z
                                                    120 z 120 z 120 z
                                                    120 z 120 z 120 z
                                                    120 z 120 z 120 z
                                                    120 z 120 z 120 z
                                                    120 0% 0% 0% 0% 0%
                                                    0% 0% 0% 0% 0% 0% 0%
                                                    0% 0% 0% 0%

We have used different GCs but same results.


Task Manager


Total Size 4GB

metaspace 1GB

Off heap 512mb


Screenshot form Task manager, 612MB are occupied and not being released. 


We used jcmd tool and attached 3 files

  1. Threads print
  2. VM.metaspace output
  3. VM.classloader
In addition, we have tried calling GC manually, but it did not change much.

Thank you




Confidentiality: This communication and any attachments are intended for the above-named persons only and may be confidential and/or legally privileged. Any opinions expressed in this communication are not necessarily those of NICE Actimize. If this communication has come to you in error you must take no action based on it, nor must you copy or show it to anyone; please delete/destroy and inform the sender by e-mail immediately. 
Monitoring: NICE Actimize may monitor incoming and outgoing e-mails.
Viruses: Although we have taken steps toward ensuring that this e-mail and attachments are free from any virus, we advise that in keeping with good computing practice the recipient should ensure they are actually virus free.


Confidentiality: This communication and any attachments are intended for the above-named persons only and may be confidential and/or legally privileged. Any opinions expressed in this communication are not necessarily those of NICE Actimize. If this communication has come to you in error you must take no action based on it, nor must you copy or show it to anyone; please delete/destroy and inform the sender by e-mail immediately. 
Monitoring: NICE Actimize may monitor incoming and outgoing e-mails.
Viruses: Although we have taken steps toward ensuring that this e-mail and attachments are free from any virus, we advise that in keeping with good computing practice the recipient should ensure they are actually virus free.


Confidentiality: This communication and any attachments are intended for the above-named persons only and may be confidential and/or legally privileged. Any opinions expressed in this communication are not necessarily those of NICE Actimize. If this communication has come to you in error you must take no action based on it, nor must you copy or show it to anyone; please delete/destroy and inform the sender by e-mail immediately. 
Monitoring: NICE Actimize may monitor incoming and outgoing e-mails.
Viruses: Although we have taken steps toward ensuring that this e-mail and attachments are free from any virus, we advise that in keeping with good computing practice the recipient should ensure they are actually virus free.


Confidentiality: This communication and any attachments are intended for the above-named persons only and may be confidential and/or legally privileged. Any opinions expressed in this communication are not necessarily those of NICE Actimize. If this communication has come to you in error you must take no action based on it, nor must you copy or show it to anyone; please delete/destroy and inform the sender by e-mail immediately. 
Monitoring: NICE Actimize may monitor incoming and outgoing e-mails.
Viruses: Although we have taken steps toward ensuring that this e-mail and attachments are free from any virus, we advise that in keeping with good computing practice the recipient should ensure they are actually virus free.



Confidentiality: This communication and any attachments are intended for the above-named persons only and may be confidential and/or legally privileged. Any opinions expressed in this communication are not necessarily those of NICE Actimize. If this communication has come to you in error you must take no action based on it, nor must you copy or show it to anyone; please delete/destroy and inform the sender by e-mail immediately. 
Monitoring: NICE Actimize may monitor incoming and outgoing e-mails.
Viruses: Although we have taken steps toward ensuring that this e-mail and attachments are free from any virus, we advise that in keeping with good computing practice the recipient should ensure they are actually virus free.

Reply | Threaded
Open this post in threaded view
|

Re: Suspected classloader leak in Flink 1.11.1

Chesnay Schepler
The client and TaskManager are not coupled in any way. The client serializes individual functions that are transmitted to the task managers, deserialized and run.
Hence, if your functions rely on any library that needs cleanup then you must add this to the respective function, likely by extending the RichFunction variants, to ensure this cleanup is executed on the task manager.

On 3/2/2021 4:52 PM, Tamir Sagi wrote:
Thank you Kezhu and Chesnay,

The code I provided you is a minimal code to show what is executed as part of the batch along the Flink client app. you were right that it's not consistent with the heap dump. (which has been taken in dev env)

We run multiple Integration tests(Job per test) against Flink session cluster(Running on Kubernetes). with 2 task manager, single job manager. The jobs are submitted via Flink Client app which runs on top of spring boot application along Kafka.

I suspected that IdleConnectionReaper is the root cause to some sort of leak(In the flink client app) however, I was trying to manually shutdown the IdleConnectionReaper Once the job finished.
via calling "com.amazonaws.http.IdleConnectionReaper.shutdown()".  - which is suggested as a workaround.

It did not affect much. the memory has not been released .(shutdown method always returns false, the instance is null) ref: https://github.com/aws/aws-sdk-java/blob/master/aws-java-sdk-core/src/main/java/com/amazonaws/http/IdleConnectionReaper.java#L148-L157

On the batch code , I added close method which close the connections to aws clients once the operation finished. it did not help either, as the memory keep growing gradually.

We came across the following setting 

any more ideas based on the heap dump/Flight recording(Task manager)?
Is it correct that the Flink client & Task manager are strongly coupled?

Thanks,
Tamir.


From: Kezhu Wang [hidden email]
Sent: Monday, March 1, 2021 7:54 PM
To: Tamir Sagi [hidden email]; [hidden email] [hidden email]; Chesnay Schepler [hidden email]
Subject: Re: Suspected classloader leak in Flink 1.11.1
 

EXTERNAL EMAIL



Hi Chesnay,

Thanks for give a hand and solve this.

I guess `FlatMapXSightMsgProcessor` is a minimal reproducible version while the heap dump could be taken from near production environment.


Best,
Kezhu Wang

On March 2, 2021 at 01:00:52, Chesnay Schepler ([hidden email]) wrote:

the java-sdk-connection-reaper thread and amazon's JMX integration are causing the leak.


What strikes me as odd is that I see some dynamodb classes being referenced in the child classloaders, but I don't see where they could come from based on the application that you provided us with.


Could you clarify how exactly you depend on Amazon dependencies? (connectors, filesystems, _other stuff_)


On 3/1/2021 5:24 PM, Tamir Sagi wrote:
Hey,

I'd expect that what happens in a single execution will repeat itself in N executions.

I ran entire cycle of jobs(28 jobs). 
Once it finished:
  • Memory has grown to 1GB
  • I called GC ~100 times using "jcmd 1 GC.run" command.  Which did not affect much.
Prior running the tests I started Flight recording using "jcmd 1 JFR.start ", I stopped it after calling GC ~100 times.
Following figure shows the graphs from "recording.jfr" in Virtual Vm.



and Metaspace(top right)


docker stats command filtered to relevant Task manager container


Following files of task-manager are attached :
  • task-manager-VM.metaspace - taken via "jcmd 1 VM.metaspace"
  • task-manager-gc-class-histogram.txt via "jcmd 1 GC.class_histogram"

Task manager heap dump is ~100MB,
here is a summary:



Flink client app metric(taken from Lens):




We see a tight coupling between Task Manager app and Flink Client app, as the batch job runs on the client side(via reflection)
what happens with class loaders in that case? 

we also noticed many logs in Task manager related to PoolingHttpClientConnectionManager


and IdleConnectionReaper  InterruptedException  



On Client app we noticed many instances of that thread (From heap dump)



We uploaded 2 heap dumps and task-manager flight recording file into Google drive
  1. task-manager-heap-dump.hprof
  2. java_flink_client.hprof.
  3. task-manager-recording.jfr


Thanks,
Tamir.

From: Kezhu Wang [hidden email]
Sent: Monday, March 1, 2021 2:21 PM
To: [hidden email] [hidden email]; Tamir Sagi [hidden email]
Subject: Re: Suspected classloader leak in Flink 1.11.1
 

EXTERNAL EMAIL



Hi Tamir,

> The histogram has been taken from Task Manager using jcmd tool.

>From that histogram, I guest there is no classloader leaking.

> A simple batch job with single operation . The memory bumps to ~600MB (after single execution). once the job is finished the memory never freed.

It could be just new code paths and hence new classes. A single execution does not making much sense. Multiple or dozen runs and continuous memory increasing among them and not decreasing after could be symptom of leaking.

You could use following steps to verify whether there are issues in your task managers:
* Run job N times, the more the better.
* Wait all jobs finished or stopped.
* Trigger manually gc dozen times.
* Take class histogram and check whether there are any “ChildFirstClassLoader”.
* If there are roughly N “ChildFirstClassLoader” in histogram, then we can pretty sure there might be class loader leaking.
* If there is no “ChildFirstClassLoader” or few but memory still higher than a threshold, say ~600MB or more, it could be other shape of leaking.


In all leaking case, an heap dump as @Chesnay said could be more helpful since it can tell us which object/class/thread keep memory from freeing.


Besides this, I saw an attachment “task-manager-thrad-print.txt” in initial mail, when and where did you capture ? Task Manager ? Is there any job still running ? 


Best,
Kezhu Wang

On March 1, 2021 at 18:38:55, Tamir Sagi ([hidden email]) wrote:

Hey Kezhu,

The histogram has been taken from Task Manager using jcmd tool.

By means of batch job, do you means that you compile job graph from DataSet API in client side and then submit it through RestClient ? I am not familiar with data set api, usually, there is no `ChildFirstClassLoader` creation in client side for job graph building. Could you depict a pseudo for this or did you create `ChildFirstClassLoader` yourself ?
Yes, we have a batch app. we read a file from s3 using hadoop-s3-plugin, then map that data into DataSet then just print it.
Then we have a Flink Client application which saves the batch app jar.

Attached the following files:
  1. batch-source-code.java - main function
  2. FlatMapXSightMsgProcessor.java - custom RichFlatMapFunction
  3. flink-job-submit.txt - The code to submit the job

I've noticed 2 behaviors:
  1. Idle - Once Task manager application boots up the memory consumption gradually grows, starting ~360MB to ~430MB(within few minutes) I see logs where many classes are loaded into JVM and never get released.(Might be a normal behavior)
  2. Batch Job Execution - A simple batch job with single operation . The memory bumps to ~600MB (after single execution). once the job is finished the memory never freed. I executed GC several times (Manually + Programmatically) it did not help(although some classes were unloaded). the memory keeps growing while more batch jobs are executed.
Attached Task Manager Logs from yesterday after a single batch execution.(Memory grew to 612MB and never freed)
  1. taskmgr.txt - Task manager logs (2021-02-28T16:06:05,983 is the timestamp when the job was submitted)
  2. gc-class-historgram.txt
  3. thread-print.txt
  4. vm-class-loader-stats.txt
  5. vm-class-loaders.txt
  6. heap_info.txt

Same behavior has been observed in Flink Client application. Once the batch job is executed the memory is increased gradually and does not get cleaned afterwards.(We observed many ChildFirstClassLoader instances)


Thank you
Tamir. 


From: Kezhu Wang <[hidden email]>
Sent: Sunday, February 28, 2021 6:57 PM
To: Tamir Sagi <[hidden email]>
Subject: Re: Suspected classloader leak in Flink 1.11.1
 

EXTERNAL EMAIL



HI Tamir,

The histogram has no instance of `ChildFirstClassLoader`.

> we are running Flink on a session cluster (version 1.11.1) on Kubernetes, submitting batch jobs with Flink client on Spring boot application (using RestClusterClient).

By analyzing the memory of the client Java application with profiling tools, We saw that there are many instances of Flink's ChildFirstClassLoader (perhaps as the number of jobs which were running), and therefore many instances of the same class, each from a different instance of the Class Loader (as shown in the attached screenshot). Similarly, to the Flink task manager memory.

By means of batch job, do you means that you compile job graph from DataSet API in client side and then submit it through RestClient ? I am not familiar with data set api, usually, there is no `ChildFirstClassLoader` creation in client side for job graph building. Could you depict a pseudo for this or did you create `ChildFirstClassLoader` yourself ?


In addition, we have tried calling GC manually, but it did not change much.

It might take serval runs to collect a class loader instance.


Best,
Kezhu Wang


On February 28, 2021 at 23:27:38, Tamir Sagi ([hidden email]) wrote:

Hey Kezhu,
Thanks for fast responding,

I've read that link few days ago.; Today I ran a simple batch job with single operation (using hadoop s3 plugin) but the same behavior was observed.

attached GC.class_histogram (Not filtered)


Tamir.




From: Kezhu Wang <[hidden email]>
Sent: Sunday, February 28, 2021 4:46 PM
To: [hidden email] <[hidden email]>; Tamir Sagi <[hidden email]>
Subject: Re: Suspected classloader leak in Flink 1.11.1
 

EXTERNAL EMAIL



Hi Tamir,


Besides this, I think GC.class_histogram(even filtered) could help us listing suspected objects.


Best,
Kezhu Wang


On February 28, 2021 at 21:25:07, Tamir Sagi ([hidden email]) wrote:


Hey all,

We are encountering memory issues on a Flink client and task managers, which I would like to raise here.

we are running Flink on a session cluster (version 1.11.1) on Kubernetes, submitting batch jobs with Flink client on Spring boot application (using RestClusterClient).

When jobs are being submitted and running, one after another, We see that the metaspace memory(with max size of  1GB)  keeps increasing, as well as linear increase in the heap memory (though it's a more moderate increase). We do see GC working on the heap and releasing some resources.

By analyzing the memory of the client Java application with profiling tools, We saw that there are many instances of Flink's ChildFirstClassLoader (perhaps as the number of jobs which were running), and therefore many instances of the same class, each from a different instance of the Class Loader (as shown in the attached screenshot). Similarly, to the Flink task manager memory.

We would expect to see one instance of Class Loader. Therefore, We suspect that the reason for the increase is Class Loaders not being cleaned.

Does anyone have some insights about this issue, or ideas how to proceed the investigation?


Flink Client application (VisualVm)



 

 

Shallow
                                                          Size
                                                          com.fasterxmI.jackson.databind.PropertyMetadata
com.fasterxmIjackson.databind.PropertyMetadata
                                                          com.fasterxmI.jackson.databind.PropertyMetadata
com.fasterxmIjackson.databind.PropertyMetadata
                                                          com.fasterxmI.jackson.databind.PropertyMetadata
com.fasterxmIjackson.databind.PropertyMetadata
                                                          com.fasterxmI.jackson.databind.PropertyMetadata
com.fasterxmIjackson.databind.PropertyMetadata
                                                          com.fasterxmI.jackson.databind.PropertyMetadata
com.fasterxmIjackson.databind.PropertyMetadata
                                                          com.fasterxmI.jackson.databind.PropertyMetadata
com.fasterxmIjackson.databind.PropertyMetadata
                                                          com.fasterxmI.jackson.databind.PropertyMetadata
com.fasterxmIjackson.databind.PropertyMetadata
                                                          com.fasterxmI.jackson.databind.PropertyMetadata
com.fasterxmIjackson.databind.PropertyMetadata
                                                          com.fasterxmI.jackson.databind.PropertyMetadata
org.apache.fIink.utiI.ChiIdFirstCIassLoader (41)
                                                          org.apache.fIink.utiI.ChiIdFirstCIassLoader
                                                          (79)
                                                          org.apache.fIink.utiI.ChiIdFirstCIassLoader
                                                          (82)
                                                          org.apache.fIink.utiI.ChiIdFirstCIassLoader
                                                          (23)
                                                          org.apache.fIink.utiI.ChiIdFirstCIassLoader
                                                          (36)
                                                          org.apache.fIink.utiI.ChiIdFirstCIassLoader
                                                          (34)
                                                          org.apache.fIink.utiI.ChiIdFirstCIassLoader
                                                          (84)
                                                          org.apache.fIink.utiI.ChiIdFirstCIassLoader
                                                          (92)
                                                          org.apache.fIink.utiI.ChiIdFirstCIassLoader
                                                          (59)
                                                          org.apache.fIink.utiI.ChiIdFirstCIassLoader
                                                          (70)
                                                          org.apache.fIink.utiI.ChiIdFirstCIassLoader
                                                          (3)
                                                          org.apache.fIink.utiI.ChiIdFirstCIassLoader
                                                          (60)
                                                          org.apache.fIink.utiI.ChiIdFirstCIassLoader
                                                          (8)
                                                          org.apache.fIink.utiI.ChiIdFirstCIassLoader
                                                          (17)
                                                          org.apache.fIink.utiI.ChiIdFirstCIassLoader
                                                          (31)
                                                          org.apache.fIink.utiI.ChiIdFirstCIassLoader
                                                          (12)
                                                          org.apache.fIink.utiI.ChiIdFirstCIassLoader
                                                          (49) Objects
                                                          0% 0% 0% 0% 0%
                                                          0% 0% 0% 0% 0%
                                                          0% 0% 0% 0% 0%
                                                          0% Retained
                                                          Size 120 120
                                                          120 120 120
                                                          120 120 120
                                                          120 120 120
                                                          120 120 120
                                                          120 120 120 0%
                                                          0% 0% 0% 0% 0%
                                                          0% 0% 0% 0% 0%
                                                          0% 0% 0% 0% 0%
                                                          z 120 z 120 z
                                                          120 z 120 z
                                                          120 z 120 z
                                                          120 z 120 z
                                                          120 z 120 z
                                                          120 z 120 z
                                                          120 z 120 z
                                                          120 z 120 z
                                                          120 0% 0% 0%
                                                          0% 0% 0% 0% 0%
                                                          0% 0% 0% 0% 0%
                                                          0% 0% 0%

We have used different GCs but same results.


Task Manager


Total Size 4GB

metaspace 1GB

Off heap 512mb


Screenshot form Task manager, 612MB are occupied and not being released. 


We used jcmd tool and attached 3 files

  1. Threads print
  2. VM.metaspace output
  3. VM.classloader
In addition, we have tried calling GC manually, but it did not change much.

Thank you




Confidentiality: This communication and any attachments are intended for the above-named persons only and may be confidential and/or legally privileged. Any opinions expressed in this communication are not necessarily those of NICE Actimize. If this communication has come to you in error you must take no action based on it, nor must you copy or show it to anyone; please delete/destroy and inform the sender by e-mail immediately. 
Monitoring: NICE Actimize may monitor incoming and outgoing e-mails.
Viruses: Although we have taken steps toward ensuring that this e-mail and attachments are free from any virus, we advise that in keeping with good computing practice the recipient should ensure they are actually virus free.


Confidentiality: This communication and any attachments are intended for the above-named persons only and may be confidential and/or legally privileged. Any opinions expressed in this communication are not necessarily those of NICE Actimize. If this communication has come to you in error you must take no action based on it, nor must you copy or show it to anyone; please delete/destroy and inform the sender by e-mail immediately. 
Monitoring: NICE Actimize may monitor incoming and outgoing e-mails.
Viruses: Although we have taken steps toward ensuring that this e-mail and attachments are free from any virus, we advise that in keeping with good computing practice the recipient should ensure they are actually virus free.


Confidentiality: This communication and any attachments are intended for the above-named persons only and may be confidential and/or legally privileged. Any opinions expressed in this communication are not necessarily those of NICE Actimize. If this communication has come to you in error you must take no action based on it, nor must you copy or show it to anyone; please delete/destroy and inform the sender by e-mail immediately. 
Monitoring: NICE Actimize may monitor incoming and outgoing e-mails.
Viruses: Although we have taken steps toward ensuring that this e-mail and attachments are free from any virus, we advise that in keeping with good computing practice the recipient should ensure they are actually virus free.


Confidentiality: This communication and any attachments are intended for the above-named persons only and may be confidential and/or legally privileged. Any opinions expressed in this communication are not necessarily those of NICE Actimize. If this communication has come to you in error you must take no action based on it, nor must you copy or show it to anyone; please delete/destroy and inform the sender by e-mail immediately. 
Monitoring: NICE Actimize may monitor incoming and outgoing e-mails.
Viruses: Although we have taken steps toward ensuring that this e-mail and attachments are free from any virus, we advise that in keeping with good computing practice the recipient should ensure they are actually virus free.



Confidentiality: This communication and any attachments are intended for the above-named persons only and may be confidential and/or legally privileged. Any opinions expressed in this communication are not necessarily those of NICE Actimize. If this communication has come to you in error you must take no action based on it, nor must you copy or show it to anyone; please delete/destroy and inform the sender by e-mail immediately. 
Monitoring: NICE Actimize may monitor incoming and outgoing e-mails.
Viruses: Although we have taken steps toward ensuring that this e-mail and attachments are free from any virus, we advise that in keeping with good computing practice the recipient should ensure they are actually virus free.


Reply | Threaded
Open this post in threaded view
|

Re: Suspected classloader leak in Flink 1.11.1

Kezhu Wang
Hi all,

@Chesnay is right, there is no code execution coupling between client and task manager.

But before job is submitted to flink cluster, client need various steps to build a job graph for submission.

These steps could includes:
* Construct user functions.
* Construct runtime stream operator if necessary.
* Other possible unrelated steps.

The constrcuted functions/operators are only *opened* for function in flink cluster not client.

There are no cleanup operations for these functions/operators in client side. If you ever do some
resources consuming in construction of these functions/operators, then you probably will leak these
consumed resources in client side.

In your case, these resources consuming operations could be:
* Register `com.amazonaws.metrics.MetricAdmin` mbean directly or indirectly.
* Start `IdleConnectionReaper` directly or indirectly.


For task manager side resource cleanup, `RuntimeContext.registerUserCodeClassLoaderReleaseHookIfAbsent`
could also be useful for global resource cleanup such as mbean un-registration.

Besides this, I observed two additional symptoms which might be useful:

* "kafka-producer-network-thread"(loaded through AppClassLoader) still running.
* `MetricAdmin` mbean and `IdleConnectionReaper` are also loaded by `PluginClassLoader`.

shutdown method always returns false, the instance is null

Outside `PackagedProgramUtils.createJobGraph`, the class loader is your application class loader while the leaking resources is created inside `createJoGraph` through `ChildFirstClassLoader`.


Best,
Kezhu Wang

On March 3, 2021 at 02:33:58, Chesnay Schepler ([hidden email]) wrote:

The client and TaskManager are not coupled in any way. The client serializes individual functions that are transmitted to the task managers, deserialized and run.
Hence, if your functions rely on any library that needs cleanup then you must add this to the respective function, likely by extending the RichFunction variants, to ensure this cleanup is executed on the task manager.

On 3/2/2021 4:52 PM, Tamir Sagi wrote:
Thank you Kezhu and Chesnay,

The code I provided you is a minimal code to show what is executed as part of the batch along the Flink client app. you were right that it's not consistent with the heap dump. (which has been taken in dev env)

We run multiple Integration tests(Job per test) against Flink session cluster(Running on Kubernetes). with 2 task manager, single job manager. The jobs are submitted via Flink Client app which runs on top of spring boot application along Kafka.

I suspected that IdleConnectionReaper is the root cause to some sort of leak(In the flink client app) however, I was trying to manually shutdown the IdleConnectionReaper Once the job finished.
via calling "com.amazonaws.http.IdleConnectionReaper.shutdown()".  - which is suggested as a workaround.

It did not affect much. the memory has not been released .(shutdown method always returns false, the instance is null) ref: https://github.com/aws/aws-sdk-java/blob/master/aws-java-sdk-core/src/main/java/com/amazonaws/http/IdleConnectionReaper.java#L148-L157

On the batch code , I added close method which close the connections to aws clients once the operation finished. it did not help either, as the memory keep growing gradually.

We came across the following setting 

any more ideas based on the heap dump/Flight recording(Task manager)?
Is it correct that the Flink client & Task manager are strongly coupled?

Thanks,
Tamir.


From: Kezhu Wang [hidden email]
Sent: Monday, March 1, 2021 7:54 PM
To: Tamir Sagi [hidden email]; [hidden email] [hidden email]; Chesnay Schepler [hidden email]
Subject: Re: Suspected classloader leak in Flink 1.11.1
 

EXTERNAL EMAIL



Hi Chesnay,

Thanks for give a hand and solve this.

I guess `FlatMapXSightMsgProcessor` is a minimal reproducible version while the heap dump could be taken from near production environment.


Best,
Kezhu Wang

On March 2, 2021 at 01:00:52, Chesnay Schepler ([hidden email]) wrote:

the java-sdk-connection-reaper thread and amazon's JMX integration are causing the leak.


What strikes me as odd is that I see some dynamodb classes being referenced in the child classloaders, but I don't see where they could come from based on the application that you provided us with.


Could you clarify how exactly you depend on Amazon dependencies? (connectors, filesystems, _other stuff_)


On 3/1/2021 5:24 PM, Tamir Sagi wrote:
Hey,

I'd expect that what happens in a single execution will repeat itself in N executions.

I ran entire cycle of jobs(28 jobs). 
Once it finished:
  • Memory has grown to 1GB
  • I called GC ~100 times using "jcmd 1 GC.run" command.  Which did not affect much.
Prior running the tests I started Flight recording using "jcmd 1 JFR.start ", I stopped it after calling GC ~100 times.
Following figure shows the graphs from "recording.jfr" in Virtual Vm.



and Metaspace(top right)


docker stats command filtered to relevant Task manager container


Following files of task-manager are attached :
  • task-manager-VM.metaspace - taken via "jcmd 1 VM.metaspace"
  • task-manager-gc-class-histogram.txt via "jcmd 1 GC.class_histogram"

Task manager heap dump is ~100MB,
here is a summary:



Flink client app metric(taken from Lens):




We see a tight coupling between Task Manager app and Flink Client app, as the batch job runs on the client side(via reflection)
what happens with class loaders in that case? 

we also noticed many logs in Task manager related to PoolingHttpClientConnectionManager


and IdleConnectionReaper  InterruptedException  



On Client app we noticed many instances of that thread (From heap dump)



We uploaded 2 heap dumps and task-manager flight recording file into Google drive
  1. task-manager-heap-dump.hprof
  2. java_flink_client.hprof.
  3. task-manager-recording.jfr


Thanks,
Tamir.

From: Kezhu Wang [hidden email]
Sent: Monday, March 1, 2021 2:21 PM
To: [hidden email] [hidden email]; Tamir Sagi [hidden email]
Subject: Re: Suspected classloader leak in Flink 1.11.1
 

EXTERNAL EMAIL



Hi Tamir,

> The histogram has been taken from Task Manager using jcmd tool.

>From that histogram, I guest there is no classloader leaking.

> A simple batch job with single operation . The memory bumps to ~600MB (after single execution). once the job is finished the memory never freed.

It could be just new code paths and hence new classes. A single execution does not making much sense. Multiple or dozen runs and continuous memory increasing among them and not decreasing after could be symptom of leaking.

You could use following steps to verify whether there are issues in your task managers:
* Run job N times, the more the better.
* Wait all jobs finished or stopped.
* Trigger manually gc dozen times.
* Take class histogram and check whether there are any “ChildFirstClassLoader”.
* If there are roughly N “ChildFirstClassLoader” in histogram, then we can pretty sure there might be class loader leaking.
* If there is no “ChildFirstClassLoader” or few but memory still higher than a threshold, say ~600MB or more, it could be other shape of leaking.


In all leaking case, an heap dump as @Chesnay said could be more helpful since it can tell us which object/class/thread keep memory from freeing.


Besides this, I saw an attachment “task-manager-thrad-print.txt” in initial mail, when and where did you capture ? Task Manager ? Is there any job still running ? 


Best,
Kezhu Wang

On March 1, 2021 at 18:38:55, Tamir Sagi ([hidden email]) wrote:

Hey Kezhu,

The histogram has been taken from Task Manager using jcmd tool.

By means of batch job, do you means that you compile job graph from DataSet API in client side and then submit it through RestClient ? I am not familiar with data set api, usually, there is no `ChildFirstClassLoader` creation in client side for job graph building. Could you depict a pseudo for this or did you create `ChildFirstClassLoader` yourself ?
Yes, we have a batch app. we read a file from s3 using hadoop-s3-plugin, then map that data into DataSet then just print it.
Then we have a Flink Client application which saves the batch app jar.

Attached the following files:
  1. batch-source-code.java - main function
  2. FlatMapXSightMsgProcessor.java - custom RichFlatMapFunction
  3. flink-job-submit.txt - The code to submit the job

I've noticed 2 behaviors:
  1. Idle - Once Task manager application boots up the memory consumption gradually grows, starting ~360MB to ~430MB(within few minutes) I see logs where many classes are loaded into JVM and never get released.(Might be a normal behavior)
  2. Batch Job Execution - A simple batch job with single operation . The memory bumps to ~600MB (after single execution). once the job is finished the memory never freed. I executed GC several times (Manually + Programmatically) it did not help(although some classes were unloaded). the memory keeps growing while more batch jobs are executed.
Attached Task Manager Logs from yesterday after a single batch execution.(Memory grew to 612MB and never freed)
  1. taskmgr.txt - Task manager logs (2021-02-28T16:06:05,983 is the timestamp when the job was submitted)
  2. gc-class-historgram.txt
  3. thread-print.txt
  4. vm-class-loader-stats.txt
  5. vm-class-loaders.txt
  6. heap_info.txt

Same behavior has been observed in Flink Client application. Once the batch job is executed the memory is increased gradually and does not get cleaned afterwards.(We observed many ChildFirstClassLoader instances)


Thank you
Tamir. 


From: Kezhu Wang <[hidden email]>
Sent: Sunday, February 28, 2021 6:57 PM
To: Tamir Sagi <[hidden email]>
Subject: Re: Suspected classloader leak in Flink 1.11.1
 

EXTERNAL EMAIL



HI Tamir,

The histogram has no instance of `ChildFirstClassLoader`.

> we are running Flink on a session cluster (version 1.11.1) on Kubernetes, submitting batch jobs with Flink client on Spring boot application (using RestClusterClient).

By analyzing the memory of the client Java application with profiling tools, We saw that there are many instances of Flink's ChildFirstClassLoader (perhaps as the number of jobs which were running), and therefore many instances of the same class, each from a different instance of the Class Loader (as shown in the attached screenshot). Similarly, to the Flink task manager memory.

By means of batch job, do you means that you compile job graph from DataSet API in client side and then submit it through RestClient ? I am not familiar with data set api, usually, there is no `ChildFirstClassLoader` creation in client side for job graph building. Could you depict a pseudo for this or did you create `ChildFirstClassLoader` yourself ?


In addition, we have tried calling GC manually, but it did not change much.

It might take serval runs to collect a class loader instance.


Best,
Kezhu Wang


On February 28, 2021 at 23:27:38, Tamir Sagi ([hidden email]) wrote:

Hey Kezhu,
Thanks for fast responding,

I've read that link few days ago.; Today I ran a simple batch job with single operation (using hadoop s3 plugin) but the same behavior was observed.

attached GC.class_histogram (Not filtered)


Tamir.




From: Kezhu Wang <[hidden email]>
Sent: Sunday, February 28, 2021 4:46 PM
To: [hidden email] <[hidden email]>; Tamir Sagi <[hidden email]>
Subject: Re: Suspected classloader leak in Flink 1.11.1
 

EXTERNAL EMAIL



Hi Tamir,


Besides this, I think GC.class_histogram(even filtered) could help us listing suspected objects.


Best,
Kezhu Wang


On February 28, 2021 at 21:25:07, Tamir Sagi ([hidden email]) wrote:


Hey all,

We are encountering memory issues on a Flink client and task managers, which I would like to raise here.

we are running Flink on a session cluster (version 1.11.1) on Kubernetes, submitting batch jobs with Flink client on Spring boot application (using RestClusterClient).

When jobs are being submitted and running, one after another, We see that the metaspace memory(with max size of  1GB)  keeps increasing, as well as linear increase in the heap memory (though it's a more moderate increase). We do see GC working on the heap and releasing some resources.

By analyzing the memory of the client Java application with profiling tools, We saw that there are many instances of Flink's ChildFirstClassLoader (perhaps as the number of jobs which were running), and therefore many instances of the same class, each from a different instance of the Class Loader (as shown in the attached screenshot). Similarly, to the Flink task manager memory.

We would expect to see one instance of Class Loader. Therefore, We suspect that the reason for the increase is Class Loaders not being cleaned.

Does anyone have some insights about this issue, or ideas how to proceed the investigation?


Flink Client application (VisualVm)



 

 

Shallow
                                                          Size
                                                          com.fasterxmI.jackson.databind.PropertyMetadata
com.fasterxmIjackson.databind.PropertyMetadata
                                                          com.fasterxmI.jackson.databind.PropertyMetadata
com.fasterxmIjackson.databind.PropertyMetadata
                                                          com.fasterxmI.jackson.databind.PropertyMetadata
com.fasterxmIjackson.databind.PropertyMetadata
                                                          com.fasterxmI.jackson.databind.PropertyMetadata
com.fasterxmIjackson.databind.PropertyMetadata
                                                          com.fasterxmI.jackson.databind.PropertyMetadata
com.fasterxmIjackson.databind.PropertyMetadata
                                                          com.fasterxmI.jackson.databind.PropertyMetadata
com.fasterxmIjackson.databind.PropertyMetadata
                                                          com.fasterxmI.jackson.databind.PropertyMetadata
com.fasterxmIjackson.databind.PropertyMetadata
                                                          com.fasterxmI.jackson.databind.PropertyMetadata
com.fasterxmIjackson.databind.PropertyMetadata
                                                          com.fasterxmI.jackson.databind.PropertyMetadata
org.apache.fIink.utiI.ChiIdFirstCIassLoader (41)
                                                          org.apache.fIink.utiI.ChiIdFirstCIassLoader
                                                          (79)
                                                          org.apache.fIink.utiI.ChiIdFirstCIassLoader
                                                          (82)
                                                          org.apache.fIink.utiI.ChiIdFirstCIassLoader
                                                          (23)
                                                          org.apache.fIink.utiI.ChiIdFirstCIassLoader
                                                          (36)
                                                          org.apache.fIink.utiI.ChiIdFirstCIassLoader
                                                          (34)
                                                          org.apache.fIink.utiI.ChiIdFirstCIassLoader
                                                          (84)
                                                          org.apache.fIink.utiI.ChiIdFirstCIassLoader
                                                          (92)
                                                          org.apache.fIink.utiI.ChiIdFirstCIassLoader
                                                          (59)
                                                          org.apache.fIink.utiI.ChiIdFirstCIassLoader
                                                          (70)
                                                          org.apache.fIink.utiI.ChiIdFirstCIassLoader
                                                          (3)
                                                          org.apache.fIink.utiI.ChiIdFirstCIassLoader
                                                          (60)
                                                          org.apache.fIink.utiI.ChiIdFirstCIassLoader
                                                          (8)
                                                          org.apache.fIink.utiI.ChiIdFirstCIassLoader
                                                          (17)
                                                          org.apache.fIink.utiI.ChiIdFirstCIassLoader
                                                          (31)
                                                          org.apache.fIink.utiI.ChiIdFirstCIassLoader
                                                          (12)
                                                          org.apache.fIink.utiI.ChiIdFirstCIassLoader
                                                          (49) Objects
                                                          0% 0% 0% 0% 0%
                                                          0% 0% 0% 0% 0%
                                                          0% 0% 0% 0% 0%
                                                          0% Retained
                                                          Size 120 120
                                                          120 120 120
                                                          120 120 120
                                                          120 120 120
                                                          120 120 120
                                                          120 120 120 0%
                                                          0% 0% 0% 0% 0%
                                                          0% 0% 0% 0% 0%
                                                          0% 0% 0% 0% 0%
                                                          z 120 z 120 z
                                                          120 z 120 z
                                                          120 z 120 z
                                                          120 z 120 z
                                                          120 z 120 z
                                                          120 z 120 z
                                                          120 z 120 z
                                                          120 z 120 z
                                                          120 0% 0% 0%
                                                          0% 0% 0% 0% 0%
                                                          0% 0% 0% 0% 0%
                                                          0% 0% 0%

We have used different GCs but same results.


Task Manager


Total Size 4GB

metaspace 1GB

Off heap 512mb


Screenshot form Task manager, 612MB are occupied and not being released. 


We used jcmd tool and attached 3 files

  1. Threads print
  2. VM.metaspace output
  3. VM.classloader
In addition, we have tried calling GC manually, but it did not change much.

Thank you




Confidentiality: This communication and any attachments are intended for the above-named persons only and may be confidential and/or legally privileged. Any opinions expressed in this communication are not necessarily those of NICE Actimize. If this communication has come to you in error you must take no action based on it, nor must you copy or show it to anyone; please delete/destroy and inform the sender by e-mail immediately. 
Monitoring: NICE Actimize may monitor incoming and outgoing e-mails.
Viruses: Although we have taken steps toward ensuring that this e-mail and attachments are free from any virus, we advise that in keeping with good computing practice the recipient should ensure they are actually virus free.


Confidentiality: This communication and any attachments are intended for the above-named persons only and may be confidential and/or legally privileged. Any opinions expressed in this communication are not necessarily those of NICE Actimize. If this communication has come to you in error you must take no action based on it, nor must you copy or show it to anyone; please delete/destroy and inform the sender by e-mail immediately. 
Monitoring: NICE Actimize may monitor incoming and outgoing e-mails.
Viruses: Although we have taken steps toward ensuring that this e-mail and attachments are free from any virus, we advise that in keeping with good computing practice the recipient should ensure they are actually virus free.


Confidentiality: This communication and any attachments are intended for the above-named persons only and may be confidential and/or legally privileged. Any opinions expressed in this communication are not necessarily those of NICE Actimize. If this communication has come to you in error you must take no action based on it, nor must you copy or show it to anyone; please delete/destroy and inform the sender by e-mail immediately. 
Monitoring: NICE Actimize may monitor incoming and outgoing e-mails.
Viruses: Although we have taken steps toward ensuring that this e-mail and attachments are free from any virus, we advise that in keeping with good computing practice the recipient should ensure they are actually virus free.


Confidentiality: This communication and any attachments are intended for the above-named persons only and may be confidential and/or legally privileged. Any opinions expressed in this communication are not necessarily those of NICE Actimize. If this communication has come to you in error you must take no action based on it, nor must you copy or show it to anyone; please delete/destroy and inform the sender by e-mail immediately. 
Monitoring: NICE Actimize may monitor incoming and outgoing e-mails.
Viruses: Although we have taken steps toward ensuring that this e-mail and attachments are free from any virus, we advise that in keeping with good computing practice the recipient should ensure they are actually virus free.



Confidentiality: This communication and any attachments are intended for the above-named persons only and may be confidential and/or legally privileged. Any opinions expressed in this communication are not necessarily those of NICE Actimize. If this communication has come to you in error you must take no action based on it, nor must you copy or show it to anyone; please delete/destroy and inform the sender by e-mail immediately. 
Monitoring: NICE Actimize may monitor incoming and outgoing e-mails.
Viruses: Although we have taken steps toward ensuring that this e-mail and attachments are free from any virus, we advise that in keeping with good computing practice the recipient should ensure they are actually virus free.