GC on TaskManagers stats

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

GC on TaskManagers stats

Guido
Hello,

I have few questions regarding garbage collector’s stats on Taskmanagers and any help or further documentation would be great.
I have collected “1 second polling requesting" stats on 7 Taskmanagers, through the relative request (/taskmanagers/<idtaskmanager>/) of the Monitoring REST API  while a job, that overall took 38 seconds, was running.

This way got 38 records for each TaskManager and focusing on garbage collector’s stats I can see, for example on 1 of the 38th records:

- PS-Scavenge.Time: 2597, PS-MarkSweep.Time: 29016; 
1. Is It correct to assume they represent the total elapsed time on different GCs (respectively young and old gen)? So, I basically got a running sum distribution?
2. If yes, values are in mills, so 29 sec?

3. Could they be used to get how much time has been wasted in total because of the “Stop-the-world” GCs policy?

Finally, on the same record:

PS-Scavenge.Count: 3, PS-MarkSweep.Time: 5, load: 3.73.

4. Is it the “load” value tightly related?

Sorry if it has been quite long and thanks a lot.

Guido

 
Reply | Threaded
Open this post in threaded view
|

Re: GC on TaskManagers stats

rmetzger0
Hi Guido,

sorry for the late reply. You were collecting the stats every 1 second. Afaik, Flink is internally collecting the stats with a frequency of 5 seconds, so you can either change your or Flink's polling interval (I think its taskmanager.heartbeat-interval)

Regarding the details on PS-Scavenge, MarkSweep etc.: We just use the names the Java management beans return, so you can just google for the names and read how to interpret them. For example: http://www.ibm.com/developerworks/library/j-jtp11253/

The load is the operating system load.



On Thu, Feb 4, 2016 at 10:25 PM, Guido <[hidden email]> wrote:
Hello,

I have few questions regarding garbage collector’s stats on Taskmanagers and any help or further documentation would be great.
I have collected “1 second polling requesting" stats on 7 Taskmanagers, through the relative request (/taskmanagers/<idtaskmanager>/) of the Monitoring REST API  while a job, that overall took 38 seconds, was running.

This way got 38 records for each TaskManager and focusing on garbage collector’s stats I can see, for example on 1 of the 38th records:

- PS-Scavenge.Time: 2597, PS-MarkSweep.Time: 29016; 
1. Is It correct to assume they represent the total elapsed time on different GCs (respectively young and old gen)? So, I basically got a running sum distribution?
2. If yes, values are in mills, so 29 sec?

3. Could they be used to get how much time has been wasted in total because of the “Stop-the-world” GCs policy?

Finally, on the same record:

PS-Scavenge.Count: 3, PS-MarkSweep.Time: 5, load: 3.73.

4. Is it the “load” value tightly related?

Sorry if it has been quite long and thanks a lot.

Guido