Re: Task Manager was lost/killed due to full GC

Posted by Vinay Patil on
URL: http://deprecated-apache-flink-user-mailing-list-archive.369.s1.nabble.com/Task-Manager-was-lost-killed-due-to-full-GC-tp15386p15393.html

Hi,

Did you try running your pipeline by setting RocksDB State Backend ? Are you managing state in pipeline or using windowing ?

Direct memory stats: Count: 5236, Total Capacity: 17148907, Used Memory: 
17148908 


From the above stats it seems you are running out of memory because of which TM got killed.

I have experienced a similar issue of TM getting frequently killed or the job is not progressing fast because of full GC's. Moving to RocksDB solved the issue


Regards,
Vinay Patil

On Tue, Sep 5, 2017 at 6:39 PM, ShB [via Apache Flink User Mailing List archive.] <[hidden email]> wrote:
Hi,

I'm running a Flink batch job that reads almost 1 TB of data from S3 and
then performs operations on it. A list of filenames are distributed among
the TM's and each subset of files is read from S3 from each TM. This job
errors out at the read step due to the following error:
java.lang.Exception: TaskManager was lost/killed

Having read similar questions on the mailing list, it seems like this is a
memory issue, with full GC at the TM causing the TM to be lost.

After enabling memory debugging this seems to be the stats just before
erroring out:
Memory usage stats: [HEAP: 8327/18704/18704 MB, NON HEAP: 79/81/-1 MB
(used/committed/max)]
Direct memory stats: Count: 5236, Total Capacity: 17148907, Used Memory:
17148908
Off-heap pool stats: [Code Cache: 25/27/240 MB (used/committed/max)],
[Metaspace: 47/48/-1 MB (used/committed/max)], [Compressed Class Space:
5/5/1024 MB (used/committed/max)]
Garbage collector stats: [G1 Young Generation, GC TIME (ms): 16712, GC
COUNT: 290], [G1 Old Generation, GC TIME (ms): 689, GC COUNT: 2]

I tried all of these suggested fixes: decreased taskmanager.memory.fraction
to give more memory to user managed operations, increased number of
JVM's(parallelism), used the G1 GC for better GC performance, but my job
still errors out.  

I increased akka.watch.heartbeat.pause, akka.watch.threshold,
akka.watch.heartbeat.interval to prevent the timeout due to GC. But this
doesn't help either. I figured with the really high values for death watch,
the program would run really slowly and complete at some point but it fails
anyway.

I'm now trying to decrease object creation in my program, but so far it
hasn't helped.

How can I go about debugging and fixing this problem?

Thank you.




--
Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/



To start a new topic under Apache Flink User Mailing List archive., email [hidden email]
To unsubscribe from Apache Flink User Mailing List archive., click here.
NAML