Cluster type: Standalone cluster
Job Type: Streaming JVM memory: 26.2 GB POD memory: 33 GB CPU: 10 Cores GC: G1GC Flink Version: 1.8.3 State back end: File based NETWORK_BUFFERS_MEMORY_FRACTION : 0.02f of the Heap We are not accessing Direct memory from application. Only Flink uses direct memory We notice that in Flink 1.8.3 over a period of 30 minutes the POD is killed with OOM. JVM Heap is with in limit. We read from Kafka and have windows in the application. Our Sink is either Kafka or Elastic Search The same application/job was working perfectly in Flink 1.4.1 with the same input rate and output rate No back pressure I have attached few Grafana charts as PDF Any idea why the off heap memory / outside JVM memory is going up and eventually reaching the limit. Java Heap (reserved=26845184KB, committed=26845184KB) (mmap: reserved=26845184KB, committed=26845184KB) - Class (reserved=1241866KB, committed=219686KB) (classes #36599) (malloc=4874KB #74568) (mmap: reserved=1236992KB, committed=214812KB) - Thread (reserved=394394KB, committed=394394KB) (thread #383) (stack: reserved=392696KB, committed=392696KB) (malloc=1250KB #1920) (arena=448KB #764) - Code (reserved=272178KB, committed=137954KB) (malloc=22578KB #33442) (mmap: reserved=249600KB, committed=115376KB) - GC (reserved=1365088KB, committed=1365088KB) (malloc=336112KB #1130298) (mmap: reserved=1028976KB, committed=1028976KB) Thanks Josson memory_issue.pdf (1M) Download Attachment |
Hi Josson, I don't have much experience setting memory bounds in Kubernetes myself, but my colleague Andrey (in CC) reworked Flink's memory configuration for the last release to ease the configuration in container envs. He might be able to help. Best, Fabian Am Do., 21. Mai 2020 um 18:43 Uhr schrieb Josson Paul <[hidden email]>:
|
Hi Josson, Do you use state backend? is it RocksDB? Best, Andrey On Fri, May 22, 2020 at 12:58 PM Fabian Hueske <[hidden email]> wrote:
|
Hi Andrey, We don't use Rocks DB. As I said in the original email I am using File Based. Even though our cluster is on Kubernetes out Flink cluster is Flink's stand alone resource manager. We have not yet integrated our Flink with Kubernetes. Thanks, Josson On Fri, May 22, 2020 at 3:37 AM Andrey Zagrebin <[hidden email]> wrote:
Thanks
Josson |
Hi Andrey, To clarify the above email. I am using Heap Based State and not Rocks DB. Thanks, Josson On Sat, May 23, 2020, 17:37 Josson Paul <[hidden email]> wrote:
|
Hi Josson,
Thanks for the details. Sorry, I overlooked, you indeed mentioned the file backend. Looking into Flink memory model [1], I do not notice any problems related to the types of memory consumption we model in Flink. Direct memory consumption by network stack corresponds to your configured fraction (0.02f). JVM heap cannot cause problems. I do not know any other types of memory consumption in Flink 1.8. Nonetheless, there is no way to control all types of memory consumption, especially native memory allocation either by user code or JVM (if you do not use RocksDB, Flink barely uses the native memory explicitly). The examples (not exhaustive): - native libraries in user code or its dependencies which use off-heap, e.g. malloc (detecting this would require some OS process dump) - JVM metaspace, threads/GC overhead etc (we do not limit any of this in 1.8 by JVM args) Recently, we discovered some class loading leaks (JVM meatspace), e.g. [2] or [3]. Since 1.10, Flink limits JVM meatspace and direct memory then you would get a concrete OOM exception before container dies. Maybe Kafka or Elastic search connector clients got updated with 1.8 and caused some leaks. I cc’ed Gordon and Piotr whether they have an idea. I suggest to try to decrease POD memory, note the consumed memory of various types at the moment the container dies (I suppose as you did), and then increase POD memory multiple times until you see which type of memory consumption always grows till OOM and other types hopefully stabilise on some level. Then you could take a dump of that ever growing type of memory consumption to analyse if there is memory leak. Best, Andrey [3] https://issues.apache.org/jira/browse/FLINK-11205
|
Free forum by Nabble | Edit this page |