http://deprecated-apache-flink-user-mailing-list-archive.369.s1.nabble.com/Out-off-memory-when-catching-up-tp19108p19228.html
Hi
For sure I can share more info. We run on Flink 1.4.2 (
but have the same problems on 1.3.2 ) on a Aws EMR cluster. 6 taskmanagers on each m4.xlarge slave. Taskmanager heab set to 1850. We use RockStateDbBackend. we have set akka.ask.timeout to 60 s if GC should prevent heatbeat, yarn.maximum-failed-containers to 10000 to have some buffer before we loos our yarn session.
One of our jobs reads data from Kinesis as a Json string and map it into a object. Then we do some enrichment over a coPtocessFunction. If we can't find the data in the coprocess stream stream, we make a lookup through a asyncDataStream. Then we merge the 2 stream so that we now have one stream where enrichment has taken place. We then parse the binary data and create new object and output one main stream and 4 sideoutput streams. There should be 1 to 1 in number of objects in this map function.
For some of the sideout streams we do additional enrichment before all 5 streams are stored in kinesis.
I have now implemented max number of records read from kinesis, and by doing that I can avoid loosing my task manager, but now I can't catch up as fast as I would like. I have only seen back pressure once and that was for another job that use iteration and it never returned from that state.
So yes we create objects. I guess we create around 10-20 objects for each input objects and I would like to understand what going on, so I can make an implementation that takes care of it.
But is there a way to configure Flink so it will spill to disk instead of OOM. I would prefer a slow system instead of a dead system
Please let me know if you need additional information or it don't make any sense.
Lasse Nedergaard