Hello List,
We have a Flink job running reading a Kafka topic, then sending all messages with a SOAP call. We have had a situation where that SOAP call failed every time, causing the job to be RESTARTING every few seconds.
After a few hours Flink itself terminates with an OutOfMemoryError. This means that all flink jobs are now in trouble.
I dumped the heap, and noticed that it was completely filled up with two things:
- kafka metrics
- HashMap nodes related to PublicSuffixMatcher, a part of Apache HttpClient.
This leads me to believe that the restarting somehow retains references to some old failed classes/classloaders?
Of course I will repair the root cause, the failing job, but I would also like to fix things so that Flink does not die when something like this happens. I can of course set things like the max number of retries but I do not like that: I rather have the thing retry indefinitely so that when stuff is repaired the job continues normally.
I tried to find information about how Flink loads jobs but I could not make much of it.
How can I ensure that Flink does not run out of memory like this?
We're using Flink 1.1.1 and Kafka 0.9.0.1.
Thanks for your time,
Frits