OOM on flink when job restarts a lot

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

OOM on flink when job restarts a lot

Frits Jalvingh
Hello List,

We have a Flink job running reading a Kafka topic, then sending all messages with a SOAP call. We have had a situation where that SOAP call failed every time, causing the job to be RESTARTING every few seconds.

After a few hours Flink itself terminates with an OutOfMemoryError. This means that all flink jobs are now in trouble.

I dumped the heap, and noticed that it was completely filled up with two things:
- kafka metrics
- HashMap nodes related to PublicSuffixMatcher, a part of Apache HttpClient.

This leads me to believe that the restarting somehow retains references to some old failed classes/classloaders?

Of course I will repair the root cause, the failing job, but I would also like to fix things so that Flink does not die when something like this happens. I can of course set things like the max number of retries but I do not like that: I rather have the thing retry indefinitely so that when stuff is repaired the job continues normally.

I tried to find information about how Flink loads jobs but I could not make much of it.

How can I ensure that Flink does not run out of memory like this?

We're using Flink 1.1.1 and Kafka 0.9.0.1.

Thanks for your time,

Frits