Flink memory leak

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

Flink memory leak

ebru
Hi,

We are using Flink 1.3.1 in production, we have one job manager and 3
task managers in standalone mode. Recently, we've noticed that we have
memory related problems. We use docker container to serve Flink cluster.
We have 300 slots and 20 jobs are running with parallelism of 10. Also
the job count may be change over time. Taskmanager memory usage always
increases. After job cancelation this memory usage doesn't decrease.
We've tried to investigate the problem and we've got the task manager
jvm heap snapshot. According to the jam heap analysis, possible memory
leak was Flink list state descriptor. But we are not sure that is the
cause of our memory problem. How can we solve the problem?
Reply | Threaded
Open this post in threaded view
|

Re: Flink memory leak

Ufuk Celebi
Hey Ebru,

let me pull in Aljoscha (CC'd) who might have an idea what's causing this.

Since multiple jobs are running, it will be hard to understand to
which job the state descriptors from the heap snapshot belong to.
- Is it possible to isolate the problem and reproduce the behaviour
with only a single job?

– Ufuk


On Tue, Nov 7, 2017 at 10:27 AM, ÇETİNKAYA EBRU ÇETİNKAYA EBRU
<[hidden email]> wrote:

> Hi,
>
> We are using Flink 1.3.1 in production, we have one job manager and 3 task
> managers in standalone mode. Recently, we've noticed that we have memory
> related problems. We use docker container to serve Flink cluster. We have
> 300 slots and 20 jobs are running with parallelism of 10. Also the job count
> may be change over time. Taskmanager memory usage always increases. After
> job cancelation this memory usage doesn't decrease. We've tried to
> investigate the problem and we've got the task manager jvm heap snapshot.
> According to the jam heap analysis, possible memory leak was Flink list
> state descriptor. But we are not sure that is the cause of our memory
> problem. How can we solve the problem?