Flink job consuming all available memory on host

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

Flink job consuming all available memory on host

Mitch Lloyd
We are having an issue with a Flink Job that gradually consumes all available memory on a Docker host machine, crashing the machine.

* We are running Flink 1.10.0
* We are running Flink in a Docker container on AWS ECS with EC2 instances
* The Flink task manager UI does not show high memory usage
* Our job uses a custom process window function and RocksDB to track state
* Our memory configuration has:
    * taskmanager.memory.process.size: 6g
    * taskmanager.memory.jvm-metaspace.size: 256m
* Our machine has more memory than we allocate for task instances
* We've tried using a machine with much more memory than was allocated to the Job Managers and we still leaked memory until the machine's resources were exhausted.
* The memory on the host never is recovered even after all tasks have stopped.

We saw this problem occur over the course of about a week when we originally launched this streaming job. We've now added a daily backfill job that uses the same windowing function and exhausts the host memory much faster (within a few hours).

What might cause this continual increase of memory usage on the host machine?
Reply | Threaded
Open this post in threaded view
|

Re: Flink job consuming all available memory on host

Xintong Song
Hi Mitch,

Have you configured 'state.backend.rocksdb.memory.managed'? The default should be 'true' and if you have set it to 'false', the RocksDB memory footprint might grow to more than configured task manager memory size.

Besides, by any chance your UDFs use any native memory? E.g., launch another process, calling a JNI library or so?

Thank you~

Xintong Song



On Sat, Apr 11, 2020 at 3:56 AM Mitch Lloyd <[hidden email]> wrote:
We are having an issue with a Flink Job that gradually consumes all available memory on a Docker host machine, crashing the machine.

* We are running Flink 1.10.0
* We are running Flink in a Docker container on AWS ECS with EC2 instances
* The Flink task manager UI does not show high memory usage
* Our job uses a custom process window function and RocksDB to track state
* Our memory configuration has:
    * taskmanager.memory.process.size: 6g
    * taskmanager.memory.jvm-metaspace.size: 256m
* Our machine has more memory than we allocate for task instances
* We've tried using a machine with much more memory than was allocated to the Job Managers and we still leaked memory until the machine's resources were exhausted.
* The memory on the host never is recovered even after all tasks have stopped.

We saw this problem occur over the course of about a week when we originally launched this streaming job. We've now added a daily backfill job that uses the same windowing function and exhausts the host memory much faster (within a few hours).

What might cause this continual increase of memory usage on the host machine?