We are having an issue with a Flink Job that gradually consumes all available memory on a Docker host machine, crashing the machine.
* We are running Flink 1.10.0
* We are running Flink in a Docker container on AWS ECS with EC2 instances
* The Flink task manager UI does not show high memory usage
* Our job uses a custom process window function and RocksDB to track state
* Our memory configuration has:
* taskmanager.memory.process.size: 6g
* taskmanager.memory.jvm-metaspace.size: 256m
* Our machine has more memory than we allocate for task instances
* We've tried using a machine with much more memory than was allocated to the Job Managers and we still leaked memory until the machine's resources were exhausted.
* The memory on the host never is recovered even after all tasks have stopped.
We saw this problem occur over the course of about a week when we originally launched this streaming job. We've now added a daily backfill job that uses the same windowing function and exhausts the host memory much faster (within a few hours).
What might cause this continual increase of memory usage on the host machine?