(DEPRECATED) Apache Flink User Mailing List archive.

Native memory allocation (mmap) failed to map 1006567424 bytes

Classic

List

Threaded

9 messages Options

orips

Native memory allocation (mmap) failed to map 1006567424 bytes

After the job is running for 10 days in production, TaskManagers start failing with:

Connection unexpectedly closed by remote task manager

Looking in the machine logs, I can see the following error:

============= Java processes for user hadoop =============
OpenJDK 64-Bit Server VM warning: INFO: os::commit_memory(0x00007fb4f4010000, 1006567424, 0) failed; error='Cannot allocate memory' (err
#
# There is insufficient memory for the Java Runtime Environment to continue.
# Native memory allocation (mmap) failed to map 1006567424 bytes for committing reserved memory.
# An error report file with more information is saved as:
# /mnt/tmp/hsperfdata_hadoop/hs_err_pid6585.log
=========== End java processes for user hadoop ===========

In addition, the metrics for the TaskManager show very low Heap memory consumption (20% of Xmx).

Hence, I suspect there is a memory leak in the TaskManager's Managed Memory.

This my TaskManager's memory detail:

flink process 112g
framework.heap.size 0.2g
task.heap.size 50g
managed.size 54g
framework.off-heap.size 0.5g
task.off-heap.size 1g
network 2g
XX:MaxMetaspaceSize 1g

As you can see, the managed memory is 54g, so it's already high (my managed.fraction is set to 0.5).

I'm running Flink 1.10. Full job details attached.

Can someone advise what would cause a managed memory leak?

job-details.txt (1K) Download Attachment

Xintong Song

Re: Native memory allocation (mmap) failed to map 1006567424 bytes

Hi Ori,

The error message suggests that there's not enough physical memory on the machine to satisfy the allocation. This does not necessarily mean a managed memory leak. Managed memory leak is only one of the possibilities. There are other potential reasons, e.g., another process/container on the machine used more memory than expected, Yarn NM is not configured with enough memory reserved for the system processes, etc.

I would suggest to first look into the machine memory usages, see whether the Flink process indeed uses more memory than expected. This could be achieved via:

- Run the `top` command

- Look into the `/proc/meminfo` file

- Any container memory usage metrics that are available to your Yarn cluster

Thank you~

Xintong Song

On Tue, Oct 27, 2020 at 6:21 PM Ori Popowski <[hidden email]> wrote:

After the job is running for 10 days in production, TaskManagers start failing with:

Connection unexpectedly closed by remote task manager

Looking in the machine logs, I can see the following error:

============= Java processes for user hadoop =============
OpenJDK 64-Bit Server VM warning: INFO: os::commit_memory(0x00007fb4f4010000, 1006567424, 0) failed; error='Cannot allocate memory' (err
#
# There is insufficient memory for the Java Runtime Environment to continue.
# Native memory allocation (mmap) failed to map 1006567424 bytes for committing reserved memory.
# An error report file with more information is saved as:
# /mnt/tmp/hsperfdata_hadoop/hs_err_pid6585.log
=========== End java processes for user hadoop ===========

In addition, the metrics for the TaskManager show very low Heap memory consumption (20% of Xmx).

Hence, I suspect there is a memory leak in the TaskManager's Managed Memory.

This my TaskManager's memory detail:
flink process 112g
framework.heap.size 0.2g
task.heap.size 50g
managed.size 54g
framework.off-heap.size 0.5g
task.off-heap.size 1g
network 2g
XX:MaxMetaspaceSize 1g

As you can see, the managed memory is 54g, so it's already high (my managed.fraction is set to 0.5).

I'm running Flink 1.10. Full job details attached.

Can someone advise what would cause a managed memory leak?

orips

Re: Native memory allocation (mmap) failed to map 1006567424 bytes

Hi Xintong,

See here:

# Top memory users
ps auxwww --sort -rss | head -10
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
yarn 20339 35.8 97.0 128600192 126672256 ? Sl Oct15 5975:47 /etc/alternatives/jre/bin/java -Xmx54760833024 -Xms54760833024 -XX:Max
root 5245 0.1 0.4 5580484 627436 ? Sl Jul30 144:39 /etc/alternatives/jre/bin/java -Xmx1024m -XX:+ExitOnOutOfMemoryError -X
hadoop 5252 0.1 0.4 7376768 604772 ? Sl Jul30 153:22 /etc/alternatives/jre/bin/java -Xmx1024m -XX:+ExitOnOutOfMemoryError -X
yarn 26857 0.3 0.2 4214784 341464 ? Sl Sep17 198:43 /etc/alternatives/jre/bin/java -Dproc_nodemanager -Xmx2048m -XX:OnOutOf
root 5519 0.0 0.2 5658624 269344 ? Sl Jul30 45:21 /usr/bin/java -Xmx1500m -Xms300m -XX:+ExitOnOutOfMemoryError -XX:MinHea
root 1781 0.0 0.0 172644 8096 ? Ss Jul30 2:06 /usr/lib/systemd/systemd-journald
root 4801 0.0 0.0 2690260 4776 ? Ssl Jul30 4:42 /usr/bin/amazon-ssm-agent
root 6566 0.0 0.0 164672 4116 ? R 00:30 0:00 ps auxwww --sort -rss
root 6532 0.0 0.0 183124 3592 ? S 00:30 0:00 /usr/sbin/CROND -n

On Wed, Oct 28, 2020 at 11:34 AM Xintong Song <[hidden email]> wrote:

Hi Ori,

The error message suggests that there's not enough physical memory on the machine to satisfy the allocation. This does not necessarily mean a managed memory leak. Managed memory leak is only one of the possibilities. There are other potential reasons, e.g., another process/container on the machine used more memory than expected, Yarn NM is not configured with enough memory reserved for the system processes, etc.

I would suggest to first look into the machine memory usages, see whether the Flink process indeed uses more memory than expected. This could be achieved via:
- Run the `top` command
- Look into the `/proc/meminfo` file
- Any container memory usage metrics that are available to your Yarn cluster

Thank you~
Xintong Song

On Tue, Oct 27, 2020 at 6:21 PM Ori Popowski <[hidden email]> wrote:
After the job is running for 10 days in production, TaskManagers start failing with:

Connection unexpectedly closed by remote task manager

Looking in the machine logs, I can see the following error:

============= Java processes for user hadoop =============
OpenJDK 64-Bit Server VM warning: INFO: os::commit_memory(0x00007fb4f4010000, 1006567424, 0) failed; error='Cannot allocate memory' (err
#
# There is insufficient memory for the Java Runtime Environment to continue.
# Native memory allocation (mmap) failed to map 1006567424 bytes for committing reserved memory.
# An error report file with more information is saved as:
# /mnt/tmp/hsperfdata_hadoop/hs_err_pid6585.log
=========== End java processes for user hadoop ===========

In addition, the metrics for the TaskManager show very low Heap memory consumption (20% of Xmx).

Hence, I suspect there is a memory leak in the TaskManager's Managed Memory.

This my TaskManager's memory detail:
flink process 112g
framework.heap.size 0.2g
task.heap.size 50g
managed.size 54g
framework.off-heap.size 0.5g
task.off-heap.size 1g
network 2g
XX:MaxMetaspaceSize 1g

As you can see, the managed memory is 54g, so it's already high (my managed.fraction is set to 0.5).

I'm running Flink 1.10. Full job details attached.

Can someone advise what would cause a managed memory leak?

Xintong Song

Re: Native memory allocation (mmap) failed to map 1006567424 bytes

Hi Ori,

It looks like Flink indeed uses more memory than expected. I assume the first item with PID 20331 is the flink process, right?

It would be helpful if you can briefly introduce your workload.

- What kind of workload are you running? Streaming or batch?

- Do you use RocksDB state backend?
- Any UDFs or 3rd party dependencies that might allocate significant native memory?

Moreover, if the metrics shows only 20% heap usages, I would suggest configuring less `task.heap.size`, leaving more memory to off-heap. The reduced heap size does not necessarily all go to the managed memory. You can also try increasing the `jvm-overhead`, simply to leave more native memory in the container in case there are other other significant native memory usages.

Thank you~

Xintong Song

On Wed, Oct 28, 2020 at 5:53 PM Ori Popowski <[hidden email]> wrote:

Hi Xintong,

See here:

# Top memory users
ps auxwww --sort -rss | head -10
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
yarn 20339 35.8 97.0 128600192 126672256 ? Sl Oct15 5975:47 /etc/alternatives/jre/bin/java -Xmx54760833024 -Xms54760833024 -XX:Max
root 5245 0.1 0.4 5580484 627436 ? Sl Jul30 144:39 /etc/alternatives/jre/bin/java -Xmx1024m -XX:+ExitOnOutOfMemoryError -X
hadoop 5252 0.1 0.4 7376768 604772 ? Sl Jul30 153:22 /etc/alternatives/jre/bin/java -Xmx1024m -XX:+ExitOnOutOfMemoryError -X
yarn 26857 0.3 0.2 4214784 341464 ? Sl Sep17 198:43 /etc/alternatives/jre/bin/java -Dproc_nodemanager -Xmx2048m -XX:OnOutOf
root 5519 0.0 0.2 5658624 269344 ? Sl Jul30 45:21 /usr/bin/java -Xmx1500m -Xms300m -XX:+ExitOnOutOfMemoryError -XX:MinHea
root 1781 0.0 0.0 172644 8096 ? Ss Jul30 2:06 /usr/lib/systemd/systemd-journald
root 4801 0.0 0.0 2690260 4776 ? Ssl Jul30 4:42 /usr/bin/amazon-ssm-agent
root 6566 0.0 0.0 164672 4116 ? R 00:30 0:00 ps auxwww --sort -rss
root 6532 0.0 0.0 183124 3592 ? S 00:30 0:00 /usr/sbin/CROND -n

On Wed, Oct 28, 2020 at 11:34 AM Xintong Song <[hidden email]> wrote:
Hi Ori,

The error message suggests that there's not enough physical memory on the machine to satisfy the allocation. This does not necessarily mean a managed memory leak. Managed memory leak is only one of the possibilities. There are other potential reasons, e.g., another process/container on the machine used more memory than expected, Yarn NM is not configured with enough memory reserved for the system processes, etc.

I would suggest to first look into the machine memory usages, see whether the Flink process indeed uses more memory than expected. This could be achieved via:
- Run the `top` command
- Look into the `/proc/meminfo` file
- Any container memory usage metrics that are available to your Yarn cluster

Thank you~
Xintong Song

On Tue, Oct 27, 2020 at 6:21 PM Ori Popowski <[hidden email]> wrote:
After the job is running for 10 days in production, TaskManagers start failing with:

Connection unexpectedly closed by remote task manager

Looking in the machine logs, I can see the following error:

============= Java processes for user hadoop =============
OpenJDK 64-Bit Server VM warning: INFO: os::commit_memory(0x00007fb4f4010000, 1006567424, 0) failed; error='Cannot allocate memory' (err
#
# There is insufficient memory for the Java Runtime Environment to continue.
# Native memory allocation (mmap) failed to map 1006567424 bytes for committing reserved memory.
# An error report file with more information is saved as:
# /mnt/tmp/hsperfdata_hadoop/hs_err_pid6585.log
=========== End java processes for user hadoop ===========

In addition, the metrics for the TaskManager show very low Heap memory consumption (20% of Xmx).

Hence, I suspect there is a memory leak in the TaskManager's Managed Memory.

This my TaskManager's memory detail:
flink process 112g
framework.heap.size 0.2g
task.heap.size 50g
managed.size 54g
framework.off-heap.size 0.5g
task.off-heap.size 1g
network 2g
XX:MaxMetaspaceSize 1g

As you can see, the managed memory is 54g, so it's already high (my managed.fraction is set to 0.5).

I'm running Flink 1.10. Full job details attached.

Can someone advise what would cause a managed memory leak?

orips

Re: Native memory allocation (mmap) failed to map 1006567424 bytes

Hi,

PID 20331 is indeed the Flink process, specifically the TaskManager process.

- Workload is a streaming workload reading from Kafka and writing to S3 using a custom Sink

- RockDB state backend is used with default settings

- My external dependencies are:

-- logback

-- jackson

-- flatbuffers

-- jaxb-api

-- scala-java8-compat

-- apache commons-io

-- apache commons-compress

-- software.amazon.awssdk s3

- What do you mean by UDFs? I've implemented several operators like KafkaDeserializationSchema, FlatMap, Map, ProcessFunction.

We use a SessionWindow with 30 minutes of gap, and a watermark with 10 minutes delay.

We did confirm we have some keys in our job which keep receiving records indefinitely, but I'm not sure why it would cause a managed memory leak, since this should be flushed to RocksDB and free the memory used. We have a guard against this, where we keep the overall size of all the records for each key, and when it reaches 300mb, we don't move the records downstream, which causes them to create a session and go through the sink.

About what you suggested - I kind of did this by increasing the managed memory fraction to 0.5. And it did postpone the occurrence of the problem (meaning, the TMs started crashing after 10 days instead of 7 days). It looks like anything I'll do on that front will only postpone the problem but not solve it.

I am attaching the full job configuration.

On Thu, Oct 29, 2020 at 10:09 AM Xintong Song <[hidden email]> wrote:

Hi Ori,

It looks like Flink indeed uses more memory than expected. I assume the first item with PID 20331 is the flink process, right?

It would be helpful if you can briefly introduce your workload.
- What kind of workload are you running? Streaming or batch?
- Do you use RocksDB state backend?
- Any UDFs or 3rd party dependencies that might allocate significant native memory?

Moreover, if the metrics shows only 20% heap usages, I would suggest configuring less `task.heap.size`, leaving more memory to off-heap. The reduced heap size does not necessarily all go to the managed memory. You can also try increasing the `jvm-overhead`, simply to leave more native memory in the container in case there are other other significant native memory usages.

Thank you~
Xintong Song

On Wed, Oct 28, 2020 at 5:53 PM Ori Popowski <[hidden email]> wrote:
Hi Xintong,

See here:

# Top memory users
ps auxwww --sort -rss | head -10
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
yarn 20339 35.8 97.0 128600192 126672256 ? Sl Oct15 5975:47 /etc/alternatives/jre/bin/java -Xmx54760833024 -Xms54760833024 -XX:Max
root 5245 0.1 0.4 5580484 627436 ? Sl Jul30 144:39 /etc/alternatives/jre/bin/java -Xmx1024m -XX:+ExitOnOutOfMemoryError -X
hadoop 5252 0.1 0.4 7376768 604772 ? Sl Jul30 153:22 /etc/alternatives/jre/bin/java -Xmx1024m -XX:+ExitOnOutOfMemoryError -X
yarn 26857 0.3 0.2 4214784 341464 ? Sl Sep17 198:43 /etc/alternatives/jre/bin/java -Dproc_nodemanager -Xmx2048m -XX:OnOutOf
root 5519 0.0 0.2 5658624 269344 ? Sl Jul30 45:21 /usr/bin/java -Xmx1500m -Xms300m -XX:+ExitOnOutOfMemoryError -XX:MinHea
root 1781 0.0 0.0 172644 8096 ? Ss Jul30 2:06 /usr/lib/systemd/systemd-journald
root 4801 0.0 0.0 2690260 4776 ? Ssl Jul30 4:42 /usr/bin/amazon-ssm-agent
root 6566 0.0 0.0 164672 4116 ? R 00:30 0:00 ps auxwww --sort -rss
root 6532 0.0 0.0 183124 3592 ? S 00:30 0:00 /usr/sbin/CROND -n

On Wed, Oct 28, 2020 at 11:34 AM Xintong Song <[hidden email]> wrote:
Hi Ori,

The error message suggests that there's not enough physical memory on the machine to satisfy the allocation. This does not necessarily mean a managed memory leak. Managed memory leak is only one of the possibilities. There are other potential reasons, e.g., another process/container on the machine used more memory than expected, Yarn NM is not configured with enough memory reserved for the system processes, etc.

I would suggest to first look into the machine memory usages, see whether the Flink process indeed uses more memory than expected. This could be achieved via:
- Run the `top` command
- Look into the `/proc/meminfo` file
- Any container memory usage metrics that are available to your Yarn cluster

Thank you~
Xintong Song

On Tue, Oct 27, 2020 at 6:21 PM Ori Popowski <[hidden email]> wrote:
After the job is running for 10 days in production, TaskManagers start failing with:

Connection unexpectedly closed by remote task manager

Looking in the machine logs, I can see the following error:

============= Java processes for user hadoop =============
OpenJDK 64-Bit Server VM warning: INFO: os::commit_memory(0x00007fb4f4010000, 1006567424, 0) failed; error='Cannot allocate memory' (err
#
# There is insufficient memory for the Java Runtime Environment to continue.
# Native memory allocation (mmap) failed to map 1006567424 bytes for committing reserved memory.
# An error report file with more information is saved as:
# /mnt/tmp/hsperfdata_hadoop/hs_err_pid6585.log
=========== End java processes for user hadoop ===========

In addition, the metrics for the TaskManager show very low Heap memory consumption (20% of Xmx).

Hence, I suspect there is a memory leak in the TaskManager's Managed Memory.

This my TaskManager's memory detail:
flink process 112g
framework.heap.size 0.2g
task.heap.size 50g
managed.size 54g
framework.off-heap.size 0.5g
task.off-heap.size 1g
network 2g
XX:MaxMetaspaceSize 1g

As you can see, the managed memory is 54g, so it's already high (my managed.fraction is set to 0.5).

I'm running Flink 1.10. Full job details attached.

Can someone advise what would cause a managed memory leak?

configuration.txt (2K) Download Attachment

Xintong Song

Re: Native memory allocation (mmap) failed to map 1006567424 bytes

Hi Ori,

RocksDB also uses managed memory. If the memory overuse indeed comes from RocksDB, then increasing managed memory fraction will not help. RocksDB will try to use as many memory as the configured managed memory size. Therefore increasing managed memory fraction also makes RocksDB try to use more memory. That is why I suggested increasing `jvm-overhead` instead.

Please also make sure the configuration option `state.backend.rocksdb.memory.managed` is either not explicitly configured, or configured to `true`.

In addition, I noticed that you are using Flink 1.10.0. You might want to upgrade to 1.10.2, to include the latest bug fixes on the 1.10 release.

Thank you~

Xintong Song

On Thu, Oct 29, 2020 at 4:41 PM Ori Popowski <[hidden email]> wrote:

Hi,

PID 20331 is indeed the Flink process, specifically the TaskManager process.

- Workload is a streaming workload reading from Kafka and writing to S3 using a custom Sink
- RockDB state backend is used with default settings
- My external dependencies are:
-- logback
-- jackson
-- flatbuffers
-- jaxb-api
-- scala-java8-compat
-- apache commons-io
-- apache commons-compress
-- software.amazon.awssdk s3
- What do you mean by UDFs? I've implemented several operators like KafkaDeserializationSchema, FlatMap, Map, ProcessFunction.

We use a SessionWindow with 30 minutes of gap, and a watermark with 10 minutes delay.

We did confirm we have some keys in our job which keep receiving records indefinitely, but I'm not sure why it would cause a managed memory leak, since this should be flushed to RocksDB and free the memory used. We have a guard against this, where we keep the overall size of all the records for each key, and when it reaches 300mb, we don't move the records downstream, which causes them to create a session and go through the sink.

About what you suggested - I kind of did this by increasing the managed memory fraction to 0.5. And it did postpone the occurrence of the problem (meaning, the TMs started crashing after 10 days instead of 7 days). It looks like anything I'll do on that front will only postpone the problem but not solve it.

I am attaching the full job configuration.

On Thu, Oct 29, 2020 at 10:09 AM Xintong Song <[hidden email]> wrote:
Hi Ori,

It looks like Flink indeed uses more memory than expected. I assume the first item with PID 20331 is the flink process, right?

It would be helpful if you can briefly introduce your workload.
- What kind of workload are you running? Streaming or batch?
- Do you use RocksDB state backend?
- Any UDFs or 3rd party dependencies that might allocate significant native memory?

Moreover, if the metrics shows only 20% heap usages, I would suggest configuring less `task.heap.size`, leaving more memory to off-heap. The reduced heap size does not necessarily all go to the managed memory. You can also try increasing the `jvm-overhead`, simply to leave more native memory in the container in case there are other other significant native memory usages.

Thank you~
Xintong Song

On Wed, Oct 28, 2020 at 5:53 PM Ori Popowski <[hidden email]> wrote:
Hi Xintong,

See here:

# Top memory users
ps auxwww --sort -rss | head -10
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
yarn 20339 35.8 97.0 128600192 126672256 ? Sl Oct15 5975:47 /etc/alternatives/jre/bin/java -Xmx54760833024 -Xms54760833024 -XX:Max
root 5245 0.1 0.4 5580484 627436 ? Sl Jul30 144:39 /etc/alternatives/jre/bin/java -Xmx1024m -XX:+ExitOnOutOfMemoryError -X
hadoop 5252 0.1 0.4 7376768 604772 ? Sl Jul30 153:22 /etc/alternatives/jre/bin/java -Xmx1024m -XX:+ExitOnOutOfMemoryError -X
yarn 26857 0.3 0.2 4214784 341464 ? Sl Sep17 198:43 /etc/alternatives/jre/bin/java -Dproc_nodemanager -Xmx2048m -XX:OnOutOf
root 5519 0.0 0.2 5658624 269344 ? Sl Jul30 45:21 /usr/bin/java -Xmx1500m -Xms300m -XX:+ExitOnOutOfMemoryError -XX:MinHea
root 1781 0.0 0.0 172644 8096 ? Ss Jul30 2:06 /usr/lib/systemd/systemd-journald
root 4801 0.0 0.0 2690260 4776 ? Ssl Jul30 4:42 /usr/bin/amazon-ssm-agent
root 6566 0.0 0.0 164672 4116 ? R 00:30 0:00 ps auxwww --sort -rss
root 6532 0.0 0.0 183124 3592 ? S 00:30 0:00 /usr/sbin/CROND -n

On Wed, Oct 28, 2020 at 11:34 AM Xintong Song <[hidden email]> wrote:
Hi Ori,

The error message suggests that there's not enough physical memory on the machine to satisfy the allocation. This does not necessarily mean a managed memory leak. Managed memory leak is only one of the possibilities. There are other potential reasons, e.g., another process/container on the machine used more memory than expected, Yarn NM is not configured with enough memory reserved for the system processes, etc.

I would suggest to first look into the machine memory usages, see whether the Flink process indeed uses more memory than expected. This could be achieved via:
- Run the `top` command
- Look into the `/proc/meminfo` file
- Any container memory usage metrics that are available to your Yarn cluster

Thank you~
Xintong Song

On Tue, Oct 27, 2020 at 6:21 PM Ori Popowski <[hidden email]> wrote:
After the job is running for 10 days in production, TaskManagers start failing with:

Connection unexpectedly closed by remote task manager

Looking in the machine logs, I can see the following error:

============= Java processes for user hadoop =============
OpenJDK 64-Bit Server VM warning: INFO: os::commit_memory(0x00007fb4f4010000, 1006567424, 0) failed; error='Cannot allocate memory' (err
#
# There is insufficient memory for the Java Runtime Environment to continue.
# Native memory allocation (mmap) failed to map 1006567424 bytes for committing reserved memory.
# An error report file with more information is saved as:
# /mnt/tmp/hsperfdata_hadoop/hs_err_pid6585.log
=========== End java processes for user hadoop ===========

In addition, the metrics for the TaskManager show very low Heap memory consumption (20% of Xmx).

Hence, I suspect there is a memory leak in the TaskManager's Managed Memory.

This my TaskManager's memory detail:
flink process 112g
framework.heap.size 0.2g
task.heap.size 50g
managed.size 54g
framework.off-heap.size 0.5g
task.off-heap.size 1g
network 2g
XX:MaxMetaspaceSize 1g

As you can see, the managed memory is 54g, so it's already high (my managed.fraction is set to 0.5).

I'm running Flink 1.10. Full job details attached.

Can someone advise what would cause a managed memory leak?

orips

Re: Native memory allocation (mmap) failed to map 1006567424 bytes

Hi Xintong,

Unfortunately I cannot upgrade to 1.10.2, because EMR has either 1.10.0 or 1.11.0.

About the overhead - turns out I already configured taskmanager.memory.jvm-overhead.max to 2 gb instead of the default 1 gb. Should I increase it further?

state.backend.rocksdb.memory.managed is already not explicitly configured.

Is there anything else I can do?

On Thu, Oct 29, 2020 at 1:24 PM Xintong Song <[hidden email]> wrote:

Hi Ori,

RocksDB also uses managed memory. If the memory overuse indeed comes from RocksDB, then increasing managed memory fraction will not help. RocksDB will try to use as many memory as the configured managed memory size. Therefore increasing managed memory fraction also makes RocksDB try to use more memory. That is why I suggested increasing `jvm-overhead` instead.

Please also make sure the configuration option `state.backend.rocksdb.memory.managed` is either not explicitly configured, or configured to `true`.

In addition, I noticed that you are using Flink 1.10.0. You might want to upgrade to 1.10.2, to include the latest bug fixes on the 1.10 release.

Thank you~
Xintong Song

On Thu, Oct 29, 2020 at 4:41 PM Ori Popowski <[hidden email]> wrote:
Hi,

PID 20331 is indeed the Flink process, specifically the TaskManager process.

- Workload is a streaming workload reading from Kafka and writing to S3 using a custom Sink
- RockDB state backend is used with default settings
- My external dependencies are:
-- logback
-- jackson
-- flatbuffers
-- jaxb-api
-- scala-java8-compat
-- apache commons-io
-- apache commons-compress
-- software.amazon.awssdk s3
- What do you mean by UDFs? I've implemented several operators like KafkaDeserializationSchema, FlatMap, Map, ProcessFunction.

We use a SessionWindow with 30 minutes of gap, and a watermark with 10 minutes delay.

We did confirm we have some keys in our job which keep receiving records indefinitely, but I'm not sure why it would cause a managed memory leak, since this should be flushed to RocksDB and free the memory used. We have a guard against this, where we keep the overall size of all the records for each key, and when it reaches 300mb, we don't move the records downstream, which causes them to create a session and go through the sink.

About what you suggested - I kind of did this by increasing the managed memory fraction to 0.5. And it did postpone the occurrence of the problem (meaning, the TMs started crashing after 10 days instead of 7 days). It looks like anything I'll do on that front will only postpone the problem but not solve it.

I am attaching the full job configuration.

On Thu, Oct 29, 2020 at 10:09 AM Xintong Song <[hidden email]> wrote:
Hi Ori,

It looks like Flink indeed uses more memory than expected. I assume the first item with PID 20331 is the flink process, right?

It would be helpful if you can briefly introduce your workload.
- What kind of workload are you running? Streaming or batch?
- Do you use RocksDB state backend?
- Any UDFs or 3rd party dependencies that might allocate significant native memory?

Moreover, if the metrics shows only 20% heap usages, I would suggest configuring less `task.heap.size`, leaving more memory to off-heap. The reduced heap size does not necessarily all go to the managed memory. You can also try increasing the `jvm-overhead`, simply to leave more native memory in the container in case there are other other significant native memory usages.

Thank you~
Xintong Song

On Wed, Oct 28, 2020 at 5:53 PM Ori Popowski <[hidden email]> wrote:
Hi Xintong,

See here:

# Top memory users
ps auxwww --sort -rss | head -10
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
yarn 20339 35.8 97.0 128600192 126672256 ? Sl Oct15 5975:47 /etc/alternatives/jre/bin/java -Xmx54760833024 -Xms54760833024 -XX:Max
root 5245 0.1 0.4 5580484 627436 ? Sl Jul30 144:39 /etc/alternatives/jre/bin/java -Xmx1024m -XX:+ExitOnOutOfMemoryError -X
hadoop 5252 0.1 0.4 7376768 604772 ? Sl Jul30 153:22 /etc/alternatives/jre/bin/java -Xmx1024m -XX:+ExitOnOutOfMemoryError -X
yarn 26857 0.3 0.2 4214784 341464 ? Sl Sep17 198:43 /etc/alternatives/jre/bin/java -Dproc_nodemanager -Xmx2048m -XX:OnOutOf
root 5519 0.0 0.2 5658624 269344 ? Sl Jul30 45:21 /usr/bin/java -Xmx1500m -Xms300m -XX:+ExitOnOutOfMemoryError -XX:MinHea
root 1781 0.0 0.0 172644 8096 ? Ss Jul30 2:06 /usr/lib/systemd/systemd-journald
root 4801 0.0 0.0 2690260 4776 ? Ssl Jul30 4:42 /usr/bin/amazon-ssm-agent
root 6566 0.0 0.0 164672 4116 ? R 00:30 0:00 ps auxwww --sort -rss
root 6532 0.0 0.0 183124 3592 ? S 00:30 0:00 /usr/sbin/CROND -n

On Wed, Oct 28, 2020 at 11:34 AM Xintong Song <[hidden email]> wrote:
Hi Ori,

The error message suggests that there's not enough physical memory on the machine to satisfy the allocation. This does not necessarily mean a managed memory leak. Managed memory leak is only one of the possibilities. There are other potential reasons, e.g., another process/container on the machine used more memory than expected, Yarn NM is not configured with enough memory reserved for the system processes, etc.

I would suggest to first look into the machine memory usages, see whether the Flink process indeed uses more memory than expected. This could be achieved via:
- Run the `top` command
- Look into the `/proc/meminfo` file
- Any container memory usage metrics that are available to your Yarn cluster

Thank you~
Xintong Song

On Tue, Oct 27, 2020 at 6:21 PM Ori Popowski <[hidden email]> wrote:
After the job is running for 10 days in production, TaskManagers start failing with:

Connection unexpectedly closed by remote task manager

Looking in the machine logs, I can see the following error:

============= Java processes for user hadoop =============
OpenJDK 64-Bit Server VM warning: INFO: os::commit_memory(0x00007fb4f4010000, 1006567424, 0) failed; error='Cannot allocate memory' (err
#
# There is insufficient memory for the Java Runtime Environment to continue.
# Native memory allocation (mmap) failed to map 1006567424 bytes for committing reserved memory.
# An error report file with more information is saved as:
# /mnt/tmp/hsperfdata_hadoop/hs_err_pid6585.log
=========== End java processes for user hadoop ===========

In addition, the metrics for the TaskManager show very low Heap memory consumption (20% of Xmx).

Hence, I suspect there is a memory leak in the TaskManager's Managed Memory.

This my TaskManager's memory detail:
flink process 112g
framework.heap.size 0.2g
task.heap.size 50g
managed.size 54g
framework.off-heap.size 0.5g
task.off-heap.size 1g
network 2g
XX:MaxMetaspaceSize 1g

As you can see, the managed memory is 54g, so it's already high (my managed.fraction is set to 0.5).

I'm running Flink 1.10. Full job details attached.

Can someone advise what would cause a managed memory leak?

Xintong Song

Re: Native memory allocation (mmap) failed to map 1006567424 bytes

Hi Ori,

I'm not sure about where the problem comes from. There are several things that might worse a try.

- Further increasing the `jvm-overhead`. Your `ps` result suggests that the Flink process uses 120+GB, while `process.size` is configured 112GB. So I think 2GB `jvm-overhead` might not be enough. I would suggest to tune `managed.fraction` back to 0.4 and increase `jvm-overhead` to around 12GB. This should give you roughly the same `process.size` as before, while leaving more unmanaged native memory space.

- During the 7-10 job running days, are there any failovers/restarts? If yes, you might want to look into this comment [1] in FLINK-18712.

- If neither of the above actions helps, we might need to leverage tools (e.g., JVM NMT [2]) to track the native memory usages and see where exactly the leak comes from.

Thank you~

Xintong Song

[1] https://issues.apache.org/jira/browse/FLINK-18712?focusedCommentId=17189138&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17189138

[2] https://docs.oracle.com/javase/8/docs/technotes/guides/troubleshoot/tooldescr007.html

On Thu, Oct 29, 2020 at 7:51 PM Ori Popowski <[hidden email]> wrote:

Hi Xintong,

Unfortunately I cannot upgrade to 1.10.2, because EMR has either 1.10.0 or 1.11.0.

About the overhead - turns out I already configured taskmanager.memory.jvm-overhead.max to 2 gb instead of the default 1 gb. Should I increase it further?

state.backend.rocksdb.memory.managed is already not explicitly configured.

Is there anything else I can do?

On Thu, Oct 29, 2020 at 1:24 PM Xintong Song <[hidden email]> wrote:
Hi Ori,

RocksDB also uses managed memory. If the memory overuse indeed comes from RocksDB, then increasing managed memory fraction will not help. RocksDB will try to use as many memory as the configured managed memory size. Therefore increasing managed memory fraction also makes RocksDB try to use more memory. That is why I suggested increasing `jvm-overhead` instead.

Please also make sure the configuration option `state.backend.rocksdb.memory.managed` is either not explicitly configured, or configured to `true`.

In addition, I noticed that you are using Flink 1.10.0. You might want to upgrade to 1.10.2, to include the latest bug fixes on the 1.10 release.

Thank you~
Xintong Song

On Thu, Oct 29, 2020 at 4:41 PM Ori Popowski <[hidden email]> wrote:
Hi,

PID 20331 is indeed the Flink process, specifically the TaskManager process.

- Workload is a streaming workload reading from Kafka and writing to S3 using a custom Sink
- RockDB state backend is used with default settings
- My external dependencies are:
-- logback
-- jackson
-- flatbuffers
-- jaxb-api
-- scala-java8-compat
-- apache commons-io
-- apache commons-compress
-- software.amazon.awssdk s3
- What do you mean by UDFs? I've implemented several operators like KafkaDeserializationSchema, FlatMap, Map, ProcessFunction.

We use a SessionWindow with 30 minutes of gap, and a watermark with 10 minutes delay.

We did confirm we have some keys in our job which keep receiving records indefinitely, but I'm not sure why it would cause a managed memory leak, since this should be flushed to RocksDB and free the memory used. We have a guard against this, where we keep the overall size of all the records for each key, and when it reaches 300mb, we don't move the records downstream, which causes them to create a session and go through the sink.

About what you suggested - I kind of did this by increasing the managed memory fraction to 0.5. And it did postpone the occurrence of the problem (meaning, the TMs started crashing after 10 days instead of 7 days). It looks like anything I'll do on that front will only postpone the problem but not solve it.

I am attaching the full job configuration.

On Thu, Oct 29, 2020 at 10:09 AM Xintong Song <[hidden email]> wrote:
Hi Ori,

It looks like Flink indeed uses more memory than expected. I assume the first item with PID 20331 is the flink process, right?

It would be helpful if you can briefly introduce your workload.
- What kind of workload are you running? Streaming or batch?
- Do you use RocksDB state backend?
- Any UDFs or 3rd party dependencies that might allocate significant native memory?

Moreover, if the metrics shows only 20% heap usages, I would suggest configuring less `task.heap.size`, leaving more memory to off-heap. The reduced heap size does not necessarily all go to the managed memory. You can also try increasing the `jvm-overhead`, simply to leave more native memory in the container in case there are other other significant native memory usages.

Thank you~
Xintong Song

On Wed, Oct 28, 2020 at 5:53 PM Ori Popowski <[hidden email]> wrote:
Hi Xintong,

See here:

# Top memory users
ps auxwww --sort -rss | head -10
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
yarn 20339 35.8 97.0 128600192 126672256 ? Sl Oct15 5975:47 /etc/alternatives/jre/bin/java -Xmx54760833024 -Xms54760833024 -XX:Max
root 5245 0.1 0.4 5580484 627436 ? Sl Jul30 144:39 /etc/alternatives/jre/bin/java -Xmx1024m -XX:+ExitOnOutOfMemoryError -X
hadoop 5252 0.1 0.4 7376768 604772 ? Sl Jul30 153:22 /etc/alternatives/jre/bin/java -Xmx1024m -XX:+ExitOnOutOfMemoryError -X
yarn 26857 0.3 0.2 4214784 341464 ? Sl Sep17 198:43 /etc/alternatives/jre/bin/java -Dproc_nodemanager -Xmx2048m -XX:OnOutOf
root 5519 0.0 0.2 5658624 269344 ? Sl Jul30 45:21 /usr/bin/java -Xmx1500m -Xms300m -XX:+ExitOnOutOfMemoryError -XX:MinHea
root 1781 0.0 0.0 172644 8096 ? Ss Jul30 2:06 /usr/lib/systemd/systemd-journald
root 4801 0.0 0.0 2690260 4776 ? Ssl Jul30 4:42 /usr/bin/amazon-ssm-agent
root 6566 0.0 0.0 164672 4116 ? R 00:30 0:00 ps auxwww --sort -rss
root 6532 0.0 0.0 183124 3592 ? S 00:30 0:00 /usr/sbin/CROND -n

On Wed, Oct 28, 2020 at 11:34 AM Xintong Song <[hidden email]> wrote:
Hi Ori,

The error message suggests that there's not enough physical memory on the machine to satisfy the allocation. This does not necessarily mean a managed memory leak. Managed memory leak is only one of the possibilities. There are other potential reasons, e.g., another process/container on the machine used more memory than expected, Yarn NM is not configured with enough memory reserved for the system processes, etc.

I would suggest to first look into the machine memory usages, see whether the Flink process indeed uses more memory than expected. This could be achieved via:
- Run the `top` command
- Look into the `/proc/meminfo` file
- Any container memory usage metrics that are available to your Yarn cluster

Thank you~
Xintong Song

On Tue, Oct 27, 2020 at 6:21 PM Ori Popowski <[hidden email]> wrote:
After the job is running for 10 days in production, TaskManagers start failing with:

Connection unexpectedly closed by remote task manager

Looking in the machine logs, I can see the following error:

============= Java processes for user hadoop =============
OpenJDK 64-Bit Server VM warning: INFO: os::commit_memory(0x00007fb4f4010000, 1006567424, 0) failed; error='Cannot allocate memory' (err
#
# There is insufficient memory for the Java Runtime Environment to continue.
# Native memory allocation (mmap) failed to map 1006567424 bytes for committing reserved memory.
# An error report file with more information is saved as:
# /mnt/tmp/hsperfdata_hadoop/hs_err_pid6585.log
=========== End java processes for user hadoop ===========

In addition, the metrics for the TaskManager show very low Heap memory consumption (20% of Xmx).

Hence, I suspect there is a memory leak in the TaskManager's Managed Memory.

This my TaskManager's memory detail:
flink process 112g
framework.heap.size 0.2g
task.heap.size 50g
managed.size 54g
framework.off-heap.size 0.5g
task.off-heap.size 1g
network 2g
XX:MaxMetaspaceSize 1g

As you can see, the managed memory is 54g, so it's already high (my managed.fraction is set to 0.5).

I'm running Flink 1.10. Full job details attached.

Can someone advise what would cause a managed memory leak?

orips

Re: Native memory allocation (mmap) failed to map 1006567424 bytes

- I will increase the jvm-overhead

- I don't have any failovers or restarts until it starts happening

- If it happens again even with the changes, I'll post the NMT output

On Fri, Oct 30, 2020 at 3:54 AM Xintong Song <[hidden email]> wrote:

Hi Ori,

I'm not sure about where the problem comes from. There are several things that might worse a try.
- Further increasing the `jvm-overhead`. Your `ps` result suggests that the Flink process uses 120+GB, while `process.size` is configured 112GB. So I think 2GB `jvm-overhead` might not be enough. I would suggest to tune `managed.fraction` back to 0.4 and increase `jvm-overhead` to around 12GB. This should give you roughly the same `process.size` as before, while leaving more unmanaged native memory space.
- During the 7-10 job running days, are there any failovers/restarts? If yes, you might want to look into this comment [1] in FLINK-18712.
- If neither of the above actions helps, we might need to leverage tools (e.g., JVM NMT [2]) to track the native memory usages and see where exactly the leak comes from.

Thank you~
Xintong Song

[1] https://issues.apache.org/jira/browse/FLINK-18712?focusedCommentId=17189138&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17189138
[2] https://docs.oracle.com/javase/8/docs/technotes/guides/troubleshoot/tooldescr007.html

On Thu, Oct 29, 2020 at 7:51 PM Ori Popowski <[hidden email]> wrote:

Hi Xintong,

Unfortunately I cannot upgrade to 1.10.2, because EMR has either 1.10.0 or 1.11.0.

About the overhead - turns out I already configured taskmanager.memory.jvm-overhead.max to 2 gb instead of the default 1 gb. Should I increase it further?

state.backend.rocksdb.memory.managed is already not explicitly configured.

Is there anything else I can do?

On Thu, Oct 29, 2020 at 1:24 PM Xintong Song <[hidden email]> wrote:
Hi Ori,

RocksDB also uses managed memory. If the memory overuse indeed comes from RocksDB, then increasing managed memory fraction will not help. RocksDB will try to use as many memory as the configured managed memory size. Therefore increasing managed memory fraction also makes RocksDB try to use more memory. That is why I suggested increasing `jvm-overhead` instead.

Please also make sure the configuration option `state.backend.rocksdb.memory.managed` is either not explicitly configured, or configured to `true`.

In addition, I noticed that you are using Flink 1.10.0. You might want to upgrade to 1.10.2, to include the latest bug fixes on the 1.10 release.

Thank you~
Xintong Song

On Thu, Oct 29, 2020 at 4:41 PM Ori Popowski <[hidden email]> wrote:
Hi,

PID 20331 is indeed the Flink process, specifically the TaskManager process.

- Workload is a streaming workload reading from Kafka and writing to S3 using a custom Sink
- RockDB state backend is used with default settings
- My external dependencies are:
-- logback
-- jackson
-- flatbuffers
-- jaxb-api
-- scala-java8-compat
-- apache commons-io
-- apache commons-compress
-- software.amazon.awssdk s3
- What do you mean by UDFs? I've implemented several operators like KafkaDeserializationSchema, FlatMap, Map, ProcessFunction.

We use a SessionWindow with 30 minutes of gap, and a watermark with 10 minutes delay.

We did confirm we have some keys in our job which keep receiving records indefinitely, but I'm not sure why it would cause a managed memory leak, since this should be flushed to RocksDB and free the memory used. We have a guard against this, where we keep the overall size of all the records for each key, and when it reaches 300mb, we don't move the records downstream, which causes them to create a session and go through the sink.

About what you suggested - I kind of did this by increasing the managed memory fraction to 0.5. And it did postpone the occurrence of the problem (meaning, the TMs started crashing after 10 days instead of 7 days). It looks like anything I'll do on that front will only postpone the problem but not solve it.

I am attaching the full job configuration.

On Thu, Oct 29, 2020 at 10:09 AM Xintong Song <[hidden email]> wrote:
Hi Ori,

It looks like Flink indeed uses more memory than expected. I assume the first item with PID 20331 is the flink process, right?

It would be helpful if you can briefly introduce your workload.
- What kind of workload are you running? Streaming or batch?
- Do you use RocksDB state backend?
- Any UDFs or 3rd party dependencies that might allocate significant native memory?

Moreover, if the metrics shows only 20% heap usages, I would suggest configuring less `task.heap.size`, leaving more memory to off-heap. The reduced heap size does not necessarily all go to the managed memory. You can also try increasing the `jvm-overhead`, simply to leave more native memory in the container in case there are other other significant native memory usages.

Thank you~
Xintong Song

On Wed, Oct 28, 2020 at 5:53 PM Ori Popowski <[hidden email]> wrote:
Hi Xintong,

See here:

# Top memory users
ps auxwww --sort -rss | head -10
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
yarn 20339 35.8 97.0 128600192 126672256 ? Sl Oct15 5975:47 /etc/alternatives/jre/bin/java -Xmx54760833024 -Xms54760833024 -XX:Max
root 5245 0.1 0.4 5580484 627436 ? Sl Jul30 144:39 /etc/alternatives/jre/bin/java -Xmx1024m -XX:+ExitOnOutOfMemoryError -X
hadoop 5252 0.1 0.4 7376768 604772 ? Sl Jul30 153:22 /etc/alternatives/jre/bin/java -Xmx1024m -XX:+ExitOnOutOfMemoryError -X
yarn 26857 0.3 0.2 4214784 341464 ? Sl Sep17 198:43 /etc/alternatives/jre/bin/java -Dproc_nodemanager -Xmx2048m -XX:OnOutOf
root 5519 0.0 0.2 5658624 269344 ? Sl Jul30 45:21 /usr/bin/java -Xmx1500m -Xms300m -XX:+ExitOnOutOfMemoryError -XX:MinHea
root 1781 0.0 0.0 172644 8096 ? Ss Jul30 2:06 /usr/lib/systemd/systemd-journald
root 4801 0.0 0.0 2690260 4776 ? Ssl Jul30 4:42 /usr/bin/amazon-ssm-agent
root 6566 0.0 0.0 164672 4116 ? R 00:30 0:00 ps auxwww --sort -rss
root 6532 0.0 0.0 183124 3592 ? S 00:30 0:00 /usr/sbin/CROND -n

On Wed, Oct 28, 2020 at 11:34 AM Xintong Song <[hidden email]> wrote:
Hi Ori,

The error message suggests that there's not enough physical memory on the machine to satisfy the allocation. This does not necessarily mean a managed memory leak. Managed memory leak is only one of the possibilities. There are other potential reasons, e.g., another process/container on the machine used more memory than expected, Yarn NM is not configured with enough memory reserved for the system processes, etc.

I would suggest to first look into the machine memory usages, see whether the Flink process indeed uses more memory than expected. This could be achieved via:
- Run the `top` command
- Look into the `/proc/meminfo` file
- Any container memory usage metrics that are available to your Yarn cluster

Thank you~
Xintong Song

On Tue, Oct 27, 2020 at 6:21 PM Ori Popowski <[hidden email]> wrote:
After the job is running for 10 days in production, TaskManagers start failing with:

Connection unexpectedly closed by remote task manager

Looking in the machine logs, I can see the following error:

============= Java processes for user hadoop =============
OpenJDK 64-Bit Server VM warning: INFO: os::commit_memory(0x00007fb4f4010000, 1006567424, 0) failed; error='Cannot allocate memory' (err
#
# There is insufficient memory for the Java Runtime Environment to continue.
# Native memory allocation (mmap) failed to map 1006567424 bytes for committing reserved memory.
# An error report file with more information is saved as:
# /mnt/tmp/hsperfdata_hadoop/hs_err_pid6585.log
=========== End java processes for user hadoop ===========

In addition, the metrics for the TaskManager show very low Heap memory consumption (20% of Xmx).

Hence, I suspect there is a memory leak in the TaskManager's Managed Memory.

This my TaskManager's memory detail:
flink process 112g
framework.heap.size 0.2g
task.heap.size 50g
managed.size 54g
framework.off-heap.size 0.5g
task.off-heap.size 1g
network 2g
XX:MaxMetaspaceSize 1g

As you can see, the managed memory is 54g, so it's already high (my managed.fraction is set to 0.5).

I'm running Flink 1.10. Full job details attached.

Can someone advise what would cause a managed memory leak?