(DEPRECATED) Apache Flink User Mailing List archive.

MaxMetaspace default may be to low?

Classic

List

Threaded

7 messages Options

John Smith

MaxMetaspace default may be to low?

Hi, I just upgraded to 1.10 and I started deploying my jobs. Eventually task nodes started shutting down with OutOfMemory Metaspace.

I look at the logs and the task managers are started with: -XX:MaxMetaspaceSize=100663296

So I configed: taskmanager.memory.jvm-metaspace.size: 256m

It seems to be ok for now. What are your thoughts? And should I try 512m or is that too much?

Xintong Song

Re: MaxMetaspace default may be to low?

Hi John,

The default metaspace size is intend for working with a major proportion of jobs. We are aware that for some jobs that need to load lots of classes, the default value might not be large enough. However, having a larger default value means for other jobs that do not load many classes, the overall memory requirements might be unnecessarily high. (Imagine you have a task manager with the default total memory 1.5GB, but 512m of it is reserved for metaspace.)

Another possible problem is metaspace leak. When you say "eventually task nodes started shutting down with OutOfMemory Metaspace", does this problem happen shortly after the job execution starts, or does it happen after job running for a while? Does the metaspace footprint keep growing or become stable after the initial growth? If the metaspace keeps growing along with time, it's usually an indicator of metaspace memory leak.

Thank you~

Xintong Song

On Tue, Feb 25, 2020 at 7:50 AM John Smith <[hidden email]> wrote:

Hi, I just upgraded to 1.10 and I started deploying my jobs. Eventually task nodes started shutting down with OutOfMemory Metaspace.

I look at the logs and the task managers are started with: -XX:MaxMetaspaceSize=100663296

So I configed: taskmanager.memory.jvm-metaspace.size: 256m

It seems to be ok for now. What are your thoughts? And should I try 512m or is that too much?

John Smith

Re: MaxMetaspace default may be to low?

Right after Job execution. Basically as soon as I deployed a 5th job. So at 4 jobs it was ok, at 5 jobs it would take like 1-2 minutes max and the node would just shut off.
So far with MaxMetaSpace 256m it's been stable. My task nodes are 16GB and the memory config is done as follows...
taskmanager.memory.flink.size: 12g
taskmanager.memory.jvm-metaspace.size: 256m

100% of the jobs right now are ETL with checkpoints, NO state,

Kafka -----> Json Transform ----> DB

Kafka ----> DB lookup (to small local cache)--------> Json Transform -----> Apache Ignite

None of the jobs are related.

On Mon, 24 Feb 2020 at 20:59, Xintong Song <[hidden email]> wrote:

Hi John,

The default metaspace size is intend for working with a major proportion of jobs. We are aware that for some jobs that need to load lots of classes, the default value might not be large enough. However, having a larger default value means for other jobs that do not load many classes, the overall memory requirements might be unnecessarily high. (Imagine you have a task manager with the default total memory 1.5GB, but 512m of it is reserved for metaspace.)

Another possible problem is metaspace leak. When you say "eventually task nodes started shutting down with OutOfMemory Metaspace", does this problem happen shortly after the job execution starts, or does it happen after job running for a while? Does the metaspace footprint keep growing or become stable after the initial growth? If the metaspace keeps growing along with time, it's usually an indicator of metaspace memory leak.

Thank you~
Xintong Song

On Tue, Feb 25, 2020 at 7:50 AM John Smith <[hidden email]> wrote:
Hi, I just upgraded to 1.10 and I started deploying my jobs. Eventually task nodes started shutting down with OutOfMemory Metaspace.

I look at the logs and the task managers are started with: -XX:MaxMetaspaceSize=100663296

So I configed: taskmanager.memory.jvm-metaspace.size: 256m

It seems to be ok for now. What are your thoughts? And should I try 512m or is that too much?

John Smith

Re: MaxMetaspace default may be to low?

I would like to also add the same exact jobs on Flink 1.8 where running perfectly fine.

On Tue, 25 Feb 2020 at 00:20, John Smith <[hidden email]> wrote:

Right after Job execution. Basically as soon as I deployed a 5th job. So at 4 jobs it was ok, at 5 jobs it would take like 1-2 minutes max and the node would just shut off.
So far with MaxMetaSpace 256m it's been stable. My task nodes are 16GB and the memory config is done as follows...
taskmanager.memory.flink.size: 12g
taskmanager.memory.jvm-metaspace.size: 256m

100% of the jobs right now are ETL with checkpoints, NO state,
Kafka -----> Json Transform ----> DB
or
Kafka ----> DB lookup (to small local cache)--------> Json Transform -----> Apache Ignite

None of the jobs are related.

On Mon, 24 Feb 2020 at 20:59, Xintong Song <[hidden email]> wrote:
Hi John,

The default metaspace size is intend for working with a major proportion of jobs. We are aware that for some jobs that need to load lots of classes, the default value might not be large enough. However, having a larger default value means for other jobs that do not load many classes, the overall memory requirements might be unnecessarily high. (Imagine you have a task manager with the default total memory 1.5GB, but 512m of it is reserved for metaspace.)

Another possible problem is metaspace leak. When you say "eventually task nodes started shutting down with OutOfMemory Metaspace", does this problem happen shortly after the job execution starts, or does it happen after job running for a while? Does the metaspace footprint keep growing or become stable after the initial growth? If the metaspace keeps growing along with time, it's usually an indicator of metaspace memory leak.

Thank you~
Xintong Song

On Tue, Feb 25, 2020 at 7:50 AM John Smith <[hidden email]> wrote:
Hi, I just upgraded to 1.10 and I started deploying my jobs. Eventually task nodes started shutting down with OutOfMemory Metaspace.

I look at the logs and the task managers are started with: -XX:MaxMetaspaceSize=100663296

So I configed: taskmanager.memory.jvm-metaspace.size: 256m

It seems to be ok for now. What are your thoughts? And should I try 512m or is that too much?

Xintong Song

Re: MaxMetaspace default may be to low?

In that case, I think the default metaspace size is too small for you setup. The default configurations are not intended for such large task managers.

In Flink 1.8 we do not set the JVM '-XX:MaxMetaspaceSize' parameter, which means you have 'unlimited' metaspace size. We changed that in Flink 1.10 to have stricter control on the overall memory usage of Flink processes.

Thank you~

Xintong Song

On Tue, Feb 25, 2020 at 1:24 PM John Smith <[hidden email]> wrote:

I would like to also add the same exact jobs on Flink 1.8 where running perfectly fine.

On Tue, 25 Feb 2020 at 00:20, John Smith <[hidden email]> wrote:
Right after Job execution. Basically as soon as I deployed a 5th job. So at 4 jobs it was ok, at 5 jobs it would take like 1-2 minutes max and the node would just shut off.
So far with MaxMetaSpace 256m it's been stable. My task nodes are 16GB and the memory config is done as follows...
taskmanager.memory.flink.size: 12g
taskmanager.memory.jvm-metaspace.size: 256m

100% of the jobs right now are ETL with checkpoints, NO state,
Kafka -----> Json Transform ----> DB
or
Kafka ----> DB lookup (to small local cache)--------> Json Transform -----> Apache Ignite

None of the jobs are related.

On Mon, 24 Feb 2020 at 20:59, Xintong Song <[hidden email]> wrote:
Hi John,

The default metaspace size is intend for working with a major proportion of jobs. We are aware that for some jobs that need to load lots of classes, the default value might not be large enough. However, having a larger default value means for other jobs that do not load many classes, the overall memory requirements might be unnecessarily high. (Imagine you have a task manager with the default total memory 1.5GB, but 512m of it is reserved for metaspace.)

Another possible problem is metaspace leak. When you say "eventually task nodes started shutting down with OutOfMemory Metaspace", does this problem happen shortly after the job execution starts, or does it happen after job running for a while? Does the metaspace footprint keep growing or become stable after the initial growth? If the metaspace keeps growing along with time, it's usually an indicator of metaspace memory leak.

Thank you~
Xintong Song

On Tue, Feb 25, 2020 at 7:50 AM John Smith <[hidden email]> wrote:
Hi, I just upgraded to 1.10 and I started deploying my jobs. Eventually task nodes started shutting down with OutOfMemory Metaspace.

I look at the logs and the task managers are started with: -XX:MaxMetaspaceSize=100663296

So I configed: taskmanager.memory.jvm-metaspace.size: 256m

It seems to be ok for now. What are your thoughts? And should I try 512m or is that too much?

John Smith

Re: MaxMetaspace default may be to low?

Ok maybe it can be documented?

So just trying to understand, how do most people run their jobs? I mean like they run less tasks, but tasks that have allot direct or mapped memory? Like little JVM_HEAP but huge state outside the JVM?

I also recorded this issue: https://issues.apache.org/jira/browse/FLINK-16278 so we can maybe get it documented.

On Tue, 25 Feb 2020 at 02:57, Xintong Song <[hidden email]> wrote:

In that case, I think the default metaspace size is too small for you setup. The default configurations are not intended for such large task managers.

In Flink 1.8 we do not set the JVM '-XX:MaxMetaspaceSize' parameter, which means you have 'unlimited' metaspace size. We changed that in Flink 1.10 to have stricter control on the overall memory usage of Flink processes.

Thank you~
Xintong Song

On Tue, Feb 25, 2020 at 1:24 PM John Smith <[hidden email]> wrote:
I would like to also add the same exact jobs on Flink 1.8 where running perfectly fine.

On Tue, 25 Feb 2020 at 00:20, John Smith <[hidden email]> wrote:
Right after Job execution. Basically as soon as I deployed a 5th job. So at 4 jobs it was ok, at 5 jobs it would take like 1-2 minutes max and the node would just shut off.
So far with MaxMetaSpace 256m it's been stable. My task nodes are 16GB and the memory config is done as follows...
taskmanager.memory.flink.size: 12g
taskmanager.memory.jvm-metaspace.size: 256m

100% of the jobs right now are ETL with checkpoints, NO state,
Kafka -----> Json Transform ----> DB
or
Kafka ----> DB lookup (to small local cache)--------> Json Transform -----> Apache Ignite

None of the jobs are related.

On Mon, 24 Feb 2020 at 20:59, Xintong Song <[hidden email]> wrote:
Hi John,

The default metaspace size is intend for working with a major proportion of jobs. We are aware that for some jobs that need to load lots of classes, the default value might not be large enough. However, having a larger default value means for other jobs that do not load many classes, the overall memory requirements might be unnecessarily high. (Imagine you have a task manager with the default total memory 1.5GB, but 512m of it is reserved for metaspace.)

Another possible problem is metaspace leak. When you say "eventually task nodes started shutting down with OutOfMemory Metaspace", does this problem happen shortly after the job execution starts, or does it happen after job running for a while? Does the metaspace footprint keep growing or become stable after the initial growth? If the metaspace keeps growing along with time, it's usually an indicator of metaspace memory leak.

Thank you~
Xintong Song

On Tue, Feb 25, 2020 at 7:50 AM John Smith <[hidden email]> wrote:
Hi, I just upgraded to 1.10 and I started deploying my jobs. Eventually task nodes started shutting down with OutOfMemory Metaspace.

I look at the logs and the task managers are started with: -XX:MaxMetaspaceSize=100663296

So I configed: taskmanager.memory.jvm-metaspace.size: 256m

It seems to be ok for now. What are your thoughts? And should I try 512m or is that too much?

Xintong Song

Re: MaxMetaspace default may be to low?

I'm sorry that you had bad experience with the migration and configurations. I believe the changing of limiting metaspace size is already documented in various places, but maybe it's not obvious enough that lead to your confusion. Let's keep the discussion on how to improve that in the JIRA ticket you opened.

Regarding how most people run their jobs, it depends on various factors, thus, hard to describe. Narrow down to the metaspace memory footprint, it really depends on how many classes is loaded, i.e., how many libraries used and classes defined in UDFs, and how many tasks different jobs co-exist in the same TM process.

According to our testing before the release, the current default value works for all the e2e tests, and our testing jobs with simple UDFs (without custom libraries) in single job clusters. We did observe problems in having large multi-slot TMs concurrently running different jobs. However, such cases usually requires various changes of configurations (process.size/flink.size, numOfSlots, etc.) and we think it makes sense to make metaspace one of them.

Thank you~

Xintong Song

On Tue, Feb 25, 2020 at 9:22 PM John Smith <[hidden email]> wrote:

Ok maybe it can be documented?

So just trying to understand, how do most people run their jobs? I mean like they run less tasks, but tasks that have allot direct or mapped memory? Like little JVM_HEAP but huge state outside the JVM?

I also recorded this issue: https://issues.apache.org/jira/browse/FLINK-16278 so we can maybe get it documented.

On Tue, 25 Feb 2020 at 02:57, Xintong Song <[hidden email]> wrote:
In that case, I think the default metaspace size is too small for you setup. The default configurations are not intended for such large task managers.

In Flink 1.8 we do not set the JVM '-XX:MaxMetaspaceSize' parameter, which means you have 'unlimited' metaspace size. We changed that in Flink 1.10 to have stricter control on the overall memory usage of Flink processes.

Thank you~
Xintong Song

On Tue, Feb 25, 2020 at 1:24 PM John Smith <[hidden email]> wrote:
I would like to also add the same exact jobs on Flink 1.8 where running perfectly fine.

On Tue, 25 Feb 2020 at 00:20, John Smith <[hidden email]> wrote:
Right after Job execution. Basically as soon as I deployed a 5th job. So at 4 jobs it was ok, at 5 jobs it would take like 1-2 minutes max and the node would just shut off.
So far with MaxMetaSpace 256m it's been stable. My task nodes are 16GB and the memory config is done as follows...
taskmanager.memory.flink.size: 12g
taskmanager.memory.jvm-metaspace.size: 256m

100% of the jobs right now are ETL with checkpoints, NO state,
Kafka -----> Json Transform ----> DB
or
Kafka ----> DB lookup (to small local cache)--------> Json Transform -----> Apache Ignite

None of the jobs are related.

On Mon, 24 Feb 2020 at 20:59, Xintong Song <[hidden email]> wrote:
Hi John,

The default metaspace size is intend for working with a major proportion of jobs. We are aware that for some jobs that need to load lots of classes, the default value might not be large enough. However, having a larger default value means for other jobs that do not load many classes, the overall memory requirements might be unnecessarily high. (Imagine you have a task manager with the default total memory 1.5GB, but 512m of it is reserved for metaspace.)

Another possible problem is metaspace leak. When you say "eventually task nodes started shutting down with OutOfMemory Metaspace", does this problem happen shortly after the job execution starts, or does it happen after job running for a while? Does the metaspace footprint keep growing or become stable after the initial growth? If the metaspace keeps growing along with time, it's usually an indicator of metaspace memory leak.

Thank you~
Xintong Song

On Tue, Feb 25, 2020 at 7:50 AM John Smith <[hidden email]> wrote:
Hi, I just upgraded to 1.10 and I started deploying my jobs. Eventually task nodes started shutting down with OutOfMemory Metaspace.

I look at the logs and the task managers are started with: -XX:MaxMetaspaceSize=100663296

So I configed: taskmanager.memory.jvm-metaspace.size: 256m

It seems to be ok for now. What are your thoughts? And should I try 512m or is that too much?