[Survey] Default size for the new JVM Metaspace limit in 1.10

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

[Survey] Default size for the new JVM Metaspace limit in 1.10

Andrey Zagrebin-5
Hi All,

Recently, FLIP-49 [1] introduced the new JVM Metaspace limit in the 1.10 release [2]. Flink scripts, which start the task manager JVM process, set this limit by adding the corresponding JVM argument. This has been done to properly plan resources. especially to derive container size for Yarn/Mesos/Kubernetes. Also, it should surface potential class loading leaks. There is an option to change it: 'taskmanager.memory.jvm-metaspace.size' [3]. Its current default value is 96Mb.

This change led to 'OutOfMemoryError: Metaspace' in certain cases after upgrading to 1.10 version. In some cases, a class loading leak has been detected [4] and has to be investigated on its own. In other cases, just increasing the option value helped because the default value was not enough, presumably, due to the job specifics. In general, the required Metaspace size depends on the job and there is no default value to cover all cases. There is an issue to improve docs for this concern [5].

This survey is to come up with the most reasonable default value for this option. If you have encountered this issue and increasing the Metaspace size helped (there is no class loading leak), please, report any specifics of your job, if you think it is relevant for this concern, and the option value that resolved it. There is also a dedicated Jira issue [6] for reporting.

Thanks,
Andrey

Reply | Threaded
Open this post in threaded view
|

Re: [Survey] Default size for the new JVM Metaspace limit in 1.10

Andrey Zagrebin-5
Hi all,

Bumping this topic. Poll about:
Increasing default JVM Metaspace size from 96Mb to 256Mb and
Existing Flink 1.10 setups with small process memory size (~1GB)

The community discusses 1.10.1 bugfix release and whether to increase the default size for the JVM Metaspace size.
So far increasing this setting from 96Mb to 256Mb helped in all reported cases where the default value of 96m was not enough.

Increasing the default value can affect already existing Flink 1.10 setups, especially the case where the process memory size is explicitly set to some relatively small value, e.g. around 1GB,
but the JVM Metaspace is not. This can lead to the decreased size of the Flink memory and all its components, e.g. JVM heap and managed memory.

The question is how many important setups like this (with small process memory size) already exist to investigate how badly they will be affected by the suggested change.
Any feedback is appreciated.

Best,
Andrey

On Tue, Mar 3, 2020 at 6:35 PM Andrey Zagrebin <[hidden email]> wrote:
Hi All,

Recently, FLIP-49 [1] introduced the new JVM Metaspace limit in the 1.10 release [2]. Flink scripts, which start the task manager JVM process, set this limit by adding the corresponding JVM argument. This has been done to properly plan resources. especially to derive container size for Yarn/Mesos/Kubernetes. Also, it should surface potential class loading leaks. There is an option to change it: 'taskmanager.memory.jvm-metaspace.size' [3]. Its current default value is 96Mb.

This change led to 'OutOfMemoryError: Metaspace' in certain cases after upgrading to 1.10 version. In some cases, a class loading leak has been detected [4] and has to be investigated on its own. In other cases, just increasing the option value helped because the default value was not enough, presumably, due to the job specifics. In general, the required Metaspace size depends on the job and there is no default value to cover all cases. There is an issue to improve docs for this concern [5].

This survey is to come up with the most reasonable default value for this option. If you have encountered this issue and increasing the Metaspace size helped (there is no class loading leak), please, report any specifics of your job, if you think it is relevant for this concern, and the option value that resolved it. There is also a dedicated Jira issue [6] for reporting.

Thanks,
Andrey