(DEPRECATED) Apache Flink User Mailing List archive.

metaspace out-of-memory & error while retrieving the leader gateway

Classic

List

Threaded

11 messages Options

Claude Murad

metaspace out-of-memory & error while retrieving the leader gateway

Hello,

I upgraded from Flink 1.7.2 to 1.10.2. One of the jobs running on the task managers is periodically crashing w/ the following error:

java.lang.OutOfMemoryError: Metaspace. The metaspace out-of-memory error has occurred. This can mean two things: either the job requires a larger size of JVM metaspace to load classes or there is a class loading leak. In the first case 'taskmanager.memory.jvm-metaspace.size' configuration option should be increased. If the error persists (usually in cluster after several job (re-)submissions) then there is probably a class loading leak which has to be investigated and fixed. The task executor has to be shutdown.

I found this issue regarding it:

https://issues.apache.org/jira/browse/FLINK-16406

I have tried increasing the taskmanager.memory.jvm-metaspace.size to 256M & 512M and still was having the problem.

I then added the following to the flink.conf to try to get more information about the error:

env.java.opts: -XX:+UseG1GC -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/opt/flink/log

When I deployed the change which is in a Kubernetes cluster, the jobmanager pod fails to start up and the following message shows repeatedly:

2020-09-18 17:03:46,255 WARN org.apache.flink.runtime.webmonitor.retriever.impl.RpcGatewayRetriever - Error while retrieving the leader gateway. Retrying to connect to akka.tcp://flink@flink-jobmanager:50010/user/dispatcher.

The only way I can resolve this is to delete the folder from zookeeper which I shouldn't have to do.

Any ideas on these issues?

Xintong Song

Re: metaspace out-of-memory & error while retrieving the leader gateway

Hi Claude,

IIUC, in your case the leader retrieving problem is triggered by adding the `java.opts`? Then could you try to find and post the complete command for launching the JVM process? You can try log into the pod and execute `ps -ef | grep <PID>`.

A few more questions:

- What do you mean by "resolve this"? Does the jobmanager pod get stuck there, and recover when you remove the folder from ZK? Do you have to do the removal for everytime submitting the Kubernetes?

The only way I can resolve this is to delete the folder from zookeeper which I shouldn't have to do.

- Which Flink's kubernetes deployment are you using? The standalone or native Kubernetes?

- Which cluster mode are you using? Job cluster, session cluster, or the application mode?

Thank you~

Xintong Song

On Sat, Sep 19, 2020 at 1:22 AM Claude M <[hidden email]> wrote:

Hello,

I upgraded from Flink 1.7.2 to 1.10.2. One of the jobs running on the task managers is periodically crashing w/ the following error:

java.lang.OutOfMemoryError: Metaspace. The metaspace out-of-memory error has occurred. This can mean two things: either the job requires a larger size of JVM metaspace to load classes or there is a class loading leak. In the first case 'taskmanager.memory.jvm-metaspace.size' configuration option should be increased. If the error persists (usually in cluster after several job (re-)submissions) then there is probably a class loading leak which has to be investigated and fixed. The task executor has to be shutdown.

I found this issue regarding it:
https://issues.apache.org/jira/browse/FLINK-16406

I have tried increasing the taskmanager.memory.jvm-metaspace.size to 256M & 512M and still was having the problem.

I then added the following to the flink.conf to try to get more information about the error:
env.java.opts: -XX:+UseG1GC -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/opt/flink/log

When I deployed the change which is in a Kubernetes cluster, the jobmanager pod fails to start up and the following message shows repeatedly:

2020-09-18 17:03:46,255 WARN org.apache.flink.runtime.webmonitor.retriever.impl.RpcGatewayRetriever - Error while retrieving the leader gateway. Retrying to connect to akka.tcp://flink@flink-jobmanager:50010/user/dispatcher.

The only way I can resolve this is to delete the folder from zookeeper which I shouldn't have to do.

Any ideas on these issues?

Xintong Song

Re: metaspace out-of-memory & error while retrieving the leader gateway

## Metaspace OOM

As the error message already suggested, the metaspace OOM you encountered is likely caused by a class loading leak. I think you are on the right direction trying to look into the heap dump and find out where the leak comes from. IIUC, after removing the ZK folder, you are now able to run Flink with the heap dump options.

The problem does not occur in previous versions because Flink starts to set the metaspace limit since the 1.10 release. The class loading leak might have already been there, but is never discovered. This could lead to unpredictable stability and performance issues. That's why Flink updated its memory model and explicitly set the metaspace limit in the 1.10 release.

## Leader retrieving

The command looks good to me. If this problem happens only once, it could be irrelevant to adding the options. If that does not block you from getting the heap dump, we can look into it later.

Thank you~

Xintong Song

On Mon, Sep 21, 2020 at 9:37 PM Claude M <[hidden email]> wrote:

Hi Xintong,

Thanks for your reply. Here is the command output w/ the java.opts:

/usr/local/openjdk-8/bin/java -Xms768m -Xmx768m -XX:+UseG1GC -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/opt/flink/log -Dlog4j.configuration=file:/opt/flink/conf/log4j-console.properties -Dlogback.configurationFile=file:/opt/flink/conf/logback-console.xml -classpath /opt/flink/lib/flink-metrics-datadog-statsd-2.11-0.1.jar:/opt/flink/lib/flink-shaded-hadoop-2-uber-2.7.5-10.0.jar:/opt/flink/lib/flink-table-blink_2.11-1.10.2.jar:/opt/flink/lib/flink-table_2.11-1.10.2.jar:/opt/flink/lib/log4j-1.2.17.jar:/opt/flink/lib/slf4j-log4j12-1.7.15.jar:/opt/flink/lib/flink-dist_2.11-1.10.2.jar::/etc/hadoop/conf: org.apache.flink.runtime.entrypoint.StandaloneSessionClusterEntrypoint --configDir /opt/flink/conf --executionMode cluster

To answer your questions:
Correct, in order for the pod to start up, I have to remove the flink app folder from zookeeper. I only have to delete once after applying the java.opts arguments. It doesn't make sense though that I should have to do this just from adding a parameter.
I'm using the standalone deployment.
I'm using job cluster mode.
A higher priority issue I'm trying to solve is this metaspace out of memory that is occuring in task managers. This was not happening before I upgraded to Flink 1.10.2. Even after increasing the memory, I'm still encountering the problem. That is when I added the java.opts argument to see if I can get more information about the problem. That is when I ran across the second issue w/ the job manager pod not starting up.

Thanks


On Sun, Sep 20, 2020 at 10:23 PM Xintong Song <[hidden email]> wrote:
Hi Claude,

IIUC, in your case the leader retrieving problem is triggered by adding the `java.opts`? Then could you try to find and post the complete command for launching the JVM process? You can try log into the pod and execute `ps -ef | grep <PID>`.

A few more questions:
- What do you mean by "resolve this"? Does the jobmanager pod get stuck there, and recover when you remove the folder from ZK? Do you have to do the removal for everytime submitting the Kubernetes?
The only way I can resolve this is to delete the folder from zookeeper which I shouldn't have to do.
- Which Flink's kubernetes deployment are you using? The standalone or native Kubernetes?
- Which cluster mode are you using? Job cluster, session cluster, or the application mode?

Thank you~
Xintong Song

On Sat, Sep 19, 2020 at 1:22 AM Claude M <[hidden email]> wrote:
Hello,

I upgraded from Flink 1.7.2 to 1.10.2. One of the jobs running on the task managers is periodically crashing w/ the following error:

java.lang.OutOfMemoryError: Metaspace. The metaspace out-of-memory error has occurred. This can mean two things: either the job requires a larger size of JVM metaspace to load classes or there is a class loading leak. In the first case 'taskmanager.memory.jvm-metaspace.size' configuration option should be increased. If the error persists (usually in cluster after several job (re-)submissions) then there is probably a class loading leak which has to be investigated and fixed. The task executor has to be shutdown.

I found this issue regarding it:
https://issues.apache.org/jira/browse/FLINK-16406

I have tried increasing the taskmanager.memory.jvm-metaspace.size to 256M & 512M and still was having the problem.

I then added the following to the flink.conf to try to get more information about the error:
env.java.opts: -XX:+UseG1GC -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/opt/flink/log

When I deployed the change which is in a Kubernetes cluster, the jobmanager pod fails to start up and the following message shows repeatedly:

2020-09-18 17:03:46,255 WARN org.apache.flink.runtime.webmonitor.retriever.impl.RpcGatewayRetriever - Error while retrieving the leader gateway. Retrying to connect to akka.tcp://flink@flink-jobmanager:50010/user/dispatcher.

The only way I can resolve this is to delete the folder from zookeeper which I shouldn't have to do.

Any ideas on these issues?

B.Zhou

RE: metaspace out-of-memory & error while retrieving the leader gateway

Hi Xintong and Claude,

In our internal tests, we also encounter these two issues and we spent much time debugging them. There are two points I need to confirm if we share the same problem.

Your job is using default restart strategy, which is per-second restart.
Your CPU resource on jobmanager might be small

Here is some findings I want to share.

## Metaspace OOM

Due to https://issues.apache.org/jira/browse/FLINK-15467 , when we have some job restarts, there will be some threads from the sourceFunction hanging, cause the class loader cannot close. New restarts would load new classes, then expand the metaspace, and finally OOM happens.

## Leader retrieving

Constant restarts may be heavy for jobmanager, if JM CPU resources are not enough, the thread for leader retrieving may be stuck.

Best Regards,

Brian

From: Xintong Song <[hidden email]>
Sent: Tuesday, September 22, 2020 10:16
To: Claude M; user
Subject: Re: metaspace out-of-memory & error while retrieving the leader gateway

## Metaspace OOM

## Leader retrieving

The command looks good to me. If this problem happens only once, it could be irrelevant to adding the options. If that does not block you from getting the heap dump, we can look into it later.

Thank you~

Xintong Song

On Mon, Sep 21, 2020 at 9:37 PM Claude M <[hidden email]> wrote:

Hi Xintong,

Thanks for your reply. Here is the command output w/ the java.opts:

/usr/local/openjdk-8/bin/java -Xms768m -Xmx768m -XX:+UseG1GC -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/opt/flink/log -Dlog4j.configuration=file:/opt/flink/conf/log4j-console.properties -Dlogback.configurationFile=file:/opt/flink/conf/logback-console.xml -classpath /opt/flink/lib/flink-metrics-datadog-statsd-2.11-0.1.jar:/opt/flink/lib/flink-shaded-hadoop-2-uber-2.7.5-10.0.jar:/opt/flink/lib/flink-table-blink_2.11-1.10.2.jar:/opt/flink/lib/flink-table_2.11-1.10.2.jar:/opt/flink/lib/log4j-1.2.17.jar:/opt/flink/lib/slf4j-log4j12-1.7.15.jar:/opt/flink/lib/flink-dist_2.11-1.10.2.jar::/etc/hadoop/conf: org.apache.flink.runtime.entrypoint.StandaloneSessionClusterEntrypoint --configDir /opt/flink/conf --executionMode cluster

To answer your questions:

Correct, in order for the pod to start up, I have to remove the flink app folder from zookeeper. I only have to delete once after applying the java.opts arguments. It doesn't make sense though that I should have to do this just from adding a parameter.
I'm using the standalone deployment.
I'm using job cluster mode.

A higher priority issue I'm trying to solve is this metaspace out of memory that is occuring in task managers. This was not happening before I upgraded to Flink 1.10.2. Even after increasing the memory, I'm still encountering the problem. That is when I added the java.opts argument to see if I can get more information about the problem. That is when I ran across the second issue w/ the job manager pod not starting up.

Thanks



On Sun, Sep 20, 2020 at 10:23 PM Xintong Song <[hidden email]> wrote:

Hi Claude,

IIUC, in your case the leader retrieving problem is triggered by adding the `java.opts`? Then could you try to find and post the complete command for launching the JVM process? You can try log into the pod and execute `ps -ef | grep <PID>`.

A few more questions:

- What do you mean by "resolve this"? Does the jobmanager pod get stuck there, and recover when you remove the folder from ZK? Do you have to do the removal for everytime submitting the Kubernetes?

The only way I can resolve this is to delete the folder from zookeeper which I shouldn't have to do.

- Which Flink's kubernetes deployment are you using? The standalone or native Kubernetes?

- Which cluster mode are you using? Job cluster, session cluster, or the application mode?

Thank you~

Xintong Song

On Sat, Sep 19, 2020 at 1:22 AM Claude M <[hidden email]> wrote:

Hello,

I upgraded from Flink 1.7.2 to 1.10.2. One of the jobs running on the task managers is periodically crashing w/ the following error:

java.lang.OutOfMemoryError: Metaspace. The metaspace out-of-memory error has occurred. This can mean two things: either the job requires a larger size of JVM metaspace to load classes or there is a class loading leak. In the first case 'taskmanager.memory.jvm-metaspace.size' configuration option should be increased. If the error persists (usually in cluster after several job (re-)submissions) then there is probably a class loading leak which has to be investigated and fixed. The task executor has to be shutdown.

I found this issue regarding it:

https://issues.apache.org/jira/browse/FLINK-16406

I have tried increasing the taskmanager.memory.jvm-metaspace.size to 256M & 512M and still was having the problem.

I then added the following to the flink.conf to try to get more information about the error:

env.java.opts: -XX:+UseG1GC -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/opt/flink/log

When I deployed the change which is in a Kubernetes cluster, the jobmanager pod fails to start up and the following message shows repeatedly:

2020-09-18 17:03:46,255 WARN org.apache.flink.runtime.webmonitor.retriever.impl.RpcGatewayRetriever - Error while retrieving the leader gateway. Retrying to connect to akka.tcp://flink@flink-jobmanager:50010/user/dispatcher.

The only way I can resolve this is to delete the folder from zookeeper which I shouldn't have to do.

Any ideas on these issues?

Xintong Song

Re: metaspace out-of-memory & error while retrieving the leader gateway

Thanks for the input, Brain.

This looks like what we are looking for. The issue is fixed in 1.10.3, which also matches this problem occurred in 1.10.2.

Maybe Claude can further confirm it.

Thank you~

Xintong Song

On Tue, Sep 22, 2020 at 10:57 AM Zhou, Brian <[hidden email]> wrote:

Hi Xintong and Claude,

In our internal tests, we also encounter these two issues and we spent much time debugging them. There are two points I need to confirm if we share the same problem.

Your job is using default restart strategy, which is per-second restart.
Your CPU resource on jobmanager might be small

Here is some findings I want to share.

## Metaspace OOM

Due to https://issues.apache.org/jira/browse/FLINK-15467 , when we have some job restarts, there will be some threads from the sourceFunction hanging, cause the class loader cannot close. New restarts would load new classes, then expand the metaspace, and finally OOM happens.

## Leader retrieving

Constant restarts may be heavy for jobmanager, if JM CPU resources are not enough, the thread for leader retrieving may be stuck.

Best Regards,

Brian

From: Xintong Song <[hidden email]>
Sent: Tuesday, September 22, 2020 10:16
To: Claude M; user
Subject: Re: metaspace out-of-memory & error while retrieving the leader gateway

## Metaspace OOM

As the error message already suggested, the metaspace OOM you encountered is likely caused by a class loading leak. I think you are on the right direction trying to look into the heap dump and find out where the leak comes from. IIUC, after removing the ZK folder, you are now able to run Flink with the heap dump options.

The problem does not occur in previous versions because Flink starts to set the metaspace limit since the 1.10 release. The class loading leak might have already been there, but is never discovered. This could lead to unpredictable stability and performance issues. That's why Flink updated its memory model and explicitly set the metaspace limit in the 1.10 release.

## Leader retrieving

The command looks good to me. If this problem happens only once, it could be irrelevant to adding the options. If that does not block you from getting the heap dump, we can look into it later.

Thank you~

Xintong Song

On Mon, Sep 21, 2020 at 9:37 PM Claude M <[hidden email]> wrote:

Hi Xintong,

Thanks for your reply. Here is the command output w/ the java.opts:

/usr/local/openjdk-8/bin/java -Xms768m -Xmx768m -XX:+UseG1GC -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/opt/flink/log -Dlog4j.configuration=file:/opt/flink/conf/log4j-console.properties -Dlogback.configurationFile=file:/opt/flink/conf/logback-console.xml -classpath /opt/flink/lib/flink-metrics-datadog-statsd-2.11-0.1.jar:/opt/flink/lib/flink-shaded-hadoop-2-uber-2.7.5-10.0.jar:/opt/flink/lib/flink-table-blink_2.11-1.10.2.jar:/opt/flink/lib/flink-table_2.11-1.10.2.jar:/opt/flink/lib/log4j-1.2.17.jar:/opt/flink/lib/slf4j-log4j12-1.7.15.jar:/opt/flink/lib/flink-dist_2.11-1.10.2.jar::/etc/hadoop/conf: org.apache.flink.runtime.entrypoint.StandaloneSessionClusterEntrypoint --configDir /opt/flink/conf --executionMode cluster

To answer your questions:

Correct, in order for the pod to start up, I have to remove the flink app folder from zookeeper. I only have to delete once after applying the java.opts arguments. It doesn't make sense though that I should have to do this just from adding a parameter.
I'm using the standalone deployment.
I'm using job cluster mode.

A higher priority issue I'm trying to solve is this metaspace out of memory that is occuring in task managers. This was not happening before I upgraded to Flink 1.10.2. Even after increasing the memory, I'm still encountering the problem. That is when I added the java.opts argument to see if I can get more information about the problem. That is when I ran across the second issue w/ the job manager pod not starting up.

Thanks



On Sun, Sep 20, 2020 at 10:23 PM Xintong Song <[hidden email]> wrote:

Hi Claude,

IIUC, in your case the leader retrieving problem is triggered by adding the `java.opts`? Then could you try to find and post the complete command for launching the JVM process? You can try log into the pod and execute `ps -ef | grep <PID>`.

A few more questions:

- What do you mean by "resolve this"? Does the jobmanager pod get stuck there, and recover when you remove the folder from ZK? Do you have to do the removal for everytime submitting the Kubernetes?

The only way I can resolve this is to delete the folder from zookeeper which I shouldn't have to do.

- Which Flink's kubernetes deployment are you using? The standalone or native Kubernetes?

- Which cluster mode are you using? Job cluster, session cluster, or the application mode?

Thank you~

Xintong Song

On Sat, Sep 19, 2020 at 1:22 AM Claude M <[hidden email]> wrote:

Hello,

I upgraded from Flink 1.7.2 to 1.10.2. One of the jobs running on the task managers is periodically crashing w/ the following error:

java.lang.OutOfMemoryError: Metaspace. The metaspace out-of-memory error has occurred. This can mean two things: either the job requires a larger size of JVM metaspace to load classes or there is a class loading leak. In the first case 'taskmanager.memory.jvm-metaspace.size' configuration option should be increased. If the error persists (usually in cluster after several job (re-)submissions) then there is probably a class loading leak which has to be investigated and fixed. The task executor has to be shutdown.

I found this issue regarding it:

https://issues.apache.org/jira/browse/FLINK-16406

I have tried increasing the taskmanager.memory.jvm-metaspace.size to 256M & 512M and still was having the problem.

I then added the following to the flink.conf to try to get more information about the error:

env.java.opts: -XX:+UseG1GC -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/opt/flink/log

When I deployed the change which is in a Kubernetes cluster, the jobmanager pod fails to start up and the following message shows repeatedly:

2020-09-18 17:03:46,255 WARN org.apache.flink.runtime.webmonitor.retriever.impl.RpcGatewayRetriever - Error while retrieving the leader gateway. Retrying to connect to akka.tcp://flink@flink-jobmanager:50010/user/dispatcher.

The only way I can resolve this is to delete the folder from zookeeper which I shouldn't have to do.

Any ideas on these issues?

Claude Murad

Re: metaspace out-of-memory & error while retrieving the leader gateway

Thanks for your responses.

1. There were no job re-starts prior to the metaspace OEM.

2. I tried increasing the CPU request and still encountered the problem. Any configuration change I make to the job manager, whether it's in the flink-conf.yaml or increasing the pod's CPU/memory request, results with this problem.

On Tue, Sep 22, 2020 at 12:04 AM Xintong Song <[hidden email]> wrote:

Thanks for the input, Brain.

This looks like what we are looking for. The issue is fixed in 1.10.3, which also matches this problem occurred in 1.10.2.

Maybe Claude can further confirm it.

Thank you~
Xintong Song

On Tue, Sep 22, 2020 at 10:57 AM Zhou, Brian <[hidden email]> wrote:

Hi Xintong and Claude,

In our internal tests, we also encounter these two issues and we spent much time debugging them. There are two points I need to confirm if we share the same problem.

Your job is using default restart strategy, which is per-second restart.
Your CPU resource on jobmanager might be small

Here is some findings I want to share.

## Metaspace OOM

Due to https://issues.apache.org/jira/browse/FLINK-15467 , when we have some job restarts, there will be some threads from the sourceFunction hanging, cause the class loader cannot close. New restarts would load new classes, then expand the metaspace, and finally OOM happens.

## Leader retrieving

Constant restarts may be heavy for jobmanager, if JM CPU resources are not enough, the thread for leader retrieving may be stuck.

Best Regards,

Brian

From: Xintong Song <[hidden email]>
Sent: Tuesday, September 22, 2020 10:16
To: Claude M; user
Subject: Re: metaspace out-of-memory & error while retrieving the leader gateway

## Metaspace OOM

As the error message already suggested, the metaspace OOM you encountered is likely caused by a class loading leak. I think you are on the right direction trying to look into the heap dump and find out where the leak comes from. IIUC, after removing the ZK folder, you are now able to run Flink with the heap dump options.

The problem does not occur in previous versions because Flink starts to set the metaspace limit since the 1.10 release. The class loading leak might have already been there, but is never discovered. This could lead to unpredictable stability and performance issues. That's why Flink updated its memory model and explicitly set the metaspace limit in the 1.10 release.

## Leader retrieving

The command looks good to me. If this problem happens only once, it could be irrelevant to adding the options. If that does not block you from getting the heap dump, we can look into it later.

Thank you~

Xintong Song

On Mon, Sep 21, 2020 at 9:37 PM Claude M <[hidden email]> wrote:

Hi Xintong,

Thanks for your reply. Here is the command output w/ the java.opts:

/usr/local/openjdk-8/bin/java -Xms768m -Xmx768m -XX:+UseG1GC -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/opt/flink/log -Dlog4j.configuration=file:/opt/flink/conf/log4j-console.properties -Dlogback.configurationFile=file:/opt/flink/conf/logback-console.xml -classpath /opt/flink/lib/flink-metrics-datadog-statsd-2.11-0.1.jar:/opt/flink/lib/flink-shaded-hadoop-2-uber-2.7.5-10.0.jar:/opt/flink/lib/flink-table-blink_2.11-1.10.2.jar:/opt/flink/lib/flink-table_2.11-1.10.2.jar:/opt/flink/lib/log4j-1.2.17.jar:/opt/flink/lib/slf4j-log4j12-1.7.15.jar:/opt/flink/lib/flink-dist_2.11-1.10.2.jar::/etc/hadoop/conf: org.apache.flink.runtime.entrypoint.StandaloneSessionClusterEntrypoint --configDir /opt/flink/conf --executionMode cluster

To answer your questions:

Correct, in order for the pod to start up, I have to remove the flink app folder from zookeeper. I only have to delete once after applying the java.opts arguments. It doesn't make sense though that I should have to do this just from adding a parameter.
I'm using the standalone deployment.
I'm using job cluster mode.

A higher priority issue I'm trying to solve is this metaspace out of memory that is occuring in task managers. This was not happening before I upgraded to Flink 1.10.2. Even after increasing the memory, I'm still encountering the problem. That is when I added the java.opts argument to see if I can get more information about the problem. That is when I ran across the second issue w/ the job manager pod not starting up.

Thanks



On Sun, Sep 20, 2020 at 10:23 PM Xintong Song <[hidden email]> wrote:

Hi Claude,

IIUC, in your case the leader retrieving problem is triggered by adding the `java.opts`? Then could you try to find and post the complete command for launching the JVM process? You can try log into the pod and execute `ps -ef | grep <PID>`.

A few more questions:

- What do you mean by "resolve this"? Does the jobmanager pod get stuck there, and recover when you remove the folder from ZK? Do you have to do the removal for everytime submitting the Kubernetes?

The only way I can resolve this is to delete the folder from zookeeper which I shouldn't have to do.

- Which Flink's kubernetes deployment are you using? The standalone or native Kubernetes?

- Which cluster mode are you using? Job cluster, session cluster, or the application mode?

Thank you~

Xintong Song

On Sat, Sep 19, 2020 at 1:22 AM Claude M <[hidden email]> wrote:

Hello,

I upgraded from Flink 1.7.2 to 1.10.2. One of the jobs running on the task managers is periodically crashing w/ the following error:

java.lang.OutOfMemoryError: Metaspace. The metaspace out-of-memory error has occurred. This can mean two things: either the job requires a larger size of JVM metaspace to load classes or there is a class loading leak. In the first case 'taskmanager.memory.jvm-metaspace.size' configuration option should be increased. If the error persists (usually in cluster after several job (re-)submissions) then there is probably a class loading leak which has to be investigated and fixed. The task executor has to be shutdown.

I found this issue regarding it:

https://issues.apache.org/jira/browse/FLINK-16406

I have tried increasing the taskmanager.memory.jvm-metaspace.size to 256M & 512M and still was having the problem.

I then added the following to the flink.conf to try to get more information about the error:

env.java.opts: -XX:+UseG1GC -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/opt/flink/log

When I deployed the change which is in a Kubernetes cluster, the jobmanager pod fails to start up and the following message shows repeatedly:

2020-09-18 17:03:46,255 WARN org.apache.flink.runtime.webmonitor.retriever.impl.RpcGatewayRetriever - Error while retrieving the leader gateway. Retrying to connect to akka.tcp://flink@flink-jobmanager:50010/user/dispatcher.

The only way I can resolve this is to delete the folder from zookeeper which I shouldn't have to do.

Any ideas on these issues?

Claude Murad

Re: metaspace out-of-memory & error while retrieving the leader gateway

In regards to the metaspace memory issue, I was able to get a heap dump and the following is the output:

Problem Suspect 1

One instance of "java.lang.ref.Finalizer" loaded by "<system class loader>" occupies 4,112,624 (11.67%) bytes. The instance is referenced by sun.misc.Cleaner @ 0xb5d6b520 , loaded by "<system class loader>". The memory is accumulated in one instance of "java.lang.Object[]" loaded by "<system class loader>".

Problem Suspect 2

33 instances of "org.apache.flink.util.ChildFirstClassLoader", loaded by "sun.misc.Launcher$AppClassLoader @ 0xb4068680" occupy 6,615,416 (18.76%)bytes.

Based on this, I'm not clear on what needs to be done to solve this.

On Tue, Sep 22, 2020 at 3:10 PM Claude M <[hidden email]> wrote:

Thanks for your responses.
1. There were no job re-starts prior to the metaspace OEM.
2. I tried increasing the CPU request and still encountered the problem. Any configuration change I make to the job manager, whether it's in the flink-conf.yaml or increasing the pod's CPU/memory request, results with this problem.

On Tue, Sep 22, 2020 at 12:04 AM Xintong Song <[hidden email]> wrote:
Thanks for the input, Brain.

This looks like what we are looking for. The issue is fixed in 1.10.3, which also matches this problem occurred in 1.10.2.

Maybe Claude can further confirm it.

Thank you~
Xintong Song

On Tue, Sep 22, 2020 at 10:57 AM Zhou, Brian <[hidden email]> wrote:

Hi Xintong and Claude,

In our internal tests, we also encounter these two issues and we spent much time debugging them. There are two points I need to confirm if we share the same problem.

Your job is using default restart strategy, which is per-second restart.
Your CPU resource on jobmanager might be small

Here is some findings I want to share.

## Metaspace OOM

Due to https://issues.apache.org/jira/browse/FLINK-15467 , when we have some job restarts, there will be some threads from the sourceFunction hanging, cause the class loader cannot close. New restarts would load new classes, then expand the metaspace, and finally OOM happens.

## Leader retrieving

Constant restarts may be heavy for jobmanager, if JM CPU resources are not enough, the thread for leader retrieving may be stuck.

Best Regards,

Brian

From: Xintong Song <[hidden email]>
Sent: Tuesday, September 22, 2020 10:16
To: Claude M; user
Subject: Re: metaspace out-of-memory & error while retrieving the leader gateway

## Metaspace OOM

As the error message already suggested, the metaspace OOM you encountered is likely caused by a class loading leak. I think you are on the right direction trying to look into the heap dump and find out where the leak comes from. IIUC, after removing the ZK folder, you are now able to run Flink with the heap dump options.

The problem does not occur in previous versions because Flink starts to set the metaspace limit since the 1.10 release. The class loading leak might have already been there, but is never discovered. This could lead to unpredictable stability and performance issues. That's why Flink updated its memory model and explicitly set the metaspace limit in the 1.10 release.

## Leader retrieving

The command looks good to me. If this problem happens only once, it could be irrelevant to adding the options. If that does not block you from getting the heap dump, we can look into it later.

Thank you~

Xintong Song

On Mon, Sep 21, 2020 at 9:37 PM Claude M <[hidden email]> wrote:

Hi Xintong,

Thanks for your reply. Here is the command output w/ the java.opts:

/usr/local/openjdk-8/bin/java -Xms768m -Xmx768m -XX:+UseG1GC -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/opt/flink/log -Dlog4j.configuration=file:/opt/flink/conf/log4j-console.properties -Dlogback.configurationFile=file:/opt/flink/conf/logback-console.xml -classpath /opt/flink/lib/flink-metrics-datadog-statsd-2.11-0.1.jar:/opt/flink/lib/flink-shaded-hadoop-2-uber-2.7.5-10.0.jar:/opt/flink/lib/flink-table-blink_2.11-1.10.2.jar:/opt/flink/lib/flink-table_2.11-1.10.2.jar:/opt/flink/lib/log4j-1.2.17.jar:/opt/flink/lib/slf4j-log4j12-1.7.15.jar:/opt/flink/lib/flink-dist_2.11-1.10.2.jar::/etc/hadoop/conf: org.apache.flink.runtime.entrypoint.StandaloneSessionClusterEntrypoint --configDir /opt/flink/conf --executionMode cluster

To answer your questions:

Correct, in order for the pod to start up, I have to remove the flink app folder from zookeeper. I only have to delete once after applying the java.opts arguments. It doesn't make sense though that I should have to do this just from adding a parameter.
I'm using the standalone deployment.
I'm using job cluster mode.

A higher priority issue I'm trying to solve is this metaspace out of memory that is occuring in task managers. This was not happening before I upgraded to Flink 1.10.2. Even after increasing the memory, I'm still encountering the problem. That is when I added the java.opts argument to see if I can get more information about the problem. That is when I ran across the second issue w/ the job manager pod not starting up.

Thanks



On Sun, Sep 20, 2020 at 10:23 PM Xintong Song <[hidden email]> wrote:

Hi Claude,

IIUC, in your case the leader retrieving problem is triggered by adding the `java.opts`? Then could you try to find and post the complete command for launching the JVM process? You can try log into the pod and execute `ps -ef | grep <PID>`.

A few more questions:

- What do you mean by "resolve this"? Does the jobmanager pod get stuck there, and recover when you remove the folder from ZK? Do you have to do the removal for everytime submitting the Kubernetes?

The only way I can resolve this is to delete the folder from zookeeper which I shouldn't have to do.

- Which Flink's kubernetes deployment are you using? The standalone or native Kubernetes?

- Which cluster mode are you using? Job cluster, session cluster, or the application mode?

Thank you~

Xintong Song

On Sat, Sep 19, 2020 at 1:22 AM Claude M <[hidden email]> wrote:

Hello,

I upgraded from Flink 1.7.2 to 1.10.2. One of the jobs running on the task managers is periodically crashing w/ the following error:

java.lang.OutOfMemoryError: Metaspace. The metaspace out-of-memory error has occurred. This can mean two things: either the job requires a larger size of JVM metaspace to load classes or there is a class loading leak. In the first case 'taskmanager.memory.jvm-metaspace.size' configuration option should be increased. If the error persists (usually in cluster after several job (re-)submissions) then there is probably a class loading leak which has to be investigated and fixed. The task executor has to be shutdown.

I found this issue regarding it:

https://issues.apache.org/jira/browse/FLINK-16406

I have tried increasing the taskmanager.memory.jvm-metaspace.size to 256M & 512M and still was having the problem.

I then added the following to the flink.conf to try to get more information about the error:

env.java.opts: -XX:+UseG1GC -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/opt/flink/log

When I deployed the change which is in a Kubernetes cluster, the jobmanager pod fails to start up and the following message shows repeatedly:

2020-09-18 17:03:46,255 WARN org.apache.flink.runtime.webmonitor.retriever.impl.RpcGatewayRetriever - Error while retrieving the leader gateway. Retrying to connect to akka.tcp://flink@flink-jobmanager:50010/user/dispatcher.

The only way I can resolve this is to delete the folder from zookeeper which I shouldn't have to do.

Any ideas on these issues?

Claude Murad

Re: metaspace out-of-memory & error while retrieving the leader gateway

It was mentioned that this issue may be fixed in 1.10.3 but there is no 1.10.3 docker image here: https://hub.docker.com/_/flink

On Wed, Sep 23, 2020 at 7:14 AM Claude M <[hidden email]> wrote:

In regards to the metaspace memory issue, I was able to get a heap dump and the following is the output:

Problem Suspect 1
One instance of "java.lang.ref.Finalizer" loaded by "<system class loader>" occupies 4,112,624 (11.67%) bytes. The instance is referenced by sun.misc.Cleaner @ 0xb5d6b520 , loaded by "<system class loader>". The memory is accumulated in one instance of "java.lang.Object[]" loaded by "<system class loader>".

Problem Suspect 2
33 instances of "org.apache.flink.util.ChildFirstClassLoader", loaded by "sun.misc.Launcher$AppClassLoader @ 0xb4068680" occupy 6,615,416 (18.76%)bytes.

Based on this, I'm not clear on what needs to be done to solve this.

On Tue, Sep 22, 2020 at 3:10 PM Claude M <[hidden email]> wrote:
Thanks for your responses.
1. There were no job re-starts prior to the metaspace OEM.
2. I tried increasing the CPU request and still encountered the problem. Any configuration change I make to the job manager, whether it's in the flink-conf.yaml or increasing the pod's CPU/memory request, results with this problem.

On Tue, Sep 22, 2020 at 12:04 AM Xintong Song <[hidden email]> wrote:
Thanks for the input, Brain.

This looks like what we are looking for. The issue is fixed in 1.10.3, which also matches this problem occurred in 1.10.2.

Maybe Claude can further confirm it.

Thank you~
Xintong Song

On Tue, Sep 22, 2020 at 10:57 AM Zhou, Brian <[hidden email]> wrote:

Hi Xintong and Claude,

In our internal tests, we also encounter these two issues and we spent much time debugging them. There are two points I need to confirm if we share the same problem.

Your job is using default restart strategy, which is per-second restart.
Your CPU resource on jobmanager might be small

Here is some findings I want to share.

## Metaspace OOM

Due to https://issues.apache.org/jira/browse/FLINK-15467 , when we have some job restarts, there will be some threads from the sourceFunction hanging, cause the class loader cannot close. New restarts would load new classes, then expand the metaspace, and finally OOM happens.

## Leader retrieving

Constant restarts may be heavy for jobmanager, if JM CPU resources are not enough, the thread for leader retrieving may be stuck.

Best Regards,

Brian

From: Xintong Song <[hidden email]>
Sent: Tuesday, September 22, 2020 10:16
To: Claude M; user
Subject: Re: metaspace out-of-memory & error while retrieving the leader gateway

## Metaspace OOM

As the error message already suggested, the metaspace OOM you encountered is likely caused by a class loading leak. I think you are on the right direction trying to look into the heap dump and find out where the leak comes from. IIUC, after removing the ZK folder, you are now able to run Flink with the heap dump options.

The problem does not occur in previous versions because Flink starts to set the metaspace limit since the 1.10 release. The class loading leak might have already been there, but is never discovered. This could lead to unpredictable stability and performance issues. That's why Flink updated its memory model and explicitly set the metaspace limit in the 1.10 release.

## Leader retrieving

The command looks good to me. If this problem happens only once, it could be irrelevant to adding the options. If that does not block you from getting the heap dump, we can look into it later.

Thank you~

Xintong Song

On Mon, Sep 21, 2020 at 9:37 PM Claude M <[hidden email]> wrote:

Hi Xintong,

Thanks for your reply. Here is the command output w/ the java.opts:

/usr/local/openjdk-8/bin/java -Xms768m -Xmx768m -XX:+UseG1GC -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/opt/flink/log -Dlog4j.configuration=file:/opt/flink/conf/log4j-console.properties -Dlogback.configurationFile=file:/opt/flink/conf/logback-console.xml -classpath /opt/flink/lib/flink-metrics-datadog-statsd-2.11-0.1.jar:/opt/flink/lib/flink-shaded-hadoop-2-uber-2.7.5-10.0.jar:/opt/flink/lib/flink-table-blink_2.11-1.10.2.jar:/opt/flink/lib/flink-table_2.11-1.10.2.jar:/opt/flink/lib/log4j-1.2.17.jar:/opt/flink/lib/slf4j-log4j12-1.7.15.jar:/opt/flink/lib/flink-dist_2.11-1.10.2.jar::/etc/hadoop/conf: org.apache.flink.runtime.entrypoint.StandaloneSessionClusterEntrypoint --configDir /opt/flink/conf --executionMode cluster

To answer your questions:

Correct, in order for the pod to start up, I have to remove the flink app folder from zookeeper. I only have to delete once after applying the java.opts arguments. It doesn't make sense though that I should have to do this just from adding a parameter.
I'm using the standalone deployment.
I'm using job cluster mode.

A higher priority issue I'm trying to solve is this metaspace out of memory that is occuring in task managers. This was not happening before I upgraded to Flink 1.10.2. Even after increasing the memory, I'm still encountering the problem. That is when I added the java.opts argument to see if I can get more information about the problem. That is when I ran across the second issue w/ the job manager pod not starting up.

Thanks



On Sun, Sep 20, 2020 at 10:23 PM Xintong Song <[hidden email]> wrote:

Hi Claude,

IIUC, in your case the leader retrieving problem is triggered by adding the `java.opts`? Then could you try to find and post the complete command for launching the JVM process? You can try log into the pod and execute `ps -ef | grep <PID>`.

A few more questions:

- What do you mean by "resolve this"? Does the jobmanager pod get stuck there, and recover when you remove the folder from ZK? Do you have to do the removal for everytime submitting the Kubernetes?

The only way I can resolve this is to delete the folder from zookeeper which I shouldn't have to do.

- Which Flink's kubernetes deployment are you using? The standalone or native Kubernetes?

- Which cluster mode are you using? Job cluster, session cluster, or the application mode?

Thank you~

Xintong Song

On Sat, Sep 19, 2020 at 1:22 AM Claude M <[hidden email]> wrote:

Hello,

I upgraded from Flink 1.7.2 to 1.10.2. One of the jobs running on the task managers is periodically crashing w/ the following error:

java.lang.OutOfMemoryError: Metaspace. The metaspace out-of-memory error has occurred. This can mean two things: either the job requires a larger size of JVM metaspace to load classes or there is a class loading leak. In the first case 'taskmanager.memory.jvm-metaspace.size' configuration option should be increased. If the error persists (usually in cluster after several job (re-)submissions) then there is probably a class loading leak which has to be investigated and fixed. The task executor has to be shutdown.

I found this issue regarding it:

https://issues.apache.org/jira/browse/FLINK-16406

I have tried increasing the taskmanager.memory.jvm-metaspace.size to 256M & 512M and still was having the problem.

I then added the following to the flink.conf to try to get more information about the error:

env.java.opts: -XX:+UseG1GC -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/opt/flink/log

When I deployed the change which is in a Kubernetes cluster, the jobmanager pod fails to start up and the following message shows repeatedly:

2020-09-18 17:03:46,255 WARN org.apache.flink.runtime.webmonitor.retriever.impl.RpcGatewayRetriever - Error while retrieving the leader gateway. Retrying to connect to akka.tcp://flink@flink-jobmanager:50010/user/dispatcher.

The only way I can resolve this is to delete the folder from zookeeper which I shouldn't have to do.

Any ideas on these issues?

Xintong Song

Re: metaspace out-of-memory & error while retrieving the leader gateway

How many slots do you have on each task manager?

Flink uses ChildFirstClassLoader for loading user codes, to avoid dependency conflicts between user codes and Flink's framework. Ideally, after a slot is freed and reassigned to a new job, the user class loaders of the previous job should be unloaded. 33 instances of them does not sound right. It might be worth looking into where the references that keep these instances alive come from.

Flink 1.10.3 is not released yet. If you want to try the unreleased version, you would need to download the sources [1], build the flink distribution [2] and build your custom image (from the 1.0.2 image and replace the flink distribution with the one you built).

Thank you~

Xintong Song

[1] https://github.com/apache/flink/tree/release-1.10

[2] https://ci.apache.org/projects/flink/flink-docs-release-1.10/flinkDev/building.html

On Wed, Sep 23, 2020 at 8:29 PM Claude M <[hidden email]> wrote:

It was mentioned that this issue may be fixed in 1.10.3 but there is no 1.10.3 docker image here: https://hub.docker.com/_/flink

On Wed, Sep 23, 2020 at 7:14 AM Claude M <[hidden email]> wrote:
In regards to the metaspace memory issue, I was able to get a heap dump and the following is the output:

Problem Suspect 1
One instance of "java.lang.ref.Finalizer" loaded by "<system class loader>" occupies 4,112,624 (11.67%) bytes. The instance is referenced by sun.misc.Cleaner @ 0xb5d6b520 , loaded by "<system class loader>". The memory is accumulated in one instance of "java.lang.Object[]" loaded by "<system class loader>".

Problem Suspect 2
33 instances of "org.apache.flink.util.ChildFirstClassLoader", loaded by "sun.misc.Launcher$AppClassLoader @ 0xb4068680" occupy 6,615,416 (18.76%)bytes.

Based on this, I'm not clear on what needs to be done to solve this.

On Tue, Sep 22, 2020 at 3:10 PM Claude M <[hidden email]> wrote:
Thanks for your responses.
1. There were no job re-starts prior to the metaspace OEM.
2. I tried increasing the CPU request and still encountered the problem. Any configuration change I make to the job manager, whether it's in the flink-conf.yaml or increasing the pod's CPU/memory request, results with this problem.

On Tue, Sep 22, 2020 at 12:04 AM Xintong Song <[hidden email]> wrote:
Thanks for the input, Brain.

This looks like what we are looking for. The issue is fixed in 1.10.3, which also matches this problem occurred in 1.10.2.

Maybe Claude can further confirm it.

Thank you~
Xintong Song

On Tue, Sep 22, 2020 at 10:57 AM Zhou, Brian <[hidden email]> wrote:

Hi Xintong and Claude,

In our internal tests, we also encounter these two issues and we spent much time debugging them. There are two points I need to confirm if we share the same problem.

Your job is using default restart strategy, which is per-second restart.
Your CPU resource on jobmanager might be small

Here is some findings I want to share.

## Metaspace OOM

Due to https://issues.apache.org/jira/browse/FLINK-15467 , when we have some job restarts, there will be some threads from the sourceFunction hanging, cause the class loader cannot close. New restarts would load new classes, then expand the metaspace, and finally OOM happens.

## Leader retrieving

Constant restarts may be heavy for jobmanager, if JM CPU resources are not enough, the thread for leader retrieving may be stuck.

Best Regards,

Brian

From: Xintong Song <[hidden email]>
Sent: Tuesday, September 22, 2020 10:16
To: Claude M; user
Subject: Re: metaspace out-of-memory & error while retrieving the leader gateway

## Metaspace OOM

As the error message already suggested, the metaspace OOM you encountered is likely caused by a class loading leak. I think you are on the right direction trying to look into the heap dump and find out where the leak comes from. IIUC, after removing the ZK folder, you are now able to run Flink with the heap dump options.

The problem does not occur in previous versions because Flink starts to set the metaspace limit since the 1.10 release. The class loading leak might have already been there, but is never discovered. This could lead to unpredictable stability and performance issues. That's why Flink updated its memory model and explicitly set the metaspace limit in the 1.10 release.

## Leader retrieving

The command looks good to me. If this problem happens only once, it could be irrelevant to adding the options. If that does not block you from getting the heap dump, we can look into it later.

Thank you~

Xintong Song

On Mon, Sep 21, 2020 at 9:37 PM Claude M <[hidden email]> wrote:

Hi Xintong,

Thanks for your reply. Here is the command output w/ the java.opts:

/usr/local/openjdk-8/bin/java -Xms768m -Xmx768m -XX:+UseG1GC -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/opt/flink/log -Dlog4j.configuration=file:/opt/flink/conf/log4j-console.properties -Dlogback.configurationFile=file:/opt/flink/conf/logback-console.xml -classpath /opt/flink/lib/flink-metrics-datadog-statsd-2.11-0.1.jar:/opt/flink/lib/flink-shaded-hadoop-2-uber-2.7.5-10.0.jar:/opt/flink/lib/flink-table-blink_2.11-1.10.2.jar:/opt/flink/lib/flink-table_2.11-1.10.2.jar:/opt/flink/lib/log4j-1.2.17.jar:/opt/flink/lib/slf4j-log4j12-1.7.15.jar:/opt/flink/lib/flink-dist_2.11-1.10.2.jar::/etc/hadoop/conf: org.apache.flink.runtime.entrypoint.StandaloneSessionClusterEntrypoint --configDir /opt/flink/conf --executionMode cluster

To answer your questions:

Correct, in order for the pod to start up, I have to remove the flink app folder from zookeeper. I only have to delete once after applying the java.opts arguments. It doesn't make sense though that I should have to do this just from adding a parameter.
I'm using the standalone deployment.
I'm using job cluster mode.

A higher priority issue I'm trying to solve is this metaspace out of memory that is occuring in task managers. This was not happening before I upgraded to Flink 1.10.2. Even after increasing the memory, I'm still encountering the problem. That is when I added the java.opts argument to see if I can get more information about the problem. That is when I ran across the second issue w/ the job manager pod not starting up.

Thanks



On Sun, Sep 20, 2020 at 10:23 PM Xintong Song <[hidden email]> wrote:

Hi Claude,

IIUC, in your case the leader retrieving problem is triggered by adding the `java.opts`? Then could you try to find and post the complete command for launching the JVM process? You can try log into the pod and execute `ps -ef | grep <PID>`.

A few more questions:

- What do you mean by "resolve this"? Does the jobmanager pod get stuck there, and recover when you remove the folder from ZK? Do you have to do the removal for everytime submitting the Kubernetes?

The only way I can resolve this is to delete the folder from zookeeper which I shouldn't have to do.

- Which Flink's kubernetes deployment are you using? The standalone or native Kubernetes?

- Which cluster mode are you using? Job cluster, session cluster, or the application mode?

Thank you~

Xintong Song

On Sat, Sep 19, 2020 at 1:22 AM Claude M <[hidden email]> wrote:

Hello,

I upgraded from Flink 1.7.2 to 1.10.2. One of the jobs running on the task managers is periodically crashing w/ the following error:

java.lang.OutOfMemoryError: Metaspace. The metaspace out-of-memory error has occurred. This can mean two things: either the job requires a larger size of JVM metaspace to load classes or there is a class loading leak. In the first case 'taskmanager.memory.jvm-metaspace.size' configuration option should be increased. If the error persists (usually in cluster after several job (re-)submissions) then there is probably a class loading leak which has to be investigated and fixed. The task executor has to be shutdown.

I found this issue regarding it:

https://issues.apache.org/jira/browse/FLINK-16406

I have tried increasing the taskmanager.memory.jvm-metaspace.size to 256M & 512M and still was having the problem.

I then added the following to the flink.conf to try to get more information about the error:

env.java.opts: -XX:+UseG1GC -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/opt/flink/log

When I deployed the change which is in a Kubernetes cluster, the jobmanager pod fails to start up and the following message shows repeatedly:

2020-09-18 17:03:46,255 WARN org.apache.flink.runtime.webmonitor.retriever.impl.RpcGatewayRetriever - Error while retrieving the leader gateway. Retrying to connect to akka.tcp://flink@flink-jobmanager:50010/user/dispatcher.

The only way I can resolve this is to delete the folder from zookeeper which I shouldn't have to do.

Any ideas on these issues?

Claude Murad

Re: metaspace out-of-memory & error while retrieving the leader gateway

I have 35 task managers, 1 slot on each. I'm running a total of 7 jobs in the cluster. All the slots are occupied. When you say that 33 instances of the ChildFirstClassLoader does not sound right, what should I be expecting? Could the number of jobs running in the cluster contribute to the out of memory? I used to have 26 task managers in this cluster w/ 5 jobs.

I added 9 additional task managers and 2 jobs. I noticed this problem started occurring after I made these additions. If this is the cause of the problem, how can it be resolved?

On Thu, Sep 24, 2020 at 1:06 AM Xintong Song <[hidden email]> wrote:

How many slots do you have on each task manager?

Flink uses ChildFirstClassLoader for loading user codes, to avoid dependency conflicts between user codes and Flink's framework. Ideally, after a slot is freed and reassigned to a new job, the user class loaders of the previous job should be unloaded. 33 instances of them does not sound right. It might be worth looking into where the references that keep these instances alive come from.

Flink 1.10.3 is not released yet. If you want to try the unreleased version, you would need to download the sources [1], build the flink distribution [2] and build your custom image (from the 1.0.2 image and replace the flink distribution with the one you built).

Thank you~
Xintong Song

[1] https://github.com/apache/flink/tree/release-1.10
[2] https://ci.apache.org/projects/flink/flink-docs-release-1.10/flinkDev/building.html

On Wed, Sep 23, 2020 at 8:29 PM Claude M <[hidden email]> wrote:
It was mentioned that this issue may be fixed in 1.10.3 but there is no 1.10.3 docker image here: https://hub.docker.com/_/flink

On Wed, Sep 23, 2020 at 7:14 AM Claude M <[hidden email]> wrote:
In regards to the metaspace memory issue, I was able to get a heap dump and the following is the output:

Problem Suspect 1
One instance of "java.lang.ref.Finalizer" loaded by "<system class loader>" occupies 4,112,624 (11.67%) bytes. The instance is referenced by sun.misc.Cleaner @ 0xb5d6b520 , loaded by "<system class loader>". The memory is accumulated in one instance of "java.lang.Object[]" loaded by "<system class loader>".

Problem Suspect 2
33 instances of "org.apache.flink.util.ChildFirstClassLoader", loaded by "sun.misc.Launcher$AppClassLoader @ 0xb4068680" occupy 6,615,416 (18.76%)bytes.

Based on this, I'm not clear on what needs to be done to solve this.

On Tue, Sep 22, 2020 at 3:10 PM Claude M <[hidden email]> wrote:
Thanks for your responses.
1. There were no job re-starts prior to the metaspace OEM.
2. I tried increasing the CPU request and still encountered the problem. Any configuration change I make to the job manager, whether it's in the flink-conf.yaml or increasing the pod's CPU/memory request, results with this problem.

On Tue, Sep 22, 2020 at 12:04 AM Xintong Song <[hidden email]> wrote:
Thanks for the input, Brain.

This looks like what we are looking for. The issue is fixed in 1.10.3, which also matches this problem occurred in 1.10.2.

Maybe Claude can further confirm it.

Thank you~
Xintong Song

On Tue, Sep 22, 2020 at 10:57 AM Zhou, Brian <[hidden email]> wrote:

Hi Xintong and Claude,

In our internal tests, we also encounter these two issues and we spent much time debugging them. There are two points I need to confirm if we share the same problem.

Your job is using default restart strategy, which is per-second restart.
Your CPU resource on jobmanager might be small

Here is some findings I want to share.

## Metaspace OOM

Due to https://issues.apache.org/jira/browse/FLINK-15467 , when we have some job restarts, there will be some threads from the sourceFunction hanging, cause the class loader cannot close. New restarts would load new classes, then expand the metaspace, and finally OOM happens.

## Leader retrieving

Constant restarts may be heavy for jobmanager, if JM CPU resources are not enough, the thread for leader retrieving may be stuck.

Best Regards,

Brian

From: Xintong Song <[hidden email]>
Sent: Tuesday, September 22, 2020 10:16
To: Claude M; user
Subject: Re: metaspace out-of-memory & error while retrieving the leader gateway

## Metaspace OOM

As the error message already suggested, the metaspace OOM you encountered is likely caused by a class loading leak. I think you are on the right direction trying to look into the heap dump and find out where the leak comes from. IIUC, after removing the ZK folder, you are now able to run Flink with the heap dump options.

The problem does not occur in previous versions because Flink starts to set the metaspace limit since the 1.10 release. The class loading leak might have already been there, but is never discovered. This could lead to unpredictable stability and performance issues. That's why Flink updated its memory model and explicitly set the metaspace limit in the 1.10 release.

## Leader retrieving

The command looks good to me. If this problem happens only once, it could be irrelevant to adding the options. If that does not block you from getting the heap dump, we can look into it later.

Thank you~

Xintong Song

On Mon, Sep 21, 2020 at 9:37 PM Claude M <[hidden email]> wrote:

Hi Xintong,

Thanks for your reply. Here is the command output w/ the java.opts:

/usr/local/openjdk-8/bin/java -Xms768m -Xmx768m -XX:+UseG1GC -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/opt/flink/log -Dlog4j.configuration=file:/opt/flink/conf/log4j-console.properties -Dlogback.configurationFile=file:/opt/flink/conf/logback-console.xml -classpath /opt/flink/lib/flink-metrics-datadog-statsd-2.11-0.1.jar:/opt/flink/lib/flink-shaded-hadoop-2-uber-2.7.5-10.0.jar:/opt/flink/lib/flink-table-blink_2.11-1.10.2.jar:/opt/flink/lib/flink-table_2.11-1.10.2.jar:/opt/flink/lib/log4j-1.2.17.jar:/opt/flink/lib/slf4j-log4j12-1.7.15.jar:/opt/flink/lib/flink-dist_2.11-1.10.2.jar::/etc/hadoop/conf: org.apache.flink.runtime.entrypoint.StandaloneSessionClusterEntrypoint --configDir /opt/flink/conf --executionMode cluster

To answer your questions:

Correct, in order for the pod to start up, I have to remove the flink app folder from zookeeper. I only have to delete once after applying the java.opts arguments. It doesn't make sense though that I should have to do this just from adding a parameter.
I'm using the standalone deployment.
I'm using job cluster mode.

A higher priority issue I'm trying to solve is this metaspace out of memory that is occuring in task managers. This was not happening before I upgraded to Flink 1.10.2. Even after increasing the memory, I'm still encountering the problem. That is when I added the java.opts argument to see if I can get more information about the problem. That is when I ran across the second issue w/ the job manager pod not starting up.

Thanks



On Sun, Sep 20, 2020 at 10:23 PM Xintong Song <[hidden email]> wrote:

Hi Claude,

IIUC, in your case the leader retrieving problem is triggered by adding the `java.opts`? Then could you try to find and post the complete command for launching the JVM process? You can try log into the pod and execute `ps -ef | grep <PID>`.

A few more questions:

- What do you mean by "resolve this"? Does the jobmanager pod get stuck there, and recover when you remove the folder from ZK? Do you have to do the removal for everytime submitting the Kubernetes?

The only way I can resolve this is to delete the folder from zookeeper which I shouldn't have to do.

- Which Flink's kubernetes deployment are you using? The standalone or native Kubernetes?

- Which cluster mode are you using? Job cluster, session cluster, or the application mode?

Thank you~

Xintong Song

On Sat, Sep 19, 2020 at 1:22 AM Claude M <[hidden email]> wrote:

Hello,

I upgraded from Flink 1.7.2 to 1.10.2. One of the jobs running on the task managers is periodically crashing w/ the following error:

java.lang.OutOfMemoryError: Metaspace. The metaspace out-of-memory error has occurred. This can mean two things: either the job requires a larger size of JVM metaspace to load classes or there is a class loading leak. In the first case 'taskmanager.memory.jvm-metaspace.size' configuration option should be increased. If the error persists (usually in cluster after several job (re-)submissions) then there is probably a class loading leak which has to be investigated and fixed. The task executor has to be shutdown.

I found this issue regarding it:

https://issues.apache.org/jira/browse/FLINK-16406

I have tried increasing the taskmanager.memory.jvm-metaspace.size to 256M & 512M and still was having the problem.

I then added the following to the flink.conf to try to get more information about the error:

env.java.opts: -XX:+UseG1GC -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/opt/flink/log

When I deployed the change which is in a Kubernetes cluster, the jobmanager pod fails to start up and the following message shows repeatedly:

2020-09-18 17:03:46,255 WARN org.apache.flink.runtime.webmonitor.retriever.impl.RpcGatewayRetriever - Error while retrieving the leader gateway. Retrying to connect to akka.tcp://flink@flink-jobmanager:50010/user/dispatcher.

The only way I can resolve this is to delete the folder from zookeeper which I shouldn't have to do.

Any ideas on these issues?

Xintong Song

Re: metaspace out-of-memory & error while retrieving the leader gateway

I'm not entirely sure how many instances of ChildFirstClassLoader should be expected. I would say 3~5 sounds fine. (1 per slot, 1 for the file system plugin, 1 for the metrics reporter plugin, and probably a few more that I'm not aware of). How many task managers and jobs exist in the cluster should not affect.

Ideally, when a slot is assigned to a job for execution, there would be a class loader created. When the slot is freed (tasks finished, canceled, failed), the class loader should be released. Later when the slot is assigned again (to either the same job or a different one), a new class loader should be created.

I suspect something goes wrong that the classloader for previous assignments is not fully released, thus more and more classloaders are accumulated. Therefore, I suggest figuring out which objects hold references of the class loaders and prevent them from being released, see if we can do something about it.

Thank you~

Xintong Song

On Thu, Sep 24, 2020 at 6:35 PM Claude M <[hidden email]> wrote:

I have 35 task managers, 1 slot on each. I'm running a total of 7 jobs in the cluster. All the slots are occupied. When you say that 33 instances of the ChildFirstClassLoader does not sound right, what should I be expecting? Could the number of jobs running in the cluster contribute to the out of memory? I used to have 26 task managers in this cluster w/ 5 jobs.
I added 9 additional task managers and 2 jobs. I noticed this problem started occurring after I made these additions. If this is the cause of the problem, how can it be resolved?

On Thu, Sep 24, 2020 at 1:06 AM Xintong Song <[hidden email]> wrote:
How many slots do you have on each task manager?

Flink uses ChildFirstClassLoader for loading user codes, to avoid dependency conflicts between user codes and Flink's framework. Ideally, after a slot is freed and reassigned to a new job, the user class loaders of the previous job should be unloaded. 33 instances of them does not sound right. It might be worth looking into where the references that keep these instances alive come from.

Flink 1.10.3 is not released yet. If you want to try the unreleased version, you would need to download the sources [1], build the flink distribution [2] and build your custom image (from the 1.0.2 image and replace the flink distribution with the one you built).

Thank you~
Xintong Song

[1] https://github.com/apache/flink/tree/release-1.10
[2] https://ci.apache.org/projects/flink/flink-docs-release-1.10/flinkDev/building.html

On Wed, Sep 23, 2020 at 8:29 PM Claude M <[hidden email]> wrote:
It was mentioned that this issue may be fixed in 1.10.3 but there is no 1.10.3 docker image here: https://hub.docker.com/_/flink

On Wed, Sep 23, 2020 at 7:14 AM Claude M <[hidden email]> wrote:
In regards to the metaspace memory issue, I was able to get a heap dump and the following is the output:

Problem Suspect 1
One instance of "java.lang.ref.Finalizer" loaded by "<system class loader>" occupies 4,112,624 (11.67%) bytes. The instance is referenced by sun.misc.Cleaner @ 0xb5d6b520 , loaded by "<system class loader>". The memory is accumulated in one instance of "java.lang.Object[]" loaded by "<system class loader>".

Problem Suspect 2
33 instances of "org.apache.flink.util.ChildFirstClassLoader", loaded by "sun.misc.Launcher$AppClassLoader @ 0xb4068680" occupy 6,615,416 (18.76%)bytes.

Based on this, I'm not clear on what needs to be done to solve this.

On Tue, Sep 22, 2020 at 3:10 PM Claude M <[hidden email]> wrote:
Thanks for your responses.
1. There were no job re-starts prior to the metaspace OEM.
2. I tried increasing the CPU request and still encountered the problem. Any configuration change I make to the job manager, whether it's in the flink-conf.yaml or increasing the pod's CPU/memory request, results with this problem.

On Tue, Sep 22, 2020 at 12:04 AM Xintong Song <[hidden email]> wrote:
Thanks for the input, Brain.

This looks like what we are looking for. The issue is fixed in 1.10.3, which also matches this problem occurred in 1.10.2.

Maybe Claude can further confirm it.

Thank you~
Xintong Song

On Tue, Sep 22, 2020 at 10:57 AM Zhou, Brian <[hidden email]> wrote:

Hi Xintong and Claude,

In our internal tests, we also encounter these two issues and we spent much time debugging them. There are two points I need to confirm if we share the same problem.

Your job is using default restart strategy, which is per-second restart.
Your CPU resource on jobmanager might be small

Here is some findings I want to share.

## Metaspace OOM

Due to https://issues.apache.org/jira/browse/FLINK-15467 , when we have some job restarts, there will be some threads from the sourceFunction hanging, cause the class loader cannot close. New restarts would load new classes, then expand the metaspace, and finally OOM happens.

## Leader retrieving

Constant restarts may be heavy for jobmanager, if JM CPU resources are not enough, the thread for leader retrieving may be stuck.

Best Regards,

Brian

From: Xintong Song <[hidden email]>
Sent: Tuesday, September 22, 2020 10:16
To: Claude M; user
Subject: Re: metaspace out-of-memory & error while retrieving the leader gateway

## Metaspace OOM

As the error message already suggested, the metaspace OOM you encountered is likely caused by a class loading leak. I think you are on the right direction trying to look into the heap dump and find out where the leak comes from. IIUC, after removing the ZK folder, you are now able to run Flink with the heap dump options.

The problem does not occur in previous versions because Flink starts to set the metaspace limit since the 1.10 release. The class loading leak might have already been there, but is never discovered. This could lead to unpredictable stability and performance issues. That's why Flink updated its memory model and explicitly set the metaspace limit in the 1.10 release.

## Leader retrieving

The command looks good to me. If this problem happens only once, it could be irrelevant to adding the options. If that does not block you from getting the heap dump, we can look into it later.

Thank you~

Xintong Song

On Mon, Sep 21, 2020 at 9:37 PM Claude M <[hidden email]> wrote:

Hi Xintong,

Thanks for your reply. Here is the command output w/ the java.opts:

/usr/local/openjdk-8/bin/java -Xms768m -Xmx768m -XX:+UseG1GC -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/opt/flink/log -Dlog4j.configuration=file:/opt/flink/conf/log4j-console.properties -Dlogback.configurationFile=file:/opt/flink/conf/logback-console.xml -classpath /opt/flink/lib/flink-metrics-datadog-statsd-2.11-0.1.jar:/opt/flink/lib/flink-shaded-hadoop-2-uber-2.7.5-10.0.jar:/opt/flink/lib/flink-table-blink_2.11-1.10.2.jar:/opt/flink/lib/flink-table_2.11-1.10.2.jar:/opt/flink/lib/log4j-1.2.17.jar:/opt/flink/lib/slf4j-log4j12-1.7.15.jar:/opt/flink/lib/flink-dist_2.11-1.10.2.jar::/etc/hadoop/conf: org.apache.flink.runtime.entrypoint.StandaloneSessionClusterEntrypoint --configDir /opt/flink/conf --executionMode cluster

To answer your questions:

Correct, in order for the pod to start up, I have to remove the flink app folder from zookeeper. I only have to delete once after applying the java.opts arguments. It doesn't make sense though that I should have to do this just from adding a parameter.
I'm using the standalone deployment.
I'm using job cluster mode.

A higher priority issue I'm trying to solve is this metaspace out of memory that is occuring in task managers. This was not happening before I upgraded to Flink 1.10.2. Even after increasing the memory, I'm still encountering the problem. That is when I added the java.opts argument to see if I can get more information about the problem. That is when I ran across the second issue w/ the job manager pod not starting up.

Thanks



On Sun, Sep 20, 2020 at 10:23 PM Xintong Song <[hidden email]> wrote:

Hi Claude,

IIUC, in your case the leader retrieving problem is triggered by adding the `java.opts`? Then could you try to find and post the complete command for launching the JVM process? You can try log into the pod and execute `ps -ef | grep <PID>`.

A few more questions:

- What do you mean by "resolve this"? Does the jobmanager pod get stuck there, and recover when you remove the folder from ZK? Do you have to do the removal for everytime submitting the Kubernetes?

The only way I can resolve this is to delete the folder from zookeeper which I shouldn't have to do.

- Which Flink's kubernetes deployment are you using? The standalone or native Kubernetes?

- Which cluster mode are you using? Job cluster, session cluster, or the application mode?

Thank you~

Xintong Song

On Sat, Sep 19, 2020 at 1:22 AM Claude M <[hidden email]> wrote:

Hello,

I upgraded from Flink 1.7.2 to 1.10.2. One of the jobs running on the task managers is periodically crashing w/ the following error:

java.lang.OutOfMemoryError: Metaspace. The metaspace out-of-memory error has occurred. This can mean two things: either the job requires a larger size of JVM metaspace to load classes or there is a class loading leak. In the first case 'taskmanager.memory.jvm-metaspace.size' configuration option should be increased. If the error persists (usually in cluster after several job (re-)submissions) then there is probably a class loading leak which has to be investigated and fixed. The task executor has to be shutdown.

I found this issue regarding it:

https://issues.apache.org/jira/browse/FLINK-16406

I have tried increasing the taskmanager.memory.jvm-metaspace.size to 256M & 512M and still was having the problem.

I then added the following to the flink.conf to try to get more information about the error:

env.java.opts: -XX:+UseG1GC -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/opt/flink/log

When I deployed the change which is in a Kubernetes cluster, the jobmanager pod fails to start up and the following message shows repeatedly:

2020-09-18 17:03:46,255 WARN org.apache.flink.runtime.webmonitor.retriever.impl.RpcGatewayRetriever - Error while retrieving the leader gateway. Retrying to connect to akka.tcp://flink@flink-jobmanager:50010/user/dispatcher.

The only way I can resolve this is to delete the folder from zookeeper which I shouldn't have to do.

Any ideas on these issues?