JobManager seems to be leaking temporary jar files

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

JobManager seems to be leaking temporary jar files

Maciek Próchniak
Hello,

in our setup we have:

- Flink 1.11.2

- job submission via REST API (first we upload jar, then we submit
multiple jobs with it)

- additional jars embedded in lib directory of main jar (this is crucial
part)

When we submit jobs this way, Flink creates new temp jar files via
PackagedProgram.extractContainedLibraries method.

We observe that they are not removed after job finishes - it seems that
PackagedProgram.deleteExtractedLibraries is not invoked when using REST
API.

What's more, it seems that those jars remain open in JobManager process.
We observe that when we delete them manually via scripts, the disk space
is not reclaimed until process is restarted, we also see via heap dump
inspection that java.util.zip.ZipFile$Source  objects remain, pointing
to those files. This is quite a problem for us, as we submit quite a few
jobs, and after a while we ran out of either heap or disk space on
JobManager process/host. Unfortunately, I cannot so far find where this
leak would happen...

Does anybody have some pointers where we can search? Or how to fix this
behaviour?


thanks,

maciek


Reply | Threaded
Open this post in threaded view
|

Re: JobManager seems to be leaking temporary jar files

Matthias
Hi Maciek,
my understanding is that the jars in the JobManager should be cleaned up after the job is terminated (I assume that your jobs successfully finished). The jars are managed by the BlobService. The dispatcher will trigger the jobCleanup in [1] after job termination. Are there any suspicious log messages that might indicate an issue?
I'm adding Chesnay to this thread as he might have more insights here.


On Mon, Jan 25, 2021 at 8:37 PM Maciek Próchniak <[hidden email]> wrote:
Hello,

in our setup we have:

- Flink 1.11.2

- job submission via REST API (first we upload jar, then we submit
multiple jobs with it)

- additional jars embedded in lib directory of main jar (this is crucial
part)

When we submit jobs this way, Flink creates new temp jar files via
PackagedProgram.extractContainedLibraries method.

We observe that they are not removed after job finishes - it seems that
PackagedProgram.deleteExtractedLibraries is not invoked when using REST
API.

What's more, it seems that those jars remain open in JobManager process.
We observe that when we delete them manually via scripts, the disk space
is not reclaimed until process is restarted, we also see via heap dump
inspection that java.util.zip.ZipFile$Source  objects remain, pointing
to those files. This is quite a problem for us, as we submit quite a few
jobs, and after a while we ran out of either heap or disk space on
JobManager process/host. Unfortunately, I cannot so far find where this
leak would happen...

Does anybody have some pointers where we can search? Or how to fix this
behaviour?


thanks,

maciek

Reply | Threaded
Open this post in threaded view
|

Re: JobManager seems to be leaking temporary jar files

Maciek Próchniak

Hi Matthias,

I think the problem lies somewhere in JarRunHandler, as this is the place where the files are created.

I think these are not the files that are managed via BlobService, as they are not stored in BlobService folders (I made experiment changing default BlobServer folders).

It seems to me that CliFrontend deletes those files explicitly:

https://github.com/apache/flink/blob/master/flink-clients/src/main/java/org/apache/flink/client/cli/CliFrontend.java#L250

whereas I couldn't find such invocation in JarRunHandler (not deleting those files does not fully explain leak on heap though...)


thanks,

maciek

On 26.01.2021 11:16, Matthias Pohl wrote:
Hi Maciek,
my understanding is that the jars in the JobManager should be cleaned up after the job is terminated (I assume that your jobs successfully finished). The jars are managed by the BlobService. The dispatcher will trigger the jobCleanup in [1] after job termination. Are there any suspicious log messages that might indicate an issue?
I'm adding Chesnay to this thread as he might have more insights here.


On Mon, Jan 25, 2021 at 8:37 PM Maciek Próchniak <[hidden email]> wrote:
Hello,

in our setup we have:

- Flink 1.11.2

- job submission via REST API (first we upload jar, then we submit
multiple jobs with it)

- additional jars embedded in lib directory of main jar (this is crucial
part)

When we submit jobs this way, Flink creates new temp jar files via
PackagedProgram.extractContainedLibraries method.

We observe that they are not removed after job finishes - it seems that
PackagedProgram.deleteExtractedLibraries is not invoked when using REST
API.

What's more, it seems that those jars remain open in JobManager process.
We observe that when we delete them manually via scripts, the disk space
is not reclaimed until process is restarted, we also see via heap dump
inspection that java.util.zip.ZipFile$Source  objects remain, pointing
to those files. This is quite a problem for us, as we submit quite a few
jobs, and after a while we ran out of either heap or disk space on
JobManager process/host. Unfortunately, I cannot so far find where this
leak would happen...

Does anybody have some pointers where we can search? Or how to fix this
behaviour?


thanks,

maciek

Reply | Threaded
Open this post in threaded view
|

Re: JobManager seems to be leaking temporary jar files

Chesnay Schepler
The problem of submitted jar files not being closed is a known one: https://issues.apache.org/jira/browse/FLINK-9844
IIRC it's not exactly trivial to fix since class-loading is involved.
It's not strictly related to the REST API; it also occurs in the CLI but is less noticeable since jars are usually not deleted.

As for the issue with deleteExtractedLibraries, Maciek is generally on a good track.
The explicit delete call is indeed missing. The best place to put is probably JarRunHandler#handleRequest, within handle after the job was run.
A similar issue also exists in the JarPlanHandler.


On 1/26/2021 12:21 PM, Maciek Próchniak wrote:

Hi Matthias,

I think the problem lies somewhere in JarRunHandler, as this is the place where the files are created.

I think these are not the files that are managed via BlobService, as they are not stored in BlobService folders (I made experiment changing default BlobServer folders).

It seems to me that CliFrontend deletes those files explicitly:

https://github.com/apache/flink/blob/master/flink-clients/src/main/java/org/apache/flink/client/cli/CliFrontend.java#L250

whereas I couldn't find such invocation in JarRunHandler (not deleting those files does not fully explain leak on heap though...)


thanks,

maciek

On 26.01.2021 11:16, Matthias Pohl wrote:
Hi Maciek,
my understanding is that the jars in the JobManager should be cleaned up after the job is terminated (I assume that your jobs successfully finished). The jars are managed by the BlobService. The dispatcher will trigger the jobCleanup in [1] after job termination. Are there any suspicious log messages that might indicate an issue?
I'm adding Chesnay to this thread as he might have more insights here.


On Mon, Jan 25, 2021 at 8:37 PM Maciek Próchniak <[hidden email]> wrote:
Hello,

in our setup we have:

- Flink 1.11.2

- job submission via REST API (first we upload jar, then we submit
multiple jobs with it)

- additional jars embedded in lib directory of main jar (this is crucial
part)

When we submit jobs this way, Flink creates new temp jar files via
PackagedProgram.extractContainedLibraries method.

We observe that they are not removed after job finishes - it seems that
PackagedProgram.deleteExtractedLibraries is not invoked when using REST
API.

What's more, it seems that those jars remain open in JobManager process.
We observe that when we delete them manually via scripts, the disk space
is not reclaimed until process is restarted, we also see via heap dump
inspection that java.util.zip.ZipFile$Source  objects remain, pointing
to those files. This is quite a problem for us, as we submit quite a few
jobs, and after a while we ran out of either heap or disk space on
JobManager process/host. Unfortunately, I cannot so far find where this
leak would happen...

Does anybody have some pointers where we can search? Or how to fix this
behaviour?


thanks,

maciek


Reply | Threaded
Open this post in threaded view
|

Re: JobManager seems to be leaking temporary jar files

Maciek Próchniak

Hi Chesnay,

thanks for reply. I wonder if FLINK-21164 will help without FLINK-9844 - if the jar file is not closed, it won't be successfully deleted?

As for FLINK-9844 - I understand that having code like

if (userClassLoader instanceof Closeable) { ((Closeable) userClassloader).close() }

is too "dirty trick" to be considered?


thanks,

maciek

 

On 27.01.2021 13:00, Chesnay Schepler wrote:
The problem of submitted jar files not being closed is a known one: https://issues.apache.org/jira/browse/FLINK-9844
IIRC it's not exactly trivial to fix since class-loading is involved.
It's not strictly related to the REST API; it also occurs in the CLI but is less noticeable since jars are usually not deleted.

As for the issue with deleteExtractedLibraries, Maciek is generally on a good track.
The explicit delete call is indeed missing. The best place to put is probably JarRunHandler#handleRequest, within handle after the job was run.
A similar issue also exists in the JarPlanHandler.


On 1/26/2021 12:21 PM, Maciek Próchniak wrote:

Hi Matthias,

I think the problem lies somewhere in JarRunHandler, as this is the place where the files are created.

I think these are not the files that are managed via BlobService, as they are not stored in BlobService folders (I made experiment changing default BlobServer folders).

It seems to me that CliFrontend deletes those files explicitly:

https://github.com/apache/flink/blob/master/flink-clients/src/main/java/org/apache/flink/client/cli/CliFrontend.java#L250

whereas I couldn't find such invocation in JarRunHandler (not deleting those files does not fully explain leak on heap though...)


thanks,

maciek

On 26.01.2021 11:16, Matthias Pohl wrote:
Hi Maciek,
my understanding is that the jars in the JobManager should be cleaned up after the job is terminated (I assume that your jobs successfully finished). The jars are managed by the BlobService. The dispatcher will trigger the jobCleanup in [1] after job termination. Are there any suspicious log messages that might indicate an issue?
I'm adding Chesnay to this thread as he might have more insights here.


On Mon, Jan 25, 2021 at 8:37 PM Maciek Próchniak <[hidden email]> wrote:
Hello,

in our setup we have:

- Flink 1.11.2

- job submission via REST API (first we upload jar, then we submit
multiple jobs with it)

- additional jars embedded in lib directory of main jar (this is crucial
part)

When we submit jobs this way, Flink creates new temp jar files via
PackagedProgram.extractContainedLibraries method.

We observe that they are not removed after job finishes - it seems that
PackagedProgram.deleteExtractedLibraries is not invoked when using REST
API.

What's more, it seems that those jars remain open in JobManager process.
We observe that when we delete them manually via scripts, the disk space
is not reclaimed until process is restarted, we also see via heap dump
inspection that java.util.zip.ZipFile$Source  objects remain, pointing
to those files. This is quite a problem for us, as we submit quite a few
jobs, and after a while we ran out of either heap or disk space on
JobManager process/host. Unfortunately, I cannot so far find where this
leak would happen...

Does anybody have some pointers where we can search? Or how to fix this
behaviour?


thanks,

maciek


Reply | Threaded
Open this post in threaded view
|

Re: JobManager seems to be leaking temporary jar files

Chesnay Schepler
Code-wise it appears that thing have gotten simpler and we can use use a URLClassLoader within PackagedProgram.

We probably won't get around a dedicated close() method on the PackagedProgram.

I think in FLINK-21164 I think have identified the right places to issue this call within the jar handlers.

On the CLI side, I suppose we can just replace all usages of  deleteTempExtractedLibraries with close().


On 1/28/2021 7:47 AM, Maciek Próchniak wrote:

Hi Chesnay,

thanks for reply. I wonder if FLINK-21164 will help without FLINK-9844 - if the jar file is not closed, it won't be successfully deleted?

As for FLINK-9844 - I understand that having code like

if (userClassLoader instanceof Closeable) { ((Closeable) userClassloader).close() }

is too "dirty trick" to be considered?


thanks,

maciek

 

On 27.01.2021 13:00, Chesnay Schepler wrote:
The problem of submitted jar files not being closed is a known one: https://issues.apache.org/jira/browse/FLINK-9844
IIRC it's not exactly trivial to fix since class-loading is involved.
It's not strictly related to the REST API; it also occurs in the CLI but is less noticeable since jars are usually not deleted.

As for the issue with deleteExtractedLibraries, Maciek is generally on a good track.
The explicit delete call is indeed missing. The best place to put is probably JarRunHandler#handleRequest, within handle after the job was run.
A similar issue also exists in the JarPlanHandler.


On 1/26/2021 12:21 PM, Maciek Próchniak wrote:

Hi Matthias,

I think the problem lies somewhere in JarRunHandler, as this is the place where the files are created.

I think these are not the files that are managed via BlobService, as they are not stored in BlobService folders (I made experiment changing default BlobServer folders).

It seems to me that CliFrontend deletes those files explicitly:

https://github.com/apache/flink/blob/master/flink-clients/src/main/java/org/apache/flink/client/cli/CliFrontend.java#L250

whereas I couldn't find such invocation in JarRunHandler (not deleting those files does not fully explain leak on heap though...)


thanks,

maciek

On 26.01.2021 11:16, Matthias Pohl wrote:
Hi Maciek,
my understanding is that the jars in the JobManager should be cleaned up after the job is terminated (I assume that your jobs successfully finished). The jars are managed by the BlobService. The dispatcher will trigger the jobCleanup in [1] after job termination. Are there any suspicious log messages that might indicate an issue?
I'm adding Chesnay to this thread as he might have more insights here.


On Mon, Jan 25, 2021 at 8:37 PM Maciek Próchniak <[hidden email]> wrote:
Hello,

in our setup we have:

- Flink 1.11.2

- job submission via REST API (first we upload jar, then we submit
multiple jobs with it)

- additional jars embedded in lib directory of main jar (this is crucial
part)

When we submit jobs this way, Flink creates new temp jar files via
PackagedProgram.extractContainedLibraries method.

We observe that they are not removed after job finishes - it seems that
PackagedProgram.deleteExtractedLibraries is not invoked when using REST
API.

What's more, it seems that those jars remain open in JobManager process.
We observe that when we delete them manually via scripts, the disk space
is not reclaimed until process is restarted, we also see via heap dump
inspection that java.util.zip.ZipFile$Source  objects remain, pointing
to those files. This is quite a problem for us, as we submit quite a few
jobs, and after a while we ran out of either heap or disk space on
JobManager process/host. Unfortunately, I cannot so far find where this
leak would happen...

Does anybody have some pointers where we can search? Or how to fix this
behaviour?


thanks,

maciek