Flink 1.5.2 process keeps reference to deleted blob files.

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

Flink 1.5.2 process keeps reference to deleted blob files.

Piotr Szczepanek
Hello,

we are using YarnClusterClient for job submission. After successful/failed
job execution it looks like blob file for that job is deleted, but there is
still some handle from Flink process to that file. As a result the file is
not removed from machine and we faced no space felt on device error.
Restarting Flink cluster moved situation back to normal, but we are
submitting quite huge number of jobs and often cluster restarts is not a
solution.

Results of lsof are:
During job execution:
lsof /flinkDir | grep job_dbafb671b0d60ed8a8ec2651fe59303b
java    11883  yarn  mem    REG  253,2 112384928 109973177
/flinkDir/yarn/../application_1536668870638_5555/blobStore-a1bcdbd4-5388-4c56-8052-6051f5af38dd/job_dbafb671b0d60ed8a8ec2651fe59303b/blob_p-8771d9ccac35e28d8571ac8957feaaecdebaeadd-7748aec7fe7369ca26181d0f94b1a578
java    11883  yarn 1837r   REG  253,2 112384928 109973177
/flinkDir/yarn/../application_1536668870638_5555/blobStore-a1bcdbd4-5388-4c56-8052-6051f5af38dd/job_dbafb671b0d60ed8a8ec2651fe59303b/blob_p-8771d9ccac35e28d8571ac8957feaaecdebaeadd-7748aec7fe7369ca26181d0f94b1a578

After job execution:
lsof /flinkDir | grep job_dbafb671b0d60ed8a8ec2651fe59303b
java    11883  yarn  DEL    REG  253,2           109973177
/flinkDir/yarn/../application_1536668870638_5555/blobStore-a1bcdbd4-5388-4c56-8052-6051f5af38dd/job_dbafb671b0d60ed8a8ec2651fe59303b/blob_p-8771d9ccac35e28d8571ac8957feaaecdebaeadd-7748aec7fe7369ca26181d0f94b1a578
java    11883  yarn 1837r   REG  253,2 112384928 109973177
/flinkDir/yarn/../application_1536668870638_5555/blobStore-a1bcdbd4-5388-4c56-8052-6051f5af38dd/job_dbafb671b0d60ed8a8ec2651fe59303b/blob_p-8771d9ccac35e28d8571ac8957feaaecdebaeadd-7748aec7fe7369ca26181d0f94b1a578
*(deleted)*

So the blob file is marked as deleted but it's still present as there is
still some handle from Flink container process.
Can you please advice, how can we avoid that situation, or if is it cause by
some bug in Flink?



--
Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/
Reply | Threaded
Open this post in threaded view
|

Re: Flink 1.5.2 process keeps reference to deleted blob files.

Stefan Richter
Hi,

I think it would be very helpful if you could identify what data is behind. For example, I could imagine that it can be a jar file that was used by the TM and some classes are still in use or loaded by a classloader that was not yet GCed. Depending on that, there could be a problem in the user-code, in Flinkā€™s classloading, or with the blob storage. I would suggest to open a Jira issue and to supply as much information about the dangling file as possible (e.g. maybe concluding from the log what blobkey was mapped to what file, from the size, or by peeking at the content.

Best,
Stefan

> Am 19.09.2018 um 16:04 schrieb Piotr Szczepanek <[hidden email]>:
>
> Hello,
>
> we are using YarnClusterClient for job submission. After successful/failed
> job execution it looks like blob file for that job is deleted, but there is
> still some handle from Flink process to that file. As a result the file is
> not removed from machine and we faced no space felt on device error.
> Restarting Flink cluster moved situation back to normal, but we are
> submitting quite huge number of jobs and often cluster restarts is not a
> solution.
>
> Results of lsof are:
> During job execution:
> lsof /flinkDir | grep job_dbafb671b0d60ed8a8ec2651fe59303b
> java    11883  yarn  mem    REG  253,2 112384928 109973177
> /flinkDir/yarn/../application_1536668870638_5555/blobStore-a1bcdbd4-5388-4c56-8052-6051f5af38dd/job_dbafb671b0d60ed8a8ec2651fe59303b/blob_p-8771d9ccac35e28d8571ac8957feaaecdebaeadd-7748aec7fe7369ca26181d0f94b1a578
> java    11883  yarn 1837r   REG  253,2 112384928 109973177
> /flinkDir/yarn/../application_1536668870638_5555/blobStore-a1bcdbd4-5388-4c56-8052-6051f5af38dd/job_dbafb671b0d60ed8a8ec2651fe59303b/blob_p-8771d9ccac35e28d8571ac8957feaaecdebaeadd-7748aec7fe7369ca26181d0f94b1a578
>
> After job execution:
> lsof /flinkDir | grep job_dbafb671b0d60ed8a8ec2651fe59303b
> java    11883  yarn  DEL    REG  253,2           109973177
> /flinkDir/yarn/../application_1536668870638_5555/blobStore-a1bcdbd4-5388-4c56-8052-6051f5af38dd/job_dbafb671b0d60ed8a8ec2651fe59303b/blob_p-8771d9ccac35e28d8571ac8957feaaecdebaeadd-7748aec7fe7369ca26181d0f94b1a578
> java    11883  yarn 1837r   REG  253,2 112384928 109973177
> /flinkDir/yarn/../application_1536668870638_5555/blobStore-a1bcdbd4-5388-4c56-8052-6051f5af38dd/job_dbafb671b0d60ed8a8ec2651fe59303b/blob_p-8771d9ccac35e28d8571ac8957feaaecdebaeadd-7748aec7fe7369ca26181d0f94b1a578
> *(deleted)*
>
> So the blob file is marked as deleted but it's still present as there is
> still some handle from Flink container process.
> Can you please advice, how can we avoid that situation, or if is it cause by
> some bug in Flink?
>
>
>
> --
> Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/