Hello,
we are using YarnClusterClient for job submission. After successful/failed job execution it looks like blob file for that job is deleted, but there is still some handle from Flink process to that file. As a result the file is not removed from machine and we faced no space felt on device error. Restarting Flink cluster moved situation back to normal, but we are submitting quite huge number of jobs and often cluster restarts is not a solution. Results of lsof are: During job execution: lsof /flinkDir | grep job_dbafb671b0d60ed8a8ec2651fe59303b java 11883 yarn mem REG 253,2 112384928 109973177 /flinkDir/yarn/../application_1536668870638_5555/blobStore-a1bcdbd4-5388-4c56-8052-6051f5af38dd/job_dbafb671b0d60ed8a8ec2651fe59303b/blob_p-8771d9ccac35e28d8571ac8957feaaecdebaeadd-7748aec7fe7369ca26181d0f94b1a578 java 11883 yarn 1837r REG 253,2 112384928 109973177 /flinkDir/yarn/../application_1536668870638_5555/blobStore-a1bcdbd4-5388-4c56-8052-6051f5af38dd/job_dbafb671b0d60ed8a8ec2651fe59303b/blob_p-8771d9ccac35e28d8571ac8957feaaecdebaeadd-7748aec7fe7369ca26181d0f94b1a578 After job execution: lsof /flinkDir | grep job_dbafb671b0d60ed8a8ec2651fe59303b java 11883 yarn DEL REG 253,2 109973177 /flinkDir/yarn/../application_1536668870638_5555/blobStore-a1bcdbd4-5388-4c56-8052-6051f5af38dd/job_dbafb671b0d60ed8a8ec2651fe59303b/blob_p-8771d9ccac35e28d8571ac8957feaaecdebaeadd-7748aec7fe7369ca26181d0f94b1a578 java 11883 yarn 1837r REG 253,2 112384928 109973177 /flinkDir/yarn/../application_1536668870638_5555/blobStore-a1bcdbd4-5388-4c56-8052-6051f5af38dd/job_dbafb671b0d60ed8a8ec2651fe59303b/blob_p-8771d9ccac35e28d8571ac8957feaaecdebaeadd-7748aec7fe7369ca26181d0f94b1a578 *(deleted)* So the blob file is marked as deleted but it's still present as there is still some handle from Flink container process. Can you please advice, how can we avoid that situation, or if is it cause by some bug in Flink? -- Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/ |
Hi,
I think it would be very helpful if you could identify what data is behind. For example, I could imagine that it can be a jar file that was used by the TM and some classes are still in use or loaded by a classloader that was not yet GCed. Depending on that, there could be a problem in the user-code, in Flinkās classloading, or with the blob storage. I would suggest to open a Jira issue and to supply as much information about the dangling file as possible (e.g. maybe concluding from the log what blobkey was mapped to what file, from the size, or by peeking at the content. Best, Stefan > Am 19.09.2018 um 16:04 schrieb Piotr Szczepanek <[hidden email]>: > > Hello, > > we are using YarnClusterClient for job submission. After successful/failed > job execution it looks like blob file for that job is deleted, but there is > still some handle from Flink process to that file. As a result the file is > not removed from machine and we faced no space felt on device error. > Restarting Flink cluster moved situation back to normal, but we are > submitting quite huge number of jobs and often cluster restarts is not a > solution. > > Results of lsof are: > During job execution: > lsof /flinkDir | grep job_dbafb671b0d60ed8a8ec2651fe59303b > java 11883 yarn mem REG 253,2 112384928 109973177 > /flinkDir/yarn/../application_1536668870638_5555/blobStore-a1bcdbd4-5388-4c56-8052-6051f5af38dd/job_dbafb671b0d60ed8a8ec2651fe59303b/blob_p-8771d9ccac35e28d8571ac8957feaaecdebaeadd-7748aec7fe7369ca26181d0f94b1a578 > java 11883 yarn 1837r REG 253,2 112384928 109973177 > /flinkDir/yarn/../application_1536668870638_5555/blobStore-a1bcdbd4-5388-4c56-8052-6051f5af38dd/job_dbafb671b0d60ed8a8ec2651fe59303b/blob_p-8771d9ccac35e28d8571ac8957feaaecdebaeadd-7748aec7fe7369ca26181d0f94b1a578 > > After job execution: > lsof /flinkDir | grep job_dbafb671b0d60ed8a8ec2651fe59303b > java 11883 yarn DEL REG 253,2 109973177 > /flinkDir/yarn/../application_1536668870638_5555/blobStore-a1bcdbd4-5388-4c56-8052-6051f5af38dd/job_dbafb671b0d60ed8a8ec2651fe59303b/blob_p-8771d9ccac35e28d8571ac8957feaaecdebaeadd-7748aec7fe7369ca26181d0f94b1a578 > java 11883 yarn 1837r REG 253,2 112384928 109973177 > /flinkDir/yarn/../application_1536668870638_5555/blobStore-a1bcdbd4-5388-4c56-8052-6051f5af38dd/job_dbafb671b0d60ed8a8ec2651fe59303b/blob_p-8771d9ccac35e28d8571ac8957feaaecdebaeadd-7748aec7fe7369ca26181d0f94b1a578 > *(deleted)* > > So the blob file is marked as deleted but it's still present as there is > still some handle from Flink container process. > Can you please advice, how can we avoid that situation, or if is it cause by > some bug in Flink? > > > > -- > Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/ |
Free forum by Nabble | Edit this page |