why not flink delete the checkpoint directory recursively?

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

why not flink delete the checkpoint directory recursively?

Joshua Fan
Hi

When a checkpoint should be deleted, FsCompletedCheckpointStorageLocation.disposeStorageLocation will be called.
Inside it, fs.delete(exclusiveCheckpointDir, false) will do the delete action. I wonder why the recursive parameter is set to false? as the  exclusiveCheckpointDir is truly a directory. in our hadoop, this causes the checkpoint cannot be removed.
It is easy to change the recursive parameter to true, but is there any potential harm?

Yours sincerely
Josh

Reply | Threaded
Open this post in threaded view
|

Re: why not flink delete the checkpoint directory recursively?

rmetzger0
Hey Josh,

As far as I understand the code CompletedCheckpoint.discard(), Flink is removing all the files in StateUtil.bestEffortDiscardAllStateObjects, then deleting the directory.

Which files are left over in your case?
Do you see any exceptions on the TaskManagers?

Best,
Robert

On Wed, Nov 11, 2020 at 12:08 PM Joshua Fan <[hidden email]> wrote:
Hi

When a checkpoint should be deleted, FsCompletedCheckpointStorageLocation.disposeStorageLocation will be called.
Inside it, fs.delete(exclusiveCheckpointDir, false) will do the delete action. I wonder why the recursive parameter is set to false? as the  exclusiveCheckpointDir is truly a directory. in our hadoop, this causes the checkpoint cannot be removed.
It is easy to change the recursive parameter to true, but is there any potential harm?

Yours sincerely
Josh

Reply | Threaded
Open this post in threaded view
|

Re: why not flink delete the checkpoint directory recursively?

Joshua Fan
Hi Robert,

When the `delete(Path f, boolean recursive)` recursive is false, hdfs will throw exception like below:
checkpoint-exception.png

Yours sincerely
Josh

On Thu, Nov 12, 2020 at 4:29 PM Robert Metzger <[hidden email]> wrote:
Hey Josh,

As far as I understand the code CompletedCheckpoint.discard(), Flink is removing all the files in StateUtil.bestEffortDiscardAllStateObjects, then deleting the directory.

Which files are left over in your case?
Do you see any exceptions on the TaskManagers?

Best,
Robert

On Wed, Nov 11, 2020 at 12:08 PM Joshua Fan <[hidden email]> wrote:
Hi

When a checkpoint should be deleted, FsCompletedCheckpointStorageLocation.disposeStorageLocation will be called.
Inside it, fs.delete(exclusiveCheckpointDir, false) will do the delete action. I wonder why the recursive parameter is set to false? as the  exclusiveCheckpointDir is truly a directory. in our hadoop, this causes the checkpoint cannot be removed.
It is easy to change the recursive parameter to true, but is there any potential harm?

Yours sincerely
Josh

Reply | Threaded
Open this post in threaded view
|

Re: why not flink delete the checkpoint directory recursively?

r_khachatryan
Hi,

I think Robert is right, state handles are deleted first, and then the directory is deleted non-recursively.
If any exception occurs while removing the files, it will be combined with the other exception (as suppressed).
So probably Flink failed to delete some files and then directory removal failed because of that.
Can you share the full exception to check this?
And probably check what files exist there as Robert suggested.

Regards,
Roman


On Tue, Nov 17, 2020 at 10:38 AM Joshua Fan <[hidden email]> wrote:
Hi Robert,

When the `delete(Path f, boolean recursive)` recursive is false, hdfs will throw exception like below:
checkpoint-exception.png

Yours sincerely
Josh

On Thu, Nov 12, 2020 at 4:29 PM Robert Metzger <[hidden email]> wrote:
Hey Josh,

As far as I understand the code CompletedCheckpoint.discard(), Flink is removing all the files in StateUtil.bestEffortDiscardAllStateObjects, then deleting the directory.

Which files are left over in your case?
Do you see any exceptions on the TaskManagers?

Best,
Robert

On Wed, Nov 11, 2020 at 12:08 PM Joshua Fan <[hidden email]> wrote:
Hi

When a checkpoint should be deleted, FsCompletedCheckpointStorageLocation.disposeStorageLocation will be called.
Inside it, fs.delete(exclusiveCheckpointDir, false) will do the delete action. I wonder why the recursive parameter is set to false? as the  exclusiveCheckpointDir is truly a directory. in our hadoop, this causes the checkpoint cannot be removed.
It is easy to change the recursive parameter to true, but is there any potential harm?

Yours sincerely
Josh

Reply | Threaded
Open this post in threaded view
|

Re: why not flink delete the checkpoint directory recursively?

Joshua Fan
Hi Roman and Robert,

Thank you.
I have checked the code and the checkpoint deleting failure case. Yes, Flink will delete the meta file and operator state file at first, then delete the checkpoint dir which is truly an empty dir. The root cause of the failure of deleting checkpoint is the hadoop delete will check the directory and recursive parameter. I will work with people who in charge of the hdfs to solve this problem.
Thanks again.

Yours sincerely
Josh

On Tue, Nov 17, 2020 at 6:36 PM Khachatryan Roman <[hidden email]> wrote:
Hi,

I think Robert is right, state handles are deleted first, and then the directory is deleted non-recursively.
If any exception occurs while removing the files, it will be combined with the other exception (as suppressed).
So probably Flink failed to delete some files and then directory removal failed because of that.
Can you share the full exception to check this?
And probably check what files exist there as Robert suggested.

Regards,
Roman


On Tue, Nov 17, 2020 at 10:38 AM Joshua Fan <[hidden email]> wrote:
Hi Robert,

When the `delete(Path f, boolean recursive)` recursive is false, hdfs will throw exception like below:
checkpoint-exception.png

Yours sincerely
Josh

On Thu, Nov 12, 2020 at 4:29 PM Robert Metzger <[hidden email]> wrote:
Hey Josh,

As far as I understand the code CompletedCheckpoint.discard(), Flink is removing all the files in StateUtil.bestEffortDiscardAllStateObjects, then deleting the directory.

Which files are left over in your case?
Do you see any exceptions on the TaskManagers?

Best,
Robert

On Wed, Nov 11, 2020 at 12:08 PM Joshua Fan <[hidden email]> wrote:
Hi

When a checkpoint should be deleted, FsCompletedCheckpointStorageLocation.disposeStorageLocation will be called.
Inside it, fs.delete(exclusiveCheckpointDir, false) will do the delete action. I wonder why the recursive parameter is set to false? as the  exclusiveCheckpointDir is truly a directory. in our hadoop, this causes the checkpoint cannot be removed.
It is easy to change the recursive parameter to true, but is there any potential harm?

Yours sincerely
Josh