[FLINK-39874][filesystems] Fix temp file leak on commit-time upload f…#28360
Open
NestDream wants to merge 2 commits into
Open
[FLINK-39874][filesystems] Fix temp file leak on commit-time upload f…#28360NestDream wants to merge 2 commits into
NestDream wants to merge 2 commits into
Conversation
…ailure and non-idempotent deletes in NativeS3RecoverableFsDataOutputStream uploadCurrentPart() deleted the local temp file only after uploadPart() returned, with no try/finally. This permanently leaks a temp file on the closeForCommit() path: closeForCommit() sets closed = true before calling uploadCurrentPart(), so when uploadPart() throws (e.g. an S3 5xx surfacing after the SDK's own retries), the later close() short-circuits on its "if (!closed)" guard and never reclaims the temp file. The single pending part's s3-part-<uuid> file is then orphaned in the shared io.tmp.dirs (one leaked file per affected stream) until the TaskManager is recycled. (On the write() path a later close() does still reclaim the file, so there the missing per-part cleanup was a robustness gap rather than a permanent leak.) Wrap the upload in try/finally and delete the temp file in the finally, so it is removed whether or not uploadPart() succeeds. Any delete failure in the finally is caught and logged so it cannot mask the original upload IOException. Also replace three Files.delete() calls with Files.deleteIfExists() so cleanup is idempotent when the temp file is already gone. This closes a real (if minor) TOCTOU in close(): write()/uploadCurrentPart() run without the lock and delete currentTempFile before createNewTempFile() reassigns it, so a concurrent cancellation close() could previously hit NoSuchFileException between its exists() check and delete(). Add regression tests: closeForCommitUploadFailureDeletesTempFile (the permanent commit-path leak) and uploadPartFailureFromWriteDeletesTempFile both assert no s3-part-* file remains after an uploadPart() failure and fail without the fix; closeForCommitIsIdempotentWhenTempFileMissing asserts closeForCommit() succeeds when the temp file was already removed (threw NoSuchFileException before the fix); closeForCommitSuccessDeletesTempFile guards the happy path.
Collaborator
Author
|
@flinkbot run azure |
Author
|
CI is green (Azure build 75800). The change wraps uploadCurrentPart() in try/finally to fix a commit-time temp-file leak and converts the three Files.delete() sites to Files.deleteIfExists() for idempotent cleanup. There are four regression tests; three fail on unpatched master and one is a happy-path control. @gaborgsomogyi when you have a moment, would you be able to take a look, or assign FLINK-39874 to me so this can be reviewed? CC @Samrat002 for native-s3 context. Happy to address any feedback. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What is the purpose of the change
NativeS3RecoverableFsDataOutputStreamcan permanently leak a local temp file when a part upload fails at commit time.closeForCommit()setsclosed = trueand then callsuploadCurrentPart()to flush the final pending part. IfuploadPart()throws there (for example an S3 5xx that survives the SDK's internal retries), the upload's only cleanup is skipped, and the stream's laterclose()short-circuits on itsif (!closed)guard and never runs. The single pending part'ss3-part-<uuid>file is then orphaned in the sharedio.tmp.dirs(one leaked file per affected stream) until the TaskManager is recycled. Under repeated commit failures these accumulate across the lifetime of the process.The same per-part cleanup is also missing on the
write()path, where a large part is flushed mid-stream. That one is not a permanent leak — a laterclose()still reclaims the file — so it's a robustness gap rather than a leak, but it's the same root cause and is fixed the same way.While in here, three
Files.delete()calls are replaced withFiles.deleteIfExists(). One of them, inclose(), is a real (if minor) intra-class TOCTOU:write()/uploadCurrentPart()run without the lock and can deletecurrentTempFile, so a concurrent cancellationclose()could fall through itsexists()check and then throwNoSuchFileExceptionon the delete. The other two are hardening.Brief change log
uploadCurrentPart()in atry/finallyso the temp file is always deleted, even whenuploadPart()throws; a failed delete is caught and logged viaLOG.warnso it can't mask the original uploadIOException.Files.delete()calls (inuploadCurrentPart(), thecloseForCommit()else-branch, andclose()) withFiles.deleteIfExists(), and drop the now-redundantexists()guard inclose().Verifying this change
This change added tests and can be verified as follows:
closeForCommitUploadFailureDeletesTempFile— drives the permanent-leak path. A small write (1024 bytes, under the 5 MB min part size) does not flush duringwrite(), so the onlyuploadPart()happens at commit time; that upload is made to throw, and the test asserts nos3-part-*file is left behind. Fails without the fix.uploadPartFailureFromWriteDeletesTempFile— drives thewrite()-path flush (a full 5 MB part) with a failing upload and asserts the temp file is cleaned up. Fails without the fix.closeForCommitIsIdempotentWhenTempFileMissing— exercises thecloseForCommit()else-branch (no pending bytes) with the temp file already gone. ThrowsNoSuchFileExceptionwithout the fix; passes withdeleteIfExists.closeForCommitSuccessDeletesTempFile— happy-path control: a successful commit deletes the temp file. Passes with and without the fix.With the fix, all 4 pass. Without the fix (source reverted to pristine
master, tests kept), the 3 above fail (2 leak assertions + 1NoSuchFileException) and the happy-path control passes. Also manually verified the happy path against real S3 (13 MB → genuine 3-part multipart upload, object committed, no locals3-part-*files left, no orphan multipart uploads).Does this pull request potentially affect one of the following parts:
@Public(Evolving): noDocumentation
Was generative AI tooling used to co-author this PR?