Skip to content

[FLINK-39874][filesystems] Fix temp file leak on commit-time upload f…#28360

Open
NestDream wants to merge 2 commits into
apache:masterfrom
NestDream:fix-s3-native-tempfile-leak
Open

[FLINK-39874][filesystems] Fix temp file leak on commit-time upload f…#28360
NestDream wants to merge 2 commits into
apache:masterfrom
NestDream:fix-s3-native-tempfile-leak

Conversation

@NestDream

Copy link
Copy Markdown

What is the purpose of the change

NativeS3RecoverableFsDataOutputStream can permanently leak a local temp file when a part upload fails at commit time.

closeForCommit() sets closed = true and then calls uploadCurrentPart() to flush the final pending part. If uploadPart() throws there (for example an S3 5xx that survives the SDK's internal retries), the upload's only cleanup is skipped, and the stream's later close() short-circuits on its if (!closed) guard and never runs. The single pending part's s3-part-<uuid> file is then orphaned in the shared io.tmp.dirs (one leaked file per affected stream) until the TaskManager is recycled. Under repeated commit failures these accumulate across the lifetime of the process.

The same per-part cleanup is also missing on the write() path, where a large part is flushed mid-stream. That one is not a permanent leak — a later close() still reclaims the file — so it's a robustness gap rather than a leak, but it's the same root cause and is fixed the same way.

While in here, three Files.delete() calls are replaced with Files.deleteIfExists(). One of them, in close(), is a real (if minor) intra-class TOCTOU: write() / uploadCurrentPart() run without the lock and can delete currentTempFile, so a concurrent cancellation close() could fall through its exists() check and then throw NoSuchFileException on the delete. The other two are hardening.

Brief change log

  • Wrap the upload in uploadCurrentPart() in a try/finally so the temp file is always deleted, even when uploadPart() throws; a failed delete is caught and logged via LOG.warn so it can't mask the original upload IOException.
  • Replace the three Files.delete() calls (in uploadCurrentPart(), the closeForCommit() else-branch, and close()) with Files.deleteIfExists(), and drop the now-redundant exists() guard in close().

Verifying this change

This change added tests and can be verified as follows:

  • closeForCommitUploadFailureDeletesTempFile — drives the permanent-leak path. A small write (1024 bytes, under the 5 MB min part size) does not flush during write(), so the only uploadPart() happens at commit time; that upload is made to throw, and the test asserts no s3-part-* file is left behind. Fails without the fix.
  • uploadPartFailureFromWriteDeletesTempFile — drives the write()-path flush (a full 5 MB part) with a failing upload and asserts the temp file is cleaned up. Fails without the fix.
  • closeForCommitIsIdempotentWhenTempFileMissing — exercises the closeForCommit() else-branch (no pending bytes) with the temp file already gone. Throws NoSuchFileException without the fix; passes with deleteIfExists.
  • closeForCommitSuccessDeletesTempFile — happy-path control: a successful commit deletes the temp file. Passes with and without the fix.

With the fix, all 4 pass. Without the fix (source reverted to pristine master, tests kept), the 3 above fail (2 leak assertions + 1 NoSuchFileException) and the happy-path control passes. Also manually verified the happy path against real S3 (13 MB → genuine 3-part multipart upload, object committed, no local s3-part-* files left, no orphan multipart uploads).

Does this pull request potentially affect one of the following parts:

  • Dependencies (does it add or upgrade a dependency): no
  • The public API, i.e., is any changed class annotated with @Public(Evolving): no
  • The serializers: no
  • The runtime per-record code paths (performance sensitive): no
  • Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Kubernetes/Yarn, ZooKeeper: no
  • The S3 file system connector: yes

Documentation

  • Does this pull request introduce a new feature? no
  • If yes, how is the feature documented? not applicable

Was generative AI tooling used to co-author this PR?
  • Yes. Claude was used as a coding assistant; I reviewed and verified all changes.

…ailure and non-idempotent deletes in NativeS3RecoverableFsDataOutputStream

uploadCurrentPart() deleted the local temp file only after uploadPart()
returned, with no try/finally. This permanently leaks a temp file on the
closeForCommit() path: closeForCommit() sets closed = true before calling
uploadCurrentPart(), so when uploadPart() throws (e.g. an S3 5xx surfacing
after the SDK's own retries), the later close() short-circuits on its
"if (!closed)" guard and never reclaims the temp file. The single pending
part's s3-part-<uuid> file is then orphaned in the shared io.tmp.dirs (one
leaked file per affected stream) until the TaskManager is recycled. (On the
write() path a later close() does still reclaim the file, so there the missing
per-part cleanup was a robustness gap rather than a permanent leak.)

Wrap the upload in try/finally and delete the temp file in the finally, so it
is removed whether or not uploadPart() succeeds. Any delete failure in the
finally is caught and logged so it cannot mask the original upload IOException.

Also replace three Files.delete() calls with Files.deleteIfExists() so cleanup
is idempotent when the temp file is already gone. This closes a real (if minor)
TOCTOU in close(): write()/uploadCurrentPart() run without the lock and delete
currentTempFile before createNewTempFile() reassigns it, so a concurrent
cancellation close() could previously hit NoSuchFileException between its
exists() check and delete().

Add regression tests: closeForCommitUploadFailureDeletesTempFile (the permanent
commit-path leak) and uploadPartFailureFromWriteDeletesTempFile both assert no
s3-part-* file remains after an uploadPart() failure and fail without the fix;
closeForCommitIsIdempotentWhenTempFileMissing asserts closeForCommit() succeeds
when the temp file was already removed (threw NoSuchFileException before the
fix); closeForCommitSuccessDeletesTempFile guards the happy path.
@flinkbot

flinkbot commented Jun 8, 2026

Copy link
Copy Markdown
Collaborator

CI report:

Bot commands The @flinkbot bot supports the following commands:
  • @flinkbot run azure re-run the last Azure build

@NestDream

Copy link
Copy Markdown
Author

@flinkbot run azure

@NestDream

Copy link
Copy Markdown
Author

CI is green (Azure build 75800). The change wraps uploadCurrentPart() in try/finally to fix a commit-time temp-file leak and converts the three Files.delete() sites to Files.deleteIfExists() for idempotent cleanup. There are four regression tests; three fail on unpatched master and one is a happy-path control.

@gaborgsomogyi when you have a moment, would you be able to take a look, or assign FLINK-39874 to me so this can be reviewed? CC @Samrat002 for native-s3 context. Happy to address any feedback.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants