Skip to content

[Bug](fs) DFSFileSystem phantom cleanup may close Hadoop FileSystem while operations are still running #64369

@foxtail463

Description

@foxtail463

Search before asking

  • I had searched in the issues and found no similar issues.

Version

branch-4.0

What's Wrong?

In branch-4.0, RemoteFSPhantomManager tracks DFSFileSystem with a PhantomReference and closes the internal org.apache.hadoop.fs.FileSystem when the DFSFileSystem object is garbage collected.

This can close the underlying Hadoop FileSystem too early. Some DFS operations may still be using the Hadoop FileSystem, RemoteIterator, FSDataInputStream, or FSDataOutputStream after the owning DFSFileSystem object is no longer strongly reachable. In that window, GC can enqueue the DFSFileSystem phantom reference, and the cleanup thread may close the Hadoop FileSystem while operations such as listStatus/listFiles, read, or write are still running.

This may cause intermittent failures in external file system access, especially when:

  • DFSFileSystem is evicted from FileSystemCache by size limit
  • DFSFileSystem expires after access timeout
  • DFSFileSystem is created directly and not retained by the cache
  • a long-running list/read/write operation only keeps the underlying Hadoop objects alive

What You Expected?

The internal Hadoop FileSystem should not be closed while any active DFS operation is still using it.

Cleanup should happen only after both conditions are true:

  1. The owning DFSFileSystem is closed or garbage collected.
  2. There are no active operations/streams/iterators holding the underlying Hadoop FileSystem.

How to Reproduce?

The issue is timing-sensitive and depends on GC.

General reproduction scenario:

  1. Create or obtain a DFSFileSystem.
  2. Start a long-running HDFS operation, for example recursive listFiles/listStatus, or create an input/output stream.
  3. Drop the strong reference to the DFSFileSystem while the returned Hadoop object is still being used.
  4. Trigger GC.
  5. Wait for RemoteFSPhantomManager cleanup.
  6. Continue using the iterator/stream.

The cleanup thread may close the internal Hadoop FileSystem, causing the still-running operation to fail with IO errors related to a closed filesystem/client.

This is less likely during normal cache hits because FileSystemCache uses a strong-value Caffeine cache, but the race is still possible after cache eviction, expiration, or direct construction outside the cache.

Anything Else?

The root cause is that phantom cleanup is tied to the lifetime of DFSFileSystem, but the actual resource users are the underlying Hadoop FileSystem and streams/iterators derived from it.

A safer design is to introduce a resource/lease layer:

  • DFSFileSystemResource owns the Hadoop FileSystem.
  • Each operation acquires a lease before using the resource.
  • close() or phantom cleanup only marks the resource as closing.
  • The Hadoop FileSystem is physically closed only after all active leases are released.

This avoids closing the Hadoop FileSystem while listFiles, reads, or writes are still in progress.

Are you willing to submit PR?

  • Yes I am willing to submit a PR!

Code of Conduct

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions