Search before asking
Version
branch-4.0
What's Wrong?
In branch-4.0, RemoteFSPhantomManager tracks DFSFileSystem with a PhantomReference and closes the internal org.apache.hadoop.fs.FileSystem when the DFSFileSystem object is garbage collected.
This can close the underlying Hadoop FileSystem too early. Some DFS operations may still be using the Hadoop FileSystem, RemoteIterator, FSDataInputStream, or FSDataOutputStream after the owning DFSFileSystem object is no longer strongly reachable. In that window, GC can enqueue the DFSFileSystem phantom reference, and the cleanup thread may close the Hadoop FileSystem while operations such as listStatus/listFiles, read, or write are still running.
This may cause intermittent failures in external file system access, especially when:
DFSFileSystem is evicted from FileSystemCache by size limit
DFSFileSystem expires after access timeout
DFSFileSystem is created directly and not retained by the cache
- a long-running list/read/write operation only keeps the underlying Hadoop objects alive
What You Expected?
The internal Hadoop FileSystem should not be closed while any active DFS operation is still using it.
Cleanup should happen only after both conditions are true:
- The owning
DFSFileSystem is closed or garbage collected.
- There are no active operations/streams/iterators holding the underlying Hadoop
FileSystem.
How to Reproduce?
The issue is timing-sensitive and depends on GC.
General reproduction scenario:
- Create or obtain a
DFSFileSystem.
- Start a long-running HDFS operation, for example recursive
listFiles/listStatus, or create an input/output stream.
- Drop the strong reference to the
DFSFileSystem while the returned Hadoop object is still being used.
- Trigger GC.
- Wait for
RemoteFSPhantomManager cleanup.
- Continue using the iterator/stream.
The cleanup thread may close the internal Hadoop FileSystem, causing the still-running operation to fail with IO errors related to a closed filesystem/client.
This is less likely during normal cache hits because FileSystemCache uses a strong-value Caffeine cache, but the race is still possible after cache eviction, expiration, or direct construction outside the cache.
Anything Else?
The root cause is that phantom cleanup is tied to the lifetime of DFSFileSystem, but the actual resource users are the underlying Hadoop FileSystem and streams/iterators derived from it.
A safer design is to introduce a resource/lease layer:
DFSFileSystemResource owns the Hadoop FileSystem.
- Each operation acquires a lease before using the resource.
close() or phantom cleanup only marks the resource as closing.
- The Hadoop
FileSystem is physically closed only after all active leases are released.
This avoids closing the Hadoop FileSystem while listFiles, reads, or writes are still in progress.
Are you willing to submit PR?
Code of Conduct
Search before asking
Version
branch-4.0
What's Wrong?
In branch-4.0,
RemoteFSPhantomManagertracksDFSFileSystemwith aPhantomReferenceand closes the internalorg.apache.hadoop.fs.FileSystemwhen theDFSFileSystemobject is garbage collected.This can close the underlying Hadoop
FileSystemtoo early. Some DFS operations may still be using the HadoopFileSystem,RemoteIterator,FSDataInputStream, orFSDataOutputStreamafter the owningDFSFileSystemobject is no longer strongly reachable. In that window, GC can enqueue theDFSFileSystemphantom reference, and the cleanup thread may close the HadoopFileSystemwhile operations such aslistStatus/listFiles, read, or write are still running.This may cause intermittent failures in external file system access, especially when:
DFSFileSystemis evicted fromFileSystemCacheby size limitDFSFileSystemexpires after access timeoutDFSFileSystemis created directly and not retained by the cacheWhat You Expected?
The internal Hadoop
FileSystemshould not be closed while any active DFS operation is still using it.Cleanup should happen only after both conditions are true:
DFSFileSystemis closed or garbage collected.FileSystem.How to Reproduce?
The issue is timing-sensitive and depends on GC.
General reproduction scenario:
DFSFileSystem.listFiles/listStatus, or create an input/output stream.DFSFileSystemwhile the returned Hadoop object is still being used.RemoteFSPhantomManagercleanup.The cleanup thread may close the internal Hadoop
FileSystem, causing the still-running operation to fail with IO errors related to a closed filesystem/client.This is less likely during normal cache hits because
FileSystemCacheuses a strong-value Caffeine cache, but the race is still possible after cache eviction, expiration, or direct construction outside the cache.Anything Else?
The root cause is that phantom cleanup is tied to the lifetime of
DFSFileSystem, but the actual resource users are the underlying HadoopFileSystemand streams/iterators derived from it.A safer design is to introduce a resource/lease layer:
DFSFileSystemResourceowns the HadoopFileSystem.close()or phantom cleanup only marks the resource as closing.FileSystemis physically closed only after all active leases are released.This avoids closing the Hadoop
FileSystemwhilelistFiles, reads, or writes are still in progress.Are you willing to submit PR?
Code of Conduct