-
Notifications
You must be signed in to change notification settings - Fork 5.3k
Optimize Directory.GetFiles by passing safe patterns to NtQueryDirectoryFile #122947
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
@copilot, there are build failures on windows: |
|
@copilot, there are lots of test failures on Windows, e.g. |
... Fixed in 546cea8. The issue was that patterns with invalid filename characters (like |
546cea8 to
6e4cf0a
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR optimizes Directory.GetFiles with search patterns on Windows by passing "safe" patterns to NtQueryDirectoryFile as OS-level pre-filters, while maintaining correctness through the existing managed MatchesPattern filter. This converts certain O(N) operations back to O(log N) for patterns where NTFS can use B-tree seeking.
Key changes:
- Thread search expression through the enumeration infrastructure (
FileSystemEnumerable→FileSystemEnumerator) - Add
IsSafePatternForOSFilter()validation to determine which patterns can be safely passed to the OS - Pass safe patterns to
NtQueryDirectoryFileon the first call per directory, with managed filtering always validating results
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
| FileSystemEnumerator.Windows.cs | Adds expression field, constructor overload, pattern safety validation method, and GetData() modifications to pass safe patterns to NtQueryDirectoryFile on first call per directory |
| FileSystemEnumerator.Unix.cs | Adds no-op constructor overload accepting expression parameter for cross-platform consistency |
| FileSystemEnumerable.cs | Adds expression field and threads it through to FileSystemEnumerator constructor |
| FileSystemEnumerableFactory.cs | Updates factory methods to pass expression parameter through to FileSystemEnumerable |
src/libraries/System.Private.CoreLib/src/System/IO/Enumeration/FileSystemEnumerator.Windows.cs
Show resolved
Hide resolved
src/libraries/System.Private.CoreLib/src/System/IO/Enumeration/FileSystemEnumerator.Windows.cs
Outdated
Show resolved
Hide resolved
src/libraries/System.Private.CoreLib/src/System/IO/Enumeration/FileSystemEnumerator.Unix.cs
Show resolved
Hide resolved
657b3ab to
ca8957c
Compare
JeremyKuhne
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you want to conditionalize passing the filter you need to look at the passed in options to determine if you're accurately understanding "safe". There are different matching algorithms that treat "." differently on both platforms, and the case sensitivity defaults are different on both (and unchangeable, at least on Windows). The behavior of the OS APIs is not super clearly documented, so anything done predictively here is pretty risky.
Above and beyond that this still won't deal with the 8.3 autogenerated name inconsistency.
I don't recommend doing anything that isn't configurable and err on the side of opting in. Most normal scenarios don't struggle. I think the best answer is to have a new virtual on the Enumerator that allows optionally specifying the OS filter. You could also hang this off of the options, but it would require some amount of navigating for documentation so that people understand the ramifications. InitialOperatingSystemFilter or something?
Note that all files in directories are always enumerated and the data fully retrieved, the question is just how many results are copied into the buffer. As also noted, you have to get all of the results anyway when doing recursive searches.
Final note: putting hundreds of thousands of files in a directory is bad news from a performance perspective, whether or not we layer on additional buffer copying of file names.
@JeremyKuhne, with all due respect, we don't get to make that call for users. We made changes that by default significantly negatively impact existing performance; we need to do our best to rectify that in as many cases as possible. We can also suggest that folks would get better performance if they made additional changes, either to their code or to their configuration, but folks absolutely have such configurations (myself included).
What changes are necessary beyond the ones already in this PR? i.e. specifically how does https://github.com/dotnet/runtime/pull/122947/files#diff-fb61490eab3527efdb0c0e91297759b84da186a3a15af6f4420df57905160043R375-R381 need to be updated?
How so? The intent here is that 8.3 names may result in additional entries being yielded by the OS and those will then be filtered out by the existing managed filtering that's happening. Are there cases where that falls down?
I think we have to do something that tries without requiring code changes to mitigate the breaks we introduced. 1000x performance degredation (as cited in the original issue, and as we've heard in other issues over the last few years) is a break. |
I'm not trying to infer anything other than stating a fact that users should be aware of. It will hurt your performance to have massive directories. Every time you enumerate a directory you touch every directory entry to get your results.
Depending on the match type the filter is going to be different. The pattern needs to have been normalized into DOS_DOT, DOS_STAR, and DOS_QM before it gets used here as NtQueryDirectoryInfo needs the special escape characters to filter correctly.
Yeah, the double filter should help that issue. With that, however, you will be making performance worse in some cases by making this the default. Think of a folder with nothing but jpeg files in it when you pass a FYI- There is no way to see how many files are in a directory without enumerating the whole thing, at least on NTFS. I was hoping there might be to help address this thing in a more targeted way. |
The approach I've taken here is to simply not provide the filter when it would be problematic: since we're always filtering in managed code after each result anyway, any additional filtering that can be done by the OS is bonus. The more often we can propagate the filter, the more of a win we possibly get, but if there's any chance the filter will be problematic, we just don't propagate it. The occurrence of
I've created a local directory with 1000 files named "test0.jpg" through "test999.jpg". Then ran this benchmark before/after this change: using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;
BenchmarkSwitcher.FromAssembly(typeof(Benchmarks).Assembly).Run(args);
[HideColumns("Job", "Error", "StdDev", "RatioSD")]
[MemoryDiagnoser(false)]
public class Benchmarks
{
[Benchmark]
[Arguments("*.*")] // always match everything
[Arguments("*.jpg")] // filtered but still matching everything
[Arguments("test12*")] // filtered and only matching a subset
public int Sum(string filter)
{
int i = 0;
foreach (var entry in Directory.EnumerateFiles(@"C:\coreclrtest\example", filter))
{
i += entry.Length;
}
return i;
}
}which for me yields:
|
That's the part I'm talking about- I don't think the filter is always transformed by the time it is passed through to the enumerator. It may still be in the Win32 format. It needs to be normalized, or you'll be getting a smaller set of results. Unit tests for the weird DOS style matching and the various match options would probably be a good idea. I have a hard time keeping the subtleties of it in my head. :) The wacky behavior is detailed here: runtime/src/libraries/System.Private.CoreLib/src/System/IO/Enumeration/FileSystemName.cs Lines 134 to 139 in f6874c8
FYI: The method there is literally the code from Windows transformed into C#. I think once you're sure that the filter has been transformed you won't need worry about excluding the period.
It would be good to look at directory sizes of 10K as well as that is a more normal "large" size. Checking the scenario of |
Description
On Windows,
Directory.GetFileswith search patterns is O(N) instead of O(log N) because .NET Core+ always passesnulltoNtQueryDirectoryFile'sFileNameparameter, enumerating all files and filtering in managed code.This PR passes "safe" patterns to
NtQueryDirectoryFileas OS-level pre-filters while maintaining 100% correctness via the existing managedMatchesPatternfilter.Safe patterns (optimized):
*.jpg,prefix*,*foo*,prefix*.ext- Only*and valid literal charactersUnsafe patterns (unchanged):
*.*- .NET treats as*, OS requires.foo*.- DOS_STAR transformation differs?- DOS_QM behavior differsImplementation:
FileSystemEnumerable→FileSystemEnumeratorIsSafePatternForOSFilter()to validate patterns (checks for?,*.*,*.endings, and invalid filename characters)NtQueryDirectoryFileon first call per directoryCustomer Impact
Customers with selective patterns in large directories experience severe performance degradation. Example:
Directory.GetFiles(path, "A14881*.jpg")in a directory with 140K files but 4 matches currently enumerates all 140K files. This PR reduces that to ~4-10 entries.Regression
No. This is not fixing a regression but rather a long-standing performance issue introduced when .NET Core eliminated the OS-level filter for behavioral consistency.
Testing
SafePatternsWorkCorrectlytest validating optimized patterns produce correct resultsSearchPatternInvalid_Coretests (26 tests) pass with invalid character handlingRisk
Low. The optimization is a pure hint - the managed filter always validates results regardless of OS filtering. Unsafe patterns (including those with invalid characters) use existing unoptimized path. The change is Windows-specific and only affects the first
NtQueryDirectoryFilecall per directory. Comprehensive validation ensures patterns with invalid filename characters (control chars,",<,>,|,:,\,/) are not passed to the OS, preserving existing error handling behavior.Original prompt
This pull request was created from Copilot chat.
💡 You can make Copilot smarter by setting up custom instructions, customizing its development environment and configuring Model Context Protocol (MCP) servers. Learn more Copilot coding agent tips in the docs.