Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Odd file behavior in virtual branches, perhaps after installing Windows Updates #70

Open
FeralSquid opened this issue Aug 19, 2021 · 13 comments

Comments

@FeralSquid
Copy link

Hello,

My organization has been using a ProjFS-based file syncing setup since late last year. Since around March of this year, the number of machines using this system has grown from ~400 to ~700 now. This is the primary way people get and interact with the files they work on constantly every day.

I give this background because, as of yesterday, I've now seen what appears to be a new and very odd issue for a 2nd time (on a different machine) within the last ~week.

Nothing about the code managing ProjFS or the backing store information has changed in any substantial way for a few months now (some minor inspection improvements is about it, nothing that impacts interactions with ProjFS).

The only potentially interesting data point I have is that a few days before the first case our IT group rolled out a series of Windows Updates. That said, in both cases I had confirmed that the users did NOT have any pending windows updates at the time (they had installed and rebooted since), as that can cause all kinds of odd behavior.

The updates installed were KB5004245, KB5003539, KB5004748, KB5004772. The last batch of windows updates were installed a month prior (before this issue was first reported).

In both cases, the odd behavior mysteriously went away after a system restart.

Because that appears to fix it, and because our users sometimes like to attempt to fix things themselves and not let us know, it is possible that there have been other cases of this that I'm just not aware of. That said, the timing of 2 cases within a week, and just after the updates, combined with the reboot-as-a-fix, seems a little suspicious.

Anyway, on to the odd behavior...

The user notices the issue when a ProjFS file fails to be read, and so causes some user application to fail to start (if it is an exe/dll/etc) or fail to load some content file.

This behavior, though, only impacts a subset of the ProjFS files under a given virtual root, while other virtual files work perfectly fine (I am able to trigger placeholder creation, hydration, PrjDeleteFile back to virtual and repeat with other files). In the first case, all ProjFS files in a given directory were impacted (there were some full files mixed in the directory that worked normally).
I'm not sure if it was all files in the same directory for the 2nd case as the user rebooted before I could test that case thoroughly.

In the first case, though, I did have time to try a handful of things on these badly behaving files, with odd results:

  • The files are listed in File Explorer
  • I was able to right-click on them and open file properties
  • Attempting to delete the file (hitting delete key in File Explorer), though, failed oddly with "Item Not Found" "This is no longer located at XXX"

...then, in C#, I tried:
var path = // bad path
var fi = new FileInfo(path);
// fi.Exists => false
// fi.Attributes => 0xFFFFFFFF

var di = new DirectoryInfo(Path.GetDirectoryName(path));
var file = di.EnumerateFiles().First(f => string.Compare(f.FullName, path) == 0);
// file is NOT NULL ... it was enumerated
// file.Exists => TRUE !!
// file.Attributes => 0x00440001 (RECALL_ON_DATA_ACCESS | RECALL_ON_OPEN | READONLY) -- looks like a placeholder, but no REPARSE_POINT attribute??

using (var fs = file.OpenRead())
{
fs.ReadByte();
}
// throws FileNotFoundException

FindFirstFileEx w/ FIND_FIRST_EX_ON_DISK_ENTRIES_ONLY lists no items in the directory !? (full files had been deleted at this point, leaving only the badly behaving files in the directory)
FindFirstFileEx without that flag lists the bad file with the above attributes

Calling CreateFile() directly to try and get any reparse point data also failed to invalid path.

I also tried:

  • Stopping and restarting the virtual root
  • Clearing the ProjFS negative path cache

... and re-running all of the above tests, all with the same result.

Throughout all of this, my ProjFS application logged no errors, and I suspect got not callbacks (though I can't fully prove that).

The file in question had been newly created on this machine the day before, then submitted to version control and the user had later sync'd. So, the file should have transitioned from Full-Writable -> Full-ReadOnly -> PrjDeleteFile to make it virtual -> Placeholder (app-triggered) and broken. All of this occurred a day or two after taking the windows updates.

I didn't notice any obvious critical or error system events in that time period.

And then, as a bit of a last ditch, I asked the user to reboot ... and after they did, the badly behaving placeholder files suddenly worked perfectly!

Rebooting also "magically" fixed the 2nd case.

Any ideas?
Any other things you'd want me to try if/when I see the next case?

Thanks!

@cgallred
Copy link
Contributor

I don't entirely know how to explain that behavior, especially that a reboot fixed things.

Because of the way the .NET System.IO file classes behave, that C# program unfortunately isn't telling you what you may think it is. System.IO.FileInfo doesn't open a file handle to determine whether a file "exists". Instead, it uses FindFirstFile to look for it in the parent directory. If the file appears in that enumeration, then it "exists". It does not open a handle to the file to confirm its presence on disk. The fact that your program shows you the RECALL_ON_OPEN bit in the file's attributes is a clue to this behavior. RECALL_ON_OPEN is a synthetic attribute bit that is set only for virtual files (the ProjFS file system filter inserts the bit for files that the provider returns in its enumeration). So the file "exists" in the sense that a directory enumeration shows it, but it is not a placeholder. BTW, even if it were a ProjFS placeholder you wouldn't see the REPARSE_POINT attribute bit in an enumeration; we hide the fact that there's a reparse point for app compat reasons. There are a lot of apps out there that make incorrect assumptions about what it means for a file to have a reparse point on it.

Your provider must have gotten directory enumeration callbacks, otherwise the C# app wouldn't have seen anything in the enumeration (since what it saw was a virtual file).

This also explains why FindFirstFileEx(...FIND_FIRST_EX_ON_DISK_ENTRIES_ONLY...) sees nothing, but without the ON_DISK flag it sees what the C# program did. With the flag no enumeration callbacks are sent to your provider.

So did your C# app use a bogus path for the file? That's what I understood from the comment "bad path" on the first line. If that's the case, then it seems like there could be a bug either in how we're sending directory enumeration callbacks, or there's a bug in your provider in how it services the callbacks. Somewhere along the line, a bad file path turned into something that your provider was able to enumerate (hence the virtual files returned for enumeration), but that it could not produce placeholder information for (hence the failure to actually open the file when you called fs.OpenRead()).

@FeralSquid
Copy link
Author

sorry, "bad path" was meant to mean the full path to the misbehaving file.
The path itself is a correct path to a virtual file.

I have noticed that .net File attributes returned are not always accurate/complete with virtual or placeholder files, which is why I was surprised that I got the same attributes returned when I enumerated with a FindFirstFileEx (without the ON_DISK flag). FindFirstFileEx that does list REPARSE_POINT as an attribute in normal situations when enumerating with the ON_DISK flag, at least, though I can't say I've tested it without the ON_DISK flag as I generally am doing it to avoid invoking callbacks.

If/when I see this again I will verify if I'm getting callbacks or not related to these misbehaving files.

@FeralSquid
Copy link
Author

ok, we just had another case and I was able to remote to the machine and debug my host app while a directory was in a bad state.

In this case there was a single directory that was behaving oddly, and was slightly different from above, so may not be exactly the same issue ... but has a lot of similarities.

Here is what I observed in this case:

  • Misbehaving path (directory): "\bin\online_Release\deployment"
  • Path listed subdirectories in File Explorer / enumeration, but no files.
    • Subdirectories were placeholders on disk
    • Should have listed ~15 virtual files, but was not
  • Parent directory worked correctly: "\bin\online_Release"
  • Child directory worked correctly: "\bin\online_Release\deployment"
  • User noticed problem when trying to copy files INTO this directory (from a non-virtual or working virtual source path), which gave him 'Virtualization provider not available' error (even though the provider was running, and the parent and child directories were working!)
  • Debugging my host application on the machine with the issue, I would get StartDirectoryEnumerationCallback() called for the parent directory (\bin\online_Release) AND child directories (\bin\online_Release\deployment\powershell), but NOT for the misbehaving directory (\bin\online_Release\deployment)
  • Comparing the reparse data from a working directory to the non-working directory, I didn't see anything obviously wrong/different (though I don't fully know what all the bits in there mean)
  • Enabling logging for all notifications, I noticed an odd thing also ... other nearby (working) directories would log NotifyFileOpenedCallback followed soon by a NotifyFileHandleClosedNoModificationCallback from System ... but this misbehaving directory only had NotifyFileOpenedCallbacks ... no other notifications were logged about it in the time I monitored it (a while, including navigating around and refreshing in File Explorer)
  • I PrjDeleteFile'd (made virtual) all files and subdirectories under the bad path so that I could PrjDeleteFile it. After that, refreshing the directory in File Explorer DID trigger a callback in my application and the directory worked was then working properly.

I did dump the reparse data again to compare now that it was working and the thing that changed were the 16 bytes before the ProviderId, but that appeared to have changed to a new, consistent value for all directory placeholders created with this instance of my host app.

It seems like there must be some state that is going wrong in how directories are mapped to callback host applications (especially given the 'virtualization provider not available' errors for this directory while all the surrounding directories were under the same one virtual root mapped to my one provider app instance)?

I would think that it was in the directory reparse data, but the fact that the other similar issues were resolved with a reboot only makes it seem like there must be some in-memory mapping that is getting mucked up somehow (and nuking the directory placeholder, as we did in this case, also resets that state similar to a reboot)?

Thanks for you help!

@abhijeet-gautam
Copy link
Contributor

abhijeet-gautam commented Aug 31, 2021

If you still have the reparseData dumps would you mind sharing those (please obscure the data that shouldn't be visible) ? I am hoping you did see a ProjFs reparseTag for the misbehaving directory \bin\online_Release\deployment as well.

@FeralSquid
Copy link
Author

Unfortunately I don't have the raw reparse data dumps, but as part of my tests I did look at them and compare the misbehaving directory's reparse data to its working parent directory's reparse data, and they were identical except for the path strings (and reparse data size, because of the string length difference).

I will attach the raw dump if/when we see another case.

@fanzeyi
Copy link

fanzeyi commented Sep 10, 2021

Hi all, I'd like to report a similar case we have been seeing with our project (https://github.com/facebookexperimental/eden).

We have received several user reports where files listed in ProjectedFS is not accessible despite the file itself is available. The file will appear in file explorer as well as directory listing in PowerShell. I confirmed the file in question was included in our reply to GetDirectoryEnumerationCallback.

However, any attempt to directly access the file will result as a file not found error from Windows. For example, a Python tool we have to check ProjectedFS state via ctypes will report FileNotFoundError: [WinError -2147024894] The system cannot find the file specified. Other debugging tools will report similar errors. We are not able to query the reparse data for the file as fsutil reparsepoint query will report Error: The system cannot find the file specified..

We also do not observe any getPlaceholderInfo calls to the file in question in our provider even though we did see getPlaceholderInfo calls for the variant of the same file (filename.exe, filename.lnk and filename.exe.lnk). I believe these calls are fallback path of some code trigger by the initial file not found error.

The machine seeing is issue is also on the latest Windows version (21H1). The only pending patch not installed is kb4023057.

Similar to OP, we tested both restarting the virutalization provider and clearing the negative path cache but no avail. The only workaround we have found so far is to reboot the machine.

@abhijeet-gautam
Copy link
Contributor

abhijeet-gautam commented Sep 11, 2021

@fanzeyi thanks for reporting the issue. If you still have the repro would you mind capturing and sharing ProjFs logs ? If yes,

  1. please copy the following into a .cmd file and run from an admin cmd prompt.
  2. Repro the scenario - where it reports the file not found error.
  3. Stop the .cmd script by pressing any key.
  4. A file projfstrace_000001.etl would be generated at the path you ran the script in 1.
  5. Please send the file to us for investigation.
@echo off
logman create trace -n ProjFS_ALL -o projfstrace.etl -nb 128 640 -bs 128
logman update trace -n ProjFS_ALL -p {b89e90a5-b619-50dd-ac92-a91115f73059} 0x7FFFFFFF 255
logman update trace -n ProjFS_ALL -p {B6D7DC51-78CF-4E85-8BAC-488A9F47A0BB} 0x7FFFFFFF 255
logman update trace -n ProjFS_ALL -p {baca0e64-1bd1-506a-a250-a10ab355fbd2} 0x7FFFFFFF 255
logman update trace -n ProjFS_ALL -p {32F63487-712D-4324-A29F-CBF4940C8C0C} 0x7FFFFFFF 255
logman start -n ProjFS_ALL

echo Trace started.  You may now run your scenario.
echo When complete, hit any key in this window to stop the scenario.
pause

logman stop -n ProjFS_ALL
logman delete -n ProjFS_ALL

@FeralSquid you could help us with the traces as well.

@xavierd
Copy link

xavierd commented Sep 13, 2021

Adding some context. Looking back at the internal EdenFS user support, it looks like the first time this issue was reported was in April this year, but some users may have hit the issue sooner than that.

@xavierd
Copy link

xavierd commented Sep 15, 2021

One of our user just hit what appears to be this issue.

At first, listing files in a directory under the virtualization root showed all but one file. Thinking that this might be an EdenFS bug, they then restarted EdenFS, but trying to open the file afterwards gave them the "The provider that supports file system virtualization is temporarily unavailable". Creating a file in that directory also failed with the same error, which appears oddly similar to @FeralSquid's comment from above.

The reparse point on the directory appears to be in order:

PS C:\open\fbsource> fsutil reparsepoint query C:\path\to\directory 
Reparse Tag Value : 0x9000001c
Tag value: Microsoft
Tag value: Directory

Reparse Data Length: 0x198
Reparse Data:
0000:  02 00 00 00 02 00 00 00  c6 c5 ce d1 c9 d7 b6 43  ...............C
0010:  98 f1 ca dd e1 6a 5a 2e  00 00 00 00 00 00 00 00  .....jZ.........
0020:  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  ................
0030:  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  ................
0040:  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  ................
0050:  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  ................
0060:  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  ................
0070:  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  ................
0080:  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  ................
0090:  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  ................
00a0:  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  ................
00b0:  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  ................
00c0:  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  ................
00d0:  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  ................
00e0:  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  ................
00f0:  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  ................
0100:  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  ................
[... path omitted ...]

Querying the file in question fails:

PS C:\open\fbsource> fsutil reparsepoint query C:\path\to\directory\BUCK
Error:  The provider that supports file system virtualization is temporarily unavailable.

I've pointed the user to this issue and they collected the trace requested above. What's the best way to send you this trace file?

@cgallred
Copy link
Contributor

How big is the trace file?

@xavierd
Copy link

xavierd commented Sep 15, 2021

The zipped version is ~20KB, but I'm more concerned about potential private information contained in it to post it on Github :)

@cgallred
Copy link
Contributor

Oh, I don't want you to post it to GitHub in any case. :-) I was thinking along the lines of email vs. some file share service.

Please attach it to an email and send it to msfltdev (at) microsoft (dot) com.

@FeralSquid
Copy link
Author

Hello all,

I was just able to grab the logs from a new live repro case. I've emailed them to the address above.

Any luck finding something in xavierd's logs?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants