Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Not suitable for FSLogix CCD #45

Open
antonvdl opened this issue Aug 24, 2021 · 28 comments
Open

Not suitable for FSLogix CCD #45

antonvdl opened this issue Aug 24, 2021 · 28 comments

Comments

@antonvdl
Copy link

The shrink script works great; however when using CCD we see some issues.
The script will shrink the VHD on the primary location; everything seems fine.
However, FSLogix is not aware of any changes in the VHD since the meta file is not updated.

So FSLogix thinks that the VHD on the primary and the secondary location are equal and continues to write changes to both locations. This will result in corruption in the VHD on the secondary location.

Running the script on both primary and secondary location will not resolve this issue, since the shrink operation will give a different result on every run (on block level).

I think the best solution here would be to update the VHD.Meta file so FSLogix knows about the update.

@lordjeb
Copy link
Member

lordjeb commented Aug 24, 2021

There is an additional wrinkle on this one. There may be unwritten data in a cache location on a local machine that still has to be flushed to the storage locations. Any changes on the storage locations would cause problems, and might result in lost cache data.

Additionally, updating the meta file would not be enough, because the vhd file has been changed. So the meta file just keeps some data about what was flushed and when. In this case, a full resync of the updated vhd to the secondary location is going to be necessary.

The only other possibility I could think of would be to mount the vhd file though a machine running ccd and do the shrink operation there, allowing ccd to do its thing and sync all the data to multiple storage locations. Note that this would also handle potential locking issues that would not be handled otherwise.

@antonvdl
Copy link
Author

In my understanding:

As long as the vhd is not locked there should be no local cache data, right?

(In our case we delete the local cache on logoff, but I believe this is also true for other environments)

For the meta file; as soon as fslogix sees that there is a difference in meta files the newest one takes precedence.

So the vhd in the secondary location will be automatically replaced by a copy of the vhd on the primary location.

I can’t find much information about this meta file, and I can’t read it. What I have seen is that the file only gets updated from an user session.

when I edit a vhd with frx the meta file remains unchanged (and my changes won’t be synced to the second location)

@JimMoyle
Copy link
Contributor

You can safely shrink both locations separately the FSLogix agent doesn't recognise this as a change to sync over to the other side as nothing inside the VHD has changed. I'm adding an enhancement to not run when there is a RW disk there as that does have bad consequences, so only during a maintenance window right now.

@lordjeb I've tested the above previously and it works fine, let me know if you want to have a chat about it though.

@lordjeb
Copy link
Member

lordjeb commented Aug 24, 2021

@JimMoyle that's awesome, glad to know it works!

@antonvdl
Copy link
Author

Hi @JimMoyle,

We are not using RW disks, but we do see issues after shrinking.
The issues occur after the shrink when the user starts to write new data.
New data will be synced correctly, but for old data the file table is incorrect at the secondary location.

after some writes and deletes the secondary location becomes corrupted

@JimMoyle
Copy link
Contributor

@antonvdl That's interesting I'll test some more, but I have not heard any other instances of this. How many disks did this happen to at the time?

@antonvdl
Copy link
Author

Hi Jim,

I can easily reproduce this with a fresh VHD:

  • Remove existing VHD on both locations
  • Start user session
    • New VHDs are created
    • Fill up VHD with random data to +/- 3GB
    • Remove +/- 1GB of data
    • Log off
  • Confirm that both VHDs are equal by calculating filehash
  • Shrink VHD on primary location
  • Confirm that VHD on primary location is smaller then secondary location
  • Start user session
    • Fill up VHD with random data to +/- 3GB
    • Remove +/- 1GB of data
    • Log off
  • Confirm that both VHDs have the same size but filehash differs
  • Mount secondary VHD in readonly -> File or directory is corrupted

Executing the same steps without the shrink operation will not give corruption.

I think FSlogix will sync the filetable of the VHD's.
But the filetable is different because of the optimize.

Another example to show this behaviour (copying filetables) is by manually changing the primary VHD:

  • Make sure there is a healthy VHD in both locations
  • Make sure the VHD is not in use
  • Mount the primary VHD locally and add a file to the profile folder (ProcessExplorer.zip in this test)
  • Start user session
    • Locate the added file and create a textfile next to it (to update the filetable)
    • Confirm the ZIP file can be openend
    • Log off
  • Confirm there is a difference between primary and secondary VHD by comparing filehash
  • Remove the primary VHD so that we fail over to secondary location
  • Start user session
    • Locate the added file
    • Try to open the ZIP file: This will fail
    • Log off
  • Mount the VHD and run a chkdsk; chkdsk will find errors in the disk

It looks like the FSLogix agent is synchronizing the VHD's on block level and not on file level.
Because we run a process outside the FSLogix environment the block structure is changed and the secondary VHD will become corrupted.
When the process would run inside the FSLogix environment then the changes would be equal on both VHDs.

When FSLogix knows that there is a difference between the VHDs (based on the meta file) it will discard the oldest VHD and replace it when the newest one.

@lordjeb
Copy link
Member

lordjeb commented Aug 26, 2021

@antonvdl Your takeaway is correct. FSLogix CCD feature synchronizes changes on the block level. It uses the meta file (the format of which is proprietary) to determine which sequenced changes have been committed to a storage location. So it can know which of multiple disks is the most up to date and which to use as the source of truth if others are less up to date.

I think some valid workarounds would be shrink the disk on location 1 + delete the disk on location 2. Or shrink location 1 + copy the shrunk vhd to location 2. Both of these are going to replicate what the FSLogix agent will do once it determines one of your vhd copies is out of date anyway (do a full copy of the vhd) so it seems like a minimal-downside change.

@antonvdl
Copy link
Author

@lordjeb Thanks for your reply. I haven’t seen that the fslogix agent notice an out-of-date vhd when the vhd was altered outside an user session.

do you know when this will happen? Maybe I did not wait long enough.

@lordjeb
Copy link
Member

lordjeb commented Aug 27, 2021

@antonvdl During a login, FSLogix will detect that an agent is out-of-date based on the .meta file. It looks at information inside that file to know which changes have been flushed to which storage locations. So it won't detect this based on the timestamps on the vhd file or anything like that.

@antonvdl
Copy link
Author

@lordjeb yes, I observed this behaviour.
But when we alter the VHD outside the FSLogix agents scope (like with this script); then the FSLogix agent will not determine the changes.

So we would need to apply the work-around by etiher deleting the disk on location 2 or sync location 1 to location 2 after a shrink operation.

@antonvdl
Copy link
Author

antonvdl commented Sep 2, 2021

@antonvdl That's interesting I'll test some more, but I have not heard any other instances of this. How many disks did this happen to at the time?

Hi @JimMoyle; have you been able to replicate the issue?
I think this is a wide spread issue where people are unaware of corrupted profiles in the second location.
As long as you keep using the primary location you won't notice the issue.

@StevenM79
Copy link

@JimMoyle @antonvdl I can confirm that disk maintenance is not suitable for Cloud Cache disks as this will eventually corrupt disk at secondary location.

Also if disk optimization is performed at secondary location, optimizations are lost when disk at primary location is being written to. FsLogix does not see that disk at secondary location has been modified and therefore will not resync from primary location.

How can we do correct disk maintenance to reclaim white-space? Without we have approx. 200TB of data, disk maintenance would reduce this to only 48TB. But because of disk corruption this doesn't seem to be a valid approach.

@antonvdl how did you resolve this?

@antonvdl
Copy link
Author

@StevenM79 Unfortunately we didn't find a good solution within FSLogix for this issue.
From our storage platform we can use dedup and compression so we win back a bit of the white-space.

Another solution would be to write a script around the shrink script that will delete the files on the secondary site. At the next login the secondary site will be recovered. However this approach has two issues:

  • As long as the user does not log-in there is no secondary location after the shrink, so the profile is lost when the primary location is offline
  • When you remove a lot of vhds from the secondary site and all users login at the same time, then the large sync from primary to secondary site may cause impact.

Maybe you can only remove the metadata file on the secondary site to resolve the first issue; but I didn't test that scenario.

@StevenM79
Copy link

@antonvdl Thanks for the update. I'm testing the removal of the metadata files at the moment, will post the results here.

We found that dedup on storage wins back some space, however optimizing disks would win back even more space as we found out. Other option could be to copy the disk to the other storage location after shrink, however this generates a lot of network traffic and disk I/O :(

@StevenM79
Copy link

@antonvdl I have tested removing the meta file on all CCD locations after optimizing the disk on secondary CCD location. Unfortunately FSLogix does not detect that the disks on both CCD locations are different. So same problem unfortunately

Guess we will have to go the dedup on storage level route. In our case we will need a whooping 400TB per CCD location minimum. What are the dedup ratio's in your experience?

@antonvdl
Copy link
Author

@StevenM79 What happens if you optimze the disk on the primary ccd location and then remove the meta file on the secondary?

The current dedup ratio is 3:1

@StevenM79
Copy link

@antonvdl removing the meta file on secondary CCD seems to trigger a resync at logon from primary to secondary. Will test this some more, also scenario when primary is unavailable and secondary has no meta file etc. Downside is a lot of extra resync traffic at logonstorm in the morning. Not sure if I'm happy about that

@StevenM79
Copy link

@antonvdl The CCD location where the meta file is removed will be seen as out-of-date on user-logon. This will cause a resync from the CCD location which still has a meta file. In case of outage before a resync was triggerd the system will connect to the CCD location where the meta file has been removed, and create a new meta file. User will work without problems. At next logon this will be the CCD location with the newer meta file causing a sync.

So it works correct if you remove the meta file on the CCD location where you did not perform disk maintenance. However this means that a lot of disks will be resynced at user logon in the morning.

I think this is unacceptable in a large environment. So i decided to further investigate the dedup on storage level scenario. Still gives me headaches as we have a share limit of 256TB. With 2 storage locations, 20000 disks and the expectation of growth to 20GB for each disk i need a lot of shares and diskspace.

Working in the cloud with O365 requires a lot of local storage for the FSLogix caching solution, i have a hard time explaining this to our management. They are in the illusion that they don't need local resources when working with cloud solutions...

@antonvdl
Copy link
Author

@StevenM79 good to hear that this scenario works.
The resync in the next morning is indeed a big issue. You could edit the script to limit the amount of VHDs that get shrinked in the same window. That should limit the effect during the next morning, but makes everything more complex again.

@vikrant003
Copy link

Hi, Anyone know if shrink script is suitable to CC? Do you guys see much profile corruption with or without running shrink? CC has become PITA for sometime..I didn't see Jim's comments since Aug last year on this thread so wondering if it was really addressed.. one thing finally.. experience you guys share is very helpful..Thanks much..

@StevenM79
Copy link

@vikrant003 I decided not to use the shrink script with CC. Main reason for that is that it corrupts de disks on the location where you did not run the shrinking script. This can be solved by removing the META file on the location that where the shrink operation was not perfomed, which causes a full fie copy from the other location. When you have a lot of disks this is a significant data transfer that has to be done. Instead of shrinking the disks we decided to rely on storage deduplication which is part of the Dell PowerStore storage solution we use. This gives us nearly the same storage gains as the shrink script.

@antonvdl
Copy link
Author

@vikrant003 Exactly the same here.
Not using Dell PowerStore but a different vendor who also offers deduplication.

@vikrant003
Copy link

@StevenM79 and @antonvdl ..thank you for your feedback..

@mav147
Copy link

mav147 commented Oct 11, 2023

We believe this is the source of problems we're seeing too, getting corruption in FSLogix. Due to our setup (trying to be highly available) users can switch between primary and secondary sites each time they log into AVD. Does anyone know if the newer "shrink on logoff" feature in FSLogix causes the same issues as the shrink script?

@lordjeb
Copy link
Member

lordjeb commented Oct 11, 2023

@mav147 The new feature in FSLogix should not have the same issues. It works within the FSLogix agent with the disks mounted by CCD. So, as it makes optimization changes, these updates are written out to all providers, and the .meta files are updated so that they can be kept in sync.

@mav147
Copy link

mav147 commented Oct 12, 2023

@mav147 The new feature in FSLogix should not have the same issues. It works within the FSLogix agent with the disks mounted by CCD. So, as it makes optimization changes, these updates are written out to all providers, and the .meta files are updated so that they can be kept in sync.

That makes sense, thanks. We'll give it a try :)

@msft-jasonparker
Copy link

@mav147 The new feature in FSLogix should not have the same issues. It works within the FSLogix agent with the disks mounted by CCD. So, as it makes optimization changes, these updates are written out to all providers, and the .meta files are updated so that they can be kept in sync.

That makes sense, thanks. We'll give it a try :)

Keep in mind that your sign-out times will be significantly increased. In order to compact the VHD, we bring the entire contents local from the storage provider. We then evaluate the VHD to determine if we can compact / save space. If we are able to compact and save space, we perform the operation and then the VHD must be uploaded to ALL storage providers.

All of these actions are part of the user sign-out operation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants