Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Azure Backup job is failing to unfreeze Ubuntu OS after running safefreeze #1868

Open
Mi1an opened this issue Jan 15, 2024 · 2 comments
Open

Comments

@Mi1an
Copy link

Mi1an commented Jan 15, 2024

Hello,

this is second time (of 3 times total) when we encountered an error with fsfreeze during Azure Backup. Our system is during the Azure Backup freezed but it's never unfreezed. The whole server is stuck and the only thing we can do is deallocate Azure VM and start it again.

When we checked logs we see that the last thing logged into the extension.log is run of a procedure safefreeze and proceed for accepting singal. Raw log:
2024-01-14 22:16:48.792020 [Microsoft.Azure.RecoveryServices.VMSnapshotLinux-1.0]PreSnapshot: Status Code: 200
2024-01-14 22:16:48.794220 [Microsoft.Azure.RecoveryServices.VMSnapshotLinux-1.0]Taking Snapshot through Host
2024-01-14 22:16:48.796446 [Microsoft.Azure.RecoveryServices.VMSnapshotLinux-1.0]T:S freeze, timeout value 60
2024-01-14 22:16:48.798717 [Microsoft.Azure.RecoveryServices.VMSnapshotLinux-1.0]skipped mount :
2024-01-14 22:16:48.800908 [Microsoft.Azure.RecoveryServices.VMSnapshotLinux-1.0]fsfreeze mount :/mnt
2024-01-14 22:16:48.803464 [Microsoft.Azure.RecoveryServices.VMSnapshotLinux-1.0]fsfreeze mount :/
2024-01-14 22:16:48.805675 [Microsoft.Azure.RecoveryServices.VMSnapshotLinux-1.0]skip freeze is : False
2024-01-14 22:16:48.811102 [Microsoft.Azure.RecoveryServices.VMSnapshotLinux-1.0]arg : ['/var/lib/waagent/Microsoft.Azure.RecoveryServices.VMSnapshotLinux-1.0.9207.0/main/safefreeze/bin/safefreeze', '60', '/']
2024-01-14 22:16:48.813571 [Microsoft.Azure.RecoveryServices.VMSnapshotLinux-1.0]proceeded for accepting signals

In Azure Backup job report we see the job as failed with these details:
ExtensionOperationInProgress
Command execution failed.
Another operation is in progress on this item. Please wait until the previous operation is completed.

We set everything correct like this documentation states:
https://learn.microsoft.com/en-us/azure/backup/backup-azure-linux-app-consistent

We are unable to identify the core of the issue deeper. We assume that it's a bug because the system is never unfreezed as it should be. We can't affect this behavior because it's initialized from Azure Backup agent job.

Our OS:
Ubuntu 18.04.6 LTS

VM agent version:
2.7.3.0

@katatohuk
Copy link

we have the same issue, have been in conversation with MS AZure Backup Team for a few month already... nothing so far. All we know it's fsfreeze which cause the issue and VM goes bananas... the main symptom is avg high load just go crazy.

@tillc24
Copy link

tillc24 commented Sep 30, 2024

We are experiencing the same condition on RHEL 8.10+ in MAG. The agent freezes the file systems but never thaws. This condition has occurred in our environment at least 10 times since July 2024. Based upon our review, something fails in between the freeze and the thaw and error handling doesn't trap. With the /var/log file system frozen, there's nothing written to any logs to identify exactly what the problem is and Microsoft support couldn't provide any input. Our only recourse is to deallocate and reallocate the VM. Redeploying does seem to reduce the occurrence but that's just an observation.

We initially suspected this might be due to a swap file being placed on a temporary disk however the condition reoccurred after we excluded the /mnt file system (https://learn.microsoft.com/en-us/troubleshoot/azure/virtual-machines/linux/create-swap-file-linux-vm#create-a-swap-partition - option #1).

Lastly, our VM's that we've encountered this problem on (at least 5 unique) ran for at least months prior to occurring this condition for the first time.

Sharing this if it helps anyone.. The last thing we see is "accepting signals".

2024/09/29 01:14:39.919079 Info PreSnapshot: Status Code: 200
2024/09/29 01:14:39.920540 Info Taking Snapshot through Host
2024/09/29 01:14:39.921552 Info T:S freeze, timeout value 60
2024/09/29 01:14:39.922781 Info skipped mount :
2024/09/29 01:14:39.923908 Info fsfreeze mount :/mnt
2024/09/29 01:14:39.926383 Info fsfreeze mount :/var/log/audit
2024/09/29 01:14:39.927263 Info fsfreeze mount :/var/log
2024/09/29 01:14:39.928283 Info fsfreeze mount :/var/tmp
2024/09/29 01:14:39.928938 Info fsfreeze mount :/var
2024/09/29 01:14:39.931145 Info fsfreeze mount :/backup
2024/09/29 01:14:39.932091 Info fsfreeze mount :/home
2024/09/29 01:14:39.933397 Info fsfreeze mount :/boot
2024/09/29 01:14:39.935240 Info fsfreeze mount :/tmp
2024/09/29 01:14:39.936067 Info fsfreeze mount :/usr
2024/09/29 01:14:39.936593 Info fsfreeze mount :/
2024/09/29 01:14:39.937537 Info skip freeze is : False
2024/09/29 01:14:39.938418 Info arg : ['/var/lib/waagent/Microsoft.Azure.RecoveryServices.VMSnapshotLinux-1.0.9220.0/main/safefreeze/bin/safefreeze', '60', '/var/log/audit', '/var/log', '/var/tmp', '/var', '/backup', '/home', '/boot', '/tmp', '/usr', '/']
2024/09/29 01:14:39.939423 Info proceeded for accepting signals

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants