Skip to content

Commit c316c1c

Browse files
authored
docs: ovh3 better backups (#436)
Using sanoid/syncoid for OVH backups.
1 parent fc05d2a commit c316c1c

File tree

3 files changed

+164
-17
lines changed

3 files changed

+164
-17
lines changed

docs/logs-ovh3.md

Lines changed: 19 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -3,6 +3,25 @@
33
Report here the timeline of incidents and interventions on ovh3 server.
44
Keep things short or write a report.
55

6+
## 2024-10-31 system taking 100% CPU
7+
8+
* Server is not accessible via SSH.
9+
* Munin show 100% CPU taken by system.
10+
* We ask for a hard reboot on OVH console.
11+
* After restart system continues to use 100% CPU.
12+
* Top shows that arc_prune + arc_evict are using 100% CPU
13+
* exploring logs does not show strange messages
14+
* `cat /proc/spl/kstat/zfs/arcstats|grep arc_meta` shows a arc_meta_used < arc_meta_max (so it's ok)
15+
* We soft reboot the server
16+
* It is back to normal
17+
18+
## 2024-10-10
19+
20+
sda on ovh3 is faulty (64 Current_Pending_Sector, 2 Reallocated_Event_Count).
21+
See https://github.com/openfoodfacts/openfoodfacts-infrastructure/issues/424
22+
23+
24+
625
## 2023-12-05 certificates for images expired
726

827
Images not displaying anymore on the website due to SSL problem (signaled by Edouard, with alert by blackbox exporter)

docs/proxmox.md

Lines changed: 29 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -6,8 +6,31 @@ On ovh1 and ovh2 we use proxmox to manage VMs.
66

77
## Proxmox Backups
88

9-
Every VM / CT is backuped twice a week using general proxmox backup, in a specific zfs dataset
10-
(see Datacenter -> backup)
9+
**IMPORTANT:** We don't use standard proxmox backup[^previous_backups] (see Datacenter -> backup).
10+
11+
Instead we use [syncoid / sanoid](./sanoid.md) to snapshot and synchronize data to other servers.
12+
13+
[^previous_backups]: Previously every VM / CT is backuped twice a week using general proxmox backup, in a specific zfs dataset
14+
15+
## Storage synchronization
16+
17+
We don't use standard proxmox replication of storages, because it is incompatible with using [syncoid / sanoid](./sanoid.md), as it removes snapshots on destination and does not allow to choose destination location.
18+
19+
It means that restoring a container / VM won't be automatic and will need a manual intervention.
20+
21+
### Replication (don't use it)
22+
23+
Previously, VM and container storage were regularly synchronized to ovh3 (and eventually to ovh1/2).
24+
25+
Replication can be seen in the web interface, clicking on "replication" section on a particular container / VM.
26+
27+
This is managed with command line `pvesr` (PVE Storage replication). See [official doc](https://pve.proxmox.com/wiki/Storage_Replication)
28+
29+
* To Add replication a replication on a container / VM:
30+
* In the Replication menu of the container, "Add" one
31+
* Target: the server you want
32+
* Schedule: */5 if you want every 5 minutes (takes less than 10 seconds, thanks to ZFS)
33+
1134

1235
## Host network configuration
1336

@@ -117,22 +140,16 @@ At OVH we have special DNS entries:
117140
* `proxy1.openfoodfacts.org` pointing to OVH reverse proxy
118141
* `off-proxy.openfoodfacts.org` pointing to Free reverse proxy
119142

120-
## Storage synchronization
121-
122-
VM and container storage are regularly synchronized to ovh3 (and eventually to ovh1/2) to have a continuous backup.
123-
124-
Replication can be seen in the web interface, clicking on "replication" section on a particular container / VM.
125-
126-
This is managed with command line `pvesr` (PVE Storage replication). See [official doc](https://pve.proxmox.com/wiki/Storage_Replication)
127-
128143

129144
## How to migrate a container / VM
130145

131146
You may want to move containers or VM from one server to another.
132147

133-
Just go to the interface, right click on the VM / Container and ask to migrate !
148+
**FIXME** this will not work with sanoid/syncoid.
149+
150+
~~Just go to the interface, right click on the VM / Container and ask to migrate !~~
134151

135-
If you have a large disk, you may want to first setup replication of your disk to the target server (see [Storage synchronization](#storage-synchronization)), schedule it immediatly (schedule button)− and then run the migration.
152+
~~If you have a large disk, you may want to first setup replication of your disk to the target server (see [Storage synchronization](#storage-synchronization)), schedule it immediatly (schedule button)− and then run the migration.~~
136153

137154
## How to Unlock a Container
138155

@@ -254,11 +271,6 @@ Using the web interface:
254271
* Start at boot: Yes
255272
* Protection: Yes (to avoid deleting it by mistake)
256273

257-
* Eventually Add replication to ovh3 or off1/2 (if we are not using sanoid/syncoid instead)
258-
* In the Replication menu of the container, "Add" one
259-
* Target: ovh3
260-
* Schedule: */5 if you want every 5 minutes (takes less than 10 seconds, thanks to ZFS)
261-
262274
Also think about [configuring email](./mail.md#postfix-configuration) in the container
263275

264276
## Logging in to a container or VM
Lines changed: 116 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,116 @@
1+
# 2024-10-30 OVH3 backups
2+
3+
We need an intervention to change a disk on ovh3.
4+
5+
We still have very few backups for OVH services.
6+
7+
Before the operation, I want to at least have replication of OVH backups on the new MOJI server.
8+
9+
We previously tried to do it while keeping replication, but it does not work well.
10+
[see 2024-09-30 ovh3 backups](./2024-09-30-ovh3-backups.md)
11+
12+
So here is what we are going to do:
13+
* remove replication and let sanoid / syncoid deal with replication to ovh3
14+
* we will have less snapshots on the ovh1/ovh2 side and we will use a replication snapshot
15+
to avoid relying on a common existing snapshot made by syncoid
16+
17+
Note that we don't replicate between ovh1 and ovh2 because we have very few space left on disks.
18+
19+
## Changing sanoid / syncoid config and removing replication
20+
21+
First because we won't use replication anymore, we have to create the ovh3operator on ovh1 and ovh2,
22+
and as we want to use replication snapshot, we have to use corresponding rights for ZFS,
23+
See [Sanoid / creating operator on PROD_SERVER](../sanoid.md#creating-operator-on-prod_server)
24+
25+
I also add to link zfs command in /usr/bin: `ln -s /usr/sbin/zfs /usr/bin`
26+
27+
For each VM / CT separately I did:
28+
* disable replication
29+
* I didn't have to change syncoid policy on ovh1/2
30+
* on ovh3
31+
* configure sanoid policy from replicated_volumes to a regular synced_data one
32+
* use syncoid to sync the volume and configure add the line to syncoid-args (using a specific snapshot)
33+
34+
I tried with CT 130 (contents) first. I tried the syncoid command manually:
35+
```bash
36+
syncoid --no-privilege-elevation [email protected]:rpool/subvol-130-disk-0 rpool/subvol-130-disk-0
37+
```
38+
I did some less important CT and VM: 113, 140, 200 and decided to wait until next day to control everything is ok.
39+
40+
The day after, I did the same for CT 101, 102, 103, 104,105, 106, 108, 202 and 203 on ovh1,
41+
and 107, 110 and 201 on ovh2.
42+
43+
I removed 109 120
44+
45+
I also removed sync of CT 107 to ovh1 (because we are nearly out of disk space) and removed the volume there.
46+
47+
## Syncing between ovh1 and ovh2
48+
49+
We have two VMs that are really important to replicate (but using syncoid) from ovh1 to ovh2.
50+
51+
So I [created an ovh2operator on ovh1](../sanoid.md#creating-operator-on-prod_server)
52+
53+
I installed the syncoid systemd service and enabled it.
54+
Same for `sanoid_check`.
55+
56+
I added synced_data template to sanoid.conf to use it for synced volumes.
57+
I removed volume 101 and 102 and manually synced them from ovh1 to ovh2.
58+
Then I added them to `syncoid-args.conf`.
59+
60+
## Removing dump backups on ovh1/2
61+
62+
We also decided to remove dump backups on ovh1/2/3.
63+
64+
Going in proxmox interface, Datacenter, Backup, I disabled backups.
65+
66+
## Removing replication snapshots
67+
68+
On ovh1, ovh2, ovh3 I removed the __replicate_ snapshots.
69+
70+
```bash
71+
zfs list -r -t snap -o name rpool|grep __replicate_
72+
zfs list -r -t snap -o name rpool|grep __replicate_|xargs -n 1 zfs destroy
73+
```
74+
75+
Also on osm45
76+
```bash
77+
zfs list -r -t snap -o name hdd-zfs/off-backups/ovh3-rpool|grep __replicate_
78+
zfs list -r -t snap -o name hdd-zfs/off-backups/ovh3-rpool|grep __replicate_|xargs -n 1 zfs destroy
79+
```
80+
81+
I did the same for vz_dump snapshots, as now backups are no more active.
82+
83+
84+
## Checking syncs on osm45 (Moji)
85+
86+
We don't need to use a sanoid specific snapshot on moji anymore so we changes the sanoid.conf
87+
to use --no-sync-snap option for every volumes but backups (which is not handled by sanoid on ovh3 side).
88+
89+
Syncs seems ok.
90+
91+
One day after we cleaned the old remaining syncoid snapshots on osm45:
92+
```bash
93+
# verify that we have snapshots after the syncoid one
94+
zfs list hdd-zfs/off-backups/ovh3-rpool -t snap -r -o name -H|grep -A 3 @syncoid_osm45|grep -v ovh3-rpool/backups@
95+
# clean
96+
zfs list hdd-zfs/off-backups/ovh3-rpool -t snap -r -o name -H|grep @syncoid_osm45|grep -v ovh3-rpool/backups@|xargs -n 1 -r zfs destroy
97+
```
98+
99+
And on ovh3:
100+
```bash
101+
# verify
102+
zfs list rpool -t snap -r -o name -H|grep @syncoid_osm45|grep -v backups@
103+
# destroy
104+
zfs list rpool -t snap -r -o name -H|grep @syncoid_osm45|grep -v backups@|xargs -n 1 -r zfs destroy
105+
```
106+
107+
## Related commits
108+
109+
Commits of configurations changes on ovh1, ovh2, ovh3:
110+
111+
* [feat: some more hourly snapshots on ovh1](https://github.com/openfoodfacts/openfoodfacts-infrastructure/commit/fd68c17ee2e929703ec364cbffae2d9bf7861d15)
112+
* [feat: using syncoid to sync data from ovh1/2](https://github.com/openfoodfacts/openfoodfacts-infrastructure/commit/2a4a413e38827e30a844f85c3e7416fdcfd998a1)
113+
* [feat(ovh1): sanoid install](https://github.com/openfoodfacts/openfoodfacts-infrastructure/commit/9d915e0e02afbcd0ce4addd30fb7c9b9d35d5a41)
114+
* [feat: some more hourly snapshots on ovh2](https://github.com/openfoodfacts/openfoodfacts-infrastructure/commit/91d89ef5cc900776b4498a4193aee6dc4a5af075)
115+
* [feat: sync some ovh1 volumes](https://github.com/openfoodfacts/openfoodfacts-infrastructure/commit/47ecab46bcb1b11188d4fecf732ae6adec37054a)
116+
* [fix: do not use a sync snap to sync from ovh3 anymore](https://github.com/openfoodfacts/openfoodfacts-infrastructure/commit/8426727bef1ef55a2c5b233233c484b4177fdcc9)

0 commit comments

Comments
 (0)