Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to confirm that everything good after OSD reweight to 0.0 and purge after #53

Open
Badb0yBadb0y opened this issue Sep 9, 2024 · 3 comments

Comments

@Badb0yBadb0y
Copy link

Badb0yBadb0y commented Sep 9, 2024

Hi,

Thank you for the effort to create this tool, hopefully this will help me with my issue to somehow lower the load on my osds during add/remove osd.

Currently I'm testing the
pgremapper cancel-backfill --yes
option as written in the documentation.

I would like to have 2 questions:

Q1 regarding osd removal:
What I've done in test env:

  1. Set the flags: ceph osd set nobackfill;ceph osd set norebalance
  2. Reweight all the osds that I want to remove: for i in {36..43};do ceph osd reweight $i 0.0;done
  3. Run your tool and wait couple of minutes to get back the prompt: pgremapper cancel-backfill --yes
  4. Unset flags: ceph osd unset norebalance;ceph osd unset nobackfill
  5. When cluster health ok, I removed the osds: for num in {36..43}; do ceph osd out osd.$num;systemctl disable ceph-osd@$num;systemctl purge $num --yes-i-really-mean-it;umount /var/lib/ceph/osd/ceph-$num;done

In theory now I'm missing a lot of chunks so how I know that all the data actually recovered/regenerated somewhere else in the cluster?

Q2 regarding the process:
I see in the documentation this needs to be done before unset the flags. Is it possible to do if the recovery is progress already? Let's say I'm experiencing issue between recovery so I want to somehow limit.
Also this question vice-versa, if the remapping too slow, is it possible to cancel and make it faster back like original? Let's say I'd increase on all osds with the injectargs command the osd_max_backfill and the osd_recovery_max_op.

@jbaergen-do
Copy link
Contributor

Reweight all the osds that I want to remove: for i in {36..43};do ceph osd reweight $i 0.0;done

Note that due to Ceph limitations, reweighting to 0 will prevent pgremapper from being able to do anything. We usually reweight to 0.0001 or similar.

Q1 regarding osd removal:

You're missing a pile of steps here, and this has more to do with Ceph knowledge than pgremapper itself. When you reweighted your OSDs, Ceph scheduled backfill to move data off of the OSDs. Normally, you would wait for that backfill to completed, verify that the OSDs that you want to remove are empty of PGs via ceph osd df tree or similar, and only then start to tear those OSDs down. As an extra safety step, I like to take the OSD down (systemctl stop ceph-osd@$num) and then make sure via ceph -s that all PGs are still healthy. I would also use ceph osd destroy $num before purge, and that should by default have some fairly conservative safety checks.

cancel-backfill will stop this process from working, by itself, because you're stopping the backfill that's needed to empty these OSDs. You would then need to use undo-upmaps --target against those OSDs that you want to remove until they're empty.

Is it possible to do if the recovery is progress already?

Yes, it can be run at any time. Setting the flags just allows you to remain in control the whole time, reducing any wasted work that you might want to actually cancel using pgremapper. It is possible that pgremapper can occasionally fail if backfill finishes or gets scheduled why the tool runs, though.

if the remapping too slow, is it possible to cancel and make it faster back like original?

I'm not sure what you mean by this - can you elaborate?

Let's say I'd increase on all osds with the injectargs command the osd_max_backfill and the osd_recovery_max_op

injectargs should almost never be used as of Nautilus - ceph's centralized conf is almost always what you want to use instead.

@Badb0yBadb0y
Copy link
Author

Badb0yBadb0y commented Sep 10, 2024

Reweight all the osds that I want to remove: for i in {36..43};do ceph osd reweight $i 0.0;done

Note that due to Ceph limitations, reweighting to 0 will prevent pgremapper from being able to do anything. We usually reweight to 0.0001 or similar.

I see, so what would be the proper steps if I want to remove osds (or even a complete host) with using the cancel-backfill pgremapper feature to prevent overloading osds on the same nodes or other nodes same classed devices?

Let's say I have this node

-61          118.75250         -  119 TiB    64 TiB    61 TiB  285 GiB   2.8 TiB   55 TiB  53.73  0.98    -              host osd-2s15
558   nvme     1.74660   0.50000  1.7 TiB    74 GiB   250 MiB   73 GiB   899 MiB  1.7 TiB   4.13  0.08   89      up          osd.558
559   nvme     1.74660   0.50000  1.7 TiB    72 GiB   182 MiB   71 GiB   1.1 GiB  1.7 TiB   4.03  0.07   87      up          osd.559
560   nvme     1.74660   0.50000  1.7 TiB    71 GiB   182 MiB   70 GiB   826 MiB  1.7 TiB   3.98  0.07   88      up          osd.560
561   nvme     1.74660   0.50000  1.7 TiB    72 GiB   182 MiB   71 GiB   776 MiB  1.7 TiB   4.01  0.07   88      up          osd.561

I want to remove these osds then add back later (they have 2 osds on 1 nvme and I want to add back as 1osd/nvme).

  1. I set the backfill and norebalance flags
  2. reweight to 0.0001
  3. run the cancel-backfill / or should I run like this instead? pgremapper undo-upmaps bucket:testnode-2s04 --max-backfill-reservations 1 --verbose --yes - the issue is with the 0.0001 approach is that always remain some pg on the osds. Or simply osd out and go with cancel backfill approach?
  4. Unset the flags
  5. Wait until pg balancer doesn't find more optimization and the ceph osd df tree shows 0 pg and all pgs are healthy
  6. Stop and disable the osd
  7. Let's destroy first
  8. Then Purge

Moving forward when I add back the osds, the steps would be as written earlier in my first post, am I correct?

  1. set the backfill and norebalance flags
  2. run the cancel-backfill
  3. Unset the flags
  4. Wait until pg balancer doesn't find more optimization pgs are healthy
  5. Done

Q1 regarding osd removal:

You're missing a pile of steps here, and this has more to do with Ceph knowledge than pgremapper itself. When you reweighted your OSDs, Ceph scheduled backfill to move data off of the OSDs. Normally, you would wait for that backfill to completed, verify that the OSDs that you want to remove are empty of PGs via ceph osd df tree or similar, and only then start to tear those OSDs down. As an extra safety step, I like to take the OSD down (systemctl stop ceph-osd@$num) and then make sure via ceph -s that all PGs are still healthy. I would also use ceph osd destroy $num before purge, and that should by default have some fairly conservative safety checks.

cancel-backfill will stop this process from working, by itself, because you're stopping the backfill that's needed to empty these OSDs. You would then need to use undo-upmaps --target against those OSDs that you want to remove until they're empty.

May I have an example for the undo-upmaps which reflects my use-case maybe? I saw this but not really clear to be honest, if it is different from this (regarding host removal):
pgremapper undo-upmaps bucket:testnode-2s04 --max-backfill-reservations 16 --verbose --yes

Is it possible to do if the recovery is progress already?

Yes, it can be run at any time. Setting the flags just allows you to remain in control the whole time, reducing any wasted work that you might want to actually cancel using pgremapper. It is possible that pgremapper can occasionally fail if backfill finishes or gets scheduled why the tool runs, though.

If this fails happen just need to rerun the pgremapper command again?

if the remapping too slow, is it possible to cancel and make it faster back like original?

I'm not sure what you mean by this - can you elaborate?

Sure, so let's say something goes wrong with the pgremapper tool, what would be the backup steps? Or let's say I want to speed up the process which is taken care at this stage by the balancer to optimize the pg allocation? Are these possible?

Let's say I'd increase on all osds with the injectargs command the osd_max_backfill and the osd_recovery_max_op

injectargs should almost never be used as of Nautilus - ceph's centralized conf is almost always what you want to use instead.

We are running on octopus (not cephadm). I want to slowly control the backfill, so first I increase the backfill and recovery ops on the newly added osds let's say to 4 but all the rest stays 1. When the process started to slow down I increase the rest of them to 2 or increase the newly added even further. I think with config db more work to control osd by osd.

@jbaergen-do
Copy link
Contributor

OK, if you use the balancer in those steps then you should be OK. undo-upmaps can be used to have more advanced control over data movement, but keeping it simple with the balancer would probably be best. The balancer can still overload OSDs in some circumstances, and you can control how much work it schedules via target_max_misplaced_ratio.

or should I run like this instead? pgremapper undo-upmaps ...

undo-upmaps is an alternative to relying on the balancer, not an alternative to cancel-backfill.

the issue is with the 0.0001 approach is that always remain some pg on the osds. Or simply osd out and go with cancel backfill approach?

Setting an OSD out is equivalent to setting its weight to 0; that can be done at the very end, when there is little backfill remaining.

Moving forward when I add back the osds, the steps would be as written earlier in my first post, am I correct?

Yes

May I have an example for the undo-upmaps which reflects my use-case maybe?

I'll try to keep this high-level; this tooling is very much advanced user territory and while I don't know of cases where it can do dangerous things, without understanding what it's doing at a deep level it could make your life harder, not easier.

After you reweight OSDs to 0.0001, Ceph will reassign the PGs associated with those OSDs to other OSDs. Running cancel-backfill will create upmap table entries that undoes that reassignment. undo-upmaps can then be used to remove those upmap table entries in a controlled fashion, allowing you to control how much backfill gets scheduled at once better than you can with the balancer.

So, for example:
pgremapper undo-upmaps bucket:osd-2s15 --max-backfill-reservations 1 --max-source-backfills --target --verbose --yes

Running that in a loop should cause one backfill to be scheduled per source OSD and target OSD at a time until all upmaps targeting OSDs in the host osd-2s15 are undone.

If this fails happen just need to rerun the pgremapper command again?

Yeah

Sure, so let's say something goes wrong with the pgremapper tool, what would be the backup steps? Or let's say I want to speed up the process which is taken care at this stage by the balancer to optimize the pg allocation? Are these possible?

This is out of scope of pgremapper. There are standard ways of affecting the speed that backfill operates at via balancer and backfill settings.

When the process started to slow down I increase the rest of them to 2 or increase the newly added even further. I think with config db more work to control osd by osd.

With the config DB, you can set these settings at a host level (though unfortunately that might be broken in Octopus, not sure). The issue with injectargs is that the config DB can no longer control those settings in the OSDs; anything set by injectargs will remain until the OSD reboots and takes precedence over config DB settings. You can remove these overrides that you have injected so that the config values take effect again, you just need to remember to do this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants