Python and Shell script packaged as Docker images to clean up disks after an ODF (OpenShift Data Foundation) uninstall.
An OpenShift Data Foundation (ODF) installation can fail with one or multiple nodes in the cluster failing to enter the Ready state. Associated Ceph osd and mon pods get stuck in CrashLoopBackOff. The suspected root cause is improper or incomplete disk cleanup of the node.
Incorrect or incomplete cleanup of local disks causes issues with Ceph OSD/MON pods and node readiness.
-
Remove the node from Local Storage Operator
- Edit the
LocalVolumeDiscoveryandLocalVolumeSetCRs to remove the affected node
- Edit the
-
Drain and cordon the node:
oc adm drain <node-name> --ignore-daemonsets --delete-emptydir-data oc adm cordon <node-name>
-
Remove
PersistentVolumesby identifying and deletingPersistentVolumeobjects related to the node inopenshift-storagenamespace:oc get pv | grep <node-name> oc delete pv <pv-name>
-
Manual disk cleanup on node by running the scripts:
- Authenticate against the OpenShift-cluster
- Start a debug pod:
oc debug node/<node> - Change root to host to access all binaries and files:
chroot /host - Run the script as a container:
podman run ghcr.io/stakater/odf-disk-cleaner:vX.Y.Z --disks "/path/to/disk1 /path/to/disk2 ..."
-
Reboot the node:
sudo reboot -
Re-add node to Local Storage Operator:
- Revert the PR or update the
LocalVolumeDiscoveryandLocalVolumeSetCRs to add the node back - This will cause discovery pods to start running again on the node
- Revert the PR or update the
-
Delete all pods in
openshift-storagenamespace to force recreation:oc delete pod --all -n openshift-storage -
Verify ODF health
-
Check Ceph cluster health:
oc get cephcluster -n openshift-storage -
Confirm all pods are running and the node is back in
Readystate:oc get nodes oc get pods -n openshift-storage
-
- Node successfully rejoined the cluster
- All OSD and MON pods are stabilized
- ODF cluster health returned to healthy