Skip to content

stakater/odf-disk-cleaner

Repository files navigation

ODF disk-cleaner

Python and Shell script packaged as Docker images to clean up disks after an ODF (OpenShift Data Foundation) uninstall.

Recovering ODF node with OSD and MON pods in CrashLoopBackOff due to improper disk cleanup

An OpenShift Data Foundation (ODF) installation can fail with one or multiple nodes in the cluster failing to enter the Ready state. Associated Ceph osd and mon pods get stuck in CrashLoopBackOff. The suspected root cause is improper or incomplete disk cleanup of the node.

Root cause

Incorrect or incomplete cleanup of local disks causes issues with Ceph OSD/MON pods and node readiness.

Resolution steps

  1. Remove the node from Local Storage Operator

    • Edit the LocalVolumeDiscovery and LocalVolumeSet CRs to remove the affected node
  2. Drain and cordon the node:

    oc adm drain <node-name> --ignore-daemonsets --delete-emptydir-data
    oc adm cordon <node-name>
  3. Remove PersistentVolumes by identifying and deleting PersistentVolume objects related to the node in openshift-storage namespace:

    oc get pv | grep <node-name>
    oc delete pv <pv-name>
  4. Manual disk cleanup on node by running the scripts:

    1. Authenticate against the OpenShift-cluster
    2. Start a debug pod: oc debug node/<node>
    3. Change root to host to access all binaries and files: chroot /host
    4. Run the script as a container: podman run ghcr.io/stakater/odf-disk-cleaner:vX.Y.Z --disks "/path/to/disk1 /path/to/disk2 ..."
  5. Reboot the node: sudo reboot

  6. Re-add node to Local Storage Operator:

    • Revert the PR or update the LocalVolumeDiscovery and LocalVolumeSet CRs to add the node back
    • This will cause discovery pods to start running again on the node
  7. Delete all pods in openshift-storage namespace to force recreation: oc delete pod --all -n openshift-storage

  8. Verify ODF health

    • Check Ceph cluster health: oc get cephcluster -n openshift-storage

    • Confirm all pods are running and the node is back in Ready state:

      oc get nodes
      oc get pods -n openshift-storage

Expected outcome

  • Node successfully rejoined the cluster
  • All OSD and MON pods are stabilized
  • ODF cluster health returned to healthy

About

ODF disk-cleaner

Resources

License

Stars

Watchers

Forks

Sponsor this project

 

Packages

No packages published

Contributors 4

  •  
  •  
  •  
  •