Skip to content

Conversation

@jmclong
Copy link
Contributor

@jmclong jmclong commented Dec 12, 2025

This pull request introduces significant improvements to the RAID setup process in the Helm chart, making mdadm-based RAID configuration more robust and modular. The RAID initialization logic has been refactored into a dedicated script, and documentation has been updated to clarify the differences between LVM RAID and mdadm-based RAID. Additionally, there are updates to pre-commit hooks, minor script improvements, and new deployment parameters.

RAID Setup Improvements:

  • Refactored RAID initialization into a new script charts/latest/scripts/setup-raid.sh, which now handles detection, creation, and validation of RAID and LVM volume groups in a modular and robust way. This script supports both single and multiple NVMe device scenarios and ensures idempotency. (charts/latest/scripts/setup-raid.sh)
  • Updated the DaemonSet Helm template to use the new RAID setup script instead of inline bash, and now passes the VOLUME_GROUP as an environment variable for flexibility. (charts/latest/templates/daemonset.yaml) [1] [2]
  • Enhanced RAID configuration documentation in charts/latest/README.md, charts/latest/values.yaml, and docs/user-guide.md to clarify the differences between LVM RAID and mdadm-based RAID, highlight the experimental status of mdadm RAID, and warn about migration limitations. [1] [2] [3] [4]

Tooling and Automation:

  • Added a pre-commit hook for ShellCheck to .pre-commit-config.yaml and updated several tool versions for improved linting and code quality. [1] [2] [3]

Deployment and Script Enhancements:

  • Added a new deployment parameter file for NVMe v4 Ubuntu clusters. (deploy/parameters/nvme-v4-ubuntu.json)
  • Improved quoting of variables in deployment scripts for safer shell execution. (deploy/scripts/arc-install.sh, deploy/scripts/arc-uninstall.sh) [1] [2]
  • Minor bug fixes and improvements in utility scripts, such as handling array arguments in hack/add_copyright.sh and using read -rp for better user prompts in hack/cleanup_leaked_resources.sh. [1] [2]

Documentation:

  • Various minor formatting improvements for readability in README.md, SECURITY.md, and docs/design/webhooks.md. [1] [2] [3]

Fixes #302

@jmclong jmclong marked this pull request as ready for review December 15, 2025 16:51
@jmclong jmclong requested review from a team, croomes and landreasyan as code owners December 15, 2025 16:51
Comment on lines +89 to +99
RAID_DEVICE="/dev/md0"
if ! mdadm --create "${RAID_DEVICE}" \
--name="${RAID_NAME}" \
--level=0 \
--raid-devices="${#UNUSED_DEVICES[@]}" \
"${UNUSED_DEVICES[@]}" \
--run \
--force; then
echo "Error: Failed to create RAID array."
exit 1
fi
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's possible to get here if /dev/md0 already exists but doesn't match our $RAID_NAME. The create should fail, which is ok. I don't think we need to support nodes with multiple raid groups (yet).

@jmclong jmclong merged commit 15ff1f1 into main Dec 18, 2025
26 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

NVME RAID array device path is not consistent after node reboot

3 participants