Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CASMTRIAGE-6370 update csm.storage.smartmon to use specific ceph image when redeploying node-exporter #220

Merged
merged 3 commits into from
Dec 6, 2023

Conversation

leliasen-hpe
Copy link
Contributor

@leliasen-hpe leliasen-hpe commented Dec 4, 2023

Summary and Scope

Problem
On a storage node upgrade on Starlord, cephadm used the incorrect image when trying to redeploy node-exporter. This was likely due to the fact that midway through a storage node upgrade, the cephadm version installed on the node is at a higher version than the Ceph container version that is running. This causes cephadm to look for a default image at 'docker://quay.io/ceph/ceph:v17' which it is unable to find and caused the ansible play to fail.

Solution/Change
We can specify the container image that cephadm should use which means it should not look for this default image.

Notes
We were not able to prove why cephadm was looking for the ceph image at 'docker://quay.io/ceph/ceph:v17'. I have not replicated this problem and checked that this is indeed the solution. However, this is our best guess as to what is causing the problem.

Issues and Related PRs

List and characterize relationship to Jira/Github issues and other pull requests. Be sure to list dependencies.

Testing

Tested on:

  • Beau, CSM 1.6 vShasta2

Test description:

I ran this ansible play on storage that contained the ceph admin keyring, which means this play would run all the way through. I also ran the play on storage nodes that did not have the ceph admin keyring which means this play would be skipped. Both tests worked. CFS logs are below for both cases.

Output when storage node had ceph admin keyring

TASK [csm.storage.smartmon : Get Ceph version] *********************************
changed: [x3000c0s30b0n0]

TASK [csm.storage.smartmon : Redeploy node-exporter] ***************************
changed: [x3000c0s30b0n0]

TASK [csm.storage.smartmon : Reconfig node-exporter] ***************************
changed: [x3000c0s30b0n0] => (item=reconfig node-exporter)
changed: [x3000c0s30b0n0] => (item=redeploy node-exporter)

PLAY RECAP *********************************************************************
x3000c0s30b0n0             : ok=23   changed=8    unreachable=0    failed=0    skipped=5    rescued=3    ignored=0

All playbooks completed successfully

Output when storage node did not have ceph admin keyring

PLAY [Management_Storage:!cfs_image] *******************************************

PLAY RECAP *********************************************************************
x3000c0s31b0n0             : ok=20   changed=5    unreachable=0    failed=0    skipped=8    rescued=3    ignored=0

All playbooks completed successfully

Risks and Mitigations

I did not get a chance to test this on metal to make sure it solves the problem. However, this will be tested on our internal CSM 1.4 to 1.5 upgrades.

Pull Request Checklist

  • Version number(s) incremented, if applicable
  • Copyrights updated
  • License file intact
  • Target branch correct
  • CHANGELOG.md updated
  • Testing is appropriate and complete, if applicable
  • HPC Product Announcement prepared, if applicable

@leliasen-hpe leliasen-hpe requested a review from a team as a code owner December 4, 2023 19:43
@Cray-HPE Cray-HPE deleted a comment from github-actions bot Dec 5, 2023
@Cray-HPE Cray-HPE deleted a comment from github-actions bot Dec 5, 2023
@leliasen-hpe
Copy link
Contributor Author

/backport release/1.16

Copy link

github-actions bot commented Dec 5, 2023

Backporting into branch release/1.16 was successful. New PR: #221

CHANGELOG.md Outdated Show resolved Hide resolved
Update version link list
@mharding-hpe mharding-hpe merged commit bfb062e into master Dec 6, 2023
8 checks passed
@mharding-hpe mharding-hpe deleted the CASMTRIAGE-6370 branch December 6, 2023 17:37
mharding-hpe added a commit that referenced this pull request Dec 6, 2023
…0-1701884250

[chore] master -> develop from PR #220 (CASMTRIAGE-6370)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants