CASMTRIAGE-6370 update csm.storage.smartmon to use specific ceph image when redeploying node-exporter #220
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Summary and Scope
Problem
On a storage node upgrade on Starlord, cephadm used the incorrect image when trying to redeploy node-exporter. This was likely due to the fact that midway through a storage node upgrade, the cephadm version installed on the node is at a higher version than the Ceph container version that is running. This causes cephadm to look for a default image at 'docker://quay.io/ceph/ceph:v17' which it is unable to find and caused the ansible play to fail.
Solution/Change
We can specify the container image that cephadm should use which means it should not look for this default image.
Notes
We were not able to prove why cephadm was looking for the ceph image at 'docker://quay.io/ceph/ceph:v17'. I have not replicated this problem and checked that this is indeed the solution. However, this is our best guess as to what is causing the problem.
Issues and Related PRs
List and characterize relationship to Jira/Github issues and other pull requests. Be sure to list dependencies.
Testing
Tested on:
Test description:
I ran this ansible play on storage that contained the ceph admin keyring, which means this play would run all the way through. I also ran the play on storage nodes that did not have the ceph admin keyring which means this play would be skipped. Both tests worked. CFS logs are below for both cases.
Output when storage node had ceph admin keyring
Output when storage node did not have ceph admin keyring
Risks and Mitigations
I did not get a chance to test this on metal to make sure it solves the problem. However, this will be tested on our internal CSM 1.4 to 1.5 upgrades.
Pull Request Checklist