Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

end-to-end gpu driver testing enhancement #88

Merged
merged 1 commit into from
Aug 20, 2024

Conversation

shivakunv
Copy link
Contributor

TODO--

*1. *

  • matrix:
  • driver:
    • 535.183.06
    • 550.90.07
      An idea for a potential follow-up – instead of defining a matrix and spinning up one AWS instance per driver version, can we instead pass all driver versions as input to the test script and test all of them in sequence? For example, first install 535.183.06, and then upgrade to 550.90.07? This would also allow us to source the driver versions from the versions.mk file, instead of having to redefine and maintain that list here.

2
+name: CI
Let's rename this to End-to-end tests

3
+# Install the operator with usePrecompiled mode set to true
Remove this comment as it is not accurate.

4

  • AWS_SESSION_TOKEN: ${{ secrets.AWS_SESSION_TOKEN }}
    I don't believe AWS_SESSION_TOKEN is needed with current holodeck implementation. Let's remove.

5
We can simply wait for the nvidia-driver pod to be ready
kubectl wait -n ${TEST_NAMESPACE} --for=condition=Ready pod -l app=nvidia-driver-daemonset --timeout 10m
If successful, then wait for the validator pod to be ready (this means that the rest of the pods are healthy):
kubectl wait -n ${TEST_NAMESPACE} --for=condition=Ready pod -l app=nvidia-operator-validator --timeout 2m
If either of these commands fails, capture the state of all pods in the operator namespace, by running kubectl get pods -n ${TEST_NAMESPACE} , and also capture some logs so we can debug.
This will reduce the amount of logs emitted during the test. Right now, we print out all pods every 5 seconds so it is very unreadable

@shivakunv shivakunv self-assigned this Aug 16, 2024
@shivakunv shivakunv force-pushed the enhancegpuvalidation branch from bf4b25a to b29b09c Compare August 16, 2024 10:24
@shivakunv shivakunv requested a review from cdesiniotis August 16, 2024 10:24
.github/workflows/ci.yaml Outdated Show resolved Hide resolved
.github/workflows/ci.yaml Outdated Show resolved Hide resolved
@shivakunv shivakunv force-pushed the enhancegpuvalidation branch from b29b09c to b042889 Compare August 16, 2024 10:31
@shivakunv shivakunv marked this pull request as ready for review August 16, 2024 10:32
@shivakunv shivakunv force-pushed the enhancegpuvalidation branch 16 times, most recently from ae9de5c to e6d824a Compare August 16, 2024 20:16
tests/scripts/checks.sh Outdated Show resolved Hide resolved
tests/scripts/remote.sh Show resolved Hide resolved
tests/scripts/verify-operator.sh Outdated Show resolved Hide resolved
tests/scripts/verify-operator.sh Outdated Show resolved Hide resolved
@shivakunv shivakunv force-pushed the enhancegpuvalidation branch 4 times, most recently from 07a660f to 84f80c9 Compare August 17, 2024 05:29
@shivakunv shivakunv force-pushed the enhancegpuvalidation branch from 84f80c9 to 13839ac Compare August 17, 2024 06:11
@shivakunv
Copy link
Contributor Author

@cdesiniotis PTAL

.github/workflows/ci.yaml Outdated Show resolved Hide resolved
tests/scripts/must-gather.sh Outdated Show resolved Hide resolved
tests/scripts/uninstall-operator.sh Outdated Show resolved Hide resolved
@shivakunv shivakunv force-pushed the enhancegpuvalidation branch 9 times, most recently from d97cc36 to c26329e Compare August 20, 2024 09:21
@shivakunv shivakunv force-pushed the enhancegpuvalidation branch from c26329e to c6f8865 Compare August 20, 2024 09:40
@cdesiniotis cdesiniotis merged commit f8c3a2b into NVIDIA:main Aug 20, 2024
6 checks passed
@shivakunv shivakunv deleted the enhancegpuvalidation branch September 4, 2024 17:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants