-
Notifications
You must be signed in to change notification settings - Fork 32
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
end-to-end gpu driver testing enhancement #88
Merged
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
shivakunv
force-pushed
the
enhancegpuvalidation
branch
from
August 16, 2024 10:24
bf4b25a
to
b29b09c
Compare
shivakunv
commented
Aug 16, 2024
shivakunv
commented
Aug 16, 2024
shivakunv
commented
Aug 16, 2024
shivakunv
commented
Aug 16, 2024
shivakunv
force-pushed
the
enhancegpuvalidation
branch
from
August 16, 2024 10:31
b29b09c
to
b042889
Compare
shivakunv
force-pushed
the
enhancegpuvalidation
branch
16 times, most recently
from
August 16, 2024 20:16
ae9de5c
to
e6d824a
Compare
cdesiniotis
reviewed
Aug 16, 2024
shivakunv
force-pushed
the
enhancegpuvalidation
branch
4 times, most recently
from
August 17, 2024 05:29
07a660f
to
84f80c9
Compare
shivakunv
force-pushed
the
enhancegpuvalidation
branch
from
August 17, 2024 06:11
84f80c9
to
13839ac
Compare
@cdesiniotis PTAL |
cdesiniotis
reviewed
Aug 19, 2024
shivakunv
force-pushed
the
enhancegpuvalidation
branch
9 times, most recently
from
August 20, 2024 09:21
d97cc36
to
c26329e
Compare
Signed-off-by: shiva kumar <[email protected]>
shivakunv
force-pushed
the
enhancegpuvalidation
branch
from
August 20, 2024 09:40
c26329e
to
c6f8865
Compare
cdesiniotis
approved these changes
Aug 20, 2024
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
TODO--
*1. *
An idea for a potential follow-up – instead of defining a matrix and spinning up one AWS instance per driver version, can we instead pass all driver versions as input to the test script and test all of them in sequence? For example, first install 535.183.06, and then upgrade to 550.90.07? This would also allow us to source the driver versions from the versions.mk file, instead of having to redefine and maintain that list here.
2
+name: CI
Let's rename this to End-to-end tests
3
+# Install the operator with usePrecompiled mode set to true
Remove this comment as it is not accurate.
4
I don't believe AWS_SESSION_TOKEN is needed with current holodeck implementation. Let's remove.
5
We can simply wait for the nvidia-driver pod to be ready
kubectl wait -n ${TEST_NAMESPACE} --for=condition=Ready pod -l app=nvidia-driver-daemonset --timeout 10m
If successful, then wait for the validator pod to be ready (this means that the rest of the pods are healthy):
kubectl wait -n ${TEST_NAMESPACE} --for=condition=Ready pod -l app=nvidia-operator-validator --timeout 2m
If either of these commands fails, capture the state of all pods in the operator namespace, by running kubectl get pods -n ${TEST_NAMESPACE} , and also capture some logs so we can debug.
This will reduce the amount of logs emitted during the test. Right now, we print out all pods every 5 seconds so it is very unreadable