-
Notifications
You must be signed in to change notification settings - Fork 88
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CI refactoring to cover more test support #302
Conversation
/ok to test |
1 similar comment
/ok to test |
/ok to test |
/ok to test |
/ok to test |
/ok to test |
/ok to test |
/ok to test |
/ok to test |
…orkflow Making them reusable workflows are not possible because they would not be callable as a single job step (which was what composite actions are for). But the steps in these actions are so tiny and problem-specific that making them standalone is hard to maintain anyway, so a single, moderate-sized workflow is acceptable.
/ok to test |
1 similar comment
/ok to test |
It is weird that |
/ok to test |
/ok to test |
@vzhurba01 This is ready for review. Happy to walk you through it offline. This PR should make our CI more robust and extensible. |
pytestmark = pytest.mark.skipif( | ||
not check_nvjitlink_usable(), reason="nvJitLink not usable, maybe not installed or too old (<12.3)" | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI @ksimpson-work the expanded CI caught this issue
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wonderful, good to see it doing its job
kernel = """extern "C" __global__ void ABC() { }""" | ||
object_code = Program(kernel, "c++").compile("ptx", options=("-rdc=true",)) | ||
assert object_code._handle is None | ||
kernel = object_code.get_kernel("A") | ||
kernel = object_code.get_kernel("ABC") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I dunno why it passed when the CTK (NVRTC) version is higher than the driver version, but if we want to fetch a kernel by name, it has to have a C linkage (extern "C"
) to avoid name mangling.
@@ -112,16 +110,16 @@ runs: | |||
populate_cuda_path cuda_cudart | |||
populate_cuda_path cuda_nvrtc | |||
populate_cuda_path cuda_profiler_api | |||
populate_cuda_path libnvjitlink | |||
populate_cuda_path cuda_cccl |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is needed by #261. Turns out that the cooperative group headers depend on nv/target
, the latter of which is part of CCCL but the former is not... 🙁 (cc @jrhemstad)
if [[ "$(cut -d '.' -f 1 <<< ${{ inputs.cuda-version }})" -ge 12 ]]; then | ||
populate_cuda_path libnvjitlink | ||
fi |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is for CUDA 11 pipelines
- name: Set up CTK cache variable | ||
shell: bash --noprofile --norc -xeuo pipefail {0} | ||
run: | | ||
echo "CTK_CACHE_KEY=mini-ctk-${{ inputs.cuda-version }}-${{ inputs.host-platform }}" >> $GITHUB_ENV | ||
echo "CTK_CACHE_FILENAME=mini-ctk-${{ inputs.cuda-version }}-${{ inputs.host-platform }}.tar.gz" >> $GITHUB_ENV | ||
|
||
- name: Install dependencies |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This step is really messy, and it is so because we could run this action in different environments (GH- or self- hosted vm images, or an arbitrary container). It'd be really nice if we could unify the environments...
# TODO: enable testing once win-64 GPU runners are up | ||
# - win-64 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note: we comment it out instead of using an if:
check to skip, because matrix
is evaluated after if:
... Turns out this is documented in GHA docs :(
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
xref: actions/runner#1985
cuda-version: | ||
# Note: this is for test-time only. | ||
- "12.6.2" | ||
- "12.0.1" | ||
# FIXME: cannot run cuda.bindings 12.x with CTK 11.x | ||
- "11.8.0" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note: These are test-time (run-time) CTK versions, not build-time (for which we inherit from the previous job output)!
runner: | ||
- default |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note: we need this extra dimension in the matrix so that we can add arbitrary runners using include:
below
# The build stage could fail but we want the CI to keep moving. | ||
if: ${{ (github.repository_owner == 'nvidia') && always() }} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Previously ("CI 1.0"), the matrix is defined for the entire build + test workflow, so as soon as the build of a single matrix element finishes, it proceeds to the test stage immediately, without waiting for / synchronizing with the builds of other matrix elements.
With "CI 2.0", there are two separate matrices for two jobs. Therefore, the build of all matrix elements synchronizes, and then it proceeds to the tests. I've spent quite some time but unfortunately did not see any built-in approaches allowing me to easily skip this synchronization. This is the second best thing we can do, which is to keep the CI moving and fail at the test stage (when fetching missing artifacts that the previous stage failed to build). Without always()
here, as long as there's one build failure, all test pipelines will not be triggered.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Therefore, the build of all matrix elements synchronizes, and then it proceeds to the tests. I've spent quite some time but unfortunately did not see any built-in approaches allowing me to easily skip this synchronization.
To avoid the (build) job-wide synchronization we need something like "a specific matrix element in the 2nd job depends only on another matrix element in the 1st job," but it's not something expressible as of today: https://github.com/orgs/community/discussions/11072
/ok to test |
/ok to test |
/ok to test |
/ok to test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approving because i like all the changes you've made and see it as a big improvement. My only question / comment was about the removal of ci-gh.yml and you addressed that in yesterday's meeting. Thanks for all the work on the CI!
/ok to test |
Thanks, Keenan! CI is green, let's merge. I'll look into nuking ci-gh later. |
Close #278. Close #279. Close #281. Part of #303.
This is a major refactoring of our CI system ("CI 2.0"). Below is a summary of what's done; see also the commit history (I tried to make the commits self-contained for this PR).
cuda.bindings
: the binding version must match the CTK major versioncuda.core
: no constraintcuda.core
tests (Addcluster
toLaunchConfig
to support thread block clusters on Hopper #261)--privileged flag
when launching a container on self-hosted runners