Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CI refactoring to cover more test support #302

Merged
merged 13 commits into from
Dec 18, 2024
Merged

Conversation

leofang
Copy link
Member

@leofang leofang commented Dec 14, 2024

Close #278. Close #279. Close #281. Part of #303.

This is a major refactoring of our CI system ("CI 2.0"). Below is a summary of what's done; see also the commit history (I tried to make the commits self-contained for this PR).

  • Merge setup/build/test actions into a single workflow with 2 jobs, each of which has its own matrix
    • This makes the CI logs for each job steps easier to browse (each has its own collapsible section)
  • Expand the test matrix to test against multiple CTK versions
    • For cuda.bindings: the binding version must match the CTK major version
    • For cuda.core: no constraint
  • Make the mini CTK fetch logic a standalone action reusable in both build/test stages
    • It can create a new mini CTK (and cache it), or fetch from cache if already existing
  • Make CI stage names more readable in the PR status summary
  • Add a H100 runner to cover the cuda.core tests (Add cluster to LaunchConfig to support thread block clusters on Hopper #261)
  • Fix test suite issues caught by the CI
  • Remove the --privileged flag when launching a container on self-hosted runners

Copy link
Contributor

copy-pr-bot bot commented Dec 14, 2024

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@leofang
Copy link
Member Author

leofang commented Dec 14, 2024

/ok to test

1 similar comment
@leofang
Copy link
Member Author

leofang commented Dec 15, 2024

/ok to test

@leofang
Copy link
Member Author

leofang commented Dec 15, 2024

/ok to test

@leofang
Copy link
Member Author

leofang commented Dec 15, 2024

/ok to test

@leofang leofang added the CI/CD CI/CD infrastructure label Dec 15, 2024
@leofang
Copy link
Member Author

leofang commented Dec 15, 2024

/ok to test

@leofang
Copy link
Member Author

leofang commented Dec 15, 2024

/ok to test

@leofang
Copy link
Member Author

leofang commented Dec 15, 2024

/ok to test

@leofang
Copy link
Member Author

leofang commented Dec 15, 2024

/ok to test

@leofang
Copy link
Member Author

leofang commented Dec 15, 2024

/ok to test

…orkflow

Making them reusable workflows are not possible because they would
not be callable as a single job step (which was what composite actions
are for). But the steps in these actions are so tiny and problem-specific
that making them standalone is hard to maintain anyway, so a single,
moderate-sized workflow is acceptable.
@leofang
Copy link
Member Author

leofang commented Dec 15, 2024

/ok to test

1 similar comment
@leofang
Copy link
Member Author

leofang commented Dec 15, 2024

/ok to test

@leofang
Copy link
Member Author

leofang commented Dec 15, 2024

It is weird that runs-on can access the job matrix but if cannot... actions/runner#1985 ☹️

@leofang
Copy link
Member Author

leofang commented Dec 15, 2024

/ok to test

@leofang
Copy link
Member Author

leofang commented Dec 15, 2024

/ok to test

@leofang leofang requested a review from vzhurba01 December 17, 2024 00:16
@leofang leofang changed the title WIP: CI rework CI refactoring to cover more test support Dec 17, 2024
@leofang leofang marked this pull request as ready for review December 17, 2024 00:19
@leofang
Copy link
Member Author

leofang commented Dec 17, 2024

@vzhurba01 This is ready for review. Happy to walk you through it offline. This PR should make our CI more robust and extensible.

Comment on lines +63 to +65
pytestmark = pytest.mark.skipif(
not check_nvjitlink_usable(), reason="nvJitLink not usable, maybe not installed or too old (<12.3)"
)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI @ksimpson-work the expanded CI caught this issue

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wonderful, good to see it doing its job

Comment on lines +18 to +21
kernel = """extern "C" __global__ void ABC() { }"""
object_code = Program(kernel, "c++").compile("ptx", options=("-rdc=true",))
assert object_code._handle is None
kernel = object_code.get_kernel("A")
kernel = object_code.get_kernel("ABC")
Copy link
Member Author

@leofang leofang Dec 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I dunno why it passed when the CTK (NVRTC) version is higher than the driver version, but if we want to fetch a kernel by name, it has to have a C linkage (extern "C") to avoid name mangling.

@@ -112,16 +110,16 @@ runs:
populate_cuda_path cuda_cudart
populate_cuda_path cuda_nvrtc
populate_cuda_path cuda_profiler_api
populate_cuda_path libnvjitlink
populate_cuda_path cuda_cccl
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is needed by #261. Turns out that the cooperative group headers depend on nv/target, the latter of which is part of CCCL but the former is not... 🙁 (cc @jrhemstad)

Comment on lines +114 to +116
if [[ "$(cut -d '.' -f 1 <<< ${{ inputs.cuda-version }})" -ge 12 ]]; then
populate_cuda_path libnvjitlink
fi
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is for CUDA 11 pipelines

- name: Set up CTK cache variable
shell: bash --noprofile --norc -xeuo pipefail {0}
run: |
echo "CTK_CACHE_KEY=mini-ctk-${{ inputs.cuda-version }}-${{ inputs.host-platform }}" >> $GITHUB_ENV
echo "CTK_CACHE_FILENAME=mini-ctk-${{ inputs.cuda-version }}-${{ inputs.host-platform }}.tar.gz" >> $GITHUB_ENV

- name: Install dependencies
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This step is really messy, and it is so because we could run this action in different environments (GH- or self- hosted vm images, or an arbitrary container). It'd be really nice if we could unify the environments...

Comment on lines +173 to +174
# TODO: enable testing once win-64 GPU runners are up
# - win-64
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: we comment it out instead of using an if: check to skip, because matrix is evaluated after if:... Turns out this is documented in GHA docs :(

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment on lines 181 to 186
cuda-version:
# Note: this is for test-time only.
- "12.6.2"
- "12.0.1"
# FIXME: cannot run cuda.bindings 12.x with CTK 11.x
- "11.8.0"
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: These are test-time (run-time) CTK versions, not build-time (for which we inherit from the previous job output)!

Comment on lines +187 to +188
runner:
- default
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: we need this extra dimension in the matrix so that we can add arbitrary runners using include: below

Comment on lines +195 to +196
# The build stage could fail but we want the CI to keep moving.
if: ${{ (github.repository_owner == 'nvidia') && always() }}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Previously ("CI 1.0"), the matrix is defined for the entire build + test workflow, so as soon as the build of a single matrix element finishes, it proceeds to the test stage immediately, without waiting for / synchronizing with the builds of other matrix elements.

With "CI 2.0", there are two separate matrices for two jobs. Therefore, the build of all matrix elements synchronizes, and then it proceeds to the tests. I've spent quite some time but unfortunately did not see any built-in approaches allowing me to easily skip this synchronization. This is the second best thing we can do, which is to keep the CI moving and fail at the test stage (when fetching missing artifacts that the previous stage failed to build). Without always() here, as long as there's one build failure, all test pipelines will not be triggered.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Therefore, the build of all matrix elements synchronizes, and then it proceeds to the tests. I've spent quite some time but unfortunately did not see any built-in approaches allowing me to easily skip this synchronization.

To avoid the (build) job-wide synchronization we need something like "a specific matrix element in the 2nd job depends only on another matrix element in the 1st job," but it's not something expressible as of today: https://github.com/orgs/community/discussions/11072

@leofang
Copy link
Member Author

leofang commented Dec 17, 2024

/ok to test

@leofang
Copy link
Member Author

leofang commented Dec 17, 2024

/ok to test

@leofang
Copy link
Member Author

leofang commented Dec 17, 2024

/ok to test

@leofang
Copy link
Member Author

leofang commented Dec 17, 2024

/ok to test

@ksimpson-work ksimpson-work self-requested a review December 18, 2024 14:35
Copy link
Contributor

@ksimpson-work ksimpson-work left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving because i like all the changes you've made and see it as a big improvement. My only question / comment was about the removal of ci-gh.yml and you addressed that in yesterday's meeting. Thanks for all the work on the CI!

@leofang
Copy link
Member Author

leofang commented Dec 18, 2024

/ok to test

@leofang
Copy link
Member Author

leofang commented Dec 18, 2024

Thanks, Keenan! CI is green, let's merge. I'll look into nuking ci-gh later.

@leofang leofang merged commit 3ac17fe into NVIDIA:main Dec 18, 2024
46 checks passed
@leofang leofang deleted the ci_refactor branch December 18, 2024 16:41
@leofang leofang added this to the cuda-python 12-next, 11-next milestone Dec 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CI/CD CI/CD infrastructure P0 High priority - Must do!
Projects
None yet
2 participants