CI refactoring to cover more test support #302

leofang · 2024-12-14T20:24:21Z

Close #278. Close #279. Close #281. Part of #303.

This is a major refactoring of our CI system ("CI 2.0"). Below is a summary of what's done; see also the commit history (I tried to make the commits self-contained for this PR).

Merge setup/build/test actions into a single workflow with 2 jobs, each of which has its own matrix
- This makes the CI logs for each job steps easier to browse (each has its own collapsible section)
Expand the test matrix to test against multiple CTK versions
- For cuda.bindings: the binding version must match the CTK major version
- For cuda.core: no constraint
Make the mini CTK fetch logic a standalone action reusable in both build/test stages
- It can create a new mini CTK (and cache it), or fetch from cache if already existing
Make CI stage names more readable in the PR status summary
Add a H100 runner to cover the cuda.core tests (Add cluster to LaunchConfig to support thread block clusters on Hopper #261)
Fix test suite issues caught by the CI
Remove the --privileged flag when launching a container on self-hosted runners
- As requested by @ajschmidt8

copy-pr-bot · 2024-12-14T20:24:24Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

leofang · 2024-12-14T20:24:33Z

/ok to test

leofang · 2024-12-15T06:04:16Z

/ok to test

leofang · 2024-12-15T06:09:14Z

/ok to test

leofang · 2024-12-15T06:11:15Z

/ok to test

leofang · 2024-12-15T06:24:48Z

/ok to test

leofang · 2024-12-15T06:34:21Z

/ok to test

leofang · 2024-12-15T16:29:18Z

/ok to test

leofang · 2024-12-15T16:34:47Z

/ok to test

leofang · 2024-12-15T19:29:47Z

/ok to test

…orkflow Making them reusable workflows are not possible because they would not be callable as a single job step (which was what composite actions are for). But the steps in these actions are so tiny and problem-specific that making them standalone is hard to maintain anyway, so a single, moderate-sized workflow is acceptable.

leofang · 2024-12-15T20:48:51Z

/ok to test

leofang · 2024-12-15T21:09:13Z

/ok to test

leofang · 2024-12-15T21:16:42Z

It is weird that runs-on can access the job matrix but if cannot... actions/runner#1985 ☹️

leofang · 2024-12-15T21:29:28Z

/ok to test

leofang · 2024-12-15T21:31:50Z

/ok to test

leofang · 2024-12-17T00:20:22Z

@vzhurba01 This is ready for review. Happy to walk you through it offline. This PR should make our CI more robust and extensible.

leofang · 2024-12-17T00:20:59Z

cuda_bindings/tests/test_nvjitlink.py

+pytestmark = pytest.mark.skipif(
+    not check_nvjitlink_usable(), reason="nvJitLink not usable, maybe not installed or too old (<12.3)"
+)


FYI @ksimpson-work the expanded CI caught this issue

wonderful, good to see it doing its job

leofang · 2024-12-17T00:22:16Z

cuda_core/tests/test_module.py

+    kernel = """extern "C" __global__ void ABC() { }"""
    object_code = Program(kernel, "c++").compile("ptx", options=("-rdc=true",))
    assert object_code._handle is None
-    kernel = object_code.get_kernel("A")
+    kernel = object_code.get_kernel("ABC")


I dunno why it passed when the CTK (NVRTC) version is higher than the driver version, but if we want to fetch a kernel by name, it has to have a C linkage (extern "C") to avoid name mangling.

leofang · 2024-12-17T00:24:04Z

.github/actions/fetch_ctk/action.yml

@@ -112,16 +110,16 @@ runs:
        populate_cuda_path cuda_cudart
        populate_cuda_path cuda_nvrtc
        populate_cuda_path cuda_profiler_api
-        populate_cuda_path libnvjitlink
+        populate_cuda_path cuda_cccl


This is needed by #261. Turns out that the cooperative group headers depend on nv/target, the latter of which is part of CCCL but the former is not... 🙁 (cc @jrhemstad)

leofang · 2024-12-17T00:24:30Z

.github/actions/fetch_ctk/action.yml

+        if [[ "$(cut -d '.' -f 1 <<< ${{ inputs.cuda-version }})" -ge 12 ]]; then
+          populate_cuda_path libnvjitlink
+        fi


This is for CUDA 11 pipelines

leofang · 2024-12-17T00:27:07Z

.github/actions/fetch_ctk/action.yml

    - name: Set up CTK cache variable
      shell: bash --noprofile --norc -xeuo pipefail {0}
      run: |
        echo "CTK_CACHE_KEY=mini-ctk-${{ inputs.cuda-version }}-${{ inputs.host-platform }}" >> $GITHUB_ENV
        echo "CTK_CACHE_FILENAME=mini-ctk-${{ inputs.cuda-version }}-${{ inputs.host-platform }}.tar.gz" >> $GITHUB_ENV

+    - name: Install dependencies


This step is really messy, and it is so because we could run this action in different environments (GH- or self- hosted vm images, or an arbitrary container). It'd be really nice if we could unify the environments...

leofang · 2024-12-17T00:30:38Z