Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
21 commits
Select commit Hold shift + click to select a range
68fa036
initial commit of adding multi-gpu ci runners
gagank1 Oct 7, 2025
62f381d
fix missing multi_gpu marker in esm2
gagank1 Oct 7, 2025
250bd29
switch to a6000 runners
gagank1 Oct 9, 2025
68307b1
fixed moco multi-gpu tests
gagank1 Oct 9, 2025
6fbe6f4
bind to only localhost
gagank1 Oct 9, 2025
494c6fb
Merge origin/main and add multi_gpu markers to new tests
gagank1 Oct 11, 2025
b777145
implemented data download in evo2 gradient equivalence test. fixed bu…
gagank1 Oct 11, 2025
c3f9737
added new ciflow:multi-gpu label and fixed scheduling
gagank1 Oct 14, 2025
9812568
update pull request template and contributing.md with instructions fo…
gagank1 Oct 14, 2025
01eabb6
run multi-gpu tests nightly only for now
gagank1 Oct 14, 2025
9520a1d
don't upload multigpu test results to codecov
gagank1 Oct 14, 2025
aa64021
match structure of existing pytest runner scripts
gagank1 Oct 14, 2025
617003a
Merge remote-tracking branch 'origin/main' into gkaushik/multi-gpu-ci
gagank1 Nov 7, 2025
16dc4ff
added @pytest.mark.multi_gpu to new recipes tests
gagank1 Nov 7, 2025
9c994c7
update recipes workflow to run multigpu tests nightly if single gpu t…
gagank1 Nov 7, 2025
2fe65dc
more scheduling changes
gagank1 Nov 13, 2025
3bf090d
add missing multi_gpu mark registration
gagank1 Nov 13, 2025
5f17689
fix recipes workflow return code, add fp8 check to test_accelerate_am…
gagank1 Nov 13, 2025
f14497c
fine tune scheduling logic
gagank1 Nov 13, 2025
d2822a4
fix labels for copy pr bot
gagank1 Nov 14, 2025
91e4dcd
xfail known bug
gagank1 Nov 14, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 4 additions & 0 deletions .github/labels.yml
Original file line number Diff line number Diff line change
Expand Up @@ -18,3 +18,7 @@
- name: ciflow:skip
description: Skip all CI tests for this PR
color: B60205 # Red

- name: ciflow:multi-gpu
description: (Reserved for future use) Run all multi GPU tests (unit tests, slow tests) for bionemo2
color: 12F5AE # Teal
2 changes: 1 addition & 1 deletion .github/pull_request_template.md
Original file line number Diff line number Diff line change
Expand Up @@ -30,7 +30,7 @@ Configure CI behavior by applying the relevant labels. By default, only basic un
- [ciflow:all](https://github.com/NVIDIA/bionemo-framework/blob/main/docs/docs/main/contributing/contributing.md#ciflow:all) - Run all tests (unit tests, slow tests, and notebooks) for bionemo2. This label can be used to enforce running tests for all bionemo2.
- [ciflow:all-recipes](https://github.com/NVIDIA/bionemo-framework/blob/main/docs/docs/main/contributing/contributing.md#ciflow:all-recipes) - Run tests for all recipes (under bionemo-recipes). This label can be used to enforce running tests for all recipes.

Unit tests marked as `@pytest.mark.multi_gpu` or `@pytest.mark.distributed` are not run in the PR pipeline.
Multi-GPU tests (marked with `@pytest.mark.multi_gpu`) are not run in PR CI. They run automatically in nightly builds after single-GPU tests pass.

For more details, see [CONTRIBUTING](CONTRIBUTING.md)

Expand Down
121 changes: 107 additions & 14 deletions .github/workflows/unit-tests-framework.yml
Original file line number Diff line number Diff line change
Expand Up @@ -19,16 +19,29 @@
# 2. pre-commit: Runs static code checks and linting
# 3. get-pr-labels: Retrieves PR labels for conditional job execution
# 4. build-bionemo-image: Builds Docker image (conditional on triggers/labels)
# 5. run-tests: Runs unit tests (when image build succeeds)
# 6. run-tests-slow: Runs slow tests (when image succeeds + ciflow:slow label OR schedule/merge_group/ciflow:all)
# 7. run-tests-notebooks: Runs notebook tests (when image succeeds + ciflow:notebooks label OR schedule/merge_group/ciflow:all)
# 8. verify-tests-status: Verifies all test jobs completed successfully
# 5. run-tests-single-gpu: Runs fast single-GPU unit tests (when image build succeeds)
# 6. run-tests-multi-gpu: Runs fast multi-GPU tests (conditional - see below)
# 7. run-tests-slow-single-gpu: Runs slow single-GPU tests (when image succeeds + ciflow:slow/ciflow:all label OR schedule/merge_group)
# 8. run-tests-slow-multi-gpu: Runs slow multi-GPU tests (conditional - see below)
# 9. run-tests-notebooks: Runs notebook tests (when image succeeds + ciflow:notebooks/ciflow:all label OR schedule/merge_group)
# 10. verify-tests-status: Verifies all test jobs completed successfully
#
# CONDITIONAL EXECUTION:
# - build-bionemo-image runs on: schedule, ciflow:all label, (no ciflow:skip + modified files), (merge_group + modified files)
# - run-tests runs when: build-bionemo-image succeeds
# - run-tests-slow runs when: build-bionemo-image succeeds AND (schedule OR merge_group OR ciflow:all OR ciflow:slow)
# - run-tests-single-gpu runs when: build-bionemo-image succeeds
# - run-tests-multi-gpu runs when: build succeeds AND ((schedule AND single-gpu passes) OR (push AND ciflow:multi-gpu label))
# - run-tests-slow-single-gpu runs when: build-bionemo-image succeeds AND (schedule OR merge_group OR ciflow:all OR ciflow:slow)
# - run-tests-slow-multi-gpu runs when: build succeeds AND ((schedule AND slow-single-gpu passes) OR (push AND ciflow:multi-gpu label))
# - run-tests-notebooks runs when: build-bionemo-image succeeds AND (schedule OR merge_group OR ciflow:all OR ciflow:notebooks)
#
# MULTI-GPU TEST EXECUTION:
# Multi-GPU tests run in these situations:
# - On schedule (nightly): if build succeeds AND corresponding single-GPU tests pass
# - On PRs (push events): if build succeeds AND labels match:
# * Fast multi-GPU: ciflow:all OR ciflow:multi-gpu
# * Slow multi-GPU: ciflow:all OR (ciflow:multi-gpu AND ciflow:slow)
# - NOT on merge_group or any other events
# Note: On push, multi-GPU tests run in parallel with single-GPU tests (no dependency)

name: "BioNeMo Framework CI"

Expand Down Expand Up @@ -206,7 +219,7 @@
cache-to: ${{ steps.cache.outputs.cache-to }}


run-tests:
run-tests-single-gpu:
needs:
- build-bionemo-image
- get-pr-labels
Expand All @@ -221,7 +234,7 @@
- name: Checkout repository
uses: actions/checkout@v4

- name: Run tests
- name: Run single-GPU tests
# Tests in this stage generate code coverage metrics for the repository
# Coverage data is uploaded to Codecov in subsequent stages
env:
Expand All @@ -246,7 +259,45 @@
with:
token: ${{ secrets.CODECOV_TOKEN }}

run-tests-slow:
run-tests-multi-gpu:
needs:
- build-bionemo-image
- run-tests-single-gpu
- get-pr-labels
runs-on: linux-amd64-gpu-rtxa6000-latest-2-nemo
container:
image: svcbionemo023/bionemo-framework:${{ github.run_id }}
credentials:
username: ${{ vars.DOCKER_USERNAME }}
password: ${{ secrets.DOCKER_PASSWORD }}
# Run multi-GPU tests ONLY when:
# 1. On schedule: if build succeeds AND single-GPU tests pass
# 2. On push: if build succeeds AND (ciflow:all OR ciflow:multi-gpu label)
# Do NOT run on merge_group or any other events
if: |
(needs.build-bionemo-image.result == 'success') &&
(
(
github.event_name == 'schedule' &&
needs.run-tests-single-gpu.result == 'success'
) ||
(
contains(fromJSON(needs.get-pr-labels.outputs.labels || '[]'), 'ciflow:all') ||
contains(fromJSON(needs.get-pr-labels.outputs.labels || '[]'), 'ciflow:multi-gpu')
)
)
steps:
- name: Checkout repository
uses: actions/checkout@v4

- name: Run multi-GPU tests
env:
BIONEMO_DATA_SOURCE: ngc
run: |
chmod +x ./ci/scripts/run_pytest_multigpu.sh
./ci/scripts/run_pytest_multigpu.sh

run-tests-slow-single-gpu:
needs:
- build-bionemo-image
- get-pr-labels
Expand All @@ -268,18 +319,58 @@
- name: Checkout repository
uses: actions/checkout@v4

- name: Run slow tests
- name: Run slow single-GPU tests
env:
BIONEMO_DATA_SOURCE: ngc
run: |
chmod +x ./ci/scripts/pytest_runner.sh
./ci/scripts/pytest_runner.sh --no-nbval --only-slow --skip-multi-gpu --allow-no-tests

run-tests-slow-multi-gpu:
needs:
- build-bionemo-image
- run-tests-slow-single-gpu
- get-pr-labels
runs-on: linux-amd64-gpu-rtxa6000-latest-2-nemo
container:
image: svcbionemo023/bionemo-framework:${{ github.run_id }}
credentials:
username: ${{ vars.DOCKER_USERNAME }}
password: ${{ secrets.DOCKER_PASSWORD }}
# Run slow multi-GPU tests ONLY when:
# 1. On schedule: if build succeeds AND slow single-GPU tests pass
# 2. On push: if build succeeds AND (ciflow:all OR (ciflow:multi-gpu AND ciflow:slow))
# Do NOT run on merge_group or any other events
if: |
(needs.build-bionemo-image.result == 'success') &&
(
(
github.event_name == 'schedule' &&
needs.run-tests-slow-single-gpu.result == 'success'
) ||
(
contains(fromJSON(needs.get-pr-labels.outputs.labels || '[]'), 'ciflow:all') ||
(
contains(fromJSON(needs.get-pr-labels.outputs.labels || '[]'), 'ciflow:multi-gpu') &&
contains(fromJSON(needs.get-pr-labels.outputs.labels || '[]'), 'ciflow:slow')
)
)
)
steps:
- name: Checkout repository
uses: actions/checkout@v4

- name: Run slow multi-GPU tests
env:
BIONEMO_DATA_SOURCE: ngc
# Not every sub-package has slow tests, and since some sub-packages have tests under the same name we need
# to run package by package like we do with the fast tests.
run: |
chmod +x ./ci/scripts/run_pytest_slow.sh
./ci/scripts/run_pytest_slow.sh
chmod +x ./ci/scripts/run_pytest_slow_multigpu.sh
./ci/scripts/run_pytest_slow_multigpu.sh


run-tests-notebooks:

Check warning

Code scanning / CodeQL

Workflow does not contain permissions Medium

Actions job or workflow does not limit the permissions of the GITHUB_TOKEN. Consider setting an explicit permissions block, using the following as a minimal starting point: {contents: read}
needs:
- build-bionemo-image
- get-pr-labels
Expand Down Expand Up @@ -317,8 +408,10 @@
- pre-commit
- get-pr-labels
- build-bionemo-image
- run-tests
- run-tests-slow
- run-tests-single-gpu
- run-tests-multi-gpu
- run-tests-slow-single-gpu
- run-tests-slow-multi-gpu
- run-tests-notebooks
# Add all other run-*-test jobs
runs-on: ubuntu-latest
Expand Down
132 changes: 126 additions & 6 deletions .github/workflows/unit-tests-recipes.yml
Original file line number Diff line number Diff line change
@@ -1,3 +1,26 @@
# BioNeMo Recipes CI Workflow
#
# This workflow runs tests for BioNeMo recipes and models on various triggers:
#
# TRIGGERS:
# - Push to pull-request branches or dependabot branches
# - Merge group events (when PRs are merged via merge queue)
# - Scheduled runs (daily at 9 AM UTC)
#
# WORKFLOW OVERVIEW:
# 1. changed-dirs: Detects which recipe/model directories have changed
# 2. get-pr-labels: Retrieves PR labels for conditional job execution
# 3. unit-tests-single-gpu: Runs single-GPU tests for changed directories
# 4. unit-tests-multi-gpu: Runs multi-GPU tests (conditional - see below)
# 5. verify-recipe-tests: Verifies all test jobs completed successfully
#
# MULTI-GPU TEST EXECUTION:
# Multi-GPU tests run in these situations:
# - On schedule (nightly): if changed dirs exist AND single-GPU tests pass
# - On PRs (push events): if changed dirs exist AND (ciflow:all-recipes OR ciflow:multi-gpu)
# - NOT on merge_group or any other events
# Note: On push, multi-GPU tests run in parallel with single-GPU tests (no dependency)

name: "BioNeMo Recipes CI"

on:
Expand Down Expand Up @@ -127,11 +150,11 @@
echo '${{ toJSON(steps.set-dirs.outputs) }}'
shell: bash

unit-tests:
unit-tests-single-gpu:
needs: changed-dirs
runs-on: linux-amd64-gpu-l4-latest-1
if: ${{ needs.changed-dirs.outputs.dirs != '[]' }}
name: "unit-tests (${{ matrix.recipe.name }})"
name: "unit-tests-single-gpu (${{ matrix.recipe.name }})"
container:
image: ${{ matrix.recipe.image }}
options: --shm-size=16G
Expand Down Expand Up @@ -169,23 +192,120 @@
exit 1
fi

- name: Run tests
- name: Run single-GPU tests
working-directory: ${{ matrix.recipe.dir }}
run: pytest -v -m "not multi_gpu" .

# With copy-pr-bot, we need to get the PR labels from the PR API rather than from the event metadata.
get-pr-labels:
runs-on: ubuntu-latest
outputs:
labels: ${{ steps.get-labels.outputs.labels || steps.get-labels-empty.outputs.labels }}
steps:
- name: Get PR number from branch
if: startsWith(github.ref, 'refs/heads/pull-request/')
id: get-pr-num
run: |
PR_NUM=$(echo ${{ github.ref_name }} | grep -oE '[0-9]+$')
echo "pr_num=$PR_NUM" >> $GITHUB_OUTPUT

- name: Get PR labels
id: get-labels
if: startsWith(github.ref, 'refs/heads/pull-request/')
env:
GH_TOKEN: ${{ github.token }}
run: |
LABELS=$(gh api repos/${{ github.repository }}/pulls/${{ steps.get-pr-num.outputs.pr_num }} --jq '[.labels[].name]' || echo "[]")
echo "labels=$LABELS" >> $GITHUB_OUTPUT
echo "Retrieved labels: $LABELS"

- name: Set empty labels for non-PR branches
if: ${{ !startsWith(github.ref, 'refs/heads/pull-request/') }}
id: get-labels-empty
run: |
echo "labels=[]" >> $GITHUB_OUTPUT
echo "Set empty labels for non-PR branch"

unit-tests-multi-gpu:
Comment on lines +201 to +229

Check warning

Code scanning / CodeQL

Workflow does not contain permissions Medium

Actions job or workflow does not limit the permissions of the GITHUB_TOKEN. Consider setting an explicit permissions block, using the following as a minimal starting point: {}

Copilot Autofix

AI 7 days ago

The recommended fix is to explicitly specify least privilege permissions for the workflow or, ideally, for the individual jobs in .github/workflows/unit-tests-recipes.yml.

  • For the get-pr-labels job, the only required permission is to read pull request metadata (and potentially repository contents if any fetch occurs).

  • pull-requests: read and optionally contents: read cover reading PR info and repo access.

  • Add a top-level permissions block after name: (applies to all jobs), or alternatively, apply a narrower permissions block to each job. For simplicity and following the error’s suggestion, add at the top/workflow level.

  • Edit the file .github/workflows/unit-tests-recipes.yml, adding:

    permissions:
      contents: read
      pull-requests: read

    directly after the name: key and before events (on:).

Required methods/imports/definitions: None; this is a YAML structure change only.


Suggested changeset 1
.github/workflows/unit-tests-recipes.yml

Autofix patch

Autofix patch
Run the following command in your local git repository to apply this patch
cat << 'EOF' | git apply
diff --git a/.github/workflows/unit-tests-recipes.yml b/.github/workflows/unit-tests-recipes.yml
--- a/.github/workflows/unit-tests-recipes.yml
+++ b/.github/workflows/unit-tests-recipes.yml
@@ -22,6 +22,9 @@
 # Note: On push, multi-GPU tests run in parallel with single-GPU tests (no dependency)
 
 name: "BioNeMo Recipes CI"
+permissions:
+  contents: read
+  pull-requests: read
 
 on:
   push:
EOF
@@ -22,6 +22,9 @@
# Note: On push, multi-GPU tests run in parallel with single-GPU tests (no dependency)

name: "BioNeMo Recipes CI"
permissions:
contents: read
pull-requests: read

on:
push:
Copilot is powered by AI and may make mistakes. Always verify output.
needs:
- changed-dirs
- unit-tests-single-gpu
- get-pr-labels
runs-on: linux-amd64-gpu-rtxa6000-latest-2-nemo
# Run multi-GPU tests ONLY when:
# 1. On schedule: if changed dirs exist AND single-GPU tests pass
# 2. On push: if changed dirs exist AND (ciflow:all-recipes OR ciflow:multi-gpu label)
# Do NOT run on merge_group or any other events
if: |
(needs.changed-dirs.outputs.dirs != '[]') &&
(
(
github.event_name == 'schedule' &&
needs.unit-tests-single-gpu.result == 'success'
) ||
(
contains(fromJSON(needs.get-pr-labels.outputs.labels || '[]'), 'ciflow:all-recipes') ||
contains(fromJSON(needs.get-pr-labels.outputs.labels || '[]'), 'ciflow:multi-gpu')
)
)
name: "unit-tests-multi-gpu (${{ matrix.recipe.name }})"
container:
image: ${{ matrix.recipe.image }}
options: --shm-size=16G
strategy:
matrix:
recipe: ${{ fromJson(needs.changed-dirs.outputs.dirs) }}
fail-fast: false

steps:
- name: Show GPU info
run: nvidia-smi
- name: Setup proxy cache
uses: nv-gha-runners/setup-proxy-cache@main

- name: Checkout repository
uses: actions/checkout@v4
with:
sparse-checkout: "${{ matrix.recipe.dir }}"
sparse-checkout-cone-mode: false

- name: Install dependencies
working-directory: ${{ matrix.recipe.dir }}
run: |
if [ -f pyproject.toml ] || [ -f setup.py ]; then
PIP_CONSTRAINT= pip install -e .
echo "Installed ${{ matrix.recipe.dir }} as editable package"
elif [ -f requirements.txt ]; then
PIP_CONSTRAINT= pip install -r requirements.txt
echo "Installed ${{ matrix.recipe.dir }} from requirements.txt"
else
echo "No pyproject.toml, setup.py, or requirements.txt found in ${{ matrix.recipe.dir }}"
exit 1
fi

- name: Run multi-GPU tests
working-directory: ${{ matrix.recipe.dir }}
run: pytest -v .
run: |
# Run multi-GPU tests, but allow exit code 5 (no tests found) to pass
pytest -v -m "multi_gpu" . || [ $? -eq 5 ]

verify-recipe-tests:

Check warning

Code scanning / CodeQL

Workflow does not contain permissions Medium

Actions job or workflow does not limit the permissions of the GITHUB_TOKEN. Consider setting an explicit permissions block, using the following as a minimal starting point: {contents: read}
# This job checks the status of the unit-tests matrix and fails if any matrix job failed or was cancelled.
# Use this job as the required check for PRs.
needs: unit-tests
needs:
- changed-dirs
- get-pr-labels
- unit-tests-single-gpu
- unit-tests-multi-gpu
runs-on: ubuntu-latest
if: always()
steps:
- name: Check unit-tests matrix status
run: |
if [[ "${{ needs.unit-tests.result }}" == "failure" || "${{ needs.unit-tests.result }}" == "cancelled" ]]; then
if [[ "${{ contains(needs.*.result, 'failure') || contains(needs.*.result, 'cancelled') }}" == "true" ]]; then
echo "Some unit-tests matrix jobs have failed or been cancelled!"
exit 1
else
echo "All unit-tests matrix jobs have completed successfully or were skipped!"
exit 0
fi

Check warning

Code scanning / CodeQL

Workflow does not contain permissions Medium

Actions job or workflow does not limit the permissions of the GITHUB_TOKEN. Consider setting an explicit permissions block, using the following as a minimal starting point: {}
6 changes: 5 additions & 1 deletion bionemo-recipes/models/amplify/pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -27,4 +27,8 @@ package-dir = { "" = "src" }

[tool.pytest.ini_options]
addopts = ["--color=yes", "--strict-markers", "--tb=short", "-v"]
markers = ["fp8: marks tests as requiring FP8 support"]
markers = [
"fp8: marks tests as requiring FP8 support",
"slow: medium-complexity tests, like integration tests, on a single GPU",
"multi_gpu: tests that require multiple GPUs",
]
4 changes: 4 additions & 0 deletions bionemo-recipes/models/esm2/pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -27,3 +27,7 @@ package-dir = { "" = "src" }

[tool.pytest.ini_options]
addopts = ["--color=yes", "--strict-markers", "--tb=short", "-v"]
markers = [
"slow: medium-complexity tests, like integration tests, on a single GPU",
"multi_gpu: tests that require multiple GPUs",
]
Original file line number Diff line number Diff line change
Expand Up @@ -61,6 +61,7 @@ def test_single_process_attaches_correct_fp8_recipe(strategy):
pytest.fail(f"Command failed with exit code {result.returncode}")


@pytest.mark.multi_gpu
@pytest.mark.parametrize(
"strategy", ["ddp", "fsdp2", pytest.param("mfsdp", marks=pytest.mark.xfail(reason="BIONEMO-2999"))]
)
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -66,6 +66,7 @@ def test_ddp_vs_fsdp_single_gpu(strategy, backend):
pytest.fail(f"Command failed with exit code {result.returncode}")


@pytest.mark.multi_gpu
@requires_multi_gpu
@pytest.mark.parametrize("strategy", ["fsdp2", pytest.param("mfsdp", marks=pytest.mark.xfail(reason="BIONEMO-2726"))])
@pytest.mark.parametrize("backend", ["te", "eager"])
Expand Down
1 change: 1 addition & 0 deletions bionemo-recipes/recipes/codonfm_ptl_te/pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -83,6 +83,7 @@ python_classes = ["Test*"]
python_functions = ["test_*"]
markers = [
"slow: marks tests as slow (deselect with '-m \"not slow\"')",
"multi_gpu: marks tests that require multiple GPUs (deselect with '-m \"not multi_gpu\"')",
"integration: marks tests as integration tests",
"unit: marks tests as unit tests",
]
Expand Down
Loading
Loading