-
Notifications
You must be signed in to change notification settings - Fork 101
CI tests on multi-GPU runners #1229
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: Gagan Kaushik <[email protected]>
WalkthroughAdds a Changes
Sequence Diagram(s)sequenceDiagram
autonumber
participant Dev as Developer
participant GH as GitHub Actions
participant Jobs as unit-tests-* jobs
participant Verify as verify-tests-status
Dev->>GH: Push PR or schedule
GH->>Jobs: Start workflow (matrix)
alt PR CI
Jobs->>Jobs: Run single-GPU jobs (multi-GPU jobs skipped)
else Nightly / scheduled
Jobs->>Jobs: Run single-GPU jobs
Jobs->>Jobs: Run multi-GPU jobs after single-GPU success
end
Jobs-->>GH: Upload artifacts/coverage
Jobs->>Verify: Report job results
Verify-->>GH: Aggregate status
sequenceDiagram
autonumber
participant Job as CI job
participant Runner as pytest_runner.sh
participant PyTest as pytest
Job->>Runner: Invoke with flags (--only-multi-gpu / --skip-multi-gpu / --skip-slow / --only-slow)
Runner->>Runner: Build MARKER_EXPR from flags (multi_gpu, slow)
alt MARKER_EXPR non-empty
Runner->>PyTest: pytest -m "<MARKER_EXPR>" [options]
else
Runner->>PyTest: pytest [options]
end
PyTest-->>Job: Return results and reports
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Poem
Pre-merge checks and finishing touches❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
📜 Recent review detailsConfiguration used: CodeRabbit UI Review profile: CHILL Plan: Pro 📒 Files selected for processing (5)
🧰 Additional context used🪛 actionlint (1.7.8).github/workflows/unit-tests-framework.yml257-257: label "linux-amd64-gpu-rtxa6000-latest-2-nemo" is unknown. available labels are "windows-latest", "windows-latest-8-cores", "windows-2025", "windows-2022", "windows-11-arm", "ubuntu-latest", "ubuntu-latest-4-cores", "ubuntu-latest-8-cores", "ubuntu-latest-16-cores", "ubuntu-24.04", "ubuntu-24.04-arm", "ubuntu-22.04", "ubuntu-22.04-arm", "macos-latest", "macos-latest-xl", "macos-latest-xlarge", "macos-latest-large", "macos-26-xlarge", "macos-26", "macos-15-intel", "macos-15-xlarge", "macos-15-large", "macos-15", "macos-14-xl", "macos-14-xlarge", "macos-14-large", "macos-14", "macos-13-xl", "macos-13-xlarge", "macos-13-large", "macos-13", "self-hosted", "x64", "arm", "arm64", "linux", "macos", "windows". if it is a custom label for self-hosted runner, set list of labels in actionlint.yaml config file (runner-label) 311-311: label "linux-amd64-gpu-rtxa6000-latest-2-nemo" is unknown. available labels are "windows-latest", "windows-latest-8-cores", "windows-2025", "windows-2022", "windows-11-arm", "ubuntu-latest", "ubuntu-latest-4-cores", "ubuntu-latest-8-cores", "ubuntu-latest-16-cores", "ubuntu-24.04", "ubuntu-24.04-arm", "ubuntu-22.04", "ubuntu-22.04-arm", "macos-latest", "macos-latest-xl", "macos-latest-xlarge", "macos-latest-large", "macos-26-xlarge", "macos-26", "macos-15-intel", "macos-15-xlarge", "macos-15-large", "macos-15", "macos-14-xl", "macos-14-xlarge", "macos-14-large", "macos-14", "macos-13-xl", "macos-13-xlarge", "macos-13-large", "macos-13", "self-hosted", "x64", "arm", "arm64", "linux", "macos", "windows". if it is a custom label for self-hosted runner, set list of labels in actionlint.yaml config file (runner-label) 🪛 GitHub Check: CodeQL.github/workflows/unit-tests-framework.yml[warning] 254-278: Workflow does not contain permissions ⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
/ok to test 68fa036 |
Signed-off-by: Gagan Kaushik <[email protected]>
|
/ok to test 62f381d |
❌ 8 Tests Failed:
View the top 3 failed test(s) by shortest run time
To view more test analytics, go to the Test Analytics Dashboard |
|
From standup: Is it possible to start having these run nightly, and track queue time + success rate? |
Signed-off-by: Gagan Kaushik <[email protected]>
|
/ok to test 250bd29 |
Signed-off-by: Gagan Kaushik <[email protected]>
|
/ok to test 68307b1 |
...onemo/moco/interpolants/continuous_time/continuous/test_continuous_flow_matching_parallel.py
Fixed
Show fixed
Hide fixed
...bionemo-moco/tests/bionemo/moco/interpolants/continuous_time/continuous/test_vdm_parallel.py
Fixed
Show fixed
Hide fixed
...s/bionemo/moco/interpolants/continuous_time/discrete/test_discrete_flow_matching_parallel.py
Fixed
Show fixed
Hide fixed
.../bionemo-moco/tests/bionemo/moco/interpolants/continuous_time/discrete/test_mdlm_parallel.py
Fixed
Show fixed
Hide fixed
.../bionemo-moco/tests/bionemo/moco/interpolants/discrete_time/continuous/test_ddpm_parallel.py
Fixed
Show fixed
Hide fixed
...es/bionemo-moco/tests/bionemo/moco/interpolants/discrete_time/discrete/test_d3pm_parallel.py
Fixed
Show fixed
Hide fixed
Signed-off-by: Gagan Kaushik <[email protected]>
|
/ok to test 6fbe6f4 |
- Merged in latest changes from origin/main - Added @pytest.mark.multi_gpu to test_distributed_fp8.py::test_multi_process_fp8_recipes_are_synced - Added @pytest.mark.multi_gpu to test_train.py::test_distributed_training_gradient_equivalence - These new multi-GPU tests will run on the 2-GPU runner in merge queue/schedule
…g in evo2 preprocessing that ignored random seed during bootstrap Signed-off-by: Gagan Kaushik <[email protected]>
|
/ok to test b777145 |
Signed-off-by: Gagan Kaushik <[email protected]>
|
/ok to test c3f9737 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
♻️ Duplicate comments (3)
sub-packages/bionemo-moco/tests/bionemo/moco/interpolants/discrete_time/discrete/test_d3pm_parallel.py (1)
88-92: The security bot's concern is incorrect.The socket is bound to
"localhost"(line 89), not to all interfaces. This is secure and appropriate for local testing.sub-packages/bionemo-moco/tests/bionemo/moco/interpolants/continuous_time/discrete/test_discrete_flow_matching_parallel.py (1)
82-86: The security bot's concern is incorrect.The socket is bound to
"localhost"(line 83), not to all interfaces. This is secure and appropriate for local testing..github/workflows/unit-tests-framework.yml (1)
254-294: Address the missing permissions block.Static analysis tools flagged that the
run-tests-multi-gpujob does not limit the permissions of the GITHUB_TOKEN. Consider adding an explicit permissions block as a security best practice.Apply this diff to set minimal permissions:
run-tests-multi-gpu: + permissions: + contents: read needs: - build-bionemo-image - get-pr-labelsThis same recommendation applies to the
run-tests-slow-multi-gpujob at lines 325-357.
🧹 Nitpick comments (4)
sub-packages/bionemo-moco/tests/bionemo/moco/interpolants/discrete_time/discrete/test_d3pm_parallel.py (1)
83-99: Pre-spawn environment setup looks good.The environment variable setup before spawning ensures all processes use consistent values. The port allocation via temporary socket binding is a standard pattern in distributed testing.
Note: There's a small race condition window between closing the socket (line 91) and the test using the port, but this is acceptable in practice given the very short time window.
The A6000 workaround is well-documented and addresses a known hanging issue.
Consider extracting this environment setup pattern into a shared helper function, as it's duplicated across multiple test files (e.g., test_discrete_flow_matching_parallel.py).
sub-packages/bionemo-moco/tests/bionemo/moco/interpolants/continuous_time/discrete/test_discrete_flow_matching_parallel.py (1)
77-93: Environment setup is identical to test_d3pm_parallel.py.This code is duplicated from the other test file. Both files would benefit from extracting this setup logic into a shared helper function (e.g.,
setup_distributed_test_environment()in a test utilities module).Apply a refactor to extract the common logic:
Example helper in a shared test utilities module:
def setup_distributed_test_environment(): """Set up environment variables for distributed testing.""" if "MASTER_ADDR" not in os.environ: os.environ["MASTER_ADDR"] = "localhost" if "MASTER_PORT" not in os.environ: s = socket.socket(socket.AF_INET, socket.SOCK_STREAM) s.bind(("localhost", 0)) port = s.getsockname()[1] s.close() os.environ["MASTER_PORT"] = str(port) # Fix hanging issue on A6000 GPUs if torch.cuda.is_available(): for i in range(torch.cuda.device_count()): if "A6000" in torch.cuda.get_device_name(i): os.environ["NCCL_P2P_DISABLE"] = "1" breakThen use it in both test files:
- # Set up environment variables BEFORE spawning so all processes use the same values - if "MASTER_ADDR" not in os.environ: - os.environ["MASTER_ADDR"] = "localhost" - if "MASTER_PORT" not in os.environ: - # Find a free port for this test (bind to localhost only for security) - s = socket.socket(socket.AF_INET, socket.SOCK_STREAM) - s.bind(("localhost", 0)) - port = s.getsockname()[1] - s.close() - os.environ["MASTER_PORT"] = str(port) - - # Fix hanging issue on A6000 GPUs - if torch.cuda.is_available(): - for i in range(torch.cuda.device_count()): - if "A6000" in torch.cuda.get_device_name(i): - os.environ["NCCL_P2P_DISABLE"] = "1" - break + setup_distributed_test_environment()sub-packages/bionemo-moco/tests/bionemo/moco/interpolants/continuous_time/discrete/test_mdlm_parallel.py (1)
79-88: Environment setup correctly uses localhost and addresses security concerns.The binding to "localhost" prevents external network access, addressing the security bot's concern about binding to all interfaces.
The dynamic port allocation has a theoretical TOCTOU race (the port could be taken between
s.close()andspawn()), but this is acceptable in practice for CI tests where port collisions are rare.Optional improvement: If flakiness occurs, consider keeping the socket open and using
SO_REUSEADDR, or implementing retry logic with a port range.sub-packages/bionemo-evo2/tests/bionemo/evo2/run/test_train.py (1)
497-598: Wrap download subprocess calls in try/except to skip on failure
Surround eachsubprocess.runforwget,zcat, andcatwith exception handling (CalledProcessError,TimeoutExpired) and callpytest.skip(...)on errors to prevent CI hangs and flakiness.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (22)
.github/labels.yml(1 hunks).github/workflows/unit-tests-framework.yml(6 hunks).github/workflows/unit-tests-recipes.yml(2 hunks)bionemo-recipes/models/amplify/pyproject.toml(1 hunks)bionemo-recipes/models/esm2/pyproject.toml(1 hunks)bionemo-recipes/models/esm2/tests/test_distributed_fp8.py(1 hunks)bionemo-recipes/models/esm2/tests/test_distributed_strategies.py(1 hunks)bionemo-recipes/recipes/esm2_native_te/tests/test_distributed_checkpointing.py(4 hunks)bionemo-recipes/recipes/esm2_native_te/tests/test_train_two_gpu.py(4 hunks)bionemo-recipes/recipes/geneformer_native_te_mfsdp_fp8/test_distributed_checkpointing.py(8 hunks)ci/scripts/pytest_runner.sh(4 hunks)docs/docs/main/contributing/contributing.md(1 hunks)sub-packages/bionemo-evo2/src/bionemo/evo2/data/preprocess.py(2 hunks)sub-packages/bionemo-evo2/tests/bionemo/evo2/run/test_predict.py(3 hunks)sub-packages/bionemo-evo2/tests/bionemo/evo2/run/test_train.py(4 hunks)sub-packages/bionemo-llm/tests/bionemo/llm/test_lightning.py(1 hunks)sub-packages/bionemo-moco/tests/bionemo/moco/interpolants/continuous_time/continuous/test_continuous_flow_matching_parallel.py(3 hunks)sub-packages/bionemo-moco/tests/bionemo/moco/interpolants/continuous_time/continuous/test_vdm_parallel.py(3 hunks)sub-packages/bionemo-moco/tests/bionemo/moco/interpolants/continuous_time/discrete/test_discrete_flow_matching_parallel.py(3 hunks)sub-packages/bionemo-moco/tests/bionemo/moco/interpolants/continuous_time/discrete/test_mdlm_parallel.py(3 hunks)sub-packages/bionemo-moco/tests/bionemo/moco/interpolants/discrete_time/continuous/test_ddpm_parallel.py(3 hunks)sub-packages/bionemo-moco/tests/bionemo/moco/interpolants/discrete_time/discrete/test_d3pm_parallel.py(3 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
sub-packages/bionemo-evo2/tests/bionemo/evo2/run/test_train.py (2)
sub-packages/bionemo-evo2/src/bionemo/evo2/data/preprocess.py (2)
Evo2Preprocessor(47-451)preprocess_offline(387-451)sub-packages/bionemo-evo2/src/bionemo/evo2/utils/config.py (1)
Evo2PreprocessingConfig(46-101)
🪛 actionlint (1.7.8)
.github/workflows/unit-tests-framework.yml
258-258: label "linux-amd64-gpu-rtxa6000-latest-2-nemo" is unknown. available labels are "windows-latest", "windows-latest-8-cores", "windows-2025", "windows-2022", "windows-11-arm", "ubuntu-latest", "ubuntu-latest-4-cores", "ubuntu-latest-8-cores", "ubuntu-latest-16-cores", "ubuntu-24.04", "ubuntu-24.04-arm", "ubuntu-22.04", "ubuntu-22.04-arm", "macos-latest", "macos-latest-xl", "macos-latest-xlarge", "macos-latest-large", "macos-26-xlarge", "macos-26", "macos-15-intel", "macos-15-xlarge", "macos-15-large", "macos-15", "macos-14-xl", "macos-14-xlarge", "macos-14-large", "macos-14", "macos-13-xl", "macos-13-xlarge", "macos-13-large", "macos-13", "self-hosted", "x64", "arm", "arm64", "linux", "macos", "windows". if it is a custom label for self-hosted runner, set list of labels in actionlint.yaml config file
(runner-label)
329-329: label "linux-amd64-gpu-rtxa6000-latest-2-nemo" is unknown. available labels are "windows-latest", "windows-latest-8-cores", "windows-2025", "windows-2022", "windows-11-arm", "ubuntu-latest", "ubuntu-latest-4-cores", "ubuntu-latest-8-cores", "ubuntu-latest-16-cores", "ubuntu-24.04", "ubuntu-24.04-arm", "ubuntu-22.04", "ubuntu-22.04-arm", "macos-latest", "macos-latest-xl", "macos-latest-xlarge", "macos-latest-large", "macos-26-xlarge", "macos-26", "macos-15-intel", "macos-15-xlarge", "macos-15-large", "macos-15", "macos-14-xl", "macos-14-xlarge", "macos-14-large", "macos-14", "macos-13-xl", "macos-13-xlarge", "macos-13-large", "macos-13", "self-hosted", "x64", "arm", "arm64", "linux", "macos", "windows". if it is a custom label for self-hosted runner, set list of labels in actionlint.yaml config file
(runner-label)
.github/workflows/unit-tests-recipes.yml
132-132: label "linux-amd64-gpu-l4-latest-1" is unknown. available labels are "windows-latest", "windows-latest-8-cores", "windows-2025", "windows-2022", "windows-11-arm", "ubuntu-latest", "ubuntu-latest-4-cores", "ubuntu-latest-8-cores", "ubuntu-latest-16-cores", "ubuntu-24.04", "ubuntu-24.04-arm", "ubuntu-22.04", "ubuntu-22.04-arm", "macos-latest", "macos-latest-xl", "macos-latest-xlarge", "macos-latest-large", "macos-26-xlarge", "macos-26", "macos-15-intel", "macos-15-xlarge", "macos-15-large", "macos-15", "macos-14-xl", "macos-14-xlarge", "macos-14-large", "macos-14", "macos-13-xl", "macos-13-xlarge", "macos-13-large", "macos-13", "self-hosted", "x64", "arm", "arm64", "linux", "macos", "windows". if it is a custom label for self-hosted runner, set list of labels in actionlint.yaml config file
(runner-label)
175-175: label "linux-amd64-gpu-rtxa6000-latest-2-nemo" is unknown. available labels are "windows-latest", "windows-latest-8-cores", "windows-2025", "windows-2022", "windows-11-arm", "ubuntu-latest", "ubuntu-latest-4-cores", "ubuntu-latest-8-cores", "ubuntu-latest-16-cores", "ubuntu-24.04", "ubuntu-24.04-arm", "ubuntu-22.04", "ubuntu-22.04-arm", "macos-latest", "macos-latest-xl", "macos-latest-xlarge", "macos-latest-large", "macos-26-xlarge", "macos-26", "macos-15-intel", "macos-15-xlarge", "macos-15-large", "macos-15", "macos-14-xl", "macos-14-xlarge", "macos-14-large", "macos-14", "macos-13-xl", "macos-13-xlarge", "macos-13-large", "macos-13", "self-hosted", "x64", "arm", "arm64", "linux", "macos", "windows". if it is a custom label for self-hosted runner, set list of labels in actionlint.yaml config file
(runner-label)
200-200: shellcheck reported issue in this script: SC1007:warning:2:18: Remove space after = if trying to assign a value (for empty string, use var='' ... )
(shellcheck)
200-200: shellcheck reported issue in this script: SC1007:warning:5:18: Remove space after = if trying to assign a value (for empty string, use var='' ... )
(shellcheck)
🪛 GitHub Check: CodeQL
.github/workflows/unit-tests-framework.yml
[warning] 255-296: Workflow does not contain permissions
Actions job or workflow does not limit the permissions of the GITHUB_TOKEN. Consider setting an explicit permissions block, using the following as a minimal starting point: {{contents: read}}
.github/workflows/unit-tests-recipes.yml
[warning] 174-216: Workflow does not contain permissions
Actions job or workflow does not limit the permissions of the GITHUB_TOKEN. Consider setting an explicit permissions block, using the following as a minimal starting point: {{contents: read}}
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (7)
- GitHub Check: run-tests-slow-multi-gpu
- GitHub Check: run-tests-single-gpu
- GitHub Check: run-tests-notebooks
- GitHub Check: run-tests-slow-single-gpu
- GitHub Check: run-tests-multi-gpu
- GitHub Check: unit-tests-single-gpu (models/amplify)
- GitHub Check: unit-tests-single-gpu (recipes/esm2_native_te)
🔇 Additional comments (32)
sub-packages/bionemo-moco/tests/bionemo/moco/interpolants/discrete_time/discrete/test_d3pm_parallel.py (1)
63-69: LGTM! Clear test parameterization.The expanded parameterization with explicit IDs and the
multi_gpumark properly distinguishes single- and multi-GPU test cases.sub-packages/bionemo-moco/tests/bionemo/moco/interpolants/continuous_time/discrete/test_discrete_flow_matching_parallel.py (1)
57-63: LGTM! Consistent test parameterization.The parameterization matches the pattern in test_d3pm_parallel.py, providing clear test case identification.
sub-packages/bionemo-moco/tests/bionemo/moco/interpolants/discrete_time/continuous/test_ddpm_parallel.py (4)
16-17: LGTM: Imports for environment setup.These standard library imports support the new multi-GPU test infrastructure added below.
59-65: LGTM: Multi-GPU test marking.The parametrization correctly adds explicit IDs and marks the 2-device case with
pytest.mark.multi_gpu, enabling selective execution via CI labels as described in the PR objectives.
79-88: LGTM: Secure localhost binding for distributed test coordination.The code correctly binds to
"localhost"(not""or"0.0.0.0"), so the past security advisory about binding to all network interfaces does not apply here. The temporary socket pattern for finding a free port is standard practice for distributed testing, though there's a minor TOCTOU race between closing the socket and spawning processes—acceptable in test isolation.
90-96: LGTM: A6000 hang mitigation.The workaround correctly addresses the known NCCL P2P hang issue on A6000 GPUs mentioned in the PR description. The environment variable affects the entire process, which is acceptable for test infrastructure where tests typically run in isolation.
sub-packages/bionemo-moco/tests/bionemo/moco/interpolants/continuous_time/continuous/test_vdm_parallel.py (4)
16-17: LGTM!Standard library imports appropriately added for environment setup and port discovery needed by the distributed test infrastructure.
59-65: LGTM!Excellent refactor to explicit parameterization with clear IDs and the
multi_gpumarker. This aligns perfectly with the PR's objective to enable selective multi-GPU test execution in CI.
79-88: Pre-spawn environment setup looks good.Setting
MASTER_ADDRandMASTER_PORTbefore spawning ensures all processes share the same distributed configuration. The port discovery correctly binds tolocalhostonly (not all interfaces), which addresses security concerns.Note: There's a theoretical TOCTOU race where another process could claim the port between
s.close()and actual use, but this is acceptable in the test environment where pytest provides isolation and the risk is negligible.
90-95: Good hardware-specific workaround for the A6000 hanging issue.This correctly addresses the MOCO multi-GPU test hangs mentioned in the PR objectives by disabling NCCL P2P communication on A6000 devices. While this is a workaround rather than a root cause fix, it's appropriate for ensuring CI stability.
The simple string match on device name and setting the environment variable globally when any A6000 is detected is a reasonable approach.
sub-packages/bionemo-moco/tests/bionemo/moco/interpolants/continuous_time/discrete/test_mdlm_parallel.py (3)
17-18: LGTM: Necessary imports for environment setup.The
osandsocketimports support the dynamic port allocation and environment configuration added below.
59-65: LGTM: Clear parametrization with appropriate markers.The explicit test IDs and
pytest.mark.multi_gpumarker align with the PR's multi-GPU CI strategy.
90-95: LGTM: Conservative workaround for A6000 hanging issues.Disabling NCCL peer-to-peer globally when any A6000 is detected is a pragmatic solution to the hanging problem described in the PR. This conservative approach prioritizes test reliability over potential performance gains from P2P communication.
sub-packages/bionemo-llm/tests/bionemo/llm/test_lightning.py (1)
78-78: LGTM!The
multi_gpumarker correctly identifies this test for multi-GPU execution, aligning with the test's existing GPU count requirement and functionality.bionemo-recipes/models/amplify/pyproject.toml (1)
34-38: LGTM!Pytest marker definitions are properly structured and align with the PR's multi-GPU test framework.
bionemo-recipes/models/esm2/pyproject.toml (1)
33-36: LGTM!Marker definitions correctly support the new multi-GPU test categorization.
bionemo-recipes/recipes/geneformer_native_te_mfsdp_fp8/test_distributed_checkpointing.py (1)
137-137: LGTM!All
multi_gpumarkers correctly identify tests that spawn multi-process distributed training with 2 GPUs, enabling proper test routing in CI.Also applies to: 312-312, 565-565, 683-683, 803-803, 965-965, 1083-1083, 1203-1203
sub-packages/bionemo-evo2/src/bionemo/evo2/data/preprocess.py (2)
348-348: Good fix for deterministic split assignment.Changing from a set to a list enables deterministic ordering via
random.sample()later in the flow, which respects the configured seed.
367-377: Excellent fix for the seed reproducibility bug.The combination of
random.sample()with the seeded context andpop(0)ensures that split assignments are deterministic and reproducible with the same seed, addressing the bug mentioned in the PR description.The previous
set.pop()approach was non-deterministic because Python sets have arbitrary iteration order..github/labels.yml (1)
22-24: LGTM!The new
ciflow:multi-gpulabel properly integrates with the CI framework for multi-GPU test execution.bionemo-recipes/models/esm2/tests/test_distributed_strategies.py (1)
69-69: LGTM!The
multi_gpumarker appropriately categorizes this 2-GPU test for CI routing.bionemo-recipes/models/esm2/tests/test_distributed_fp8.py (1)
64-64: LGTM!The
multi_gpumarker correctly identifies this multi-process FP8 test for proper CI execution.docs/docs/main/contributing/contributing.md (1)
132-137: LGTM!The documentation for the
ciflow:multi-gpulabel is clear and consistent with other label descriptions. It properly explains the label's purpose, usage, and default behavior.bionemo-recipes/recipes/esm2_native_te/tests/test_distributed_checkpointing.py (1)
172-172: LGTM!The
@pytest.mark.multi_gpumarkers correctly identify tests that require multiple GPUs. All marked tests usetorchrunwith--nproc_per_node=2and already have the@requires_multi_gpudecorator, making them suitable for multi-GPU test selection.Also applies to: 420-420, 630-630, 912-912
sub-packages/bionemo-evo2/tests/bionemo/evo2/run/test_predict.py (1)
83-90: LGTM with observation!The
@pytest.mark.multi_gpumarkers correctly identify test parameters that require multiple GPUs (ddp>1, cp>1, pp>1, tp>1).Note: The
wiparameter was changed from"batch"to"epoch"for one of theddp=2cases (line 84 compared to line 242-244). This appears intentional but is unrelated to the multi-GPU marker addition.Also applies to: 241-267
ci/scripts/pytest_runner.sh (1)
105-119: Verify handling of conflicting flag combinations.The marker expression logic correctly combines slow and multi_gpu filters. However, if users provide conflicting flags (e.g., both
--skip-slowand--only-slow, or both--skip-multi-gpuand--only-multi-gpu), the resulting marker expression would be contradictory (e.g.,"not slow and slow").Consider adding validation to detect and reject conflicting flag combinations:
+# Validate flag combinations +if [[ "$SKIP_SLOW" == true && "$ONLY_SLOW" == true ]]; then + echo "Error: Cannot use both --skip-slow and --only-slow" >&2 + exit 1 +fi +if [[ "$SKIP_MULTI_GPU" == true && "$ONLY_MULTI_GPU" == true ]]; then + echo "Error: Cannot use both --skip-multi-gpu and --only-multi-gpu" >&2 + exit 1 +fi + # Build marker expression for filtering tests MARKER_EXPR=""bionemo-recipes/recipes/esm2_native_te/tests/test_train_two_gpu.py (1)
49-49: LGTM!The
@pytest.mark.multi_gpumarkers correctly identify tests that require multiple GPUs. All marked tests usetorchrunwith--nproc_per_node=2and have the@requires_multi_gpudecorator, making them suitable for multi-GPU test selection.Also applies to: 68-68, 87-87, 106-106
sub-packages/bionemo-evo2/tests/bionemo/evo2/run/test_train.py (2)
621-625: LGTM!The updated
dataset_configfixture correctly falls back to downloading and preprocessing data when not provided via command line or environment variables. This enables the gradient equivalence test to run in CI environments.
689-689: LGTM!The
@pytest.mark.multi_gpumarker correctly identifies this test as requiring multiple GPUs. The test uses 2 GPUs for various distributed parallelism strategies (dp/cp/tp/pp)..github/workflows/unit-tests-recipes.yml (3)
130-172: LGTM!The
unit-tests-single-gpujob correctly filters to single-GPU tests usingpytest -v -m "not multi_gpu" .. The job configuration is appropriate for single-GPU test execution.
173-215: LGTM!The
unit-tests-multi-gpujob correctly filters to multi-GPU tests usingpytest -v -m "multi_gpu" .. The|| exit 0allows recipes without multi-GPU tests to pass gracefully.
219-233: LGTM!The
verify-recipe-testsjob correctly aggregates results from bothunit-tests-single-gpuandunit-tests-multi-gpujobs using thecontains(needs.*.result, ...)pattern.
| # Set up environment variables BEFORE spawning so all processes use the same values | ||
| if "MASTER_ADDR" not in os.environ: | ||
| os.environ["MASTER_ADDR"] = "localhost" | ||
| if "MASTER_PORT" not in os.environ: | ||
| # Find a free port for this test (bind to localhost only for security) | ||
| s = socket.socket(socket.AF_INET, socket.SOCK_STREAM) | ||
| s.bind(("localhost", 0)) | ||
| port = s.getsockname()[1] | ||
| s.close() | ||
| os.environ["MASTER_PORT"] = str(port) | ||
|
|
||
| # Fix hanging issue on A6000 GPUs | ||
| if torch.cuda.is_available(): | ||
| for i in range(torch.cuda.device_count()): | ||
| if "A6000" in torch.cuda.get_device_name(i): | ||
| os.environ["NCCL_P2P_DISABLE"] = "1" | ||
| break |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Scope the MASTER/NCCL env tweaks to this test only
Setting MASTER_ADDR, MASTER_PORT, and NCCL_P2P_DISABLE via os.environ[...] = ... leaks those values to every later test in the same Python process. That cross-test contamination is already biting us: once MASTER_PORT is filled here, later spawns reuse the same port instead of grabbing a fresh free one, and forcing NCCL_P2P_DISABLE=1 means every subsequent distributed test runs without GPU P2P even when it should be enabled. Please switch to pytest’s monkeypatch (or another scoped helper) so the overrides are applied for this test only, e.g.:
- if "MASTER_ADDR" not in os.environ:
- os.environ["MASTER_ADDR"] = "localhost"
+ monkeypatch.setenv("MASTER_ADDR", os.environ.get("MASTER_ADDR", "localhost"))
...
- os.environ["MASTER_PORT"] = str(port)
+ monkeypatch.setenv("MASTER_PORT", str(port))
...
- os.environ["NCCL_P2P_DISABLE"] = "1"
+ monkeypatch.setenv("NCCL_P2P_DISABLE", "1")and add monkeypatch to the test signature. That keeps the CI-friendly defaults without mutating global state that other tests rely on.
Committable suggestion skipped: line range outside the PR's diff.
🤖 Prompt for AI Agents
In
sub-packages/bionemo-moco/tests/bionemo/moco/interpolants/continuous_time/continuous/test_continuous_flow_matching_parallel.py
around lines 79-95, the test currently mutates os.environ directly which leaks
MASTER_ADDR, MASTER_PORT and NCCL_P2P_DISABLE to other tests; change the test to
accept the pytest monkeypatch fixture and replace direct os.environ writes with
monkeypatch.setenv calls (compute a free localhost port as currently done, then
monkeypatch.setenv("MASTER_PORT", str(port))), use
monkeypatch.setenv("MASTER_ADDR","localhost") if not present, and for the A6000
NCCL tweak only call monkeypatch.setenv("NCCL_P2P_DISABLE","1") when a local
CUDA device name contains "A6000" so the environment change is scoped to this
test and does not affect subsequent tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 3
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (4)
sub-packages/bionemo-evo2/tests/bionemo/evo2/run/test_predict.py (2)
112-114: Fix skip message textCondition is world_size > device_count, but message says “less than”. Correct the message.
- pytest.skip(f"World size {world_size} is less than the number of GPUs {torch.cuda.device_count()}") + pytest.skip(f"World size {world_size} exceeds available GPUs ({torch.cuda.device_count()})")
154-154: Wrong command name in assertion messageThis test runs predict_evo2, not train_evo2.
- assert result.returncode == 0, "train_evo2 command failed." + assert result.returncode == 0, "predict_evo2 command failed."sub-packages/bionemo-evo2/src/bionemo/evo2/data/preprocess.py (1)
394-397: Bug: partial file‑existence check due to zip truncationOnly (BIN,train) and (IDX,val) are checked. test split and other combos are missed. This can cause accidental overwrites or skips.
- if any( - self._get_output_filename(preproc_config, ext, split).is_file() - for ext, split in zip([self.BIN, self.IDX], [self.TRAIN, self.VAL, self.TEST]) - ): + if any( + self._get_output_filename(preproc_config, ext, split).is_file() + for ext in (self.BIN, self.IDX) + for split in (self.TRAIN, self.VAL, self.TEST) + ):.github/workflows/unit-tests-recipes.yml (1)
1-20: Add least‑privilege token permissionsDeclare minimal permissions for GITHUB_TOKEN.
name: "BioNeMo Recipes CI" @@ concurrency: group: ${{ github.workflow }}-${{ github.event.pull_request.number || github.ref }} cancel-in-progress: true + +permissions: + contents: read
♻️ Duplicate comments (11)
bionemo-recipes/recipes/geneformer_native_te_mfsdp_fp8/test_distributed_checkpointing.py (7)
312-315: LGTM; same env-port suggestion applies.Marker addition is correct. Please apply the per-test MASTER_ADDR/PORT env setup as suggested above to avoid flakiness.
565-568: LGTM; same env-port suggestion applies.Good to gate as multi_gpu. Mirror the MASTER_ADDR/PORT and optional A6000 workaround in this test’s env before
subprocess.run.
683-686: LGTM; same env-port suggestion applies.Use per-test MASTER_ADDR/PORT in the
envpassed to subprocess to avoid collisions.
803-806: LGTM; same env-port suggestion applies.Consistent with others; add unique MASTER_ADDR/PORT in
env.
965-968: LGTM; same env-port suggestion applies.Please add the per-test MASTER_ADDR/PORT env setup in this DDP multi-proc test too.
1083-1086: LGTM; same env-port suggestion applies.Marking looks good; add unique MASTER_ADDR/PORT in
env.
1203-1206: LGTM; same env-port suggestion applies.Mirror env setup (IPv4 loopback + free port) as noted in earlier comment.
sub-packages/bionemo-moco/tests/bionemo/moco/interpolants/continuous_time/continuous/test_vdm_parallel.py (1)
79-96: Same env hardening as other parallel tests.Apply IPv4 loopback, always-pick-port, and scope NCCL tweak to multi-GPU as suggested in MDLM test.
sub-packages/bionemo-moco/tests/bionemo/moco/interpolants/continuous_time/continuous/test_continuous_flow_matching_parallel.py (1)
79-96: Repeat env robustness tweaks (IPv4, fresh port, scope NCCL).Mirror the small refactor used in sibling tests to reduce flakiness and optimize single-GPU path.
sub-packages/bionemo-moco/tests/bionemo/moco/interpolants/continuous_time/discrete/test_discrete_flow_matching_parallel.py (1)
77-94: Apply same env hardening (IPv4, always-pick-port, scope NCCL) and consider deduping.Consistent tweaks reduce cross-test collisions and unnecessary NCCL changes on single-GPU.
.github/workflows/unit-tests-framework.yml (1)
37-57: Add least‑privilege token permissionsSet minimal GITHUB_TOKEN permissions at workflow level.
name: "BioNeMo Framework CI" on: @@ concurrency: group: ${{ github.workflow }}-${{ github.event.pull_request.number || github.ref }} cancel-in-progress: true + +permissions: + contents: read
🧹 Nitpick comments (11)
sub-packages/bionemo-moco/tests/bionemo/moco/interpolants/discrete_time/discrete/test_d3pm_parallel.py (1)
94-99: A6000 stability workaround is appropriate.Setting
NCCL_P2P_DISABLE=1for A6000 GPUs addresses the hanging issues mentioned in the PR description. The implementation correctly checks all visible devices and applies the workaround when needed.Consider centralizing this A6000 workaround if it's used across multiple test files. A shared test utility or conftest.py fixture could reduce duplication and make the workaround easier to maintain or remove in future NCCL versions.
sub-packages/bionemo-moco/tests/bionemo/moco/interpolants/continuous_time/discrete/test_mdlm_parallel.py (1)
79-96: Harden env setup: prefer IPv4, always set a fresh port, scope NCCL tweak to multi-GPU, and dedupe.
- Use 127.0.0.1 to avoid IPv6/NCCL issues.
- Always assign a free port to reduce cross-test collisions.
- Only set NCCL_P2P_DISABLE when world_size >= 2.
- Consider factoring this into a small helper to reuse across tests.
Diff:
- if "MASTER_ADDR" not in os.environ: - os.environ["MASTER_ADDR"] = "localhost" - if "MASTER_PORT" not in os.environ: - # Find a free port for this test (bind to localhost only for security) - s = socket.socket(socket.AF_INET, socket.SOCK_STREAM) - s.bind(("localhost", 0)) - port = s.getsockname()[1] - s.close() - os.environ["MASTER_PORT"] = str(port) + # Always prefer IPv4 loopback and pick a free ephemeral port + os.environ["MASTER_ADDR"] = "127.0.0.1" + with socket.socket(socket.AF_INET, socket.SOCK_STREAM) as s: + s.bind(("127.0.0.1", 0)) + os.environ["MASTER_PORT"] = str(s.getsockname()[1]) - # Fix hanging issue on A6000 GPUs - if torch.cuda.is_available(): + # Fix hanging issue on A6000 GPUs (only matters for multi-GPU) + if torch.cuda.is_available() and world_size >= 2: for i in range(torch.cuda.device_count()): if "A6000" in torch.cuda.get_device_name(i): os.environ["NCCL_P2P_DISABLE"] = "1" breaksub-packages/bionemo-evo2/tests/bionemo/evo2/run/test_predict.py (1)
137-146: Subprocess can hang; add timeout and raise on failureGuard long‑running distributed calls to avoid indefinite hangs.
- result = subprocess.run( + try: + result = subprocess.run( command, shell=True, # Use the shell to interpret wildcards (e.g. SDH*) cwd=tmp_path, # Run in the temporary directory capture_output=True, # Capture stdout and stderr for debugging env=env, # Pass in the env where we override the master port. - text=True, # Decode output as text - ) + text=True, # Decode output as text + timeout=480, + ) + except subprocess.TimeoutExpired as e: + sys.stderr.write(f"Timed out running predict_evo2: {e}\n") + raisesub-packages/bionemo-evo2/src/bionemo/evo2/data/preprocess.py (2)
348-349: Deterministic split assignment: good fixReplacing set.pop() with a seeded random ordering removes nondeterminism. Minor: pop(0) is O(n); for 3 items it’s trivial, but a deque would be slightly cleaner.
- splits_needed = random.sample(splits_list, len(splits_list)) if splits_list else [] + from collections import deque + splits_needed = deque(random.sample(splits_list, len(splits_list))) if splits_list else deque() ... - split = splits_needed.pop(0) + split = splits_needed.popleft()Also applies to: 366-370, 375-378
249-252: Nondeterministic seed component from Python hash()hash(filepath) is salted per process (PYTHONHASHSEED), breaking cross‑run determinism. Prefer a stable hash of the path string.
- with self.preprocessing_context_manager( - config.seed + hash(filepath) + seq_idx if config.seed is not None else None - ): + stable = int.from_bytes(__import__("hashlib").md5(str(filepath).encode()).digest()[:8], "big") + with self.preprocessing_context_manager( + (config.seed + stable + seq_idx) if config.seed is not None else None + ):ci/scripts/pytest_runner.sh (1)
105-120: Marker filtering logic LGTM; add conflict guardIf both --skip-multi-gpu and --only-multi-gpu are set, fail fast to avoid surprising selection.
MARKER_EXPR="" +# Guard conflicting flags early +if [[ "$SKIP_MULTI_GPU" == true && "$ONLY_MULTI_GPU" == true ]]; then + echo "Use either --skip-multi-gpu or --only-multi-gpu, not both." >&2 + exit 2 +fi.github/workflows/unit-tests-framework.yml (2)
254-283: Multi‑GPU job config LGTM; add a timeoutAdd timeout-minutes to prevent indefinite hangs on distributed runs.
run-tests-multi-gpu: @@ - if: | + timeout-minutes: 120 + if: |
350-358: Slow multi‑GPU job: add timeoutSame recommendation for slow suite.
run-tests-slow-multi-gpu: @@ - if: | + timeout-minutes: 180 + if: |bionemo-recipes/recipes/esm2_native_te/tests/test_distributed_checkpointing.py (1)
172-174: Multi‑GPU gating LGTMmulti_gpu markers align with runner gating and existing skipifs.
Add timeout=... to subprocess.run calls in these multi‑GPU tests to prevent indefinite hangs on CI infra.
Also applies to: 420-422, 630-632, 912-914
.github/workflows/unit-tests-recipes.yml (1)
156-167: Shellcheck: clarify empty PIP_CONSTRAINTUse explicit empty string to silence SC1007.
- PIP_CONSTRAINT= pip install -e . + PIP_CONSTRAINT="" pip install -e . @@ - PIP_CONSTRAINT= pip install -r requirements.txt + PIP_CONSTRAINT="" pip install -r requirements.txtAlso applies to: 198-211
sub-packages/bionemo-evo2/tests/bionemo/evo2/run/test_train.py (1)
519-533: External downloads: add timeouts and basic retryNetwork fetches can hang. Add timeout and a simple retry loop.
- subprocess.run( - ["wget", "https://hgdownload.soe.ucsc.edu/goldenpath/hg38/chromosomes/chr20.fa.gz"], - cwd=test_dir, - check=True, - ) + for f in chr20 chr21 chr22; do + for i in {1..3}; do + if wget -T 120 -O "$f.fa.gz" "https://hgdownload.soe.ucsc.edu/goldenpath/hg38/chromosomes/$f.fa.gz"; then break; fi + echo "Retry $i for $f..." + sleep 2 + done + done
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (23)
.github/labels.yml(1 hunks).github/pull_request_template.md(1 hunks).github/workflows/unit-tests-framework.yml(6 hunks).github/workflows/unit-tests-recipes.yml(2 hunks)bionemo-recipes/models/amplify/pyproject.toml(1 hunks)bionemo-recipes/models/esm2/pyproject.toml(1 hunks)bionemo-recipes/models/esm2/tests/test_distributed_fp8.py(1 hunks)bionemo-recipes/models/esm2/tests/test_distributed_strategies.py(1 hunks)bionemo-recipes/recipes/esm2_native_te/tests/test_distributed_checkpointing.py(4 hunks)bionemo-recipes/recipes/esm2_native_te/tests/test_train_two_gpu.py(4 hunks)bionemo-recipes/recipes/geneformer_native_te_mfsdp_fp8/test_distributed_checkpointing.py(8 hunks)ci/scripts/pytest_runner.sh(4 hunks)docs/docs/main/contributing/contributing.md(2 hunks)sub-packages/bionemo-evo2/src/bionemo/evo2/data/preprocess.py(2 hunks)sub-packages/bionemo-evo2/tests/bionemo/evo2/run/test_predict.py(3 hunks)sub-packages/bionemo-evo2/tests/bionemo/evo2/run/test_train.py(4 hunks)sub-packages/bionemo-llm/tests/bionemo/llm/test_lightning.py(1 hunks)sub-packages/bionemo-moco/tests/bionemo/moco/interpolants/continuous_time/continuous/test_continuous_flow_matching_parallel.py(3 hunks)sub-packages/bionemo-moco/tests/bionemo/moco/interpolants/continuous_time/continuous/test_vdm_parallel.py(3 hunks)sub-packages/bionemo-moco/tests/bionemo/moco/interpolants/continuous_time/discrete/test_discrete_flow_matching_parallel.py(3 hunks)sub-packages/bionemo-moco/tests/bionemo/moco/interpolants/continuous_time/discrete/test_mdlm_parallel.py(3 hunks)sub-packages/bionemo-moco/tests/bionemo/moco/interpolants/discrete_time/continuous/test_ddpm_parallel.py(3 hunks)sub-packages/bionemo-moco/tests/bionemo/moco/interpolants/discrete_time/discrete/test_d3pm_parallel.py(3 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
sub-packages/bionemo-evo2/tests/bionemo/evo2/run/test_train.py (2)
sub-packages/bionemo-evo2/src/bionemo/evo2/data/preprocess.py (2)
Evo2Preprocessor(47-451)preprocess_offline(387-451)sub-packages/bionemo-evo2/src/bionemo/evo2/utils/config.py (1)
Evo2PreprocessingConfig(46-101)
🪛 actionlint (1.7.8)
.github/workflows/unit-tests-framework.yml
258-258: label "linux-amd64-gpu-rtxa6000-latest-2-nemo" is unknown. available labels are "windows-latest", "windows-latest-8-cores", "windows-2025", "windows-2022", "windows-11-arm", "ubuntu-latest", "ubuntu-latest-4-cores", "ubuntu-latest-8-cores", "ubuntu-latest-16-cores", "ubuntu-24.04", "ubuntu-24.04-arm", "ubuntu-22.04", "ubuntu-22.04-arm", "macos-latest", "macos-latest-xl", "macos-latest-xlarge", "macos-latest-large", "macos-26-xlarge", "macos-26", "macos-15-intel", "macos-15-xlarge", "macos-15-large", "macos-15", "macos-14-xl", "macos-14-xlarge", "macos-14-large", "macos-14", "macos-13-xl", "macos-13-xlarge", "macos-13-large", "macos-13", "self-hosted", "x64", "arm", "arm64", "linux", "macos", "windows". if it is a custom label for self-hosted runner, set list of labels in actionlint.yaml config file
(runner-label)
329-329: label "linux-amd64-gpu-rtxa6000-latest-2-nemo" is unknown. available labels are "windows-latest", "windows-latest-8-cores", "windows-2025", "windows-2022", "windows-11-arm", "ubuntu-latest", "ubuntu-latest-4-cores", "ubuntu-latest-8-cores", "ubuntu-latest-16-cores", "ubuntu-24.04", "ubuntu-24.04-arm", "ubuntu-22.04", "ubuntu-22.04-arm", "macos-latest", "macos-latest-xl", "macos-latest-xlarge", "macos-latest-large", "macos-26-xlarge", "macos-26", "macos-15-intel", "macos-15-xlarge", "macos-15-large", "macos-15", "macos-14-xl", "macos-14-xlarge", "macos-14-large", "macos-14", "macos-13-xl", "macos-13-xlarge", "macos-13-large", "macos-13", "self-hosted", "x64", "arm", "arm64", "linux", "macos", "windows". if it is a custom label for self-hosted runner, set list of labels in actionlint.yaml config file
(runner-label)
.github/workflows/unit-tests-recipes.yml
132-132: label "linux-amd64-gpu-l4-latest-1" is unknown. available labels are "windows-latest", "windows-latest-8-cores", "windows-2025", "windows-2022", "windows-11-arm", "ubuntu-latest", "ubuntu-latest-4-cores", "ubuntu-latest-8-cores", "ubuntu-latest-16-cores", "ubuntu-24.04", "ubuntu-24.04-arm", "ubuntu-22.04", "ubuntu-22.04-arm", "macos-latest", "macos-latest-xl", "macos-latest-xlarge", "macos-latest-large", "macos-26-xlarge", "macos-26", "macos-15-intel", "macos-15-xlarge", "macos-15-large", "macos-15", "macos-14-xl", "macos-14-xlarge", "macos-14-large", "macos-14", "macos-13-xl", "macos-13-xlarge", "macos-13-large", "macos-13", "self-hosted", "x64", "arm", "arm64", "linux", "macos", "windows". if it is a custom label for self-hosted runner, set list of labels in actionlint.yaml config file
(runner-label)
175-175: label "linux-amd64-gpu-rtxa6000-latest-2-nemo" is unknown. available labels are "windows-latest", "windows-latest-8-cores", "windows-2025", "windows-2022", "windows-11-arm", "ubuntu-latest", "ubuntu-latest-4-cores", "ubuntu-latest-8-cores", "ubuntu-latest-16-cores", "ubuntu-24.04", "ubuntu-24.04-arm", "ubuntu-22.04", "ubuntu-22.04-arm", "macos-latest", "macos-latest-xl", "macos-latest-xlarge", "macos-latest-large", "macos-26-xlarge", "macos-26", "macos-15-intel", "macos-15-xlarge", "macos-15-large", "macos-15", "macos-14-xl", "macos-14-xlarge", "macos-14-large", "macos-14", "macos-13-xl", "macos-13-xlarge", "macos-13-large", "macos-13", "self-hosted", "x64", "arm", "arm64", "linux", "macos", "windows". if it is a custom label for self-hosted runner, set list of labels in actionlint.yaml config file
(runner-label)
200-200: shellcheck reported issue in this script: SC1007:warning:2:18: Remove space after = if trying to assign a value (for empty string, use var='' ... )
(shellcheck)
200-200: shellcheck reported issue in this script: SC1007:warning:5:18: Remove space after = if trying to assign a value (for empty string, use var='' ... )
(shellcheck)
🪛 GitHub Check: CodeQL
.github/workflows/unit-tests-framework.yml
[warning] 255-296: Workflow does not contain permissions
Actions job or workflow does not limit the permissions of the GITHUB_TOKEN. Consider setting an explicit permissions block, using the following as a minimal starting point: {{contents: read}}
.github/workflows/unit-tests-recipes.yml
[warning] 174-216: Workflow does not contain permissions
Actions job or workflow does not limit the permissions of the GITHUB_TOKEN. Consider setting an explicit permissions block, using the following as a minimal starting point: {{contents: read}}
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
- GitHub Check: pre-commit
- GitHub Check: Analyze (rust)
🔇 Additional comments (33)
sub-packages/bionemo-llm/tests/bionemo/llm/test_lightning.py (1)
78-78: LGTM! Marker correctly categorizes multi-GPU test.The
@pytest.mark.multi_gpumarker appropriately tags this test for selective execution in CI. It complements the existingskipifdecorator—the skipif prevents execution on insufficient hardware, while the marker enables test selection via CLI flags (--skip-multi-gpu,--only-multi-gpu) and theciflow:multi-gpulabel.sub-packages/bionemo-moco/tests/bionemo/moco/interpolants/discrete_time/discrete/test_d3pm_parallel.py (2)
63-69: LGTM! Multi-GPU test marking is correctly applied.The parametrization with named IDs and
pytest.mark.multi_gpuforworld_size=2aligns with the PR objectives to mark multi-GPU tests for conditional CI execution.
83-92: LGTM! Environment setup follows best practices.The pre-spawn environment configuration ensures all processes share the same distributed training settings. Binding to
"localhost"(not all interfaces) is secure, addressing the security concern from the previous review.Note: There's a minor TOCTOU race where another process could claim the port between closing the socket and spawning, but this is negligible in practice and is a known limitation of this common pattern.
sub-packages/bionemo-moco/tests/bionemo/moco/interpolants/discrete_time/continuous/test_ddpm_parallel.py (4)
16-17: LGTM!Standard library imports required for the new environment setup and port-finding functionality.
59-65: LGTM!Good parametrization pattern that enables testing both single-GPU and multi-GPU scenarios, with the multi-GPU case properly marked for selective execution in CI.
79-88: Good security practice: binding to localhost.The environment setup correctly uses
"localhost"instead of binding to all interfaces (''or'0.0.0.0'), which addresses the security concern from the previous review. The MASTER_ADDR/MASTER_PORT configuration before spawning ensures all processes share consistent values.
90-95: Necessary workaround for A6000 hanging issue.This fix addresses the hanging multi-GPU tests mentioned in the PR description. Setting
NCCL_P2P_DISABLE=1for A6000 GPUs is a standard workaround for known peer-to-peer communication issues on these devices.sub-packages/bionemo-moco/tests/bionemo/moco/interpolants/continuous_time/discrete/test_mdlm_parallel.py (2)
17-18: LGTM: required imports added.Imports for os/socket are appropriate for pre-spawn env setup.
59-65: Nice parametrize with multi_gpu gating.Clear ids and selective marking improve CI selection.
sub-packages/bionemo-moco/tests/bionemo/moco/interpolants/continuous_time/continuous/test_vdm_parallel.py (2)
16-17: LGTM: imports for env prep.
59-65: Parametrize looks good with multi_gpu mark.sub-packages/bionemo-moco/tests/bionemo/moco/interpolants/continuous_time/continuous/test_continuous_flow_matching_parallel.py (2)
16-17: LGTM: imports for env prep.
59-65: Parametrize with ids and multi_gpu mark is clear.sub-packages/bionemo-moco/tests/bionemo/moco/interpolants/continuous_time/discrete/test_discrete_flow_matching_parallel.py (2)
16-17: LGTM: imports for env prep.
57-63: Parametrize: good use of ids and multi_gpu mark.bionemo-recipes/models/amplify/pyproject.toml (1)
34-38: LGTM!The pytest markers are well-defined with clear descriptions. The expansion from a single marker to a list is properly formatted and aligns with the PR's multi-GPU testing infrastructure.
bionemo-recipes/models/esm2/pyproject.toml (1)
33-36: LGTM!The pytest markers are properly defined and consistent with the multi-GPU testing framework introduced in this PR.
bionemo-recipes/recipes/esm2_native_te/tests/test_train_two_gpu.py (1)
49-49: LGTM!The
@pytest.mark.multi_gpudecorators are correctly applied to all multi-GPU tests in this file, enabling proper test filtering in CI. The decorator placement before@requires_multi_gpuis appropriate.Also applies to: 68-68, 87-87, 106-106
bionemo-recipes/models/esm2/tests/test_distributed_strategies.py (1)
69-69: LGTM!The
@pytest.mark.multi_gpudecorator is correctly placed and enables proper test filtering for this multi-GPU test.docs/docs/main/contributing/contributing.md (2)
132-138: LGTM! Documentation is comprehensive.The
ciflow:multi-gpulabel documentation clearly explains:
- When to use it (distributed/multi-GPU code changes)
- How it combines with
ciflow:slowfor additional coverage- Default behavior (disabled in PR CI, enabled in nightly/merge queue)
147-147: LGTM!The update to the
ciflow:alldescription correctly reflects that it now includes multi-GPU tests in the comprehensive test suite.bionemo-recipes/models/esm2/tests/test_distributed_fp8.py (1)
64-64: LGTM!The
@pytest.mark.multi_gpudecorator is correctly applied, enabling proper test filtering for this FP8 multi-GPU test..github/labels.yml (1)
22-24: LGTM!The
ciflow:multi-gpulabel is properly defined with a clear description and distinctive color. The label naming follows the establishedciflow:*convention..github/pull_request_template.md (2)
30-31: LGTM!The PR template updates clearly document:
- The new
ciflow:multi-gpulabel and its purpose- How to combine it with
ciflow:slowfor comprehensive multi-GPU testing- The expanded scope of
ciflow:allto include multi-GPU tests
34-34: LGTM!The default behavior is clearly stated, helping contributors understand that multi-GPU tests require explicit enablement via the
ciflow:multi-gpulabel.sub-packages/bionemo-evo2/tests/bionemo/evo2/run/test_predict.py (3)
83-91: Multi‑GPU parametrizations LGTMMarker usage and IDs look consistent. Ensure multi_gpu is registered in pytest.ini/pyproject to avoid unknown‑marker warnings.
121-124: A6000 hang workaround: consider broadening NCCL guardSetting NCCL_P2P_DISABLE=1 is fine. If hangs persist, also consider NCCL_SHM_DISABLE=1 and NCCL_IB_DISABLE=1 as a fallback on these runners.
241-268: Equivalence test markers LGTMMulti‑GPU gating and parameter coverage look good.
Ensure multi_gpu marker is included in pyproject.toml (add to markers) to silence warnings. Based on learnings
.github/workflows/unit-tests-framework.yml (1)
258-271: Runner label warnings in actionlintThese are custom self‑hosted labels. Configure actionlint to know them (actionlint.yaml) or ignore the warnings in CI.
Also applies to: 329-345
.github/workflows/unit-tests-recipes.yml (2)
132-136: Runner label warnings in actionlintCustom labels are fine; register them in actionlint.yaml to avoid lint noise.
Also applies to: 175-179
219-233: Matrix status check LGTMCombined status logic is correct.
sub-packages/bionemo-evo2/tests/bionemo/evo2/run/test_train.py (2)
621-625: Fixture fallback logic LGTMAutomatic data prep when paths not provided is convenient.
689-693: Multi‑GPU gating LGTMMarker aligns with 2‑GPU requirement.
| - name: Run multi-GPU tests | ||
| working-directory: ${{ matrix.recipe.dir }} | ||
| run: pytest -v -m "multi_gpu" . || exit 0 # Allow no tests found for recipes without multi-GPU tests | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don’t swallow real test failures
pytest ... || exit 0 hides failures. Only ignore exit code 5 (no tests).
- - name: Run multi-GPU tests
+ - name: Run multi-GPU tests
working-directory: ${{ matrix.recipe.dir }}
- run: pytest -v -m "multi_gpu" . || exit 0 # Allow no tests found for recipes without multi-GPU tests
+ run: |
+ set +e
+ pytest -v -m "multi_gpu" .
+ rc=$?
+ if [[ $rc -eq 5 ]]; then
+ echo "No tests found for multi_gpu; continuing."
+ exit 0
+ fi
+ exit $rc📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| - name: Run multi-GPU tests | |
| working-directory: ${{ matrix.recipe.dir }} | |
| run: pytest -v -m "multi_gpu" . || exit 0 # Allow no tests found for recipes without multi-GPU tests | |
| - name: Run multi-GPU tests | |
| working-directory: ${{ matrix.recipe.dir }} | |
| run: | | |
| set +e | |
| pytest -v -m "multi_gpu" . | |
| rc=$? | |
| if [[ $rc -eq 5 ]]; then | |
| echo "No tests found for multi_gpu; continuing." | |
| exit 0 | |
| fi | |
| exit $rc |
🤖 Prompt for AI Agents
.github/workflows/unit-tests-recipes.yml around lines 212 to 215: the current
command swallows all failures by unconditionally returning 0; change it to run
pytest, capture its exit code, and only convert exit code 5 (no tests collected)
to 0 while preserving all other exit codes (i.e., run pytest -v -m "multi_gpu"
.; rc=$?; if [ $rc -eq 5 ]; then exit 0; else exit $rc; fi).
| @pytest.mark.multi_gpu | ||
| @requires_multi_gpu | ||
| @pytest.mark.slow | ||
| def test_checkpoint_save_and_load_two_processes_mfsdp(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add unique MASTER_ADDR/PORT to avoid torchrun port collisions (apply to this file’s multi-proc tests).
Great to mark as multi_gpu. However, torchrun defaults (e.g., MASTER_PORT=29500) can collide when multiple tests run concurrently. Set a per-test port in the env dict before subprocess.run, and prefer IPv4.
Example (place after env = os.environ.copy() in this test):
env = os.environ.copy()
env["WANDB_MODE"] = "disabled"
+import socket
+# Use IPv4 loopback and a free ephemeral port
+env.setdefault("MASTER_ADDR", "127.0.0.1")
+if "MASTER_PORT" not in env:
+ with socket.socket(socket.AF_INET, socket.SOCK_STREAM) as s:
+ s.bind(("127.0.0.1", 0))
+ env["MASTER_PORT"] = str(s.getsockname()[1])
+# Optional: mitigate A6000 hangs in CI
+if torch.cuda.is_available() and any("A6000" in torch.cuda.get_device_name(i) for i in range(torch.cuda.device_count())):
+ env.setdefault("NCCL_P2P_DISABLE", "1")Repeat for the resume phase and mirror in other 2‑proc tests in this module.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| @pytest.mark.multi_gpu | |
| @requires_multi_gpu | |
| @pytest.mark.slow | |
| def test_checkpoint_save_and_load_two_processes_mfsdp(): | |
| @pytest.mark.multi_gpu | |
| @requires_multi_gpu | |
| @pytest.mark.slow | |
| def test_checkpoint_save_and_load_two_processes_mfsdp(): | |
| env = os.environ.copy() | |
| env["WANDB_MODE"] = "disabled" | |
| import socket | |
| # Use IPv4 loopback and a free ephemeral port | |
| env.setdefault("MASTER_ADDR", "127.0.0.1") | |
| if "MASTER_PORT" not in env: | |
| with socket.socket(socket.AF_INET, socket.SOCK_STREAM) as s: | |
| s.bind(("127.0.0.1", 0)) | |
| env["MASTER_PORT"] = str(s.getsockname()[1]) | |
| # Optional: mitigate A6000 hangs in CI | |
| if torch.cuda.is_available() and any( | |
| "A6000" in torch.cuda.get_device_name(i) | |
| for i in range(torch.cuda.device_count()) | |
| ): | |
| env.setdefault("NCCL_P2P_DISABLE", "1") | |
| # ... existing subprocess.run or torchrun invocation using env ... |
🤖 Prompt for AI Agents
In
bionemo-recipes/recipes/geneformer_native_te_mfsdp_fp8/test_distributed_checkpointing.py
around lines 137 to 140, the multi-process test does not set explicit
MASTER_ADDR/MASTER_PORT so torchrun can collide with other tests; after the
existing env = os.environ.copy() in this test set env["MASTER_ADDR"]="127.0.0.1"
and env["MASTER_PORT"]=str(<unique_port>) (pick a test-specific free port or
compute base+pid) before each subprocess.run call (both the initial run and the
resume phase), and apply the same change to the other two-process tests in this
file so each subprocess invocation uses IPv4 and a unique port.
| subprocess.run(["zcat", "chr20.fa.gz"], stdout=open(test_dir / "chr20.fa", "w"), cwd=test_dir, check=True) | ||
| subprocess.run(["zcat", "chr21.fa.gz"], stdout=open(test_dir / "chr21.fa", "w"), cwd=test_dir, check=True) | ||
| subprocess.run(["zcat", "chr22.fa.gz"], stdout=open(test_dir / "chr22.fa", "w"), cwd=test_dir, check=True) | ||
|
|
||
| # Concatenate files | ||
| subprocess.run( | ||
| ["cat", "chr20.fa", "chr21.fa", "chr22.fa"], stdout=open(concat_path, "w"), cwd=test_dir, check=True | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Close file descriptors and add timeouts
zcat/cat calls open files without closing; add context managers and timeouts.
- subprocess.run(["zcat", "chr20.fa.gz"], stdout=open(test_dir / "chr20.fa", "w"), cwd=test_dir, check=True)
- subprocess.run(["zcat", "chr21.fa.gz"], stdout=open(test_dir / "chr21.fa", "w"), cwd=test_dir, check=True)
- subprocess.run(["zcat", "chr22.fa.gz"], stdout=open(test_dir / "chr22.fa", "w"), cwd=test_dir, check=True)
+ with open(test_dir / "chr20.fa", "w") as out20:
+ subprocess.run(["zcat", "chr20.fa.gz"], stdout=out20, cwd=test_dir, check=True, timeout=600)
+ with open(test_dir / "chr21.fa", "w") as out21:
+ subprocess.run(["zcat", "chr21.fa.gz"], stdout=out21, cwd=test_dir, check=True, timeout=600)
+ with open(test_dir / "chr22.fa", "w") as out22:
+ subprocess.run(["zcat", "chr22.fa.gz"], stdout=out22, cwd=test_dir, check=True, timeout=600)
@@
- subprocess.run(
- ["cat", "chr20.fa", "chr21.fa", "chr22.fa"], stdout=open(concat_path, "w"), cwd=test_dir, check=True
- )
+ with open(concat_path, "w") as outcat:
+ subprocess.run(["cat", "chr20.fa", "chr21.fa", "chr22.fa"], stdout=outcat, cwd=test_dir, check=True, timeout=600)📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| subprocess.run(["zcat", "chr20.fa.gz"], stdout=open(test_dir / "chr20.fa", "w"), cwd=test_dir, check=True) | |
| subprocess.run(["zcat", "chr21.fa.gz"], stdout=open(test_dir / "chr21.fa", "w"), cwd=test_dir, check=True) | |
| subprocess.run(["zcat", "chr22.fa.gz"], stdout=open(test_dir / "chr22.fa", "w"), cwd=test_dir, check=True) | |
| # Concatenate files | |
| subprocess.run( | |
| ["cat", "chr20.fa", "chr21.fa", "chr22.fa"], stdout=open(concat_path, "w"), cwd=test_dir, check=True | |
| ) | |
| with open(test_dir / "chr20.fa", "w") as out20: | |
| subprocess.run(["zcat", "chr20.fa.gz"], stdout=out20, cwd=test_dir, check=True, timeout=600) | |
| with open(test_dir / "chr21.fa", "w") as out21: | |
| subprocess.run(["zcat", "chr21.fa.gz"], stdout=out21, cwd=test_dir, check=True, timeout=600) | |
| with open(test_dir / "chr22.fa", "w") as out22: | |
| subprocess.run(["zcat", "chr22.fa.gz"], stdout=out22, cwd=test_dir, check=True, timeout=600) | |
| # Concatenate files | |
| with open(concat_path, "w") as outcat: | |
| subprocess.run(["cat", "chr20.fa", "chr21.fa", "chr22.fa"], stdout=outcat, cwd=test_dir, check=True, timeout=600) |
🤖 Prompt for AI Agents
In sub-packages/bionemo-evo2/tests/bionemo/evo2/run/test_train.py around lines
536 to 543, subprocess.run calls open() for stdout without closing file
descriptors and lack timeouts; update each subprocess.run to open the target
file with a context manager (with open(path, "w") as f:) and pass the file
object as stdout, and add a reasonable timeout argument (e.g. timeout=30) to
each subprocess.run so files are always closed even on errors and the calls
cannot hang.
| unit-tests-multi-gpu: | ||
| needs: changed-dirs | ||
| runs-on: linux-amd64-gpu-rtxa6000-latest-2-nemo | ||
| if: ${{ needs.changed-dirs.outputs.dirs != '[]' }} | ||
| name: "unit-tests-multi-gpu (${{ matrix.recipe.name }})" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we don't want these to be running on every PR right? Are we confident enough in those nemo A6000 nodes to hold up critical PRs? Are they aware of the incoming usage?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's make this only run on schedule events, and only if the single-GPU tests pass
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i dont' see where this is gated by schedule?
Signed-off-by: Gagan Kaushik <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
♻️ Duplicate comments (1)
.github/workflows/unit-tests-framework.yml (1)
37-49: Add minimal GITHUB_TOKEN permissions at workflow level.Silences CodeQL and follows least‑privilege.
name: "BioNeMo Framework CI" + +permissions: + contents: read + pull-requests: read + on:
🧹 Nitpick comments (4)
docs/docs/main/contributing/contributing.md (1)
144-145: Clarify multi‑GPU behavior and add a dedicated label sectionGood note. Suggest making PR usage explicit and adding a
ciflow:multi-gpusubsection for discoverability.Apply this diff to clarify the note:
- - Note: Multi-GPU tests always run separately in nightly builds and are not affected by this label + - Note: Multi-GPU tests run separately in nightly builds and are not affected by this label. To run them in PR CI, add the `ciflow:multi-gpu` label.And add a dedicated section for the label (placed after
ciflow:all):+#### **ciflow:multi-gpu** + +- Triggers multi-GPU tests (2× RTX A6000 runners) on PR CI. +- Disabled by default; they run automatically in the merge queue and nightly regardless of `ciflow:all`. +- Use when modifying distributed or multi-GPU code paths. +- Mark tests with `@pytest.mark.multi_gpu`.Please verify this matches the actual workflow/job behavior introduced in this PR (job names, hardware type, and when they’re scheduled).
.github/workflows/unit-tests-framework.yml (3)
258-258: Unknown runner label; verify self‑hosted configuration or add ‘self‑hosted’ to runs‑on.actionlint flags linux-amd64-gpu-rtxa6000-latest-2-nemo. If these are self-hosted, prefer:
- runs-on: [self-hosted, linux, x64, your-custom-label] or configure actionlint to allow custom labels.
Also applies to: 325-325
228-237: Remove redundant chmod of run_pytest_unittests.sh.You’ve switched to pytest_runner.sh; drop the unused chmod.
run: | - chmod +x ./ci/scripts/run_pytest_unittests.sh chmod +x ./ci/scripts/pytest_runner.shApply to single-GPU and multi-GPU jobs.
Also applies to: 272-279, 345-346
254-291: Add timeouts to prevent hung multi‑GPU runs.Past MOCO hangs suggest setting job timeouts.
run-tests-multi-gpu: + timeout-minutes: 120 @@ run-tests-slow-multi-gpu: + timeout-minutes: 180Also applies to: 321-347
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (4)
.github/labels.yml(1 hunks).github/pull_request_template.md(1 hunks).github/workflows/unit-tests-framework.yml(6 hunks)docs/docs/main/contributing/contributing.md(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
- .github/pull_request_template.md
- .github/labels.yml
🧰 Additional context used
🪛 actionlint (1.7.8)
.github/workflows/unit-tests-framework.yml
258-258: label "linux-amd64-gpu-rtxa6000-latest-2-nemo" is unknown. available labels are "windows-latest", "windows-latest-8-cores", "windows-2025", "windows-2022", "windows-11-arm", "ubuntu-latest", "ubuntu-latest-4-cores", "ubuntu-latest-8-cores", "ubuntu-latest-16-cores", "ubuntu-24.04", "ubuntu-24.04-arm", "ubuntu-22.04", "ubuntu-22.04-arm", "macos-latest", "macos-latest-xl", "macos-latest-xlarge", "macos-latest-large", "macos-26-xlarge", "macos-26", "macos-15-intel", "macos-15-xlarge", "macos-15-large", "macos-15", "macos-14-xl", "macos-14-xlarge", "macos-14-large", "macos-14", "macos-13-xl", "macos-13-xlarge", "macos-13-large", "macos-13", "self-hosted", "x64", "arm", "arm64", "linux", "macos", "windows". if it is a custom label for self-hosted runner, set list of labels in actionlint.yaml config file
(runner-label)
325-325: label "linux-amd64-gpu-rtxa6000-latest-2-nemo" is unknown. available labels are "windows-latest", "windows-latest-8-cores", "windows-2025", "windows-2022", "windows-11-arm", "ubuntu-latest", "ubuntu-latest-4-cores", "ubuntu-latest-8-cores", "ubuntu-latest-16-cores", "ubuntu-24.04", "ubuntu-24.04-arm", "ubuntu-22.04", "ubuntu-22.04-arm", "macos-latest", "macos-latest-xl", "macos-latest-xlarge", "macos-latest-large", "macos-26-xlarge", "macos-26", "macos-15-intel", "macos-15-xlarge", "macos-15-large", "macos-15", "macos-14-xl", "macos-14-xlarge", "macos-14-large", "macos-14", "macos-13-xl", "macos-13-xlarge", "macos-13-large", "macos-13", "self-hosted", "x64", "arm", "arm64", "linux", "macos", "windows". if it is a custom label for self-hosted runner, set list of labels in actionlint.yaml config file
(runner-label)
🪛 GitHub Check: CodeQL
.github/workflows/unit-tests-framework.yml
[warning] 255-292: Workflow does not contain permissions
Actions job or workflow does not limit the permissions of the GITHUB_TOKEN. Consider setting an explicit permissions block, using the following as a minimal starting point: {{contents: read}}
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Analyze (rust)
🔇 Additional comments (2)
.github/workflows/unit-tests-framework.yml (2)
280-290: Codecov correctly disabled for multi‑GPU jobs.Conditions exclude schedule/merge_group, so no uploads from multi‑GPU runs. Matches prior feedback.
314-320: Flags for slow single‑GPU look right.--only-slow with --skip-multi-gpu is correct; allow-no-tests avoids false failures.
| run-tests-multi-gpu: | ||
| needs: | ||
| - build-bionemo-image | ||
| - run-tests-single-gpu | ||
| runs-on: linux-amd64-gpu-rtxa6000-latest-2-nemo | ||
| container: | ||
| image: svcbionemo023/bionemo-framework:${{ github.run_id }} | ||
| credentials: | ||
| username: ${{ vars.DOCKER_USERNAME }} | ||
| password: ${{ secrets.DOCKER_PASSWORD }} | ||
| if: | | ||
| (needs.build-bionemo-image.result == 'success') && | ||
| (needs.run-tests-single-gpu.result == 'success') && | ||
| (github.event_name == 'schedule') | ||
| steps: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Multi‑GPU job never runs on PR label or merge queue; schedule‑only contradicts PR objectives.
Enable ciflow:multi-gpu and merge_group triggers and wire in get-pr-labels.
Apply:
run-tests-multi-gpu:
- needs:
- - build-bionemo-image
- - run-tests-single-gpu
+ needs:
+ - build-bionemo-image
+ - run-tests-single-gpu
+ - get-pr-labels
@@
- if: |
- (needs.build-bionemo-image.result == 'success') &&
- (needs.run-tests-single-gpu.result == 'success') &&
- (github.event_name == 'schedule')
+ if: |
+ (needs.build-bionemo-image.result == 'success') &&
+ (needs.run-tests-single-gpu.result == 'success') &&
+ (
+ (github.event_name == 'schedule') ||
+ (github.event_name == 'merge_group') ||
+ contains(fromJSON(needs.get-pr-labels.outputs.labels || '[]'), 'ciflow:all') ||
+ contains(fromJSON(needs.get-pr-labels.outputs.labels || '[]'), 'ciflow:multi-gpu')
+ )🧰 Tools
🪛 actionlint (1.7.8)
258-258: label "linux-amd64-gpu-rtxa6000-latest-2-nemo" is unknown. available labels are "windows-latest", "windows-latest-8-cores", "windows-2025", "windows-2022", "windows-11-arm", "ubuntu-latest", "ubuntu-latest-4-cores", "ubuntu-latest-8-cores", "ubuntu-latest-16-cores", "ubuntu-24.04", "ubuntu-24.04-arm", "ubuntu-22.04", "ubuntu-22.04-arm", "macos-latest", "macos-latest-xl", "macos-latest-xlarge", "macos-latest-large", "macos-26-xlarge", "macos-26", "macos-15-intel", "macos-15-xlarge", "macos-15-large", "macos-15", "macos-14-xl", "macos-14-xlarge", "macos-14-large", "macos-14", "macos-13-xl", "macos-13-xlarge", "macos-13-large", "macos-13", "self-hosted", "x64", "arm", "arm64", "linux", "macos", "windows". if it is a custom label for self-hosted runner, set list of labels in actionlint.yaml config file
(runner-label)
| run-tests-slow-multi-gpu: | ||
| needs: | ||
| - build-bionemo-image | ||
| - run-tests-slow-single-gpu | ||
| runs-on: linux-amd64-gpu-rtxa6000-latest-2-nemo | ||
| container: | ||
| image: svcbionemo023/bionemo-framework:${{ github.run_id }} | ||
| credentials: | ||
| username: ${{ vars.DOCKER_USERNAME }} | ||
| password: ${{ secrets.DOCKER_PASSWORD }} | ||
| if: | | ||
| (needs.build-bionemo-image.result == 'success') && | ||
| (needs.run-tests-slow-single-gpu.result == 'success') && | ||
| (github.event_name == 'schedule') | ||
| steps: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Slow multi‑GPU job also schedule‑only; allow PR label and merge queue.
Mirror fast job gating for consistency with usage docs.
run-tests-slow-multi-gpu:
- needs:
- - build-bionemo-image
- - run-tests-slow-single-gpu
+ needs:
+ - build-bionemo-image
+ - run-tests-slow-single-gpu
+ - get-pr-labels
@@
- if: |
- (needs.build-bionemo-image.result == 'success') &&
- (needs.run-tests-slow-single-gpu.result == 'success') &&
- (github.event_name == 'schedule')
+ if: |
+ (needs.build-bionemo-image.result == 'success') &&
+ (needs.run-tests-slow-single-gpu.result == 'success') &&
+ (
+ (github.event_name == 'schedule') ||
+ (github.event_name == 'merge_group') ||
+ contains(fromJSON(needs.get-pr-labels.outputs.labels || '[]'), 'ciflow:all') ||
+ contains(fromJSON(needs.get-pr-labels.outputs.labels || '[]'), 'ciflow:multi-gpu')
+ )📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| run-tests-slow-multi-gpu: | |
| needs: | |
| - build-bionemo-image | |
| - run-tests-slow-single-gpu | |
| runs-on: linux-amd64-gpu-rtxa6000-latest-2-nemo | |
| container: | |
| image: svcbionemo023/bionemo-framework:${{ github.run_id }} | |
| credentials: | |
| username: ${{ vars.DOCKER_USERNAME }} | |
| password: ${{ secrets.DOCKER_PASSWORD }} | |
| if: | | |
| (needs.build-bionemo-image.result == 'success') && | |
| (needs.run-tests-slow-single-gpu.result == 'success') && | |
| (github.event_name == 'schedule') | |
| steps: | |
| run-tests-slow-multi-gpu: | |
| needs: | |
| - build-bionemo-image | |
| - run-tests-slow-single-gpu | |
| - get-pr-labels | |
| runs-on: linux-amd64-gpu-rtxa6000-latest-2-nemo | |
| container: | |
| image: svcbionemo023/bionemo-framework:${{ github.run_id }} | |
| credentials: | |
| username: ${{ vars.DOCKER_USERNAME }} | |
| password: ${{ secrets.DOCKER_PASSWORD }} | |
| if: | | |
| (needs.build-bionemo-image.result == 'success') && | |
| (needs.run-tests-slow-single-gpu.result == 'success') && | |
| ( | |
| (github.event_name == 'schedule') || | |
| (github.event_name == 'merge_group') || | |
| contains(fromJSON(needs.get-pr-labels.outputs.labels || '[]'), 'ciflow:all') || | |
| contains(fromJSON(needs.get-pr-labels.outputs.labels || '[]'), 'ciflow:multi-gpu') | |
| ) | |
| steps: |
🧰 Tools
🪛 actionlint (1.7.8)
325-325: label "linux-amd64-gpu-rtxa6000-latest-2-nemo" is unknown. available labels are "windows-latest", "windows-latest-8-cores", "windows-2025", "windows-2022", "windows-11-arm", "ubuntu-latest", "ubuntu-latest-4-cores", "ubuntu-latest-8-cores", "ubuntu-latest-16-cores", "ubuntu-24.04", "ubuntu-24.04-arm", "ubuntu-22.04", "ubuntu-22.04-arm", "macos-latest", "macos-latest-xl", "macos-latest-xlarge", "macos-latest-large", "macos-26-xlarge", "macos-26", "macos-15-intel", "macos-15-xlarge", "macos-15-large", "macos-15", "macos-14-xl", "macos-14-xlarge", "macos-14-large", "macos-14", "macos-13-xl", "macos-13-xlarge", "macos-13-large", "macos-13", "self-hosted", "x64", "arm", "arm64", "linux", "macos", "windows". if it is a custom label for self-hosted runner, set list of labels in actionlint.yaml config file
(runner-label)
🤖 Prompt for AI Agents
.github/workflows/unit-tests-framework.yml lines 321-335: the
run-tests-slow-multi-gpu job is currently restricted to schedule-only by
checking github.event_name == 'schedule'; update its if expression to mirror the
fast job gating so it also runs when a PR is labeled or triggered via the merge
queue as well as on schedule. Replace the current if block with the same
conditional used by the fast job (allowing schedule OR pull_request with the
specific label OR merge-queue triggers) so behavior matches the fast job and
project docs.
Signed-off-by: Gagan Kaushik <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
♻️ Duplicate comments (3)
.github/workflows/unit-tests-framework.yml (3)
37-57: Add least‑privilege GITHUB_TOKEN permissions (and job‑level PR read for labels).Set minimal top‑level permissions and grant pull‑requests: read to get-pr-labels.
Based on static analysis hints
name: "BioNeMo Framework CI" on: push: @@ schedule: - cron: "0 7 * * *" # Runs at 7 AM UTC daily (12 AM MST) +permissions: + contents: read + defaults: run: shell: bash -x -e -u -o pipefail {0}And for the labels job:
get-pr-labels: + permissions: + contents: read + pull-requests: read runs-on: ubuntu-latest
254-268: Multi‑GPU job is schedule‑only; enable PR label and merge queue per usage docs.Wire in get-pr-labels and expand the if to allow ciflow:multi-gpu, ciflow:all, merge_group, and schedule.
run-tests-multi-gpu: - needs: - - build-bionemo-image - - run-tests-single-gpu + needs: + - build-bionemo-image + - run-tests-single-gpu + - get-pr-labels @@ - if: | - (needs.build-bionemo-image.result == 'success') && - (needs.run-tests-single-gpu.result == 'success') && - (github.event_name == 'schedule') + if: | + (needs.build-bionemo-image.result == 'success') && + (needs.run-tests-single-gpu.result == 'success') && + ( + (github.event_name == 'schedule') || + (github.event_name == 'merge_group') || + contains(fromJSON(needs.get-pr-labels.outputs.labels || '[]'), 'ciflow:all') || + contains(fromJSON(needs.get-pr-labels.outputs.labels || '[]'), 'ciflow:multi-gpu') + )
309-323: Slow multi‑GPU job is also schedule‑only; mirror fast job gating.Allow ciflow:multi-gpu/ciflow:all labels and merge queue too.
run-tests-slow-multi-gpu: - needs: - - build-bionemo-image - - run-tests-slow-single-gpu + needs: + - build-bionemo-image + - run-tests-slow-single-gpu + - get-pr-labels @@ - if: | - (needs.build-bionemo-image.result == 'success') && - (needs.run-tests-slow-single-gpu.result == 'success') && - (github.event_name == 'schedule') + if: | + (needs.build-bionemo-image.result == 'success') && + (needs.run-tests-slow-single-gpu.result == 'success') && + ( + (github.event_name == 'schedule') || + (github.event_name == 'merge_group') || + contains(fromJSON(needs.get-pr-labels.outputs.labels || '[]'), 'ciflow:all') || + contains(fromJSON(needs.get-pr-labels.outputs.labels || '[]'), 'ciflow:multi-gpu') + )
🧹 Nitpick comments (3)
.github/workflows/unit-tests-framework.yml (3)
228-237: Avoid chmod on unused wrapper or use it consistently.You chmod run_pytest_unittests.sh but don’t use it. Either remove the chmod, or call the wrapper as earlier suggested by the reviewer.
Option A (remove unused chmod):
- chmod +x ./ci/scripts/run_pytest_unittests.sh chmod +x ./ci/scripts/pytest_runner.sh ./ci/scripts/pytest_runner.sh --no-nbval --skip-slow --skip-multi-gpuOption B (use wrapper and add skip flag, aligning with prior guidance):
- chmod +x ./ci/scripts/pytest_runner.sh - ./ci/scripts/pytest_runner.sh --no-nbval --skip-slow --skip-multi-gpu + chmod +x ./ci/scripts/run_pytest_unittests.sh + ./ci/scripts/run_pytest_unittests.sh --skip-multi-gpu
22-27: Keep the workflow comments in sync with gating.After enabling label/merge‑queue triggers for multi‑GPU jobs, update these “nightly only” notes.
Also applies to: 31-35
367-392: Optional: track queue time and success rates.Emit queue/runtime metrics to the job summary via GH API.
Example step:
- name: Queue/runtime metrics env: GH_TOKEN: ${{ github.token }} run: | RUN_JSON=$(gh api repos/${{ github.repository }}/actions/runs/${{ github.run_id }}) CREATED=$(echo "$RUN_JSON" | jq -r '.created_at') STARTED=$(echo "$RUN_JSON" | jq -r '.run_started_at') NOW=$(date -u +%FT%TZ) QUEUE_SEC=$(python - <<PY from datetime import datetime, timezone from dateutil import parser c=parser.isoparse("$CREATED"); s=parser.isoparse("$STARTED") print(int((s-c).total_seconds())) PY ) echo "Queue time (s): $QUEUE_SEC" >> $GITHUB_STEP_SUMMARY
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
.github/workflows/unit-tests-framework.yml(6 hunks)
🧰 Additional context used
🪛 actionlint (1.7.8)
.github/workflows/unit-tests-framework.yml
258-258: label "linux-amd64-gpu-rtxa6000-latest-2-nemo" is unknown. available labels are "windows-latest", "windows-latest-8-cores", "windows-2025", "windows-2022", "windows-11-arm", "ubuntu-latest", "ubuntu-latest-4-cores", "ubuntu-latest-8-cores", "ubuntu-latest-16-cores", "ubuntu-24.04", "ubuntu-24.04-arm", "ubuntu-22.04", "ubuntu-22.04-arm", "macos-latest", "macos-latest-xl", "macos-latest-xlarge", "macos-latest-large", "macos-26-xlarge", "macos-26", "macos-15-intel", "macos-15-xlarge", "macos-15-large", "macos-15", "macos-14-xl", "macos-14-xlarge", "macos-14-large", "macos-14", "macos-13-xl", "macos-13-xlarge", "macos-13-large", "macos-13", "self-hosted", "x64", "arm", "arm64", "linux", "macos", "windows". if it is a custom label for self-hosted runner, set list of labels in actionlint.yaml config file
(runner-label)
313-313: label "linux-amd64-gpu-rtxa6000-latest-2-nemo" is unknown. available labels are "windows-latest", "windows-latest-8-cores", "windows-2025", "windows-2022", "windows-11-arm", "ubuntu-latest", "ubuntu-latest-4-cores", "ubuntu-latest-8-cores", "ubuntu-latest-16-cores", "ubuntu-24.04", "ubuntu-24.04-arm", "ubuntu-22.04", "ubuntu-22.04-arm", "macos-latest", "macos-latest-xl", "macos-latest-xlarge", "macos-latest-large", "macos-26-xlarge", "macos-26", "macos-15-intel", "macos-15-xlarge", "macos-15-large", "macos-15", "macos-14-xl", "macos-14-xlarge", "macos-14-large", "macos-14", "macos-13-xl", "macos-13-xlarge", "macos-13-large", "macos-13", "self-hosted", "x64", "arm", "arm64", "linux", "macos", "windows". if it is a custom label for self-hosted runner, set list of labels in actionlint.yaml config file
(runner-label)
🪛 GitHub Check: CodeQL
.github/workflows/unit-tests-framework.yml
[warning] 255-280: Workflow does not contain permissions
Actions job or workflow does not limit the permissions of the GITHUB_TOKEN. Consider setting an explicit permissions block, using the following as a minimal starting point: {{contents: read}}
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Analyze (rust)
🔇 Additional comments (1)
.github/workflows/unit-tests-framework.yml (1)
258-258: actionlint: declare custom runner labels to silence false positives.Add an actionlint config listing your self‑hosted labels.
Based on static analysis hints
Create .github/actionlint.yaml:
runner-label: - self-hosted - linux-amd64-cpu16 - linux-amd64-gpu-l4-latest-1 - linux-amd64-gpu-rtxa6000-latest-2-nemoAlso applies to: 313-313
Signed-off-by: Gagan Kaushik <[email protected]>
|
/ok to test aa64021 |
| if: | | ||
| (needs.build-bionemo-image.result == 'success') && | ||
| (needs.run-tests-single-gpu.result == 'success') && | ||
| (github.event_name == 'schedule') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should run either on the schedule event or the PR label?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i thought we wanted to only have it run nightly? or is it nightly + label but no merge queue for now?
Signed-off-by: Gagan Kaushik <[email protected]>
…ests pass OR on PRs with the ciflow:multi-gpu label Signed-off-by: Gagan Kaushik <[email protected]>
| runs-on: ubuntu-latest | ||
| outputs: | ||
| labels: ${{ steps.get-labels.outputs.labels || steps.get-labels-empty.outputs.labels }} | ||
| steps: | ||
| - name: Get PR number from branch | ||
| if: startsWith(github.ref, 'refs/heads/pull-request/') | ||
| id: get-pr-num | ||
| run: | | ||
| PR_NUM=$(echo ${{ github.ref_name }} | grep -oE '[0-9]+$') | ||
| echo "pr_num=$PR_NUM" >> $GITHUB_OUTPUT | ||
| - name: Get PR labels | ||
| id: get-labels | ||
| if: startsWith(github.ref, 'refs/heads/pull-request/') | ||
| env: | ||
| GH_TOKEN: ${{ github.token }} | ||
| run: | | ||
| LABELS=$(gh api repos/${{ github.repository }}/pulls/${{ steps.get-pr-num.outputs.pr_num }} --jq '[.labels[].name]' || echo "[]") | ||
| echo "labels=$LABELS" >> $GITHUB_OUTPUT | ||
| echo "Retrieved labels: $LABELS" | ||
| - name: Set empty labels for non-PR branches | ||
| if: ${{ !startsWith(github.ref, 'refs/heads/pull-request/') }} | ||
| id: get-labels-empty | ||
| run: | | ||
| echo "labels=[]" >> $GITHUB_OUTPUT | ||
| echo "Set empty labels for non-PR branch" | ||
| unit-tests-multi-gpu: |
Check warning
Code scanning / CodeQL
Workflow does not contain permissions Medium
Show autofix suggestion
Hide autofix suggestion
Copilot Autofix
AI 6 days ago
The recommended fix is to explicitly specify least privilege permissions for the workflow or, ideally, for the individual jobs in .github/workflows/unit-tests-recipes.yml.
-
For the
get-pr-labelsjob, the only required permission is to read pull request metadata (and potentially repository contents if any fetch occurs). -
pull-requests: readand optionallycontents: readcover reading PR info and repo access. -
Add a top-level
permissionsblock aftername:(applies to all jobs), or alternatively, apply a narrower permissions block to each job. For simplicity and following the error’s suggestion, add at the top/workflow level. -
Edit the file
.github/workflows/unit-tests-recipes.yml, adding:permissions: contents: read pull-requests: read
directly after the
name:key and before events (on:).
Required methods/imports/definitions: None; this is a YAML structure change only.
-
Copy modified lines R25-R27
| @@ -22,6 +22,9 @@ | ||
| # Note: On push, multi-GPU tests run in parallel with single-GPU tests (no dependency) | ||
|
|
||
| name: "BioNeMo Recipes CI" | ||
| permissions: | ||
| contents: read | ||
| pull-requests: read | ||
|
|
||
| on: | ||
| push: |
Signed-off-by: Gagan Kaushik <[email protected]>
|
/ok to test 2fe65dc |
| needs: | ||
| - build-bionemo-image | ||
| - run-tests-single-gpu | ||
| - get-pr-labels | ||
| runs-on: linux-amd64-gpu-rtxa6000-latest-2-nemo | ||
| container: | ||
| image: svcbionemo023/bionemo-framework:${{ github.run_id }} | ||
| credentials: | ||
| username: ${{ vars.DOCKER_USERNAME }} | ||
| password: ${{ secrets.DOCKER_PASSWORD }} | ||
| # Run multi-GPU tests ONLY when: | ||
| # Prerequisites: build succeeds AND single-GPU tests pass | ||
| # Then run if: schedule OR (push with ciflow:all OR ciflow:multi-gpu label) | ||
| # Do NOT run on merge_group or any other events | ||
| if: | | ||
| (needs.build-bionemo-image.result == 'success') && | ||
| (needs.run-tests-single-gpu.result == 'success') && | ||
| ( | ||
| github.event_name == 'schedule' || | ||
| ( | ||
| github.event_name == 'push' && | ||
| ( | ||
| contains(fromJSON(needs.get-pr-labels.outputs.labels || '[]'), 'ciflow:all') || | ||
| contains(fromJSON(needs.get-pr-labels.outputs.labels || '[]'), 'ciflow:multi-gpu') | ||
| ) | ||
| ) | ||
| ) | ||
| steps: | ||
| - name: Checkout repository | ||
| uses: actions/checkout@v4 | ||
|
|
||
| - name: Run multi-GPU tests | ||
| env: | ||
| BIONEMO_DATA_SOURCE: ngc | ||
| run: | | ||
| chmod +x ./ci/scripts/run_pytest_multigpu.sh | ||
| ./ci/scripts/run_pytest_multigpu.sh | ||
| run-tests-slow-single-gpu: |
Check warning
Code scanning / CodeQL
Workflow does not contain permissions Medium
Show autofix suggestion
Hide autofix suggestion
Copilot Autofix
AI 6 days ago
To fix the problem, we need to add the permissions key at the workflow level (top-level, after the workflow name and before the first trigger under on:). This will instruct GitHub to issue a token with only the specified minimal permissions for all jobs, unless a particular job specifies a more permissive permissions: block.
The best single edit is to add the following block:
permissions:
contents: readafter the workflow name and before the on: block, i.e., after line 46 and before line 48. This limits all jobs in this workflow (unless individually overridden) to only read repository contents using the automatic GITHUB_TOKEN. No imports or further changes are required, as only the workflow file is affected.
-
Copy modified lines R48-R50
| @@ -45,6 +45,9 @@ | ||
|
|
||
| name: "BioNeMo Framework CI" | ||
|
|
||
| permissions: | ||
| contents: read | ||
|
|
||
| on: | ||
| push: | ||
| branches: |
Signed-off-by: Gagan Kaushik <[email protected]>
|
/ok to test 3bf090d |
…plify.py Signed-off-by: Gagan Kaushik <[email protected]>
|
/ok to test 5f17689 |
Signed-off-by: Gagan Kaushik <[email protected]>
|
/ok to test f14497c |
Signed-off-by: Gagan Kaushik <[email protected]>
Signed-off-by: Gagan Kaushik <[email protected]>
|
/ok to test 91e4dcd |
Description
Usage
Simply add @pytest.mark.multi_gpu to any future multi-gpu tests. Use the new ciflow:multi-gpu label to run them in PRs. Otherwise, they will run in the merge queue and on the nightly schedule.
Type of changes
Pre-submit Checklist
Summary by CodeRabbit
New Features
Documentation
CI
Tests
Bug Fixes