CI tests on multi-GPU runners #1229

gagank1 · 2025-10-07T09:27:57Z

Description

Implements support for multi-gpu unit tests on 2x RTX A6000 runners
Adds pytest.mark.multi_gpu to several unmarked tests
Adds in integrated data download to Evo2 gradient equivalence test to allow it to run in CI
Fixes bug in Evo2 preprocessor where it wasn't respecting random seed during sampling
Fixes MOCO multi-gpu tests (previously hanging)

Usage

Simply add @pytest.mark.multi_gpu to any future multi-gpu tests. Use the new ciflow:multi-gpu label to run them in PRs. Otherwise, they will run in the merge queue and on the nightly schedule.

Type of changes

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Refactor
Documentation update
Other (please describe):

Pre-submit Checklist

I have tested these changes locally
I have updated the documentation accordingly
I have added/updated tests as needed
All existing tests pass successfully

Summary by CodeRabbit

New Features
- Added a public label to opt into multi-GPU testing.
Documentation
- Clarified CI behavior for multi-GPU tests and contributing guidance; noted nightly multi-GPU runs.
CI
- Split test pipelines into single-/multi-GPU and fast/slow paths; added runtime flags and new per-path jobs with updated verification and coverage aggregation.
Tests
- Marked many tests as multi-GPU, added pre-spawn environment setup and GPU stability workarounds, and enabled on-demand test data preparation.
Bug Fixes
- Made preprocessing split selection deterministic.

Signed-off-by: Gagan Kaushik <[email protected]>

copy-pr-bot · 2025-10-07T09:28:00Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

coderabbitai · 2025-10-07T09:28:09Z

Walkthrough

Adds a ciflow:multi-gpu label; splits CI test jobs into single-/multi-GPU and fast/slow paths; extends pytest runner with --only-multi-gpu/--skip-multi-gpu; marks many tests with @pytest.mark.multi_gpu; adds pre-spawn env and A6000 NCCL workaround; makes Evo2 preprocessing deterministic and adds a test data downloader helper.

Changes

Cohort / File(s)	Summary
CI labels & docs `.github/labels.yml`, `.github/pull_request_template.md`, `docs/docs/main/contributing/contributing.md`	Adds `ciflow:multi-gpu` label; updates PR template and contributing docs to clarify multi-GPU tests are excluded from PR CI and run separately (nightly).
CI workflows — framework `.github/workflows/unit-tests-framework.yml`	Replaces monolithic test jobs with per-path jobs (`run-tests-single-gpu`, `run-tests-multi-gpu`, `run-tests-slow-single-gpu`, `run-tests-slow-multi-gpu`, `run-tests-notebooks`); updates conditions, per-job scripts, coverage/Codecov steps, and `verify-tests-status` dependencies.
CI workflows — recipes `.github/workflows/unit-tests-recipes.yml`	Splits recipe testing into `unit-tests-single-gpu` and `unit-tests-multi-gpu` matrix jobs; mirrors setup/checkout/install steps; updates verification to depend on both jobs and aggregates failures.
PyTest runner script `ci/scripts/pytest_runner.sh`	Adds CLI flags `--skip-multi-gpu` and `--only-multi-gpu` with variables `SKIP_MULTI_GPU`, `ONLY_MULTI_GPU`; composes combined `MARKER_EXPR` (multi_gpu/slow) and applies `-m` to pytest when needed.
New/updated test runner scripts `ci/scripts/run_pytest_multigpu.sh`, `ci/scripts/run_pytest_slow_multigpu.sh`, `ci/scripts/run_pytest_unittests.sh`, `ci/scripts/run_pytest_slow.sh`	Adds `run_pytest_multigpu.sh` and `run_pytest_slow_multigpu.sh`; updates unittests/slow invocations to pass `--skip-multi-gpu` or appropriate flags to `pytest_runner.sh`.
PyTest marker configs `bionemo-recipes/models/amplify/pyproject.toml`, `bionemo-recipes/models/esm2/pyproject.toml`	Adds `slow` and `multi_gpu` markers to tool.pytest.ini_options marker lists.
Test marker additions — recipes & models `bionemo-recipes/models/esm2/tests/...`, `bionemo-recipes/recipes/...`, `sub-packages/bionemo-llm/tests/...`, `sub-packages/bionemo-evo2/tests/...`	Adds `@pytest.mark.multi_gpu` to many distributed/training tests and parametrizations; updates some test ids and skip/mark combinations.
MOCO distributed tests — env & marks `sub-packages/bionemo-moco/tests/...`	Adds `os`/`socket` imports; sets `MASTER_ADDR` and assigns a free `MASTER_PORT` before spawn; disables NCCL P2P on A6000 GPUs via `NCCL_P2P_DISABLE=1` when CUDA present; expands world_size parametrizations and marks multi-GPU cases.
Evo2 deterministic preprocess `sub-packages/bionemo-evo2/src/bionemo/evo2/data/preprocess.py`	Replaces unordered `set` + `pop()` with ordered `list` and `random.sample(...);pop(0)` to make split assignment deterministic under a seed.
Evo2 tests — data helper & fixture changes `sub-packages/bionemo-evo2/tests/bionemo/evo2/run/test_train.py`	Adds `download_and_preprocess_training_data()` helper (appears duplicated in diff); `dataset_config` fixture now calls helper when dataset paths missing; adds multi-GPU marks to tests.
Various small test updates `sub-packages/bionemo-moco/...`, `sub-packages/bionemo-evo2/tests/...`, `sub-packages/bionemo-llm/...`	Multiple tests updated to include multi-GPU marks or to adjust parametrizations; some tests add environment/setup steps pre-spawn.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  participant Dev as Developer
  participant GH as GitHub Actions
  participant Jobs as unit-tests-* jobs
  participant Verify as verify-tests-status

  Dev->>GH: Push PR or schedule
  GH->>Jobs: Start workflow (matrix)
  alt PR CI
    Jobs->>Jobs: Run single-GPU jobs (multi-GPU jobs skipped)
  else Nightly / scheduled
    Jobs->>Jobs: Run single-GPU jobs
    Jobs->>Jobs: Run multi-GPU jobs after single-GPU success
  end
  Jobs-->>GH: Upload artifacts/coverage
  Jobs->>Verify: Report job results
  Verify-->>GH: Aggregate status

sequenceDiagram
  autonumber
  participant Job as CI job
  participant Runner as pytest_runner.sh
  participant PyTest as pytest

  Job->>Runner: Invoke with flags (--only-multi-gpu / --skip-multi-gpu / --skip-slow / --only-slow)
  Runner->>Runner: Build MARKER_EXPR from flags (multi_gpu, slow)
  alt MARKER_EXPR non-empty
    Runner->>PyTest: pytest -m "<MARKER_EXPR>" [options]
  else
    Runner->>PyTest: pytest [options]
  end
  PyTest-->>Job: Return results and reports

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Poem

I tagged the CI with teal delight,
I hop where GPUs split day and night.
Nightly I wait while single-GPU runs,
Then multi-GPU dances under teal suns.
I nibble seeds and make splits right — hooray! 🐇✨

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Description Check	⚠️ Warning	The pull request description includes the Description, Usage, Type of changes, and Pre-submit Checklist sections but omits the required CI Pipeline Configuration section from the repository template and lacks a code snippet under Usage, making it incomplete against the expected template.	Please add the “CI Pipeline Configuration” section with the relevant labels and, if applicable, include a short code snippet under “Usage” to fully match the repository’s PR description template.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Title Check	✅ Passed	The title succinctly captures the primary change of enabling CI tests on multi-GPU runners without extraneous detail and clearly informs reviewers of the main intent of the pull request.
Docstring Coverage	✅ Passed	No functions found in the changes. Docstring coverage check skipped.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch gkaushik/multi-gpu-ci

📜 Recent review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 9520a1d and aa64021.

📒 Files selected for processing (5)

.github/workflows/unit-tests-framework.yml (6 hunks)
ci/scripts/run_pytest_multigpu.sh (1 hunks)
ci/scripts/run_pytest_slow.sh (1 hunks)
ci/scripts/run_pytest_slow_multigpu.sh (1 hunks)
ci/scripts/run_pytest_unittests.sh (1 hunks)

🧰 Additional context used

🪛 actionlint (1.7.8)

.github/workflows/unit-tests-framework.yml

257-257: label "linux-amd64-gpu-rtxa6000-latest-2-nemo" is unknown. available labels are "windows-latest", "windows-latest-8-cores", "windows-2025", "windows-2022", "windows-11-arm", "ubuntu-latest", "ubuntu-latest-4-cores", "ubuntu-latest-8-cores", "ubuntu-latest-16-cores", "ubuntu-24.04", "ubuntu-24.04-arm", "ubuntu-22.04", "ubuntu-22.04-arm", "macos-latest", "macos-latest-xl", "macos-latest-xlarge", "macos-latest-large", "macos-26-xlarge", "macos-26", "macos-15-intel", "macos-15-xlarge", "macos-15-large", "macos-15", "macos-14-xl", "macos-14-xlarge", "macos-14-large", "macos-14", "macos-13-xl", "macos-13-xlarge", "macos-13-large", "macos-13", "self-hosted", "x64", "arm", "arm64", "linux", "macos", "windows". if it is a custom label for self-hosted runner, set list of labels in actionlint.yaml config file

(runner-label)

311-311: label "linux-amd64-gpu-rtxa6000-latest-2-nemo" is unknown. available labels are "windows-latest", "windows-latest-8-cores", "windows-2025", "windows-2022", "windows-11-arm", "ubuntu-latest", "ubuntu-latest-4-cores", "ubuntu-latest-8-cores", "ubuntu-latest-16-cores", "ubuntu-24.04", "ubuntu-24.04-arm", "ubuntu-22.04", "ubuntu-22.04-arm", "macos-latest", "macos-latest-xl", "macos-latest-xlarge", "macos-latest-large", "macos-26-xlarge", "macos-26", "macos-15-intel", "macos-15-xlarge", "macos-15-large", "macos-15", "macos-14-xl", "macos-14-xlarge", "macos-14-large", "macos-14", "macos-13-xl", "macos-13-xlarge", "macos-13-large", "macos-13", "self-hosted", "x64", "arm", "arm64", "linux", "macos", "windows". if it is a custom label for self-hosted runner, set list of labels in actionlint.yaml config file

(runner-label)

🪛 GitHub Check: CodeQL

.github/workflows/unit-tests-framework.yml

[warning] 254-278: Workflow does not contain permissions
Actions job or workflow does not limit the permissions of the GITHUB_TOKEN. Consider setting an explicit permissions block, using the following as a minimal starting point: {{contents: read}}

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: Analyze (rust)

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

gagank1 · 2025-10-07T09:28:22Z

/ok to test 68fa036

.github/workflows/unit-tests-framework.yml

Signed-off-by: Gagan Kaushik <[email protected]>

gagank1 · 2025-10-07T09:51:50Z

/ok to test 62f381d

codecov-commenter · 2025-10-07T11:27:53Z

❌ 8 Tests Failed:

Tests completed	Failed	Passed	Skipped
1315	8	1307	35

View the top 3 failed test(s) by shortest run time

sub-packages/bionemo-amplify/tests/bionemo/amplify/test_datamodule.py::sub-packages.bionemo-amplify.tests.bionemo.amplify.test_datamodule

Stack Traces | 0s run time

.../local/lib/python3.12.../huggingface_hub/utils/_http.py:402: in hf_raise_for_status
    response.raise_for_status()
.../local/lib/python3.12.../dist-packages/requests/models.py:1024: in raise_for_status
    raise HTTPError(http_error_msg, response=self)
E   requests.exceptions.HTTPError: 429 Client Error: Too Many Requests for url: https://huggingface..../resolve/main/config.json

The above exception was the direct cause of the following exception:
.../local/lib/python3.12...................../dist-packages/huggingface_hub/file_download.py:1543: in _get_metadata_or_catch_error
    metadata = get_hf_file_metadata(
.../local/lib/python3.12.../huggingface_hub/utils/_validators.py:114: in _inner_fn
    return fn(*args, **kwargs)
.../local/lib/python3.12...................../dist-packages/huggingface_hub/file_download.py:1460: in get_hf_file_metadata
    r = _request_wrapper(
.../local/lib/python3.12...................../dist-packages/huggingface_hub/file_download.py:283: in _request_wrapper
    response = _request_wrapper(
.../local/lib/python3.12...................../dist-packages/huggingface_hub/file_download.py:307: in _request_wrapper
    hf_raise_for_status(response)
.../local/lib/python3.12.../huggingface_hub/utils/_http.py:475: in hf_raise_for_status
    raise _format(HfHubHTTPError, str(e), response) from e
E   huggingface_hub.errors.HfHubHTTPError: 429 Client Error: Too Many Requests for url: https://huggingface..../resolve/main/config.json (Request ID: Root=1-69177bf7-6c0f5eef0806cd7542727e77;a9c071b3-a16f-401e-a99a-9bee058f2585)
E   
E   We had to rate limit your IP (216.228.127.128). To continue using our service, create a HF account or login to your existing account, and make sure you pass a HF_TOKEN if you're using the API.

The above exception was the direct cause of the following exception:
.../local/lib/python3.12.../transformers/utils/hub.py:470: in cached_files
    hf_hub_download(
.../local/lib/python3.12.../huggingface_hub/utils/_validators.py:114: in _inner_fn
    return fn(*args, **kwargs)
.../local/lib/python3.12...................../dist-packages/huggingface_hub/file_download.py:1007: in hf_hub_download
    return _hf_hub_download_to_cache_dir(
.../local/lib/python3.12...................../dist-packages/huggingface_hub/file_download.py:1114: in _hf_hub_download_to_cache_dir
    _raise_on_head_call_error(head_call_error, force_download, local_files_only)
.../local/lib/python3.12...................../dist-packages/huggingface_hub/file_download.py:1658: in _raise_on_head_call_error
    raise LocalEntryNotFoundError(
E   huggingface_hub.errors.LocalEntryNotFoundError: An error happened while trying to locate the file on the Hub and we cannot find the requested files in the local cache. Please check your connection and try again or make sure your Internet connection is on.

The above exception was the direct cause of the following exception:
.../bionemo/amplify/test_datamodule.py:23: in <module>
    from bionemo.amplify.datamodule import AMPLIFYDataModule
.../local/lib/python3.12.../bionemo/amplify/datamodule.py:35: in <module>
    class AMPLIFYDataModule(MegatronDataModule):
.../local/lib/python3.12.../bionemo/amplify/datamodule.py:55: in AMPLIFYDataModule
    tokenizer: tokenizer.BioNeMoAMPLIFYTokenizer = tokenizer.BioNeMoAMPLIFYTokenizer(),
.../local/lib/python3.12.../lightning/io/mixin.py:598: in wrapped_init
    original_init(self, *args, **kwargs)
.../local/lib/python3.12.../bionemo/amplify/tokenizer.py:24: in __init__
    other = transformers.AutoTokenizer.from_pretrained("chandar-lab/AMPLIFY_350M", use_fast=True)
.../local/lib/python3.12.../models/auto/tokenization_auto.py:1003: in from_pretrained
    config = AutoConfig.from_pretrained(
.../local/lib/python3.12.../models/auto/configuration_auto.py:1197: in from_pretrained
    config_dict, unused_kwargs = PretrainedConfig.get_config_dict(pretrained_model_name_or_path, **kwargs)
.../local/lib/python3.12....../dist-packages/transformers/configuration_utils.py:608: in get_config_dict
    config_dict, kwargs = cls._get_config_dict(pretrained_model_name_or_path, **kwargs)
.../local/lib/python3.12....../dist-packages/transformers/configuration_utils.py:667: in _get_config_dict
    resolved_config_file = cached_file(
.../local/lib/python3.12.../transformers/utils/hub.py:312: in cached_file
    file = cached_files(path_or_repo_id=path_or_repo_id, filenames=[filename], **kwargs)
.../local/lib/python3.12.../transformers/utils/hub.py:543: in cached_files
    raise OSError(
E   OSError: We couldn't connect to 'https://huggingface.co' to load the files, and couldn't find them in the cached files.
E   Check your internet connection or see how to run the library in offline mode at 'https://huggingface..../docs/transformers/installation#offline-mode'.

sub-packages/bionemo-amplify/tests/bionemo/amplify/test_train_amplify.py::sub-packages.bionemo-amplify.tests.bionemo.amplify.test_train_amplify

Stack Traces | 0s run time

.../local/lib/python3.12.../huggingface_hub/utils/_http.py:402: in hf_raise_for_status
    response.raise_for_status()
.../local/lib/python3.12.../dist-packages/requests/models.py:1024: in raise_for_status
    raise HTTPError(http_error_msg, response=self)
E   requests.exceptions.HTTPError: 429 Client Error: Too Many Requests for url: https://huggingface..../resolve/main/config.json

The above exception was the direct cause of the following exception:
.../local/lib/python3.12...................../dist-packages/huggingface_hub/file_download.py:1543: in _get_metadata_or_catch_error
    metadata = get_hf_file_metadata(
.../local/lib/python3.12.../huggingface_hub/utils/_validators.py:114: in _inner_fn
    return fn(*args, **kwargs)
.../local/lib/python3.12...................../dist-packages/huggingface_hub/file_download.py:1460: in get_hf_file_metadata
    r = _request_wrapper(
.../local/lib/python3.12...................../dist-packages/huggingface_hub/file_download.py:283: in _request_wrapper
    response = _request_wrapper(
.../local/lib/python3.12...................../dist-packages/huggingface_hub/file_download.py:307: in _request_wrapper
    hf_raise_for_status(response)
.../local/lib/python3.12.../huggingface_hub/utils/_http.py:475: in hf_raise_for_status
    raise _format(HfHubHTTPError, str(e), response) from e
E   huggingface_hub.errors.HfHubHTTPError: 429 Client Error: Too Many Requests for url: https://huggingface..../resolve/main/config.json (Request ID: Root=1-69177bfa-2597769576eb6b5b79e10cd6;19ee5c67-abff-42c0-b5af-1500fc30779e)
E   
E   We had to rate limit your IP (216.228.127.128). To continue using our service, create a HF account or login to your existing account, and make sure you pass a HF_TOKEN if you're using the API.

The above exception was the direct cause of the following exception:
.../local/lib/python3.12.../transformers/utils/hub.py:470: in cached_files
    hf_hub_download(
.../local/lib/python3.12.../huggingface_hub/utils/_validators.py:114: in _inner_fn
    return fn(*args, **kwargs)
.../local/lib/python3.12...................../dist-packages/huggingface_hub/file_download.py:1007: in hf_hub_download
    return _hf_hub_download_to_cache_dir(
.../local/lib/python3.12...................../dist-packages/huggingface_hub/file_download.py:1114: in _hf_hub_download_to_cache_dir
    _raise_on_head_call_error(head_call_error, force_download, local_files_only)
.../local/lib/python3.12...................../dist-packages/huggingface_hub/file_download.py:1658: in _raise_on_head_call_error
    raise LocalEntryNotFoundError(
E   huggingface_hub.errors.LocalEntryNotFoundError: An error happened while trying to locate the file on the Hub and we cannot find the requested files in the local cache. Please check your connection and try again or make sure your Internet connection is on.

The above exception was the direct cause of the following exception:
.../bionemo/amplify/test_train_amplify.py:22: in <module>
    from bionemo.amplify.train_amplify import main
.../local/lib/python3.12.../bionemo/amplify/train_amplify.py:32: in <module>
    from bionemo.amplify.datamodule import AMPLIFYDataModule
.../local/lib/python3.12.../bionemo/amplify/datamodule.py:35: in <module>
    class AMPLIFYDataModule(MegatronDataModule):
.../local/lib/python3.12.../bionemo/amplify/datamodule.py:55: in AMPLIFYDataModule
    tokenizer: tokenizer.BioNeMoAMPLIFYTokenizer = tokenizer.BioNeMoAMPLIFYTokenizer(),
.../local/lib/python3.12.../lightning/io/mixin.py:598: in wrapped_init
    original_init(self, *args, **kwargs)
.../local/lib/python3.12.../bionemo/amplify/tokenizer.py:24: in __init__
    other = transformers.AutoTokenizer.from_pretrained("chandar-lab/AMPLIFY_350M", use_fast=True)
.../local/lib/python3.12.../models/auto/tokenization_auto.py:1003: in from_pretrained
    config = AutoConfig.from_pretrained(
.../local/lib/python3.12.../models/auto/configuration_auto.py:1197: in from_pretrained
    config_dict, unused_kwargs = PretrainedConfig.get_config_dict(pretrained_model_name_or_path, **kwargs)
.../local/lib/python3.12....../dist-packages/transformers/configuration_utils.py:608: in get_config_dict
    config_dict, kwargs = cls._get_config_dict(pretrained_model_name_or_path, **kwargs)
.../local/lib/python3.12....../dist-packages/transformers/configuration_utils.py:667: in _get_config_dict
    resolved_config_file = cached_file(
.../local/lib/python3.12.../transformers/utils/hub.py:312: in cached_file
    file = cached_files(path_or_repo_id=path_or_repo_id, filenames=[filename], **kwargs)
.../local/lib/python3.12.../transformers/utils/hub.py:543: in cached_files
    raise OSError(
E   OSError: We couldn't connect to 'https://huggingface.co' to load the files, and couldn't find them in the cached files.
E   Check your internet connection or see how to run the library in offline mode at 'https://huggingface..../docs/transformers/installation#offline-mode'.

sub-packages/bionemo-esm2/tests/bionemo/esm2/model/test_convert.py::test_nemo2_conversion_equivalent_8m_bf16

Stack Traces | 0.268s run time

response = <Response [429]>, endpoint_name = None

    def hf_raise_for_status(response: Response, endpoint_name: Optional[str] = None) -> None:
        """
        Internal version of `response.raise_for_status()` that will refine a
        potential HTTPError. Raised exception will be an instance of `HfHubHTTPError`.
    
        This helper is meant to be the unique method to raise_for_status when making a call
        to the Hugging Face Hub.
    
    
        Example:
        ```py
            import requests
            from huggingface_hub.utils import get_session, hf_raise_for_status, HfHubHTTPError
    
            response = get_session().post(...)
            try:
                hf_raise_for_status(response)
            except HfHubHTTPError as e:
                print(str(e)) # formatted message
                e.request_id, e.server_message # details returned by server
    
                # Complete the error message with additional information once it's raised
                e.append_to_message("\n`create_commit` expects the repository to exist.")
                raise
        ```
    
        Args:
            response (`Response`):
                Response from the server.
            endpoint_name (`str`, *optional*):
                Name of the endpoint that has been called. If provided, the error message
                will be more complete.
    
        > [!WARNING]
        > Raises when the request has failed:
        >
        >     - [`~utils.RepositoryNotFoundError`]
        >         If the repository to download from cannot be found. This may be because it
        >         doesn't exist, because `repo_type` is not set correctly, or because the repo
        >         is `private` and you do not have access.
        >     - [`~utils.GatedRepoError`]
        >         If the repository exists but is gated and the user is not on the authorized
        >         list.
        >     - [`~utils.RevisionNotFoundError`]
        >         If the repository exists but the revision couldn't be find.
        >     - [`~utils.EntryNotFoundError`]
        >         If the repository exists but the entry (e.g. the requested file) couldn't be
        >         find.
        >     - [`~utils.BadRequestError`]
        >         If request failed with a HTTP 400 BadRequest error.
        >     - [`~utils.HfHubHTTPError`]
        >         If request failed for a reason not listed above.
        """
        try:
>           response.raise_for_status()

.../local/lib/python3.12.../huggingface_hub/utils/_http.py:402: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = <Response [429]>

    def raise_for_status(self):
        """Raises :class:`HTTPError`, if one occurred."""
    
        http_error_msg = ""
        if isinstance(self.reason, bytes):
            # We attempt to decode utf-8 first because some servers
            # choose to localize their reason strings. If the string
            # isn't utf-8, we fall back to iso-8859-1 for all other
            # encodings. (See PR #3538)
            try:
                reason = self.reason.decode("utf-8")
            except UnicodeDecodeError:
                reason = self.reason.decode("iso-8859-1")
        else:
            reason = self.reason
    
        if 400 <= self.status_code < 500:
            http_error_msg = (
                f"{self.status_code} Client Error: {reason} for url: {self.url}"
            )
    
        elif 500 <= self.status_code < 600:
            http_error_msg = (
                f"{self.status_code} Server Error: {reason} for url: {self.url}"
            )
    
        if http_error_msg:
>           raise HTTPError(http_error_msg, response=self)
E           requests.exceptions.HTTPError: 429 Client Error: Too Many Requests for url: https://huggingface..../resolve/main/config.json

.../local/lib/python3.12.../dist-packages/requests/models.py:1024: HTTPError

The above exception was the direct cause of the following exception:

    def _get_metadata_or_catch_error(
        *,
        repo_id: str,
        filename: str,
        repo_type: str,
        revision: str,
        endpoint: Optional[str],
        proxies: Optional[Dict],
        etag_timeout: Optional[float],
        headers: Dict[str, str],  # mutated inplace!
        token: Union[bool, str, None],
        local_files_only: bool,
        relative_filename: Optional[str] = None,  # only used to store `.no_exists` in cache
        storage_folder: Optional[str] = None,  # only used to store `.no_exists` in cache
    ) -> Union[
        # Either an exception is caught and returned
        Tuple[None, None, None, None, None, Exception],
        # Or the metadata is returned as
        # `(url_to_download, etag, commit_hash, expected_size, xet_file_data, None)`
        Tuple[str, str, str, int, Optional[XetFileData], None],
    ]:
        """Get metadata for a file on the Hub, safely handling network issues.
    
        Returns either the etag, commit_hash and expected size of the file, or the error
        raised while fetching the metadata.
    
        NOTE: This function mutates `headers` inplace! It removes the `authorization` header
              if the file is a LFS blob and the domain of the url is different from the
              domain of the location (typically an S3 bucket).
        """
        if local_files_only:
            return (
                None,
                None,
                None,
                None,
                None,
                OfflineModeIsEnabled(
                    f"Cannot access file since 'local_files_only=True' as been set. (repo_id: {repo_id}, repo_type: {repo_type}, revision: {revision}, filename: {filename})"
                ),
            )
    
        url = hf_hub_url(repo_id, filename, repo_type=repo_type, revision=revision, endpoint=endpoint)
        url_to_download: str = url
        etag: Optional[str] = None
        commit_hash: Optional[str] = None
        expected_size: Optional[int] = None
        head_error_call: Optional[Exception] = None
        xet_file_data: Optional[XetFileData] = None
    
        # Try to get metadata from the server.
        # Do not raise yet if the file is not found or not accessible.
        if not local_files_only:
            try:
                try:
>                   metadata = get_hf_file_metadata(
                        url=url, proxies=proxies, timeout=etag_timeout, headers=headers, token=token, endpoint=endpoint
                    )

.../local/lib/python3.12...................../dist-packages/huggingface_hub/file_download.py:1543: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
.../local/lib/python3.12.../huggingface_hub/utils/_validators.py:114: in _inner_fn
    return fn(*args, **kwargs)
.../local/lib/python3.12...................../dist-packages/huggingface_hub/file_download.py:1460: in get_hf_file_metadata
    r = _request_wrapper(
.../local/lib/python3.12...................../dist-packages/huggingface_hub/file_download.py:283: in _request_wrapper
    response = _request_wrapper(
.../local/lib/python3.12...................../dist-packages/huggingface_hub/file_download.py:307: in _request_wrapper
    hf_raise_for_status(response)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

response = <Response [429]>, endpoint_name = None

    def hf_raise_for_status(response: Response, endpoint_name: Optional[str] = None) -> None:
        """
        Internal version of `response.raise_for_status()` that will refine a
        potential HTTPError. Raised exception will be an instance of `HfHubHTTPError`.
    
        This helper is meant to be the unique method to raise_for_status when making a call
        to the Hugging Face Hub.
    
    
        Example:
        ```py
            import requests
            from huggingface_hub.utils import get_session, hf_raise_for_status, HfHubHTTPError
    
            response = get_session().post(...)
            try:
                hf_raise_for_status(response)
            except HfHubHTTPError as e:
                print(str(e)) # formatted message
                e.request_id, e.server_message # details returned by server
    
                # Complete the error message with additional information once it's raised
                e.append_to_message("\n`create_commit` expects the repository to exist.")
                raise
        ```
    
        Args:
            response (`Response`):
                Response from the server.
            endpoint_name (`str`, *optional*):
                Name of the endpoint that has been called. If provided, the error message
                will be more complete.
    
        > [!WARNING]
        > Raises when the request has failed:
        >
        >     - [`~utils.RepositoryNotFoundError`]
        >         If the repository to download from cannot be found. This may be because it
        >         doesn't exist, because `repo_type` is not set correctly, or because the repo
        >         is `private` and you do not have access.
        >     - [`~utils.GatedRepoError`]
        >         If the repository exists but is gated and the user is not on the authorized
        >         list.
        >     - [`~utils.RevisionNotFoundError`]
        >         If the repository exists but the revision couldn't be find.
        >     - [`~utils.EntryNotFoundError`]
        >         If the repository exists but the entry (e.g. the requested file) couldn't be
        >         find.
        >     - [`~utils.BadRequestError`]
        >         If request failed with a HTTP 400 BadRequest error.
        >     - [`~utils.HfHubHTTPError`]
        >         If request failed for a reason not listed above.
        """
        try:
            response.raise_for_status()
        except HTTPError as e:
            error_code = response.headers.get("X-Error-Code")
            error_message = response.headers.get("X-Error-Message")
    
            if error_code == "RevisionNotFound":
                message = f"{response.status_code} Client Error." + "\n\n" + f"Revision Not Found for url: {response.url}."
                raise _format(RevisionNotFoundError, message, response) from e
    
            elif error_code == "EntryNotFound":
                message = f"{response.status_code} Client Error." + "\n\n" + f"Entry Not Found for url: {response.url}."
                raise _format(EntryNotFoundError, message, response) from e
    
            elif error_code == "GatedRepo":
                message = (
                    f"{response.status_code} Client Error." + "\n\n" + f"Cannot access gated repo for url {response.url}."
                )
                raise _format(GatedRepoError, message, response) from e
    
            elif error_message == "Access to this resource is disabled.":
                message = (
                    f"{response.status_code} Client Error."
                    + "\n\n"
                    + f"Cannot access repository for url {response.url}."
                    + "\n"
                    + "Access to this resource is disabled."
                )
                raise _format(DisabledRepoError, message, response) from e
    
            elif error_code == "RepoNotFound" or (
                response.status_code == 401
                and error_message != "Invalid credentials in Authorization header"
                and response.request is not None
                and response.request.url is not None
                and REPO_API_REGEX.search(response.request.url) is not None
            ):
                # 401 is misleading as it is returned for:
                #    - private and gated repos if user is not authenticated
                #    - missing repos
                # => for now, we process them as `RepoNotFound` anyway.
                # See https://gist.github.com/Wauplin/46c27ad266b15998ce56a6603796f0b9
                message = (
                    f"{response.status_code} Client Error."
                    + "\n\n"
                    + f"Repository Not Found for url: {response.url}."
                    + "\nPlease make sure you specified the correct `repo_id` and"
                    " `repo_type`.\nIf you are trying to access a private or gated repo,"
                    " make sure you are authenticated. For more details, see"
                    " https://huggingface..../docs/huggingface_hub/authentication"
                )
                raise _format(RepositoryNotFoundError, message, response) from e
    
            elif response.status_code == 400:
                message = (
                    f"\n\nBad request for {endpoint_name} endpoint:" if endpoint_name is not None else "\n\nBad request:"
                )
                raise _format(BadRequestError, message, response) from e
    
            elif response.status_code == 403:
                message = (
                    f"\n\n{response.status_code} Forbidden: {error_message}."
                    + f"\nCannot access content at: {response.url}."
                    + "\nMake sure your token has the correct permissions."
                )
                raise _format(HfHubHTTPError, message, response) from e
    
            elif response.status_code == 416:
                range_header = response.request.headers.get("Range")
                message = f"{e}. Requested range: {range_header}. Content-Range: {response.headers.get('Content-Range')}."
                raise _format(HfHubHTTPError, message, response) from e
    
            # Convert `HTTPError` into a `HfHubHTTPError` to display request information
            # as well (request id and/or server error message)
>           raise _format(HfHubHTTPError, str(e), response) from e
E           huggingface_hub.errors.HfHubHTTPError: 429 Client Error: Too Many Requests for url: https://huggingface..../resolve/main/config.json (Request ID: Root=1-69177c6b-673b858d5301d1d339408a42;3c2326a8-4b40-490c-bb79-22e9e0155080)
E           
E           We had to rate limit your IP (216.228.127.128). To continue using our service, create a HF account or login to your existing account, and make sure you pass a HF_TOKEN if you're using the API.

.../local/lib/python3.12.../huggingface_hub/utils/_http.py:475: HfHubHTTPError

The above exception was the direct cause of the following exception:

path_or_repo_id = 'facebook/esm2_t6_8M_UR50D', filenames = ['config.json']
cache_dir = '....../github/home/.cache/huggingface/hub', force_download = False
resume_download = None, proxies = None, token = None, revision = None
local_files_only = False, subfolder = '', repo_type = None
user_agent = 'transformers/4.53.3; python/3.12.3; session_id/1dc5d65396de47f9b8ea53ab966d4123; torch/2.8.0a0+5228986c39.nv25.6; file_type/config; from_auto_class/True'
_raise_exceptions_for_gated_repo = True
_raise_exceptions_for_missing_entries = True
_raise_exceptions_for_connection_errors = True, _commit_hash = None
deprecated_kwargs = {}, use_auth_token = None, full_filenames = ['config.json']
existing_files = [], filename = 'config.json'

    def cached_files(
        path_or_repo_id: Union[str, os.PathLike],
        filenames: list[str],
        cache_dir: Optional[Union[str, os.PathLike]] = None,
        force_download: bool = False,
        resume_download: Optional[bool] = None,
        proxies: Optional[dict[str, str]] = None,
        token: Optional[Union[bool, str]] = None,
        revision: Optional[str] = None,
        local_files_only: bool = False,
        subfolder: str = "",
        repo_type: Optional[str] = None,
        user_agent: Optional[Union[str, dict[str, str]]] = None,
        _raise_exceptions_for_gated_repo: bool = True,
        _raise_exceptions_for_missing_entries: bool = True,
        _raise_exceptions_for_connection_errors: bool = True,
        _commit_hash: Optional[str] = None,
        **deprecated_kwargs,
    ) -> Optional[str]:
        """
        Tries to locate several files in a local folder and repo, downloads and cache them if necessary.
    
        Args:
            path_or_repo_id (`str` or `os.PathLike`):
                This can be either:
                - a string, the *model id* of a model repo on huggingface.co.
                - a path to a *directory* potentially containing the file.
            filenames (`list[str]`):
                The name of all the files to locate in `path_or_repo`.
            cache_dir (`str` or `os.PathLike`, *optional*):
                Path to a directory in which a downloaded pretrained model configuration should be cached if the standard
                cache should not be used.
            force_download (`bool`, *optional*, defaults to `False`):
                Whether or not to force to (re-)download the configuration files and override the cached versions if they
                exist.
            resume_download:
                Deprecated and ignored. All downloads are now resumed by default when possible.
                Will be removed in v5 of Transformers.
            proxies (`dict[str, str]`, *optional*):
                A dictionary of proxy servers to use by protocol or endpoint, e.g., `{'http': 'foo.bar:3128',
                'http://hostname': 'foo.bar:4012'}.` The proxies are used on each request.
            token (`str` or *bool*, *optional*):
                The token to use as HTTP bearer authorization for remote files. If `True`, will use the token generated
                when running `huggingface-cli login` (stored in `~/.huggingface`).
            revision (`str`, *optional*, defaults to `"main"`):
                The specific model version to use. It can be a branch name, a tag name, or a commit id, since we use a
                git-based system for storing models and other artifacts on huggingface.co, so `revision` can be any
                identifier allowed by git.
            local_files_only (`bool`, *optional*, defaults to `False`):
                If `True`, will only try to load the tokenizer configuration from local files.
            subfolder (`str`, *optional*, defaults to `""`):
                In case the relevant files are located inside a subfolder of the model repo on huggingface.co, you can
                specify the folder name here.
            repo_type (`str`, *optional*):
                Specify the repo type (useful when downloading from a space for instance).
    
        Private args:
            _raise_exceptions_for_gated_repo (`bool`):
                if False, do not raise an exception for gated repo error but return None.
            _raise_exceptions_for_missing_entries (`bool`):
                if False, do not raise an exception for missing entries but return None.
            _raise_exceptions_for_connection_errors (`bool`):
                if False, do not raise an exception for connection errors but return None.
            _commit_hash (`str`, *optional*):
                passed when we are chaining several calls to various files (e.g. when loading a tokenizer or
                a pipeline). If files are cached for this commit hash, avoid calls to head and get from the cache.
    
        <Tip>
    
        Passing `token=True` is required when you want to use a private model.
    
        </Tip>
    
        Returns:
            `Optional[str]`: Returns the resolved file (to the cache folder if downloaded from a repo).
    
        Examples:
    
        ```python
        # Download a model weight from the Hub and cache it.
        model_weights_file = cached_file("google-bert/bert-base-uncased", "pytorch_model.bin")
        ```
        """
        use_auth_token = deprecated_kwargs.pop("use_auth_token", None)
        if use_auth_token is not None:
            warnings.warn(
                "The `use_auth_token` argument is deprecated and will be removed in v5 of Transformers. Please use `token` instead.",
                FutureWarning,
            )
            if token is not None:
                raise ValueError("`token` and `use_auth_token` are both specified. Please set only the argument `token`.")
            token = use_auth_token
    
        if is_offline_mode() and not local_files_only:
            logger.info("Offline mode: forcing local_files_only=True")
            local_files_only = True
        if subfolder is None:
            subfolder = ""
    
        # Add folder to filenames
        full_filenames = [os.path.join(subfolder, file) for file in filenames]
    
        path_or_repo_id = str(path_or_repo_id)
        existing_files = []
        for filename in full_filenames:
            if os.path.isdir(path_or_repo_id):
                resolved_file = os.path.join(path_or_repo_id, filename)
                if not os.path.isfile(resolved_file):
                    if _raise_exceptions_for_missing_entries and filename != os.path.join(subfolder, "config.json"):
                        revision_ = "main" if revision is None else revision
                        raise OSError(
                            f"{path_or_repo_id} does not appear to have a file named {filename}. Checkout "
                            f"'https://huggingface.co/{path_or_repo_id}/tree/{revision_}' for available files."
                        )
                    else:
                        return None
                existing_files.append(resolved_file)
    
        # All files exist
        if len(existing_files) == len(full_filenames):
            return existing_files
    
        if cache_dir is None:
            cache_dir = TRANSFORMERS_CACHE
        if isinstance(cache_dir, Path):
            cache_dir = str(cache_dir)
    
        existing_files = []
        file_counter = 0
        if _commit_hash is not None and not force_download:
            for filename in full_filenames:
                # If the file is cached under that commit hash, we return it directly.
                resolved_file = try_to_load_from_cache(
                    path_or_repo_id, filename, cache_dir=cache_dir, revision=_commit_hash, repo_type=repo_type
                )
                if resolved_file is not None:
                    if resolved_file is not _CACHED_NO_EXIST:
                        file_counter += 1
                        existing_files.append(resolved_file)
                    elif not _raise_exceptions_for_missing_entries:
                        file_counter += 1
                    else:
                        raise OSError(f"Could not locate {filename} inside {path_or_repo_id}.")
    
        # Either all the files were found, or some were _CACHED_NO_EXIST but we do not raise for missing entries
        if file_counter == len(full_filenames):
            return existing_files if len(existing_files) > 0 else None
    
        user_agent = http_user_agent(user_agent)
        # download the files if needed
        try:
            if len(full_filenames) == 1:
                # This is slightly better for only 1 file
>               hf_hub_download(
                    path_or_repo_id,
                    filenames[0],
                    subfolder=None if len(subfolder) == 0 else subfolder,
                    repo_type=repo_type,
                    revision=revision,
                    cache_dir=cache_dir,
                    user_agent=user_agent,
                    force_download=force_download,
                    proxies=proxies,
                    resume_download=resume_download,
                    token=token,
                    local_files_only=local_files_only,
                )

.../local/lib/python3.12.../transformers/utils/hub.py:470: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
.../local/lib/python3.12.../huggingface_hub/utils/_validators.py:114: in _inner_fn
    return fn(*args, **kwargs)
.../local/lib/python3.12...................../dist-packages/huggingface_hub/file_download.py:1007: in hf_hub_download
    return _hf_hub_download_to_cache_dir(
.../local/lib/python3.12...................../dist-packages/huggingface_hub/file_download.py:1114: in _hf_hub_download_to_cache_dir
    _raise_on_head_call_error(head_call_error, force_download, local_files_only)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

head_call_error = HfHubHTTPError("429 Client Error: Too Many Requests for url: https://huggingface..../esm2_t6_8M_UR50D/resolve/...ce, create a HF account or login to your existing account, and make sure you pass a HF_TOKEN if you're using the API.")
force_download = False, local_files_only = False

    def _raise_on_head_call_error(head_call_error: Exception, force_download: bool, local_files_only: bool) -> NoReturn:
        """Raise an appropriate error when the HEAD call failed and we cannot locate a local file."""
        # No head call => we cannot force download.
        if force_download:
            if local_files_only:
                raise ValueError("Cannot pass 'force_download=True' and 'local_files_only=True' at the same time.")
            elif isinstance(head_call_error, OfflineModeIsEnabled):
                raise ValueError("Cannot pass 'force_download=True' when offline mode is enabled.") from head_call_error
            else:
                raise ValueError("Force download failed due to the above error.") from head_call_error
    
        # No head call + couldn't find an appropriate file on disk => raise an error.
        if local_files_only:
            raise LocalEntryNotFoundError(
                "Cannot find the requested files in the disk cache and outgoing traffic has been disabled. To enable"
                " hf.co look-ups and downloads online, set 'local_files_only' to False."
            )
        elif isinstance(head_call_error, (RepositoryNotFoundError, GatedRepoError)) or (
            isinstance(head_call_error, HfHubHTTPError) and head_call_error.response.status_code == 401
        ):
            # Repo not found or gated => let's raise the actual error
            # Unauthorized => likely a token issue => let's raise the actual error
            raise head_call_error
        else:
            # Otherwise: most likely a connection issue or Hub downtime => let's warn the user
>           raise LocalEntryNotFoundError(
                "An error happened while trying to locate the file on the Hub and we cannot find the requested files"
                " in the local cache. Please check your connection and try again or make sure your Internet connection"
                " is on."
            ) from head_call_error
E           huggingface_hub.errors.LocalEntryNotFoundError: An error happened while trying to locate the file on the Hub and we cannot find the requested files in the local cache. Please check your connection and try again or make sure your Internet connection is on.

.../local/lib/python3.12...................../dist-packages/huggingface_hub/file_download.py:1658: LocalEntryNotFoundError

The above exception was the direct cause of the following exception:

tmp_path = PosixPath('.../pytest-of-root/pytest-1/test_nemo2_conversion_equivale2')

    def test_nemo2_conversion_equivalent_8m_bf16(tmp_path):
        model_tag = "facebook/esm2_t6_8M_UR50D"
        module = biobert_lightning_module(config=ESM2Config())
>       io.import_ckpt(module, f"hf://{model_tag}", tmp_path / "nemo_checkpoint")

.../esm2/model/test_convert.py:96: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
.../local/lib/python3.12.../lightning/io/api.py:167: in import_ckpt
    ckpt_path = importer(overwrite=overwrite, output_path=output_path, **kwargs)
.../local/lib/python3.12.../lightning/io/connector.py:101: in __call__
    to_return = self.apply(_output_path, **kwargs)
.../local/lib/python3.12.../esm2/model/convert.py:50: in apply
    source = AutoModelForMaskedLM.from_pretrained(str(self), trust_remote_code=True, torch_dtype="auto")
.../local/lib/python3.12.../models/auto/auto_factory.py:547: in from_pretrained
    config, kwargs = AutoConfig.from_pretrained(
.../local/lib/python3.12.../models/auto/configuration_auto.py:1197: in from_pretrained
    config_dict, unused_kwargs = PretrainedConfig.get_config_dict(pretrained_model_name_or_path, **kwargs)
.../local/lib/python3.12....../dist-packages/transformers/configuration_utils.py:608: in get_config_dict
    config_dict, kwargs = cls._get_config_dict(pretrained_model_name_or_path, **kwargs)
.../local/lib/python3.12....../dist-packages/transformers/configuration_utils.py:667: in _get_config_dict
    resolved_config_file = cached_file(
.../local/lib/python3.12.../transformers/utils/hub.py:312: in cached_file
    file = cached_files(path_or_repo_id=path_or_repo_id, filenames=[filename], **kwargs)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

path_or_repo_id = 'facebook/esm2_t6_8M_UR50D', filenames = ['config.json']
cache_dir = '....../github/home/.cache/huggingface/hub', force_download = False
resume_download = None, proxies = None, token = None, revision = None
local_files_only = False, subfolder = '', repo_type = None
user_agent = 'transformers/4.53.3; python/3.12.3; session_id/1dc5d65396de47f9b8ea53ab966d4123; torch/2.8.0a0+5228986c39.nv25.6; file_type/config; from_auto_class/True'
_raise_exceptions_for_gated_repo = True
_raise_exceptions_for_missing_entries = True
_raise_exceptions_for_connection_errors = True, _commit_hash = None
deprecated_kwargs = {}, use_auth_token = None, full_filenames = ['config.json']
existing_files = [], filename = 'config.json'

    def cached_files(
        path_or_repo_id: Union[str, os.PathLike],
        filenames: list[str],
        cache_dir: Optional[Union[str, os.PathLike]] = None,
        force_download: bool = False,
        resume_download: Optional[bool] = None,
        proxies: Optional[dict[str, str]] = None,
        token: Optional[Union[bool, str]] = None,
        revision: Optional[str] = None,
        local_files_only: bool = False,
        subfolder: str = "",
        repo_type: Optional[str] = None,
        user_agent: Optional[Union[str, dict[str, str]]] = None,
        _raise_exceptions_for_gated_repo: bool = True,
        _raise_exceptions_for_missing_entries: bool = True,
        _raise_exceptions_for_connection_errors: bool = True,
        _commit_hash: Optional[str] = None,
        **deprecated_kwargs,
    ) -> Optional[str]:
        """
        Tries to locate several files in a local folder and repo, downloads and cache them if necessary.
    
        Args:
            path_or_repo_id (`str` or `os.PathLike`):
                This can be either:
                - a string, the *model id* of a model repo on huggingface.co.
                - a path to a *directory* potentially containing the file.
            filenames (`list[str]`):
                The name of all the files to locate in `path_or_repo`.
            cache_dir (`str` or `os.PathLike`, *optional*):
                Path to a directory in which a downloaded pretrained model configuration should be cached if the standard
                cache should not be used.
            force_download (`bool`, *optional*, defaults to `False`):
                Whether or not to force to (re-)download the configuration files and override the cached versions if they
                exist.
            resume_download:
                Deprecated and ignored. All downloads are now resumed by default when possible.
                Will be removed in v5 of Transformers.
            proxies (`dict[str, str]`, *optional*):
                A dictionary of proxy servers to use by protocol or endpoint, e.g., `{'http': 'foo.bar:3128',
                'http://hostname': 'foo.bar:4012'}.` The proxies are used on each request.
            token (`str` or *bool*, *optional*):
                The token to use as HTTP bearer authorization for remote files. If `True`, will use the token generated
                when running `huggingface-cli login` (stored in `~/.huggingface`).
            revision (`str`, *optional*, defaults to `"main"`):
                The specific model version to use. It can be a branch name, a tag name, or a commit id, since we use a
                git-based system for storing models and other artifacts on huggingface.co, so `revision` can be any
                identifier allowed by git.
            local_files_only (`bool`, *optional*, defaults to `False`):
                If `True`, will only try to load the tokenizer configuration from local files.
            subfolder (`str`, *optional*, defaults to `""`):
                In case the relevant files are located inside a subfolder of the model repo on huggingface.co, you can
                specify the folder name here.
            repo_type (`str`, *optional*):
                Specify the repo type (useful when downloading from a space for instance).
    
        Private args:
            _raise_exceptions_for_gated_repo (`bool`):
                if False, do not raise an exception for gated repo error but return None.
            _raise_exceptions_for_missing_entries (`bool`):
                if False, do not raise an exception for missing entries but return None.
            _raise_exceptions_for_connection_errors (`bool`):
                if False, do not raise an exception for connection errors but return None.
            _commit_hash (`str`, *optional*):
                passed when we are chaining several calls to various files (e.g. when loading a tokenizer or
                a pipeline). If files are cached for this commit hash, avoid calls to head and get from the cache.
    
        <Tip>
    
        Passing `token=True` is required when you want to use a private model.
    
        </Tip>
    
        Returns:
            `Optional[str]`: Returns the resolved file (to the cache folder if downloaded from a repo).
    
        Examples:
    
        ```python
        # Download a model weight from the Hub and cache it.
        model_weights_file = cached_file("google-bert/bert-base-uncased", "pytorch_model.bin")
        ```
        """
        use_auth_token = deprecated_kwargs.pop("use_auth_token", None)
        if use_auth_token is not None:
            warnings.warn(
                "The `use_auth_token` argument is deprecated and will be removed in v5 of Transformers. Please use `token` instead.",
                FutureWarning,
            )
            if token is not None:
                raise ValueError("`token` and `use_auth_token` are both specified. Please set only the argument `token`.")
            token = use_auth_token
    
        if is_offline_mode() and not local_files_only:
            logger.info("Offline mode: forcing local_files_only=True")
            local_files_only = True
        if subfolder is None:
            subfolder = ""
    
        # Add folder to filenames
        full_filenames = [os.path.join(subfolder, file) for file in filenames]
    
        path_or_repo_id = str(path_or_repo_id)
        existing_files = []
        for filename in full_filenames:
            if os.path.isdir(path_or_repo_id):
                resolved_file = os.path.join(path_or_repo_id, filename)
                if not os.path.isfile(resolved_file):
                    if _raise_exceptions_for_missing_entries and filename != os.path.join(subfolder, "config.json"):
                        revision_ = "main" if revision is None else revision
                        raise OSError(
                            f"{path_or_repo_id} does not appear to have a file named {filename}. Checkout "
                            f"'https://huggingface.co/{path_or_repo_id}/tree/{revision_}' for available files."
                        )
                    else:
                        return None
                existing_files.append(resolved_file)
    
        # All files exist
        if len(existing_files) == len(full_filenames):
            return existing_files
    
        if cache_dir is None:
            cache_dir = TRANSFORMERS_CACHE
        if isinstance(cache_dir, Path):
            cache_dir = str(cache_dir)
    
        existing_files = []
        file_counter = 0
        if _commit_hash is not None and not force_download:
            for filename in full_filenames:
                # If the file is cached under that commit hash, we return it directly.
                resolved_file = try_to_load_from_cache(
                    path_or_repo_id, filename, cache_dir=cache_dir, revision=_commit_hash, repo_type=repo_type
                )
                if resolved_file is not None:
                    if resolved_file is not _CACHED_NO_EXIST:
                        file_counter += 1
                        existing_files.append(resolved_file)
                    elif not _raise_exceptions_for_missing_entries:
                        file_counter += 1
                    else:
                        raise OSError(f"Could not locate {filename} inside {path_or_repo_id}.")
    
        # Either all the files were found, or some were _CACHED_NO_EXIST but we do not raise for missing entries
        if file_counter == len(full_filenames):
            return existing_files if len(existing_files) > 0 else None
    
        user_agent = http_user_agent(user_agent)
        # download the files if needed
        try:
            if len(full_filenames) == 1:
                # This is slightly better for only 1 file
                hf_hub_download(
                    path_or_repo_id,
                    filenames[0],
                    subfolder=None if len(subfolder) == 0 else subfolder,
                    repo_type=repo_type,
                    revision=revision,
                    cache_dir=cache_dir,
                    user_agent=user_agent,
                    force_download=force_download,
                    proxies=proxies,
                    resume_download=resume_download,
                    token=token,
                    local_files_only=local_files_only,
                )
            else:
                snapshot_download(
                    path_or_repo_id,
                    allow_patterns=full_filenames,
                    repo_type=repo_type,
                    revision=revision,
                    cache_dir=cache_dir,
                    user_agent=user_agent,
                    force_download=force_download,
                    proxies=proxies,
                    resume_download=resume_download,
                    token=token,
                    local_files_only=local_files_only,
                )
    
        except Exception as e:
            # We cannot recover from them
            if isinstance(e, RepositoryNotFoundError) and not isinstance(e, GatedRepoError):
                raise OSError(
                    f"{path_or_repo_id} is not a local folder and is not a valid model identifier "
                    "listed on 'https://huggingface.co/models'\nIf this is a private repository, make sure to pass a token "
                    "having permission to this repo either by logging in with `huggingface-cli login` or by passing "
                    "`token=<your_token>`"
                ) from e
            elif isinstance(e, RevisionNotFoundError):
                raise OSError(
                    f"{revision} is not a valid git identifier (branch name, tag name or commit id) that exists "
                    "for this model name. Check the model page at "
                    f"'https://huggingface.co/{path_or_repo_id}' for available revisions."
                ) from e
            elif isinstance(e, PermissionError):
                raise OSError(
                    f"PermissionError at {e.filename} when downloading {path_or_repo_id}. "
                    "Check cache directory permissions. Common causes: 1) another user is downloading the same model (please wait); "
                    "2) a previous download was canceled and the lock file needs manual removal."
                ) from e
    
            # Now we try to recover if we can find all files correctly in the cache
            resolved_files = [
                _get_cache_file_to_return(path_or_repo_id, filename, cache_dir, revision) for filename in full_filenames
            ]
            if all(file is not None for file in resolved_files):
                return resolved_files
    
            # Raise based on the flags. Note that we will raise for missing entries at the very end, even when
            # not entering this Except block, as it may also happen when `snapshot_download` does not raise
            if isinstance(e, GatedRepoError):
                if not _raise_exceptions_for_gated_repo:
                    return None
                raise OSError(
                    "You are trying to access a gated repo.\nMake sure to have access to it at "
                    f"https://huggingface.co/{path_or_repo_id}.\n{str(e)}"
                ) from e
            elif isinstance(e, LocalEntryNotFoundError):
                if not _raise_exceptions_for_connection_errors:
                    return None
                # Here we only raise if both flags for missing entry and connection errors are True (because it can be raised
                # even when `local_files_only` is True, in which case raising for connections errors only would not make sense)
                elif _raise_exceptions_for_missing_entries:
>                   raise OSError(
                        f"We couldn't connect to '{HUGGINGFACE_CO_RESOLVE_ENDPOINT}' to load the files, and couldn't find them in the"
                        f" cached files.\nCheck your internet connection or see how to run the library in offline mode at"
                        " 'https://huggingface..../docs/transformers/installation#offline-mode'."
                    ) from e
E                   OSError: We couldn't connect to 'https://huggingface.co' to load the files, and couldn't find them in the cached files.
E                   Check your internet connection or see how to run the library in offline mode at 'https://huggingface..../docs/transformers/installation#offline-mode'.

.../local/lib/python3.12.../transformers/utils/hub.py:543: OSError

sub-packages/bionemo-esm2/tests/bionemo/esm2/model/test_convert.py::test_nemo2_conversion_equivalent_8m_with_local_path

Stack Traces | 0.273s run time

response = <Response [429]>, endpoint_name = None

    def hf_raise_for_status(response: Response, endpoint_name: Optional[str] = None) -> None:
        """
        Internal version of `response.raise_for_status()` that will refine a
        potential HTTPError. Raised exception will be an instance of `HfHubHTTPError`.
    
        This helper is meant to be the unique method to raise_for_status when making a call
        to the Hugging Face Hub.
    
    
        Example:
        ```py
            import requests
            from huggingface_hub.utils import get_session, hf_raise_for_status, HfHubHTTPError
    
            response = get_session().post(...)
            try:
                hf_raise_for_status(response)
            except HfHubHTTPError as e:
                print(str(e)) # formatted message
                e.request_id, e.server_message # details returned by server
    
                # Complete the error message with additional information once it's raised
                e.append_to_message("\n`create_commit` expects the repository to exist.")
                raise
        ```
    
        Args:
            response (`Response`):
                Response from the server.
            endpoint_name (`str`, *optional*):
                Name of the endpoint that has been called. If provided, the error message
                will be more complete.
    
        > [!WARNING]
        > Raises when the request has failed:
        >
        >     - [`~utils.RepositoryNotFoundError`]
        >         If the repository to download from cannot be found. This may be because it
        >         doesn't exist, because `repo_type` is not set correctly, or because the repo
        >         is `private` and you do not have access.
        >     - [`~utils.GatedRepoError`]
        >         If the repository exists but is gated and the user is not on the authorized
        >         list.
        >     - [`~utils.RevisionNotFoundError`]
        >         If the repository exists but the revision couldn't be find.
        >     - [`~utils.EntryNotFoundError`]
        >         If the repository exists but the entry (e.g. the requested file) couldn't be
        >         find.
        >     - [`~utils.BadRequestError`]
        >         If request failed with a HTTP 400 BadRequest error.
        >     - [`~utils.HfHubHTTPError`]
        >         If request failed for a reason not listed above.
        """
        try:
>           response.raise_for_status()

.../local/lib/python3.12.../huggingface_hub/utils/_http.py:402: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = <Response [429]>

    def raise_for_status(self):
        """Raises :class:`HTTPError`, if one occurred."""
    
        http_error_msg = ""
        if isinstance(self.reason, bytes):
            # We attempt to decode utf-8 first because some servers
            # choose to localize their reason strings. If the string
            # isn't utf-8, we fall back to iso-8859-1 for all other
            # encodings. (See PR #3538)
            try:
                reason = self.reason.decode("utf-8")
            except UnicodeDecodeError:
                reason = self.reason.decode("iso-8859-1")
        else:
            reason = self.reason
    
        if 400 <= self.status_code < 500:
            http_error_msg = (
                f"{self.status_code} Client Error: {reason} for url: {self.url}"
            )
    
        elif 500 <= self.status_code < 600:
            http_error_msg = (
                f"{self.status_code} Server Error: {reason} for url: {self.url}"
            )
    
        if http_error_msg:
>           raise HTTPError(http_error_msg, response=self)
E           requests.exceptions.HTTPError: 429 Client Error: Too Many Requests for url: https://huggingface..../resolve/main/config.json

.../local/lib/python3.12.../dist-packages/requests/models.py:1024: HTTPError

The above exception was the direct cause of the following exception:

    def _get_metadata_or_catch_error(
        *,
        repo_id: str,
        filename: str,
        repo_type: str,
        revision: str,
        endpoint: Optional[str],
        proxies: Optional[Dict],
        etag_timeout: Optional[float],
        headers: Dict[str, str],  # mutated inplace!
        token: Union[bool, str, None],
        local_files_only: bool,
        relative_filename: Optional[str] = None,  # only used to store `.no_exists` in cache
        storage_folder: Optional[str] = None,  # only used to store `.no_exists` in cache
    ) -> Union[
        # Either an exception is caught and returned
        Tuple[None, None, None, None, None, Exception],
        # Or the metadata is returned as
        # `(url_to_download, etag, commit_hash, expected_size, xet_file_data, None)`
        Tuple[str, str, str, int, Optional[XetFileData], None],
    ]:
        """Get metadata for a file on the Hub, safely handling network issues.
    
        Returns either the etag, commit_hash and expected size of the file, or the error
        raised while fetching the metadata.
    
        NOTE: This function mutates `headers` inplace! It removes the `authorization` header
              if the file is a LFS blob and the domain of the url is different from the
              domain of the location (typically an S3 bucket).
        """
        if local_files_only:
            return (
                None,
                None,
                None,
                None,
                None,
                OfflineModeIsEnabled(
                    f"Cannot access file since 'local_files_only=True' as been set. (repo_id: {repo_id}, repo_type: {repo_type}, revision: {revision}, filename: {filename})"
                ),
            )
    
        url = hf_hub_url(repo_id, filename, repo_type=repo_type, revision=revision, endpoint=endpoint)
        url_to_download: str = url
        etag: Optional[str] = None
        commit_hash: Optional[str] = None
        expected_size: Optional[int] = None
        head_error_call: Optional[Exception] = None
        xet_file_data: Optional[XetFileData] = None
    
        # Try to get metadata from the server.
        # Do not raise yet if the file is not found or not accessible.
        if not local_files_only:
            try:
                try:
>                   metadata = get_hf_file_metadata(
                        url=url, proxies=proxies, timeout=etag_timeout, headers=headers, token=token, endpoint=endpoint
                    )

.../local/lib/python3.12...................../dist-packages/huggingface_hub/file_download.py:1543: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
.../local/lib/python3.12.../huggingface_hub/utils/_validators.py:114: in _inner_fn
    return fn(*args, **kwargs)
.../local/lib/python3.12...................../dist-packages/huggingface_hub/file_download.py:1460: in get_hf_file_metadata
    r = _request_wrapper(
.../local/lib/python3.12...................../dist-packages/huggingface_hub/file_download.py:283: in _request_wrapper
    response = _request_wrapper(
.../local/lib/python3.12...................../dist-packages/huggingface_hub/file_download.py:307: in _request_wrapper
    hf_raise_for_status(response)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

response = <Response [429]>, endpoint_name = None

    def hf_raise_for_status(response: Response, endpoint_name: Optional[str] = None) -> None:
        """
        Internal version of `response.raise_for_status()` that will refine a
        potential HTTPError. Raised exception will be an instance of `HfHubHTTPError`.
    
        This helper is meant to be the unique method to raise_for_status when making a call
        to the Hugging Face Hub.
    
    
        Example:
        ```py
            import requests
            from huggingface_hub.utils import get_session, hf_raise_for_status, HfHubHTTPError
    
            response = get_session().post(...)
            try:
                hf_raise_for_status(response)
            except HfHubHTTPError as e:
                print(str(e)) # formatted message
                e.request_id, e.server_message # details returned by server
    
                # Complete the error message with additional information once it's raised
                e.append_to_message("\n`create_commit` expects the repository to exist.")
                raise
        ```
    
        Args:
            response (`Response`):
                Response from the server.
            endpoint_name (`str`, *optional*):
                Name of the endpoint that has been called. If provided, the error message
                will be more complete.
    
        > [!WARNING]
        > Raises when the request has failed:
        >
        >     - [`~utils.RepositoryNotFoundError`]
        >         If the repository to download from cannot be found. This may be because it
        >         doesn't exist, because `repo_type` is not set correctly, or because the repo
        >         is `private` and you do not have access.
        >     - [`~utils.GatedRepoError`]
        >         If the repository exists but is gated and the user is not on the authorized
        >         list.
        >     - [`~utils.RevisionNotFoundError`]
        >         If the repository exists but the revision couldn't be find.
        >     - [`~utils.EntryNotFoundError`]
        >         If the repository exists but the entry (e.g. the requested file) couldn't be
        >         find.
        >     - [`~utils.BadRequestError`]
        >         If request failed with a HTTP 400 BadRequest error.
        >     - [`~utils.HfHubHTTPError`]
        >         If request failed for a reason not listed above.
        """
        try:
            response.raise_for_status()
        except HTTPError as e:
            error_code = response.headers.get("X-Error-Code")
            error_message = response.headers.get("X-Error-Message")
    
            if error_code == "RevisionNotFound":
                message = f"{response.status_code} Client Error." + "\n\n" + f"Revision Not Found for url: {response.url}."
                raise _format(RevisionNotFoundError, message, response) from e
    
            elif error_code == "EntryNotFound":
                message = f"{response.status_code} Client Error." + "\n\n" + f"Entry Not Found for url: {response.url}."
                raise _format(EntryNotFoundError, message, response) from e
    
            elif error_code == "GatedRepo":
                message = (
                    f"{response.status_code} Client Error." + "\n\n" + f"Cannot access gated repo for url {response.url}."
                )
                raise _format(GatedRepoError, message, response) from e
    
            elif error_message == "Access to this resource is disabled.":
                message = (
                    f"{response.status_code} Client Error."
                    + "\n\n"
                    + f"Cannot access repository for url {response.url}."
                    + "\n"
                    + "Access to this resource is disabled."
                )
                raise _format(DisabledRepoError, message, response) from e
    
            elif error_code == "RepoNotFound" or (
                response.status_code == 401
                and error_message != "Invalid credentials in Authorization header"
                and response.request is not None
                and response.request.url is not None
                and REPO_API_REGEX.search(response.request.url) is not None
            ):
                # 401 is misleading as it is returned for:
                #    - private and gated repos if user is not authenticated
                #    - missing repos
                # => for now, we process them as `RepoNotFound` anyway.
                # See https://gist.github.com/Wauplin/46c27ad266b15998ce56a6603796f0b9
                message = (
                    f"{response.status_code} Client Error."
                    + "\n\n"
                    + f"Repository Not Found for url: {response.url}."
                    + "\nPlease make sure you specified the correct `repo_id` and"
                    " `repo_type`.\nIf you are trying to access a private or gated repo,"
                    " make sure you are authenticated. For more details, see"
                    " https://huggingface..../docs/huggingface_hub/authentication"
                )
                raise _format(RepositoryNotFoundError, message, response) from e
    
            elif response.status_code == 400:
                message = (
                    f"\n\nBad request for {endpoint_name} endpoint:" if endpoint_name is not None else "\n\nBad request:"
                )
                raise _format(BadRequestError, message, response) from e
    
            elif response.status_code == 403:
                message = (
                    f"\n\n{response.status_code} Forbidden: {error_message}."
                    + f"\nCannot access content at: {response.url}."
                    + "\nMake sure your token has the correct permissions."
                )
                raise _format(HfHubHTTPError, message, response) from e
    
            elif response.status_code == 416:
                range_header = response.request.headers.get("Range")
                message = f"{e}. Requested range: {range_header}. Content-Range: {response.headers.get('Content-Range')}."
                raise _format(HfHubHTTPError, message, response) from e
    
            # Convert `HTTPError` into a `HfHubHTTPError` to display request information
            # as well (request id and/or server error message)
>           raise _format(HfHubHTTPError, str(e), response) from e
E           huggingface_hub.errors.HfHubHTTPError: 429 Client Error: Too Many Requests for url: https://huggingface..../resolve/main/config.json (Request ID: Root=1-69177c59-584e391f3678d8013fb56db1;bf4b2eb1-42af-4c02-bacf-34b34f5699b1)
E           
E           We had to rate limit your IP (216.228.127.128). To continue using our service, create a HF account or login to your existing account, and make sure you pass a HF_TOKEN if you're using the API.

.../local/lib/python3.12.../huggingface_hub/utils/_http.py:475: HfHubHTTPError

The above exception was the direct cause of the following exception:

path_or_repo_id = 'facebook/esm2_t6_8M_UR50D', filenames = ['config.json']
cache_dir = '....../github/home/.cache/huggingface/hub', force_download = False
resume_download = None, proxies = None, token = None, revision = None
local_files_only = False, subfolder = '', repo_type = None
user_agent = 'transformers/4.53.3; python/3.12.3; session_id/1dc5d65396de47f9b8ea53ab966d4123; torch/2.8.0a0+5228986c39.nv25.6; file_type/config; from_auto_class/True'
_raise_exceptions_for_gated_repo = True
_raise_exceptions_for_missing_entries = True
_raise_exceptions_for_connection_errors = True, _commit_hash = None
deprecated_kwargs = {}, use_auth_token = None, full_filenames = ['config.json']
existing_files = [], filename = 'config.json'

    def cached_files(
        path_or_repo_id: Union[str, os.PathLike],
        filenames: list[str],
        cache_dir: Optional[Union[str, os.PathLike]] = None,
        force_download: bool = False,
        resume_download: Optional[bool] = None,
        proxies: Optional[dict[str, str]] = None,
        token: Optional[Union[bool, str]] = None,
        revision: Optional[str] = None,
        local_files_only: bool = False,
        subfolder: str = "",
        repo_type: Optional[str] = None,
        user_agent: Optional[Union[str, dict[str, str]]] = None,
        _raise_exceptions_for_gated_repo: bool = True,
        _raise_exceptions_for_missing_entries: bool = True,
        _raise_exceptions_for_connection_errors: bool = True,
        _commit_hash: Optional[str] = None,
        **deprecated_kwargs,
    ) -> Optional[str]:
        """
        Tries to locate several files in a local folder and repo, downloads and cache them if necessary.
    
        Args:
            path_or_repo_id (`str` or `os.PathLike`):
                This can be either:
                - a string, the *model id* of a model repo on huggingface.co.
                - a path to a *directory* potentially containing the file.
            filenames (`list[str]`):
                The name of all the files to locate in `path_or_repo`.
            cache_dir (`str` or `os.PathLike`, *optional*):
                Path to a directory in which a downloaded pretrained model configuration should be cached if the standard
                cache should not be used.
            force_download (`bool`, *optional*, defaults to `False`):
                Whether or not to force to (re-)download the configuration files and override the cached versions if they
                exist.
            resume_download:
                Deprecated and ignored. All downloads are now resumed by default when possible.
                Will be removed in v5 of Transformers.
            proxies (`dict[str, str]`, *optional*):
                A dictionary of proxy servers to use by protocol or endpoint, e.g., `{'http': 'foo.bar:3128',
                'http://hostname': 'foo.bar:4012'}.` The proxies are used on each request.
            token (`str` or *bool*, *optional*):
                The token to use as HTTP bearer authorization for remote files. If `True`, will use the token generated
                when running `huggingface-cli login` (stored in `~/.huggingface`).
            revision (`str`, *optional*, defaults to `"main"`):
                The specific model version to use. It can be a branch name, a tag name, or a commit id, since we use a
                git-based system for storing models and other artifacts on huggingface.co, so `revision` can be any
                identifier allowed by git.
            local_files_only (`bool`, *optional*, defaults to `False`):
                If `True`, will only try to load the tokenizer configuration from local files.
            subfolder (`str`, *optional*, defaults to `""`):
                In case the relevant files are located inside a subfolder of the model repo on huggingface.co, you can
                specify the folder name here.
            repo_type (`str`, *optional*):
                Specify the repo type (useful when downloading from a space for instance).
    
        Private args:
            _raise_exceptions_for_gated_repo (`bool`):
                if False, do not raise an exception for gated repo error but return None.
            _raise_exceptions_for_missing_entries (`bool`):
                if False, do not raise an exception for missing entries but return None.
            _raise_exceptions_for_connection_errors (`bool`):
                if False, do not raise an exception for connection errors but return None.
            _commit_hash (`str`, *optional*):
                passed when we are chaining several calls to various files (e.g. when loading a tokenizer or
                a pipeline). If files are cached for this commit hash, avoid calls to head and get from the cache.
    
        <Tip>
    
        Passing `token=True` is required when you want to use a private model.
    
        </Tip>
    
        Returns:
            `Optional[str]`: Returns the resolved file (to the cache folder if downloaded from a repo).
    
        Examples:
    
        ```python
        # Download a model weight from the Hub and cache it.
        model_weights_file = cached_file("google-bert/bert-base-uncased", "pytorch_model.bin")
        ```
        """
        use_auth_token = deprecated_kwargs.pop("use_auth_token", None)
        if use_auth_token is not None:
            warnings.warn(
                "The `use_auth_token` argument is deprecated and will be removed in v5 of Transformers. Please use `token` instead.",
                FutureWarning,
            )
            if token is not None:
                raise ValueError("`token` and `use_auth_token` are both specified. Please set only the argument `token`.")
            token = use_auth_token
    
        if is_offline_mode() and not local_files_only:
            logger.info("Offline mode: forcing local_files_only=True")
            local_files_only = True
        if subfolder is None:
            subfolder = ""
    
        # Add folder to filenames
        full_filenames = [os.path.join(subfolder, file) for file in filenames]
    
        path_or_repo_id = str(path_or_repo_id)
        existing_files = []
        for filename in full_filenames:
            if os.path.isdir(path_or_repo_id):
                resolved_file = os.path.join(path_or_repo_id, filename)
                if not os.path.isfile(resolved_file):
                    if _raise_exceptions_for_missing_entries and filename != os.path.join(subfolder, "config.json"):
                        revision_ = "main" if revision is None else revision
                        raise OSError(
                            f"{path_or_repo_id} does not appear to have a file named {filename}. Checkout "
                            f"'https://huggingface.co/{path_or_repo_id}/tree/{revision_}' for available files."
                        )
                    else:
                        return None
                existing_files.append(resolved_file)
    
        # All files exist
        if len(existing_files) == len(full_filenames):
            return existing_files
    
        if cache_dir is None:
            cache_dir = TRANSFORMERS_CACHE
        if isinstance(cache_dir, Path):
            cache_dir = str(cache_dir)
    
        existing_files = []
        file_counter = 0
        if _commit_hash is not None and not force_download:
            for filename in full_filenames:
                # If the file is cached under that commit hash, we return it directly.
                resolved_file = try_to_load_from_cache(
                    path_or_repo_id, filename, cache_dir=cache_dir, revision=_commit_hash, repo_type=repo_type
                )
                if resolved_file is not None:
                    if resolved_file is not _CACHED_NO_EXIST:
                        file_counter += 1
                        existing_files.append(resolved_file)
                    elif not _raise_exceptions_for_missing_entries:
                        file_counter += 1
                    else:
                        raise OSError(f"Could not locate {filename} inside {path_or_repo_id}.")
    
        # Either all the files were found, or some were _CACHED_NO_EXIST but we do not raise for missing entries
        if file_counter == len(full_filenames):
            return existing_files if len(existing_files) > 0 else None
    
        user_agent = http_user_agent(user_agent)
        # download the files if needed
        try:
            if len(full_filenames) == 1:
                # This is slightly better for only 1 file
>               hf_hub_download(
                    path_or_repo_id,
                    filenames[0],
                    subfolder=None if len(subfolder) == 0 else subfolder,
                    repo_type=repo_type,
                    revision=revision,
                    cache_dir=cache_dir,
                    user_agent=user_agent,
                    force_download=force_download,
                    proxies=proxies,
                    resume_download=resume_download,
                    token=token,
                    local_files_only=local_files_only,
                )

.../local/lib/python3.12.../transformers/utils/hub.py:470: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
.../local/lib/python3.12.../huggingface_hub/utils/_validators.py:114: in _inner_fn
    return fn(*args, **kwargs)
.../local/lib/python3.12...................../dist-packages/huggingface_hub/file_download.py:1007: in hf_hub_download
    return _hf_hub_download_to_cache_dir(
.../local/lib/python3.12...................../dist-packages/huggingface_hub/file_download.py:1114: in _hf_hub_download_to_cache_dir
    _raise_on_head_call_error(head_call_error, force_download, local_files_only)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

head_call_error = HfHubHTTPError("429 Client Error: Too Many Requests for url: https://huggingface..../esm2_t6_8M_UR50D/resolve/...ce, create a HF account or login to your existing account, and make sure you pass a HF_TOKEN if you're using the API.")
force_download = False, local_files_only = False

    def _raise_on_head_call_error(head_call_error: Exception, force_download: bool, local_files_only: bool) -> NoReturn:
        """Raise an appropriate error when the HEAD call failed and we cannot locate a local file."""
        # No head call => we cannot force download.
        if force_download:
            if local_files_only:
                raise ValueError("Cannot pass 'force_download=True' and 'local_files_only=True' at the same time.")
            elif isinstance(head_call_error, OfflineModeIsEnabled):
                raise ValueError("Cannot pass 'force_download=True' when offline mode is enabled.") from head_call_error
            else:
                raise ValueError("Force download failed due to the above error.") from head_call_error
    
        # No head call + couldn't find an appropriate file on disk => raise an error.
        if local_files_only:
            raise LocalEntryNotFoundError(
                "Cannot find the requested files in the disk cache and outgoing traffic has been disabled. To enable"
                " hf.co look-ups and downloads online, set 'local_files_only' to False."
            )
        elif isinstance(head_call_error, (RepositoryNotFoundError, GatedRepoError)) or (
            isinstance(head_call_error, HfHubHTTPError) and head_call_error.response.status_code == 401
        ):
            # Repo not found or gated => let's raise the actual error
            # Unauthorized => likely a token issue => let's raise the actual error
            raise head_call_error
        else:
            # Otherwise: most likely a connection issue or Hub downtime => let's warn the user
>           raise LocalEntryNotFoundError(
                "An error happened while trying to locate the file on the Hub and we cannot find the requested files"
                " in the local cache. Please check your connection and try again or make sure your Internet connection"
                " is on."
            ) from head_call_error
E           huggingface_hub.errors.LocalEntryNotFoundError: An error happened while trying to locate the file on the Hub and we cannot find the requested files in the local cache. Please check your connection and try again or make sure your Internet connection is on.

.../local/lib/python3.12...................../dist-packages/huggingface_hub/file_download.py:1658: LocalEntryNotFoundError

The above exception was the direct cause of the following exception:

tmp_path = PosixPath('.../pytest-of-root/pytest-1/test_nemo2_conversion_equivale1')

    def test_nemo2_conversion_equivalent_8m_with_local_path(tmp_path):
        model_tag = "facebook/esm2_t6_8M_UR50D"
>       hf_model = AutoModelForMaskedLM.from_pretrained(model_tag)

.../esm2/model/test_convert.py:44: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
.../local/lib/python3.12.../models/auto/auto_factory.py:547: in from_pretrained
    config, kwargs = AutoConfig.from_pretrained(
.../local/lib/python3.12.../models/auto/configuration_auto.py:1197: in from_pretrained
    config_dict, unused_kwargs = PretrainedConfig.get_config_dict(pretrained_model_name_or_path, **kwargs)
.../local/lib/python3.12....../dist-packages/transformers/configuration_utils.py:608: in get_config_dict
    config_dict, kwargs = cls._get_config_dict(pretrained_model_name_or_path, **kwargs)
.../local/lib/python3.12....../dist-packages/transformers/configuration_utils.py:667: in _get_config_dict
    resolved_config_file = cached_file(
.../local/lib/python3.12.../transformers/utils/hub.py:312: in cached_file
    file = cached_files(path_or_repo_id=path_or_repo_id, filenames=[filename], **kwargs)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

path_or_repo_id = 'facebook/esm2_t6_8M_UR50D', filenames = ['config.json']
cache_dir = '....../github/home/.cache/huggingface/hub', force_download = False
resume_download = None, proxies = None, token = None, revision = None
local_files_only = False, subfolder = '', repo_type = None
user_agent = 'transformers/4.53.3; python/3.12.3; session_id/1dc5d65396de47f9b8ea53ab966d4123; torch/2.8.0a0+5228986c39.nv25.6; file_type/config; from_auto_class/True'
_raise_exceptions_for_gated_repo = True
_raise_exceptions_for_missing_entries = True
_raise_exceptions_for_connection_errors = True, _commit_hash = None
deprecated_kwargs = {}, use_auth_token = None, full_filenames = ['config.json']
existing_files = [], filename = 'config.json'

    def cached_files(
        path_or_repo_id: Union[str, os.PathLike],
        filenames: list[str],
        cache_dir: Optional[Union[str, os.PathLike]] = None,
        force_download: bool = False,
        resume_download: Optional[bool] = None,
        proxies: Optional[dict[str, str]] = None,
        token: Optional[Union[bool, str]] = None,
        revision: Optional[str] = None,
        local_files_only: bool = False,
        subfolder: str = "",
        repo_type: Optional[str] = None,
        user_agent: Optional[Union[str, dict[str, str]]] = None,
        _raise_exceptions_for_gated_repo: bool = True,
        _raise_exceptions_for_missing_entries: bool = True,
        _raise_exceptions_for_connection_errors: bool = True,
        _commit_hash: Optional[str] = None,
        **deprecated_kwargs,
    ) -> Optional[str]:
        """
        Tries to locate several files in a local folder and repo, downloads and cache them if necessary.
    
        Args:
            path_or_repo_id (`str` or `os.PathLike`):
                This can be either:
                - a string, the *model id* of a model repo on huggingface.co.
                - a path to a *directory* potentially containing the file.
            filenames (`list[str]`):
                The name of all the files to locate in `path_or_repo`.
            cache_dir (`str` or `os.PathLike`, *optional*):
                Path to a directory in which a downloaded pretrained model configuration should be cached if the standard
                cache should not be used.
            force_download (`bool`, *optional*, defaults to `False`):
                Whether or not to force to (re-)download the configuration files and override the cached versions if they
                exist.
            resume_download:
                Deprecated and ignored. All downloads are now resumed by default when possible.
                Will be removed in v5 of Transformers.
            proxies (`dict[str, str]`, *optional*):
                A dictionary of proxy servers to use by protocol or endpoint, e.g., `{'http': 'foo.bar:3128',
                'http://hostname': 'foo.bar:4012'}.` The proxies are used on each request.
            token (`str` or *bool*, *optional*):
                The token to use as HTTP bearer authorization for remote files. If `True`, will use the token generated
                when running `huggingface-cli login` (stored in `~/.huggingface`).
            revision (`str`, *optional*, defaults to `"main"`):
                The specific model version to use. It can be a branch name, a tag name, or a commit id, since we use a
                git-based system for storing models and other artifacts on huggingface.co, so `revision` can be any
                identifier allowed by git.
            local_files_only (`bool`, *optional*, defaults to `False`):
                If `True`, will only try to load the tokenizer configuration from local files.
            subfolder (`str`, *optional*, defaults to `""`):
                In case the relevant files are located inside a subfolder of the model repo on huggingface.co, you can
                specify the folder name here.
            repo_type (`str`, *optional*):
                Specify the repo type (useful when downloading from a space for instance).
    
        Private args:
            _raise_exceptions_for_gated_repo (`bool`):
                if False, do not raise an exception for gated repo error but return None.
            _raise_exceptions_for_missing_entries (`bool`):
                if False, do not raise an exception for missing entries but return None.
            _raise_exceptions_for_connection_errors (`bool`):
                if False, do not raise an exception for connection errors but return None.
            _commit_hash (`str`, *optional*):
                passed when we are chaining several calls to various files (e.g. when loading a tokenizer or
                a pipeline). If files are cached for this commit hash, avoid calls to head and get from the cache.
    
        <Tip>
    
        Passing `token=True` is required when you want to use a private model.
    
        </Tip>
    
        Returns:
            `Optional[str]`: Returns the resolved file (to the cache folder if downloaded from a repo).
    
        Examples:
    
        ```python
        # Download a model weight from the Hub and cache it.
        model_weights_file = cached_file("google-bert/bert-base-uncased", "pytorch_model.bin")
        ```
        """
        use_auth_token = deprecated_kwargs.pop("use_auth_token", None)
        if use_auth_token is not None:
            warnings.warn(
                "The `use_auth_token` argument is deprecated and will be removed in v5 of Transformers. Please use `token` instead.",
                FutureWarning,
            )
            if token is not None:
                raise ValueError("`token` and `use_auth_token` are both specified. Please set only the argument `token`.")
            token = use_auth_token
    
        if is_offline_mode() and not local_files_only:
            logger.info("Offline mode: forcing local_files_only=True")
            local_files_only = True
        if subfolder is None:
            subfolder = ""
    
        # Add folder to filenames
        full_filenames = [os.path.join(subfolder, file) for file in filenames]
    
        path_or_repo_id = str(path_or_repo_id)
        existing_files = []
        for filename in full_filenames:
            if os.path.isdir(path_or_repo_id):
                resolved_file = os.path.join(path_or_repo_id, filename)
                if not os.path.isfile(resolved_file):
                    if _raise_exceptions_for_missing_entries and filename != os.path.join(subfolder, "config.json"):
                        revision_ = "main" if revision is None else revision
                        raise OSError(
                            f"{path_or_repo_id} does not appear to have a file named {filename}. Checkout "
                            f"'https://huggingface.co/{path_or_repo_id}/tree/{revision_}' for available files."
                        )
                    else:
                        return None
                existing_files.append(resolved_file)
    
        # All files exist
        if len(existing_files) == len(full_filenames):
            return existing_files
    
        if cache_dir is None:
            cache_dir = TRANSFORMERS_CACHE
        if isinstance(cache_dir, Path):
            cache_dir = str(cache_dir)
    
        existing_files = []
        file_counter = 0
        if _commit_hash is not None and not force_download:
            for filename in full_filenames:
                # If the file is cached under that commit hash, we return it directly.
                resolved_file = try_to_load_from_cache(
                    path_or_repo_id, filename, cache_dir=cache_dir, revision=_commit_hash, repo_type=repo_type
                )
                if resolved_file is not None:
                    if resolved_file is not _CACHED_NO_EXIST:
                        file_counter += 1
                        existing_files.append(resolved_file)
                    elif not _raise_exceptions_for_missing_entries:
                        file_counter += 1
                    else:
                        raise OSError(f"Could not locate {filename} inside {path_or_repo_id}.")
    
        # Either all the files were found, or some were _CACHED_NO_EXIST but we do not raise for missing entries
        if file_counter == len(full_filenames):
            return existing_files if len(existing_files) > 0 else None
    
        user_agent = http_user_agent(user_agent)
        # download the files if needed
        try:
            if len(full_filenames) == 1:
                # This is slightly better for only 1 file
                hf_hub_download(
                    path_or_repo_id,
                    filenames[0],
                    subfolder=None if len(subfolder) == 0 else subfolder,
                    repo_type=repo_type,
                    revision=revision,
                    cache_dir=cache_dir,
                    user_agent=user_agent,
                    force_download=force_download,
                    proxies=proxies,
                    resume_download=resume_download,
                    token=token,
                    local_files_only=local_files_only,
                )
            else:
                snapshot_download(
                    path_or_repo_id,
                    allow_patterns=full_filenames,
                    repo_type=repo_type,
                    revision=revision,
                    cache_dir=cache_dir,
                    user_agent=user_agent,
                    force_download=force_download,
                    proxies=proxies,
                    resume_download=resume_download,
                    token=token,
                    local_files_only=local_files_only,
                )
    
        except Exception as e:
            # We cannot recover from them
            if isinstance(e, RepositoryNotFoundError) and not isinstance(e, GatedRepoError):
                raise OSError(
                    f"{path_or_repo_id} is not a local folder and is not a valid model identifier "
                    "listed on 'https://huggingface.co/models'\nIf this is a private repository, make sure to pass a token "
                    "having permission to this repo either by logging in with `huggingface-cli login` or by passing "
                    "`token=<your_token>`"
                ) from e
            elif isinstance(e, RevisionNotFoundError):
                raise OSError(
                    f"{revision} is not a valid git identifier (branch name, tag name or commit id) that exists "
                    "for this model name. Check the model page at "
                    f"'https://huggingface.co/{path_or_repo_id}' for available revisions."
                ) from e
            elif isinstance(e, PermissionError):
                raise OSError(
                    f"PermissionError at {e.filename} when downloading {path_or_repo_id}. "
                    "Check cache directory permissions. Common causes: 1) another user is downloading the same model (please wait); "
                    "2) a previous download was canceled and the lock file needs manual removal."
                ) from e
    
            # Now we try to recover if we can find all files correctly in the cache
            resolved_files = [
                _get_cache_file_to_return(path_or_repo_id, filename, cache_dir, revision) for filename in full_filenames
            ]
            if all(file is not None for file in resolved_files):
                return resolved_files
    
            # Raise based on the flags. Note that we will raise for missing entries at the very end, even when
            # not entering this Except block, as it may also happen when `snapshot_download` does not raise
            if isinstance(e, GatedRepoError):
                if not _raise_exceptions_for_gated_repo:
                    return None
                raise OSError(
                    "You are trying to access a gated repo.\nMake sure to have access to it at "
                    f"https://huggingface.co/{path_or_repo_id}.\n{str(e)}"
                ) from e
            elif isinstance(e, LocalEntryNotFoundError):
                if not _raise_exceptions_for_connection_errors:
                    return None
                # Here we only raise if both flags for missing entry and connection errors are True (because it can be raised
                # even when `local_files_only` is True, in which case raising for connections errors only would not make sense)
                elif _raise_exceptions_for_missing_entries:
>                   raise OSError(
                        f"We couldn't connect to '{HUGGINGFACE_CO_RESOLVE_ENDPOINT}' to load the files, and couldn't find them in the"
                        f" cached files.\nCheck your internet connection or see how to run the library in offline mode at"
                        " 'https://huggingface..../docs/transformers/installation#offline-mode'."
                    ) from e
E                   OSError: We couldn't connect to 'https://huggingface.co' to load the files, and couldn't find them in the cached files.
E                   Check your internet connection or see how to run the library in offline mode at 'https://huggingface..../docs/transformers/installation#offline-mode'.

.../local/lib/python3.12.../transformers/utils/hub.py:543: OSError

sub-packages/bionemo-esm2/tests/bionemo/esm2/model/test_convert.py::test_cli_nemo2_conversion_equivalent_8m

Stack Traces | 0.292s run time

tmp_path = PosixPath('.../pytest-of-root/pytest-1/test_cli_nemo2_conversion_equi0')

    def test_cli_nemo2_conversion_equivalent_8m(tmp_path):
        """Test that the CLI conversion functions maintain model equivalence."""
        model_tag = "facebook/esm2_t6_8M_UR50D"
        runner = CliRunner()
    
        # First convert HF to NeMo
        nemo_path = tmp_path / "nemo_checkpoint"
        result = runner.invoke(app, ["convert-hf-to-nemo", model_tag, str(nemo_path)])
>       assert result.exit_code == 0, f"CLI command failed: {result.output}"
E       AssertionError: CLI command failed: 
E       assert 1 == 0
E        +  where 1 = <Result OSError("We couldn't connect to 'https://huggingface.co' to load the files, and couldn't find them in the cached files.\nCheck your internet connection or see how to run the library in offline mode at 'https://huggingface..../docs/transformers/installation#offline-mode'.")>.exit_code

.../esm2/model/test_convert.py:126: AssertionError

sub-packages/bionemo-esm2/tests/bionemo/esm2/model/test_convert.py::test_nemo2_conversion_equivalent_8m

Stack Traces | 0.315s run time

response = <Response [429]>, endpoint_name = None

    def hf_raise_for_status(response: Response, endpoint_name: Optional[str] = None) -> None:
        """
        Internal version of `response.raise_for_status()` that will refine a
        potential HTTPError. Raised exception will be an instance of `HfHubHTTPError`.
    
        This helper is meant to be the unique method to raise_for_status when making a call
        to the Hugging Face Hub.
    
    
        Example:
        ```py
            import requests
            from huggingface_hub.utils import get_session, hf_raise_for_status, HfHubHTTPError
    
            response = get_session().post(...)
            try:
                hf_raise_for_status(response)
            except HfHubHTTPError as e:
                print(str(e)) # formatted message
                e.request_id, e.server_message # details returned by server
    
                # Complete the error message with additional information once it's raised
                e.append_to_message("\n`create_commit` expects the repository to exist.")
                raise
        ```
    
        Args:
            response (`Response`):
                Response from the server.
            endpoint_name (`str`, *optional*):
                Name of the endpoint that has been called. If provided, the error message
                will be more complete.
    
        > [!WARNING]
        > Raises when the request has failed:
        >
        >     - [`~utils.RepositoryNotFoundError`]
        >         If the repository to download from cannot be found. This may be because it
        >         doesn't exist, because `repo_type` is not set correctly, or because the repo
        >         is `private` and you do not have access.
        >     - [`~utils.GatedRepoError`]
        >         If the repository exists but is gated and the user is not on the authorized
        >         list.
        >     - [`~utils.RevisionNotFoundError`]
        >         If the repository exists but the revision couldn't be find.
        >     - [`~utils.EntryNotFoundError`]
        >         If the repository exists but the entry (e.g. the requested file) couldn't be
        >         find.
        >     - [`~utils.BadRequestError`]
        >         If request failed with a HTTP 400 BadRequest error.
        >     - [`~utils.HfHubHTTPError`]
        >         If request failed for a reason not listed above.
        """
        try:
>           response.raise_for_status()

.../local/lib/python3.12.../huggingface_hub/utils/_http.py:402: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = <Response [429]>

    def raise_for_status(self):
        """Raises :class:`HTTPError`, if one occurred."""
    
        http_error_msg = ""
        if isinstance(self.reason, bytes):
            # We attempt to decode utf-8 first because some servers
            # choose to localize their reason strings. If the string
            # isn't utf-8, we fall back to iso-8859-1 for all other
            # encodings. (See PR #3538)
            try:
                reason = self.reason.decode("utf-8")
            except UnicodeDecodeError:
                reason = self.reason.decode("iso-8859-1")
        else:
            reason = self.reason
    
        if 400 <= self.status_code < 500:
            http_error_msg = (
                f"{self.status_code} Client Error: {reason} for url: {self.url}"
            )
    
        elif 500 <= self.status_code < 600:
            http_error_msg = (
                f"{self.status_code} Server Error: {reason} for url: {self.url}"
            )
    
        if http_error_msg:
>           raise HTTPError(http_error_msg, response=self)
E           requests.exceptions.HTTPError: 429 Client Error: Too Many Requests for url: https://huggingface..../resolve/main/config.json

.../local/lib/python3.12.../dist-packages/requests/models.py:1024: HTTPError

The above exception was the direct cause of the following exception:

    def _get_metadata_or_catch_error(
        *,
        repo_id: str,
        filename: str,
        repo_type: str,
        revision: str,
        endpoint: Optional[str],
        proxies: Optional[Dict],
        etag_timeout: Optional[float],
        headers: Dict[str, str],  # mutated inplace!
        token: Union[bool, str, None],
        local_files_only: bool,
        relative_filename: Optional[str] = None,  # only used to store `.no_exists` in cache
        storage_folder: Optional[str] = None,  # only used to store `.no_exists` in cache
    ) -> Union[
        # Either an exception is caught and returned
        Tuple[None, None, None, None, None, Exception],
        # Or the metadata is returned as
        # `(url_to_download, etag, commit_hash, expected_size, xet_file_data, None)`
        Tuple[str, str, str, int, Optional[XetFileData], None],
    ]:
        """Get metadata for a file on the Hub, safely handling network issues.
    
        Returns either the etag, commit_hash and expected size of the file, or the error
        raised while fetching the metadata.
    
        NOTE: This function mutates `headers` inplace! It removes the `authorization` header
              if the file is a LFS blob and the domain of the url is different from the
              domain of the location (typically an S3 bucket).
        """
        if local_files_only:
            return (
                None,
                None,
                None,
                None,
                None,
                OfflineModeIsEnabled(
                    f"Cannot access file since 'local_files_only=True' as been set. (repo_id: {repo_id}, repo_type: {repo_type}, revision: {revision}, filename: {filename})"
                ),
            )
    
        url = hf_hub_url(repo_id, filename, repo_type=repo_type, revision=revision, endpoint=endpoint)
        url_to_download: str = url
        etag: Optional[str] = None
        commit_hash: Optional[str] = None
        expected_size: Optional[int] = None
        head_error_call: Optional[Exception] = None
        xet_file_data: Optional[XetFileData] = None
    
        # Try to get metadata from the server.
        # Do not raise yet if the file is not found or not accessible.
        if not local_files_only:
            try:
                try:
>                   metadata = get_hf_file_metadata(
                        url=url, proxies=proxies, timeout=etag_timeout, headers=headers, token=token, endpoint=endpoint
                    )

.../local/lib/python3.12...................../dist-packages/huggingface_hub/file_download.py:1543: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
.../local/lib/python3.12.../huggingface_hub/utils/_validators.py:114: in _inner_fn
    return fn(*args, **kwargs)
.../local/lib/python3.12...................../dist-packages/huggingface_hub/file_download.py:1460: in get_hf_file_metadata
    r = _request_wrapper(
.../local/lib/python3.12...................../dist-packages/huggingface_hub/file_download.py:283: in _request_wrapper
    response = _request_wrapper(
.../local/lib/python3.12...................../dist-packages/huggingface_hub/file_download.py:307: in _request_wrapper
    hf_raise_for_status(response)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

response = <Response [429]>, endpoint_name = None

    def hf_raise_for_status(response: Response, endpoint_name: Optional[str] = None) -> None:
        """
        Internal version of `response.raise_for_status()` that will refine a
        potential HTTPError. Raised exception will be an instance of `HfHubHTTPError`.
    
        This helper is meant to be the unique method to raise_for_status when making a call
        to the Hugging Face Hub.
    
    
        Example:
        ```py
            import requests
            from huggingface_hub.utils import get_session, hf_raise_for_status, HfHubHTTPError
    
            response = get_session().post(...)
            try:
                hf_raise_for_status(response)
            except HfHubHTTPError as e:
                print(str(e)) # formatted message
                e.request_id, e.server_message # details returned by server
    
                # Complete the error message with additional information once it's raised
                e.append_to_message("\n`create_commit` expects the repository to exist.")
                raise
        ```
    
        Args:
            response (`Response`):
                Response from the server.
            endpoint_name (`str`, *optional*):
                Name of the endpoint that has been called. If provided, the error message
                will be more complete.
    
        > [!WARNING]
        > Raises when the request has failed:
        >
        >     - [`~utils.RepositoryNotFoundError`]
        >         If the repository to download from cannot be found. This may be because it
        >         doesn't exist, because `repo_type` is not set correctly, or because the repo
        >         is `private` and you do not have access.
        >     - [`~utils.GatedRepoError`]
        >         If the repository exists but is gated and the user is not on the authorized
        >         list.
        >     - [`~utils.RevisionNotFoundError`]
        >         If the repository exists but the revision couldn't be find.
        >     - [`~utils.EntryNotFoundError`]
        >         If the repository exists but the entry (e.g. the requested file) couldn't be
        >         find.
        >     - [`~utils.BadRequestError`]
        >         If request failed with a HTTP 400 BadRequest error.
        >     - [`~utils.HfHubHTTPError`]
        >         If request failed for a reason not listed above.
        """
        try:
            response.raise_for_status()
        except HTTPError as e:
            error_code = response.headers.get("X-Error-Code")
            error_message = response.headers.get("X-Error-Message")
    
            if error_code == "RevisionNotFound":
                message = f"{response.status_code} Client Error." + "\n\n" + f"Revision Not Found for url: {response.url}."
                raise _format(RevisionNotFoundError, message, response) from e
    
            elif error_code == "EntryNotFound":
                message = f"{response.status_code} Client Error." + "\n\n" + f"Entry Not Found for url: {response.url}."
                raise _format(EntryNotFoundError, message, response) from e
    
            elif error_code == "GatedRepo":
                message = (
                    f"{response.status_code} Client Error." + "\n\n" + f"Cannot access gated repo for url {response.url}."
                )
                raise _format(GatedRepoError, message, response) from e
    
            elif error_message == "Access to this resource is disabled.":
                message = (
                    f"{response.status_code} Client Error."
                    + "\n\n"
                    + f"Cannot access repository for url {response.url}."
                    + "\n"
                    + "Access to this resource is disabled."
                )
                raise _format(DisabledRepoError, message, response) from e
    
            elif error_code == "RepoNotFound" or (
                response.status_code == 401
                and error_message != "Invalid credentials in Authorization header"
                and response.request is not None
                and response.request.url is not None
                and REPO_API_REGEX.search(response.request.url) is not None
            ):
                # 401 is misleading as it is returned for:
                #    - private and gated repos if user is not authenticated
                #    - missing repos
                # => for now, we process them as `RepoNotFound` anyway.
                # See https://gist.github.com/Wauplin/46c27ad266b15998ce56a6603796f0b9
                message = (
                    f"{response.status_code} Client Error."
                    + "\n\n"
                    + f"Repository Not Found for url: {response.url}."
                    + "\nPlease make sure you specified the correct `repo_id` and"
                    " `repo_type`.\nIf you are trying to access a private or gated repo,"
                    " make sure you are authenticated. For more details, see"
                    " https://huggingface..../docs/huggingface_hub/authentication"
                )
                raise _format(RepositoryNotFoundError, message, response) from e
    
            elif response.status_code == 400:
                message = (
                    f"\n\nBad request for {endpoint_name} endpoint:" if endpoint_name is not None else "\n\nBad request:"
                )
                raise _format(BadRequestError, message, response) from e
    
            elif response.status_code == 403:
                message = (
                    f"\n\n{response.status_code} Forbidden: {error_message}."
                    + f"\nCannot access content at: {response.url}."
                    + "\nMake sure your token has the correct permissions."
                )
                raise _format(HfHubHTTPError, message, response) from e
    
            elif response.status_code == 416:
                range_header = response.request.headers.get("Range")
                message = f"{e}. Requested range: {range_header}. Content-Range: {response.headers.get('Content-Range')}."
                raise _format(HfHubHTTPError, message, response) from e
    
            # Convert `HTTPError` into a `HfHubHTTPError` to display request information
            # as well (request id and/or server error message)
>           raise _format(HfHubHTTPError, str(e), response) from e
E           huggingface_hub.errors.HfHubHTTPError: 429 Client Error: Too Many Requests for url: https://huggingface..../resolve/main/config.json (Request ID: Root=1-69177c58-093d46c36b34b7e9731e0fca;063c4d0f-3bf5-4ff6-93a4-df6f5d9272c0)
E           
E           We had to rate limit your IP (216.228.127.128). To continue using our service, create a HF account or login to your existing account, and make sure you pass a HF_TOKEN if you're using the API.

.../local/lib/python3.12.../huggingface_hub/utils/_http.py:475: HfHubHTTPError

The above exception was the direct cause of the following exception:

path_or_repo_id = 'facebook/esm2_t6_8M_UR50D', filenames = ['config.json']
cache_dir = '....../github/home/.cache/huggingface/hub', force_download = False
resume_download = None, proxies = None, token = None, revision = None
local_files_only = False, subfolder = '', repo_type = None
user_agent = 'transformers/4.53.3; python/3.12.3; session_id/1dc5d65396de47f9b8ea53ab966d4123; torch/2.8.0a0+5228986c39.nv25.6; file_type/config; from_auto_class/True'
_raise_exceptions_for_gated_repo = True
_raise_exceptions_for_missing_entries = True
_raise_exceptions_for_connection_errors = True, _commit_hash = None
deprecated_kwargs = {}, use_auth_token = None, full_filenames = ['config.json']
existing_files = [], filename = 'config.json'

    def cached_files(
        path_or_repo_id: Union[str, os.PathLike],
        filenames: list[str],
        cache_dir: Optional[Union[str, os.PathLike]] = None,
        force_download: bool = False,
        resume_download: Optional[bool] = None,
        proxies: Optional[dict[str, str]] = None,
        token: Optional[Union[bool, str]] = None,
        revision: Optional[str] = None,
        local_files_only: bool = False,
        subfolder: str = "",
        repo_type: Optional[str] = None,
        user_agent: Optional[Union[str, dict[str, str]]] = None,
        _raise_exceptions_for_gated_repo: bool = True,
        _raise_exceptions_for_missing_entries: bool = True,
        _raise_exceptions_for_connection_errors: bool = True,
        _commit_hash: Optional[str] = None,
        **deprecated_kwargs,
    ) -> Optional[str]:
        """
        Tries to locate several files in a local folder and repo, downloads and cache them if necessary.
    
        Args:
            path_or_repo_id (`str` or `os.PathLike`):
                This can be either:
                - a string, the *model id* of a model repo on huggingface.co.
                - a path to a *directory* potentially containing the file.
            filenames (`list[str]`):
                The name of all the files to locate in `path_or_repo`.
            cache_dir (`str` or `os.PathLike`, *optional*):
                Path to a directory in which a downloaded pretrained model configuration should be cached if the standard
                cache should not be used.
            force_download (`bool`, *optional*, defaults to `False`):
                Whether or not to force to (re-)download the configuration files and override the cached versions if they
                exist.
            resume_download:
                Deprecated and ignored. All downloads are now resumed by default when possible.
                Will be removed in v5 of Transformers.
            proxies (`dict[str, str]`, *optional*):
                A dictionary of proxy servers to use by protocol or endpoint, e.g., `{'http': 'foo.bar:3128',
                'http://hostname': 'foo.bar:4012'}.` The proxies are used on each request.
            token (`str` or *bool*, *optional*):
                The token to use as HTTP bearer authorization for remote files. If `True`, will use the token generated
                when running `huggingface-cli login` (stored in `~/.huggingface`).
            revision (`str`, *optional*, defaults to `"main"`):
                The specific model version to use. It can be a branch name, a tag name, or a commit id, since we use a
                git-based system for storing models and other artifacts on huggingface.co, so `revision` can be any
                identifier allowed by git.
            local_files_only (`bool`, *optional*, defaults to `False`):
                If `True`, will only try to load the tokenizer configuration from local files.
            subfolder (`str`, *optional*, defaults to `""`):
                In case the relevant files are located inside a subfolder of the model repo on huggingface.co, you can
                specify the folder name here.
            repo_type (`str`, *optional*):
                Specify the repo type (useful when downloading from a space for instance).
    
        Private args:
            _raise_exceptions_for_gated_repo (`bool`):
                if False, do not raise an exception for gated repo error but return None.
            _raise_exceptions_for_missing_entries (`bool`):
                if False, do not raise an exception for missing entries but return None.
            _raise_exceptions_for_connection_errors (`bool`):
                if False, do not raise an exception for connection errors but return None.
            _commit_hash (`str`, *optional*):
                passed when we are chaining several calls to various files (e.g. when loading a tokenizer or
                a pipeline). If files are cached for this commit hash, avoid calls to head and get from the cache.
    
        <Tip>
    
        Passing `token=True` is required when you want to use a private model.
    
        </Tip>
    
        Returns:
            `Optional[str]`: Returns the resolved file (to the cache folder if downloaded from a repo).
    
        Examples:
    
        ```python
        # Download a model weight from the Hub and cache it.
        model_weights_file = cached_file("google-bert/bert-base-uncased", "pytorch_model.bin")
        ```
        """
        use_auth_token = deprecated_kwargs.pop("use_auth_token", None)
        if use_auth_token is not None:
            warnings.warn(
                "The `use_auth_token` argument is deprecated and will be removed in v5 of Transformers. Please use `token` instead.",
                FutureWarning,
            )
            if token is not None:
                raise ValueError("`token` and `use_auth_token` are both specified. Please set only the argument `token`.")
            token = use_auth_token
    
        if is_offline_mode() and not local_files_only:
            logger.info("Offline mode: forcing local_files_only=True")
            local_files_only = True
        if subfolder is None:
            subfolder = ""
    
        # Add folder to filenames
        full_filenames = [os.path.join(subfolder, file) for file in filenames]
    
        path_or_repo_id = str(path_or_repo_id)
        existing_files = []
        for filename in full_filenames:
            if os.path.isdir(path_or_repo_id):
                resolved_file = os.path.join(path_or_repo_id, filename)
                if not os.path.isfile(resolved_file):
                    if _raise_exceptions_for_missing_entries and filename != os.path.join(subfolder, "config.json"):
                        revision_ = "main" if revision is None else revision
                        raise OSError(
                            f"{path_or_repo_id} does not appear to have a file named {filename}. Checkout "
                            f"'https://huggingface.co/{path_or_repo_id}/tree/{revision_}' for available files."
                        )
                    else:
                        return None
                existing_files.append(resolved_file)
    
        # All files exist
        if len(existing_files) == len(full_filenames):
            return existing_files
    
        if cache_dir is None:
            cache_dir = TRANSFORMERS_CACHE
        if isinstance(cache_dir, Path):
            cache_dir = str(cache_dir)
    
        existing_files = []
        file_counter = 0
        if _commit_hash is not None and not force_download:
            for filename in full_filenames:
                # If the file is cached under that commit hash, we return it directly.
                resolved_file = try_to_load_from_cache(
                    path_or_repo_id, filename, cache_dir=cache_dir, revision=_commit_hash, repo_type=repo_type
                )
                if resolved_file is not None:
                    if resolved_file is not _CACHED_NO_EXIST:
                        file_counter += 1
                        existing_files.append(resolved_file)
                    elif not _raise_exceptions_for_missing_entries:
                        file_counter += 1
                    else:
                        raise OSError(f"Could not locate {filename} inside {path_or_repo_id}.")
    
        # Either all the files were found, or some were _CACHED_NO_EXIST but we do not raise for missing entries
        if file_counter == len(full_filenames):
            return existing_files if len(existing_files) > 0 else None
    
        user_agent = http_user_agent(user_agent)
        # download the files if needed
        try:
            if len(full_filenames) == 1:
                # This is slightly better for only 1 file
>               hf_hub_download(
                    path_or_repo_id,
                    filenames[0],
                    subfolder=None if len(subfolder) == 0 else subfolder,
                    repo_type=repo_type,
                    revision=revision,
                    cache_dir=cache_dir,
                    user_agent=user_agent,
                    force_download=force_download,
                    proxies=proxies,
                    resume_download=resume_download,
                    token=token,
                    local_files_only=local_files_only,
                )

.../local/lib/python3.12.../transformers/utils/hub.py:470: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
.../local/lib/python3.12.../huggingface_hub/utils/_validators.py:114: in _inner_fn
    return fn(*args, **kwargs)
.../local/lib/python3.12...................../dist-packages/huggingface_hub/file_download.py:1007: in hf_hub_download
    return _hf_hub_download_to_cache_dir(
.../local/lib/python3.12...................../dist-packages/huggingface_hub/file_download.py:1114: in _hf_hub_download_to_cache_dir
    _raise_on_head_call_error(head_call_error, force_download, local_files_only)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

head_call_error = HfHubHTTPError("429 Client Error: Too Many Requests for url: https://huggingface..../esm2_t6_8M_UR50D/resolve/...ce, create a HF account or login to your existing account, and make sure you pass a HF_TOKEN if you're using the API.")
force_download = False, local_files_only = False

    def _raise_on_head_call_error(head_call_error: Exception, force_download: bool, local_files_only: bool) -> NoReturn:
        """Raise an appropriate error when the HEAD call failed and we cannot locate a local file."""
        # No head call => we cannot force download.
        if force_download:
            if local_files_only:
                raise ValueError("Cannot pass 'force_download=True' and 'local_files_only=True' at the same time.")
            elif isinstance(head_call_error, OfflineModeIsEnabled):
                raise ValueError("Cannot pass 'force_download=True' when offline mode is enabled.") from head_call_error
            else:
                raise ValueError("Force download failed due to the above error.") from head_call_error
    
        # No head call + couldn't find an appropriate file on disk => raise an error.
        if local_files_only:
            raise LocalEntryNotFoundError(
                "Cannot find the requested files in the disk cache and outgoing traffic has been disabled. To enable"
                " hf.co look-ups and downloads online, set 'local_files_only' to False."
            )
        elif isinstance(head_call_error, (RepositoryNotFoundError, GatedRepoError)) or (
            isinstance(head_call_error, HfHubHTTPError) and head_call_error.response.status_code == 401
        ):
            # Repo not found or gated => let's raise the actual error
            # Unauthorized => likely a token issue => let's raise the actual error
            raise head_call_error
        else:
            # Otherwise: most likely a connection issue or Hub downtime => let's warn the user
>           raise LocalEntryNotFoundError(
                "An error happened while trying to locate the file on the Hub and we cannot find the requested files"
                " in the local cache. Please check your connection and try again or make sure your Internet connection"
                " is on."
            ) from head_call_error
E           huggingface_hub.errors.LocalEntryNotFoundError: An error happened while trying to locate the file on the Hub and we cannot find the requested files in the local cache. Please check your connection and try again or make sure your Internet connection is on.

.../local/lib/python3.12...................../dist-packages/huggingface_hub/file_download.py:1658: LocalEntryNotFoundError

The above exception was the direct cause of the following exception:

tmp_path = PosixPath('.../pytest-of-root/pytest-1/test_nemo2_conversion_equivale0')

    def test_nemo2_conversion_equivalent_8m(tmp_path):
        model_tag = "facebook/esm2_t6_8M_UR50D"
        module = biobert_lightning_module(config=ESM2Config())
>       io.import_ckpt(module, f"hf://{model_tag}", tmp_path / "nemo_checkpoint")

.../esm2/model/test_convert.py:37: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
.../local/lib/python3.12.../lightning/io/api.py:167: in import_ckpt
    ckpt_path = importer(overwrite=overwrite, output_path=output_path, **kwargs)
.../local/lib/python3.12.../lightning/io/connector.py:101: in __call__
    to_return = self.apply(_output_path, **kwargs)
.../local/lib/python3.12.../esm2/model/convert.py:50: in apply
    source = AutoModelForMaskedLM.from_pretrained(str(self), trust_remote_code=True, torch_dtype="auto")
.../local/lib/python3.12.../models/auto/auto_factory.py:547: in from_pretrained
    config, kwargs = AutoConfig.from_pretrained(
.../local/lib/python3.12.../models/auto/configuration_auto.py:1197: in from_pretrained
    config_dict, unused_kwargs = PretrainedConfig.get_config_dict(pretrained_model_name_or_path, **kwargs)
.../local/lib/python3.12....../dist-packages/transformers/configuration_utils.py:608: in get_config_dict
    config_dict, kwargs = cls._get_config_dict(pretrained_model_name_or_path, **kwargs)
.../local/lib/python3.12....../dist-packages/transformers/configuration_utils.py:667: in _get_config_dict
    resolved_config_file = cached_file(
.../local/lib/python3.12.../transformers/utils/hub.py:312: in cached_file
    file = cached_files(path_or_repo_id=path_or_repo_id, filenames=[filename], **kwargs)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

path_or_repo_id = 'facebook/esm2_t6_8M_UR50D', filenames = ['config.json']
cache_dir = '....../github/home/.cache/huggingface/hub', force_download = False
resume_download = None, proxies = None, token = None, revision = None
local_files_only = False, subfolder = '', repo_type = None
user_agent = 'transformers/4.53.3; python/3.12.3; session_id/1dc5d65396de47f9b8ea53ab966d4123; torch/2.8.0a0+5228986c39.nv25.6; file_type/config; from_auto_class/True'
_raise_exceptions_for_gated_repo = True
_raise_exceptions_for_missing_entries = True
_raise_exceptions_for_connection_errors = True, _commit_hash = None
deprecated_kwargs = {}, use_auth_token = None, full_filenames = ['config.json']
existing_files = [], filename = 'config.json'

    def cached_files(
        path_or_repo_id: Union[str, os.PathLike],
        filenames: list[str],
        cache_dir: Optional[Union[str, os.PathLike]] = None,
        force_download: bool = False,
        resume_download: Optional[bool] = None,
        proxies: Optional[dict[str, str]] = None,
        token: Optional[Union[bool, str]] = None,
        revision: Optional[str] = None,
        local_files_only: bool = False,
        subfolder: str = "",
        repo_type: Optional[str] = None,
        user_agent: Optional[Union[str, dict[str, str]]] = None,
        _raise_exceptions_for_gated_repo: bool = True,
        _raise_exceptions_for_missing_entries: bool = True,
        _raise_exceptions_for_connection_errors: bool = True,
        _commit_hash: Optional[str] = None,
        **deprecated_kwargs,
    ) -> Optional[str]:
        """
        Tries to locate several files in a local folder and repo, downloads and cache them if necessary.
    
        Args:
            path_or_repo_id (`str` or `os.PathLike`):
                This can be either:
                - a string, the *model id* of a model repo on huggingface.co.
                - a path to a *directory* potentially containing the file.
            filenames (`list[str]`):
                The name of all the files to locate in `path_or_repo`.
            cache_dir (`str` or `os.PathLike`, *optional*):
                Path to a directory in which a downloaded pretrained model configuration should be cached if the standard
                cache should not be used.
            force_download (`bool`, *optional*, defaults to `False`):
                Whether or not to force to (re-)download the configuration files and override the cached versions if they
                exist.
            resume_download:
                Deprecated and ignored. All downloads are now resumed by default when possible.
                Will be removed in v5 of Transformers.
            proxies (`dict[str, str]`, *optional*):
                A dictionary of proxy servers to use by protocol or endpoint, e.g., `{'http': 'foo.bar:3128',
                'http://hostname': 'foo.bar:4012'}.` The proxies are used on each request.
            token (`str` or *bool*, *optional*):
                The token to use as HTTP bearer authorization for remote files. If `True`, will use the token generated
                when running `huggingface-cli login` (stored in `~/.huggingface`).
            revision (`str`, *optional*, defaults to `"main"`):
                The specific model version to use. It can be a branch name, a tag name, or a commit id, since we use a
                git-based system for storing models and other artifacts on huggingface.co, so `revision` can be any
                identifier allowed by git.
            local_files_only (`bool`, *optional*, defaults to `False`):
                If `True`, will only try to load the tokenizer configuration from local files.
            subfolder (`str`, *optional*, defaults to `""`):
                In case the relevant files are located inside a subfolder of the model repo on huggingface.co, you can
                specify the folder name here.
            repo_type (`str`, *optional*):
                Specify the repo type (useful when downloading from a space for instance).
    
        Private args:
            _raise_exceptions_for_gated_repo (`bool`):
                if False, do not raise an exception for gated repo error but return None.
            _raise_exceptions_for_missing_entries (`bool`):
                if False, do not raise an exception for missing entries but return None.
            _raise_exceptions_for_connection_errors (`bool`):
                if False, do not raise an exception for connection errors but return None.
            _commit_hash (`str`, *optional*):
                passed when we are chaining several calls to various files (e.g. when loading a tokenizer or
                a pipeline). If files are cached for this commit hash, avoid calls to head and get from the cache.
    
        <Tip>
    
        Passing `token=True` is required when you want to use a private model.
    
        </Tip>
    
        Returns:
            `Optional[str]`: Returns the resolved file (to the cache folder if downloaded from a repo).
    
        Examples:
    
        ```python
        # Download a model weight from the Hub and cache it.
        model_weights_file = cached_file("google-bert/bert-base-uncased", "pytorch_model.bin")
        ```
        """
        use_auth_token = deprecated_kwargs.pop("use_auth_token", None)
        if use_auth_token is not None:
            warnings.warn(
                "The `use_auth_token` argument is deprecated and will be removed in v5 of Transformers. Please use `token` instead.",
                FutureWarning,
            )
            if token is not None:
                raise ValueError("`token` and `use_auth_token` are both specified. Please set only the argument `token`.")
            token = use_auth_token
    
        if is_offline_mode() and not local_files_only:
            logger.info("Offline mode: forcing local_files_only=True")
            local_files_only = True
        if subfolder is None:
            subfolder = ""
    
        # Add folder to filenames
        full_filenames = [os.path.join(subfolder, file) for file in filenames]
    
        path_or_repo_id = str(path_or_repo_id)
        existing_files = []
        for filename in full_filenames:
            if os.path.isdir(path_or_repo_id):
                resolved_file = os.path.join(path_or_repo_id, filename)
                if not os.path.isfile(resolved_file):
                    if _raise_exceptions_for_missing_entries and filename != os.path.join(subfolder, "config.json"):
                        revision_ = "main" if revision is None else revision
                        raise OSError(
                            f"{path_or_repo_id} does not appear to have a file named {filename}. Checkout "
                            f"'https://huggingface.co/{path_or_repo_id}/tree/{revision_}' for available files."
                        )
                    else:
                        return None
                existing_files.append(resolved_file)
    
        # All files exist
        if len(existing_files) == len(full_filenames):
            return existing_files
    
        if cache_dir is None:
            cache_dir = TRANSFORMERS_CACHE
        if isinstance(cache_dir, Path):
            cache_dir = str(cache_dir)
    
        existing_files = []
        file_counter = 0
        if _commit_hash is not None and not force_download:
            for filename in full_filenames:
                # If the file is cached under that commit hash, we return it directly.
                resolved_file = try_to_load_from_cache(
                    path_or_repo_id, filename, cache_dir=cache_dir, revision=_commit_hash, repo_type=repo_type
                )
                if resolved_file is not None:
                    if resolved_file is not _CACHED_NO_EXIST:
                        file_counter += 1
                        existing_files.append(resolved_file)
                    elif not _raise_exceptions_for_missing_entries:
                        file_counter += 1
                    else:
                        raise OSError(f"Could not locate {filename} inside {path_or_repo_id}.")
    
        # Either all the files were found, or some were _CACHED_NO_EXIST but we do not raise for missing entries
        if file_counter == len(full_filenames):
            return existing_files if len(existing_files) > 0 else None
    
        user_agent = http_user_agent(user_agent)
        # download the files if needed
        try:
            if len(full_filenames) == 1:
                # This is slightly better for only 1 file
                hf_hub_download(
                    path_or_repo_id,
                    filenames[0],
                    subfolder=None if len(subfolder) == 0 else subfolder,
                    repo_type=repo_type,
                    revision=revision,
                    cache_dir=cache_dir,
                    user_agent=user_agent,
                    force_download=force_download,
                    proxies=proxies,
                    resume_download=resume_download,
                    token=token,
                    local_files_only=local_files_only,
                )
            else:
                snapshot_download(
                    path_or_repo_id,
                    allow_patterns=full_filenames,
                    repo_type=repo_type,
                    revision=revision,
                    cache_dir=cache_dir,
                    user_agent=user_agent,
                    force_download=force_download,
                    proxies=proxies,
                    resume_download=resume_download,
                    token=token,
                    local_files_only=local_files_only,
                )
    
        except Exception as e:
            # We cannot recover from them
            if isinstance(e, RepositoryNotFoundError) and not isinstance(e, GatedRepoError):
                raise OSError(
                    f"{path_or_repo_id} is not a local folder and is not a valid model identifier "
                    "listed on 'https://huggingface.co/models'\nIf this is a private repository, make sure to pass a token "
                    "having permission to this repo either by logging in with `huggingface-cli login` or by passing "
                    "`token=<your_token>`"
                ) from e
            elif isinstance(e, RevisionNotFoundError):
                raise OSError(
                    f"{revision} is not a valid git identifier (branch name, tag name or commit id) that exists "
                    "for this model name. Check the model page at "
                    f"'https://huggingface.co/{path_or_repo_id}' for available revisions."
                ) from e
            elif isinstance(e, PermissionError):
                raise OSError(
                    f"PermissionError at {e.filename} when downloading {path_or_repo_id}. "
                    "Check cache directory permissions. Common causes: 1) another user is downloading the same model (please wait); "
                    "2) a previous download was canceled and the lock file needs manual removal."
                ) from e
    
            # Now we try to recover if we can find all files correctly in the cache
            resolved_files = [
                _get_cache_file_to_return(path_or_repo_id, filename, cache_dir, revision) for filename in full_filenames
            ]
            if all(file is not None for file in resolved_files):
                return resolved_files
    
            # Raise based on the flags. Note that we will raise for missing entries at the very end, even when
            # not entering this Except block, as it may also happen when `snapshot_download` does not raise
            if isinstance(e, GatedRepoError):
                if not _raise_exceptions_for_gated_repo:
                    return None
                raise OSError(
                    "You are trying to access a gated repo.\nMake sure to have access to it at "
                    f"https://huggingface.co/{path_or_repo_id}.\n{str(e)}"
                ) from e
            elif isinstance(e, LocalEntryNotFoundError):
                if not _raise_exceptions_for_connection_errors:
                    return None
                # Here we only raise if both flags for missing entry and connection errors are True (because it can be raised
                # even when `local_files_only` is True, in which case raising for connections errors only would not make sense)
                elif _raise_exceptions_for_missing_entries:
>                   raise OSError(
                        f"We couldn't connect to '{HUGGINGFACE_CO_RESOLVE_ENDPOINT}' to load the files, and couldn't find them in the"
                        f" cached files.\nCheck your internet connection or see how to run the library in offline mode at"
                        " 'https://huggingface..../docs/transformers/installation#offline-mode'."
                    ) from e
E                   OSError: We couldn't connect to 'https://huggingface.co' to load the files, and couldn't find them in the cached files.
E                   Check your internet connection or see how to run the library in offline mode at 'https://huggingface..../docs/transformers/installation#offline-mode'.

.../local/lib/python3.12.../transformers/utils/hub.py:543: OSError

sub-packages/bionemo-esm2/tests/bionemo/esm2/model/test_convert.py::test_nemo2_export_8m_weights_equivalent

Stack Traces | 1.59s run time

response = <Response [429]>, endpoint_name = None

    def hf_raise_for_status(response: Response, endpoint_name: Optional[str] = None) -> None:
        """
        Internal version of `response.raise_for_status()` that will refine a
        potential HTTPError. Raised exception will be an instance of `HfHubHTTPError`.
    
        This helper is meant to be the unique method to raise_for_status when making a call
        to the Hugging Face Hub.
    
    
        Example:
        ```py
            import requests
            from huggingface_hub.utils import get_session, hf_raise_for_status, HfHubHTTPError
    
            response = get_session().post(...)
            try:
                hf_raise_for_status(response)
            except HfHubHTTPError as e:
                print(str(e)) # formatted message
                e.request_id, e.server_message # details returned by server
    
                # Complete the error message with additional information once it's raised
                e.append_to_message("\n`create_commit` expects the repository to exist.")
                raise
        ```
    
        Args:
            response (`Response`):
                Response from the server.
            endpoint_name (`str`, *optional*):
                Name of the endpoint that has been called. If provided, the error message
                will be more complete.
    
        > [!WARNING]
        > Raises when the request has failed:
        >
        >     - [`~utils.RepositoryNotFoundError`]
        >         If the repository to download from cannot be found. This may be because it
        >         doesn't exist, because `repo_type` is not set correctly, or because the repo
        >         is `private` and you do not have access.
        >     - [`~utils.GatedRepoError`]
        >         If the repository exists but is gated and the user is not on the authorized
        >         list.
        >     - [`~utils.RevisionNotFoundError`]
        >         If the repository exists but the revision couldn't be find.
        >     - [`~utils.EntryNotFoundError`]
        >         If the repository exists but the entry (e.g. the requested file) couldn't be
        >         find.
        >     - [`~utils.BadRequestError`]
        >         If request failed with a HTTP 400 BadRequest error.
        >     - [`~utils.HfHubHTTPError`]
        >         If request failed for a reason not listed above.
        """
        try:
>           response.raise_for_status()

.../local/lib/python3.12.../huggingface_hub/utils/_http.py:402: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = <Response [429]>

    def raise_for_status(self):
        """Raises :class:`HTTPError`, if one occurred."""
    
        http_error_msg = ""
        if isinstance(self.reason, bytes):
            # We attempt to decode utf-8 first because some servers
            # choose to localize their reason strings. If the string
            # isn't utf-8, we fall back to iso-8859-1 for all other
            # encodings. (See PR #3538)
            try:
                reason = self.reason.decode("utf-8")
            except UnicodeDecodeError:
                reason = self.reason.decode("iso-8859-1")
        else:
            reason = self.reason
    
        if 400 <= self.status_code < 500:
            http_error_msg = (
                f"{self.status_code} Client Error: {reason} for url: {self.url}"
            )
    
        elif 500 <= self.status_code < 600:
            http_error_msg = (
                f"{self.status_code} Server Error: {reason} for url: {self.url}"
            )
    
        if http_error_msg:
>           raise HTTPError(http_error_msg, response=self)
E           requests.exceptions.HTTPError: 429 Client Error: Too Many Requests for url: https://huggingface..../resolve/main/config.json

.../local/lib/python3.12.../dist-packages/requests/models.py:1024: HTTPError

The above exception was the direct cause of the following exception:

    def _get_metadata_or_catch_error(
        *,
        repo_id: str,
        filename: str,
        repo_type: str,
        revision: str,
        endpoint: Optional[str],
        proxies: Optional[Dict],
        etag_timeout: Optional[float],
        headers: Dict[str, str],  # mutated inplace!
        token: Union[bool, str, None],
        local_files_only: bool,
        relative_filename: Optional[str] = None,  # only used to store `.no_exists` in cache
        storage_folder: Optional[str] = None,  # only used to store `.no_exists` in cache
    ) -> Union[
        # Either an exception is caught and returned
        Tuple[None, None, None, None, None, Exception],
        # Or the metadata is returned as
        # `(url_to_download, etag, commit_hash, expected_size, xet_file_data, None)`
        Tuple[str, str, str, int, Optional[XetFileData], None],
    ]:
        """Get metadata for a file on the Hub, safely handling network issues.
    
        Returns either the etag, commit_hash and expected size of the file, or the error
        raised while fetching the metadata.
    
        NOTE: This function mutates `headers` inplace! It removes the `authorization` header
              if the file is a LFS blob and the domain of the url is different from the
              domain of the location (typically an S3 bucket).
        """
        if local_files_only:
            return (
                None,
                None,
                None,
                None,
                None,
                OfflineModeIsEnabled(
                    f"Cannot access file since 'local_files_only=True' as been set. (repo_id: {repo_id}, repo_type: {repo_type}, revision: {revision}, filename: {filename})"
                ),
            )
    
        url = hf_hub_url(repo_id, filename, repo_type=repo_type, revision=revision, endpoint=endpoint)
        url_to_download: str = url
        etag: Optional[str] = None
        commit_hash: Optional[str] = None
        expected_size: Optional[int] = None
        head_error_call: Optional[Exception] = None
        xet_file_data: Optional[XetFileData] = None
    
        # Try to get metadata from the server.
        # Do not raise yet if the file is not found or not accessible.
        if not local_files_only:
            try:
                try:
>                   metadata = get_hf_file_metadata(
                        url=url, proxies=proxies, timeout=etag_timeout, headers=headers, token=token, endpoint=endpoint
                    )

.../local/lib/python3.12...................../dist-packages/huggingface_hub/file_download.py:1543: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
.../local/lib/python3.12.../huggingface_hub/utils/_validators.py:114: in _inner_fn
    return fn(*args, **kwargs)
.../local/lib/python3.12...................../dist-packages/huggingface_hub/file_download.py:1460: in get_hf_file_metadata
    r = _request_wrapper(
.../local/lib/python3.12...................../dist-packages/huggingface_hub/file_download.py:283: in _request_wrapper
    response = _request_wrapper(
.../local/lib/python3.12...................../dist-packages/huggingface_hub/file_download.py:307: in _request_wrapper
    hf_raise_for_status(response)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

response = <Response [429]>, endpoint_name = None

    def hf_raise_for_status(response: Response, endpoint_name: Optional[str] = None) -> None:
        """
        Internal version of `response.raise_for_status()` that will refine a
        potential HTTPError. Raised exception will be an instance of `HfHubHTTPError`.
    
        This helper is meant to be the unique method to raise_for_status when making a call
        to the Hugging Face Hub.
    
    
        Example:
        ```py
            import requests
            from huggingface_hub.utils import get_session, hf_raise_for_status, HfHubHTTPError
    
            response = get_session().post(...)
            try:
                hf_raise_for_status(response)
            except HfHubHTTPError as e:
                print(str(e)) # formatted message
                e.request_id, e.server_message # details returned by server
    
                # Complete the error message with additional information once it's raised
                e.append_to_message("\n`create_commit` expects the repository to exist.")
                raise
        ```
    
        Args:
            response (`Response`):
                Response from the server.
            endpoint_name (`str`, *optional*):
                Name of the endpoint that has been called. If provided, the error message
                will be more complete.
    
        > [!WARNING]
        > Raises when the request has failed:
        >
        >     - [`~utils.RepositoryNotFoundError`]
        >         If the repository to download from cannot be found. This may be because it
        >         doesn't exist, because `repo_type` is not set correctly, or because the repo
        >         is `private` and you do not have access.
        >     - [`~utils.GatedRepoError`]
        >         If the repository exists but is gated and the user is not on the authorized
        >         list.
        >     - [`~utils.RevisionNotFoundError`]
        >         If the repository exists but the revision couldn't be find.
        >     - [`~utils.EntryNotFoundError`]
        >         If the repository exists but the entry (e.g. the requested file) couldn't be
        >         find.
        >     - [`~utils.BadRequestError`]
        >         If request failed with a HTTP 400 BadRequest error.
        >     - [`~utils.HfHubHTTPError`]
        >         If request failed for a reason not listed above.
        """
        try:
            response.raise_for_status()
        except HTTPError as e:
            error_code = response.headers.get("X-Error-Code")
            error_message = response.headers.get("X-Error-Message")
    
            if error_code == "RevisionNotFound":
                message = f"{response.status_code} Client Error." + "\n\n" + f"Revision Not Found for url: {response.url}."
                raise _format(RevisionNotFoundError, message, response) from e
    
            elif error_code == "EntryNotFound":
                message = f"{response.status_code} Client Error." + "\n\n" + f"Entry Not Found for url: {response.url}."
                raise _format(EntryNotFoundError, message, response) from e
    
            elif error_code == "GatedRepo":
                message = (
                    f"{response.status_code} Client Error." + "\n\n" + f"Cannot access gated repo for url {response.url}."
                )
                raise _format(GatedRepoError, message, response) from e
    
            elif error_message == "Access to this resource is disabled.":
                message = (
                    f"{response.status_code} Client Error."
                    + "\n\n"
                    + f"Cannot access repository for url {response.url}."
                    + "\n"
                    + "Access to this resource is disabled."
                )
                raise _format(DisabledRepoError, message, response) from e
    
            elif error_code == "RepoNotFound" or (
                response.status_code == 401
                and error_message != "Invalid credentials in Authorization header"
                and response.request is not None
                and response.request.url is not None
                and REPO_API_REGEX.search(response.request.url) is not None
            ):
                # 401 is misleading as it is returned for:
                #    - private and gated repos if user is not authenticated
                #    - missing repos
                # => for now, we process them as `RepoNotFound` anyway.
                # See https://gist.github.com/Wauplin/46c27ad266b15998ce56a6603796f0b9
                message = (
                    f"{response.status_code} Client Error."
                    + "\n\n"
                    + f"Repository Not Found for url: {response.url}."
                    + "\nPlease make sure you specified the correct `repo_id` and"
                    " `repo_type`.\nIf you are trying to access a private or gated repo,"
                    " make sure you are authenticated. For more details, see"
                    " https://huggingface..../docs/huggingface_hub/authentication"
                )
                raise _format(RepositoryNotFoundError, message, response) from e
    
            elif response.status_code == 400:
                message = (
                    f"\n\nBad request for {endpoint_name} endpoint:" if endpoint_name is not None else "\n\nBad request:"
                )
                raise _format(BadRequestError, message, response) from e
    
            elif response.status_code == 403:
                message = (
                    f"\n\n{response.status_code} Forbidden: {error_message}."
                    + f"\nCannot access content at: {response.url}."
                    + "\nMake sure your token has the correct permissions."
                )
                raise _format(HfHubHTTPError, message, response) from e
    
            elif response.status_code == 416:
                range_header = response.request.headers.get("Range")
                message = f"{e}. Requested range: {range_header}. Content-Range: {response.headers.get('Content-Range')}."
                raise _format(HfHubHTTPError, message, response) from e
    
            # Convert `HTTPError` into a `HfHubHTTPError` to display request information
            # as well (request id and/or server error message)
>           raise _format(HfHubHTTPError, str(e), response) from e
E           huggingface_hub.errors.HfHubHTTPError: 429 Client Error: Too Many Requests for url: https://huggingface..../resolve/main/config.json (Request ID: Root=1-69177c5b-611a30223bad98a45e457366;d368d878-4511-4e15-a473-7871176a6a10)
E           
E           We had to rate limit your IP (216.228.127.128). To continue using our service, create a HF account or login to your existing account, and make sure you pass a HF_TOKEN if you're using the API.

.../local/lib/python3.12.../huggingface_hub/utils/_http.py:475: HfHubHTTPError

The above exception was the direct cause of the following exception:

path_or_repo_id = 'facebook/esm2_t6_8M_UR50D', filenames = ['config.json']
cache_dir = '....../github/home/.cache/huggingface/hub', force_download = False
resume_download = None, proxies = None, token = None, revision = None
local_files_only = False, subfolder = '', repo_type = None
user_agent = 'transformers/4.53.3; python/3.12.3; session_id/1dc5d65396de47f9b8ea53ab966d4123; torch/2.8.0a0+5228986c39.nv25.6; file_type/config; from_auto_class/True'
_raise_exceptions_for_gated_repo = True
_raise_exceptions_for_missing_entries = True
_raise_exceptions_for_connection_errors = True, _commit_hash = None
deprecated_kwargs = {}, use_auth_token = None, full_filenames = ['config.json']
existing_files = [], filename = 'config.json'

    def cached_files(
        path_or_repo_id: Union[str, os.PathLike],
        filenames: list[str],
        cache_dir: Optional[Union[str, os.PathLike]] = None,
        force_download: bool = False,
        resume_download: Optional[bool] = None,
        proxies: Optional[dict[str, str]] = None,
        token: Optional[Union[bool, str]] = None,
        revision: Optional[str] = None,
        local_files_only: bool = False,
        subfolder: str = "",
        repo_type: Optional[str] = None,
        user_agent: Optional[Union[str, dict[str, str]]] = None,
        _raise_exceptions_for_gated_repo: bool = True,
        _raise_exceptions_for_missing_entries: bool = True,
        _raise_exceptions_for_connection_errors: bool = True,
        _commit_hash: Optional[str] = None,
        **deprecated_kwargs,
    ) -> Optional[str]:
        """
        Tries to locate several files in a local folder and repo, downloads and cache them if necessary.
    
        Args:
            path_or_repo_id (`str` or `os.PathLike`):
                This can be either:
                - a string, the *model id* of a model repo on huggingface.co.
                - a path to a *directory* potentially containing the file.
            filenames (`list[str]`):
                The name of all the files to locate in `path_or_repo`.
            cache_dir (`str` or `os.PathLike`, *optional*):
                Path to a directory in which a downloaded pretrained model configuration should be cached if the standard
                cache should not be used.
            force_download (`bool`, *optional*, defaults to `False`):
                Whether or not to force to (re-)download the configuration files and override the cached versions if they
                exist.
            resume_download:
                Deprecated and ignored. All downloads are now resumed by default when possible.
                Will be removed in v5 of Transformers.
            proxies (`dict[str, str]`, *optional*):
                A dictionary of proxy servers to use by protocol or endpoint, e.g., `{'http': 'foo.bar:3128',
                'http://hostname': 'foo.bar:4012'}.` The proxies are used on each request.
            token (`str` or *bool*, *optional*):
                The token to use as HTTP bearer authorization for remote files. If `True`, will use the token generated
                when running `huggingface-cli login` (stored in `~/.huggingface`).
            revision (`str`, *optional*, defaults to `"main"`):
                The specific model version to use. It can be a branch name, a tag name, or a commit id, since we use a
                git-based system for storing models and other artifacts on huggingface.co, so `revision` can be any
                identifier allowed by git.
            local_files_only (`bool`, *optional*, defaults to `False`):
                If `True`, will only try to load the tokenizer configuration from local files.
            subfolder (`str`, *optional*, defaults to `""`):
                In case the relevant files are located inside a subfolder of the model repo on huggingface.co, you can
                specify the folder name here.
            repo_type (`str`, *optional*):
                Specify the repo type (useful when downloading from a space for instance).
    
        Private args:
            _raise_exceptions_for_gated_repo (`bool`):
                if False, do not raise an exception for gated repo error but return None.
            _raise_exceptions_for_missing_entries (`bool`):
                if False, do not raise an exception for missing entries but return None.
            _raise_exceptions_for_connection_errors (`bool`):
                if False, do not raise an exception for connection errors but return None.
            _commit_hash (`str`, *optional*):
                passed when we are chaining several calls to various files (e.g. when loading a tokenizer or
                a pipeline). If files are cached for this commit hash, avoid calls to head and get from the cache.
    
        <Tip>
    
        Passing `token=True` is required when you want to use a private model.
    
        </Tip>
    
        Returns:
            `Optional[str]`: Returns the resolved file (to the cache folder if downloaded from a repo).
    
        Examples:
    
        ```python
        # Download a model weight from the Hub and cache it.
        model_weights_file = cached_file("google-bert/bert-base-uncased", "pytorch_model.bin")
        ```
        """
        use_auth_token = deprecated_kwargs.pop("use_auth_token", None)
        if use_auth_token is not None:
            warnings.warn(
                "The `use_auth_token` argument is deprecated and will be removed in v5 of Transformers. Please use `token` instead.",
                FutureWarning,
            )
            if token is not None:
                raise ValueError("`token` and `use_auth_token` are both specified. Please set only the argument `token`.")
            token = use_auth_token
    
        if is_offline_mode() and not local_files_only:
            logger.info("Offline mode: forcing local_files_only=True")
            local_files_only = True
        if subfolder is None:
            subfolder = ""
    
        # Add folder to filenames
        full_filenames = [os.path.join(subfolder, file) for file in filenames]
    
        path_or_repo_id = str(path_or_repo_id)
        existing_files = []
        for filename in full_filenames:
            if os.path.isdir(path_or_repo_id):
                resolved_file = os.path.join(path_or_repo_id, filename)
                if not os.path.isfile(resolved_file):
                    if _raise_exceptions_for_missing_entries and filename != os.path.join(subfolder, "config.json"):
                        revision_ = "main" if revision is None else revision
                        raise OSError(
                            f"{path_or_repo_id} does not appear to have a file named {filename}. Checkout "
                            f"'https://huggingface.co/{path_or_repo_id}/tree/{revision_}' for available files."
                        )
                    else:
                        return None
                existing_files.append(resolved_file)
    
        # All files exist
        if len(existing_files) == len(full_filenames):
            return existing_files
    
        if cache_dir is None:
            cache_dir = TRANSFORMERS_CACHE
        if isinstance(cache_dir, Path):
            cache_dir = str(cache_dir)
    
        existing_files = []
        file_counter = 0
        if _commit_hash is not None and not force_download:
            for filename in full_filenames:
                # If the file is cached under that commit hash, we return it directly.
                resolved_file = try_to_load_from_cache(
                    path_or_repo_id, filename, cache_dir=cache_dir, revision=_commit_hash, repo_type=repo_type
                )
                if resolved_file is not None:
                    if resolved_file is not _CACHED_NO_EXIST:
                        file_counter += 1
                        existing_files.append(resolved_file)
                    elif not _raise_exceptions_for_missing_entries:
                        file_counter += 1
                    else:
                        raise OSError(f"Could not locate {filename} inside {path_or_repo_id}.")
    
        # Either all the files were found, or some were _CACHED_NO_EXIST but we do not raise for missing entries
        if file_counter == len(full_filenames):
            return existing_files if len(existing_files) > 0 else None
    
        user_agent = http_user_agent(user_agent)
        # download the files if needed
        try:
            if len(full_filenames) == 1:
                # This is slightly better for only 1 file
>               hf_hub_download(
                    path_or_repo_id,
                    filenames[0],
                    subfolder=None if len(subfolder) == 0 else subfolder,
                    repo_type=repo_type,
                    revision=revision,
                    cache_dir=cache_dir,
                    user_agent=user_agent,
                    force_download=force_download,
                    proxies=proxies,
                    resume_download=resume_download,
                    token=token,
                    local_files_only=local_files_only,
                )

.../local/lib/python3.12.../transformers/utils/hub.py:470: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
.../local/lib/python3.12.../huggingface_hub/utils/_validators.py:114: in _inner_fn
    return fn(*args, **kwargs)
.../local/lib/python3.12...................../dist-packages/huggingface_hub/file_download.py:1007: in hf_hub_download
    return _hf_hub_download_to_cache_dir(
.../local/lib/python3.12...................../dist-packages/huggingface_hub/file_download.py:1114: in _hf_hub_download_to_cache_dir
    _raise_on_head_call_error(head_call_error, force_download, local_files_only)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

head_call_error = HfHubHTTPError("429 Client Error: Too Many Requests for url: https://huggingface..../esm2_t6_8M_UR50D/resolve/...ce, create a HF account or login to your existing account, and make sure you pass a HF_TOKEN if you're using the API.")
force_download = False, local_files_only = False

    def _raise_on_head_call_error(head_call_error: Exception, force_download: bool, local_files_only: bool) -> NoReturn:
        """Raise an appropriate error when the HEAD call failed and we cannot locate a local file."""
        # No head call => we cannot force download.
        if force_download:
            if local_files_only:
                raise ValueError("Cannot pass 'force_download=True' and 'local_files_only=True' at the same time.")
            elif isinstance(head_call_error, OfflineModeIsEnabled):
                raise ValueError("Cannot pass 'force_download=True' when offline mode is enabled.") from head_call_error
            else:
                raise ValueError("Force download failed due to the above error.") from head_call_error
    
        # No head call + couldn't find an appropriate file on disk => raise an error.
        if local_files_only:
            raise LocalEntryNotFoundError(
                "Cannot find the requested files in the disk cache and outgoing traffic has been disabled. To enable"
                " hf.co look-ups and downloads online, set 'local_files_only' to False."
            )
        elif isinstance(head_call_error, (RepositoryNotFoundError, GatedRepoError)) or (
            isinstance(head_call_error, HfHubHTTPError) and head_call_error.response.status_code == 401
        ):
            # Repo not found or gated => let's raise the actual error
            # Unauthorized => likely a token issue => let's raise the actual error
            raise head_call_error
        else:
            # Otherwise: most likely a connection issue or Hub downtime => let's warn the user
>           raise LocalEntryNotFoundError(
                "An error happened while trying to locate the file on the Hub and we cannot find the requested files"
                " in the local cache. Please check your connection and try again or make sure your Internet connection"
                " is on."
            ) from head_call_error
E           huggingface_hub.errors.LocalEntryNotFoundError: An error happened while trying to locate the file on the Hub and we cannot find the requested files in the local cache. Please check your connection and try again or make sure your Internet connection is on.

.../local/lib/python3.12...................../dist-packages/huggingface_hub/file_download.py:1658: LocalEntryNotFoundError

The above exception was the direct cause of the following exception:

tmp_path = PosixPath('.../pytest-of-root/pytest-1/test_nemo2_export_8m_weights_e0')

    def test_nemo2_export_8m_weights_equivalent(tmp_path):
        ckpt_path = load("esm2/8m:2.0")
        output_path = io.export_ckpt(ckpt_path, "hf", tmp_path / "hf_checkpoint")
    
        hf_model_from_nemo = AutoModelForMaskedLM.from_pretrained(output_path)
>       hf_model_from_hf = AutoModelForMaskedLM.from_pretrained("facebook/esm2_t6_8M_UR50D")

.../esm2/model/test_convert.py:58: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
.../local/lib/python3.12.../models/auto/auto_factory.py:547: in from_pretrained
    config, kwargs = AutoConfig.from_pretrained(
.../local/lib/python3.12.../models/auto/configuration_auto.py:1197: in from_pretrained
    config_dict, unused_kwargs = PretrainedConfig.get_config_dict(pretrained_model_name_or_path, **kwargs)
.../local/lib/python3.12....../dist-packages/transformers/configuration_utils.py:608: in get_config_dict
    config_dict, kwargs = cls._get_config_dict(pretrained_model_name_or_path, **kwargs)
.../local/lib/python3.12....../dist-packages/transformers/configuration_utils.py:667: in _get_config_dict
    resolved_config_file = cached_file(
.../local/lib/python3.12.../transformers/utils/hub.py:312: in cached_file
    file = cached_files(path_or_repo_id=path_or_repo_id, filenames=[filename], **kwargs)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

path_or_repo_id = 'facebook/esm2_t6_8M_UR50D', filenames = ['config.json']
cache_dir = '....../github/home/.cache/huggingface/hub', force_download = False
resume_download = None, proxies = None, token = None, revision = None
local_files_only = False, subfolder = '', repo_type = None
user_agent = 'transformers/4.53.3; python/3.12.3; session_id/1dc5d65396de47f9b8ea53ab966d4123; torch/2.8.0a0+5228986c39.nv25.6; file_type/config; from_auto_class/True'
_raise_exceptions_for_gated_repo = True
_raise_exceptions_for_missing_entries = True
_raise_exceptions_for_connection_errors = True, _commit_hash = None
deprecated_kwargs = {}, use_auth_token = None, full_filenames = ['config.json']
existing_files = [], filename = 'config.json'

    def cached_files(
        path_or_repo_id: Union[str, os.PathLike],
        filenames: list[str],
        cache_dir: Optional[Union[str, os.PathLike]] = None,
        force_download: bool = False,
        resume_download: Optional[bool] = None,
        proxies: Optional[dict[str, str]] = None,
        token: Optional[Union[bool, str]] = None,
        revision: Optional[str] = None,
        local_files_only: bool = False,
        subfolder: str = "",
        repo_type: Optional[str] = None,
        user_agent: Optional[Union[str, dict[str, str]]] = None,
        _raise_exceptions_for_gated_repo: bool = True,
        _raise_exceptions_for_missing_entries: bool = True,
        _raise_exceptions_for_connection_errors: bool = True,
        _commit_hash: Optional[str] = None,
        **deprecated_kwargs,
    ) -> Optional[str]:
        """
        Tries to locate several files in a local folder and repo, downloads and cache them if necessary.
    
        Args:
            path_or_repo_id (`str` or `os.PathLike`):
                This can be either:
                - a string, the *model id* of a model repo on huggingface.co.
                - a path to a *directory* potentially containing the file.
            filenames (`list[str]`):
                The name of all the files to locate in `path_or_repo`.
            cache_dir (`str` or `os.PathLike`, *optional*):
                Path to a directory in which a downloaded pretrained model configuration should be cached if the standard
                cache should not be used.
            force_download (`bool`, *optional*, defaults to `False`):
                Whether or not to force to (re-)download the configuration files and override the cached versions if they
                exist.
            resume_download:
                Deprecated and ignored. All downloads are now resumed by default when possible.
                Will be removed in v5 of Transformers.
            proxies (`dict[str, str]`, *optional*):
                A dictionary of proxy servers to use by protocol or endpoint, e.g., `{'http': 'foo.bar:3128',
                'http://hostname': 'foo.bar:4012'}.` The proxies are used on each request.
            token (`str` or *bool*, *optional*):
                The token to use as HTTP bearer authorization for remote files. If `True`, will use the token generated
                when running `huggingface-cli login` (stored in `~/.huggingface`).
            revision (`str`, *optional*, defaults to `"main"`):
                The specific model version to use. It can be a branch name, a tag name, or a commit id, since we use a
                git-based system for storing models and other artifacts on huggingface.co, so `revision` can be any
                identifier allowed by git.
            local_files_only (`bool`, *optional*, defaults to `False`):
                If `True`, will only try to load the tokenizer configuration from local files.
            subfolder (`str`, *optional*, defaults to `""`):
                In case the relevant files are located inside a subfolder of the model repo on huggingface.co, you can
                specify the folder name here.
            repo_type (`str`, *optional*):
                Specify the repo type (useful when downloading from a space for instance).
    
        Private args:
            _raise_exceptions_for_gated_repo (`bool`):
                if False, do not raise an exception for gated repo error but return None.
            _raise_exceptions_for_missing_entries (`bool`):
                if False, do not raise an exception for missing entries but return None.
            _raise_exceptions_for_connection_errors (`bool`):
                if False, do not raise an exception for connection errors but return None.
            _commit_hash (`str`, *optional*):
                passed when we are chaining several calls to various files (e.g. when loading a tokenizer or
                a pipeline). If files are cached for this commit hash, avoid calls to head and get from the cache.
    
        <Tip>
    
        Passing `token=True` is required when you want to use a private model.
    
        </Tip>
    
        Returns:
            `Optional[str]`: Returns the resolved file (to the cache folder if downloaded from a repo).
    
        Examples:
    
        ```python
        # Download a model weight from the Hub and cache it.
        model_weights_file = cached_file("google-bert/bert-base-uncased", "pytorch_model.bin")
        ```
        """
        use_auth_token = deprecated_kwargs.pop("use_auth_token", None)
        if use_auth_token is not None:
            warnings.warn(
                "The `use_auth_token` argument is deprecated and will be removed in v5 of Transformers. Please use `token` instead.",
                FutureWarning,
            )
            if token is not None:
                raise ValueError("`token` and `use_auth_token` are both specified. Please set only the argument `token`.")
            token = use_auth_token
    
        if is_offline_mode() and not local_files_only:
            logger.info("Offline mode: forcing local_files_only=True")
            local_files_only = True
        if subfolder is None:
            subfolder = ""
    
        # Add folder to filenames
        full_filenames = [os.path.join(subfolder, file) for file in filenames]
    
        path_or_repo_id = str(path_or_repo_id)
        existing_files = []
        for filename in full_filenames:
            if os.path.isdir(path_or_repo_id):
                resolved_file = os.path.join(path_or_repo_id, filename)
                if not os.path.isfile(resolved_file):
                    if _raise_exceptions_for_missing_entries and filename != os.path.join(subfolder, "config.json"):
                        revision_ = "main" if revision is None else revision
                        raise OSError(
                            f"{path_or_repo_id} does not appear to have a file named {filename}. Checkout "
                            f"'https://huggingface.co/{path_or_repo_id}/tree/{revision_}' for available files."
                        )
                    else:
                        return None
                existing_files.append(resolved_file)
    
        # All files exist
        if len(existing_files) == len(full_filenames):
            return existing_files
    
        if cache_dir is None:
            cache_dir = TRANSFORMERS_CACHE
        if isinstance(cache_dir, Path):
            cache_dir = str(cache_dir)
    
        existing_files = []
        file_counter = 0
        if _commit_hash is not None and not force_download:
            for filename in full_filenames:
                # If the file is cached under that commit hash, we return it directly.
                resolved_file = try_to_load_from_cache(
                    path_or_repo_id, filename, cache_dir=cache_dir, revision=_commit_hash, repo_type=repo_type
                )
                if resolved_file is not None:
                    if resolved_file is not _CACHED_NO_EXIST:
                        file_counter += 1
                        existing_files.append(resolved_file)
                    elif not _raise_exceptions_for_missing_entries:
                        file_counter += 1
                    else:
                        raise OSError(f"Could not locate {filename} inside {path_or_repo_id}.")
    
        # Either all the files were found, or some were _CACHED_NO_EXIST but we do not raise for missing entries
        if file_counter == len(full_filenames):
            return existing_files if len(existing_files) > 0 else None
    
        user_agent = http_user_agent(user_agent)
        # download the files if needed
        try:
            if len(full_filenames) == 1:
                # This is slightly better for only 1 file
                hf_hub_download(
                    path_or_repo_id,
                    filenames[0],
                    subfolder=None if len(subfolder) == 0 else subfolder,
                    repo_type=repo_type,
                    revision=revision,
                    cache_dir=cache_dir,
                    user_agent=user_agent,
                    force_download=force_download,
                    proxies=proxies,
                    resume_download=resume_download,
                    token=token,
                    local_files_only=local_files_only,
                )
            else:
                snapshot_download(
                    path_or_repo_id,
                    allow_patterns=full_filenames,
                    repo_type=repo_type,
                    revision=revision,
                    cache_dir=cache_dir,
                    user_agent=user_agent,
                    force_download=force_download,
                    proxies=proxies,
                    resume_download=resume_download,
                    token=token,
                    local_files_only=local_files_only,
                )
    
        except Exception as e:
            # We cannot recover from them
            if isinstance(e, RepositoryNotFoundError) and not isinstance(e, GatedRepoError):
                raise OSError(
                    f"{path_or_repo_id} is not a local folder and is not a valid model identifier "
                    "listed on 'https://huggingface.co/models'\nIf this is a private repository, make sure to pass a token "
                    "having permission to this repo either by logging in with `huggingface-cli login` or by passing "
                    "`token=<your_token>`"
                ) from e
            elif isinstance(e, RevisionNotFoundError):
                raise OSError(
                    f"{revision} is not a valid git identifier (branch name, tag name or commit id) that exists "
                    "for this model name. Check the model page at "
                    f"'https://huggingface.co/{path_or_repo_id}' for available revisions."
                ) from e
            elif isinstance(e, PermissionError):
                raise OSError(
                    f"PermissionError at {e.filename} when downloading {path_or_repo_id}. "
                    "Check cache directory permissions. Common causes: 1) another user is downloading the same model (please wait); "
                    "2) a previous download was canceled and the lock file needs manual removal."
                ) from e
    
            # Now we try to recover if we can find all files correctly in the cache
            resolved_files = [
                _get_cache_file_to_return(path_or_repo_id, filename, cache_dir, revision) for filename in full_filenames
            ]
            if all(file is not None for file in resolved_files):
                return resolved_files
    
            # Raise based on the flags. Note that we will raise for missing entries at the very end, even when
            # not entering this Except block, as it may also happen when `snapshot_download` does not raise
            if isinstance(e, GatedRepoError):
                if not _raise_exceptions_for_gated_repo:
                    return None
                raise OSError(
                    "You are trying to access a gated repo.\nMake sure to have access to it at "
                    f"https://huggingface.co/{path_or_repo_id}.\n{str(e)}"
                ) from e
            elif isinstance(e, LocalEntryNotFoundError):
                if not _raise_exceptions_for_connection_errors:
                    return None
                # Here we only raise if both flags for missing entry and connection errors are True (because it can be raised
                # even when `local_files_only` is True, in which case raising for connections errors only would not make sense)
                elif _raise_exceptions_for_missing_entries:
>                   raise OSError(
                        f"We couldn't connect to '{HUGGINGFACE_CO_RESOLVE_ENDPOINT}' to load the files, and couldn't find them in the"
                        f" cached files.\nCheck your internet connection or see how to run the library in offline mode at"
                        " 'https://huggingface..../docs/transformers/installation#offline-mode'."
                    ) from e
E                   OSError: We couldn't connect to 'https://huggingface.co' to load the files, and couldn't find them in the cached files.
E                   Check your internet connection or see how to run the library in offline mode at 'https://huggingface..../docs/transformers/installation#offline-mode'.

.../local/lib/python3.12.../transformers/utils/hub.py:543: OSError

sub-packages/bionemo-esm2/tests/bionemo/esm2/model/test_model.py::test_esm2_loss

Stack Traces | 6.3s run time

response = <Response [429]>, endpoint_name = None

    def hf_raise_for_status(response: Response, endpoint_name: Optional[str] = None) -> None:
        """
        Internal version of `response.raise_for_status()` that will refine a
        potential HTTPError. Raised exception will be an instance of `HfHubHTTPError`.
    
        This helper is meant to be the unique method to raise_for_status when making a call
        to the Hugging Face Hub.
    
    
        Example:
        ```py
            import requests
            from huggingface_hub.utils import get_session, hf_raise_for_status, HfHubHTTPError
    
            response = get_session().post(...)
            try:
                hf_raise_for_status(response)
            except HfHubHTTPError as e:
                print(str(e)) # formatted message
                e.request_id, e.server_message # details returned by server
    
                # Complete the error message with additional information once it's raised
                e.append_to_message("\n`create_commit` expects the repository to exist.")
                raise
        ```
    
        Args:
            response (`Response`):
                Response from the server.
            endpoint_name (`str`, *optional*):
                Name of the endpoint that has been called. If provided, the error message
                will be more complete.
    
        > [!WARNING]
        > Raises when the request has failed:
        >
        >     - [`~utils.RepositoryNotFoundError`]
        >         If the repository to download from cannot be found. This may be because it
        >         doesn't exist, because `repo_type` is not set correctly, or because the repo
        >         is `private` and you do not have access.
        >     - [`~utils.GatedRepoError`]
        >         If the repository exists but is gated and the user is not on the authorized
        >         list.
        >     - [`~utils.RevisionNotFoundError`]
        >         If the repository exists but the revision couldn't be find.
        >     - [`~utils.EntryNotFoundError`]
        >         If the repository exists but the entry (e.g. the requested file) couldn't be
        >         find.
        >     - [`~utils.BadRequestError`]
        >         If request failed with a HTTP 400 BadRequest error.
        >     - [`~utils.HfHubHTTPError`]
        >         If request failed for a reason not listed above.
        """
        try:
>           response.raise_for_status()

.../local/lib/python3.12.../huggingface_hub/utils/_http.py:402: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = <Response [429]>

    def raise_for_status(self):
        """Raises :class:`HTTPError`, if one occurred."""
    
        http_error_msg = ""
        if isinstance(self.reason, bytes):
            # We attempt to decode utf-8 first because some servers
            # choose to localize their reason strings. If the string
            # isn't utf-8, we fall back to iso-8859-1 for all other
            # encodings. (See PR #3538)
            try:
                reason = self.reason.decode("utf-8")
            except UnicodeDecodeError:
                reason = self.reason.decode("iso-8859-1")
        else:
            reason = self.reason
    
        if 400 <= self.status_code < 500:
            http_error_msg = (
                f"{self.status_code} Client Error: {reason} for url: {self.url}"
            )
    
        elif 500 <= self.status_code < 600:
            http_error_msg = (
                f"{self.status_code} Server Error: {reason} for url: {self.url}"
            )
    
        if http_error_msg:
>           raise HTTPError(http_error_msg, response=self)
E           requests.exceptions.HTTPError: 429 Client Error: Too Many Requests for url: https://huggingface..../resolve/main/config.json

.../local/lib/python3.12.../dist-packages/requests/models.py:1024: HTTPError

The above exception was the direct cause of the following exception:

    def _get_metadata_or_catch_error(
        *,
        repo_id: str,
        filename: str,
        repo_type: str,
        revision: str,
        endpoint: Optional[str],
        proxies: Optional[Dict],
        etag_timeout: Optional[float],
        headers: Dict[str, str],  # mutated inplace!
        token: Union[bool, str, None],
        local_files_only: bool,
        relative_filename: Optional[str] = None,  # only used to store `.no_exists` in cache
        storage_folder: Optional[str] = None,  # only used to store `.no_exists` in cache
    ) -> Union[
        # Either an exception is caught and returned
        Tuple[None, None, None, None, None, Exception],
        # Or the metadata is returned as
        # `(url_to_download, etag, commit_hash, expected_size, xet_file_data, None)`
        Tuple[str, str, str, int, Optional[XetFileData], None],
    ]:
        """Get metadata for a file on the Hub, safely handling network issues.
    
        Returns either the etag, commit_hash and expected size of the file, or the error
        raised while fetching the metadata.
    
        NOTE: This function mutates `headers` inplace! It removes the `authorization` header
              if the file is a LFS blob and the domain of the url is different from the
              domain of the location (typically an S3 bucket).
        """
        if local_files_only:
            return (
                None,
                None,
                None,
                None,
                None,
                OfflineModeIsEnabled(
                    f"Cannot access file since 'local_files_only=True' as been set. (repo_id: {repo_id}, repo_type: {repo_type}, revision: {revision}, filename: {filename})"
                ),
            )
    
        url = hf_hub_url(repo_id, filename, repo_type=repo_type, revision=revision, endpoint=endpoint)
        url_to_download: str = url
        etag: Optional[str] = None
        commit_hash: Optional[str] = None
        expected_size: Optional[int] = None
        head_error_call: Optional[Exception] = None
        xet_file_data: Optional[XetFileData] = None
    
        # Try to get metadata from the server.
        # Do not raise yet if the file is not found or not accessible.
        if not local_files_only:
            try:
                try:
>                   metadata = get_hf_file_metadata(
                        url=url, proxies=proxies, timeout=etag_timeout, headers=headers, token=token, endpoint=endpoint
                    )

.../local/lib/python3.12...................../dist-packages/huggingface_hub/file_download.py:1543: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
.../local/lib/python3.12.../huggingface_hub/utils/_validators.py:114: in _inner_fn
    return fn(*args, **kwargs)
.../local/lib/python3.12...................../dist-packages/huggingface_hub/file_download.py:1460: in get_hf_file_metadata
    r = _request_wrapper(
.../local/lib/python3.12...................../dist-packages/huggingface_hub/file_download.py:283: in _request_wrapper
    response = _request_wrapper(
.../local/lib/python3.12...................../dist-packages/huggingface_hub/file_download.py:307: in _request_wrapper
    hf_raise_for_status(response)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

response = <Response [429]>, endpoint_name = None

    def hf_raise_for_status(response: Response, endpoint_name: Optional[str] = None) -> None:
        """
        Internal version of `response.raise_for_status()` that will refine a
        potential HTTPError. Raised exception will be an instance of `HfHubHTTPError`.
    
        This helper is meant to be the unique method to raise_for_status when making a call
        to the Hugging Face Hub.
    
    
        Example:
        ```py
            import requests
            from huggingface_hub.utils import get_session, hf_raise_for_status, HfHubHTTPError
    
            response = get_session().post(...)
            try:
                hf_raise_for_status(response)
            except HfHubHTTPError as e:
                print(str(e)) # formatted message
                e.request_id, e.server_message # details returned by server
    
                # Complete the error message with additional information once it's raised
                e.append_to_message("\n`create_commit` expects the repository to exist.")
                raise
        ```
    
        Args:
            response (`Response`):
                Response from the server.
            endpoint_name (`str`, *optional*):
                Name of the endpoint that has been called. If provided, the error message
                will be more complete.
    
        > [!WARNING]
        > Raises when the request has failed:
        >
        >     - [`~utils.RepositoryNotFoundError`]
        >         If the repository to download from cannot be found. This may be because it
        >         doesn't exist, because `repo_type` is not set correctly, or because the repo
        >         is `private` and you do not have access.
        >     - [`~utils.GatedRepoError`]
        >         If the repository exists but is gated and the user is not on the authorized
        >         list.
        >     - [`~utils.RevisionNotFoundError`]
        >         If the repository exists but the revision couldn't be find.
        >     - [`~utils.EntryNotFoundError`]
        >         If the repository exists but the entry (e.g. the requested file) couldn't be
        >         find.
        >     - [`~utils.BadRequestError`]
        >         If request failed with a HTTP 400 BadRequest error.
        >     - [`~utils.HfHubHTTPError`]
        >         If request failed for a reason not listed above.
        """
        try:
            response.raise_for_status()
        except HTTPError as e:
            error_code = response.headers.get("X-Error-Code")
            error_message = response.headers.get("X-Error-Message")
    
            if error_code == "RevisionNotFound":
                message = f"{response.status_code} Client Error." + "\n\n" + f"Revision Not Found for url: {response.url}."
                raise _format(RevisionNotFoundError, message, response) from e
    
            elif error_code == "EntryNotFound":
                message = f"{response.status_code} Client Error." + "\n\n" + f"Entry Not Found for url: {response.url}."
                raise _format(EntryNotFoundError, message, response) from e
    
            elif error_code == "GatedRepo":
                message = (
                    f"{response.status_code} Client Error." + "\n\n" + f"Cannot access gated repo for url {response.url}."
                )
                raise _format(GatedRepoError, message, response) from e
    
            elif error_message == "Access to this resource is disabled.":
                message = (
                    f"{response.status_code} Client Error."
                    + "\n\n"
                    + f"Cannot access repository for url {response.url}."
                    + "\n"
                    + "Access to this resource is disabled."
                )
                raise _format(DisabledRepoError, message, response) from e
    
            elif error_code == "RepoNotFound" or (
                response.status_code == 401
                and error_message != "Invalid credentials in Authorization header"
                and response.request is not None
                and response.request.url is not None
                and REPO_API_REGEX.search(response.request.url) is not None
            ):
                # 401 is misleading as it is returned for:
                #    - private and gated repos if user is not authenticated
                #    - missing repos
                # => for now, we process them as `RepoNotFound` anyway.
                # See https://gist.github.com/Wauplin/46c27ad266b15998ce56a6603796f0b9
                message = (
                    f"{response.status_code} Client Error."
                    + "\n\n"
                    + f"Repository Not Found for url: {response.url}."
                    + "\nPlease make sure you specified the correct `repo_id` and"
                    " `repo_type`.\nIf you are trying to access a private or gated repo,"
                    " make sure you are authenticated. For more details, see"
                    " https://huggingface..../docs/huggingface_hub/authentication"
                )
                raise _format(RepositoryNotFoundError, message, response) from e
    
            elif response.status_code == 400:
                message = (
                    f"\n\nBad request for {endpoint_name} endpoint:" if endpoint_name is not None else "\n\nBad request:"
                )
                raise _format(BadRequestError, message, response) from e
    
            elif response.status_code == 403:
                message = (
                    f"\n\n{response.status_code} Forbidden: {error_message}."
                    + f"\nCannot access content at: {response.url}."
                    + "\nMake sure your token has the correct permissions."
                )
                raise _format(HfHubHTTPError, message, response) from e
    
            elif response.status_code == 416:
                range_header = response.request.headers.get("Range")
                message = f"{e}. Requested range: {range_header}. Content-Range: {response.headers.get('Content-Range')}."
                raise _format(HfHubHTTPError, message, response) from e
    
            # Convert `HTTPError` into a `HfHubHTTPError` to display request information
            # as well (request id and/or server error message)
>           raise _format(HfHubHTTPError, str(e), response) from e
E           huggingface_hub.errors.HfHubHTTPError: 429 Client Error: Too Many Requests for url: https://huggingface..../resolve/main/config.json (Request ID: Root=1-69177c84-1579e3b12bed4fab55a19aa4;0ceda65c-afe3-4a0f-9646-a99f8fe17c03)
E           
E           We had to rate limit your IP (216.228.127.128). To continue using our service, create a HF account or login to your existing account, and make sure you pass a HF_TOKEN if you're using the API.

.../local/lib/python3.12.../huggingface_hub/utils/_http.py:475: HfHubHTTPError

The above exception was the direct cause of the following exception:

path_or_repo_id = 'facebook/esm2_t6_8M_UR50D', filenames = ['config.json']
cache_dir = '....../github/home/.cache/huggingface/hub', force_download = False
resume_download = None, proxies = None, token = None, revision = None
local_files_only = False, subfolder = '', repo_type = None
user_agent = 'transformers/4.53.3; python/3.12.3; session_id/1dc5d65396de47f9b8ea53ab966d4123; torch/2.8.0a0+5228986c39.nv25.6; file_type/config; from_auto_class/True'
_raise_exceptions_for_gated_repo = True
_raise_exceptions_for_missing_entries = True
_raise_exceptions_for_connection_errors = True, _commit_hash = None
deprecated_kwargs = {}, use_auth_token = None, full_filenames = ['config.json']
existing_files = [], filename = 'config.json'

    def cached_files(
        path_or_repo_id: Union[str, os.PathLike],
        filenames: list[str],
        cache_dir: Optional[Union[str, os.PathLike]] = None,
        force_download: bool = False,
        resume_download: Optional[bool] = None,
        proxies: Optional[dict[str, str]] = None,
        token: Optional[Union[bool, str]] = None,
        revision: Optional[str] = None,
        local_files_only: bool = False,
        subfolder: str = "",
        repo_type: Optional[str] = None,
        user_agent: Optional[Union[str, dict[str, str]]] = None,
        _raise_exceptions_for_gated_repo: bool = True,
        _raise_exceptions_for_missing_entries: bool = True,
        _raise_exceptions_for_connection_errors: bool = True,
        _commit_hash: Optional[str] = None,
        **deprecated_kwargs,
    ) -> Optional[str]:
        """
        Tries to locate several files in a local folder and repo, downloads and cache them if necessary.
    
        Args:
            path_or_repo_id (`str` or `os.PathLike`):
                This can be either:
                - a string, the *model id* of a model repo on huggingface.co.
                - a path to a *directory* potentially containing the file.
            filenames (`list[str]`):
                The name of all the files to locate in `path_or_repo`.
            cache_dir (`str` or `os.PathLike`, *optional*):
                Path to a directory in which a downloaded pretrained model configuration should be cached if the standard
                cache should not be used.
            force_download (`bool`, *optional*, defaults to `False`):
                Whether or not to force to (re-)download the configuration files and override the cached versions if they
                exist.
            resume_download:
                Deprecated and ignored. All downloads are now resumed by default when possible.
                Will be removed in v5 of Transformers.
            proxies (`dict[str, str]`, *optional*):
                A dictionary of proxy servers to use by protocol or endpoint, e.g., `{'http': 'foo.bar:3128',
                'http://hostname': 'foo.bar:4012'}.` The proxies are used on each request.
            token (`str` or *bool*, *optional*):
                The token to use as HTTP bearer authorization for remote files. If `True`, will use the token generated
                when running `huggingface-cli login` (stored in `~/.huggingface`).
            revision (`str`, *optional*, defaults to `"main"`):
                The specific model version to use. It can be a branch name, a tag name, or a commit id, since we use a
                git-based system for storing models and other artifacts on huggingface.co, so `revision` can be any
                identifier allowed by git.
            local_files_only (`bool`, *optional*, defaults to `False`):
                If `True`, will only try to load the tokenizer configuration from local files.
            subfolder (`str`, *optional*, defaults to `""`):
                In case the relevant files are located inside a subfolder of the model repo on huggingface.co, you can
                specify the folder name here.
            repo_type (`str`, *optional*):
                Specify the repo type (useful when downloading from a space for instance).
    
        Private args:
            _raise_exceptions_for_gated_repo (`bool`):
                if False, do not raise an exception for gated repo error but return None.
            _raise_exceptions_for_missing_entries (`bool`):
                if False, do not raise an exception for missing entries but return None.
            _raise_exceptions_for_connection_errors (`bool`):
                if False, do not raise an exception for connection errors but return None.
            _commit_hash (`str`, *optional*):
                passed when we are chaining several calls to various files (e.g. when loading a tokenizer or
                a pipeline). If files are cached for this commit hash, avoid calls to head and get from the cache.
    
        <Tip>
    
        Passing `token=True` is required when you want to use a private model.
    
        </Tip>
    
        Returns:
            `Optional[str]`: Returns the resolved file (to the cache folder if downloaded from a repo).
    
        Examples:
    
        ```python
        # Download a model weight from the Hub and cache it.
        model_weights_file = cached_file("google-bert/bert-base-uncased", "pytorch_model.bin")
        ```
        """
        use_auth_token = deprecated_kwargs.pop("use_auth_token", None)
        if use_auth_token is not None:
            warnings.warn(
                "The `use_auth_token` argument is deprecated and will be removed in v5 of Transformers. Please use `token` instead.",
                FutureWarning,
            )
            if token is not None:
                raise ValueError("`token` and `use_auth_token` are both specified. Please set only the argument `token`.")
            token = use_auth_token
    
        if is_offline_mode() and not local_files_only:
            logger.info("Offline mode: forcing local_files_only=True")
            local_files_only = True
        if subfolder is None:
            subfolder = ""
    
        # Add folder to filenames
        full_filenames = [os.path.join(subfolder, file) for file in filenames]
    
        path_or_repo_id = str(path_or_repo_id)
        existing_files = []
        for filename in full_filenames:
            if os.path.isdir(path_or_repo_id):
                resolved_file = os.path.join(path_or_repo_id, filename)
                if not os.path.isfile(resolved_file):
                    if _raise_exceptions_for_missing_entries and filename != os.path.join(subfolder, "config.json"):
                        revision_ = "main" if revision is None else revision
                        raise OSError(
                            f"{path_or_repo_id} does not appear to have a file named {filename}. Checkout "
                            f"'https://huggingface.co/{path_or_repo_id}/tree/{revision_}' for available files."
                        )
                    else:
                        return None
                existing_files.append(resolved_file)
    
        # All files exist
        if len(existing_files) == len(full_filenames):
            return existing_files
    
        if cache_dir is None:
            cache_dir = TRANSFORMERS_CACHE
        if isinstance(cache_dir, Path):
            cache_dir = str(cache_dir)
    
        existing_files = []
        file_counter = 0
        if _commit_hash is not None and not force_download:
            for filename in full_filenames:
                # If the file is cached under that commit hash, we return it directly.
                resolved_file = try_to_load_from_cache(
                    path_or_repo_id, filename, cache_dir=cache_dir, revision=_commit_hash, repo_type=repo_type
                )
                if resolved_file is not None:
                    if resolved_file is not _CACHED_NO_EXIST:
                        file_counter += 1
                        existing_files.append(resolved_file)
                    elif not _raise_exceptions_for_missing_entries:
                        file_counter += 1
                    else:
                        raise OSError(f"Could not locate {filename} inside {path_or_repo_id}.")
    
        # Either all the files were found, or some were _CACHED_NO_EXIST but we do not raise for missing entries
        if file_counter == len(full_filenames):
            return existing_files if len(existing_files) > 0 else None
    
        user_agent = http_user_agent(user_agent)
        # download the files if needed
        try:
            if len(full_filenames) == 1:
                # This is slightly better for only 1 file
>               hf_hub_download(
                    path_or_repo_id,
                    filenames[0],
                    subfolder=None if len(subfolder) == 0 else subfolder,
                    repo_type=repo_type,
                    revision=revision,
                    cache_dir=cache_dir,
                    user_agent=user_agent,
                    force_download=force_download,
                    proxies=proxies,
                    resume_download=resume_download,
                    token=token,
                    local_files_only=local_files_only,
                )

.../local/lib/python3.12.../transformers/utils/hub.py:470: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
.../local/lib/python3.12.../huggingface_hub/utils/_validators.py:114: in _inner_fn
    return fn(*args, **kwargs)
.../local/lib/python3.12...................../dist-packages/huggingface_hub/file_download.py:1007: in hf_hub_download
    return _hf_hub_download_to_cache_dir(
.../local/lib/python3.12...................../dist-packages/huggingface_hub/file_download.py:1114: in _hf_hub_download_to_cache_dir
    _raise_on_head_call_error(head_call_error, force_download, local_files_only)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

head_call_error = HfHubHTTPError("429 Client Error: Too Many Requests for url: https://huggingface..../esm2_t6_8M_UR50D/resolve/...ce, create a HF account or login to your existing account, and make sure you pass a HF_TOKEN if you're using the API.")
force_download = False, local_files_only = False

    def _raise_on_head_call_error(head_call_error: Exception, force_download: bool, local_files_only: bool) -> NoReturn:
        """Raise an appropriate error when the HEAD call failed and we cannot locate a local file."""
        # No head call => we cannot force download.
        if force_download:
            if local_files_only:
                raise ValueError("Cannot pass 'force_download=True' and 'local_files_only=True' at the same time.")
            elif isinstance(head_call_error, OfflineModeIsEnabled):
                raise ValueError("Cannot pass 'force_download=True' when offline mode is enabled.") from head_call_error
            else:
                raise ValueError("Force download failed due to the above error.") from head_call_error
    
        # No head call + couldn't find an appropriate file on disk => raise an error.
        if local_files_only:
            raise LocalEntryNotFoundError(
                "Cannot find the requested files in the disk cache and outgoing traffic has been disabled. To enable"
                " hf.co look-ups and downloads online, set 'local_files_only' to False."
            )
        elif isinstance(head_call_error, (RepositoryNotFoundError, GatedRepoError)) or (
            isinstance(head_call_error, HfHubHTTPError) and head_call_error.response.status_code == 401
        ):
            # Repo not found or gated => let's raise the actual error
            # Unauthorized => likely a token issue => let's raise the actual error
            raise head_call_error
        else:
            # Otherwise: most likely a connection issue or Hub downtime => let's warn the user
>           raise LocalEntryNotFoundError(
                "An error happened while trying to locate the file on the Hub and we cannot find the requested files"
                " in the local cache. Please check your connection and try again or make sure your Internet connection"
                " is on."
            ) from head_call_error
E           huggingface_hub.errors.LocalEntryNotFoundError: An error happened while trying to locate the file on the Hub and we cannot find the requested files in the local cache. Please check your connection and try again or make sure your Internet connection is on.

.../local/lib/python3.12...................../dist-packages/huggingface_hub/file_download.py:1658: LocalEntryNotFoundError

The above exception was the direct cause of the following exception:

dummy_protein_dataset = PosixPath('.../pytest-1/test_esm2_loss0/protein_dataset.db')
dummy_parquet_train_val_inputs = (PosixPath('.../pytest-1/test_esm2_loss0/train_clusters.parquet'), PosixPath('.../pytest-1/test_esm2_loss0/valid_clusters.parquet'))

    def test_esm2_loss(dummy_protein_dataset, dummy_parquet_train_val_inputs):
        hf_model_tag = "facebook/esm2_t6_8M_UR50D"
        nv_model_tag = "esm2/8m:2.0"
        # hf_model_tag = "facebook/esm2_t33_650M_UR50D"
        # nv_model_tag = "esm2/650m:2.0"
    
        train_cluster_path, valid_cluster_path = dummy_parquet_train_val_inputs
    
        seed: int = 42
    
        with (
            torch.inference_mode(),
            megatron_parallel_state_utils.distributed_model_parallel_state(seed),
            random_numpy_context(seed),
        ):
            tokenizer = get_tokenizer()
    
            # ESM2 model initialized with params
            model = ESM2Config(initial_ckpt_path=str(load(nv_model_tag))).configure_model(tokenizer).cuda()
    
            # Initialize the data module.
            data_module = ESMDataModule(
                train_cluster_path=train_cluster_path,
                train_database_path=dummy_protein_dataset,
                valid_cluster_path=valid_cluster_path,
                valid_database_path=dummy_protein_dataset,
                global_batch_size=4,
                micro_batch_size=2,
                min_seq_length=None,
                max_seq_length=1024,
                seed=seed,
                num_workers=1,
            )
            assert data_module is not None
            data_module.trainer = mock.Mock()
            data_module.trainer.max_epochs = 1
            data_module.trainer.max_steps = 10
            data_module.trainer.val_check_interval = 2
            data_module.trainer.limit_val_batches = 1
    
            data_module.setup()
    
            train_dataloader = data_module.train_dataloader()
            assert isinstance(train_dataloader, torch.utils.data.DataLoader)
    
            val_dataloader = data_module.val_dataloader()
            assert isinstance(val_dataloader, torch.utils.data.DataLoader)
    
            mean_loss = _compute_loss(model, train_dataloader, vocab_size=tokenizer.vocab_size)
    
            # HF model initialized with params
>           hf_model = AutoModelForMaskedLM.from_pretrained(hf_model_tag, torch_dtype=get_autocast_dtype(32)).cuda()

.../esm2/model/test_model.py:171: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
.../local/lib/python3.12.../models/auto/auto_factory.py:547: in from_pretrained
    config, kwargs = AutoConfig.from_pretrained(
.../local/lib/python3.12.../models/auto/configuration_auto.py:1197: in from_pretrained
    config_dict, unused_kwargs = PretrainedConfig.get_config_dict(pretrained_model_name_or_path, **kwargs)
.../local/lib/python3.12....../dist-packages/transformers/configuration_utils.py:608: in get_config_dict
    config_dict, kwargs = cls._get_config_dict(pretrained_model_name_or_path, **kwargs)
.../local/lib/python3.12....../dist-packages/transformers/configuration_utils.py:667: in _get_config_dict
    resolved_config_file = cached_file(
.../local/lib/python3.12.../transformers/utils/hub.py:312: in cached_file
    file = cached_files(path_or_repo_id=path_or_repo_id, filenames=[filename], **kwargs)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

path_or_repo_id = 'facebook/esm2_t6_8M_UR50D', filenames = ['config.json']
cache_dir = '....../github/home/.cache/huggingface/hub', force_download = False
resume_download = None, proxies = None, token = None, revision = None
local_files_only = False, subfolder = '', repo_type = None
user_agent = 'transformers/4.53.3; python/3.12.3; session_id/1dc5d65396de47f9b8ea53ab966d4123; torch/2.8.0a0+5228986c39.nv25.6; file_type/config; from_auto_class/True'
_raise_exceptions_for_gated_repo = True
_raise_exceptions_for_missing_entries = True
_raise_exceptions_for_connection_errors = True, _commit_hash = None
deprecated_kwargs = {}, use_auth_token = None, full_filenames = ['config.json']
existing_files = [], filename = 'config.json'

    def cached_files(
        path_or_repo_id: Union[str, os.PathLike],
        filenames: list[str],
        cache_dir: Optional[Union[str, os.PathLike]] = None,
        force_download: bool = False,
        resume_download: Optional[bool] = None,
        proxies: Optional[dict[str, str]] = None,
        token: Optional[Union[bool, str]] = None,
        revision: Optional[str] = None,
        local_files_only: bool = False,
        subfolder: str = "",
        repo_type: Optional[str] = None,
        user_agent: Optional[Union[str, dict[str, str]]] = None,
        _raise_exceptions_for_gated_repo: bool = True,
        _raise_exceptions_for_missing_entries: bool = True,
        _raise_exceptions_for_connection_errors: bool = True,
        _commit_hash: Optional[str] = None,
        **deprecated_kwargs,
    ) -> Optional[str]:
        """
        Tries to locate several files in a local folder and repo, downloads and cache them if necessary.
    
        Args:
            path_or_repo_id (`str` or `os.PathLike`):
                This can be either:
                - a string, the *model id* of a model repo on huggingface.co.
                - a path to a *directory* potentially containing the file.
            filenames (`list[str]`):
                The name of all the files to locate in `path_or_repo`.
            cache_dir (`str` or `os.PathLike`, *optional*):
                Path to a directory in which a downloaded pretrained model configuration should be cached if the standard
                cache should not be used.
            force_download (`bool`, *optional*, defaults to `False`):
                Whether or not to force to (re-)download the configuration files and override the cached versions if they
                exist.
            resume_download:
                Deprecated and ignored. All downloads are now resumed by default when possible.
                Will be removed in v5 of Transformers.
            proxies (`dict[str, str]`, *optional*):
                A dictionary of proxy servers to use by protocol or endpoint, e.g., `{'http': 'foo.bar:3128',
                'http://hostname': 'foo.bar:4012'}.` The proxies are used on each request.
            token (`str` or *bool*, *optional*):
                The token to use as HTTP bearer authorization for remote files. If `True`, will use the token generated
                when running `huggingface-cli login` (stored in `~/.huggingface`).
            revision (`str`, *optional*, defaults to `"main"`):
                The specific model version to use. It can be a branch name, a tag name, or a commit id, since we use a
                git-based system for storing models and other artifacts on huggingface.co, so `revision` can be any
                identifier allowed by git.
            local_files_only (`bool`, *optional*, defaults to `False`):
                If `True`, will only try to load the tokenizer configuration from local files.
            subfolder (`str`, *optional*, defaults to `""`):
                In case the relevant files are located inside a subfolder of the model repo on huggingface.co, you can
                specify the folder name here.
            repo_type (`str`, *optional*):
                Specify the repo type (useful when downloading from a space for instance).
    
        Private args:
            _raise_exceptions_for_gated_repo (`bool`):
                if False, do not raise an exception for gated repo error but return None.
            _raise_exceptions_for_missing_entries (`bool`):
                if False, do not raise an exception for missing entries but return None.
            _raise_exceptions_for_connection_errors (`bool`):
                if False, do not raise an exception for connection errors but return None.
            _commit_hash (`str`, *optional*):
                passed when we are chaining several calls to various files (e.g. when loading a tokenizer or
                a pipeline). If files are cached for this commit hash, avoid calls to head and get from the cache.
    
        <Tip>
    
        Passing `token=True` is required when you want to use a private model.
    
        </Tip>
    
        Returns:
            `Optional[str]`: Returns the resolved file (to the cache folder if downloaded from a repo).
    
        Examples:
    
        ```python
        # Download a model weight from the Hub and cache it.
        model_weights_file = cached_file("google-bert/bert-base-uncased", "pytorch_model.bin")
        ```
        """
        use_auth_token = deprecated_kwargs.pop("use_auth_token", None)
        if use_auth_token is not None:
            warnings.warn(
                "The `use_auth_token` argument is deprecated and will be removed in v5 of Transformers. Please use `token` instead.",
                FutureWarning,
            )
            if token is not None:
                raise ValueError("`token` and `use_auth_token` are both specified. Please set only the argument `token`.")
            token = use_auth_token
    
        if is_offline_mode() and not local_files_only:
            logger.info("Offline mode: forcing local_files_only=True")
            local_files_only = True
        if subfolder is None:
            subfolder = ""
    
        # Add folder to filenames
        full_filenames = [os.path.join(subfolder, file) for file in filenames]
    
        path_or_repo_id = str(path_or_repo_id)
        existing_files = []
        for filename in full_filenames:
            if os.path.isdir(path_or_repo_id):
                resolved_file = os.path.join(path_or_repo_id, filename)
                if not os.path.isfile(resolved_file):
                    if _raise_exceptions_for_missing_entries and filename != os.path.join(subfolder, "config.json"):
                        revision_ = "main" if revision is None else revision
                        raise OSError(
                            f"{path_or_repo_id} does not appear to have a file named {filename}. Checkout "
                            f"'https://huggingface.co/{path_or_repo_id}/tree/{revision_}' for available files."
                        )
                    else:
                        return None
                existing_files.append(resolved_file)
    
        # All files exist
        if len(existing_files) == len(full_filenames):
            return existing_files
    
        if cache_dir is None:
            cache_dir = TRANSFORMERS_CACHE
        if isinstance(cache_dir, Path):
            cache_dir = str(cache_dir)
    
        existing_files = []
        file_counter = 0
        if _commit_hash is not None and not force_download:
            for filename in full_filenames:
                # If the file is cached under that commit hash, we return it directly.
                resolved_file = try_to_load_from_cache(
                    path_or_repo_id, filename, cache_dir=cache_dir, revision=_commit_hash, repo_type=repo_type
                )
                if resolved_file is not None:
                    if resolved_file is not _CACHED_NO_EXIST:
                        file_counter += 1
                        existing_files.append(resolved_file)
                    elif not _raise_exceptions_for_missing_entries:
                        file_counter += 1
                    else:
                        raise OSError(f"Could not locate {filename} inside {path_or_repo_id}.")
    
        # Either all the files were found, or some were _CACHED_NO_EXIST but we do not raise for missing entries
        if file_counter == len(full_filenames):
            return existing_files if len(existing_files) > 0 else None
    
        user_agent = http_user_agent(user_agent)
        # download the files if needed
        try:
            if len(full_filenames) == 1:
                # This is slightly better for only 1 file
                hf_hub_download(
                    path_or_repo_id,
                    filenames[0],
                    subfolder=None if len(subfolder) == 0 else subfolder,
                    repo_type=repo_type,
                    revision=revision,
                    cache_dir=cache_dir,
                    user_agent=user_agent,
                    force_download=force_download,
                    proxies=proxies,
                    resume_download=resume_download,
                    token=token,
                    local_files_only=local_files_only,
                )
            else:
                snapshot_download(
                    path_or_repo_id,
                    allow_patterns=full_filenames,
                    repo_type=repo_type,
                    revision=revision,
                    cache_dir=cache_dir,
                    user_agent=user_agent,
                    force_download=force_download,
                    proxies=proxies,
                    resume_download=resume_download,
                    token=token,
                    local_files_only=local_files_only,
                )
    
        except Exception as e:
            # We cannot recover from them
            if isinstance(e, RepositoryNotFoundError) and not isinstance(e, GatedRepoError):
                raise OSError(
                    f"{path_or_repo_id} is not a local folder and is not a valid model identifier "
                    "listed on 'https://huggingface.co/models'\nIf this is a private repository, make sure to pass a token "
                    "having permission to this repo either by logging in with `huggingface-cli login` or by passing "
                    "`token=<your_token>`"
                ) from e
            elif isinstance(e, RevisionNotFoundError):
                raise OSError(
                    f"{revision} is not a valid git identifier (branch name, tag name or commit id) that exists "
                    "for this model name. Check the model page at "
                    f"'https://huggingface.co/{path_or_repo_id}' for available revisions."
                ) from e
            elif isinstance(e, PermissionError):
                raise OSError(
                    f"PermissionError at {e.filename} when downloading {path_or_repo_id}. "
                    "Check cache directory permissions. Common causes: 1) another user is downloading the same model (please wait); "
                    "2) a previous download was canceled and the lock file needs manual removal."
                ) from e
    
            # Now we try to recover if we can find all files correctly in the cache
            resolved_files = [
                _get_cache_file_to_return(path_or_repo_id, filename, cache_dir, revision) for filename in full_filenames
            ]
            if all(file is not None for file in resolved_files):
                return resolved_files
    
            # Raise based on the flags. Note that we will raise for missing entries at the very end, even when
            # not entering this Except block, as it may also happen when `snapshot_download` does not raise
            if isinstance(e, GatedRepoError):
                if not _raise_exceptions_for_gated_repo:
                    return None
                raise OSError(
                    "You are trying to access a gated repo.\nMake sure to have access to it at "
                    f"https://huggingface.co/{path_or_repo_id}.\n{str(e)}"
                ) from e
            elif isinstance(e, LocalEntryNotFoundError):
                if not _raise_exceptions_for_connection_errors:
                    return None
                # Here we only raise if both flags for missing entry and connection errors are True (because it can be raised
                # even when `local_files_only` is True, in which case raising for connections errors only would not make sense)
                elif _raise_exceptions_for_missing_entries:
>                   raise OSError(
                        f"We couldn't connect to '{HUGGINGFACE_CO_RESOLVE_ENDPOINT}' to load the files, and couldn't find them in the"
                        f" cached files.\nCheck your internet connection or see how to run the library in offline mode at"
                        " 'https://huggingface..../docs/transformers/installation#offline-mode'."
                    ) from e
E                   OSError: We couldn't connect to 'https://huggingface.co' to load the files, and couldn't find them in the cached files.
E                   Check your internet connection or see how to run the library in offline mode at 'https://huggingface..../docs/transformers/installation#offline-mode'.

.../local/lib/python3.12.../transformers/utils/hub.py:543: OSError

To view more test analytics, go to the Test Analytics Dashboard
_{📋 Got 3 mins? Take this short survey to help us improve Test Analytics.}

trvachov · 2025-10-07T17:45:56Z

From standup: Is it possible to start having these run nightly, and track queue time + success rate?

Signed-off-by: Gagan Kaushik <[email protected]>

.github/workflows/unit-tests-framework.yml

gagank1 · 2025-10-09T13:23:59Z

/ok to test 250bd29

Signed-off-by: Gagan Kaushik <[email protected]>

gagank1 · 2025-10-09T13:39:40Z

/ok to test 68307b1

...onemo/moco/interpolants/continuous_time/continuous/test_continuous_flow_matching_parallel.py

...bionemo-moco/tests/bionemo/moco/interpolants/continuous_time/continuous/test_vdm_parallel.py

...s/bionemo/moco/interpolants/continuous_time/discrete/test_discrete_flow_matching_parallel.py

.../bionemo-moco/tests/bionemo/moco/interpolants/continuous_time/discrete/test_mdlm_parallel.py

.../bionemo-moco/tests/bionemo/moco/interpolants/discrete_time/continuous/test_ddpm_parallel.py

...es/bionemo-moco/tests/bionemo/moco/interpolants/discrete_time/discrete/test_d3pm_parallel.py

Signed-off-by: Gagan Kaushik <[email protected]>

gagank1 · 2025-10-09T13:54:16Z

/ok to test 6fbe6f4

- Merged in latest changes from origin/main - Added @pytest.mark.multi_gpu to test_distributed_fp8.py::test_multi_process_fp8_recipes_are_synced - Added @pytest.mark.multi_gpu to test_train.py::test_distributed_training_gradient_equivalence - These new multi-GPU tests will run on the 2-GPU runner in merge queue/schedule

…g in evo2 preprocessing that ignored random seed during bootstrap Signed-off-by: Gagan Kaushik <[email protected]>

gagank1 · 2025-10-11T16:06:13Z

/ok to test b777145

Signed-off-by: Gagan Kaushik <[email protected]>

gagank1 · 2025-10-14T11:18:39Z

/ok to test c3f9737

coderabbitai

Actionable comments posted: 1

♻️ Duplicate comments (3)

sub-packages/bionemo-moco/tests/bionemo/moco/interpolants/discrete_time/discrete/test_d3pm_parallel.py (1)

88-92: The security bot's concern is incorrect.

The socket is bound to "localhost" (line 89), not to all interfaces. This is secure and appropriate for local testing.

sub-packages/bionemo-moco/tests/bionemo/moco/interpolants/continuous_time/discrete/test_discrete_flow_matching_parallel.py (1)

82-86: The security bot's concern is incorrect.

The socket is bound to "localhost" (line 83), not to all interfaces. This is secure and appropriate for local testing.
.github/workflows/unit-tests-framework.yml (1)
254-294: Address the missing permissions block.

Static analysis tools flagged that the run-tests-multi-gpu job does not limit the permissions of the GITHUB_TOKEN. Consider adding an explicit permissions block as a security best practice.

Apply this diff to set minimal permissions:
 run-tests-multi-gpu:
+  permissions:
+    contents: read
   needs:
     - build-bionemo-image
     - get-pr-labels
This same recommendation applies to the run-tests-slow-multi-gpu job at lines 325-357.

🧹 Nitpick comments (4)

sub-packages/bionemo-moco/tests/bionemo/moco/interpolants/discrete_time/discrete/test_d3pm_parallel.py (1)

83-99: Pre-spawn environment setup looks good.

The environment variable setup before spawning ensures all processes use consistent values. The port allocation via temporary socket binding is a standard pattern in distributed testing.

Note: There's a small race condition window between closing the socket (line 91) and the test using the port, but this is acceptable in practice given the very short time window.

The A6000 workaround is well-documented and addresses a known hanging issue.

Consider extracting this environment setup pattern into a shared helper function, as it's duplicated across multiple test files (e.g., test_discrete_flow_matching_parallel.py).
sub-packages/bionemo-moco/tests/bionemo/moco/interpolants/continuous_time/discrete/test_discrete_flow_matching_parallel.py (1)
77-93: Environment setup is identical to test_d3pm_parallel.py.

This code is duplicated from the other test file. Both files would benefit from extracting this setup logic into a shared helper function (e.g., setup_distributed_test_environment() in a test utilities module).

Apply a refactor to extract the common logic:

Example helper in a shared test utilities module:
def setup_distributed_test_environment():
    """Set up environment variables for distributed testing."""
    if "MASTER_ADDR" not in os.environ:
        os.environ["MASTER_ADDR"] = "localhost"
    if "MASTER_PORT" not in os.environ:
        s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
        s.bind(("localhost", 0))
        port = s.getsockname()[1]
        s.close()
        os.environ["MASTER_PORT"] = str(port)
    
    # Fix hanging issue on A6000 GPUs
    if torch.cuda.is_available():
        for i in range(torch.cuda.device_count()):
            if "A6000" in torch.cuda.get_device_name(i):
                os.environ["NCCL_P2P_DISABLE"] = "1"
                break
Then use it in both test files:
-    # Set up environment variables BEFORE spawning so all processes use the same values
-    if "MASTER_ADDR" not in os.environ:
-        os.environ["MASTER_ADDR"] = "localhost"
-    if "MASTER_PORT" not in os.environ:
-        # Find a free port for this test (bind to localhost only for security)
-        s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
-        s.bind(("localhost", 0))
-        port = s.getsockname()[1]
-        s.close()
-        os.environ["MASTER_PORT"] = str(port)
-
-    # Fix hanging issue on A6000 GPUs
-    if torch.cuda.is_available():
-        for i in range(torch.cuda.device_count()):
-            if "A6000" in torch.cuda.get_device_name(i):
-                os.environ["NCCL_P2P_DISABLE"] = "1"
-                break
+    setup_distributed_test_environment()
sub-packages/bionemo-moco/tests/bionemo/moco/interpolants/continuous_time/discrete/test_mdlm_parallel.py (1)

79-88: Environment setup correctly uses localhost and addresses security concerns.

The binding to "localhost" prevents external network access, addressing the security bot's concern about binding to all interfaces.

The dynamic port allocation has a theoretical TOCTOU race (the port could be taken between s.close() and spawn()), but this is acceptable in practice for CI tests where port collisions are rare.

Optional improvement: If flakiness occurs, consider keeping the socket open and using SO_REUSEADDR, or implementing retry logic with a port range.

sub-packages/bionemo-evo2/tests/bionemo/evo2/run/test_train.py (1)

497-598: Wrap download subprocess calls in try/except to skip on failure
Surround each subprocess.run for wget, zcat, and cat with exception handling (CalledProcessError, TimeoutExpired) and call pytest.skip(...) on errors to prevent CI hangs and flakiness.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between a8ccbad and c3f9737.

📒 Files selected for processing (22)

.github/labels.yml (1 hunks)
.github/workflows/unit-tests-framework.yml (6 hunks)
.github/workflows/unit-tests-recipes.yml (2 hunks)
bionemo-recipes/models/amplify/pyproject.toml (1 hunks)
bionemo-recipes/models/esm2/pyproject.toml (1 hunks)
bionemo-recipes/models/esm2/tests/test_distributed_fp8.py (1 hunks)
bionemo-recipes/models/esm2/tests/test_distributed_strategies.py (1 hunks)
bionemo-recipes/recipes/esm2_native_te/tests/test_distributed_checkpointing.py (4 hunks)
bionemo-recipes/recipes/esm2_native_te/tests/test_train_two_gpu.py (4 hunks)
bionemo-recipes/recipes/geneformer_native_te_mfsdp_fp8/test_distributed_checkpointing.py (8 hunks)
ci/scripts/pytest_runner.sh (4 hunks)
docs/docs/main/contributing/contributing.md (1 hunks)
sub-packages/bionemo-evo2/src/bionemo/evo2/data/preprocess.py (2 hunks)
sub-packages/bionemo-evo2/tests/bionemo/evo2/run/test_predict.py (3 hunks)
sub-packages/bionemo-evo2/tests/bionemo/evo2/run/test_train.py (4 hunks)
sub-packages/bionemo-llm/tests/bionemo/llm/test_lightning.py (1 hunks)
sub-packages/bionemo-moco/tests/bionemo/moco/interpolants/continuous_time/continuous/test_continuous_flow_matching_parallel.py (3 hunks)
sub-packages/bionemo-moco/tests/bionemo/moco/interpolants/continuous_time/continuous/test_vdm_parallel.py (3 hunks)
sub-packages/bionemo-moco/tests/bionemo/moco/interpolants/continuous_time/discrete/test_discrete_flow_matching_parallel.py (3 hunks)
sub-packages/bionemo-moco/tests/bionemo/moco/interpolants/continuous_time/discrete/test_mdlm_parallel.py (3 hunks)
sub-packages/bionemo-moco/tests/bionemo/moco/interpolants/discrete_time/continuous/test_ddpm_parallel.py (3 hunks)
sub-packages/bionemo-moco/tests/bionemo/moco/interpolants/discrete_time/discrete/test_d3pm_parallel.py (3 hunks)

🧰 Additional context used

🧬 Code graph analysis (1)

sub-packages/bionemo-evo2/tests/bionemo/evo2/run/test_train.py (2)

sub-packages/bionemo-evo2/src/bionemo/evo2/data/preprocess.py (2)

Evo2Preprocessor (47-451)

preprocess_offline (387-451)

sub-packages/bionemo-evo2/src/bionemo/evo2/utils/config.py (1)

Evo2PreprocessingConfig (46-101)

🪛 actionlint (1.7.8)

.github/workflows/unit-tests-framework.yml

258-258: label "linux-amd64-gpu-rtxa6000-latest-2-nemo" is unknown. available labels are "windows-latest", "windows-latest-8-cores", "windows-2025", "windows-2022", "windows-11-arm", "ubuntu-latest", "ubuntu-latest-4-cores", "ubuntu-latest-8-cores", "ubuntu-latest-16-cores", "ubuntu-24.04", "ubuntu-24.04-arm", "ubuntu-22.04", "ubuntu-22.04-arm", "macos-latest", "macos-latest-xl", "macos-latest-xlarge", "macos-latest-large", "macos-26-xlarge", "macos-26", "macos-15-intel", "macos-15-xlarge", "macos-15-large", "macos-15", "macos-14-xl", "macos-14-xlarge", "macos-14-large", "macos-14", "macos-13-xl", "macos-13-xlarge", "macos-13-large", "macos-13", "self-hosted", "x64", "arm", "arm64", "linux", "macos", "windows". if it is a custom label for self-hosted runner, set list of labels in actionlint.yaml config file

(runner-label)

329-329: label "linux-amd64-gpu-rtxa6000-latest-2-nemo" is unknown. available labels are "windows-latest", "windows-latest-8-cores", "windows-2025", "windows-2022", "windows-11-arm", "ubuntu-latest", "ubuntu-latest-4-cores", "ubuntu-latest-8-cores", "ubuntu-latest-16-cores", "ubuntu-24.04", "ubuntu-24.04-arm", "ubuntu-22.04", "ubuntu-22.04-arm", "macos-latest", "macos-latest-xl", "macos-latest-xlarge", "macos-latest-large", "macos-26-xlarge", "macos-26", "macos-15-intel", "macos-15-xlarge", "macos-15-large", "macos-15", "macos-14-xl", "macos-14-xlarge", "macos-14-large", "macos-14", "macos-13-xl", "macos-13-xlarge", "macos-13-large", "macos-13", "self-hosted", "x64", "arm", "arm64", "linux", "macos", "windows". if it is a custom label for self-hosted runner, set list of labels in actionlint.yaml config file

(runner-label)

.github/workflows/unit-tests-recipes.yml

132-132: label "linux-amd64-gpu-l4-latest-1" is unknown. available labels are "windows-latest", "windows-latest-8-cores", "windows-2025", "windows-2022", "windows-11-arm", "ubuntu-latest", "ubuntu-latest-4-cores", "ubuntu-latest-8-cores", "ubuntu-latest-16-cores", "ubuntu-24.04", "ubuntu-24.04-arm", "ubuntu-22.04", "ubuntu-22.04-arm", "macos-latest", "macos-latest-xl", "macos-latest-xlarge", "macos-latest-large", "macos-26-xlarge", "macos-26", "macos-15-intel", "macos-15-xlarge", "macos-15-large", "macos-15", "macos-14-xl", "macos-14-xlarge", "macos-14-large", "macos-14", "macos-13-xl", "macos-13-xlarge", "macos-13-large", "macos-13", "self-hosted", "x64", "arm", "arm64", "linux", "macos", "windows". if it is a custom label for self-hosted runner, set list of labels in actionlint.yaml config file

(runner-label)

175-175: label "linux-amd64-gpu-rtxa6000-latest-2-nemo" is unknown. available labels are "windows-latest", "windows-latest-8-cores", "windows-2025", "windows-2022", "windows-11-arm", "ubuntu-latest", "ubuntu-latest-4-cores", "ubuntu-latest-8-cores", "ubuntu-latest-16-cores", "ubuntu-24.04", "ubuntu-24.04-arm", "ubuntu-22.04", "ubuntu-22.04-arm", "macos-latest", "macos-latest-xl", "macos-latest-xlarge", "macos-latest-large", "macos-26-xlarge", "macos-26", "macos-15-intel", "macos-15-xlarge", "macos-15-large", "macos-15", "macos-14-xl", "macos-14-xlarge", "macos-14-large", "macos-14", "macos-13-xl", "macos-13-xlarge", "macos-13-large", "macos-13", "self-hosted", "x64", "arm", "arm64", "linux", "macos", "windows". if it is a custom label for self-hosted runner, set list of labels in actionlint.yaml config file

(runner-label)

200-200: shellcheck reported issue in this script: SC1007:warning:2:18: Remove space after = if trying to assign a value (for empty string, use var='' ... )

(shellcheck)

200-200: shellcheck reported issue in this script: SC1007:warning:5:18: Remove space after = if trying to assign a value (for empty string, use var='' ... )

(shellcheck)

🪛 GitHub Check: CodeQL

.github/workflows/unit-tests-framework.yml

[warning] 255-296: Workflow does not contain permissions
Actions job or workflow does not limit the permissions of the GITHUB_TOKEN. Consider setting an explicit permissions block, using the following as a minimal starting point: {{contents: read}}

.github/workflows/unit-tests-recipes.yml

[warning] 174-216: Workflow does not contain permissions
Actions job or workflow does not limit the permissions of the GITHUB_TOKEN. Consider setting an explicit permissions block, using the following as a minimal starting point: {{contents: read}}

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (7)

GitHub Check: run-tests-slow-multi-gpu
GitHub Check: run-tests-single-gpu
GitHub Check: run-tests-notebooks
GitHub Check: run-tests-slow-single-gpu
GitHub Check: run-tests-multi-gpu
GitHub Check: unit-tests-single-gpu (models/amplify)
GitHub Check: unit-tests-single-gpu (recipes/esm2_native_te)

🔇 Additional comments (32)

sub-packages/bionemo-moco/tests/bionemo/moco/interpolants/discrete_time/discrete/test_d3pm_parallel.py (1)

63-69: LGTM! Clear test parameterization.

The expanded parameterization with explicit IDs and the multi_gpu mark properly distinguishes single- and multi-GPU test cases.

sub-packages/bionemo-moco/tests/bionemo/moco/interpolants/continuous_time/discrete/test_discrete_flow_matching_parallel.py (1)

57-63: LGTM! Consistent test parameterization.

The parameterization matches the pattern in test_d3pm_parallel.py, providing clear test case identification.

sub-packages/bionemo-moco/tests/bionemo/moco/interpolants/discrete_time/continuous/test_ddpm_parallel.py (4)

16-17: LGTM: Imports for environment setup.

These standard library imports support the new multi-GPU test infrastructure added below.

59-65: LGTM: Multi-GPU test marking.

The parametrization correctly adds explicit IDs and marks the 2-device case with pytest.mark.multi_gpu, enabling selective execution via CI labels as described in the PR objectives.

79-88: LGTM: Secure localhost binding for distributed test coordination.

The code correctly binds to "localhost" (not "" or "0.0.0.0"), so the past security advisory about binding to all network interfaces does not apply here. The temporary socket pattern for finding a free port is standard practice for distributed testing, though there's a minor TOCTOU race between closing the socket and spawning processes—acceptable in test isolation.

90-96: LGTM: A6000 hang mitigation.

The workaround correctly addresses the known NCCL P2P hang issue on A6000 GPUs mentioned in the PR description. The environment variable affects the entire process, which is acceptable for test infrastructure where tests typically run in isolation.

sub-packages/bionemo-moco/tests/bionemo/moco/interpolants/continuous_time/continuous/test_vdm_parallel.py (4)

16-17: LGTM!

Standard library imports appropriately added for environment setup and port discovery needed by the distributed test infrastructure.

59-65: LGTM!

Excellent refactor to explicit parameterization with clear IDs and the multi_gpu marker. This aligns perfectly with the PR's objective to enable selective multi-GPU test execution in CI.

79-88: Pre-spawn environment setup looks good.

Setting MASTER_ADDR and MASTER_PORT before spawning ensures all processes share the same distributed configuration. The port discovery correctly binds to localhost only (not all interfaces), which addresses security concerns.

Note: There's a theoretical TOCTOU race where another process could claim the port between s.close() and actual use, but this is acceptable in the test environment where pytest provides isolation and the risk is negligible.

90-95: Good hardware-specific workaround for the A6000 hanging issue.

This correctly addresses the MOCO multi-GPU test hangs mentioned in the PR objectives by disabling NCCL P2P communication on A6000 devices. While this is a workaround rather than a root cause fix, it's appropriate for ensuring CI stability.

The simple string match on device name and setting the environment variable globally when any A6000 is detected is a reasonable approach.

sub-packages/bionemo-moco/tests/bionemo/moco/interpolants/continuous_time/discrete/test_mdlm_parallel.py (3)

17-18: LGTM: Necessary imports for environment setup.

The os and socket imports support the dynamic port allocation and environment configuration added below.

59-65: LGTM: Clear parametrization with appropriate markers.

The explicit test IDs and pytest.mark.multi_gpu marker align with the PR's multi-GPU CI strategy.

90-95: LGTM: Conservative workaround for A6000 hanging issues.

Disabling NCCL peer-to-peer globally when any A6000 is detected is a pragmatic solution to the hanging problem described in the PR. This conservative approach prioritizes test reliability over potential performance gains from P2P communication.

sub-packages/bionemo-llm/tests/bionemo/llm/test_lightning.py (1)

78-78: LGTM!

The multi_gpu marker correctly identifies this test for multi-GPU execution, aligning with the test's existing GPU count requirement and functionality.

bionemo-recipes/models/amplify/pyproject.toml (1)

34-38: LGTM!

Pytest marker definitions are properly structured and align with the PR's multi-GPU test framework.

bionemo-recipes/models/esm2/pyproject.toml (1)

33-36: LGTM!

Marker definitions correctly support the new multi-GPU test categorization.

bionemo-recipes/recipes/geneformer_native_te_mfsdp_fp8/test_distributed_checkpointing.py (1)

137-137: LGTM!

All multi_gpu markers correctly identify tests that spawn multi-process distributed training with 2 GPUs, enabling proper test routing in CI.

Also applies to: 312-312, 565-565, 683-683, 803-803, 965-965, 1083-1083, 1203-1203

sub-packages/bionemo-evo2/src/bionemo/evo2/data/preprocess.py (2)

348-348: Good fix for deterministic split assignment.

Changing from a set to a list enables deterministic ordering via random.sample() later in the flow, which respects the configured seed.

367-377: Excellent fix for the seed reproducibility bug.

The combination of random.sample() with the seeded context and pop(0) ensures that split assignments are deterministic and reproducible with the same seed, addressing the bug mentioned in the PR description.

The previous set.pop() approach was non-deterministic because Python sets have arbitrary iteration order.

.github/labels.yml (1)

22-24: LGTM!

The new ciflow:multi-gpu label properly integrates with the CI framework for multi-GPU test execution.

bionemo-recipes/models/esm2/tests/test_distributed_strategies.py (1)

69-69: LGTM!

The multi_gpu marker appropriately categorizes this 2-GPU test for CI routing.

bionemo-recipes/models/esm2/tests/test_distributed_fp8.py (1)

64-64: LGTM!

The multi_gpu marker correctly identifies this multi-process FP8 test for proper CI execution.

docs/docs/main/contributing/contributing.md (1)

132-137: LGTM!

The documentation for the ciflow:multi-gpu label is clear and consistent with other label descriptions. It properly explains the label's purpose, usage, and default behavior.

bionemo-recipes/recipes/esm2_native_te/tests/test_distributed_checkpointing.py (1)

172-172: LGTM!

The @pytest.mark.multi_gpu markers correctly identify tests that require multiple GPUs. All marked tests use torchrun with --nproc_per_node=2 and already have the @requires_multi_gpu decorator, making them suitable for multi-GPU test selection.

Also applies to: 420-420, 630-630, 912-912

sub-packages/bionemo-evo2/tests/bionemo/evo2/run/test_predict.py (1)

83-90: LGTM with observation!

The @pytest.mark.multi_gpu markers correctly identify test parameters that require multiple GPUs (ddp>1, cp>1, pp>1, tp>1).

Note: The wi parameter was changed from "batch" to "epoch" for one of the ddp=2 cases (line 84 compared to line 242-244). This appears intentional but is unrelated to the multi-GPU marker addition.

Also applies to: 241-267
ci/scripts/pytest_runner.sh (1)
105-119: Verify handling of conflicting flag combinations.

The marker expression logic correctly combines slow and multi_gpu filters. However, if users provide conflicting flags (e.g., both --skip-slow and --only-slow, or both --skip-multi-gpu and --only-multi-gpu), the resulting marker expression would be contradictory (e.g., "not slow and slow").

Consider adding validation to detect and reject conflicting flag combinations:
+# Validate flag combinations
+if [[ "$SKIP_SLOW" == true && "$ONLY_SLOW" == true ]]; then
+    echo "Error: Cannot use both --skip-slow and --only-slow" >&2
+    exit 1
+fi
+if [[ "$SKIP_MULTI_GPU" == true && "$ONLY_MULTI_GPU" == true ]]; then
+    echo "Error: Cannot use both --skip-multi-gpu and --only-multi-gpu" >&2
+    exit 1
+fi
+
 # Build marker expression for filtering tests
 MARKER_EXPR=""
bionemo-recipes/recipes/esm2_native_te/tests/test_train_two_gpu.py (1)

49-49: LGTM!

The @pytest.mark.multi_gpu markers correctly identify tests that require multiple GPUs. All marked tests use torchrun with --nproc_per_node=2 and have the @requires_multi_gpu decorator, making them suitable for multi-GPU test selection.

Also applies to: 68-68, 87-87, 106-106

sub-packages/bionemo-evo2/tests/bionemo/evo2/run/test_train.py (2)

621-625: LGTM!

The updated dataset_config fixture correctly falls back to downloading and preprocessing data when not provided via command line or environment variables. This enables the gradient equivalence test to run in CI environments.

689-689: LGTM!

The @pytest.mark.multi_gpu marker correctly identifies this test as requiring multiple GPUs. The test uses 2 GPUs for various distributed parallelism strategies (dp/cp/tp/pp).

.github/workflows/unit-tests-recipes.yml (3)

130-172: LGTM!

The unit-tests-single-gpu job correctly filters to single-GPU tests using pytest -v -m "not multi_gpu" .. The job configuration is appropriate for single-GPU test execution.

173-215: LGTM!

The unit-tests-multi-gpu job correctly filters to multi-GPU tests using pytest -v -m "multi_gpu" .. The || exit 0 allows recipes without multi-GPU tests to pass gracefully.

219-233: LGTM!

The verify-recipe-tests job correctly aggregates results from both unit-tests-single-gpu and unit-tests-multi-gpu jobs using the contains(needs.*.result, ...) pattern.

coderabbitai · 2025-10-14T11:40:38Z

...onemo/moco/interpolants/continuous_time/continuous/test_continuous_flow_matching_parallel.py

+    # Set up environment variables BEFORE spawning so all processes use the same values
+    if "MASTER_ADDR" not in os.environ:
+        os.environ["MASTER_ADDR"] = "localhost"
+    if "MASTER_PORT" not in os.environ:
+        # Find a free port for this test (bind to localhost only for security)
+        s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
+        s.bind(("localhost", 0))
+        port = s.getsockname()[1]
+        s.close()
+        os.environ["MASTER_PORT"] = str(port)
+
+    # Fix hanging issue on A6000 GPUs
+    if torch.cuda.is_available():
+        for i in range(torch.cuda.device_count()):
+            if "A6000" in torch.cuda.get_device_name(i):
+                os.environ["NCCL_P2P_DISABLE"] = "1"
+                break


⚠️ Potential issue | 🟠 Major

Scope the MASTER/NCCL env tweaks to this test only

Setting MASTER_ADDR, MASTER_PORT, and NCCL_P2P_DISABLE via os.environ[...] = ... leaks those values to every later test in the same Python process. That cross-test contamination is already biting us: once MASTER_PORT is filled here, later spawns reuse the same port instead of grabbing a fresh free one, and forcing NCCL_P2P_DISABLE=1 means every subsequent distributed test runs without GPU P2P even when it should be enabled. Please switch to pytest’s monkeypatch (or another scoped helper) so the overrides are applied for this test only, e.g.:

- if "MASTER_ADDR" not in os.environ: - os.environ["MASTER_ADDR"] = "localhost" + monkeypatch.setenv("MASTER_ADDR", os.environ.get("MASTER_ADDR", "localhost")) ... - os.environ["MASTER_PORT"] = str(port) + monkeypatch.setenv("MASTER_PORT", str(port)) ... - os.environ["NCCL_P2P_DISABLE"] = "1" + monkeypatch.setenv("NCCL_P2P_DISABLE", "1")

and add monkeypatch to the test signature. That keeps the CI-friendly defaults without mutating global state that other tests rely on.

Committable suggestion skipped: line range outside the PR's diff.

🤖 Prompt for AI Agents

In sub-packages/bionemo-moco/tests/bionemo/moco/interpolants/continuous_time/continuous/test_continuous_flow_matching_parallel.py around lines 79-95, the test currently mutates os.environ directly which leaks MASTER_ADDR, MASTER_PORT and NCCL_P2P_DISABLE to other tests; change the test to accept the pytest monkeypatch fixture and replace direct os.environ writes with monkeypatch.setenv calls (compute a free localhost port as currently done, then monkeypatch.setenv("MASTER_PORT", str(port))), use monkeypatch.setenv("MASTER_ADDR","localhost") if not present, and for the A6000 NCCL tweak only call monkeypatch.setenv("NCCL_P2P_DISABLE","1") when a local CUDA device name contains "A6000" so the environment change is scoped to this test and does not affect subsequent tests.

coderabbitai

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (4)

sub-packages/bionemo-evo2/tests/bionemo/evo2/run/test_predict.py (2)
112-114: Fix skip message text

Condition is world_size > device_count, but message says “less than”. Correct the message.
-        pytest.skip(f"World size {world_size} is less than the number of GPUs {torch.cuda.device_count()}")
+        pytest.skip(f"World size {world_size} exceeds available GPUs ({torch.cuda.device_count()})")
154-154: Wrong command name in assertion message

This test runs predict_evo2, not train_evo2.
-    assert result.returncode == 0, "train_evo2 command failed."
+    assert result.returncode == 0, "predict_evo2 command failed."
sub-packages/bionemo-evo2/src/bionemo/evo2/data/preprocess.py (1)
394-397: Bug: partial file‑existence check due to zip truncation

Only (BIN,train) and (IDX,val) are checked. test split and other combos are missed. This can cause accidental overwrites or skips.
-        if any(
-            self._get_output_filename(preproc_config, ext, split).is_file()
-            for ext, split in zip([self.BIN, self.IDX], [self.TRAIN, self.VAL, self.TEST])
-        ):
+        if any(
+            self._get_output_filename(preproc_config, ext, split).is_file()
+            for ext in (self.BIN, self.IDX)
+            for split in (self.TRAIN, self.VAL, self.TEST)
+        ):
.github/workflows/unit-tests-recipes.yml (1)
1-20: Add least‑privilege token permissions

Declare minimal permissions for GITHUB_TOKEN.
 name: "BioNeMo Recipes CI"
@@
 concurrency:
   group: ${{ github.workflow }}-${{ github.event.pull_request.number || github.ref }}
   cancel-in-progress: true
+
+permissions:
+  contents: read

♻️ Duplicate comments (11)

bionemo-recipes/recipes/geneformer_native_te_mfsdp_fp8/test_distributed_checkpointing.py (7)

312-315: LGTM; same env-port suggestion applies.

Marker addition is correct. Please apply the per-test MASTER_ADDR/PORT env setup as suggested above to avoid flakiness.

565-568: LGTM; same env-port suggestion applies.

Good to gate as multi_gpu. Mirror the MASTER_ADDR/PORT and optional A6000 workaround in this test’s env before subprocess.run.

683-686: LGTM; same env-port suggestion applies.

Use per-test MASTER_ADDR/PORT in the env passed to subprocess to avoid collisions.

803-806: LGTM; same env-port suggestion applies.

Consistent with others; add unique MASTER_ADDR/PORT in env.

965-968: LGTM; same env-port suggestion applies.

Please add the per-test MASTER_ADDR/PORT env setup in this DDP multi-proc test too.

1083-1086: LGTM; same env-port suggestion applies.

Marking looks good; add unique MASTER_ADDR/PORT in env.

1203-1206: LGTM; same env-port suggestion applies.

Mirror env setup (IPv4 loopback + free port) as noted in earlier comment.

sub-packages/bionemo-moco/tests/bionemo/moco/interpolants/continuous_time/continuous/test_vdm_parallel.py (1)

79-96: Same env hardening as other parallel tests.

Apply IPv4 loopback, always-pick-port, and scope NCCL tweak to multi-GPU as suggested in MDLM test.

sub-packages/bionemo-moco/tests/bionemo/moco/interpolants/continuous_time/continuous/test_continuous_flow_matching_parallel.py (1)

79-96: Repeat env robustness tweaks (IPv4, fresh port, scope NCCL).

Mirror the small refactor used in sibling tests to reduce flakiness and optimize single-GPU path.

sub-packages/bionemo-moco/tests/bionemo/moco/interpolants/continuous_time/discrete/test_discrete_flow_matching_parallel.py (1)

77-94: Apply same env hardening (IPv4, always-pick-port, scope NCCL) and consider deduping.

Consistent tweaks reduce cross-test collisions and unnecessary NCCL changes on single-GPU.
.github/workflows/unit-tests-framework.yml (1)
37-57: Add least‑privilege token permissions

Set minimal GITHUB_TOKEN permissions at workflow level.
 name: "BioNeMo Framework CI"
 
 on:
@@
 concurrency:
   group: ${{ github.workflow }}-${{ github.event.pull_request.number || github.ref }}
   cancel-in-progress: true
+
+permissions:
+  contents: read

🧹 Nitpick comments (11)

sub-packages/bionemo-moco/tests/bionemo/moco/interpolants/discrete_time/discrete/test_d3pm_parallel.py (1)

94-99: A6000 stability workaround is appropriate.

Setting NCCL_P2P_DISABLE=1 for A6000 GPUs addresses the hanging issues mentioned in the PR description. The implementation correctly checks all visible devices and applies the workaround when needed.

Consider centralizing this A6000 workaround if it's used across multiple test files. A shared test utility or conftest.py fixture could reduce duplication and make the workaround easier to maintain or remove in future NCCL versions.
sub-packages/bionemo-moco/tests/bionemo/moco/interpolants/continuous_time/discrete/test_mdlm_parallel.py (1)
79-96: Harden env setup: prefer IPv4, always set a fresh port, scope NCCL tweak to multi-GPU, and dedupe.

Use 127.0.0.1 to avoid IPv6/NCCL issues.

Always assign a free port to reduce cross-test collisions.

Only set NCCL_P2P_DISABLE when world_size >= 2.

Consider factoring this into a small helper to reuse across tests.

Diff:
-    if "MASTER_ADDR" not in os.environ:
-        os.environ["MASTER_ADDR"] = "localhost"
-    if "MASTER_PORT" not in os.environ:
-        # Find a free port for this test (bind to localhost only for security)
-        s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
-        s.bind(("localhost", 0))
-        port = s.getsockname()[1]
-        s.close()
-        os.environ["MASTER_PORT"] = str(port)
+    # Always prefer IPv4 loopback and pick a free ephemeral port
+    os.environ["MASTER_ADDR"] = "127.0.0.1"
+    with socket.socket(socket.AF_INET, socket.SOCK_STREAM) as s:
+        s.bind(("127.0.0.1", 0))
+        os.environ["MASTER_PORT"] = str(s.getsockname()[1])
 
-    # Fix hanging issue on A6000 GPUs
-    if torch.cuda.is_available():
+    # Fix hanging issue on A6000 GPUs (only matters for multi-GPU)
+    if torch.cuda.is_available() and world_size >= 2:
         for i in range(torch.cuda.device_count()):
             if "A6000" in torch.cuda.get_device_name(i):
                 os.environ["NCCL_P2P_DISABLE"] = "1"
                 break
sub-packages/bionemo-evo2/tests/bionemo/evo2/run/test_predict.py (1)
137-146: Subprocess can hang; add timeout and raise on failure

Guard long‑running distributed calls to avoid indefinite hangs.
-    result = subprocess.run(
+    try:
+        result = subprocess.run(
         command,
         shell=True,  # Use the shell to interpret wildcards (e.g. SDH*)
         cwd=tmp_path,  # Run in the temporary directory
         capture_output=True,  # Capture stdout and stderr for debugging
         env=env,  # Pass in the env where we override the master port.
-        text=True,  # Decode output as text
-    )
+        text=True,  # Decode output as text
+        timeout=480,
+    )
+    except subprocess.TimeoutExpired as e:
+        sys.stderr.write(f"Timed out running predict_evo2: {e}\n")
+        raise
sub-packages/bionemo-evo2/src/bionemo/evo2/data/preprocess.py (2)
348-349: Deterministic split assignment: good fix

Replacing set.pop() with a seeded random ordering removes nondeterminism. Minor: pop(0) is O(n); for 3 items it’s trivial, but a deque would be slightly cleaner.
-            splits_needed = random.sample(splits_list, len(splits_list)) if splits_list else []
+            from collections import deque
+            splits_needed = deque(random.sample(splits_list, len(splits_list))) if splits_list else deque()
...
-                    split = splits_needed.pop(0)
+                    split = splits_needed.popleft()
Also applies to: 366-370, 375-378

249-252: Nondeterministic seed component from Python hash()

hash(filepath) is salted per process (PYTHONHASHSEED), breaking cross‑run determinism. Prefer a stable hash of the path string.
-        with self.preprocessing_context_manager(
-            config.seed + hash(filepath) + seq_idx if config.seed is not None else None
-        ):
+        stable = int.from_bytes(__import__("hashlib").md5(str(filepath).encode()).digest()[:8], "big")
+        with self.preprocessing_context_manager(
+            (config.seed + stable + seq_idx) if config.seed is not None else None
+        ):
ci/scripts/pytest_runner.sh (1)
105-120: Marker filtering logic LGTM; add conflict guard

If both --skip-multi-gpu and --only-multi-gpu are set, fail fast to avoid surprising selection.
 MARKER_EXPR=""
+# Guard conflicting flags early
+if [[ "$SKIP_MULTI_GPU" == true && "$ONLY_MULTI_GPU" == true ]]; then
+  echo "Use either --skip-multi-gpu or --only-multi-gpu, not both." >&2
+  exit 2
+fi
.github/workflows/unit-tests-framework.yml (2)
254-283: Multi‑GPU job config LGTM; add a timeout

Add timeout-minutes to prevent indefinite hangs on distributed runs.
 run-tests-multi-gpu:
@@
-    if: |
+    timeout-minutes: 120
+    if: |
350-358: Slow multi‑GPU job: add timeout

Same recommendation for slow suite.
   run-tests-slow-multi-gpu:
@@
-    if: |
+    timeout-minutes: 180
+    if: |
bionemo-recipes/recipes/esm2_native_te/tests/test_distributed_checkpointing.py (1)

172-174: Multi‑GPU gating LGTM

multi_gpu markers align with runner gating and existing skipifs.

Add timeout=... to subprocess.run calls in these multi‑GPU tests to prevent indefinite hangs on CI infra.

Also applies to: 420-422, 630-632, 912-914
.github/workflows/unit-tests-recipes.yml (1)
156-167: Shellcheck: clarify empty PIP_CONSTRAINT

Use explicit empty string to silence SC1007.
-            PIP_CONSTRAINT= pip install -e .
+            PIP_CONSTRAINT="" pip install -e .
@@
-            PIP_CONSTRAINT= pip install -r requirements.txt
+            PIP_CONSTRAINT="" pip install -r requirements.txt
Also applies to: 198-211
sub-packages/bionemo-evo2/tests/bionemo/evo2/run/test_train.py (1)
519-533: External downloads: add timeouts and basic retry

Network fetches can hang. Add timeout and a simple retry loop.
-        subprocess.run(
-            ["wget", "https://hgdownload.soe.ucsc.edu/goldenpath/hg38/chromosomes/chr20.fa.gz"],
-            cwd=test_dir,
-            check=True,
-        )
+        for f in chr20 chr21 chr22; do
+          for i in {1..3}; do
+            if wget -T 120 -O "$f.fa.gz" "https://hgdownload.soe.ucsc.edu/goldenpath/hg38/chromosomes/$f.fa.gz"; then break; fi
+            echo "Retry $i for $f..."
+            sleep 2
+          done
+        done

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between a8ccbad and 9812568.

📒 Files selected for processing (23)

.github/labels.yml (1 hunks)
.github/pull_request_template.md (1 hunks)
.github/workflows/unit-tests-framework.yml (6 hunks)
.github/workflows/unit-tests-recipes.yml (2 hunks)
bionemo-recipes/models/amplify/pyproject.toml (1 hunks)
bionemo-recipes/models/esm2/pyproject.toml (1 hunks)
bionemo-recipes/models/esm2/tests/test_distributed_fp8.py (1 hunks)
bionemo-recipes/models/esm2/tests/test_distributed_strategies.py (1 hunks)
bionemo-recipes/recipes/esm2_native_te/tests/test_distributed_checkpointing.py (4 hunks)
bionemo-recipes/recipes/esm2_native_te/tests/test_train_two_gpu.py (4 hunks)
bionemo-recipes/recipes/geneformer_native_te_mfsdp_fp8/test_distributed_checkpointing.py (8 hunks)
ci/scripts/pytest_runner.sh (4 hunks)
docs/docs/main/contributing/contributing.md (2 hunks)
sub-packages/bionemo-evo2/src/bionemo/evo2/data/preprocess.py (2 hunks)
sub-packages/bionemo-evo2/tests/bionemo/evo2/run/test_predict.py (3 hunks)
sub-packages/bionemo-evo2/tests/bionemo/evo2/run/test_train.py (4 hunks)
sub-packages/bionemo-llm/tests/bionemo/llm/test_lightning.py (1 hunks)
sub-packages/bionemo-moco/tests/bionemo/moco/interpolants/continuous_time/continuous/test_continuous_flow_matching_parallel.py (3 hunks)
sub-packages/bionemo-moco/tests/bionemo/moco/interpolants/continuous_time/continuous/test_vdm_parallel.py (3 hunks)
sub-packages/bionemo-moco/tests/bionemo/moco/interpolants/continuous_time/discrete/test_discrete_flow_matching_parallel.py (3 hunks)
sub-packages/bionemo-moco/tests/bionemo/moco/interpolants/continuous_time/discrete/test_mdlm_parallel.py (3 hunks)
sub-packages/bionemo-moco/tests/bionemo/moco/interpolants/discrete_time/continuous/test_ddpm_parallel.py (3 hunks)
sub-packages/bionemo-moco/tests/bionemo/moco/interpolants/discrete_time/discrete/test_d3pm_parallel.py (3 hunks)

🧰 Additional context used

🧬 Code graph analysis (1)

sub-packages/bionemo-evo2/tests/bionemo/evo2/run/test_train.py (2)

sub-packages/bionemo-evo2/src/bionemo/evo2/data/preprocess.py (2)

Evo2Preprocessor (47-451)

preprocess_offline (387-451)

sub-packages/bionemo-evo2/src/bionemo/evo2/utils/config.py (1)

Evo2PreprocessingConfig (46-101)

🪛 actionlint (1.7.8)

.github/workflows/unit-tests-framework.yml

258-258: label "linux-amd64-gpu-rtxa6000-latest-2-nemo" is unknown. available labels are "windows-latest", "windows-latest-8-cores", "windows-2025", "windows-2022", "windows-11-arm", "ubuntu-latest", "ubuntu-latest-4-cores", "ubuntu-latest-8-cores", "ubuntu-latest-16-cores", "ubuntu-24.04", "ubuntu-24.04-arm", "ubuntu-22.04", "ubuntu-22.04-arm", "macos-latest", "macos-latest-xl", "macos-latest-xlarge", "macos-latest-large", "macos-26-xlarge", "macos-26", "macos-15-intel", "macos-15-xlarge", "macos-15-large", "macos-15", "macos-14-xl", "macos-14-xlarge", "macos-14-large", "macos-14", "macos-13-xl", "macos-13-xlarge", "macos-13-large", "macos-13", "self-hosted", "x64", "arm", "arm64", "linux", "macos", "windows". if it is a custom label for self-hosted runner, set list of labels in actionlint.yaml config file

(runner-label)

329-329: label "linux-amd64-gpu-rtxa6000-latest-2-nemo" is unknown. available labels are "windows-latest", "windows-latest-8-cores", "windows-2025", "windows-2022", "windows-11-arm", "ubuntu-latest", "ubuntu-latest-4-cores", "ubuntu-latest-8-cores", "ubuntu-latest-16-cores", "ubuntu-24.04", "ubuntu-24.04-arm", "ubuntu-22.04", "ubuntu-22.04-arm", "macos-latest", "macos-latest-xl", "macos-latest-xlarge", "macos-latest-large", "macos-26-xlarge", "macos-26", "macos-15-intel", "macos-15-xlarge", "macos-15-large", "macos-15", "macos-14-xl", "macos-14-xlarge", "macos-14-large", "macos-14", "macos-13-xl", "macos-13-xlarge", "macos-13-large", "macos-13", "self-hosted", "x64", "arm", "arm64", "linux", "macos", "windows". if it is a custom label for self-hosted runner, set list of labels in actionlint.yaml config file

(runner-label)

.github/workflows/unit-tests-recipes.yml

132-132: label "linux-amd64-gpu-l4-latest-1" is unknown. available labels are "windows-latest", "windows-latest-8-cores", "windows-2025", "windows-2022", "windows-11-arm", "ubuntu-latest", "ubuntu-latest-4-cores", "ubuntu-latest-8-cores", "ubuntu-latest-16-cores", "ubuntu-24.04", "ubuntu-24.04-arm", "ubuntu-22.04", "ubuntu-22.04-arm", "macos-latest", "macos-latest-xl", "macos-latest-xlarge", "macos-latest-large", "macos-26-xlarge", "macos-26", "macos-15-intel", "macos-15-xlarge", "macos-15-large", "macos-15", "macos-14-xl", "macos-14-xlarge", "macos-14-large", "macos-14", "macos-13-xl", "macos-13-xlarge", "macos-13-large", "macos-13", "self-hosted", "x64", "arm", "arm64", "linux", "macos", "windows". if it is a custom label for self-hosted runner, set list of labels in actionlint.yaml config file

(runner-label)

175-175: label "linux-amd64-gpu-rtxa6000-latest-2-nemo" is unknown. available labels are "windows-latest", "windows-latest-8-cores", "windows-2025", "windows-2022", "windows-11-arm", "ubuntu-latest", "ubuntu-latest-4-cores", "ubuntu-latest-8-cores", "ubuntu-latest-16-cores", "ubuntu-24.04", "ubuntu-24.04-arm", "ubuntu-22.04", "ubuntu-22.04-arm", "macos-latest", "macos-latest-xl", "macos-latest-xlarge", "macos-latest-large", "macos-26-xlarge", "macos-26", "macos-15-intel", "macos-15-xlarge", "macos-15-large", "macos-15", "macos-14-xl", "macos-14-xlarge", "macos-14-large", "macos-14", "macos-13-xl", "macos-13-xlarge", "macos-13-large", "macos-13", "self-hosted", "x64", "arm", "arm64", "linux", "macos", "windows". if it is a custom label for self-hosted runner, set list of labels in actionlint.yaml config file

(runner-label)

200-200: shellcheck reported issue in this script: SC1007:warning:2:18: Remove space after = if trying to assign a value (for empty string, use var='' ... )

(shellcheck)

200-200: shellcheck reported issue in this script: SC1007:warning:5:18: Remove space after = if trying to assign a value (for empty string, use var='' ... )

(shellcheck)

🪛 GitHub Check: CodeQL

.github/workflows/unit-tests-framework.yml

[warning] 255-296: Workflow does not contain permissions
Actions job or workflow does not limit the permissions of the GITHUB_TOKEN. Consider setting an explicit permissions block, using the following as a minimal starting point: {{contents: read}}

.github/workflows/unit-tests-recipes.yml

[warning] 174-216: Workflow does not contain permissions
Actions job or workflow does not limit the permissions of the GITHUB_TOKEN. Consider setting an explicit permissions block, using the following as a minimal starting point: {{contents: read}}

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)

GitHub Check: pre-commit
GitHub Check: Analyze (rust)

🔇 Additional comments (33)

sub-packages/bionemo-llm/tests/bionemo/llm/test_lightning.py (1)

78-78: LGTM! Marker correctly categorizes multi-GPU test.

The @pytest.mark.multi_gpu marker appropriately tags this test for selective execution in CI. It complements the existing skipif decorator—the skipif prevents execution on insufficient hardware, while the marker enables test selection via CLI flags (--skip-multi-gpu, --only-multi-gpu) and the ciflow:multi-gpu label.

sub-packages/bionemo-moco/tests/bionemo/moco/interpolants/discrete_time/discrete/test_d3pm_parallel.py (2)

63-69: LGTM! Multi-GPU test marking is correctly applied.

The parametrization with named IDs and pytest.mark.multi_gpu for world_size=2 aligns with the PR objectives to mark multi-GPU tests for conditional CI execution.

83-92: LGTM! Environment setup follows best practices.

The pre-spawn environment configuration ensures all processes share the same distributed training settings. Binding to "localhost" (not all interfaces) is secure, addressing the security concern from the previous review.

Note: There's a minor TOCTOU race where another process could claim the port between closing the socket and spawning, but this is negligible in practice and is a known limitation of this common pattern.

sub-packages/bionemo-moco/tests/bionemo/moco/interpolants/discrete_time/continuous/test_ddpm_parallel.py (4)

16-17: LGTM!

Standard library imports required for the new environment setup and port-finding functionality.

59-65: LGTM!

Good parametrization pattern that enables testing both single-GPU and multi-GPU scenarios, with the multi-GPU case properly marked for selective execution in CI.

79-88: Good security practice: binding to localhost.

The environment setup correctly uses "localhost" instead of binding to all interfaces ('' or '0.0.0.0'), which addresses the security concern from the previous review. The MASTER_ADDR/MASTER_PORT configuration before spawning ensures all processes share consistent values.

90-95: Necessary workaround for A6000 hanging issue.

This fix addresses the hanging multi-GPU tests mentioned in the PR description. Setting NCCL_P2P_DISABLE=1 for A6000 GPUs is a standard workaround for known peer-to-peer communication issues on these devices.

sub-packages/bionemo-moco/tests/bionemo/moco/interpolants/continuous_time/discrete/test_mdlm_parallel.py (2)

17-18: LGTM: required imports added.

Imports for os/socket are appropriate for pre-spawn env setup.

59-65: Nice parametrize with multi_gpu gating.

Clear ids and selective marking improve CI selection.

sub-packages/bionemo-moco/tests/bionemo/moco/interpolants/continuous_time/continuous/test_vdm_parallel.py (2)

16-17: LGTM: imports for env prep.

59-65: Parametrize looks good with multi_gpu mark.

sub-packages/bionemo-moco/tests/bionemo/moco/interpolants/continuous_time/continuous/test_continuous_flow_matching_parallel.py (2)

16-17: LGTM: imports for env prep.

59-65: Parametrize with ids and multi_gpu mark is clear.

sub-packages/bionemo-moco/tests/bionemo/moco/interpolants/continuous_time/discrete/test_discrete_flow_matching_parallel.py (2)

16-17: LGTM: imports for env prep.

57-63: Parametrize: good use of ids and multi_gpu mark.

bionemo-recipes/models/amplify/pyproject.toml (1)

34-38: LGTM!

The pytest markers are well-defined with clear descriptions. The expansion from a single marker to a list is properly formatted and aligns with the PR's multi-GPU testing infrastructure.

bionemo-recipes/models/esm2/pyproject.toml (1)

33-36: LGTM!

The pytest markers are properly defined and consistent with the multi-GPU testing framework introduced in this PR.

bionemo-recipes/recipes/esm2_native_te/tests/test_train_two_gpu.py (1)

49-49: LGTM!

The @pytest.mark.multi_gpu decorators are correctly applied to all multi-GPU tests in this file, enabling proper test filtering in CI. The decorator placement before @requires_multi_gpu is appropriate.

Also applies to: 68-68, 87-87, 106-106

bionemo-recipes/models/esm2/tests/test_distributed_strategies.py (1)

69-69: LGTM!

The @pytest.mark.multi_gpu decorator is correctly placed and enables proper test filtering for this multi-GPU test.

docs/docs/main/contributing/contributing.md (2)

132-138: LGTM! Documentation is comprehensive.

The ciflow:multi-gpu label documentation clearly explains:

When to use it (distributed/multi-GPU code changes)

How it combines with ciflow:slow for additional coverage

Default behavior (disabled in PR CI, enabled in nightly/merge queue)

147-147: LGTM!

The update to the ciflow:all description correctly reflects that it now includes multi-GPU tests in the comprehensive test suite.

bionemo-recipes/models/esm2/tests/test_distributed_fp8.py (1)

64-64: LGTM!

The @pytest.mark.multi_gpu decorator is correctly applied, enabling proper test filtering for this FP8 multi-GPU test.

.github/labels.yml (1)

22-24: LGTM!

The ciflow:multi-gpu label is properly defined with a clear description and distinctive color. The label naming follows the established ciflow:* convention.

.github/pull_request_template.md (2)

30-31: LGTM!

The PR template updates clearly document:

The new ciflow:multi-gpu label and its purpose

How to combine it with ciflow:slow for comprehensive multi-GPU testing

The expanded scope of ciflow:all to include multi-GPU tests

34-34: LGTM!

The default behavior is clearly stated, helping contributors understand that multi-GPU tests require explicit enablement via the ciflow:multi-gpu label.

sub-packages/bionemo-evo2/tests/bionemo/evo2/run/test_predict.py (3)

83-91: Multi‑GPU parametrizations LGTM

Marker usage and IDs look consistent. Ensure multi_gpu is registered in pytest.ini/pyproject to avoid unknown‑marker warnings.

121-124: A6000 hang workaround: consider broadening NCCL guard

Setting NCCL_P2P_DISABLE=1 is fine. If hangs persist, also consider NCCL_SHM_DISABLE=1 and NCCL_IB_DISABLE=1 as a fallback on these runners.

241-268: Equivalence test markers LGTM

Multi‑GPU gating and parameter coverage look good.

Ensure multi_gpu marker is included in pyproject.toml (add to markers) to silence warnings. Based on learnings

.github/workflows/unit-tests-framework.yml (1)

258-271: Runner label warnings in actionlint

These are custom self‑hosted labels. Configure actionlint to know them (actionlint.yaml) or ignore the warnings in CI.

Also applies to: 329-345

.github/workflows/unit-tests-recipes.yml (2)

132-136: Runner label warnings in actionlint

Custom labels are fine; register them in actionlint.yaml to avoid lint noise.

Also applies to: 175-179

219-233: Matrix status check LGTM

Combined status logic is correct.

sub-packages/bionemo-evo2/tests/bionemo/evo2/run/test_train.py (2)

621-625: Fixture fallback logic LGTM

Automatic data prep when paths not provided is convenient.

689-693: Multi‑GPU gating LGTM

Marker aligns with 2‑GPU requirement.

coderabbitai · 2025-10-14T11:51:03Z

.github/workflows/unit-tests-recipes.yml

+      - name: Run multi-GPU tests
+        working-directory: ${{ matrix.recipe.dir }}
+        run: pytest -v -m "multi_gpu" . || exit 0  # Allow no tests found for recipes without multi-GPU tests



⚠️ Potential issue | 🔴 Critical

Don’t swallow real test failures

pytest ... || exit 0 hides failures. Only ignore exit code 5 (no tests).

- - name: Run multi-GPU tests + - name: Run multi-GPU tests working-directory: ${{ matrix.recipe.dir }} - run: pytest -v -m "multi_gpu" . || exit 0 # Allow no tests found for recipes without multi-GPU tests + run: | + set +e + pytest -v -m "multi_gpu" . + rc=$? + if [[ $rc -eq 5 ]]; then + echo "No tests found for multi_gpu; continuing." + exit 0 + fi + exit $rc

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

- name: Run multi-GPU tests

working-directory: ${{ matrix.recipe.dir }}

run: pytest -v -m "multi_gpu" . || exit 0 # Allow no tests found for recipes without multi-GPU tests

- name: Run multi-GPU tests

working-directory: ${{ matrix.recipe.dir }}

run: |

set +e

pytest -v -m "multi_gpu" .

rc=$?

if [[ $rc -eq 5 ]]; then

echo "No tests found for multi_gpu; continuing."

exit 0

fi

exit $rc

🤖 Prompt for AI Agents

.github/workflows/unit-tests-recipes.yml around lines 212 to 215: the current command swallows all failures by unconditionally returning 0; change it to run pytest, capture its exit code, and only convert exit code 5 (no tests collected) to 0 while preserving all other exit codes (i.e., run pytest -v -m "multi_gpu" .; rc=$?; if [ $rc -eq 5 ]; then exit 0; else exit $rc; fi).

coderabbitai · 2025-10-14T11:51:04Z

bionemo-recipes/recipes/geneformer_native_te_mfsdp_fp8/test_distributed_checkpointing.py

+@pytest.mark.multi_gpu
 @requires_multi_gpu
 @pytest.mark.slow
 def test_checkpoint_save_and_load_two_processes_mfsdp():


⚠️ Potential issue | 🟠 Major

Add unique MASTER_ADDR/PORT to avoid torchrun port collisions (apply to this file’s multi-proc tests).

Great to mark as multi_gpu. However, torchrun defaults (e.g., MASTER_PORT=29500) can collide when multiple tests run concurrently. Set a per-test port in the env dict before subprocess.run, and prefer IPv4.

Example (place after env = os.environ.copy() in this test):

env = os.environ.copy() env["WANDB_MODE"] = "disabled" +import socket +# Use IPv4 loopback and a free ephemeral port +env.setdefault("MASTER_ADDR", "127.0.0.1") +if "MASTER_PORT" not in env: + with socket.socket(socket.AF_INET, socket.SOCK_STREAM) as s: + s.bind(("127.0.0.1", 0)) + env["MASTER_PORT"] = str(s.getsockname()[1]) +# Optional: mitigate A6000 hangs in CI +if torch.cuda.is_available() and any("A6000" in torch.cuda.get_device_name(i) for i in range(torch.cuda.device_count())): + env.setdefault("NCCL_P2P_DISABLE", "1")

Repeat for the resume phase and mirror in other 2‑proc tests in this module.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

@pytest.mark.multi_gpu

@requires_multi_gpu

@pytest.mark.slow

def test_checkpoint_save_and_load_two_processes_mfsdp():

@pytest.mark.multi_gpu

@requires_multi_gpu

@pytest.mark.slow

def test_checkpoint_save_and_load_two_processes_mfsdp():

env = os.environ.copy()

env["WANDB_MODE"] = "disabled"

import socket

# Use IPv4 loopback and a free ephemeral port

env.setdefault("MASTER_ADDR", "127.0.0.1")

if "MASTER_PORT" not in env:

with socket.socket(socket.AF_INET, socket.SOCK_STREAM) as s:

s.bind(("127.0.0.1", 0))

env["MASTER_PORT"] = str(s.getsockname()[1])

# Optional: mitigate A6000 hangs in CI

if torch.cuda.is_available() and any(

"A6000" in torch.cuda.get_device_name(i)

for i in range(torch.cuda.device_count())

):

env.setdefault("NCCL_P2P_DISABLE", "1")

# ... existing subprocess.run or torchrun invocation using env ...

🤖 Prompt for AI Agents

In bionemo-recipes/recipes/geneformer_native_te_mfsdp_fp8/test_distributed_checkpointing.py around lines 137 to 140, the multi-process test does not set explicit MASTER_ADDR/MASTER_PORT so torchrun can collide with other tests; after the existing env = os.environ.copy() in this test set env["MASTER_ADDR"]="127.0.0.1" and env["MASTER_PORT"]=str(<unique_port>) (pick a test-specific free port or compute base+pid) before each subprocess.run call (both the initial run and the resume phase), and apply the same change to the other two-process tests in this file so each subprocess invocation uses IPv4 and a unique port.

coderabbitai · 2025-10-14T11:51:04Z

sub-packages/bionemo-evo2/tests/bionemo/evo2/run/test_train.py

+        subprocess.run(["zcat", "chr20.fa.gz"], stdout=open(test_dir / "chr20.fa", "w"), cwd=test_dir, check=True)
+        subprocess.run(["zcat", "chr21.fa.gz"], stdout=open(test_dir / "chr21.fa", "w"), cwd=test_dir, check=True)
+        subprocess.run(["zcat", "chr22.fa.gz"], stdout=open(test_dir / "chr22.fa", "w"), cwd=test_dir, check=True)
+
+        # Concatenate files
+        subprocess.run(
+            ["cat", "chr20.fa", "chr21.fa", "chr22.fa"], stdout=open(concat_path, "w"), cwd=test_dir, check=True
+        )


⚠️ Potential issue | 🟠 Major

Close file descriptors and add timeouts

zcat/cat calls open files without closing; add context managers and timeouts.

- subprocess.run(["zcat", "chr20.fa.gz"], stdout=open(test_dir / "chr20.fa", "w"), cwd=test_dir, check=True) - subprocess.run(["zcat", "chr21.fa.gz"], stdout=open(test_dir / "chr21.fa", "w"), cwd=test_dir, check=True) - subprocess.run(["zcat", "chr22.fa.gz"], stdout=open(test_dir / "chr22.fa", "w"), cwd=test_dir, check=True) + with open(test_dir / "chr20.fa", "w") as out20: + subprocess.run(["zcat", "chr20.fa.gz"], stdout=out20, cwd=test_dir, check=True, timeout=600) + with open(test_dir / "chr21.fa", "w") as out21: + subprocess.run(["zcat", "chr21.fa.gz"], stdout=out21, cwd=test_dir, check=True, timeout=600) + with open(test_dir / "chr22.fa", "w") as out22: + subprocess.run(["zcat", "chr22.fa.gz"], stdout=out22, cwd=test_dir, check=True, timeout=600) @@ - subprocess.run( - ["cat", "chr20.fa", "chr21.fa", "chr22.fa"], stdout=open(concat_path, "w"), cwd=test_dir, check=True - ) + with open(concat_path, "w") as outcat: + subprocess.run(["cat", "chr20.fa", "chr21.fa", "chr22.fa"], stdout=outcat, cwd=test_dir, check=True, timeout=600)

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

subprocess.run(["zcat", "chr20.fa.gz"], stdout=open(test_dir / "chr20.fa", "w"), cwd=test_dir, check=True)

subprocess.run(["zcat", "chr21.fa.gz"], stdout=open(test_dir / "chr21.fa", "w"), cwd=test_dir, check=True)

subprocess.run(["zcat", "chr22.fa.gz"], stdout=open(test_dir / "chr22.fa", "w"), cwd=test_dir, check=True)

# Concatenate files

subprocess.run(

["cat", "chr20.fa", "chr21.fa", "chr22.fa"], stdout=open(concat_path, "w"), cwd=test_dir, check=True

)

with open(test_dir / "chr20.fa", "w") as out20:

subprocess.run(["zcat", "chr20.fa.gz"], stdout=out20, cwd=test_dir, check=True, timeout=600)

with open(test_dir / "chr21.fa", "w") as out21:

subprocess.run(["zcat", "chr21.fa.gz"], stdout=out21, cwd=test_dir, check=True, timeout=600)

with open(test_dir / "chr22.fa", "w") as out22:

subprocess.run(["zcat", "chr22.fa.gz"], stdout=out22, cwd=test_dir, check=True, timeout=600)

# Concatenate files

with open(concat_path, "w") as outcat:

subprocess.run(["cat", "chr20.fa", "chr21.fa", "chr22.fa"], stdout=outcat, cwd=test_dir, check=True, timeout=600)

🤖 Prompt for AI Agents

In sub-packages/bionemo-evo2/tests/bionemo/evo2/run/test_train.py around lines 536 to 543, subprocess.run calls open() for stdout without closing file descriptors and lack timeouts; update each subprocess.run to open the target file with a context manager (with open(path, "w") as f:) and pass the file object as stdout, and add a reasonable timeout argument (e.g. timeout=30) to each subprocess.run so files are always closed even on errors and the calls cannot hang.

.github/workflows/unit-tests-framework.yml

pstjohn · 2025-10-14T15:26:06Z

.github/workflows/unit-tests-recipes.yml

+  unit-tests-multi-gpu:
+    needs: changed-dirs
+    runs-on: linux-amd64-gpu-rtxa6000-latest-2-nemo
+    if: ${{ needs.changed-dirs.outputs.dirs != '[]' }}
+    name: "unit-tests-multi-gpu (${{ matrix.recipe.name }})"


we don't want these to be running on every PR right? Are we confident enough in those nemo A6000 nodes to hold up critical PRs? Are they aware of the incoming usage?

let's make this only run on schedule events, and only if the single-GPU tests pass

i dont' see where this is gated by schedule?

Signed-off-by: Gagan Kaushik <[email protected]>

.github/workflows/unit-tests-framework.yml

coderabbitai

Actionable comments posted: 2

♻️ Duplicate comments (1)

.github/workflows/unit-tests-framework.yml (1)
37-49: Add minimal GITHUB_TOKEN permissions at workflow level.

Silences CodeQL and follows least‑privilege.
 name: "BioNeMo Framework CI"
+
+permissions:
+  contents: read
+  pull-requests: read
+
 on:

🧹 Nitpick comments (4)

docs/docs/main/contributing/contributing.md (1)
144-145: Clarify multi‑GPU behavior and add a dedicated label section

Good note. Suggest making PR usage explicit and adding a ciflow:multi-gpu subsection for discoverability.

Apply this diff to clarify the note:
- - Note: Multi-GPU tests always run separately in nightly builds and are not affected by this label
+ - Note: Multi-GPU tests run separately in nightly builds and are not affected by this label. To run them in PR CI, add the `ciflow:multi-gpu` label.
And add a dedicated section for the label (placed after ciflow:all):
+#### **ciflow:multi-gpu**
+
+- Triggers multi-GPU tests (2× RTX A6000 runners) on PR CI.
+- Disabled by default; they run automatically in the merge queue and nightly regardless of `ciflow:all`.
+- Use when modifying distributed or multi-GPU code paths.
+- Mark tests with `@pytest.mark.multi_gpu`.
Please verify this matches the actual workflow/job behavior introduced in this PR (job names, hardware type, and when they’re scheduled).
.github/workflows/unit-tests-framework.yml (3)
258-258: Unknown runner label; verify self‑hosted configuration or add ‘self‑hosted’ to runs‑on.

actionlint flags linux-amd64-gpu-rtxa6000-latest-2-nemo. If these are self-hosted, prefer:

runs-on: [self-hosted, linux, x64, your-custom-label] or configure actionlint to allow custom labels.

Also applies to: 325-325

228-237: Remove redundant chmod of run_pytest_unittests.sh.

You’ve switched to pytest_runner.sh; drop the unused chmod.
   run: |
-    chmod +x ./ci/scripts/run_pytest_unittests.sh
     chmod +x ./ci/scripts/pytest_runner.sh
Apply to single-GPU and multi-GPU jobs.

Also applies to: 272-279, 345-346

254-291: Add timeouts to prevent hung multi‑GPU runs.

Past MOCO hangs suggest setting job timeouts.
 run-tests-multi-gpu:
+  timeout-minutes: 120
@@
 run-tests-slow-multi-gpu:
+  timeout-minutes: 180
Also applies to: 321-347

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 9812568 and 01eabb6.

📒 Files selected for processing (4)

.github/labels.yml (1 hunks)
.github/pull_request_template.md (1 hunks)
.github/workflows/unit-tests-framework.yml (6 hunks)
docs/docs/main/contributing/contributing.md (1 hunks)

🚧 Files skipped from review as they are similar to previous changes (2)

.github/pull_request_template.md
.github/labels.yml

🧰 Additional context used

🪛 actionlint (1.7.8)

.github/workflows/unit-tests-framework.yml

258-258: label "linux-amd64-gpu-rtxa6000-latest-2-nemo" is unknown. available labels are "windows-latest", "windows-latest-8-cores", "windows-2025", "windows-2022", "windows-11-arm", "ubuntu-latest", "ubuntu-latest-4-cores", "ubuntu-latest-8-cores", "ubuntu-latest-16-cores", "ubuntu-24.04", "ubuntu-24.04-arm", "ubuntu-22.04", "ubuntu-22.04-arm", "macos-latest", "macos-latest-xl", "macos-latest-xlarge", "macos-latest-large", "macos-26-xlarge", "macos-26", "macos-15-intel", "macos-15-xlarge", "macos-15-large", "macos-15", "macos-14-xl", "macos-14-xlarge", "macos-14-large", "macos-14", "macos-13-xl", "macos-13-xlarge", "macos-13-large", "macos-13", "self-hosted", "x64", "arm", "arm64", "linux", "macos", "windows". if it is a custom label for self-hosted runner, set list of labels in actionlint.yaml config file

(runner-label)

325-325: label "linux-amd64-gpu-rtxa6000-latest-2-nemo" is unknown. available labels are "windows-latest", "windows-latest-8-cores", "windows-2025", "windows-2022", "windows-11-arm", "ubuntu-latest", "ubuntu-latest-4-cores", "ubuntu-latest-8-cores", "ubuntu-latest-16-cores", "ubuntu-24.04", "ubuntu-24.04-arm", "ubuntu-22.04", "ubuntu-22.04-arm", "macos-latest", "macos-latest-xl", "macos-latest-xlarge", "macos-latest-large", "macos-26-xlarge", "macos-26", "macos-15-intel", "macos-15-xlarge", "macos-15-large", "macos-15", "macos-14-xl", "macos-14-xlarge", "macos-14-large", "macos-14", "macos-13-xl", "macos-13-xlarge", "macos-13-large", "macos-13", "self-hosted", "x64", "arm", "arm64", "linux", "macos", "windows". if it is a custom label for self-hosted runner, set list of labels in actionlint.yaml config file

(runner-label)

🪛 GitHub Check: CodeQL

.github/workflows/unit-tests-framework.yml

[warning] 255-292: Workflow does not contain permissions
Actions job or workflow does not limit the permissions of the GITHUB_TOKEN. Consider setting an explicit permissions block, using the following as a minimal starting point: {{contents: read}}

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: Analyze (rust)

🔇 Additional comments (2)

.github/workflows/unit-tests-framework.yml (2)

280-290: Codecov correctly disabled for multi‑GPU jobs.

Conditions exclude schedule/merge_group, so no uploads from multi‑GPU runs. Matches prior feedback.

314-320: Flags for slow single‑GPU look right.

--only-slow with --skip-multi-gpu is correct; allow-no-tests avoids false failures.

coderabbitai · 2025-10-14T18:04:14Z

.github/workflows/unit-tests-framework.yml

+  run-tests-multi-gpu:
+    needs:
+      - build-bionemo-image
+      - run-tests-single-gpu
+    runs-on: linux-amd64-gpu-rtxa6000-latest-2-nemo
+    container:
+      image: svcbionemo023/bionemo-framework:${{ github.run_id }}
+      credentials:
+        username: ${{ vars.DOCKER_USERNAME }}
+        password: ${{ secrets.DOCKER_PASSWORD }}
+    if: |
+        (needs.build-bionemo-image.result == 'success') &&
+        (needs.run-tests-single-gpu.result == 'success') &&
+        (github.event_name == 'schedule')
+    steps:


⚠️ Potential issue | 🟠 Major

Multi‑GPU job never runs on PR label or merge queue; schedule‑only contradicts PR objectives.

Enable ciflow:multi-gpu and merge_group triggers and wire in get-pr-labels.

Apply:

run-tests-multi-gpu: - needs: - - build-bionemo-image - - run-tests-single-gpu + needs: + - build-bionemo-image + - run-tests-single-gpu + - get-pr-labels @@ - if: | - (needs.build-bionemo-image.result == 'success') && - (needs.run-tests-single-gpu.result == 'success') && - (github.event_name == 'schedule') + if: | + (needs.build-bionemo-image.result == 'success') && + (needs.run-tests-single-gpu.result == 'success') && + ( + (github.event_name == 'schedule') || + (github.event_name == 'merge_group') || + contains(fromJSON(needs.get-pr-labels.outputs.labels || '[]'), 'ciflow:all') || + contains(fromJSON(needs.get-pr-labels.outputs.labels || '[]'), 'ciflow:multi-gpu') + )

🧰 Tools

🪛 actionlint (1.7.8)

258-258: label "linux-amd64-gpu-rtxa6000-latest-2-nemo" is unknown. available labels are "windows-latest", "windows-latest-8-cores", "windows-2025", "windows-2022", "windows-11-arm", "ubuntu-latest", "ubuntu-latest-4-cores", "ubuntu-latest-8-cores", "ubuntu-latest-16-cores", "ubuntu-24.04", "ubuntu-24.04-arm", "ubuntu-22.04", "ubuntu-22.04-arm", "macos-latest", "macos-latest-xl", "macos-latest-xlarge", "macos-latest-large", "macos-26-xlarge", "macos-26", "macos-15-intel", "macos-15-xlarge", "macos-15-large", "macos-15", "macos-14-xl", "macos-14-xlarge", "macos-14-large", "macos-14", "macos-13-xl", "macos-13-xlarge", "macos-13-large", "macos-13", "self-hosted", "x64", "arm", "arm64", "linux", "macos", "windows". if it is a custom label for self-hosted runner, set list of labels in actionlint.yaml config file

(runner-label)

coderabbitai · 2025-10-14T18:04:15Z

.github/workflows/unit-tests-framework.yml

+  run-tests-slow-multi-gpu:
+    needs:
+      - build-bionemo-image
+      - run-tests-slow-single-gpu
+    runs-on: linux-amd64-gpu-rtxa6000-latest-2-nemo
+    container:
+      image: svcbionemo023/bionemo-framework:${{ github.run_id }}
+      credentials:
+        username: ${{ vars.DOCKER_USERNAME }}
+        password: ${{ secrets.DOCKER_PASSWORD }}
+    if: |
+        (needs.build-bionemo-image.result == 'success') &&
+        (needs.run-tests-slow-single-gpu.result == 'success') &&
+        (github.event_name == 'schedule')
+    steps:


⚠️ Potential issue | 🟠 Major

Slow multi‑GPU job also schedule‑only; allow PR label and merge queue.

Mirror fast job gating for consistency with usage docs.

run-tests-slow-multi-gpu: - needs: - - build-bionemo-image - - run-tests-slow-single-gpu + needs: + - build-bionemo-image + - run-tests-slow-single-gpu + - get-pr-labels @@ - if: | - (needs.build-bionemo-image.result == 'success') && - (needs.run-tests-slow-single-gpu.result == 'success') && - (github.event_name == 'schedule') + if: | + (needs.build-bionemo-image.result == 'success') && + (needs.run-tests-slow-single-gpu.result == 'success') && + ( + (github.event_name == 'schedule') || + (github.event_name == 'merge_group') || + contains(fromJSON(needs.get-pr-labels.outputs.labels || '[]'), 'ciflow:all') || + contains(fromJSON(needs.get-pr-labels.outputs.labels || '[]'), 'ciflow:multi-gpu') + )

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

run-tests-slow-multi-gpu:

needs:

- build-bionemo-image

- run-tests-slow-single-gpu

runs-on: linux-amd64-gpu-rtxa6000-latest-2-nemo

container:

image: svcbionemo023/bionemo-framework:${{ github.run_id }}

credentials:

username: ${{ vars.DOCKER_USERNAME }}

password: ${{ secrets.DOCKER_PASSWORD }}

if: |

(needs.build-bionemo-image.result == 'success') &&

(needs.run-tests-slow-single-gpu.result == 'success') &&

(github.event_name == 'schedule')

steps:

run-tests-slow-multi-gpu:

needs:

- build-bionemo-image

- run-tests-slow-single-gpu

- get-pr-labels

runs-on: linux-amd64-gpu-rtxa6000-latest-2-nemo

container:

image: svcbionemo023/bionemo-framework:${{ github.run_id }}

credentials:

username: ${{ vars.DOCKER_USERNAME }}

password: ${{ secrets.DOCKER_PASSWORD }}

if: |

(needs.build-bionemo-image.result == 'success') &&

(needs.run-tests-slow-single-gpu.result == 'success') &&

(

(github.event_name == 'schedule') ||

(github.event_name == 'merge_group') ||

contains(fromJSON(needs.get-pr-labels.outputs.labels || '[]'), 'ciflow:all') ||

contains(fromJSON(needs.get-pr-labels.outputs.labels || '[]'), 'ciflow:multi-gpu')

)

steps:

🧰 Tools

🪛 actionlint (1.7.8)

325-325: label "linux-amd64-gpu-rtxa6000-latest-2-nemo" is unknown. available labels are "windows-latest", "windows-latest-8-cores", "windows-2025", "windows-2022", "windows-11-arm", "ubuntu-latest", "ubuntu-latest-4-cores", "ubuntu-latest-8-cores", "ubuntu-latest-16-cores", "ubuntu-24.04", "ubuntu-24.04-arm", "ubuntu-22.04", "ubuntu-22.04-arm", "macos-latest", "macos-latest-xl", "macos-latest-xlarge", "macos-latest-large", "macos-26-xlarge", "macos-26", "macos-15-intel", "macos-15-xlarge", "macos-15-large", "macos-15", "macos-14-xl", "macos-14-xlarge", "macos-14-large", "macos-14", "macos-13-xl", "macos-13-xlarge", "macos-13-large", "macos-13", "self-hosted", "x64", "arm", "arm64", "linux", "macos", "windows". if it is a custom label for self-hosted runner, set list of labels in actionlint.yaml config file

(runner-label)

🤖 Prompt for AI Agents

.github/workflows/unit-tests-framework.yml lines 321-335: the run-tests-slow-multi-gpu job is currently restricted to schedule-only by checking github.event_name == 'schedule'; update its if expression to mirror the fast job gating so it also runs when a PR is labeled or triggered via the merge queue as well as on schedule. Replace the current if block with the same conditional used by the fast job (allowing schedule OR pull_request with the specific label OR merge-queue triggers) so behavior matches the fast job and project docs.

Signed-off-by: Gagan Kaushik <[email protected]>

coderabbitai

Actionable comments posted: 0

♻️ Duplicate comments (3)

.github/workflows/unit-tests-framework.yml (3)

37-57: Add least‑privilege GITHUB_TOKEN permissions (and job‑level PR read for labels).

Set minimal top‑level permissions and grant pull‑requests: read to get-pr-labels.

Based on static analysis hints

 name: "BioNeMo Framework CI"
 
 on:
   push:
@@
   schedule:
     - cron: "0 7 * * *" # Runs at 7 AM UTC daily (12 AM MST)
 
+permissions:
+  contents: read
+
 defaults:
   run:
     shell: bash -x -e -u -o pipefail {0}

And for the labels job:

 get-pr-labels:
+  permissions:
+    contents: read
+    pull-requests: read
   runs-on: ubuntu-latest

254-268: Multi‑GPU job is schedule‑only; enable PR label and merge queue per usage docs.

Wire in get-pr-labels and expand the if to allow ciflow:multi-gpu, ciflow:all, merge_group, and schedule.

 run-tests-multi-gpu:
-  needs:
-    - build-bionemo-image
-    - run-tests-single-gpu
+  needs:
+    - build-bionemo-image
+    - run-tests-single-gpu
+    - get-pr-labels
@@
-    if: |
-        (needs.build-bionemo-image.result == 'success') &&
-        (needs.run-tests-single-gpu.result == 'success') &&
-        (github.event_name == 'schedule')
+    if: |
+        (needs.build-bionemo-image.result == 'success') &&
+        (needs.run-tests-single-gpu.result == 'success') &&
+        (
+          (github.event_name == 'schedule') ||
+          (github.event_name == 'merge_group') ||
+          contains(fromJSON(needs.get-pr-labels.outputs.labels || '[]'), 'ciflow:all') ||
+          contains(fromJSON(needs.get-pr-labels.outputs.labels || '[]'), 'ciflow:multi-gpu')
+        )

309-323: Slow multi‑GPU job is also schedule‑only; mirror fast job gating.

Allow ciflow:multi-gpu/ciflow:all labels and merge queue too.

 run-tests-slow-multi-gpu:
-  needs:
-    - build-bionemo-image
-    - run-tests-slow-single-gpu
+  needs:
+    - build-bionemo-image
+    - run-tests-slow-single-gpu
+    - get-pr-labels
@@
-    if: |
-        (needs.build-bionemo-image.result == 'success') &&
-        (needs.run-tests-slow-single-gpu.result == 'success') &&
-        (github.event_name == 'schedule')
+    if: |
+        (needs.build-bionemo-image.result == 'success') &&
+        (needs.run-tests-slow-single-gpu.result == 'success') &&
+        (
+          (github.event_name == 'schedule') ||
+          (github.event_name == 'merge_group') ||
+          contains(fromJSON(needs.get-pr-labels.outputs.labels || '[]'), 'ciflow:all') ||
+          contains(fromJSON(needs.get-pr-labels.outputs.labels || '[]'), 'ciflow:multi-gpu')
+        )

🧹 Nitpick comments (3)

.github/workflows/unit-tests-framework.yml (3)
228-237: Avoid chmod on unused wrapper or use it consistently.

You chmod run_pytest_unittests.sh but don’t use it. Either remove the chmod, or call the wrapper as earlier suggested by the reviewer.

Option A (remove unused chmod):
-          chmod +x ./ci/scripts/run_pytest_unittests.sh
           chmod +x ./ci/scripts/pytest_runner.sh
           ./ci/scripts/pytest_runner.sh --no-nbval --skip-slow --skip-multi-gpu
Option B (use wrapper and add skip flag, aligning with prior guidance):
-          chmod +x ./ci/scripts/pytest_runner.sh
-          ./ci/scripts/pytest_runner.sh --no-nbval --skip-slow --skip-multi-gpu
+          chmod +x ./ci/scripts/run_pytest_unittests.sh
+          ./ci/scripts/run_pytest_unittests.sh --skip-multi-gpu
22-27: Keep the workflow comments in sync with gating.

After enabling label/merge‑queue triggers for multi‑GPU jobs, update these “nightly only” notes.

Also applies to: 31-35

367-392: Optional: track queue time and success rates.

Emit queue/runtime metrics to the job summary via GH API.

Example step:
- name: Queue/runtime metrics
  env:
    GH_TOKEN: ${{ github.token }}
  run: |
    RUN_JSON=$(gh api repos/${{ github.repository }}/actions/runs/${{ github.run_id }})
    CREATED=$(echo "$RUN_JSON" | jq -r '.created_at')
    STARTED=$(echo "$RUN_JSON" | jq -r '.run_started_at')
    NOW=$(date -u +%FT%TZ)
    QUEUE_SEC=$(python - <<PY
from datetime import datetime, timezone
from dateutil import parser
c=parser.isoparse("$CREATED"); s=parser.isoparse("$STARTED")
print(int((s-c).total_seconds()))
PY
)
    echo "Queue time (s): $QUEUE_SEC" >> $GITHUB_STEP_SUMMARY

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 01eabb6 and 9520a1d.

📒 Files selected for processing (1)

.github/workflows/unit-tests-framework.yml (6 hunks)

🧰 Additional context used

🪛 actionlint (1.7.8)

.github/workflows/unit-tests-framework.yml

258-258: label "linux-amd64-gpu-rtxa6000-latest-2-nemo" is unknown. available labels are "windows-latest", "windows-latest-8-cores", "windows-2025", "windows-2022", "windows-11-arm", "ubuntu-latest", "ubuntu-latest-4-cores", "ubuntu-latest-8-cores", "ubuntu-latest-16-cores", "ubuntu-24.04", "ubuntu-24.04-arm", "ubuntu-22.04", "ubuntu-22.04-arm", "macos-latest", "macos-latest-xl", "macos-latest-xlarge", "macos-latest-large", "macos-26-xlarge", "macos-26", "macos-15-intel", "macos-15-xlarge", "macos-15-large", "macos-15", "macos-14-xl", "macos-14-xlarge", "macos-14-large", "macos-14", "macos-13-xl", "macos-13-xlarge", "macos-13-large", "macos-13", "self-hosted", "x64", "arm", "arm64", "linux", "macos", "windows". if it is a custom label for self-hosted runner, set list of labels in actionlint.yaml config file

(runner-label)

313-313: label "linux-amd64-gpu-rtxa6000-latest-2-nemo" is unknown. available labels are "windows-latest", "windows-latest-8-cores", "windows-2025", "windows-2022", "windows-11-arm", "ubuntu-latest", "ubuntu-latest-4-cores", "ubuntu-latest-8-cores", "ubuntu-latest-16-cores", "ubuntu-24.04", "ubuntu-24.04-arm", "ubuntu-22.04", "ubuntu-22.04-arm", "macos-latest", "macos-latest-xl", "macos-latest-xlarge", "macos-latest-large", "macos-26-xlarge", "macos-26", "macos-15-intel", "macos-15-xlarge", "macos-15-large", "macos-15", "macos-14-xl", "macos-14-xlarge", "macos-14-large", "macos-14", "macos-13-xl", "macos-13-xlarge", "macos-13-large", "macos-13", "self-hosted", "x64", "arm", "arm64", "linux", "macos", "windows". if it is a custom label for self-hosted runner, set list of labels in actionlint.yaml config file

(runner-label)

🪛 GitHub Check: CodeQL

.github/workflows/unit-tests-framework.yml

[warning] 255-280: Workflow does not contain permissions
Actions job or workflow does not limit the permissions of the GITHUB_TOKEN. Consider setting an explicit permissions block, using the following as a minimal starting point: {{contents: read}}

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: Analyze (rust)

🔇 Additional comments (1)

.github/workflows/unit-tests-framework.yml (1)
258-258: actionlint: declare custom runner labels to silence false positives.

Add an actionlint config listing your self‑hosted labels.

Based on static analysis hints

Create .github/actionlint.yaml:
runner-label:
  - self-hosted
  - linux-amd64-cpu16
  - linux-amd64-gpu-l4-latest-1
  - linux-amd64-gpu-rtxa6000-latest-2-nemo
Also applies to: 313-313

Signed-off-by: Gagan Kaushik <[email protected]>

gagank1 · 2025-10-14T18:30:33Z

/ok to test aa64021

pstjohn · 2025-10-17T20:11:57Z

.github/workflows/unit-tests-framework.yml

+    if: |
+        (needs.build-bionemo-image.result == 'success') &&
+        (needs.run-tests-single-gpu.result == 'success') &&
+        (github.event_name == 'schedule')


we should run either on the schedule event or the PR label?

i thought we wanted to only have it run nightly? or is it nightly + label but no merge queue for now?

Signed-off-by: Gagan Kaushik <[email protected]>

…ests pass OR on PRs with the ciflow:multi-gpu label Signed-off-by: Gagan Kaushik <[email protected]>

.github/workflows/unit-tests-recipes.yml

+    runs-on: ubuntu-latest
+    outputs:
+      labels: ${{ steps.get-labels.outputs.labels || steps.get-labels-empty.outputs.labels }}
+    steps:
+      - name: Get PR number from branch
+        if: startsWith(github.ref, 'refs/heads/pull-request/')
+        id: get-pr-num
+        run: |
+          PR_NUM=$(echo ${{ github.ref_name }} | grep -oE '[0-9]+$')
+          echo "pr_num=$PR_NUM" >> $GITHUB_OUTPUT
+
+      - name: Get PR labels
+        id: get-labels
+        if: startsWith(github.ref, 'refs/heads/pull-request/')
+        env:
+          GH_TOKEN: ${{ github.token }}
+        run: |
+          LABELS=$(gh api repos/${{ github.repository }}/pulls/${{ steps.get-pr-num.outputs.pr_num }} --jq '[.labels[].name]' || echo "[]")
+          echo "labels=$LABELS" >> $GITHUB_OUTPUT
+          echo "Retrieved labels: $LABELS"
+
+      - name: Set empty labels for non-PR branches
+        if: ${{ !startsWith(github.ref, 'refs/heads/pull-request/') }}
+        id: get-labels-empty
+        run: |
+          echo "labels=[]" >> $GITHUB_OUTPUT
+          echo "Set empty labels for non-PR branch"
+
+  unit-tests-multi-gpu:


The recommended fix is to explicitly specify least privilege permissions for the workflow or, ideally, for the individual jobs in .github/workflows/unit-tests-recipes.yml.

For the get-pr-labels job, the only required permission is to read pull request metadata (and potentially repository contents if any fetch occurs).

pull-requests: read and optionally contents: read cover reading PR info and repo access.

Add a top-level permissions block after name: (applies to all jobs), or alternatively, apply a narrower permissions block to each job. For simplicity and following the error’s suggestion, add at the top/workflow level.

Edit the file .github/workflows/unit-tests-recipes.yml, adding:

permissions: contents: read pull-requests: read

directly after the name: key and before events (on:).

Required methods/imports/definitions: None; this is a YAML structure change only.

Signed-off-by: Gagan Kaushik <[email protected]>

gagank1 · 2025-11-13T15:06:03Z

/ok to test 2fe65dc

.github/workflows/unit-tests-framework.yml

+    needs:
+      - build-bionemo-image
+      - run-tests-single-gpu
+      - get-pr-labels
+    runs-on: linux-amd64-gpu-rtxa6000-latest-2-nemo
+    container:
+      image: svcbionemo023/bionemo-framework:${{ github.run_id }}
+      credentials:
+        username: ${{ vars.DOCKER_USERNAME }}
+        password: ${{ secrets.DOCKER_PASSWORD }}
+    # Run multi-GPU tests ONLY when:
+    # Prerequisites: build succeeds AND single-GPU tests pass
+    # Then run if: schedule OR (push with ciflow:all OR ciflow:multi-gpu label)
+    # Do NOT run on merge_group or any other events
+    if: |
+        (needs.build-bionemo-image.result == 'success') &&
+        (needs.run-tests-single-gpu.result == 'success') &&
+        (
+          github.event_name == 'schedule' ||
+          (
+            github.event_name == 'push' &&
+            (
+              contains(fromJSON(needs.get-pr-labels.outputs.labels || '[]'), 'ciflow:all') ||
+              contains(fromJSON(needs.get-pr-labels.outputs.labels || '[]'), 'ciflow:multi-gpu')
+            )
+          )
+        )
+    steps:
+      - name: Checkout repository
+        uses: actions/checkout@v4
+
+      - name: Run multi-GPU tests
+        env:
+          BIONEMO_DATA_SOURCE: ngc
+        run: |
+          chmod +x ./ci/scripts/run_pytest_multigpu.sh
+          ./ci/scripts/run_pytest_multigpu.sh
+
+  run-tests-slow-single-gpu:


To fix the problem, we need to add the permissions key at the workflow level (top-level, after the workflow name and before the first trigger under on:). This will instruct GitHub to issue a token with only the specified minimal permissions for all jobs, unless a particular job specifies a more permissive permissions: block.

The best single edit is to add the following block:

permissions: contents: read

after the workflow name and before the on: block, i.e., after line 46 and before line 48. This limits all jobs in this workflow (unless individually overridden) to only read repository contents using the automatic GITHUB_TOKEN. No imports or further changes are required, as only the workflow file is affected.

Signed-off-by: Gagan Kaushik <[email protected]>

gagank1 · 2025-11-13T16:21:41Z

/ok to test 3bf090d

…plify.py Signed-off-by: Gagan Kaushik <[email protected]>

gagank1 · 2025-11-13T17:40:43Z

/ok to test 5f17689

Signed-off-by: Gagan Kaushik <[email protected]>

gagank1 · 2025-11-13T17:53:58Z

/ok to test f14497c

Signed-off-by: Gagan Kaushik <[email protected]>

gagank1 · 2025-11-14T18:24:15Z

/ok to test 91e4dcd

initial commit of adding multi-gpu ci runners

68fa036

Signed-off-by: Gagan Kaushik <[email protected]>

gagank1 self-assigned this Oct 7, 2025

github-advanced-security bot found potential problems Oct 7, 2025

View reviewed changes

.github/workflows/unit-tests-framework.yml Fixed Show fixed Hide fixed

fix missing multi_gpu marker in esm2

62f381d

Signed-off-by: Gagan Kaushik <[email protected]>

gagank1 added the ciflow:slow Run slow single GPU integration tests marked as @pytest.mark.slow for bionemo2 label Oct 7, 2025

switch to a6000 runners

250bd29

Signed-off-by: Gagan Kaushik <[email protected]>

github-advanced-security bot found potential problems Oct 9, 2025

View reviewed changes

.github/workflows/unit-tests-framework.yml Fixed Show fixed Hide fixed

fixed moco multi-gpu tests

68307b1

Signed-off-by: Gagan Kaushik <[email protected]>

github-advanced-security bot found potential problems Oct 9, 2025

View reviewed changes

bind to only localhost

6fbe6f4

Signed-off-by: Gagan Kaushik <[email protected]>

gagank1 added 2 commits October 11, 2025 12:30

implemented data download in evo2 gradient equivalence test. fixed bu…

b777145

…g in evo2 preprocessing that ignored random seed during bootstrap Signed-off-by: Gagan Kaushik <[email protected]>

added new ciflow:multi-gpu label and fixed scheduling

c3f9737

Signed-off-by: Gagan Kaushik <[email protected]>

gagank1 marked this pull request as ready for review October 14, 2025 11:33

gagank1 requested review from dorotat-nv and jstjohn as code owners October 14, 2025 11:33

coderabbitai bot reviewed Oct 14, 2025

View reviewed changes

pstjohn reviewed Oct 14, 2025

View reviewed changes

run multi-gpu tests nightly only for now

01eabb6

Signed-off-by: Gagan Kaushik <[email protected]>

gagank1 removed the ciflow:multi-gpu Run all multi GPU tests (unit tests, slow tests) for bionemo2 label Oct 14, 2025

github-advanced-security bot found potential problems Oct 14, 2025

View reviewed changes

.github/workflows/unit-tests-framework.yml Fixed Show fixed Hide fixed

coderabbitai bot reviewed Oct 14, 2025

View reviewed changes

don't upload multigpu test results to codecov

9520a1d

Signed-off-by: Gagan Kaushik <[email protected]>

coderabbitai bot reviewed Oct 14, 2025

View reviewed changes

match structure of existing pytest runner scripts

aa64021

Signed-off-by: Gagan Kaushik <[email protected]>

pstjohn reviewed Oct 17, 2025

View reviewed changes

gagank1 added the ciflow:multi-gpu Run all multi GPU tests (unit tests, slow tests) for bionemo2 label Nov 7, 2025

gagank1 added 3 commits November 6, 2025 21:15

Merge remote-tracking branch 'origin/main' into gkaushik/multi-gpu-ci

617003a

added @pytest.mark.multi_gpu to new recipes tests

16dc4ff

Signed-off-by: Gagan Kaushik <[email protected]>

update recipes workflow to run multigpu tests nightly if single gpu t…

9c994c7

…ests pass OR on PRs with the ciflow:multi-gpu label Signed-off-by: Gagan Kaushik <[email protected]>

github-advanced-security bot found potential problems Nov 7, 2025

View reviewed changes

gagank1 added the ciflow:slow Run slow single GPU integration tests marked as @pytest.mark.slow for bionemo2 label Nov 13, 2025

more scheduling changes

2fe65dc

Signed-off-by: Gagan Kaushik <[email protected]>

github-advanced-security bot found potential problems Nov 13, 2025

View reviewed changes

add missing multi_gpu mark registration

3bf090d

Signed-off-by: Gagan Kaushik <[email protected]>

fix recipes workflow return code, add fp8 check to test_accelerate_am…

5f17689

…plify.py Signed-off-by: Gagan Kaushik <[email protected]>

fine tune scheduling logic

f14497c

Signed-off-by: Gagan Kaushik <[email protected]>

gagank1 added 2 commits November 14, 2025 10:18

fix labels for copy pr bot

d2822a4

Signed-off-by: Gagan Kaushik <[email protected]>

xfail known bug

91e4dcd

Signed-off-by: Gagan Kaushik <[email protected]>

-@pytest.mark.multi_gpu
-@requires_multi_gpu
-@pytest.mark.slow
-def test_checkpoint_save_and_load_two_processes_mfsdp():
+@pytest.mark.multi_gpu
+@requires_multi_gpu
+@pytest.mark.slow
+def test_checkpoint_save_and_load_two_processes_mfsdp():
+    env = os.environ.copy()
+    env["WANDB_MODE"] = "disabled"
+    import socket
+    # Use IPv4 loopback and a free ephemeral port
+    env.setdefault("MASTER_ADDR", "127.0.0.1")
+    if "MASTER_PORT" not in env:
+        with socket.socket(socket.AF_INET, socket.SOCK_STREAM) as s:
+            s.bind(("127.0.0.1", 0))
+            env["MASTER_PORT"] = str(s.getsockname()[1])
+    # Optional: mitigate A6000 hangs in CI
+    if torch.cuda.is_available() and any(
+        "A6000" in torch.cuda.get_device_name(i)
+        for i in range(torch.cuda.device_count())
+    ):
+        env.setdefault("NCCL_P2P_DISABLE", "1")
+    # ... existing subprocess.run or torchrun invocation using env ...

-        subprocess.run(["zcat", "chr20.fa.gz"], stdout=open(test_dir / "chr20.fa", "w"), cwd=test_dir, check=True)
-        subprocess.run(["zcat", "chr21.fa.gz"], stdout=open(test_dir / "chr21.fa", "w"), cwd=test_dir, check=True)
-        subprocess.run(["zcat", "chr22.fa.gz"], stdout=open(test_dir / "chr22.fa", "w"), cwd=test_dir, check=True)
-        # Concatenate files
-        subprocess.run(
-            ["cat", "chr20.fa", "chr21.fa", "chr22.fa"], stdout=open(concat_path, "w"), cwd=test_dir, check=True
-        )
+        with open(test_dir / "chr20.fa", "w") as out20:
+            subprocess.run(["zcat", "chr20.fa.gz"], stdout=out20, cwd=test_dir, check=True, timeout=600)
+        with open(test_dir / "chr21.fa", "w") as out21:
+            subprocess.run(["zcat", "chr21.fa.gz"], stdout=out21, cwd=test_dir, check=True, timeout=600)
+        with open(test_dir / "chr22.fa", "w") as out22:
+            subprocess.run(["zcat", "chr22.fa.gz"], stdout=out22, cwd=test_dir, check=True, timeout=600)
+        # Concatenate files
+        with open(concat_path, "w") as outcat:
+            subprocess.run(["cat", "chr20.fa", "chr21.fa", "chr22.fa"], stdout=outcat, cwd=test_dir, check=True, timeout=600)

CI tests on multi-GPU runners #1229

Are you sure you want to change the base?

CI tests on multi-GPU runners #1229

Uh oh!

Conversation

gagank1 commented Oct 7, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Usage

Type of changes

Pre-submit Checklist

Summary by CodeRabbit

Uh oh!

copy-pr-bot bot commented Oct 7, 2025

Uh oh!

coderabbitai bot commented Oct 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Poem

Pre-merge checks and finishing touches

Uh oh!

gagank1 commented Oct 7, 2025

Uh oh!

Uh oh!

gagank1 commented Oct 7, 2025

Uh oh!

codecov-commenter commented Oct 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

❌ 8 Tests Failed:

Uh oh!

trvachov commented Oct 7, 2025

Uh oh!

Uh oh!

gagank1 commented Oct 9, 2025

Uh oh!

gagank1 commented Oct 9, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

gagank1 commented Oct 9, 2025

Uh oh!

gagank1 commented Oct 11, 2025

Uh oh!

gagank1 commented Oct 14, 2025

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Oct 14, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Oct 14, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Oct 14, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Oct 14, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

pstjohn Oct 14, 2025

Choose a reason for hiding this comment

Uh oh!

pstjohn Oct 14, 2025

Choose a reason for hiding this comment

Uh oh!

gagank1 Oct 14, 2025

gagank1 commented Oct 7, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Oct 7, 2025 •

edited

Loading

codecov-commenter commented Oct 7, 2025 •

edited

Loading