Skip to content

Conversation

@terrykong
Copy link
Contributor

@terrykong terrykong commented Nov 9, 2025

Closes #1480

Dependency Management & Container Fingerprinting

Overview

This PR allows users to use bare python for the driver script and exposes all the other venvs and alias them in the PATH with the names python-$ray_ractor_name. This PR also adds container fingerprinting to detect and handle version mismatches between container environments and local code.

Key Changes

1. Container Fingerprinting System

New files:

  • tools/generate_fingerprint.py: Computes MD5 hashes of pyproject.toml, uv.lock, and git submodule SHAs to create a unique fingerprint
  • nemo_rl/__init__.py: Enhanced with _check_container_fingerprint() that validates container/code alignment on startup

How it works:

  • Containers now include a fingerprint file at /opt/nemo_rl_container_fingerprint generated at build time
  • On startup (when NRL_CONTAINER=1), NeMo RL compares the container's fingerprint against the current code
  • If mismatches are detected, users receive a detailed warning showing exactly what changed (dependencies, submodules, etc.)
  • Container-only feature: Fingerprinting is only available in containers where we can ensure the fingerprint file is created at build time. These checks are automatically skipped when running outside a container

Environment variables:

  • NRL_IGNORE_VERSION_MISMATCH=1: Bypass the version check (not recommended)
  • NRL_FORCE_REBUILD_VENVS=true: Automatically skip the check and rebuild venvs with current dependencies (existing env var)
  • NRL_CONTAINER=1: Indicates running inside a container (set automatically in Dockerfile)

2. Frozen Environment Support

New features:

  • tools/list_editable_packages.py: Utility to discover which packages can be mounted for development
  • nemo_rl/utils/prefetch_venvs.py: Now creates convenience symlinks like python-MegatronPolicyWorker, python-VllmGenerationWorker, etc.

Benefits:

  • Users can directly access environment-specific Python executables without uv run
  • Supports manual package installation via python-MegatronPolicyWorker -m pip install <pkg> for quick experimentation
  • All convenience executables available in /usr/local/bin/python-*
  • Running bare python (instead of uv run) skips network checks that uv performs, providing faster startup in air-gapped or network-constrained environments

3. Comprehensive Documentation

New docs:

  • docs/design-docs/dependency-management.md:
    • Production vs development workflows
    • How uv run works under the hood
    • Multi-environment architecture (policy, generation, environment workers)
    • Container pre-caching mechanisms
    • Decision flowcharts and mermaid diagrams
    • When to rebuild environments vs containers

Docs enhancements:

  • Added sphinxcontrib-mermaid support for diagrams in documentation
  • Updated docs/conf.py to enable mermaid code blocks

4. Docker Improvements

Dockerfile changes:

  • Added ENV NRL_CONTAINER=1 to indicate container environment
  • bug fix for custom vllm installs:
    • Copy nemo_rl/__init__.py and package_info.py earlier in build for build-custom-vllm.sh
    • Add uv venv --seed before conditional vLLM build
  • Generate and store container fingerprint at /opt/nemo_rl_container_fingerprint

5. Test Coverage

New tests (tests/unit/test_version_check.py):

  • Build isolation detection tests (validates fingerprint checks skip during uv lock/sync)
  • Fingerprint check tests for match/mismatch scenarios
  • Environment variable override tests
  • Integration tests with real uv sync commands

6. Miscellaneous

  • Added pip to core dependencies for frozen environment support (since uv venv --seed doesn't always add it)
  • Fixed torch backend in build-custom-vllm.sh (cu128 → cu129)
  • Improved logging format to show file location and line numbers for all log messages (default level: WARNING and above)

Workflows Supported

Production Workflow

  • Use pre-built containers with pre-cached environments
  • No rebuilding needed if code matches container
  • Fast startup times

Development Workflow

  • Set NRL_FORCE_REBUILD_VENVS=true to rebuild venvs on demand
  • Works on single-node and multi-node setups
  • Not intended for large scale runs

Frozen Environment Workflow

  • Direct access to Python executables without uv run
  • Manual dependency management with pip
  • Useful for experimentation (but changes must be upstreamed via pyproject.toml)
  • Useful if you need bare python which in very high node scales can be convenient.

Migration Notes

  • Existing workflows continue to work without changes
  • Version mismatch warnings are informative and provide clear resolution steps
  • Users can bypass checks if needed (though not recommended)

@github-actions github-actions bot added the documentation Improvements or additions to documentation label Nov 9, 2025
@terrykong terrykong added the CI:L1 Run doctests, unit tests, and functional tests label Nov 10, 2025
@terrykong
Copy link
Contributor Author

@CodeRabbit summary

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Nov 10, 2025

✅ Actions performed

Summary regeneration triggered.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Nov 10, 2025

📝 Walkthrough

Walkthrough

Adds support for container "frozen" environments: container fingerprint generation and import-time fingerprint validation, venv wrapper scripts for direct python-{Worker} invocation, documentation and tests for dependency management and frozen-env workflows, Docker build adjustments, and related tooling utilities.

Changes

Cohort / File(s) Summary
Container build & fingerprinting
docker/Dockerfile, .github/workflows/cicd-main.yml
Sets ENV NRL_CONTAINER=1, seeds uv venv in image, conditionally runs tools/build-custom-vllm.sh via BUILD_CUSTOM_VLLM, writes container fingerprint via python tools/generate_fingerprint.py to /opt/nemo_rl_container_fingerprint, and passes build-args MAX_JOBS and NEMO_RL_COMMIT in CICD.
Fingerprint generator
tools/generate_fingerprint.py
New stdlib-only script computing MD5s for pyproject.toml, uv.lock and Git submodule SHAs; emits sorted JSON fingerprint to stdout; new public functions: get_repo_root, compute_file_hash, get_submodule_shas, generate_fingerprint, main.
Runtime version checking
nemo_rl/__init__.py
Adds import-time logic to detect build isolation, run container-only fingerprint comparison against /opt/nemo_rl_container_fingerprint, warn or raise on mismatches (controlled by NRL_IGNORE_VERSION_MISMATCH), and exposes helper functions _is_build_isolation() and _check_container_fingerprint().
Frozen env wrapper creation
nemo_rl/utils/prefetch_venvs.py
Adds create_frozen_environment_symlinks(venv_configs) to create /usr/local/bin/python-{ClassName} wrapper scripts that set VIRTUAL_ENV/PATH and exec the venv Python when NRL_CONTAINER is set; invoked after venv prefetch.
Dependency metadata & build scripts
pyproject.toml, tools/build-custom-vllm.sh
Adds top-level dependency pip (seed reliability) and updates Torch wheel CUDA backend from cu128 to cu129 in custom vLLM build script.
Design docs & Sphinx
docs/design-docs/dependency-management.md, docs/conf.py, docs/index.md
Adds a dependency-management design doc and registers it in the Design Docs toctree; enables MyST directive handling for mermaid fenced blocks in docs/conf.py.
Tests — functional
tests/functional/test_frozen_env.sh, tests/functional/grpo_frozen_env.sh, tests/functional/L1_Functional_Tests_GPU.sh
Adds test_frozen_env.sh (three-part validation of wrapper executables, mutation detection, import isolation), grpo_frozen_env.sh (GRPO run under frozen env), and expands L1 GPU suite to run frozen-env tests and additional workflows.
Tests — unit
tests/unit/test_version_check.py
New unit/integration-style tests exercising fingerprint check paths, mismatch handling, build isolation detection, and force-rebuild behavior; extensive mocking and some skipped integration tests.
Utilities
tools/list_editable_packages.py
New utility to enumerate editable pip packages per discovered python* executables on PATH (helpers: find_python_executables, get_editable_packages, main).

Sequence Diagram(s)

sequenceDiagram
    participant User
    participant nemo as nemo_rl.__init__
    participant gen as tools/generate_fingerprint.py
    participant FS as Container FS

    rect rgb(220,235,255)
      Note over User,nemo: Import-time fingerprint validation (container-only)
      User->>nemo: import nemo_rl
      alt NRL_CONTAINER set and not build-isolation and not NRL_FORCE_REBUILD_VENVS
        nemo->>gen: run generate_fingerprint.py
        gen->>gen: compute MD5(pyproject.toml, uv.lock) & submodule SHAs
        gen-->>nemo: fingerprint dict
        nemo->>FS: read /opt/nemo_rl_container_fingerprint
        FS-->>nemo: container fingerprint
        nemo->>nemo: compare fingerprints
        alt mismatch & not NRL_IGNORE_VERSION_MISMATCH
          nemo->>User: raise RuntimeError (stop import)
        else mismatch & NRL_IGNORE_VERSION_MISMATCH
          nemo->>User: log warning, continue
        else match
          nemo->>User: import completes
        end
      else skip checks
        nemo->>User: import completes
      end
    end
Loading
sequenceDiagram
    participant Caller
    participant Prefetch as prefetch_venvs.create
    participant FS as /usr/local/bin

    rect rgb(235,250,230)
      Note over Caller,FS: Frozen environment wrapper creation
      Caller->>Prefetch: prefetch_venvs() completes venvs
      alt NRL_CONTAINER set
        Prefetch->>Prefetch: create_frozen_environment_symlinks(venv_configs)
        loop per venv config
          Prefetch->>FS: write python-{ClassName} wrapper script
          FS->>FS: set VIRTUAL_ENV, update PATH, exec venv python
        end
        Prefetch->>Caller: log summary of created wrappers
      else skip creation
        Prefetch->>Caller: skip
      end
    end
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~30–45 minutes

  • Pay extra attention to:
    • nemo_rl/init.py — import-time side effects, environment variable gating, and exception semantics.
    • tools/generate_fingerprint.py — correctness of file hashing, submodule SHA extraction, and stable JSON ordering.
    • nemo_rl/utils/prefetch_venvs.py — wrapper script content, permissions, collision detection, and PATH/VIRTUAL_ENV correctness.
    • Tests in tests/unit/test_version_check.py — mocked scenarios and skipped integration tests that may hide brittle assumptions.
    • Dockerfile build ordering for seed, fingerprint generation, and conditional vLLM build.

Possibly related PRs

Suggested labels

CI:L1, Run CICD

Suggested reviewers

  • hemildesai
  • chtruong814
  • parthchadha

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 72.73% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'feat: allow uv-less execution and fingerprint the environment' directly summarizes the two main changes: enabling uv-less execution and adding environment fingerprinting for version/drift detection.
Linked Issues check ✅ Passed All primary objectives from #1480 are met: uv-less execution via per-venv python wrappers [1480], frozen environments with seeded venvs [1480], container drift detection via MD5/submodule fingerprints [1480], and environment variable overrides for flexibility [1480].
Out of Scope Changes check ✅ Passed All changes align with #1480 objectives; minor supporting changes include torch backend update, pip dependency addition, documentation, and test infrastructure—all necessary for frozen environment support.
Test Results For Major Changes ✅ Passed PR includes 460 lines of unit tests for fingerprint verification and 258 lines of functional tests validating frozen environment behavior with metric validation in CI pipeline.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch tk/uv-less

Tip

📝 Customizable high-level summaries are now available in beta!

You can now customize how CodeRabbit generates the high-level summary in your pull requests — including its content, structure, tone, and formatting.

  • Provide your own instructions using the high_level_summary_instructions setting.
  • Format the summary however you like (bullet lists, tables, multi-section layouts, contributor stats, etc.).
  • Use high_level_summary_in_walkthrough to move the summary from the description to the walkthrough section.

Example instruction:

"Divide the high-level summary into five sections:

  1. 📝 Description — Summarize the main change in 50–60 words, explaining what was done.
  2. 📓 References — List relevant issues, discussions, documentation, or related PRs.
  3. 📦 Dependencies & Requirements — Mention any new/updated dependencies, environment variable changes, or configuration updates.
  4. 📊 Contributor Summary — Include a Markdown table showing contributions:
    | Contributor | Lines Added | Lines Removed | Files Changed |
  5. ✔️ Additional Notes — Add any extra reviewer context.
    Keep each section concise (under 200 words) and use bullet or numbered lists for clarity."

Note: This feature is currently in beta for Pro-tier users, and pricing will be announced later.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@terrykong terrykong added CI:L1 Run doctests, unit tests, and functional tests and removed CI:L1 Run doctests, unit tests, and functional tests labels Nov 10, 2025
@github-actions github-actions bot added the CI Relating to CI label Nov 11, 2025
@terrykong terrykong added CI:L1 Run doctests, unit tests, and functional tests and removed CI:L1 Run doctests, unit tests, and functional tests labels Nov 11, 2025
Signed-off-by: Terry Kong <[email protected]>

fix

Signed-off-by: Terry Kong <[email protected]>

seed

Signed-off-by: Terry Kong <[email protected]>

init

Signed-off-by: Terry Kong <[email protected]>

fix

Signed-off-by: Terry Kong <[email protected]>

fix

Signed-off-by: Terry Kong <[email protected]>

fix symlink creation to use exec as opposed to symlink which doesn't work

Signed-off-by: Terry Kong <[email protected]>

frozen grpo

Signed-off-by: Terry Kong <[email protected]>

use a json fingerprint

Signed-off-by: Terry Kong <[email protected]>

logging level

Signed-off-by: Terry Kong <[email protected]>

docs update

Signed-off-by: Terry Kong <[email protected]>

fix up

Signed-off-by: Terry Kong <[email protected]>

more functional test

Signed-off-by: Terry Kong <[email protected]>

mermaid

Signed-off-by: Terry Kong <[email protected]>

nrl-force rebuild and build-v will skip fingerprint check

Signed-off-by: Terry Kong <[email protected]>

fix tests

Signed-off-by: Terry Kong <[email protected]>

explain the local development

Signed-off-by: Terry Kong <[email protected]>

fix ci container

Signed-off-by: Terry Kong <[email protected]>

safe.directory fix

Signed-off-by: Terry Kong <[email protected]>
Signed-off-by: Terry Kong <[email protected]>
@terrykong terrykong marked this pull request as ready for review November 18, 2025 07:03
@terrykong terrykong requested review from a team as code owners November 18, 2025 07:03
Signed-off-by: Terry Kong <[email protected]>
@terrykong terrykong added CI:L1 Run doctests, unit tests, and functional tests and removed CI:L1 Run doctests, unit tests, and functional tests labels Nov 18, 2025
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (15)
pyproject.toml (1)

18-21: Adding pip as a core dependency is reasonable for frozen envs

Including pip in the base dependencies matches the frozen-environment workflow where uv venv --seed may not always install it; behavior-wise this looks fine. If you start seeing churn from future pip releases, you might later choose to version-pin for reproducibility, but it’s not required now.

nemo_rl/__init__.py (3)

19-23: Avoid configuring global logging at import time

Calling logging.basicConfig(...) in a library’s __init__ affects the root logger formatting/level for any embedding application that hasn’t configured logging yet. That’s surprising behavior for consumers.

Prefer configuring only this package’s logger (e.g., logger = logging.getLogger(__name__) with a handler) or restricting this setup to CLI entry points instead of import-time.


58-70: Build-isolation detection heuristic is brittle but probably sufficient

The _is_build_isolation() check relies on "/builds-v" being present in sys.prefix. If uv ever changes its cache layout, this could silently stop skipping fingerprint checks in build isolation.

If you want this more robust, consider tightening the pattern (e.g., checking for /.cache/uv/builds-v prefix) and/or adding a small test that asserts the behavior against a known uv build path.


72-205: Fingerprint check is solid overall; a few robustness and ergonomics tweaks

The container/code drift check is well structured and guarded by NRL_CONTAINER, but a few details could be improved:

  • Environment flag parsing (Lines 86-89): NRL_FORCE_REBUILD_VENVS only skips checks when exactly "true" (case-insensitive lower). Users may reasonably set "1" or "True". Consider normalizing to accept a small set of truthy values ({"1","true","yes","on"}) to avoid surprises.
  • Using runpy + sys.stdout hijack (Lines 118-125): Temporarily reassigning sys.stdout is a global side effect and could be racy if imports happen in parallel. Since tools/generate_fingerprint.py exposes generate_fingerprint(), it’d be cleaner to import and call that function directly via importlib/runpy.run_path (using the returned globals) instead of driving the script via stdout.
  • Swallowing unexpected failures quietly (Lines 199-205): Any non–RuntimeError gets reduced to a DEBUG log message, which will be hidden under the default WARNING-level config. If the fingerprint mechanism fails (e.g., JSON decode error, script crash, unreadable /opt/... file), users may never notice that the guardrail is effectively disabled. Consider logging these as logging.warning (or at least info) with a short message that the version check was skipped due to an internal error.

The mismatch handling / message content look good and align with the design doc; just these small changes would make the behavior more predictable in production.

docs/design-docs/dependency-management.md (3)

205-221: Decision-guide flowchart seems inverted vs described workflows

In the “Decision Guide” mermaid diagram, node C (“Container built from same commit as code?”) sends Yes to:

F["✓ Run with
NRL_FORCE_REBUILD_VENVS=true uv run examples/..."]
Loading

and No to D (small scale or testing). Based on the earlier text, it seems the expected behavior is:

  • If the container and code are from the same commit, you should just run normally (uv run ...), no NRL_FORCE_REBUILD_VENVS needed.
  • If they differ, you go into the development workflow branch (deciding between NRL_FORCE_REBUILD_VENVS=true and rebuilding the container).

Consider swapping the E/F nodes or adjusting edges so the flowchart matches the narrative and the actual recommended workflows.


271-311: Align documented “Solutions” list with the actual runtime warning

The “Container Version Checking” example lists three solutions (rebuild container, set NRL_FORCE_REBUILD_VENVS=true, set NRL_IGNORE_VERSION_MISMATCH=1), whereas the runtime warning in nemo_rl/__init__.py offers an additional option to regenerate /opt/nemo_rl_container_fingerprint with tools/generate_fingerprint.py.

For clarity, consider either:

  • Adding the “regenerate fingerprint” option to the doc snippet, or
  • Removing it from the runtime warning if you don’t want to encourage that path for most users.

Keeping these consistent will make it easier for users to reconcile the docs with what they see at runtime.


234-325: Minor markdownlint nits around blockquotes

Static analysis flagged blank lines inside blockquotes (MD028) in a few places (e.g., the warnings/caution callouts around frozen environments). Not urgent, but trimming the blank lines inside those > sections will keep markdownlint happy and avoid noisy CI failures if you enforce docs linting.

tests/functional/test_frozen_env.sh (1)

18-41: Slightly brittle ls-based discovery of python- executables*

PYTHON_EXECUTABLES=($(ls -1 /usr/local/bin/python-* 2>/dev/null || true)) works here but has the usual ls + word-splitting pitfalls (SC2207), and will misbehave if any executable name contains whitespace.

Given this is /usr/local/bin and the risk is low, it’s not a blocker, but you could make it more robust with something like:

mapfile -t PYTHON_EXECUTABLES < <(find /usr/local/bin -maxdepth 1 -type f -name 'python-*' | sort)

This avoids ls and handles arbitrary filenames cleanly.

tools/generate_fingerprint.py (1)

46-55: Consider SHA-256 instead of MD5 for file hashes

For drift detection MD5 is functionally fine, but many security tools flag it as “insecure” by default. Switching to hashlib.sha256() would avoid those warnings with essentially no downside in this context.

Example:

hasher = hashlib.sha256()
with open(file_path, "rb") as f:
    for chunk in iter(lambda: f.read(8192), b""):
        hasher.update(chunk)
return hasher.hexdigest()

Not urgent, but worth considering to keep static analysis and security reviews quiet.

tools/list_editable_packages.py (1)

75-129: Optional: surface per-executable failures for easier debugging

get_editable_packages() returning None on any subprocess/JSON error causes main() to silently skip that interpreter. That keeps output clean, but can make it hard to understand why a particular python-* is missing.

If you want more transparency, consider printing a brief message to stderr when an executable is skipped due to an error (perhaps behind a --verbose flag), while still returning 0 to avoid breaking workflows.

nemo_rl/utils/prefetch_venvs.py (3)

57-75: Container-only wrapper creation hook looks correct but may be a bit noisy on bare metal

The unconditional call to create_frozen_environment_symlinks(venv_configs) and the early NRL_CONTAINER guard give a clear separation between container and non-container behavior, which is good.

Minor nit: on non-container runs, this will always print the "Skipping frozen environment wrapper script creation (not in container)" message after prefetching. If prefetch_venvs() is used in local/dev workflows, consider either:

  • Guarding the call at the caller (only call when NRL_CONTAINER is set), or
  • Making the "skipping" message debug/optional

to keep CLI output a bit quieter for non-container use cases.


81-105: Avoid re-running venv creation just to discover python paths

Inside create_frozen_environment_symlinks, you call create_local_venv(py_executable, actor_fqn) again for each actor, even though prefetch_venvs() has already created those venvs just above. Depending on create_local_venv’s implementation, this can:

  • Re-run potentially expensive sync/bootstrap steps, and
  • Duplicate log output, making prefetch runs noisier.

A more efficient pattern would be to reuse the python_path values from the initial prefetch loop instead of invoking create_local_venv again, e.g.:

  • Track a mapping {actor_fqn: python_path} while prefetching, and
  • Pass that mapping into create_frozen_environment_symlinks so it only needs to do collision detection and wrapper generation.

This keeps the behavior the same while avoiding redundant work.


111-149: Wrapper script implementation is solid; consider aligning naming and tightening error handling

The wrapper scripts themselves look good:

  • VIRTUAL_ENV and PATH are set in a predictable way.
  • The exec "$VENV_PATH/bin/python" "$@" handoff keeps argument handling simple and correct.
  • Existing wrappers are replaced atomically via unlink + rewrite + chmod(0o755).

Two small improvements you might consider:

  1. The function is named create_frozen_environment_symlinks, but it actually writes wrapper scripts rather than symlinks. Renaming to something like create_frozen_environment_wrappers (and updating callers) would better match behavior.

  2. The broad except Exception as e blocks around venv discovery and wrapper creation are likely intentional “best-effort” behavior. If you want slightly stronger hygiene without losing that behavior, you could narrow them to anticipated exceptions (e.g., OSError, subprocess.CalledProcessError) while still printing warnings and continuing.

Neither of these is blocking; just potential polish for clarity and maintainability.

tests/unit/test_version_check.py (2)

87-229: Skipped mismatch/edge-case tests: either promote to integration tests or simplify

The additional tests around mismatched fingerprints, ignore flags, missing fingerprint files, and runpy failures are all marked @pytest.mark.skip(...). As a result they currently add complexity without contributing to coverage.

Two options to consider:

  • If you plan to rely on these behaviors, move them toward real integration tests (e.g., minimal reliance on heavy Path patching, or running against an actual small fingerprint script and file).
  • If they’re no longer needed, converting them to parameterized variants of simpler unit tests or removing them would keep the suite leaner.

Not urgent, but clarifying the long-term intent for these skipped tests will help keep the test suite maintainable.


231-460: Integration-style build isolation tests are valuable; consider marking them explicitly as slow

The TestBuildIsolationDetection tests provide useful end-to-end coverage of _is_build_isolation by:

  • Creating a temporary project and wiring it to the local nemo_rl tree.
  • Spawning external build processes and asserting on the recorded sys.prefix and isolation flags.

You already guard against slow/broken environments by skipping on TimeoutExpired and generic initialization failures, which is good.

Given they run real external processes and can take noticeable time, you might want to:

  • Add a marker (e.g., @pytest.mark.slow or similar) so they can be selectively included/excluded in CI, and/or
  • Document in a module-level comment that these are integration tests, not pure unit tests.

Behavior-wise, though, the tests look sound and match the intended build-isolation semantics.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between c32778d and 29f027a.

⛔ Files ignored due to path filters (1)
  • uv.lock is excluded by !**/*.lock
📒 Files selected for processing (15)
  • .github/workflows/cicd-main.yml (1 hunks)
  • docker/Dockerfile (3 hunks)
  • docs/conf.py (1 hunks)
  • docs/design-docs/dependency-management.md (1 hunks)
  • docs/index.md (1 hunks)
  • nemo_rl/__init__.py (2 hunks)
  • nemo_rl/utils/prefetch_venvs.py (2 hunks)
  • pyproject.toml (1 hunks)
  • tests/functional/L1_Functional_Tests_GPU.sh (1 hunks)
  • tests/functional/grpo_frozen_env.sh (1 hunks)
  • tests/functional/test_frozen_env.sh (1 hunks)
  • tests/unit/test_version_check.py (1 hunks)
  • tools/build-custom-vllm.sh (1 hunks)
  • tools/generate_fingerprint.py (1 hunks)
  • tools/list_editable_packages.py (1 hunks)
🧰 Additional context used
🧠 Learnings (3)
📚 Learning: 2025-10-12T14:46:55.513Z
Learnt from: zpqiu
Repo: NVIDIA-NeMo/RL PR: 1324
File: tests/test_suites/llm/distillation-qwen3-32b-to-1.7b-base-1n8g-megatron-tp2pp2cp2-pack.sh:16-30
Timestamp: 2025-10-12T14:46:55.513Z
Learning: In the NVIDIA-NeMo/RL repository, test scripts under tests/ follow a consistent pattern: use `cd $PROJECT_ROOT` without quotes or error handling, and pass arguments with `$@` unquoted. Maintain this consistency when adding new test scripts.

Applied to files:

  • tests/functional/grpo_frozen_env.sh
  • docker/Dockerfile
  • tests/functional/test_frozen_env.sh
  • tests/functional/L1_Functional_Tests_GPU.sh
📚 Learning: 2025-09-19T07:28:29.887Z
Learnt from: shuo-nvidia
Repo: NVIDIA-NeMo/RL PR: 1006
File: tests/test_suites/llm/distillation-qwen3-32b-to-4b-base-2n8g-fsdp2tp2-long.v1.sh:1-4
Timestamp: 2025-09-19T07:28:29.887Z
Learning: The NVIDIA-NeMo/RL project prefers to maintain consistent formatting across test scripts rather than applying individual bash hardening improvements like `set -euo pipefail` or proper quoting for sourcing files.

Applied to files:

  • tests/functional/grpo_frozen_env.sh
  • docker/Dockerfile
📚 Learning: 2025-10-12T14:46:57.171Z
Learnt from: zpqiu
Repo: NVIDIA-NeMo/RL PR: 1324
File: tests/test_suites/llm/distillation-qwen3-32b-to-1.7b-base-1n8g-megatron-tp2pp2cp2-pack.sh:6-11
Timestamp: 2025-10-12T14:46:57.171Z
Learning: Test scripts in tests/test_suites/llm/ follow a standard configuration pattern that includes NUM_NODES, STEPS_PER_RUN, MAX_STEPS, NUM_RUNS (calculated as `$(( (MAX_STEPS + STEPS_PER_RUN - 1) / STEPS_PER_RUN ))`), and NUM_MINUTES. These variables are part of the test infrastructure's standard interface and should not be flagged as unused even if not directly referenced within the individual script, as they are consumed by external launch tooling or common.env.

Applied to files:

  • tests/functional/L1_Functional_Tests_GPU.sh
🧬 Code graph analysis (3)
tests/functional/test_frozen_env.sh (2)
tools/generate_fingerprint.py (1)
  • main (132-135)
tools/list_editable_packages.py (1)
  • main (132-163)
nemo_rl/utils/prefetch_venvs.py (1)
nemo_rl/utils/venvs.py (1)
  • create_local_venv (33-103)
tests/unit/test_version_check.py (1)
nemo_rl/__init__.py (1)
  • _check_container_fingerprint (72-204)
🪛 markdownlint-cli2 (0.18.1)
docs/design-docs/dependency-management.md

141-141: Blank line inside blockquote

(MD028, no-blanks-blockquote)


192-192: Blank line inside blockquote

(MD028, no-blanks-blockquote)


321-321: Blank line inside blockquote

(MD028, no-blanks-blockquote)

🪛 Ruff (0.14.5)
tools/generate_fingerprint.py

51-51: Probable use of insecure hash functions in hashlib: md5

(S324)


68-68: subprocess call: check for execution of untrusted input

(S603)


70-76: Starting a process with a partial executable path

(S607)

nemo_rl/utils/prefetch_venvs.py

95-100: Abstract raise to an inner function

(TRY301)


95-100: Avoid specifying long messages outside the exception class

(TRY003)


103-103: Do not catch blind exception: Exception

(BLE001)


139-139: Do not catch blind exception: Exception

(BLE001)

tools/list_editable_packages.py

57-57: subprocess call: check for execution of untrusted input

(S603)


85-85: subprocess call: check for execution of untrusted input

(S603)


100-100: subprocess call: check for execution of untrusted input

(S603)


122-122: Consider moving this statement to an else block

(TRY300)

tests/unit/test_version_check.py

57-57: Unused function argument: path

(ARG001)


57-57: Unused function argument: run_name

(ARG001)


124-124: Unused method argument: caplog

(ARG002)


135-135: Unused function argument: path

(ARG001)


135-135: Unused function argument: run_name

(ARG001)


174-174: Unused function argument: path

(ARG001)


174-174: Unused function argument: run_name

(ARG001)


208-208: Unused function argument: path

(ARG001)


208-208: Unused function argument: run_name

(ARG001)


209-209: Avoid specifying long messages outside the exception class

(TRY003)


316-316: Starting a process with a partial executable path

(S607)


325-325: Do not catch blind exception: Exception

(BLE001)


354-354: Starting a process with a partial executable path

(S607)


415-415: subprocess call: check for execution of untrusted input

(S603)


416-427: Starting a process with a partial executable path

(S607)

nemo_rl/__init__.py

162-182: Possible SQL injection vector through string-based query construction

(S608)


192-195: Abstract raise to an inner function

(TRY301)


202-202: Do not catch blind exception: Exception

(BLE001)

🪛 Shellcheck (0.11.0)
tests/functional/grpo_frozen_env.sh

[error] 38-38: Double quote array expansions to avoid re-splitting elements.

(SC2068)

tests/functional/test_frozen_env.sh

[warning] 19-19: Prefer mapfile or read -a to split command output (or quote to avoid splitting).

(SC2207)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (6)
  • GitHub Check: sphinx-build / Build docs
  • GitHub Check: build-container / main
  • GitHub Check: Lint check
  • GitHub Check: Lint check
  • GitHub Check: Post automodel integration comment / Comment on PR
  • GitHub Check: Post submodule check comment / Comment on PR
🔇 Additional comments (13)
docs/conf.py (1)

69-69: Mermaid fence support looks correct

Enabling myst_fence_as_directive = ["mermaid"] is consistent with the existing sphinxcontrib.mermaid setup and should make ```mermaid fences render cleanly.

.github/workflows/cicd-main.yml (1)

198-213: Build target and build-args wiring is consistent

Switching the container build target to release and passing MAX_JOBS and NEMO_RL_COMMIT as build-args matches the Dockerfile’s ARG usage and should correctly surface commit metadata in the image.

docs/index.md (1)

237-254: New dependency-management design doc is correctly linked

Adding design-docs/dependency-management.md to the Design Docs toctree looks correct and keeps navigation consistent.

tools/build-custom-vllm.sh (1)

69-69: Torch backend bump aligns with CUDA 12.9 stack

Using --torch-backend=cu129 for torch==2.8.0 aligns this script with the cu129 index and CUDA 12.9 base image, reducing potential ABI mismatches.

tests/functional/L1_Functional_Tests_GPU.sh (1)

22-25: Frozen-environment tests are well integrated into the GPU L1 suite

Running grpo_frozen_env.sh and test_frozen_env.sh via bare bash before the uv run --no-sync tests matches the design intent of validating frozen env behavior, and the coverage output path still aligns with the final coverage combine.

tests/functional/grpo_frozen_env.sh (1)

1-45: Frozen GRPO functional test is structured correctly

This script cleanly exercises the frozen environment via bare python + coverage, writes coverage to tests/.coverage for later aggregation, and enforces a concrete training-metric threshold, all while following the existing tests’ style conventions.

docker/Dockerfile (1)

12-15: Container marker, venv seeding, and fingerprint generation are wired coherently

Setting NRL_CONTAINER=1, seeding the uv venv before dependency resolution, copying just nemo_rl/__init__.py and package_info.py for the custom vLLM path, and emitting /opt/nemo_rl_container_fingerprint via tools/generate_fingerprint.py all line up with the new runtime fingerprint checks and frozen-env design. The placement of the fingerprint file outside /opt/nemo-rl is also a good choice to survive user mounts.

Also applies to: 79-83, 86-88, 132-135

nemo_rl/__init__.py (1)

207-208: Import-time fingerprint enforcement is appropriate but be aware of import cost

Running _check_container_fingerprint() unconditionally at import (gated by NRL_CONTAINER) is a good safeguard but does add I/O and a git/subprocess-dependent code path to every import in containers.

No change needed, but keep this in mind if import-time latency becomes an issue; you could later gate this behind a NRL_DISABLE_VERSION_CHECK or lazily run the check only when starting Ray workers.

tests/functional/test_frozen_env.sh (2)

64-139: Nice end-to-end mutation detection; minor note on network/submodule assumptions

The Test 2 workflow (rsync snapshot, baseline import, pyproject mutation, optional submodule bump) is well thought out and exercises the fingerprint logic realistically. The submodule step gracefully degrades if 3rdparty/megatron-lm/.git is missing or git fetch fails.

Just a small note: because this relies on network access and a specific submodule path, it’s good you already print warnings and skip rather than failing the whole test; keep that behavior if you later add more submodule checks.

Based on learnings


151-195: Import isolation assertions look good and match the design

The checks that python-MegatronPolicyWorker only sees megatron.core (not nemo_automodel) and python-DTensorPolicyWorkerV2 only sees nemo_automodel (not megatron.core) are clear and directly encode the intended separation between worker environments.

This is a solid functional guardrail for the frozen env setup.

Based on learnings

tools/generate_fingerprint.py (1)

58-103: Submodule SHA collection is robust and fail-open in a sensible way

The get_submodule_shas() helper correctly scopes git to the repo via --git-dir/--work-tree, parses git submodule status, and falls back to an empty dict on failure (e.g., missing git, not a repo). That meshes well with the import-time fingerprint check, which will still compare pyproject/uv.lock even if submodule info is unavailable.

Looks good as-is.

tools/list_editable_packages.py (1)

29-72: PATH scanning and executable validation look solid

The discovery logic for python executables (regex filter + --version probe with timeout) is a pragmatic way to find relevant interpreters while excluding argcomplete helpers. Sorting by name and deduping via seen gives deterministic output.

This is a good balance between robustness and simplicity.

tests/unit/test_version_check.py (1)

27-86: Good coverage of basic fingerprint check behavior

The first two tests in TestContainerFingerprintCheck nicely cover:

  • The bare-metal guard (NRL_CONTAINER unset → check is a no-op).
  • The happy-path container case where code and container fingerprints match, including mocking out both the fingerprint script and /opt/nemo_rl_container_fingerprint.

The mocking strategy (patching runpy.run_path and nemo_rl.Path) is a bit white-box but appropriate here given the logic you’re exercising, and the tests assert the main contract (no exception raised) without overcoupling to logging details.

Signed-off-by: Terry Kong <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CI:L1 Run doctests, unit tests, and functional tests CI Relating to CI documentation Improvements or additions to documentation

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Support "frozen" environments

2 participants