feat: allow uv-less execution and fingerprint the environment #1491

terrykong · 2025-11-09T07:46:38Z

Dependency Management & Container Fingerprinting

Overview

This PR allows users to use bare python for the driver script and exposes all the other venvs and alias them in the PATH with the names python-$ray_ractor_name. This PR also adds container fingerprinting to detect and handle version mismatches between container environments and local code.

Key Changes

1. Container Fingerprinting System

New files:

tools/generate_fingerprint.py: Computes MD5 hashes of pyproject.toml, uv.lock, and git submodule SHAs to create a unique fingerprint
nemo_rl/__init__.py: Enhanced with _check_container_fingerprint() that validates container/code alignment on startup

How it works:

Containers now include a fingerprint file at /opt/nemo_rl_container_fingerprint generated at build time
On startup (when NRL_CONTAINER=1), NeMo RL compares the container's fingerprint against the current code
If mismatches are detected, users receive a detailed warning showing exactly what changed (dependencies, submodules, etc.)
Container-only feature: Fingerprinting is only available in containers where we can ensure the fingerprint file is created at build time. These checks are automatically skipped when running outside a container

Environment variables:

NRL_IGNORE_VERSION_MISMATCH=1: Bypass the version check (not recommended)
NRL_FORCE_REBUILD_VENVS=true: Automatically skip the check and rebuild venvs with current dependencies (existing env var)
NRL_CONTAINER=1: Indicates running inside a container (set automatically in Dockerfile)

2. Frozen Environment Support

New features:

tools/list_editable_packages.py: Utility to discover which packages can be mounted for development
nemo_rl/utils/prefetch_venvs.py: Now creates convenience symlinks like python-MegatronPolicyWorker, python-VllmGenerationWorker, etc.

Benefits:

Users can directly access environment-specific Python executables without uv run
Supports manual package installation via python-MegatronPolicyWorker -m pip install <pkg> for quick experimentation
All convenience executables available in /usr/local/bin/python-*
Running bare python (instead of uv run) skips network checks that uv performs, providing faster startup in air-gapped or network-constrained environments

3. Comprehensive Documentation

New docs:

docs/design-docs/dependency-management.md:
- Production vs development workflows
- How uv run works under the hood
- Multi-environment architecture (policy, generation, environment workers)
- Container pre-caching mechanisms
- Decision flowcharts and mermaid diagrams
- When to rebuild environments vs containers

Docs enhancements:

Added sphinxcontrib-mermaid support for diagrams in documentation
Updated docs/conf.py to enable mermaid code blocks

4. Docker Improvements

Dockerfile changes:

Added ENV NRL_CONTAINER=1 to indicate container environment
bug fix for custom vllm installs:
- Copy nemo_rl/__init__.py and package_info.py earlier in build for build-custom-vllm.sh
- Add uv venv --seed before conditional vLLM build
Generate and store container fingerprint at /opt/nemo_rl_container_fingerprint

5. Test Coverage

New tests (tests/unit/test_version_check.py):

Build isolation detection tests (validates fingerprint checks skip during uv lock/sync)
Fingerprint check tests for match/mismatch scenarios
Environment variable override tests
Integration tests with real uv sync commands

6. Miscellaneous

Added pip to core dependencies for frozen environment support (since uv venv --seed doesn't always add it)
Fixed torch backend in build-custom-vllm.sh (cu128 → cu129)
Improved logging format to show file location and line numbers for all log messages (default level: WARNING and above)

Workflows Supported

Production Workflow

Use pre-built containers with pre-cached environments
No rebuilding needed if code matches container
Fast startup times

Development Workflow

Set NRL_FORCE_REBUILD_VENVS=true to rebuild venvs on demand
Works on single-node and multi-node setups
Not intended for large scale runs

Frozen Environment Workflow

Direct access to Python executables without uv run
Manual dependency management with pip
Useful for experimentation (but changes must be upstreamed via pyproject.toml)
Useful if you need bare python which in very high node scales can be convenient.

Migration Notes

Existing workflows continue to work without changes
Version mismatch warnings are informative and provide clear resolution steps
Users can bypass checks if needed (though not recommended)

terrykong · 2025-11-10T07:31:40Z

@CodeRabbit summary

coderabbitai · 2025-11-10T07:31:58Z

✅ Actions performed

Summary regeneration triggered.

coderabbitai · 2025-11-10T07:33:32Z

📝 Walkthrough

Walkthrough

Adds support for container "frozen" environments: container fingerprint generation and import-time fingerprint validation, venv wrapper scripts for direct python-{Worker} invocation, documentation and tests for dependency management and frozen-env workflows, Docker build adjustments, and related tooling utilities.

Changes

Cohort / File(s)	Summary
Container build & fingerprinting `docker/Dockerfile`, `.github/workflows/cicd-main.yml`	Sets `ENV NRL_CONTAINER=1`, seeds uv venv in image, conditionally runs `tools/build-custom-vllm.sh` via `BUILD_CUSTOM_VLLM`, writes container fingerprint via `python tools/generate_fingerprint.py` to `/opt/nemo_rl_container_fingerprint`, and passes build-args `MAX_JOBS` and `NEMO_RL_COMMIT` in CICD.
Fingerprint generator `tools/generate_fingerprint.py`	New stdlib-only script computing MD5s for `pyproject.toml`, `uv.lock` and Git submodule SHAs; emits sorted JSON fingerprint to stdout; new public functions: `get_repo_root`, `compute_file_hash`, `get_submodule_shas`, `generate_fingerprint`, `main`.
Runtime version checking `nemo_rl/__init__.py`	Adds import-time logic to detect build isolation, run container-only fingerprint comparison against `/opt/nemo_rl_container_fingerprint`, warn or raise on mismatches (controlled by `NRL_IGNORE_VERSION_MISMATCH`), and exposes helper functions `_is_build_isolation()` and `_check_container_fingerprint()`.
Frozen env wrapper creation `nemo_rl/utils/prefetch_venvs.py`	Adds `create_frozen_environment_symlinks(venv_configs)` to create `/usr/local/bin/python-{ClassName}` wrapper scripts that set `VIRTUAL_ENV`/`PATH` and exec the venv Python when `NRL_CONTAINER` is set; invoked after venv prefetch.
Dependency metadata & build scripts `pyproject.toml`, `tools/build-custom-vllm.sh`	Adds top-level dependency `pip` (seed reliability) and updates Torch wheel CUDA backend from `cu128` to `cu129` in custom vLLM build script.
Design docs & Sphinx `docs/design-docs/dependency-management.md`, `docs/conf.py`, `docs/index.md`	Adds a dependency-management design doc and registers it in the Design Docs toctree; enables MyST directive handling for `mermaid` fenced blocks in `docs/conf.py`.
Tests — functional `tests/functional/test_frozen_env.sh`, `tests/functional/grpo_frozen_env.sh`, `tests/functional/L1_Functional_Tests_GPU.sh`	Adds `test_frozen_env.sh` (three-part validation of wrapper executables, mutation detection, import isolation), `grpo_frozen_env.sh` (GRPO run under frozen env), and expands L1 GPU suite to run frozen-env tests and additional workflows.
Tests — unit `tests/unit/test_version_check.py`	New unit/integration-style tests exercising fingerprint check paths, mismatch handling, build isolation detection, and force-rebuild behavior; extensive mocking and some skipped integration tests.
Utilities `tools/list_editable_packages.py`	New utility to enumerate editable pip packages per discovered `python*` executables on PATH (helpers: `find_python_executables`, `get_editable_packages`, `main`).

Sequence Diagram(s)

sequenceDiagram
    participant User
    participant nemo as nemo_rl.__init__
    participant gen as tools/generate_fingerprint.py
    participant FS as Container FS

    rect rgb(220,235,255)
      Note over User,nemo: Import-time fingerprint validation (container-only)
      User->>nemo: import nemo_rl
      alt NRL_CONTAINER set and not build-isolation and not NRL_FORCE_REBUILD_VENVS
        nemo->>gen: run generate_fingerprint.py
        gen->>gen: compute MD5(pyproject.toml, uv.lock) & submodule SHAs
        gen-->>nemo: fingerprint dict
        nemo->>FS: read /opt/nemo_rl_container_fingerprint
        FS-->>nemo: container fingerprint
        nemo->>nemo: compare fingerprints
        alt mismatch & not NRL_IGNORE_VERSION_MISMATCH
          nemo->>User: raise RuntimeError (stop import)
        else mismatch & NRL_IGNORE_VERSION_MISMATCH
          nemo->>User: log warning, continue
        else match
          nemo->>User: import completes
        end
      else skip checks
        nemo->>User: import completes
      end
    end

sequenceDiagram
    participant Caller
    participant Prefetch as prefetch_venvs.create
    participant FS as /usr/local/bin

    rect rgb(235,250,230)
      Note over Caller,FS: Frozen environment wrapper creation
      Caller->>Prefetch: prefetch_venvs() completes venvs
      alt NRL_CONTAINER set
        Prefetch->>Prefetch: create_frozen_environment_symlinks(venv_configs)
        loop per venv config
          Prefetch->>FS: write python-{ClassName} wrapper script
          FS->>FS: set VIRTUAL_ENV, update PATH, exec venv python
        end
        Prefetch->>Caller: log summary of created wrappers
      else skip creation
        Prefetch->>Caller: skip
      end
    end

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~30–45 minutes

Pay extra attention to:
- nemo_rl/init.py — import-time side effects, environment variable gating, and exception semantics.
- tools/generate_fingerprint.py — correctness of file hashing, submodule SHA extraction, and stable JSON ordering.
- nemo_rl/utils/prefetch_venvs.py — wrapper script content, permissions, collision detection, and PATH/VIRTUAL_ENV correctness.
- Tests in tests/unit/test_version_check.py — mocked scenarios and skipped integration tests that may hide brittle assumptions.
- Dockerfile build ordering for seed, fingerprint generation, and conditional vLLM build.

Possibly related PRs

fix: update the custom vllm instructions #1116 — Overlaps on BUILD_CUSTOM_VLLM usage and Docker/build-script interactions.
chore: major version bump (torch 2.8, vllm 0.11, ray 2.49) & SP fixes #1334 — Related Dockerfile and vLLM build script changes impacting container build and runtime flags.

Suggested labels

CI:L1, Run CICD

Suggested reviewers

hemildesai
chtruong814
parthchadha

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 72.73% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.

✅ Passed checks (5 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title 'feat: allow uv-less execution and fingerprint the environment' directly summarizes the two main changes: enabling uv-less execution and adding environment fingerprinting for version/drift detection.
Linked Issues check	✅ Passed	All primary objectives from #1480 are met: uv-less execution via per-venv python wrappers [1480], frozen environments with seeded venvs [1480], container drift detection via MD5/submodule fingerprints [1480], and environment variable overrides for flexibility [1480].
Out of Scope Changes check	✅ Passed	All changes align with #1480 objectives; minor supporting changes include torch backend update, pip dependency addition, documentation, and test infrastructure—all necessary for frozen environment support.
Test Results For Major Changes	✅ Passed	PR includes 460 lines of unit tests for fingerprint verification and 258 lines of functional tests validating frozen environment behavior with metric validation in CI pipeline.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch tk/uv-less

Tip

📝 Customizable high-level summaries are now available in beta!

You can now customize how CodeRabbit generates the high-level summary in your pull requests — including its content, structure, tone, and formatting.

Provide your own instructions using the high_level_summary_instructions setting.
Format the summary however you like (bullet lists, tables, multi-section layouts, contributor stats, etc.).
Use high_level_summary_in_walkthrough to move the summary from the description to the walkthrough section.

Example instruction:

"Divide the high-level summary into five sections:

📝 Description — Summarize the main change in 50–60 words, explaining what was done.

📓 References — List relevant issues, discussions, documentation, or related PRs.

📦 Dependencies & Requirements — Mention any new/updated dependencies, environment variable changes, or configuration updates.

📊 Contributor Summary — Include a Markdown table showing contributions:
| Contributor | Lines Added | Lines Removed | Files Changed |

✔️ Additional Notes — Add any extra reviewer context.
Keep each section concise (under 200 words) and use bullet or numbered lists for clarity."

Note: This feature is currently in beta for Pro-tier users, and pricing will be announced later.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

Signed-off-by: Terry Kong <[email protected]> fix Signed-off-by: Terry Kong <[email protected]> seed Signed-off-by: Terry Kong <[email protected]> init Signed-off-by: Terry Kong <[email protected]> fix Signed-off-by: Terry Kong <[email protected]> fix Signed-off-by: Terry Kong <[email protected]> fix symlink creation to use exec as opposed to symlink which doesn't work Signed-off-by: Terry Kong <[email protected]> frozen grpo Signed-off-by: Terry Kong <[email protected]> use a json fingerprint Signed-off-by: Terry Kong <[email protected]> logging level Signed-off-by: Terry Kong <[email protected]> docs update Signed-off-by: Terry Kong <[email protected]> fix up Signed-off-by: Terry Kong <[email protected]> more functional test Signed-off-by: Terry Kong <[email protected]> mermaid Signed-off-by: Terry Kong <[email protected]> nrl-force rebuild and build-v will skip fingerprint check Signed-off-by: Terry Kong <[email protected]> fix tests Signed-off-by: Terry Kong <[email protected]> explain the local development Signed-off-by: Terry Kong <[email protected]> fix ci container Signed-off-by: Terry Kong <[email protected]> safe.directory fix Signed-off-by: Terry Kong <[email protected]>

Signed-off-by: Terry Kong <[email protected]>

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (15)

pyproject.toml (1)

18-21: Adding pip as a core dependency is reasonable for frozen envs

Including pip in the base dependencies matches the frozen-environment workflow where uv venv --seed may not always install it; behavior-wise this looks fine. If you start seeing churn from future pip releases, you might later choose to version-pin for reproducibility, but it’s not required now.

nemo_rl/__init__.py (3)

19-23: Avoid configuring global logging at import time

Calling logging.basicConfig(...) in a library’s __init__ affects the root logger formatting/level for any embedding application that hasn’t configured logging yet. That’s surprising behavior for consumers.

Prefer configuring only this package’s logger (e.g., logger = logging.getLogger(__name__) with a handler) or restricting this setup to CLI entry points instead of import-time.

58-70: Build-isolation detection heuristic is brittle but probably sufficient

The _is_build_isolation() check relies on "/builds-v" being present in sys.prefix. If uv ever changes its cache layout, this could silently stop skipping fingerprint checks in build isolation.

If you want this more robust, consider tightening the pattern (e.g., checking for /.cache/uv/builds-v prefix) and/or adding a small test that asserts the behavior against a known uv build path.

72-205: Fingerprint check is solid overall; a few robustness and ergonomics tweaks

The container/code drift check is well structured and guarded by NRL_CONTAINER, but a few details could be improved:

Environment flag parsing (Lines 86-89): NRL_FORCE_REBUILD_VENVS only skips checks when exactly "true" (case-insensitive lower). Users may reasonably set "1" or "True". Consider normalizing to accept a small set of truthy values ({"1","true","yes","on"}) to avoid surprises.

Using runpy + sys.stdout hijack (Lines 118-125): Temporarily reassigning sys.stdout is a global side effect and could be racy if imports happen in parallel. Since tools/generate_fingerprint.py exposes generate_fingerprint(), it’d be cleaner to import and call that function directly via importlib/runpy.run_path (using the returned globals) instead of driving the script via stdout.

Swallowing unexpected failures quietly (Lines 199-205): Any non–RuntimeError gets reduced to a DEBUG log message, which will be hidden under the default WARNING-level config. If the fingerprint mechanism fails (e.g., JSON decode error, script crash, unreadable /opt/... file), users may never notice that the guardrail is effectively disabled. Consider logging these as logging.warning (or at least info) with a short message that the version check was skipped due to an internal error.

The mismatch handling / message content look good and align with the design doc; just these small changes would make the behavior more predictable in production.
docs/design-docs/dependency-management.md (3)
205-221: Decision-guide flowchart seems inverted vs described workflows

In the “Decision Guide” mermaid diagram, node C (“Container built from same commit as code?”) sends Yes to:
F["✓ Run with
NRL_FORCE_REBUILD_VENVS=true uv run examples/..."]
Loading
and No to D (small scale or testing). Based on the earlier text, it seems the expected behavior is:

If the container and code are from the same commit, you should just run normally (uv run ...), no NRL_FORCE_REBUILD_VENVS needed.

If they differ, you go into the development workflow branch (deciding between NRL_FORCE_REBUILD_VENVS=true and rebuilding the container).

Consider swapping the E/F nodes or adjusting edges so the flowchart matches the narrative and the actual recommended workflows.

271-311: Align documented “Solutions” list with the actual runtime warning

The “Container Version Checking” example lists three solutions (rebuild container, set NRL_FORCE_REBUILD_VENVS=true, set NRL_IGNORE_VERSION_MISMATCH=1), whereas the runtime warning in nemo_rl/__init__.py offers an additional option to regenerate /opt/nemo_rl_container_fingerprint with tools/generate_fingerprint.py.

For clarity, consider either:

Adding the “regenerate fingerprint” option to the doc snippet, or

Removing it from the runtime warning if you don’t want to encourage that path for most users.

Keeping these consistent will make it easier for users to reconcile the docs with what they see at runtime.

234-325: Minor markdownlint nits around blockquotes

Static analysis flagged blank lines inside blockquotes (MD028) in a few places (e.g., the warnings/caution callouts around frozen environments). Not urgent, but trimming the blank lines inside those > sections will keep markdownlint happy and avoid noisy CI failures if you enforce docs linting.
tests/functional/test_frozen_env.sh (1)
18-41: Slightly brittle ls-based discovery of python- executables*

PYTHON_EXECUTABLES=($(ls -1 /usr/local/bin/python-* 2>/dev/null || true)) works here but has the usual ls + word-splitting pitfalls (SC2207), and will misbehave if any executable name contains whitespace.

Given this is /usr/local/bin and the risk is low, it’s not a blocker, but you could make it more robust with something like:
mapfile -t PYTHON_EXECUTABLES < <(find /usr/local/bin -maxdepth 1 -type f -name 'python-*' | sort)
This avoids ls and handles arbitrary filenames cleanly.
tools/generate_fingerprint.py (1)
46-55: Consider SHA-256 instead of MD5 for file hashes

For drift detection MD5 is functionally fine, but many security tools flag it as “insecure” by default. Switching to hashlib.sha256() would avoid those warnings with essentially no downside in this context.

Example:
hasher = hashlib.sha256()
with open(file_path, "rb") as f:
    for chunk in iter(lambda: f.read(8192), b""):
        hasher.update(chunk)
return hasher.hexdigest()
Not urgent, but worth considering to keep static analysis and security reviews quiet.
tools/list_editable_packages.py (1)

75-129: Optional: surface per-executable failures for easier debugging

get_editable_packages() returning None on any subprocess/JSON error causes main() to silently skip that interpreter. That keeps output clean, but can make it hard to understand why a particular python-* is missing.

If you want more transparency, consider printing a brief message to stderr when an executable is skipped due to an error (perhaps behind a --verbose flag), while still returning 0 to avoid breaking workflows.

nemo_rl/utils/prefetch_venvs.py (3)

57-75: Container-only wrapper creation hook looks correct but may be a bit noisy on bare metal

The unconditional call to create_frozen_environment_symlinks(venv_configs) and the early NRL_CONTAINER guard give a clear separation between container and non-container behavior, which is good.

Minor nit: on non-container runs, this will always print the "Skipping frozen environment wrapper script creation (not in container)" message after prefetching. If prefetch_venvs() is used in local/dev workflows, consider either:

Guarding the call at the caller (only call when NRL_CONTAINER is set), or

Making the "skipping" message debug/optional

to keep CLI output a bit quieter for non-container use cases.

81-105: Avoid re-running venv creation just to discover python paths

Inside create_frozen_environment_symlinks, you call create_local_venv(py_executable, actor_fqn) again for each actor, even though prefetch_venvs() has already created those venvs just above. Depending on create_local_venv’s implementation, this can:

Re-run potentially expensive sync/bootstrap steps, and

Duplicate log output, making prefetch runs noisier.

A more efficient pattern would be to reuse the python_path values from the initial prefetch loop instead of invoking create_local_venv again, e.g.:

Track a mapping {actor_fqn: python_path} while prefetching, and

Pass that mapping into create_frozen_environment_symlinks so it only needs to do collision detection and wrapper generation.

This keeps the behavior the same while avoiding redundant work.

111-149: Wrapper script implementation is solid; consider aligning naming and tightening error handling

The wrapper scripts themselves look good:

VIRTUAL_ENV and PATH are set in a predictable way.

The exec "$VENV_PATH/bin/python" "$@" handoff keeps argument handling simple and correct.

Existing wrappers are replaced atomically via unlink + rewrite + chmod(0o755).

Two small improvements you might consider:

The function is named create_frozen_environment_symlinks, but it actually writes wrapper scripts rather than symlinks. Renaming to something like create_frozen_environment_wrappers (and updating callers) would better match behavior.

The broad except Exception as e blocks around venv discovery and wrapper creation are likely intentional “best-effort” behavior. If you want slightly stronger hygiene without losing that behavior, you could narrow them to anticipated exceptions (e.g., OSError, subprocess.CalledProcessError) while still printing warnings and continuing.

Neither of these is blocking; just potential polish for clarity and maintainability.

tests/unit/test_version_check.py (2)

87-229: Skipped mismatch/edge-case tests: either promote to integration tests or simplify

The additional tests around mismatched fingerprints, ignore flags, missing fingerprint files, and runpy failures are all marked @pytest.mark.skip(...). As a result they currently add complexity without contributing to coverage.

Two options to consider:

If you plan to rely on these behaviors, move them toward real integration tests (e.g., minimal reliance on heavy Path patching, or running against an actual small fingerprint script and file).

If they’re no longer needed, converting them to parameterized variants of simpler unit tests or removing them would keep the suite leaner.

Not urgent, but clarifying the long-term intent for these skipped tests will help keep the test suite maintainable.

231-460: Integration-style build isolation tests are valuable; consider marking them explicitly as slow

The TestBuildIsolationDetection tests provide useful end-to-end coverage of _is_build_isolation by:

Creating a temporary project and wiring it to the local nemo_rl tree.

Spawning external build processes and asserting on the recorded sys.prefix and isolation flags.

You already guard against slow/broken environments by skipping on TimeoutExpired and generic initialization failures, which is good.

Given they run real external processes and can take noticeable time, you might want to:

Add a marker (e.g., @pytest.mark.slow or similar) so they can be selectively included/excluded in CI, and/or

Document in a module-level comment that these are integration tests, not pure unit tests.

Behavior-wise, though, the tests look sound and match the intended build-isolation semantics.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between c32778d and 29f027a.

⛔ Files ignored due to path filters (1)

uv.lock is excluded by !**/*.lock

📒 Files selected for processing (15)

.github/workflows/cicd-main.yml (1 hunks)
docker/Dockerfile (3 hunks)
docs/conf.py (1 hunks)
docs/design-docs/dependency-management.md (1 hunks)
docs/index.md (1 hunks)
nemo_rl/__init__.py (2 hunks)
nemo_rl/utils/prefetch_venvs.py (2 hunks)
pyproject.toml (1 hunks)
tests/functional/L1_Functional_Tests_GPU.sh (1 hunks)
tests/functional/grpo_frozen_env.sh (1 hunks)
tests/functional/test_frozen_env.sh (1 hunks)
tests/unit/test_version_check.py (1 hunks)
tools/build-custom-vllm.sh (1 hunks)
tools/generate_fingerprint.py (1 hunks)
tools/list_editable_packages.py (1 hunks)

🧰 Additional context used

🧠 Learnings (3)

📚 Learning: 2025-10-12T14:46:55.513Z

Learnt from: zpqiu
Repo: NVIDIA-NeMo/RL PR: 1324
File: tests/test_suites/llm/distillation-qwen3-32b-to-1.7b-base-1n8g-megatron-tp2pp2cp2-pack.sh:16-30
Timestamp: 2025-10-12T14:46:55.513Z
Learning: In the NVIDIA-NeMo/RL repository, test scripts under tests/ follow a consistent pattern: use `cd $PROJECT_ROOT` without quotes or error handling, and pass arguments with `$@` unquoted. Maintain this consistency when adding new test scripts.

Applied to files:

tests/functional/grpo_frozen_env.sh
docker/Dockerfile
tests/functional/test_frozen_env.sh
tests/functional/L1_Functional_Tests_GPU.sh

📚 Learning: 2025-09-19T07:28:29.887Z

Learnt from: shuo-nvidia
Repo: NVIDIA-NeMo/RL PR: 1006
File: tests/test_suites/llm/distillation-qwen3-32b-to-4b-base-2n8g-fsdp2tp2-long.v1.sh:1-4
Timestamp: 2025-09-19T07:28:29.887Z
Learning: The NVIDIA-NeMo/RL project prefers to maintain consistent formatting across test scripts rather than applying individual bash hardening improvements like `set -euo pipefail` or proper quoting for sourcing files.

Applied to files:

tests/functional/grpo_frozen_env.sh
docker/Dockerfile

📚 Learning: 2025-10-12T14:46:57.171Z

Learnt from: zpqiu
Repo: NVIDIA-NeMo/RL PR: 1324
File: tests/test_suites/llm/distillation-qwen3-32b-to-1.7b-base-1n8g-megatron-tp2pp2cp2-pack.sh:6-11
Timestamp: 2025-10-12T14:46:57.171Z
Learning: Test scripts in tests/test_suites/llm/ follow a standard configuration pattern that includes NUM_NODES, STEPS_PER_RUN, MAX_STEPS, NUM_RUNS (calculated as `$(( (MAX_STEPS + STEPS_PER_RUN - 1) / STEPS_PER_RUN ))`), and NUM_MINUTES. These variables are part of the test infrastructure's standard interface and should not be flagged as unused even if not directly referenced within the individual script, as they are consumed by external launch tooling or common.env.

Applied to files:

tests/functional/L1_Functional_Tests_GPU.sh

🧬 Code graph analysis (3)

tests/functional/test_frozen_env.sh (2)

tools/generate_fingerprint.py (1)

main (132-135)

tools/list_editable_packages.py (1)

main (132-163)

nemo_rl/utils/prefetch_venvs.py (1)

nemo_rl/utils/venvs.py (1)

create_local_venv (33-103)

tests/unit/test_version_check.py (1)

nemo_rl/__init__.py (1)

_check_container_fingerprint (72-204)

🪛 markdownlint-cli2 (0.18.1)

docs/design-docs/dependency-management.md

141-141: Blank line inside blockquote

(MD028, no-blanks-blockquote)

192-192: Blank line inside blockquote

(MD028, no-blanks-blockquote)

321-321: Blank line inside blockquote

(MD028, no-blanks-blockquote)

🪛 Ruff (0.14.5)

tools/generate_fingerprint.py

51-51: Probable use of insecure hash functions in hashlib: md5

(S324)

68-68: subprocess call: check for execution of untrusted input

(S603)

70-76: Starting a process with a partial executable path

(S607)

nemo_rl/utils/prefetch_venvs.py

95-100: Abstract raise to an inner function

(TRY301)

95-100: Avoid specifying long messages outside the exception class

(TRY003)

103-103: Do not catch blind exception: Exception

(BLE001)

139-139: Do not catch blind exception: Exception

(BLE001)

tools/list_editable_packages.py

57-57: subprocess call: check for execution of untrusted input

(S603)

85-85: subprocess call: check for execution of untrusted input

(S603)

100-100: subprocess call: check for execution of untrusted input

(S603)

122-122: Consider moving this statement to an else block

(TRY300)

tests/unit/test_version_check.py

57-57: Unused function argument: path

(ARG001)

57-57: Unused function argument: run_name

(ARG001)

124-124: Unused method argument: caplog

(ARG002)

135-135: Unused function argument: path

(ARG001)

135-135: Unused function argument: run_name

(ARG001)

174-174: Unused function argument: path

(ARG001)

174-174: Unused function argument: run_name

(ARG001)

208-208: Unused function argument: path

(ARG001)

208-208: Unused function argument: run_name

(ARG001)

209-209: Avoid specifying long messages outside the exception class

(TRY003)

316-316: Starting a process with a partial executable path

(S607)

325-325: Do not catch blind exception: Exception

(BLE001)

354-354: Starting a process with a partial executable path

(S607)

415-415: subprocess call: check for execution of untrusted input

(S603)

416-427: Starting a process with a partial executable path

(S607)

nemo_rl/__init__.py

162-182: Possible SQL injection vector through string-based query construction

(S608)

192-195: Abstract raise to an inner function

(TRY301)

202-202: Do not catch blind exception: Exception

(BLE001)

🪛 Shellcheck (0.11.0)

tests/functional/grpo_frozen_env.sh

[error] 38-38: Double quote array expansions to avoid re-splitting elements.

(SC2068)

tests/functional/test_frozen_env.sh

[warning] 19-19: Prefer mapfile or read -a to split command output (or quote to avoid splitting).

(SC2207)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (6)

GitHub Check: sphinx-build / Build docs
GitHub Check: build-container / main
GitHub Check: Lint check
GitHub Check: Lint check
GitHub Check: Post automodel integration comment / Comment on PR
GitHub Check: Post submodule check comment / Comment on PR

🔇 Additional comments (13)

docs/conf.py (1)

69-69: Mermaid fence support looks correct

Enabling myst_fence_as_directive = ["mermaid"] is consistent with the existing sphinxcontrib.mermaid setup and should make ```mermaid fences render cleanly.

.github/workflows/cicd-main.yml (1)

198-213: Build target and build-args wiring is consistent

Switching the container build target to release and passing MAX_JOBS and NEMO_RL_COMMIT as build-args matches the Dockerfile’s ARG usage and should correctly surface commit metadata in the image.

docs/index.md (1)

237-254: New dependency-management design doc is correctly linked

Adding design-docs/dependency-management.md to the Design Docs toctree looks correct and keeps navigation consistent.

tools/build-custom-vllm.sh (1)

69-69: Torch backend bump aligns with CUDA 12.9 stack

Using --torch-backend=cu129 for torch==2.8.0 aligns this script with the cu129 index and CUDA 12.9 base image, reducing potential ABI mismatches.

tests/functional/L1_Functional_Tests_GPU.sh (1)

22-25: Frozen-environment tests are well integrated into the GPU L1 suite

Running grpo_frozen_env.sh and test_frozen_env.sh via bare bash before the uv run --no-sync tests matches the design intent of validating frozen env behavior, and the coverage output path still aligns with the final coverage combine.

tests/functional/grpo_frozen_env.sh (1)

1-45: Frozen GRPO functional test is structured correctly

This script cleanly exercises the frozen environment via bare python + coverage, writes coverage to tests/.coverage for later aggregation, and enforces a concrete training-metric threshold, all while following the existing tests’ style conventions.

docker/Dockerfile (1)

12-15: Container marker, venv seeding, and fingerprint generation are wired coherently

Setting NRL_CONTAINER=1, seeding the uv venv before dependency resolution, copying just nemo_rl/__init__.py and package_info.py for the custom vLLM path, and emitting /opt/nemo_rl_container_fingerprint via tools/generate_fingerprint.py all line up with the new runtime fingerprint checks and frozen-env design. The placement of the fingerprint file outside /opt/nemo-rl is also a good choice to survive user mounts.

Also applies to: 79-83, 86-88, 132-135

nemo_rl/__init__.py (1)

207-208: Import-time fingerprint enforcement is appropriate but be aware of import cost

Running _check_container_fingerprint() unconditionally at import (gated by NRL_CONTAINER) is a good safeguard but does add I/O and a git/subprocess-dependent code path to every import in containers.

No change needed, but keep this in mind if import-time latency becomes an issue; you could later gate this behind a NRL_DISABLE_VERSION_CHECK or lazily run the check only when starting Ray workers.

tests/functional/test_frozen_env.sh (2)

64-139: Nice end-to-end mutation detection; minor note on network/submodule assumptions

The Test 2 workflow (rsync snapshot, baseline import, pyproject mutation, optional submodule bump) is well thought out and exercises the fingerprint logic realistically. The submodule step gracefully degrades if 3rdparty/megatron-lm/.git is missing or git fetch fails.

Just a small note: because this relies on network access and a specific submodule path, it’s good you already print warnings and skip rather than failing the whole test; keep that behavior if you later add more submodule checks.

Based on learnings

151-195: Import isolation assertions look good and match the design

The checks that python-MegatronPolicyWorker only sees megatron.core (not nemo_automodel) and python-DTensorPolicyWorkerV2 only sees nemo_automodel (not megatron.core) are clear and directly encode the intended separation between worker environments.

This is a solid functional guardrail for the frozen env setup.

Based on learnings

tools/generate_fingerprint.py (1)

58-103: Submodule SHA collection is robust and fail-open in a sensible way

The get_submodule_shas() helper correctly scopes git to the repo via --git-dir/--work-tree, parses git submodule status, and falls back to an empty dict on failure (e.g., missing git, not a repo). That meshes well with the import-time fingerprint check, which will still compare pyproject/uv.lock even if submodule info is unavailable.

Looks good as-is.

tools/list_editable_packages.py (1)

29-72: PATH scanning and executable validation look solid

The discovery logic for python executables (regex filter + --version probe with timeout) is a pragmatic way to find relevant interpreters while excluding argcomplete helpers. Sorting by name and deduping via seen gives deterministic output.

This is a good balance between robustness and simplicity.

tests/unit/test_version_check.py (1)

27-86: Good coverage of basic fingerprint check behavior

The first two tests in TestContainerFingerprintCheck nicely cover:

The bare-metal guard (NRL_CONTAINER unset → check is a no-op).

The happy-path container case where code and container fingerprints match, including mocking out both the fingerprint script and /opt/nemo_rl_container_fingerprint.

The mocking strategy (patching runpy.run_path and nemo_rl.Path) is a bit white-box but appropriate here given the logic you’re exercising, and the tests assert the main contract (no exception raised) without overcoupling to logging details.

Signed-off-by: Terry Kong <[email protected]>

github-actions bot added the documentation Improvements or additions to documentation label Nov 9, 2025

terrykong added the CI:L1 Run doctests, unit tests, and functional tests label Nov 10, 2025

terrykong temporarily deployed to nemo-ci November 10, 2025 06:08 — with GitHub Actions Inactive

terrykong temporarily deployed to nemo-ci November 10, 2025 06:29 — with GitHub Actions Inactive

terrykong added CI:L1 Run doctests, unit tests, and functional tests and removed CI:L1 Run doctests, unit tests, and functional tests labels Nov 10, 2025

terrykong temporarily deployed to nemo-ci November 10, 2025 07:35 — with GitHub Actions Inactive

terrykong temporarily deployed to nemo-ci November 10, 2025 07:54 — with GitHub Actions Inactive

terrykong had a problem deploying to nemo-ci November 10, 2025 10:22 — with GitHub Actions Failure

github-actions bot added the CI Relating to CI label Nov 11, 2025

terrykong added CI:L1 Run doctests, unit tests, and functional tests and removed CI:L1 Run doctests, unit tests, and functional tests labels Nov 11, 2025

terrykong temporarily deployed to nemo-ci November 11, 2025 06:06 — with GitHub Actions Inactive

terrykong temporarily deployed to nemo-ci November 11, 2025 06:10 — with GitHub Actions Inactive

terrykong had a problem deploying to nemo-ci November 11, 2025 08:35 — with GitHub Actions Failure

terrykong force-pushed the tk/uv-less branch from 6758787 to 7d1d7f2 Compare November 18, 2025 06:56

terrykong force-pushed the tk/uv-less branch from 7d1d7f2 to 1792a41 Compare November 18, 2025 07:01

fix uv.lock

907dc91

Signed-off-by: Terry Kong <[email protected]>

terrykong marked this pull request as ready for review November 18, 2025 07:03

terrykong requested review from a team as code owners November 18, 2025 07:03

lint

29f027a

Signed-off-by: Terry Kong <[email protected]>

terrykong added CI:L1 Run doctests, unit tests, and functional tests and removed CI:L1 Run doctests, unit tests, and functional tests labels Nov 18, 2025

terrykong temporarily deployed to nemo-ci November 18, 2025 07:06 — with GitHub Actions Inactive

terrykong requested a review from hemildesai November 18, 2025 07:09

coderabbitai bot reviewed Nov 18, 2025

View reviewed changes

terrykong temporarily deployed to nemo-ci November 18, 2025 07:35 — with GitHub Actions Inactive

terrykong temporarily deployed to nemo-ci November 18, 2025 10:57 — with GitHub Actions Inactive

fix custom vllm

20e266b

Signed-off-by: Terry Kong <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: allow uv-less execution and fingerprint the environment #1491

feat: allow uv-less execution and fingerprint the environment #1491

Uh oh!

terrykong commented Nov 9, 2025 •

edited

Loading

Uh oh!

terrykong commented Nov 10, 2025

Uh oh!

coderabbitai bot commented Nov 10, 2025

Uh oh!

coderabbitai bot commented Nov 10, 2025 •

edited

Loading

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

Uh oh!

coderabbitai bot left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

feat: allow uv-less execution and fingerprint the environment #1491

Are you sure you want to change the base?

feat: allow uv-less execution and fingerprint the environment #1491

Uh oh!

Conversation

terrykong commented Nov 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Dependency Management & Container Fingerprinting

Overview

Key Changes

1. Container Fingerprinting System

2. Frozen Environment Support

3. Comprehensive Documentation

4. Docker Improvements

5. Test Coverage

6. Miscellaneous

Workflows Supported

Production Workflow

Development Workflow

Frozen Environment Workflow

Migration Notes

Uh oh!

terrykong commented Nov 10, 2025

Uh oh!

coderabbitai bot commented Nov 10, 2025

Uh oh!

coderabbitai bot commented Nov 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

Pre-merge checks and finishing touches

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

terrykong commented Nov 9, 2025 •

edited

Loading

coderabbitai bot commented Nov 10, 2025 •

edited

Loading