-
Notifications
You must be signed in to change notification settings - Fork 169
feat: allow uv-less execution and fingerprint the environment #1491
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
@CodeRabbit summary |
✅ Actions performedSummary regeneration triggered. |
📝 WalkthroughWalkthroughAdds support for container "frozen" environments: container fingerprint generation and import-time fingerprint validation, venv wrapper scripts for direct python-{Worker} invocation, documentation and tests for dependency management and frozen-env workflows, Docker build adjustments, and related tooling utilities. Changes
Sequence Diagram(s)sequenceDiagram
participant User
participant nemo as nemo_rl.__init__
participant gen as tools/generate_fingerprint.py
participant FS as Container FS
rect rgb(220,235,255)
Note over User,nemo: Import-time fingerprint validation (container-only)
User->>nemo: import nemo_rl
alt NRL_CONTAINER set and not build-isolation and not NRL_FORCE_REBUILD_VENVS
nemo->>gen: run generate_fingerprint.py
gen->>gen: compute MD5(pyproject.toml, uv.lock) & submodule SHAs
gen-->>nemo: fingerprint dict
nemo->>FS: read /opt/nemo_rl_container_fingerprint
FS-->>nemo: container fingerprint
nemo->>nemo: compare fingerprints
alt mismatch & not NRL_IGNORE_VERSION_MISMATCH
nemo->>User: raise RuntimeError (stop import)
else mismatch & NRL_IGNORE_VERSION_MISMATCH
nemo->>User: log warning, continue
else match
nemo->>User: import completes
end
else skip checks
nemo->>User: import completes
end
end
sequenceDiagram
participant Caller
participant Prefetch as prefetch_venvs.create
participant FS as /usr/local/bin
rect rgb(235,250,230)
Note over Caller,FS: Frozen environment wrapper creation
Caller->>Prefetch: prefetch_venvs() completes venvs
alt NRL_CONTAINER set
Prefetch->>Prefetch: create_frozen_environment_symlinks(venv_configs)
loop per venv config
Prefetch->>FS: write python-{ClassName} wrapper script
FS->>FS: set VIRTUAL_ENV, update PATH, exec venv python
end
Prefetch->>Caller: log summary of created wrappers
else skip creation
Prefetch->>Caller: skip
end
end
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~30–45 minutes
Possibly related PRs
Suggested labels
Suggested reviewers
Pre-merge checks and finishing touches❌ Failed checks (1 warning)
✅ Passed checks (5 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Tip 📝 Customizable high-level summaries are now available in beta!You can now customize how CodeRabbit generates the high-level summary in your pull requests — including its content, structure, tone, and formatting.
Example instruction:
Note: This feature is currently in beta for Pro-tier users, and pricing will be announced later. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
6758787 to
7d1d7f2
Compare
Signed-off-by: Terry Kong <[email protected]> fix Signed-off-by: Terry Kong <[email protected]> seed Signed-off-by: Terry Kong <[email protected]> init Signed-off-by: Terry Kong <[email protected]> fix Signed-off-by: Terry Kong <[email protected]> fix Signed-off-by: Terry Kong <[email protected]> fix symlink creation to use exec as opposed to symlink which doesn't work Signed-off-by: Terry Kong <[email protected]> frozen grpo Signed-off-by: Terry Kong <[email protected]> use a json fingerprint Signed-off-by: Terry Kong <[email protected]> logging level Signed-off-by: Terry Kong <[email protected]> docs update Signed-off-by: Terry Kong <[email protected]> fix up Signed-off-by: Terry Kong <[email protected]> more functional test Signed-off-by: Terry Kong <[email protected]> mermaid Signed-off-by: Terry Kong <[email protected]> nrl-force rebuild and build-v will skip fingerprint check Signed-off-by: Terry Kong <[email protected]> fix tests Signed-off-by: Terry Kong <[email protected]> explain the local development Signed-off-by: Terry Kong <[email protected]> fix ci container Signed-off-by: Terry Kong <[email protected]> safe.directory fix Signed-off-by: Terry Kong <[email protected]>
7d1d7f2 to
1792a41
Compare
Signed-off-by: Terry Kong <[email protected]>
Signed-off-by: Terry Kong <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (15)
pyproject.toml (1)
18-21: Adding pip as a core dependency is reasonable for frozen envsIncluding
pipin the base dependencies matches the frozen-environment workflow whereuv venv --seedmay not always install it; behavior-wise this looks fine. If you start seeing churn from future pip releases, you might later choose to version-pin for reproducibility, but it’s not required now.nemo_rl/__init__.py (3)
19-23: Avoid configuring global logging at import timeCalling
logging.basicConfig(...)in a library’s__init__affects the root logger formatting/level for any embedding application that hasn’t configured logging yet. That’s surprising behavior for consumers.Prefer configuring only this package’s logger (e.g.,
logger = logging.getLogger(__name__)with a handler) or restricting this setup to CLI entry points instead of import-time.
58-70: Build-isolation detection heuristic is brittle but probably sufficientThe
_is_build_isolation()check relies on"/builds-v"being present insys.prefix. If uv ever changes its cache layout, this could silently stop skipping fingerprint checks in build isolation.If you want this more robust, consider tightening the pattern (e.g., checking for
/.cache/uv/builds-vprefix) and/or adding a small test that asserts the behavior against a known uv build path.
72-205: Fingerprint check is solid overall; a few robustness and ergonomics tweaksThe container/code drift check is well structured and guarded by
NRL_CONTAINER, but a few details could be improved:
- Environment flag parsing (Lines 86-89):
NRL_FORCE_REBUILD_VENVSonly skips checks when exactly"true"(case-insensitive lower). Users may reasonably set"1"or"True". Consider normalizing to accept a small set of truthy values ({"1","true","yes","on"}) to avoid surprises.- Using
runpy+sys.stdouthijack (Lines 118-125): Temporarily reassigningsys.stdoutis a global side effect and could be racy if imports happen in parallel. Sincetools/generate_fingerprint.pyexposesgenerate_fingerprint(), it’d be cleaner to import and call that function directly viaimportlib/runpy.run_path(using the returned globals) instead of driving the script via stdout.- Swallowing unexpected failures quietly (Lines 199-205): Any non–
RuntimeErrorgets reduced to a DEBUG log message, which will be hidden under the default WARNING-level config. If the fingerprint mechanism fails (e.g., JSON decode error, script crash, unreadable/opt/...file), users may never notice that the guardrail is effectively disabled. Consider logging these aslogging.warning(or at leastinfo) with a short message that the version check was skipped due to an internal error.The mismatch handling / message content look good and align with the design doc; just these small changes would make the behavior more predictable in production.
docs/design-docs/dependency-management.md (3)
205-221: Decision-guide flowchart seems inverted vs described workflowsIn the “Decision Guide” mermaid diagram, node C (“Container built from same commit as code?”) sends Yes to:
LoadingF["✓ Run with NRL_FORCE_REBUILD_VENVS=true uv run examples/..."]and No to D (small scale or testing). Based on the earlier text, it seems the expected behavior is:
- If the container and code are from the same commit, you should just run normally (
uv run ...), noNRL_FORCE_REBUILD_VENVSneeded.- If they differ, you go into the development workflow branch (deciding between
NRL_FORCE_REBUILD_VENVS=trueand rebuilding the container).Consider swapping the E/F nodes or adjusting edges so the flowchart matches the narrative and the actual recommended workflows.
271-311: Align documented “Solutions” list with the actual runtime warningThe “Container Version Checking” example lists three solutions (rebuild container, set
NRL_FORCE_REBUILD_VENVS=true, setNRL_IGNORE_VERSION_MISMATCH=1), whereas the runtime warning innemo_rl/__init__.pyoffers an additional option to regenerate/opt/nemo_rl_container_fingerprintwithtools/generate_fingerprint.py.For clarity, consider either:
- Adding the “regenerate fingerprint” option to the doc snippet, or
- Removing it from the runtime warning if you don’t want to encourage that path for most users.
Keeping these consistent will make it easier for users to reconcile the docs with what they see at runtime.
234-325: Minor markdownlint nits around blockquotesStatic analysis flagged blank lines inside blockquotes (MD028) in a few places (e.g., the warnings/caution callouts around frozen environments). Not urgent, but trimming the blank lines inside those
>sections will keep markdownlint happy and avoid noisy CI failures if you enforce docs linting.tests/functional/test_frozen_env.sh (1)
18-41: Slightly brittlels-based discovery of python- executables*
PYTHON_EXECUTABLES=($(ls -1 /usr/local/bin/python-* 2>/dev/null || true))works here but has the usualls+ word-splitting pitfalls (SC2207), and will misbehave if any executable name contains whitespace.Given this is /usr/local/bin and the risk is low, it’s not a blocker, but you could make it more robust with something like:
mapfile -t PYTHON_EXECUTABLES < <(find /usr/local/bin -maxdepth 1 -type f -name 'python-*' | sort)This avoids
lsand handles arbitrary filenames cleanly.tools/generate_fingerprint.py (1)
46-55: Consider SHA-256 instead of MD5 for file hashesFor drift detection MD5 is functionally fine, but many security tools flag it as “insecure” by default. Switching to
hashlib.sha256()would avoid those warnings with essentially no downside in this context.Example:
hasher = hashlib.sha256() with open(file_path, "rb") as f: for chunk in iter(lambda: f.read(8192), b""): hasher.update(chunk) return hasher.hexdigest()Not urgent, but worth considering to keep static analysis and security reviews quiet.
tools/list_editable_packages.py (1)
75-129: Optional: surface per-executable failures for easier debugging
get_editable_packages()returningNoneon any subprocess/JSON error causesmain()to silently skip that interpreter. That keeps output clean, but can make it hard to understand why a particular python-* is missing.If you want more transparency, consider printing a brief message to stderr when an executable is skipped due to an error (perhaps behind a
--verboseflag), while still returning 0 to avoid breaking workflows.nemo_rl/utils/prefetch_venvs.py (3)
57-75: Container-only wrapper creation hook looks correct but may be a bit noisy on bare metalThe unconditional call to
create_frozen_environment_symlinks(venv_configs)and the earlyNRL_CONTAINERguard give a clear separation between container and non-container behavior, which is good.Minor nit: on non-container runs, this will always print the "Skipping frozen environment wrapper script creation (not in container)" message after prefetching. If
prefetch_venvs()is used in local/dev workflows, consider either:
- Guarding the call at the caller (only call when
NRL_CONTAINERis set), or- Making the "skipping" message debug/optional
to keep CLI output a bit quieter for non-container use cases.
81-105: Avoid re-running venv creation just to discover python pathsInside
create_frozen_environment_symlinks, you callcreate_local_venv(py_executable, actor_fqn)again for each actor, even thoughprefetch_venvs()has already created those venvs just above. Depending oncreate_local_venv’s implementation, this can:
- Re-run potentially expensive sync/bootstrap steps, and
- Duplicate log output, making prefetch runs noisier.
A more efficient pattern would be to reuse the
python_pathvalues from the initial prefetch loop instead of invokingcreate_local_venvagain, e.g.:
- Track a mapping
{actor_fqn: python_path}while prefetching, and- Pass that mapping into
create_frozen_environment_symlinksso it only needs to do collision detection and wrapper generation.This keeps the behavior the same while avoiding redundant work.
111-149: Wrapper script implementation is solid; consider aligning naming and tightening error handlingThe wrapper scripts themselves look good:
VIRTUAL_ENVandPATHare set in a predictable way.- The
exec "$VENV_PATH/bin/python" "$@"handoff keeps argument handling simple and correct.- Existing wrappers are replaced atomically via
unlink+ rewrite +chmod(0o755).Two small improvements you might consider:
The function is named
create_frozen_environment_symlinks, but it actually writes wrapper scripts rather than symlinks. Renaming to something likecreate_frozen_environment_wrappers(and updating callers) would better match behavior.The broad
except Exception as eblocks around venv discovery and wrapper creation are likely intentional “best-effort” behavior. If you want slightly stronger hygiene without losing that behavior, you could narrow them to anticipated exceptions (e.g.,OSError,subprocess.CalledProcessError) while still printing warnings and continuing.Neither of these is blocking; just potential polish for clarity and maintainability.
tests/unit/test_version_check.py (2)
87-229: Skipped mismatch/edge-case tests: either promote to integration tests or simplifyThe additional tests around mismatched fingerprints, ignore flags, missing fingerprint files, and runpy failures are all marked
@pytest.mark.skip(...). As a result they currently add complexity without contributing to coverage.Two options to consider:
- If you plan to rely on these behaviors, move them toward real integration tests (e.g., minimal reliance on heavy
Pathpatching, or running against an actual small fingerprint script and file).- If they’re no longer needed, converting them to parameterized variants of simpler unit tests or removing them would keep the suite leaner.
Not urgent, but clarifying the long-term intent for these skipped tests will help keep the test suite maintainable.
231-460: Integration-style build isolation tests are valuable; consider marking them explicitly as slowThe
TestBuildIsolationDetectiontests provide useful end-to-end coverage of_is_build_isolationby:
- Creating a temporary project and wiring it to the local nemo_rl tree.
- Spawning external build processes and asserting on the recorded
sys.prefixand isolation flags.You already guard against slow/broken environments by skipping on
TimeoutExpiredand generic initialization failures, which is good.Given they run real external processes and can take noticeable time, you might want to:
- Add a marker (e.g.,
@pytest.mark.slowor similar) so they can be selectively included/excluded in CI, and/or- Document in a module-level comment that these are integration tests, not pure unit tests.
Behavior-wise, though, the tests look sound and match the intended build-isolation semantics.
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
⛔ Files ignored due to path filters (1)
uv.lockis excluded by!**/*.lock
📒 Files selected for processing (15)
.github/workflows/cicd-main.yml(1 hunks)docker/Dockerfile(3 hunks)docs/conf.py(1 hunks)docs/design-docs/dependency-management.md(1 hunks)docs/index.md(1 hunks)nemo_rl/__init__.py(2 hunks)nemo_rl/utils/prefetch_venvs.py(2 hunks)pyproject.toml(1 hunks)tests/functional/L1_Functional_Tests_GPU.sh(1 hunks)tests/functional/grpo_frozen_env.sh(1 hunks)tests/functional/test_frozen_env.sh(1 hunks)tests/unit/test_version_check.py(1 hunks)tools/build-custom-vllm.sh(1 hunks)tools/generate_fingerprint.py(1 hunks)tools/list_editable_packages.py(1 hunks)
🧰 Additional context used
🧠 Learnings (3)
📚 Learning: 2025-10-12T14:46:55.513Z
Learnt from: zpqiu
Repo: NVIDIA-NeMo/RL PR: 1324
File: tests/test_suites/llm/distillation-qwen3-32b-to-1.7b-base-1n8g-megatron-tp2pp2cp2-pack.sh:16-30
Timestamp: 2025-10-12T14:46:55.513Z
Learning: In the NVIDIA-NeMo/RL repository, test scripts under tests/ follow a consistent pattern: use `cd $PROJECT_ROOT` without quotes or error handling, and pass arguments with `$@` unquoted. Maintain this consistency when adding new test scripts.
Applied to files:
tests/functional/grpo_frozen_env.shdocker/Dockerfiletests/functional/test_frozen_env.shtests/functional/L1_Functional_Tests_GPU.sh
📚 Learning: 2025-09-19T07:28:29.887Z
Learnt from: shuo-nvidia
Repo: NVIDIA-NeMo/RL PR: 1006
File: tests/test_suites/llm/distillation-qwen3-32b-to-4b-base-2n8g-fsdp2tp2-long.v1.sh:1-4
Timestamp: 2025-09-19T07:28:29.887Z
Learning: The NVIDIA-NeMo/RL project prefers to maintain consistent formatting across test scripts rather than applying individual bash hardening improvements like `set -euo pipefail` or proper quoting for sourcing files.
Applied to files:
tests/functional/grpo_frozen_env.shdocker/Dockerfile
📚 Learning: 2025-10-12T14:46:57.171Z
Learnt from: zpqiu
Repo: NVIDIA-NeMo/RL PR: 1324
File: tests/test_suites/llm/distillation-qwen3-32b-to-1.7b-base-1n8g-megatron-tp2pp2cp2-pack.sh:6-11
Timestamp: 2025-10-12T14:46:57.171Z
Learning: Test scripts in tests/test_suites/llm/ follow a standard configuration pattern that includes NUM_NODES, STEPS_PER_RUN, MAX_STEPS, NUM_RUNS (calculated as `$(( (MAX_STEPS + STEPS_PER_RUN - 1) / STEPS_PER_RUN ))`), and NUM_MINUTES. These variables are part of the test infrastructure's standard interface and should not be flagged as unused even if not directly referenced within the individual script, as they are consumed by external launch tooling or common.env.
Applied to files:
tests/functional/L1_Functional_Tests_GPU.sh
🧬 Code graph analysis (3)
tests/functional/test_frozen_env.sh (2)
tools/generate_fingerprint.py (1)
main(132-135)tools/list_editable_packages.py (1)
main(132-163)
nemo_rl/utils/prefetch_venvs.py (1)
nemo_rl/utils/venvs.py (1)
create_local_venv(33-103)
tests/unit/test_version_check.py (1)
nemo_rl/__init__.py (1)
_check_container_fingerprint(72-204)
🪛 markdownlint-cli2 (0.18.1)
docs/design-docs/dependency-management.md
141-141: Blank line inside blockquote
(MD028, no-blanks-blockquote)
192-192: Blank line inside blockquote
(MD028, no-blanks-blockquote)
321-321: Blank line inside blockquote
(MD028, no-blanks-blockquote)
🪛 Ruff (0.14.5)
tools/generate_fingerprint.py
51-51: Probable use of insecure hash functions in hashlib: md5
(S324)
68-68: subprocess call: check for execution of untrusted input
(S603)
70-76: Starting a process with a partial executable path
(S607)
nemo_rl/utils/prefetch_venvs.py
95-100: Abstract raise to an inner function
(TRY301)
95-100: Avoid specifying long messages outside the exception class
(TRY003)
103-103: Do not catch blind exception: Exception
(BLE001)
139-139: Do not catch blind exception: Exception
(BLE001)
tools/list_editable_packages.py
57-57: subprocess call: check for execution of untrusted input
(S603)
85-85: subprocess call: check for execution of untrusted input
(S603)
100-100: subprocess call: check for execution of untrusted input
(S603)
122-122: Consider moving this statement to an else block
(TRY300)
tests/unit/test_version_check.py
57-57: Unused function argument: path
(ARG001)
57-57: Unused function argument: run_name
(ARG001)
124-124: Unused method argument: caplog
(ARG002)
135-135: Unused function argument: path
(ARG001)
135-135: Unused function argument: run_name
(ARG001)
174-174: Unused function argument: path
(ARG001)
174-174: Unused function argument: run_name
(ARG001)
208-208: Unused function argument: path
(ARG001)
208-208: Unused function argument: run_name
(ARG001)
209-209: Avoid specifying long messages outside the exception class
(TRY003)
316-316: Starting a process with a partial executable path
(S607)
325-325: Do not catch blind exception: Exception
(BLE001)
354-354: Starting a process with a partial executable path
(S607)
415-415: subprocess call: check for execution of untrusted input
(S603)
416-427: Starting a process with a partial executable path
(S607)
nemo_rl/__init__.py
162-182: Possible SQL injection vector through string-based query construction
(S608)
192-195: Abstract raise to an inner function
(TRY301)
202-202: Do not catch blind exception: Exception
(BLE001)
🪛 Shellcheck (0.11.0)
tests/functional/grpo_frozen_env.sh
[error] 38-38: Double quote array expansions to avoid re-splitting elements.
(SC2068)
tests/functional/test_frozen_env.sh
[warning] 19-19: Prefer mapfile or read -a to split command output (or quote to avoid splitting).
(SC2207)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (6)
- GitHub Check: sphinx-build / Build docs
- GitHub Check: build-container / main
- GitHub Check: Lint check
- GitHub Check: Lint check
- GitHub Check: Post automodel integration comment / Comment on PR
- GitHub Check: Post submodule check comment / Comment on PR
🔇 Additional comments (13)
docs/conf.py (1)
69-69: Mermaid fence support looks correctEnabling
myst_fence_as_directive = ["mermaid"]is consistent with the existingsphinxcontrib.mermaidsetup and should make ```mermaid fences render cleanly..github/workflows/cicd-main.yml (1)
198-213: Build target and build-args wiring is consistentSwitching the container build
targettoreleaseand passingMAX_JOBSandNEMO_RL_COMMITas build-args matches the Dockerfile’sARGusage and should correctly surface commit metadata in the image.docs/index.md (1)
237-254: New dependency-management design doc is correctly linkedAdding
design-docs/dependency-management.mdto the Design Docs toctree looks correct and keeps navigation consistent.tools/build-custom-vllm.sh (1)
69-69: Torch backend bump aligns with CUDA 12.9 stackUsing
--torch-backend=cu129fortorch==2.8.0aligns this script with the cu129 index and CUDA 12.9 base image, reducing potential ABI mismatches.tests/functional/L1_Functional_Tests_GPU.sh (1)
22-25: Frozen-environment tests are well integrated into the GPU L1 suiteRunning
grpo_frozen_env.shandtest_frozen_env.shvia barebashbefore theuv run --no-synctests matches the design intent of validating frozen env behavior, and the coverage output path still aligns with the finalcoverage combine.tests/functional/grpo_frozen_env.sh (1)
1-45: Frozen GRPO functional test is structured correctlyThis script cleanly exercises the frozen environment via bare
python+ coverage, writes coverage totests/.coveragefor later aggregation, and enforces a concrete training-metric threshold, all while following the existing tests’ style conventions.docker/Dockerfile (1)
12-15: Container marker, venv seeding, and fingerprint generation are wired coherentlySetting
NRL_CONTAINER=1, seeding the uv venv before dependency resolution, copying justnemo_rl/__init__.pyandpackage_info.pyfor the custom vLLM path, and emitting/opt/nemo_rl_container_fingerprintviatools/generate_fingerprint.pyall line up with the new runtime fingerprint checks and frozen-env design. The placement of the fingerprint file outside/opt/nemo-rlis also a good choice to survive user mounts.Also applies to: 79-83, 86-88, 132-135
nemo_rl/__init__.py (1)
207-208: Import-time fingerprint enforcement is appropriate but be aware of import costRunning
_check_container_fingerprint()unconditionally at import (gated byNRL_CONTAINER) is a good safeguard but does add I/O and a git/subprocess-dependent code path to every import in containers.No change needed, but keep this in mind if import-time latency becomes an issue; you could later gate this behind a
NRL_DISABLE_VERSION_CHECKor lazily run the check only when starting Ray workers.tests/functional/test_frozen_env.sh (2)
64-139: Nice end-to-end mutation detection; minor note on network/submodule assumptionsThe Test 2 workflow (rsync snapshot, baseline import, pyproject mutation, optional submodule bump) is well thought out and exercises the fingerprint logic realistically. The submodule step gracefully degrades if
3rdparty/megatron-lm/.gitis missing orgit fetchfails.Just a small note: because this relies on network access and a specific submodule path, it’s good you already print warnings and skip rather than failing the whole test; keep that behavior if you later add more submodule checks.
Based on learnings
151-195: Import isolation assertions look good and match the designThe checks that
python-MegatronPolicyWorkeronly seesmegatron.core(notnemo_automodel) andpython-DTensorPolicyWorkerV2only seesnemo_automodel(notmegatron.core) are clear and directly encode the intended separation between worker environments.This is a solid functional guardrail for the frozen env setup.
Based on learnings
tools/generate_fingerprint.py (1)
58-103: Submodule SHA collection is robust and fail-open in a sensible wayThe
get_submodule_shas()helper correctly scopes git to the repo via--git-dir/--work-tree, parsesgit submodule status, and falls back to an empty dict on failure (e.g., missing git, not a repo). That meshes well with the import-time fingerprint check, which will still compare pyproject/uv.lock even if submodule info is unavailable.Looks good as-is.
tools/list_editable_packages.py (1)
29-72: PATH scanning and executable validation look solidThe discovery logic for python executables (regex filter +
--versionprobe with timeout) is a pragmatic way to find relevant interpreters while excluding argcomplete helpers. Sorting by name and deduping viaseengives deterministic output.This is a good balance between robustness and simplicity.
tests/unit/test_version_check.py (1)
27-86: Good coverage of basic fingerprint check behaviorThe first two tests in
TestContainerFingerprintChecknicely cover:
- The bare-metal guard (
NRL_CONTAINERunset → check is a no-op).- The happy-path container case where code and container fingerprints match, including mocking out both the fingerprint script and
/opt/nemo_rl_container_fingerprint.The mocking strategy (patching
runpy.run_pathandnemo_rl.Path) is a bit white-box but appropriate here given the logic you’re exercising, and the tests assert the main contract (no exception raised) without overcoupling to logging details.
Signed-off-by: Terry Kong <[email protected]>
Closes #1480
Dependency Management & Container Fingerprinting
Overview
This PR allows users to use bare
pythonfor the driver script and exposes all the other venvs and alias them in thePATHwith the namespython-$ray_ractor_name. This PR also adds container fingerprinting to detect and handle version mismatches between container environments and local code.Key Changes
1. Container Fingerprinting System
New files:
tools/generate_fingerprint.py: Computes MD5 hashes ofpyproject.toml,uv.lock, and git submodule SHAs to create a unique fingerprintnemo_rl/__init__.py: Enhanced with_check_container_fingerprint()that validates container/code alignment on startupHow it works:
/opt/nemo_rl_container_fingerprintgenerated at build timeNRL_CONTAINER=1), NeMo RL compares the container's fingerprint against the current codeEnvironment variables:
NRL_IGNORE_VERSION_MISMATCH=1: Bypass the version check (not recommended)NRL_FORCE_REBUILD_VENVS=true: Automatically skip the check and rebuild venvs with current dependencies (existing env var)NRL_CONTAINER=1: Indicates running inside a container (set automatically in Dockerfile)2. Frozen Environment Support
New features:
tools/list_editable_packages.py: Utility to discover which packages can be mounted for developmentnemo_rl/utils/prefetch_venvs.py: Now creates convenience symlinks likepython-MegatronPolicyWorker,python-VllmGenerationWorker, etc.Benefits:
uv runpython-MegatronPolicyWorker -m pip install <pkg>for quick experimentation/usr/local/bin/python-*python(instead ofuv run) skips network checks that uv performs, providing faster startup in air-gapped or network-constrained environments3. Comprehensive Documentation
New docs:
docs/design-docs/dependency-management.md:uv runworks under the hoodDocs enhancements:
sphinxcontrib-mermaidsupport for diagrams in documentationdocs/conf.pyto enable mermaid code blocks4. Docker Improvements
Dockerfile changes:
ENV NRL_CONTAINER=1to indicate container environmentnemo_rl/__init__.pyandpackage_info.pyearlier in build forbuild-custom-vllm.shuv venv --seedbefore conditional vLLM build/opt/nemo_rl_container_fingerprint5. Test Coverage
New tests (
tests/unit/test_version_check.py):uv lock/sync)uv synccommands6. Miscellaneous
pipto core dependencies for frozen environment support (sinceuv venv --seeddoesn't always add it)build-custom-vllm.sh(cu128 → cu129)Workflows Supported
Production Workflow
Development Workflow
NRL_FORCE_REBUILD_VENVS=trueto rebuild venvs on demandFrozen Environment Workflow
uv runpippyproject.toml)pythonwhich in very high node scales can be convenient.Migration Notes