Add day0-release orchestration skill with enforced gates by Edwardf0t1 · Pull Request #1596 · NVIDIA/Model-Optimizer

Edwardf0t1 · 2026-06-02T00:33:54Z

What does this PR do?

Type of change: new feature (Claude agent skill)

Adds a day0-release orchestration skill that chains the existing domain skills (ptq → deployment → evaluation → compare-results) into the day-0 release happy path, with code-enforced gates between stages.

The problem it solves. Today the day-0 release sequence is model-driven — the agent re-decides the PTQ → deploy → eval → compare order each session, and can silently skip a gate (e.g. report scores from an incomplete eval, or hand off a checkpoint that missed quantization coverage). These are failure modes we hit repeatedly in live trials. day0-release makes the sequence and the gates deterministic so the common case is repeatable.

What it is — and isn't. It's a conductor, not a new instrument. It owns the sequence, the gates, and the accept/retry decision; the domain skills still own the actual work. It is not for single-stage asks ("just quantize X" → ptq; "run MMLU on this endpoint" → evaluation) — its negative trigger excludes those. It fires only for the full goal-driven release.

Goal it drives toward (the documented day-0 criterion): a quantized checkpoint smaller than the original, with <1% accuracy drop on the standard set vs the matching baseline, plus a publish recommendation.

Contents:

.claude/skills/day0-release/SKILL.md — the chain + gate calls + decision logic.
.claude/skills/day0-release/scripts/gate_ptq.py, gate_run.py, gate_compare.py — deterministic gate scripts. Each is a pure decision function (evaluate_*) plus a thin file-reading main, returning JSON {pass, failure_class, detail, ...}. The failure_class values match the modelopt-agent-protocols strawman (MODEL_UNSUPPORTED, QUANT_COVERAGE_FAILURE, EVAL_JUDGE_FAILED, SAMPLE_ACCOUNTING_FAILED, …).
.claude/skills/day0-release/tests/test_gates.py — 21 unit tests for the gate decision functions (GPU-free, no cluster).
.claude/skills/day0-release/tests/evals.json — 5 routing + behavior assertions.

As-built note — gate_ptq.py input. v1 takes a --summary <validation.json> (size scan + hf_ptq quant summary: source_bytes, output_bytes, recipe, layer_precision_counts, metadata_diffs) rather than reading the safetensors checkpoint directly. The --checkpoint/--source/--recipe args are reserved stubs; wiring the gate to build that summary from the exported checkpoint is a follow-up. gate_run.py and gate_compare.py likewise read small JSON summaries the agent produces from the run artifacts.

Usage

Ask Claude Code:

Release `<org>/<model>` at day-0: quantize to NVFP4, validate accuracy is within
1% of the BF16 baseline on the AA suite, and tell me if it's publishable.
Run on <cluster>.

The skill then runs the chain, enforcing a gate after each stage:

setup ─▶ PTQ ─▶ deploy ─▶ baseline-eval ─▶ quantized-eval ─▶ compare ─▶ closeout
          │       │            │                │              │
       gate_ptq  health     gate_run         gate_run      gate_compare

After stage	Gate	Pass condition	On fail
Setup	reachability	creds present, cluster SSH ok	SYSTEMIC → abort
PTQ	`gate_ptq.py`	size ratio <1, layer coverage matches recipe, metadata consistent	triage → PATCH / skip recipe / abort
Deploy	health	endpoint up + 1 successful generation	triage → PATCH flags/TP / skip
Each eval	`gate_run.py`	complete, all samples scored, no judge/parse failure	retry / triage
Compare	`gate_compare.py`	every task within <1% drop	REGRESSION → report; ANOMALOUS → human

It returns a decision, not a raw artifact: ACCEPT (publishable, with report) / REGRESSION (which tasks failed the threshold) / ANOMALOUS / INFEASIBLE — plus the workspace path and MLflow run IDs.

Testing

Tested in two layers — deterministic control flow (CI-able) separate from the agentic stages (integration-level):

Gate-script unit tests — tests/test_gates.py, 21 cases, all passing (python -m pytest .claude/skills/day0-release/tests/test_gates.py). Covers the pass path plus each failure_class branch for all three gates: gate_compare (ACCEPT / REGRESSION / ANOMALOUS-on-implausible-gain / ANOMALOUS-out-of-range / mismatched-task-sets / relative-threshold / non-numeric-score), gate_run (valid / dropped-samples / judge-error / missing-score / non-numeric-score / timeout-non-terminal / running-non-terminal / no-tasks), gate_ptq (pass / not-smaller / zero-coverage→MODEL_UNSUPPORTED / unexpected-unquantized / metadata-diff / unknown-recipe). No GPU or cluster needed.
Routing assertions (tests/evals.json, 5 cases): documents expected routing — fires on "release model X at day-0", does not fire on "just quantize X" / "run MMLU on this endpoint", and the gate-blocking / regression-reports-and-stops behaviors. These are behavior specs for manual QA; there is no automated skill-routing harness yet (tracked as "Remaining Work" in the design doc).
End-to-end integration (run on aws-pdx / B300, 2026-06-04) — ✅ the full chain and all five gate types were exercised on real cluster artifacts (Qwen3-0.6B). Results below.

End-to-end integration results

Every gate fired correctly on real PTQ / serve / eval outputs, and both failure-routing branches were exercised:

Stage	Gate	Real artifact	Verdict	✓
Setup	reachability	aws-pdx SSH + SLURM	PASS	✅
PTQ (happy)	`gate_ptq`	FP8 ckpt, 0.50×, FP8=196 layers, unexpected=0	`pass:true`	✅
PTQ (failure fixture)	`gate_ptq`	`nvfp4_experts_only` on a dense model → NVFP4=0, 196 silently-unquantized	`MODEL_UNSUPPORTED` → chain STOPS before deploy	✅
Deploy	health	2 vLLM endpoints, `/health`=200 + generation	PASS	✅
Eval ×2	`gate_run`	real run-summaries, 300/300 scored, SUCCESS	`pass:true`	✅
Compare	`gate_compare`	arc_easy BF16=57.0 vs FP8=54.0	`REGRESSION` (drop 3.0 > 1pt)	✅

Failure fixture (the key control-flow proof): an experts-only recipe on a dense model matched 0 modules, so PTQ "succeeded" in 3.5s and exported a checkpoint with quant_algo:null / quantized_layers:{} — every linear silently BF16. gate_ptq caught the zero-coverage and routed to MODEL_UNSUPPORTED, so the chain did not deploy/eval the bad checkpoint. This is exactly the silent coverage-miss the gate exists to stop.
Caveat — ACCEPT not reached on real data: the happy path returned REGRESSION, which is the correct verdict — Qwen3-0.6B FP8 genuinely loses ~3pt on arc_easy (real, ~2 SE at 300 samples; weight-only FP8 = 54.0 was no better than KV-FP8 = 54.67, so the KV-cache wasn't the cause). Tiny models are quantization-fragile and don't meet the <1% criterion; the gate correctly refused to rubber-stamp it. The threshold was not loosened to manufacture a pass. The ACCEPT terminal + publish-recommendation path therefore remains validated by unit test (test_compare_accept_within_threshold) only, not end-to-end — demonstrating it on real data needs a larger model (e.g. Qwen3-4B/8B) where FP8 is near-lossless, deferred as optional follow-up.
Infra bugs shaken out (incidental, not in this skill's code): (1) the ModelOpt launcher's /hf-local bind mount has no host dir on aws-pdx → set SLURM_HF_LOCAL=<lustre dir>; (2) lm-eval is unusable inside vllm/vllm-openai:cu130-nightly (ships transformers 5.x, which removed AutoModelForVision2Seq that lm-eval imports at load) → used a dependency-free /v1/completions log-likelihood scorer.

Scope

In v1: the linear chain + gate scripts + ACCEPT/fail-with-report outcomes. On REGRESSION, v1 reports "recipe R regressed on tasks [...]" and stops.

Deferred (follow-up PR): the evaluator-optimizer recipe loop (compare → pick next recipe → re-PTQ), which needs the bigpareto integration and the shared modelopt-agent-protocols schema adopted on both sides.

Before your PR is "Ready for review"

Is this change backward compatible?: ✅ (new skill; no change to existing skills)
If you copied code from any other sources or added a new PIP dependency, did you follow guidance in CONTRIBUTING.md: N/A (no new deps)
Did you write any new necessary tests?: ✅ (21 gate-script unit tests + 5 routing evals; end-to-end integration run on aws-pdx — see Testing)
Did you update Changelog?: ✅
Did you get Claude approval on this PR?: ⬜ (run /claude review before marking ready)

Additional Information

Design rationale: ModelOpt Agent Skills Design doc — this skill implements the "deterministic day-0 chain driver" (prompt-chaining-with-gates, the code-driven orchestration pattern from Anthropic's Building Effective Agents). The gate scripts double as the data source for the Observability stage metrics, and gate_compare.py's verdict is the entry point for the deferred evaluator-optimizer recipe loop.

Summary by CodeRabbit

New Features
- Added the day0-release skill: a linear, gated release driver that enforces checkpoints at post-quantization, deployment, evaluation, and baseline-vs-candidate comparison stages.
New Tools
- Included CLI gate utilities that validate PTQ checkpoints, run/evaluation outputs, and baseline-vs-candidate comparisons with clear ACCEPT/REGRESSION/ANOMALOUS outcomes and exit codes.
Tests
- Added unit tests covering gate behaviors, failure-class triage, and acceptance/regression/anomaly scenarios.
Documentation
- Updated changelog and added spec outlining inputs, gating rules, triage table, and required closeout outputs.

Adds a deterministic end-to-end driver that chains the existing domain skills (ptq -> deployment -> evaluation -> compare-results) with a gate after every stage, returning a publish decision (ACCEPT / REGRESSION / ANOMALOUS / INFEASIBLE) rather than a raw artifact. The skill is a conductor: it sequences the domain skills and enforces the gates; it does not re-implement quantization, serving, evaluation, or comparison. Negative trigger excludes single-stage requests. Ships three GPU-free, unit-tested gate scripts whose pure decision functions mirror logic already specified in the domain skills: - gate_ptq.py -> checkpoint-validation.md (size, coverage, metadata) - gate_run.py -> run-validation.md (completeness, judge/sample checks) - gate_compare.py -> compare-results threshold decision 18 unit tests cover pass + each failure_class branch; routing tests in tests/evals.json assert the skill fires on full-release requests and not on single-stage ones. v1 scope: linear chain + gates + report. On REGRESSION it reports which tasks exceeded the threshold and stops. The evaluator-optimizer recipe loop (compare -> next recipe -> re-PTQ) is deferred to a follow-up that needs the bigpareto integration and a shared config/result schema. Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com>

copy-pr-bot · 2026-06-02T00:33:57Z

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

coderabbitai · 2026-06-02T00:34:00Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 45c0740c-c049-4b59-ae3e-de3eb23b16e9

📥 Commits

Reviewing files that changed from the base of the PR and between fe6f2bc and 550fa1d.

📒 Files selected for processing (2)

.claude/skills/day0-release/scripts/gate_compare.py
.claude/skills/day0-release/scripts/gate_run.py

🚧 Files skipped from review as they are similar to previous changes (2)

.claude/skills/day0-release/scripts/gate_compare.py
.claude/skills/day0-release/scripts/gate_run.py

📝 Walkthrough

Walkthrough

Introduces a day0-release skill—an end-to-end quantization release workflow with enforced per-stage gates. Defines PTQ, deployment, baseline evaluation, quantized evaluation, and comparison stages; implements three deterministic gate scripts validating checkpoint coverage, evaluation completeness, and accuracy thresholds; includes comprehensive unit tests and eval scenarios.

Changes

Day-0 Release Workflow and Gates

Layer / File(s)	Summary
Workflow specification and gating contract `.claude/skills/day0-release/SKILL.md`	Skill document defining six-stage workflow (PTQ → deploy → baseline eval → quantized eval → compare → publish) with per-stage gating behavior, failure classification, triage actions mapping, output contract, and v1 scope (linear gated workflow, stop-on-REGRESSION, deferred recipe re-optimization).
Compare gate for accuracy threshold validation `.claude/skills/day0-release/scripts/gate_compare.py`	Implements decision logic comparing per-task baseline and candidate scores with support for absolute/relative thresholds, implausible-gain detection, and per-task classification. Returns ACCEPT/REGRESSION/ANOMALOUS with per-task detail payload and structured CLI that handles JSON file I/O and error reporting.
PTQ gate for checkpoint export validation `.claude/skills/day0-release/scripts/gate_ptq.py`	Validates post-quantization checkpoint validation-summary JSON: enforces output smaller than source, validates quantized-layer coverage against recipe-expected precision buckets, flags metadata diffs. Returns deterministic failure classification with per-layer detail and CLI for JSON-based validation.
Run gate for evaluation completeness validation `.claude/skills/day0-release/scripts/gate_run.py`	Validates evaluation run-summary JSON per-task for terminal SUCCESS status, error classification (judge/auth vs infra), sample accounting match, and canonical numeric score. Aggregates pass/failure_class verdict with per-task reasoning and CLI for JSON input/output.
Unit tests for gate evaluation logic `.claude/skills/day0-release/tests/test_gates.py`	Deterministic pytest module with tests covering acceptance/rejection/anomaly decisions, failure-class mappings, malformed input handling, and edge cases across all three evaluators. Includes helper constructors for task and checkpoint payloads and direct pytest entrypoint.
Integration test scenarios and version documentation `.claude/skills/day0-release/tests/evals.json`, `CHANGELOG.rst`	Eval test scenarios asserting skill routing, stage sequencing, PTQ gating with failure-class branching, and compare-based regression classification. CHANGELOG.rst documents the new Early Testing skill with deterministic gated workflow and unit-tested gate scripts.
Ruff per-file ignore configuration `pyproject.toml`	Adds per-file-ignores entry for `.claude/skills//tests/test_.py` to exclude docstring (D) and import-position (E402) linting rules.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Suggested reviewers

mxinO
shengliangxu
kaix-nv

🚥 Pre-merge checks | ✅ 5 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 12.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (5 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately and concisely summarizes the main change: adding a day0-release orchestration skill with enforced gates, which is the core objective of this PR.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Security Anti-Patterns	✅ Passed	No security anti-patterns found. Gate scripts use only standard library with safe json.load(). No torch.load, numpy.load, trust_remote_code, eval/exec, or # nosec patterns detected.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch day0-release

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

codecov · 2026-06-02T00:46:07Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 76.60%. Comparing base (905259f) to head (550fa1d).
⚠️ Report is 15 commits behind head on main.

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #1596      +/-   ##
==========================================
- Coverage   77.38%   76.60%   -0.79%     
==========================================
  Files         479      482       +3     
  Lines       52435    54783    +2348     
==========================================
+ Hits        40578    41966    +1388     
- Misses      11857    12817     +960

Flag	Coverage Δ
examples	`41.67% <ø> (+0.85%)`	⬆️
gpu	`59.85% <ø> (-0.59%)`	⬇️
regression	`15.20% <ø> (+0.07%)`	⬆️
unit	`53.93% <ø> (+0.30%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

code-quality CI flagged D103 (missing docstrings) on the gate-script main() functions and the test functions, since .claude/skills/**/scripts isn't covered by the existing tests/* and examples/* docstring exemptions. - Add one-line docstrings to gate_ptq/gate_run/gate_compare main() (the gate scripts stay fully pydocstyle-compliant — they're the contract). - Exempt skill test scripts from D / E402 via a per-file-ignore mirroring the existing tests/* rule (.claude/skills/*/scripts/test_*.py). - Apply ruff format. 18 gate unit tests still pass. Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com>

coderabbitai

Warning

CodeRabbit couldn't request changes on this pull request because it doesn't have sufficient GitHub permissions.

Please grant CodeRabbit Pull requests: Read and write permission and re-run the review.

👉 Steps to fix this

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In @.claude/skills/day0-release/scripts/gate_compare.py:
- Around line 83-94: The loop currently logs anomalies for
non-numeric/out-of-range scores but still computes arithmetic (drop = b - c / b)
which raises TypeError for invalid b/c; after the validation block that appends
to anomalies for ("baseline", b) / ("candidate", c) add a short-circuit that
skips the delta/gain math when either score is invalid (e.g., check if not
isinstance(b, (int,float)) or not isinstance(c, (int,float)) or if anomalies was
appended for this task) and mark the task as anomalous/continue the loop so you
do not execute drop/limit/within calculations; update references to drop, limit,
within, and the implausible gain check to only run when b and c are valid
numbers within [_SCORE_MIN, _SCORE_MAX].

In @.claude/skills/day0-release/scripts/gate_ptq.py:
- Around line 137-149: main() currently hard-fails if args.summary is not
provided, which breaks the documented Step 2 flow that invokes the script with
--checkpoint/--source/--recipe; change the validation so that when args.summary
is missing the code will either (a) accept and derive or locate the
validation-summary.json from args.checkpoint or args.source (use the same path
resolution logic as the export/checkpoint step) or (b) fall back to using
args.recipe/args.source to reconstruct the summary before proceeding, instead of
returning exit code 2; update the branch handling around p.add_argument(...) and
the args.summary check in main() (and any helper that reads the summary) to
implement this fallback behavior so the script runs with the documented
invocation.

In @.claude/skills/day0-release/scripts/gate_run.py:
- Around line 91-97: The sample-count and score validation currently allows
missing/invalid counts or non-finite scores to pass; update the checks around
the sample accounting block (the expected/scored handling) and the score check
so that if expected is None or scored is None it sets ok = False and appends a
reason (e.g., "missing expected/scored samples"), and if score is not a finite
number (use math.isfinite after converting/parsing or handle
ValueError/TypeError) then set ok = False and append a descriptive reason (e.g.,
"invalid/non-finite score"); ensure you import math and use math.isfinite(score)
(or equivalent parsing+check) and keep the existing messages like "sample
accounting: scored {scored} of {expected}" and "no score extracted" but expand
them to cover missing/invalid cases so malformed eval outputs fail the gate.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 18ea1284-1a4f-443e-ba5f-f2606c8478c4

📥 Commits

Reviewing files that changed from the base of the PR and between 905259f and 504cb72.

📒 Files selected for processing (7)

.claude/skills/day0-release/SKILL.md
.claude/skills/day0-release/scripts/gate_compare.py
.claude/skills/day0-release/scripts/gate_ptq.py
.claude/skills/day0-release/scripts/gate_run.py
.claude/skills/day0-release/scripts/test_gates.py
.claude/skills/day0-release/tests/evals.json
CHANGELOG.rst

coderabbitai · 2026-06-02T07:15:36Z

+    p.add_argument("--summary", help="validation-summary JSON (see module docstring)")
+    p.add_argument("--checkpoint", help="(reserved) checkpoint dir; v1 expects --summary")
+    p.add_argument("--source", help="(reserved) source model id/path")
+    p.add_argument("--recipe", help="(reserved) qformat; overrides summary.recipe if given")
+    args = p.parse_args(argv)
+
+    if not args.summary:
+        print(json.dumps({
+            "pass": False, "failure_class": "USER_CONFIG_ERROR",
+            "detail": "v1 requires --summary <validation-summary.json>; "
+                      "produce it from the exported checkpoint (size scan + hf_ptq quant summary)",
+        }))
+        return 2


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Align the PTQ gate CLI with the documented Step 2 invocation.

main() hard-fails unless --summary is passed, but the published workflow calls this script with only --checkpoint/--source/--recipe. As shipped, the documented day-0 flow cannot get past the PTQ gate even when PTQ succeeds. Either support the Step 2 arguments here or update the workflow to pass --summary consistently.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In @.claude/skills/day0-release/scripts/gate_ptq.py around lines 137 - 149, main() currently hard-fails if args.summary is not provided, which breaks the documented Step 2 flow that invokes the script with --checkpoint/--source/--recipe; change the validation so that when args.summary is missing the code will either (a) accept and derive or locate the validation-summary.json from args.checkpoint or args.source (use the same path resolution logic as the export/checkpoint step) or (b) fall back to using args.recipe/args.source to reconstruct the summary before proceeding, instead of returning exit code 2; update the branch handling around p.add_argument(...) and the args.summary check in main() (and any helper that reads the summary) to implement this fallback behavior so the script runs with the documented invocation.

coderabbitai · 2026-06-02T07:15:36Z

+        if expected is not None and scored is not None and scored != expected:
+            ok = False
+            reasons.append(f"sample accounting: scored {scored} of {expected}")
+
+        if score is None:
+            ok = False
+            reasons.append("no score extracted")


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Fail the gate when sample counts or the canonical score are malformed.

A task with status="SUCCESS", no expected_samples/scored_samples, and a non-numeric or non-finite score currently passes. That lets incomplete or malformed eval outputs through to comparison. Treat missing/invalid counts and non-finite scores as gate failures here.

Suggested fix

+import math + ... - if expected is not None and scored is not None and scored != expected: - ok = False - reasons.append(f"sample accounting: scored {scored} of {expected}") + if not isinstance(expected, int) or expected < 0 or not isinstance(scored, int) or scored < 0: + ok = False + reasons.append("sample accounting missing/invalid") + elif scored != expected: + ok = False + reasons.append(f"sample accounting: scored {scored} of {expected}") - if score is None: + if not isinstance(score, (int, float)) or not math.isfinite(score): ok = False - reasons.append("no score extracted") + reasons.append("no valid score extracted")

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

if expected is not None and scored is not None and scored != expected:

ok = False

reasons.append(f"sample accounting: scored {scored} of {expected}")

if score is None:

ok = False

reasons.append("no score extracted")

if not isinstance(expected, int) or expected < 0 or not isinstance(scored, int) or scored < 0:

ok = False

reasons.append("sample accounting missing/invalid")

elif scored != expected:

ok = False

reasons.append(f"sample accounting: scored {scored} of {expected}")

if not isinstance(score, (int, float)) or not math.isfinite(score):

ok = False

reasons.append("no valid score extracted")

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In @.claude/skills/day0-release/scripts/gate_run.py around lines 91 - 97, The sample-count and score validation currently allows missing/invalid counts or non-finite scores to pass; update the checks around the sample accounting block (the expected/scored handling) and the score check so that if expected is None or scored is None it sets ok = False and appends a reason (e.g., "missing expected/scored samples"), and if score is not a finite number (use math.isfinite after converting/parsing or handle ValueError/TypeError) then set ok = False and append a descriptive reason (e.g., "invalid/non-finite score"); ensure you import math and use math.isfinite(score) (or equivalent parsing+check) and keep the existing messages like "sample accounting: scored {scored} of {expected}" and "no score extracted" but expand them to cover missing/invalid cases so malformed eval outputs fail the gate.

coderabbitai

♻️ Duplicate comments (1)

.claude/skills/day0-release/scripts/gate_run.py (1)

91-97: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Fail the gate when sample counts or the canonical score are malformed.

The validation gaps flagged in the previous review remain unaddressed. A task with status="SUCCESS", missing expected_samples/scored_samples (None), and a non-finite score (NaN/Inf) currently passes the gate. This allows incomplete or malformed eval outputs through to comparison, which will produce misleading results or downstream failures.

Missing validations:

Lines 91-93: If expected or scored is None, the check is skipped entirely
Lines 91-93: No type or non-negative validation for sample counts
Lines 95-97: If score is not None but is NaN or Inf, it passes

The fix requires importing math and adding explicit checks for missing/invalid counts and non-finite scores.

🛡️ Suggested fix

+import math
+
 ...
-        if expected is not None and scored is not None and scored != expected:
+        if not isinstance(expected, int) or expected < 0 or not isinstance(scored, int) or scored < 0:
             ok = False
-            reasons.append(f"sample accounting: scored {scored} of {expected}")
+            reasons.append("sample accounting missing/invalid")
+        elif scored != expected:
+            ok = False
+            reasons.append(f"sample accounting: scored {scored} of {expected}")
 
-        if score is None:
+        if not isinstance(score, (int, float)) or not math.isfinite(score):
             ok = False
-            reasons.append("no score extracted")
+            reasons.append("no valid score extracted")

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In @.claude/skills/day0-release/scripts/gate_run.py around lines 91 - 97, The
current validation around sample counts and canonical score is too permissive;
update the block that inspects expected, scored, and score (variables expected,
scored, score; function logic around lines shown) to: import math at the top,
explicitly treat expected or scored being None as a failure (set ok=False and
append a clear reason), validate that expected and scored are integers and >= 0
(fail and append reason if not), still check for mismatch when both valid, and
validate score with math.isfinite(score) so NaN/Inf are treated as invalid (if
score is None or not finite set ok=False and append "no score extracted" or
"non-finite score" as appropriate). Ensure all failures set ok=False and push a
descriptive message into reasons.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Duplicate comments:
In @.claude/skills/day0-release/scripts/gate_run.py:
- Around line 91-97: The current validation around sample counts and canonical
score is too permissive; update the block that inspects expected, scored, and
score (variables expected, scored, score; function logic around lines shown) to:
import math at the top, explicitly treat expected or scored being None as a
failure (set ok=False and append a clear reason), validate that expected and
scored are integers and >= 0 (fail and append reason if not), still check for
mismatch when both valid, and validate score with math.isfinite(score) so
NaN/Inf are treated as invalid (if score is None or not finite set ok=False and
append "no score extracted" or "non-finite score" as appropriate). Ensure all
failures set ok=False and push a descriptive message into reasons.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: bc70c917-a62d-4c25-be89-6d7e5b7e8206

📥 Commits

Reviewing files that changed from the base of the PR and between 504cb72 and fcdb6a4.

📒 Files selected for processing (5)

.claude/skills/day0-release/scripts/gate_compare.py
.claude/skills/day0-release/scripts/gate_ptq.py
.claude/skills/day0-release/scripts/gate_run.py
.claude/skills/day0-release/scripts/test_gates.py
pyproject.toml

✅ Files skipped from review due to trivial changes (1)

pyproject.toml

🚧 Files skipped from review as they are similar to previous changes (3)

.claude/skills/day0-release/scripts/gate_compare.py
.claude/skills/day0-release/scripts/gate_ptq.py
.claude/skills/day0-release/scripts/test_gates.py

Copilot

Pull request overview

Adds a new day0-release Claude skill that deterministically orchestrates the “day-0 release” happy path (PTQ → deploy → baseline eval → candidate eval → compare) and enforces code-driven gates between stages to prevent silently advancing with invalid artifacts/runs.

Changes:

Introduces day0-release skill spec (SKILL.md) describing the fixed stage chain, gating points, and decision outputs.
Adds three gate scripts (gate_ptq.py, gate_run.py, gate_compare.py) plus unit tests (test_gates.py) and routing/behavior assertions (tests/evals.json).
Updates lint config (pyproject.toml) for skill test scripts and documents the feature in CHANGELOG.rst.

Reviewed changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 6 comments.

Show a summary per file

File	Description
`pyproject.toml`	Adds Ruff per-file ignores for skill-local test scripts.
`CHANGELOG.rst`	Documents the new `[Early Testing]` day0-release skill and gates.
`.claude/skills/day0-release/tests/evals.json`	Adds routing/behavior assertions for manual QA of skill selection and stop-on-gate behavior.
`.claude/skills/day0-release/SKILL.md`	Defines the deterministic orchestration chain, gate invocation points, and triage/decision mapping.
`.claude/skills/day0-release/scripts/test_gates.py`	Adds GPU-free unit tests covering the three gate decision functions.
`.claude/skills/day0-release/scripts/gate_run.py`	Implements evaluation-run completeness/validity gate (sample accounting, status, judge errors).
`.claude/skills/day0-release/scripts/gate_ptq.py`	Implements PTQ output gate (size reduction, coverage, metadata diffs) based on a summary JSON.
`.claude/skills/day0-release/scripts/gate_compare.py`	Implements baseline-vs-candidate threshold gate producing ACCEPT/REGRESSION/ANOMALOUS decisions.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

+```bash
+python .claude/skills/day0-release/scripts/gate_ptq.py \
+    --checkpoint <output_path> --source <source_model> --recipe <qformat>
+```


+```bash
+python .claude/skills/day0-release/scripts/gate_run.py --run <run_dir_or_results_yml>
+```


+    for task in sorted(baseline):
+        b, c = baseline[task], candidate[task]
+        for label, val in (("baseline", b), ("candidate", c)):
+            if not isinstance(val, (int, float)) or not (_SCORE_MIN <= val <= _SCORE_MAX):
+                anomalies.append(f"{task}: {label} score {val!r} outside [0, 100]")
+        if relative:
+            drop = (b - c) / b if b else 0.0
+            limit = threshold
+        else:
+            drop = b - c  # percentage points
+            limit = threshold * 100.0  # threshold is a fraction of the 0-100 scale
+        within = drop <= limit


+A walltime TIMEOUT with a healthy resume is not a failure (NEL resumes); only a
+terminal FAILED/incomplete run fails the gate.


+_TERMINAL_OK = "SUCCESS"
+_RESUMABLE = {"TIMEOUT", "RESUMING"}  # NEL dependency-chain resume — not a failure
+


kevalmorabia97 · 2026-06-02T18:02:50Z

why is this not in the .claude/skills/day0-release/tests/ directory?

Good catch — moved test_gates.py into .claude/skills/day0-release/tests/ (alongside evals.json) via git mv, fixed its sys.path to point at ../scripts, and updated the ruff per-file-ignore glob to .claude/skills/*/tests/test_*.py. All 21 tests pass.

are the tests small and self scoped? if yes, we should include them in the unit test CI/CD

- gate_compare: short-circuit non-numeric/non-finite/out-of-range scores before the delta math — a payload like {'gpqa': '42'} now returns ANOMALOUS instead of crashing with TypeError (CodeRabbit, Copilot). - gate_run: fail on non-numeric/non-finite scores (not just None); treat RUNNING/PENDING as non-terminal -> INFRA_TRANSIENT (was UNKNOWN); fix the docstring's TIMEOUT 'not a failure' wording to 'non-terminal, transient, still blocks the gate' (CodeRabbit, Copilot). - SKILL.md: align Step 2 PTQ-gate invocation with the v1 CLI (--summary, not the reserved --checkpoint/--source/--recipe) so the documented flow can pass; Step 5 --run placeholder -> <run-summary.json> (was <run_dir_or_results_yml>). - Move test_gates.py into tests/ (kevalmorabia97); fix sys.path + ruff glob. - pyproject: correct the per-file-ignore comment/glob for skill test scripts. - Add regression tests: non-numeric scores (compare+run), RUNNING transient. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com>

Edwardf0t1 · 2026-06-04T05:29:05Z

Pushed fe6f2bc4da addressing the review feedback (all 21 gate unit tests pass, ruff clean):

Correctness (CodeRabbit + Copilot consensus)

gate_compare.py: short-circuit non-numeric / non-finite / out-of-range scores before the delta math — a payload like {"gpqa": "42"} now returns ANOMALOUS instead of crashing with TypeError.
gate_ptq.py ↔ docs: aligned Step 2 to the v1 CLI (--summary <validation-summary.json>, not the reserved --checkpoint/--source/--recipe) so the documented day-0 flow can actually pass the PTQ gate.

gate_run.py robustness

Fail on non-numeric/non-finite scores (not just None).
Treat RUNNING/PENDING as non-terminal → INFRA_TRANSIENT (was misrouted to UNKNOWN).
Fixed the docstring's TIMEOUT "not a failure" wording → "non-terminal/transient, but still blocks the gate until the run finishes."

Docs / cleanup

SKILL.md Step 5 --run placeholder → <run-summary.json> (was <run_dir_or_results_yml>).
Moved test_gates.py → tests/ (@kevalmorabia97); fixed sys.path + the ruff per-file-ignore glob/comment.
Added regression tests: non-numeric scores (compare + run), RUNNING transient.

coderabbitai

🧹 Nitpick comments (1)

.claude/skills/day0-release/scripts/gate_compare.py (1)

35-42: 💤 Low value

Consider adding a docstring to document the validation contract.

The _is_valid_score function is well-named and the logic is clear, but a brief docstring would make the validation contract explicit for future maintainers (especially the boolean exclusion rationale).

📝 Suggested docstring

 def _is_valid_score(val):
-    """True only for a finite real number in [_SCORE_MIN, _SCORE_MAX] (not bool)."""
+    """Return True if val is a valid score.
+    
+    A valid score must be a finite numeric value (int or float, but not bool)
+    in the range [_SCORE_MIN, _SCORE_MAX]. Rejects NaN, infinity, and booleans
+    (bool is a subclass of int in Python, so explicit exclusion is needed).
+    """

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In @.claude/skills/day0-release/scripts/gate_compare.py around lines 35 - 42,
Add a concise docstring to the _is_valid_score function that documents its
validation contract: it should state the function returns True only for finite
real numbers within the [_SCORE_MIN, _SCORE_MAX] range and explicitly note that
booleans are excluded (because isinstance(True, int) is True), and mention any
edge-case behavior (e.g., rejects NaN/inf). Place this docstring immediately
below the def _is_valid_score(...) declaration so future maintainers understand
the intent.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In @.claude/skills/day0-release/scripts/gate_compare.py:
- Around line 35-42: Add a concise docstring to the _is_valid_score function
that documents its validation contract: it should state the function returns
True only for finite real numbers within the [_SCORE_MIN, _SCORE_MAX] range and
explicitly note that booleans are excluded (because isinstance(True, int) is
True), and mention any edge-case behavior (e.g., rejects NaN/inf). Place this
docstring immediately below the def _is_valid_score(...) declaration so future
maintainers understand the intent.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 78925974-71a6-4f69-be1b-af16b6452f33

📥 Commits

Reviewing files that changed from the base of the PR and between fcdb6a4 and fe6f2bc.

📒 Files selected for processing (5)

.claude/skills/day0-release/SKILL.md
.claude/skills/day0-release/scripts/gate_compare.py
.claude/skills/day0-release/scripts/gate_run.py
.claude/skills/day0-release/tests/test_gates.py
pyproject.toml

✅ Files skipped from review due to trivial changes (2)

pyproject.toml
.claude/skills/day0-release/SKILL.md

🚧 Files skipped from review as they are similar to previous changes (1)

.claude/skills/day0-release/scripts/gate_run.py

Wrap the long elif in gate_run.py and the per_task dict literal in gate_compare.py to satisfy pre-commit ruff-format (line-length 100). Fixes the code-quality CI failure on PR #1596; the earlier review fix ran `ruff check` (lint) but not the formatter. No logic change; all 21 gate unit tests pass. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com>

mxinO · 2026-06-05T04:47:23Z

+            limit = threshold
+        else:
+            drop = b - c  # percentage points
+            limit = threshold * 100.0  # threshold is a fraction of the 0-100 scale


This assumes every task score is on a 0-100 scale, but the default AA suite includes tau2_bench_telecom with Result (0-1). With baseline=1.0, candidate=0.0, threshold=0.01, this gate accepts because limit becomes 1.0.

mxinO · 2026-06-05T04:50:04Z

+  {
+    "name": "full-day0-release-triggers",
+    "skills": ["day0-release"],
+    "query": "Release nvidia/Some-New-Model at day-0: quantize to NVFP4, validate it's within 1% of the BF16 baseline on the AA suite, and tell me if it's publishable. Run on my cluster.",


Is this a real test? there is no a real model target here.

mxinO · 2026-06-05T04:54:06Z

+branch on `failure_class` (see **Triage** below). Do not deploy an
+unvalidated checkpoint.
+
+### Step 3 — Deploy


The deploy is redundant here, because the eval skill will include a deployment. We can go to directly eval I think.

Edwardf0t1 marked this pull request as ready for review June 2, 2026 07:07

Edwardf0t1 requested a review from a team as a code owner June 2, 2026 07:12

Edwardf0t1 requested a review from kevalmorabia97 June 2, 2026 07:12

coderabbitai Bot reviewed Jun 2, 2026

View reviewed changes

Edwardf0t1 requested review from chadvoegele, Copilot, kaix-nv, meenchen, mxinO and shengliangxu June 2, 2026 07:23

Copilot started reviewing on behalf of Edwardf0t1 June 2, 2026 07:24 View session

Copilot AI reviewed Jun 2, 2026

View reviewed changes

kevalmorabia97 reviewed Jun 2, 2026

View reviewed changes

coderabbitai Bot reviewed Jun 4, 2026

View reviewed changes

Edwardf0t1 changed the title ~~[Early Testing] Add day0-release orchestration skill with enforced gates~~ Add day0-release orchestration skill with enforced gates Jun 4, 2026

mxinO reviewed Jun 5, 2026

View reviewed changes

		A walltime TIMEOUT with a healthy resume is not a failure (NEL resumes); only a
		terminal FAILED/incomplete run fails the gate.

		_TERMINAL_OK = "SUCCESS"
		_RESUMABLE = {"TIMEOUT", "RESUMING"} # NEL dependency-chain resume — not a failure

Conversation

Edwardf0t1 commented Jun 2, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Usage

Testing

End-to-end integration results

Scope

Before your PR is "Ready for review"

Additional Information

Summary by CodeRabbit

Uh oh!

copy-pr-bot Bot commented Jun 2, 2026

Uh oh!

coderabbitai Bot commented Jun 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Suggested reviewers

❌ Failed checks (1 warning)

Uh oh!

codecov Bot commented Jun 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai Bot Jun 2, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 2, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

kevalmorabia97 Jun 2, 2026

Choose a reason for hiding this comment

Uh oh!

Edwardf0t1 Jun 4, 2026

Choose a reason for hiding this comment

Uh oh!

shengliangxu Jun 4, 2026

Choose a reason for hiding this comment

Uh oh!

Edwardf0t1 commented Jun 4, 2026

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

mxinO Jun 5, 2026

Choose a reason for hiding this comment

Uh oh!

mxinO Jun 5, 2026

Choose a reason for hiding this comment

Uh oh!

mxinO Jun 5, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Edwardf0t1 commented Jun 2, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Jun 2, 2026 •

edited

Loading

codecov Bot commented Jun 2, 2026 •

edited

Loading