diff --git a/.agents/skills/day0-release/SKILL.md b/.agents/skills/day0-release/SKILL.md new file mode 100644 index 00000000000..5fb2e2714db --- /dev/null +++ b/.agents/skills/day0-release/SKILL.md @@ -0,0 +1,177 @@ +--- +name: day0-release +description: Deterministic end-to-end driver for day-0 quantized-checkpoint releases — chains PTQ → evaluation → comparison with enforced gates between stages (the evaluation stage deploys the checkpoint itself), and returns a publish decision (ACCEPT / REGRESSION / ANOMALOUS / INFEASIBLE). Use when the user asks to "release a model at day-0", "quantize and validate model X is within N% of baseline and tell me if it's publishable", or "run the full day-0 workflow". Do NOT use for single-stage requests — quantizing only (use ptq), serving only (use deployment), evaluating only (use evaluation), or comparing two existing runs (use compare-results). +license: Apache-2.0 +--- + +# Day-0 Release + +Drive a model from a pretrained checkpoint to a publish decision for a quantized +checkpoint, in a fixed sequence with a gate after every stage. This skill is a +**conductor**: it sequences the existing domain skills and enforces the gates — +it does not re-implement quantization, serving, evaluation, or comparison. + +**Goal (the default day-0 criterion):** a quantized checkpoint smaller than the +source, with accuracy drop within the threshold (default <1%) on the standard +benchmark set versus the matching baseline, plus a publish recommendation. + +## When to use + +Use only for the full goal-driven release. For a single stage, route to the +domain skill directly: quantize → **ptq**, serve → **deployment**, evaluate → +**evaluation**, compare two existing runs → **compare-results**. + +## Inputs + +Resolve these before starting (ask the user for anything missing): + +- **Model** — HF handle or checkpoint path. +- **Recipe / qformat** — e.g. `nvfp4`, `fp8`, or a recipe path. One candidate for v1. +- **Cluster / launcher** — from `clusters.yaml` (see `skills/common/environment-setup.md`). +- **Eval set** — defaults to the AA suite (`evaluation/recipes/tasks/aa/`). +- **Threshold** — max accuracy drop; default `0.01` (1%). + +## The chain + +```text +setup ─▶ PTQ ─▶ baseline-eval ─▶ quantized-eval ─▶ compare ─▶ closeout + │ │ │ │ + gate_ptq gate_run gate_run gate_compare +``` + +The **evaluation** skill deploys the model it evaluates (it stands up its own +endpoint per run), so there is no separate deploy stage — a serving failure +surfaces through the eval stage's gate (`DEPLOYMENT_HEALTH_FAILED`) and triages +to the **deployment** skill to debug serving in isolation (see Step 4). + +Run each stage by invoking the domain skill, then run its gate before +proceeding. **Do not advance past a failed gate.** Copy this checklist and track +progress: + +```text +- [ ] Step 0: Resolve inputs; confirm threshold and eval set +- [ ] Step 1: Setup gate — creds present, cluster reachable +- [ ] Step 2: PTQ (ptq skill) → gate_ptq.py +- [ ] Step 3: Baseline eval (evaluation skill, deploys source) → gate_run.py [skip if cached, see below] +- [ ] Step 4: Quantized eval (evaluation skill, deploys candidate) → gate_run.py +- [ ] Step 5: Compare (compare-results skill) → gate_compare.py → decision +- [ ] Step 6: Closeout — report + publish recommendation +``` + +### Step 1 — Setup gate + +Confirm credentials (`skills/common/credentials.md`) and cluster reachability +(`skills/common/remote-execution.md`). If either fails, stop with +`SYSTEMIC` — do not start PTQ. + +### Step 2 — PTQ + +Invoke the **ptq** skill to produce the quantized checkpoint. Then gate: + +```bash +# The ptq skill's post-PTQ validation produces a validation-summary JSON (size +# ratio + layer-precision counts + metadata diffs; see +# ptq/references/checkpoint-validation.md). v1 gates on that summary: +python .agents/skills/day0-release/scripts/gate_ptq.py --summary +# add `--recipe ` to override the recipe recorded in the summary +``` + +`gate_ptq.py` returns JSON `{pass, failure_class, detail}`. On `pass: false`, +branch on `failure_class` (see **Triage** below). Do not evaluate an +unvalidated checkpoint. + +### Step 3 — Baseline eval + +The baseline is the **source** (pre-quantization) model on the same task set and +sampling params. **Look it up first** — if a matching baseline run already +exists in MLflow (same model, task set, sampling params), reuse it and skip this +stage. Otherwise run it via the **evaluation** skill (which deploys the source +model itself). Gate with `gate_run.py`. + +### Step 4 — Quantized eval + +Invoke the **evaluation** skill on the quantized checkpoint, matching the +baseline's task set and sampling params. The evaluation skill stands up the +serving endpoint itself (it builds the `deployment.command`, e.g. a +`vllm serve …`), so a serving failure surfaces here as a failed `gate_run.py` +with `DEPLOYMENT_HEALTH_FAILED`. When that happens, **drop to the deployment +skill** to reproduce and debug serving in isolation (serve the checkpoint +standalone, confirm `/health` + one generation, iterate on flags / TP / image / +env vars) rather than burning full eval cycles on a broken endpoint — then carry +the working command back into NEL's `deployment.command` and resume the eval. If +the checkpoint genuinely can't serve, `POINT_INFEASIBLE`. Gate: + +```bash +python .agents/skills/day0-release/scripts/gate_run.py --run +``` + +A `pass: false` here means the run is incomplete or invalid (judge/parse error, +dropped samples) — do **not** compare scores from it. + +### Step 5 — Compare + +Invoke the **compare-results** skill to produce per-task deltas, then gate: + +```bash +python .agents/skills/day0-release/scripts/gate_compare.py \ + --baseline --candidate \ + --threshold 0.01 +``` + +The threshold is a fraction of each task's score scale. Most AA tasks report +0-100, but some (e.g. `tau2_bench_telecom` `Result`) report 0-1; the gate infers +each task's scale (0-1 if both scores are within [0, 1], else 0-100) and +normalizes the drop accordingly, so `--threshold 0.01` means "≤1 pt on a 0-100 +task / ≤0.01 on a 0-1 task" uniformly. Pass `--scales '{"task": max}'` to +override inference if a task's scores happen to fall in an ambiguous range. + +Decision from `gate_compare.py`: + +- **ACCEPT** — every task within threshold → go to Step 6. +- **REGRESSION** — one or more tasks exceed threshold. **v1 stops here and + reports** which tasks regressed by how much. (Picking the next recipe and + re-running is deferred — see Scope.) +- **ANOMALOUS** — scores present but implausible (e.g. baseline lower than + candidate by a large margin, or a task score outside its valid range) → + surface to the user. + +### Step 6 — Closeout + +Report the decision with: source vs output size + ratio, per-task baseline / +candidate / delta / within-threshold, MLflow run IDs, and a publish +recommendation (publish / do-not-publish / needs-human). Archive artifacts to +the workspace. + +## Triage (gate failure → decision) + +Map a gate's `failure_class` to the next action: + +| `failure_class` | Action | +| --- | --- | +| `INFRA_TRANSIENT` | Retry the stage once; if it recurs, `SYSTEMIC`. | +| `MODEL_UNSUPPORTED` | PATCH: fix the recipe pattern / add model support (ptq skill owns the patch loop), then retry. If unpatchable, `POINT_INFEASIBLE`. | +| `QUANT_COVERAGE_FAILURE` | PATCH: fix the recipe wildcard so intended layers are covered; re-run PTQ. | +| `DEPLOYMENT_HEALTH_FAILED` | Drop to the **deployment** skill: reproduce serving standalone (`/health` + one generation), debug flags / image / TP / env, then carry the working command into NEL's `deployment.command` and retry the eval. If it can't serve, `POINT_INFEASIBLE`. | +| `EVAL_JUDGE_FAILED` | Usually transient (auth / rate limit) — wait and retry. | +| `SAMPLE_ACCOUNTING_FAILED` | Investigate dropped/failed samples before trusting scores. | +| `USER_CONFIG_ERROR` | Stop and ask the user. | +| `UNKNOWN` | Stop and surface to the user (`NEEDS_HUMAN`). | + +`SYSTEMIC` (cluster down, dataset unavailable) aborts the whole run. +`POINT_INFEASIBLE` means this (model, recipe) can't work as configured. + +## Output + +Return a decision, not a raw artifact: + +- `ACCEPT` + report + publish recommendation +- `REGRESSION` + which tasks failed the threshold and by how much +- `ANOMALOUS` / `INFEASIBLE` / `NEEDS_HUMAN` + reason +- Always: workspace path + MLflow run IDs for traceability + +## Scope (v1) + +In v1: the linear chain + gates + report. On `REGRESSION`, v1 reports and stops. +Deferred to a follow-up: the evaluator-optimizer recipe loop (compare → pick the +next recipe → re-run PTQ), which needs the bigpareto integration and a shared +config/result schema. diff --git a/.agents/skills/day0-release/scripts/gate_compare.py b/.agents/skills/day0-release/scripts/gate_compare.py new file mode 100644 index 00000000000..d8ae195acd1 --- /dev/null +++ b/.agents/skills/day0-release/scripts/gate_compare.py @@ -0,0 +1,208 @@ +# SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# SPDX-License-Identifier: Apache-2.0 +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +"""Day-0 compare gate. + +Decides whether a quantized candidate is within the accuracy threshold of its +baseline, per task. Pure decision logic in ``evaluate_comparison`` (unit-tested +without GPU/cluster); ``main`` reads score JSON files and prints the verdict. + +Score files are ``{task_name: score}`` dicts. Most AA task references report +``*_avg_of_N`` on a 0-100 scale, but some tasks (e.g. ``tau2_bench_telecom`` +``Result``) report on a 0-1 scale. The gate is therefore scale-aware: each +task's scale is inferred per task (0-1 if both scores are within [0, 1], else +0-100) or supplied explicitly via ``--scales``, and the drop is normalized to a +fraction of that scale so the threshold applies uniformly. The drop is an +absolute (scale-normalized) delta unless ``--relative`` is passed. +""" + +from __future__ import annotations + +import argparse +import json +import math +import sys + + +def _is_valid_score(val): + """True only for a finite real number in [_SCORE_MIN, _SCORE_MAX] (not bool).""" + return ( + isinstance(val, (int, float)) + and not isinstance(val, bool) + and math.isfinite(val) + and _SCORE_MIN <= val <= _SCORE_MAX + ) + + +# Decisions +ACCEPT = "ACCEPT" +REGRESSION = "REGRESSION" +ANOMALOUS = "ANOMALOUS" + +# Plausibility bounds. Scores may be on a 0-1 or 0-100 scale (see _infer_scale); +# the upper bound is the larger of the two so both are accepted. +_SCORE_MIN = 0.0 +_SCORE_MAX = 100.0 +# A candidate scoring this fraction of its scale ABOVE baseline is implausible +# for quantization (quantization should not meaningfully improve accuracy); flag +# it rather than silently passing. 0.05 = 5 pts on a 0-100 task, 0.05 on a 0-1 task. +_IMPLAUSIBLE_GAIN_FRAC = 0.05 + + +def _infer_scale(*vals): + """Infer a task's score scale: 1.0 if every score is within [0, 1], else 100.0. + + Most AA tasks report 0-100; a few (e.g. ``tau2_bench_telecom``) report 0-1. + Without scale metadata in the score files, we treat a task as 0-1 only when + every score for it fits in [0, 1] — a 0-100 task with sub-1.0 accuracy is + degenerate and caught elsewhere. Pass an explicit scale to override. + """ + return 1.0 if all(0.0 <= v <= 1.0 for v in vals) else 100.0 + + +def evaluate_comparison(baseline, candidate, threshold=0.01, relative=False, scales=None): + """Compare candidate vs baseline scores per task. + + Args: + baseline: dict ``{task: score}``. + candidate: dict ``{task: score}``. + threshold: max allowed drop, as a fraction of the task's scale + (0.01 = 1 percentage point on a 0-100 task / 0.01 on a 0-1 task, + or 1% relative if ``relative``). + relative: if True, drop is measured relative to the baseline score + (scale-invariant). + scales: optional dict ``{task: max_scale}`` to override per-task scale + inference (e.g. ``{"tau2_bench_telecom": 1.0}``). + + Returns: + dict ``{pass, decision, failure_class, detail, per_task}``. + """ + scales = scales or {} + missing = sorted((set(baseline) | set(candidate)) - (set(baseline) & set(candidate))) + if missing: + return { + "pass": False, + "decision": ANOMALOUS, + "failure_class": "SAMPLE_ACCOUNTING_FAILED", + "detail": f"task sets differ; missing on one side: {missing}", + "per_task": {}, + } + if not baseline: + return { + "pass": False, + "decision": ANOMALOUS, + "failure_class": "USER_CONFIG_ERROR", + "detail": "no tasks to compare", + "per_task": {}, + } + + per_task = {} + regressed = [] + anomalies = [] + for task in sorted(baseline): + b, c = baseline[task], candidate[task] + invalid = False + for label, val in (("baseline", b), ("candidate", c)): + if not _is_valid_score(val): + anomalies.append(f"{task}: {label} score {val!r} not a finite number in [0, 100]") + invalid = True + if invalid: + # Don't compute deltas on non-numeric/out-of-range scores (would raise + # TypeError); record the anomaly and move on — the run is ANOMALOUS. + per_task[task] = { + "baseline": b, + "candidate": c, + "drop": None, + "within_threshold": False, + } + continue + scale = scales.get(task) or _infer_scale(b, c) + delta = b - c # native units, for reporting + if relative: + drop = delta / b if b else 0.0 # fraction of baseline (scale-invariant) + else: + drop = delta / scale # fraction of the task's scale + within = drop <= threshold + gain = (c - b) / scale + if gain > _IMPLAUSIBLE_GAIN_FRAC: + anomalies.append( + f"{task}: candidate exceeds baseline by {c - b:.4g} ({gain:.1%} of scale, implausible)" + ) + per_task[task] = { + "baseline": b, + "candidate": c, + "drop": round(delta, 4), + "drop_fraction": round(drop, 4), + "scale": scale, + "within_threshold": within, + } + if not within: + regressed.append(task) + + if anomalies: + return { + "pass": False, + "decision": ANOMALOUS, + "failure_class": "UNKNOWN", + "detail": "; ".join(anomalies), + "per_task": per_task, + } + if regressed: + return { + "pass": False, + "decision": REGRESSION, + "failure_class": None, + "detail": f"tasks exceeding threshold ({threshold}): {regressed}", + "per_task": per_task, + } + return { + "pass": True, + "decision": ACCEPT, + "failure_class": None, + "detail": f"all {len(per_task)} task(s) within threshold {threshold}", + "per_task": per_task, + } + + +def main(argv=None): + """CLI entry point: read baseline/candidate score JSON and print the verdict.""" + p = argparse.ArgumentParser(description="Day-0 compare gate") + p.add_argument("--baseline", required=True, help="baseline score JSON {task: score}") + p.add_argument("--candidate", required=True, help="candidate score JSON {task: score}") + p.add_argument("--threshold", type=float, default=0.01, help="max drop fraction (default 0.01)") + p.add_argument("--relative", action="store_true", help="measure drop relative to baseline") + p.add_argument( + "--scales", + help="optional JSON {task: max_scale} to override per-task scale inference", + ) + args = p.parse_args(argv) + + try: + with open(args.baseline) as f: + baseline = json.load(f) + with open(args.candidate) as f: + candidate = json.load(f) + scales = json.loads(args.scales) if args.scales else None + except (OSError, json.JSONDecodeError) as e: + print(json.dumps({"pass": False, "failure_class": "USER_CONFIG_ERROR", "detail": str(e)})) + return 2 + + result = evaluate_comparison(baseline, candidate, args.threshold, args.relative, scales) + print(json.dumps(result, indent=2)) + return 0 if result["pass"] else 1 + + +if __name__ == "__main__": + sys.exit(main()) diff --git a/.agents/skills/day0-release/scripts/gate_ptq.py b/.agents/skills/day0-release/scripts/gate_ptq.py new file mode 100644 index 00000000000..3425e775fa2 --- /dev/null +++ b/.agents/skills/day0-release/scripts/gate_ptq.py @@ -0,0 +1,192 @@ +# SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# SPDX-License-Identifier: Apache-2.0 +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +"""Day-0 post-quantization checkpoint gate. + +Mirrors the required checks in ptq/references/checkpoint-validation.md: + 1. Output smaller than source (size ratio < 1 for a compression recipe). + 2. Quantized-weight coverage matches the requested recipe (no intended layer + group left unquantized). + 3. No unexpected metadata diffs vs the source. + +Pure decision logic in ``evaluate_checkpoint`` (unit-tested without real +checkpoints); ``main`` reads a validation-summary JSON produced from the +exported checkpoint (e.g. from hf_ptq.py's quant summary + a size scan) and +prints the verdict. + +Validation summary shape: + { + "source_bytes": int, + "output_bytes": int, + "recipe": "nvfp4" | "fp8" | "nvfp4_mlp_only" | ..., + "layer_precision_counts": { + "NVFP4": int, "FP8": int, "INT4": int, + "BF16_or_excluded": int, + "unexpected_unquantized": int, + "declaration_mismatch": int + }, + "metadata_diffs": [str, ...] # unexpected diffs only; [] if clean + } +""" + +from __future__ import annotations + +import argparse +import json +import sys + +# Which precision bucket each recipe is expected to populate with a nonzero count. +_RECIPE_EXPECTED_PRECISION = { + "nvfp4": "NVFP4", + "nvfp4_mlp_only": "NVFP4", + "nvfp4_experts_only": "NVFP4", + "nvfp4_omlp_only": "NVFP4", + "fp8": "FP8", + "int4_awq": "INT4", +} + + +def evaluate_checkpoint(summary): + """Validate an exported quantized checkpoint summary. + + Returns dict ``{pass, failure_class, detail, checks}``. + """ + if not summary: + return { + "pass": False, + "failure_class": "USER_CONFIG_ERROR", + "detail": "empty validation summary", + "checks": {}, + } + + src = summary.get("source_bytes") + out = summary.get("output_bytes") + recipe = (summary.get("recipe") or "").lower() + counts = summary.get("layer_precision_counts") or {} + metadata_diffs = summary.get("metadata_diffs") or [] + + checks = {} + failures = [] # (failure_class, detail) + + # Check 1 — size. + if not isinstance(src, (int, float)) or not isinstance(out, (int, float)) or src <= 0: + checks["size"] = "missing/invalid source or output bytes" + failures.append(("USER_CONFIG_ERROR", "missing source/output sizes")) + else: + ratio = out / src + checks["size"] = f"{out}/{src} = {ratio:.3f}x" + if ratio >= 1.0: + failures.append( + ("QUANT_COVERAGE_FAILURE", f"output not smaller than source (ratio {ratio:.3f})") + ) + + # Check 2 — coverage. + expected_bucket = _RECIPE_EXPECTED_PRECISION.get(recipe) + if expected_bucket is None: + checks["coverage"] = f"unknown recipe {recipe!r}; cannot verify coverage" + failures.append(("USER_CONFIG_ERROR", f"unknown recipe {recipe!r}")) + else: + covered = counts.get(expected_bucket, 0) + unexpected = counts.get("unexpected_unquantized", 0) + mismatch = counts.get("declaration_mismatch", 0) + checks["coverage"] = ( + f"{expected_bucket}={covered}, " + f"unexpected_unquantized={unexpected}, " + f"declaration_mismatch={mismatch}" + ) + if covered == 0: + failures.append( + ( + "MODEL_UNSUPPORTED", + f"recipe {recipe} targets {expected_bucket} but 0 layers covered " + "(wildcard likely missed the module names)", + ) + ) + if unexpected > 0: + failures.append( + ("QUANT_COVERAGE_FAILURE", f"{unexpected} layer(s) unexpectedly unquantized") + ) + if mismatch > 0: + failures.append( + ( + "QUANT_COVERAGE_FAILURE", + f"{mismatch} layer(s) with precision/declaration mismatch", + ) + ) + + # Check 3 — metadata. + checks["metadata"] = "clean" if not metadata_diffs else f"{len(metadata_diffs)} diff(s)" + if metadata_diffs: + failures.append(("QUANT_COVERAGE_FAILURE", f"unexpected metadata diffs: {metadata_diffs}")) + + if not failures: + return { + "pass": True, + "failure_class": None, + "detail": "size, coverage, and metadata all pass", + "checks": checks, + } + + # Surface the most actionable failure_class first: MODEL_UNSUPPORTED > + # QUANT_COVERAGE_FAILURE > USER_CONFIG_ERROR. + order = ["MODEL_UNSUPPORTED", "QUANT_COVERAGE_FAILURE", "USER_CONFIG_ERROR"] + failures.sort(key=lambda f: order.index(f[0]) if f[0] in order else len(order)) + return { + "pass": False, + "failure_class": failures[0][0], + "detail": "; ".join(d for _, d in failures), + "checks": checks, + } + + +def main(argv=None): + """CLI entry point: read a validation-summary JSON and print the verdict.""" + p = argparse.ArgumentParser(description="Day-0 post-quantization checkpoint gate") + p.add_argument("--summary", help="validation-summary JSON (see module docstring)") + p.add_argument("--checkpoint", help="(reserved) checkpoint dir; v1 expects --summary") + p.add_argument("--source", help="(reserved) source model id/path") + p.add_argument("--recipe", help="(reserved) qformat; overrides summary.recipe if given") + args = p.parse_args(argv) + + if not args.summary: + print( + json.dumps( + { + "pass": False, + "failure_class": "USER_CONFIG_ERROR", + "detail": "v1 requires --summary ; " + "produce it from the exported checkpoint (size scan + hf_ptq quant summary)", + } + ) + ) + return 2 + + try: + with open(args.summary) as f: + summary = json.load(f) + except (OSError, json.JSONDecodeError) as e: + print(json.dumps({"pass": False, "failure_class": "USER_CONFIG_ERROR", "detail": str(e)})) + return 2 + + if args.recipe: + summary["recipe"] = args.recipe + + result = evaluate_checkpoint(summary) + print(json.dumps(result, indent=2)) + return 0 if result["pass"] else 1 + + +if __name__ == "__main__": + sys.exit(main()) diff --git a/.agents/skills/day0-release/scripts/gate_run.py b/.agents/skills/day0-release/scripts/gate_run.py new file mode 100644 index 00000000000..d5dcbe94a70 --- /dev/null +++ b/.agents/skills/day0-release/scripts/gate_run.py @@ -0,0 +1,159 @@ +# SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# SPDX-License-Identifier: Apache-2.0 +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +"""Day-0 evaluation-run gate. + +Validates that a completed evaluation run is trustworthy before its scores are +compared. Mirrors the checks in evaluation/references/run-validation.md. Pure +decision logic in ``evaluate_run`` (unit-tested without a cluster); ``main`` +reads a run-summary JSON and prints the verdict. + +The run summary is a dict with, per task: + { + "tasks": { + "": { + "status": "SUCCESS" | "FAILED" | "RUNNING" | "PENDING" | "TIMEOUT" | "RESUMING", + "expected_samples": int, + "scored_samples": int, + "score": float | null, # canonical score, if extracted + "errors": [str, ...] # judge/parse/sample errors, if any + } + } + } +Only a terminal SUCCESS with complete, numeric scores passes. Non-terminal +statuses (RUNNING/PENDING/TIMEOUT/RESUMING) do NOT pass — the run hasn't +finished — but they classify as INFRA_TRANSIENT (wait for NEL to resume/finish; +not a real regression), distinct from a terminal FAILED. +""" + +from __future__ import annotations + +import argparse +import json +import math +import sys + +_TERMINAL_OK = "SUCCESS" +# Not done yet — NEL resumes/finishes these; transient, not a real failure. +_NON_TERMINAL = {"TIMEOUT", "RESUMING", "RUNNING", "PENDING"} + + +def evaluate_run(summary): + """Validate a completed run summary. + + Returns dict ``{pass, failure_class, detail, per_task}``. + """ + tasks = (summary or {}).get("tasks") + if not tasks: + return { + "pass": False, + "failure_class": "USER_CONFIG_ERROR", + "detail": "run summary has no tasks", + "per_task": {}, + } + + per_task = {} + problems = [] + for name, t in sorted(tasks.items()): + status = t.get("status") + expected = t.get("expected_samples") + scored = t.get("scored_samples") + score = t.get("score") + errors = t.get("errors") or [] + + ok = True + reasons = [] + + if status in _NON_TERMINAL: + ok = False + reasons.append(f"status {status}: not terminal yet (resume/finish expected)") + elif status != _TERMINAL_OK: + ok = False + reasons.append(f"status {status!r} is not SUCCESS") + + if errors: + ok = False + # Classify the first error to a failure_class hint. + joined = " ".join(errors).lower() + if any(k in joined for k in ("judge", "rate limit", "unauthorized", "auth")): + reasons.append(f"judge/auth error: {errors[0]}") + else: + reasons.append(f"error: {errors[0]}") + + if expected is not None and scored is not None and scored != expected: + ok = False + reasons.append(f"sample accounting: scored {scored} of {expected}") + + if score is None: + ok = False + reasons.append("no score extracted") + elif not ( + isinstance(score, (int, float)) and not isinstance(score, bool) and math.isfinite(score) + ): + ok = False + reasons.append(f"score not numeric/finite: {score!r}") + + per_task[name] = {"ok": ok, "reasons": reasons} + if not ok: + problems.append((name, reasons)) + + if not problems: + return { + "pass": True, + "failure_class": None, + "detail": f"all {len(per_task)} task(s) valid", + "per_task": per_task, + } + + # Pick the dominant failure_class for the run. + flat = " ".join(r for _, rs in problems for r in rs).lower() + if any(k in flat for k in ("judge", "rate limit", "unauthorized", "auth")): + fc = "EVAL_JUDGE_FAILED" + elif "not terminal" in flat: + # Non-terminal (RUNNING/PENDING/TIMEOUT/RESUMING): wait for resume/finish. + fc = "INFRA_TRANSIENT" + elif "sample accounting" in flat or "no score" in flat or "score not numeric" in flat: + fc = "SAMPLE_ACCOUNTING_FAILED" + else: + fc = "UNKNOWN" + + return { + "pass": False, + "failure_class": fc, + "detail": "; ".join(f"{n}: {', '.join(rs)}" for n, rs in problems), + "per_task": per_task, + } + + +def main(argv=None): + """CLI entry point: read a run-summary JSON and print the verdict.""" + p = argparse.ArgumentParser(description="Day-0 evaluation-run gate") + p.add_argument("--run", required=True, help="run-summary JSON (see module docstring)") + args = p.parse_args(argv) + + try: + with open(args.run) as f: + summary = json.load(f) + except (OSError, json.JSONDecodeError) as e: + print(json.dumps({"pass": False, "failure_class": "USER_CONFIG_ERROR", "detail": str(e)})) + return 2 + + result = evaluate_run(summary) + print(json.dumps(result, indent=2)) + return 0 if result["pass"] else 1 + + +if __name__ == "__main__": + sys.exit(main()) diff --git a/.agents/skills/day0-release/tests/evals.json b/.agents/skills/day0-release/tests/evals.json new file mode 100644 index 00000000000..bd78b22ce11 --- /dev/null +++ b/.agents/skills/day0-release/tests/evals.json @@ -0,0 +1,71 @@ +[ + { + "name": "full-day0-release-triggers", + "skills": [ + "day0-release" + ], + "query": "Release org/PLACEHOLDER-MODEL at day-0: quantize to NVFP4, validate it's within 1% of the BF16 baseline on the AA suite, and tell me if it's publishable. Run on my cluster.", + "files": [], + "expected_behavior": [ + "Selects the day0-release skill for the full goal-driven release", + "Resolves model, recipe/qformat, cluster, eval set, and threshold before starting", + "Runs stages in fixed order: setup, PTQ, baseline eval, quantized eval, compare, closeout", + "Runs the gate after each stage and does not advance past a failed gate", + "Returns a publish decision (ACCEPT / REGRESSION / ANOMALOUS / INFEASIBLE), not just raw scores" + ] + }, + { + "name": "quantize-only-does-not-trigger-day0", + "skills": [ + "ptq" + ], + "query": "Quantize org/PLACEHOLDER-MODEL to NVFP4 and save the checkpoint.", + "files": [], + "expected_behavior": [ + "Selects the ptq skill, not day0-release", + "Does not start deployment, evaluation, or comparison", + "Treats this as a single-stage quantization request" + ] + }, + { + "name": "evaluate-only-does-not-trigger-day0", + "skills": [ + "evaluation" + ], + "query": "Run MMLU-Pro on the endpoint I already have serving at http://host:8000.", + "files": [], + "expected_behavior": [ + "Selects the evaluation skill, not day0-release", + "Does not quantize or run a baseline-vs-candidate comparison" + ] + }, + { + "name": "ptq-gate-blocks-on-coverage-miss", + "skills": [ + "day0-release" + ], + "query": "Run the day-0 workflow for org/PLACEHOLDER-MODEL with nvfp4, but the PTQ checkpoint came out the same size as the source.", + "files": [], + "expected_behavior": [ + "Runs gate_ptq.py on the exported checkpoint before evaluating", + "Treats a size ratio >= 1.0 or zero quantized-layer coverage as a gate failure", + "Does not evaluate a checkpoint that failed the PTQ gate", + "Branches on the failure_class (e.g. MODEL_UNSUPPORTED or QUANT_COVERAGE_FAILURE) rather than silently continuing" + ] + }, + { + "name": "regression-reports-and-stops-in-v1", + "skills": [ + "day0-release", + "compare-results" + ], + "query": "Day-0 release for org/PLACEHOLDER-MODEL nvfp4; the quantized GPQA score is 3 points below baseline.", + "files": [], + "expected_behavior": [ + "Runs gate_compare.py with the accuracy threshold", + "Classifies a beyond-threshold drop as REGRESSION", + "Reports which tasks regressed and by how much", + "Does NOT auto-select a new recipe and re-run in v1 (that loop is deferred)" + ] + } +] diff --git a/.agents/skills/day0-release/tests/test_gates.py b/.agents/skills/day0-release/tests/test_gates.py new file mode 100644 index 00000000000..e4b39c7b7eb --- /dev/null +++ b/.agents/skills/day0-release/tests/test_gates.py @@ -0,0 +1,235 @@ +# SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# SPDX-License-Identifier: Apache-2.0 +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +"""Unit tests for the day-0 gate scripts. + +These are deterministic — no GPU, cluster, or network. They test the pure +decision functions that the gates rest on. Run with: + + python -m pytest .agents/skills/day0-release/tests/test_gates.py +""" + +import sys +from pathlib import Path + +sys.path.insert(0, str(Path(__file__).parent.parent / "scripts")) + +from gate_compare import evaluate_comparison +from gate_ptq import evaluate_checkpoint +from gate_run import evaluate_run + +# ── gate_compare ────────────────────────────────────────────────────── + + +def test_compare_accept_within_threshold(): + r = evaluate_comparison( + {"gpqa": 50.0, "scicode": 30.0}, {"gpqa": 49.5, "scicode": 29.8}, threshold=0.01 + ) + assert r["pass"] and r["decision"] == "ACCEPT" + + +def test_compare_regression_exceeds_threshold(): + r = evaluate_comparison({"gpqa": 50.0}, {"gpqa": 47.5}, threshold=0.01) # 2.5 pt drop + assert not r["pass"] and r["decision"] == "REGRESSION" + assert "gpqa" in r["detail"] + + +def test_compare_anomalous_implausible_gain(): + r = evaluate_comparison({"gpqa": 50.0}, {"gpqa": 60.0}, threshold=0.01) # +10 pts + assert not r["pass"] and r["decision"] == "ANOMALOUS" + + +def test_compare_anomalous_out_of_range(): + r = evaluate_comparison({"gpqa": 50.0}, {"gpqa": 150.0}, threshold=0.01) + assert r["decision"] == "ANOMALOUS" + + +def test_compare_mismatched_task_sets(): + r = evaluate_comparison({"gpqa": 50.0}, {"scicode": 30.0}, threshold=0.01) + assert not r["pass"] and r["failure_class"] == "SAMPLE_ACCOUNTING_FAILED" + + +def test_compare_relative_threshold(): + # 1% relative of 50 = 0.5 pts; a 0.4 pt drop passes, 0.6 fails. + assert evaluate_comparison({"t": 50.0}, {"t": 49.6}, threshold=0.01, relative=True)["pass"] + assert not evaluate_comparison({"t": 50.0}, {"t": 49.4}, threshold=0.01, relative=True)["pass"] + + +def test_compare_0_to_1_scale_full_collapse_is_regression(): + # tau2_bench_telecom reports Result on a 0-1 scale. A full collapse + # (1.0 -> 0.0) must REGRESS, not pass via the old 0-100 limit assumption. + r = evaluate_comparison( + {"tau2_bench_telecom": 1.0}, {"tau2_bench_telecom": 0.0}, threshold=0.01 + ) + assert not r["pass"] and r["decision"] == "REGRESSION" + assert "tau2_bench_telecom" in r["detail"] + + +def test_compare_0_to_1_scale_within_threshold_accepts(): + # A 0.005 drop on a 0-1 task is within the 0.01 threshold. + r = evaluate_comparison({"t": 0.900}, {"t": 0.895}, threshold=0.01) + assert r["pass"] and r["decision"] == "ACCEPT" + + +def test_compare_explicit_scale_override(): + # Force a 0-100 scale even though both scores fit in [0, 1]: a 0.5 -> 0.4 + # drop is 0.1 pts on a 0-100 scale, well within threshold. + r = evaluate_comparison({"t": 0.5}, {"t": 0.4}, threshold=0.01, scales={"t": 100.0}) + assert r["pass"] and r["decision"] == "ACCEPT" + + +def test_compare_mixed_scales_in_one_suite(): + # 0-100 task within threshold + 0-1 task collapsing -> overall REGRESSION. + r = evaluate_comparison( + {"gpqa": 50.0, "tau2_bench_telecom": 1.0}, + {"gpqa": 49.8, "tau2_bench_telecom": 0.0}, + threshold=0.01, + ) + assert not r["pass"] and r["decision"] == "REGRESSION" + assert "tau2_bench_telecom" in r["detail"] and "gpqa" not in r["detail"] + + +# ── gate_run ────────────────────────────────────────────────────────── + + +def _task(**kw): + base = { + "status": "SUCCESS", + "expected_samples": 100, + "scored_samples": 100, + "score": 42.0, + "errors": [], + } + base.update(kw) + return base + + +def test_run_all_valid(): + r = evaluate_run({"tasks": {"gpqa": _task(), "scicode": _task()}}) + assert r["pass"] + + +def test_run_dropped_samples(): + r = evaluate_run({"tasks": {"gpqa": _task(scored_samples=90)}}) + assert not r["pass"] and r["failure_class"] == "SAMPLE_ACCOUNTING_FAILED" + + +def test_run_judge_error(): + r = evaluate_run({"tasks": {"gpqa": _task(errors=["judge rate limit exceeded"])}}) + assert not r["pass"] and r["failure_class"] == "EVAL_JUDGE_FAILED" + + +def test_run_missing_score(): + r = evaluate_run({"tasks": {"gpqa": _task(score=None)}}) + assert not r["pass"] and r["failure_class"] == "SAMPLE_ACCOUNTING_FAILED" + + +def test_run_timeout_is_not_terminal(): + r = evaluate_run({"tasks": {"gpqa": _task(status="TIMEOUT")}}) + assert not r["pass"] and r["failure_class"] == "INFRA_TRANSIENT" + + +def test_run_no_tasks(): + r = evaluate_run({"tasks": {}}) + assert not r["pass"] and r["failure_class"] == "USER_CONFIG_ERROR" + + +# ── gate_ptq ────────────────────────────────────────────────────────── + + +def _ckpt(**kw): + base = { + "source_bytes": 16_000_000_000, + "output_bytes": 8_000_000_000, + "recipe": "nvfp4", + "layer_precision_counts": { + "NVFP4": 224, + "BF16_or_excluded": 3, + "unexpected_unquantized": 0, + "declaration_mismatch": 0, + }, + "metadata_diffs": [], + } + base.update(kw) + return base + + +def test_ptq_pass(): + assert evaluate_checkpoint(_ckpt())["pass"] + + +def test_ptq_not_smaller(): + r = evaluate_checkpoint(_ckpt(output_bytes=16_000_000_000)) + assert not r["pass"] and r["failure_class"] == "QUANT_COVERAGE_FAILURE" + + +def test_ptq_zero_coverage_is_model_unsupported(): + r = evaluate_checkpoint( + _ckpt( + layer_precision_counts={ + "NVFP4": 0, + "unexpected_unquantized": 0, + "declaration_mismatch": 0, + } + ) + ) + assert not r["pass"] and r["failure_class"] == "MODEL_UNSUPPORTED" + + +def test_ptq_unexpected_unquantized(): + r = evaluate_checkpoint( + _ckpt( + layer_precision_counts={ + "NVFP4": 200, + "unexpected_unquantized": 24, + "declaration_mismatch": 0, + } + ) + ) + assert not r["pass"] and r["failure_class"] == "QUANT_COVERAGE_FAILURE" + + +def test_ptq_metadata_diff(): + r = evaluate_checkpoint(_ckpt(metadata_diffs=["chat_template changed"])) + assert not r["pass"] and r["failure_class"] == "QUANT_COVERAGE_FAILURE" + + +def test_ptq_unknown_recipe(): + r = evaluate_checkpoint(_ckpt(recipe="mystery")) + assert not r["pass"] and r["failure_class"] == "USER_CONFIG_ERROR" + + +# ── regression tests for malformed inputs ──────────────────────────── + + +def test_compare_non_numeric_score_is_anomalous_not_crash(): + # A string/None score must not raise TypeError; it's ANOMALOUS. + for bad in ("42", None, float("nan"), True): + r = evaluate_comparison({"gpqa": 50.0}, {"gpqa": bad}, threshold=0.01) + assert not r["pass"] and r["decision"] == "ANOMALOUS", bad + + +def test_run_non_numeric_score_fails(): + r = evaluate_run({"tasks": {"gpqa": _task(score="42")}}) + assert not r["pass"] and r["failure_class"] == "SAMPLE_ACCOUNTING_FAILED" + + +def test_run_running_is_infra_transient(): + r = evaluate_run({"tasks": {"gpqa": _task(status="RUNNING", score=None)}}) + assert not r["pass"] and r["failure_class"] == "INFRA_TRANSIENT" + + +if __name__ == "__main__": + sys.exit(__import__("pytest").main([__file__, "-q"])) diff --git a/.github/workflows/unit_tests.yml b/.github/workflows/unit_tests.yml index fc2ae364cba..66f2486bcec 100644 --- a/.github/workflows/unit_tests.yml +++ b/.github/workflows/unit_tests.yml @@ -13,6 +13,7 @@ on: - "pyproject.toml" - "tests/unit/**" - "tools/launcher/**" + - ".agents/skills/**" schedule: - cron: "0 0 * * *" # Nightly workflow_dispatch: @@ -55,6 +56,7 @@ jobs: pyproject.toml tests/unit/** tools/launcher/** + .agents/skills/** linux: needs: [check-dco] runs-on: ubuntu-latest @@ -142,10 +144,27 @@ jobs: uv venv .venv uv pip install -e . pytest uv run python3 -m pytest -v + skills: + if: needs.check-file-changes.outputs.any_changed == 'true' + needs: [linux, check-file-changes] + runs-on: ubuntu-latest + timeout-minutes: 10 + steps: + - uses: actions/checkout@v6 + - uses: actions/setup-python@v6 + with: + python-version: "3.12" + - name: Run skill gate tests + # Skill gate tests are stdlib-only and hermetic (no GPU/cluster/network), + # so they run in their own lightweight job rather than the main unit lane. + # Override addopts to drop the repo's coverage/instafail plugins (not installed here). + run: | + pip install pytest + python -m pytest .agents/skills/ -o addopts="" -p no:cacheprovider -v unit-pr-required-check: # Run even if some jobs are skipped if: ${{ github.event_name == 'pull_request' && always() }} - needs: [check-file-changes, linux, windows, multi-version, partial-install, launcher] + needs: [check-file-changes, linux, windows, multi-version, partial-install, launcher, skills] runs-on: ubuntu-latest steps: - name: Required unit tests did not succeed @@ -154,6 +173,7 @@ jobs: needs.windows.result != 'success' || needs.multi-version.result != 'success' || needs.partial-install.result != 'success' || - needs.launcher.result != 'success' + needs.launcher.result != 'success' || + needs.skills.result != 'success' )) }} run: exit 1 diff --git a/CHANGELOG.rst b/CHANGELOG.rst index da02b315f67..5ec17822d64 100755 --- a/CHANGELOG.rst +++ b/CHANGELOG.rst @@ -80,6 +80,7 @@ Changelog - Add support for vLLM fakequant reload using ModelOpt state for HF models. See `examples/vllm_serve/README.md `_ for more details. - [Early Testing] Add Claude Code PTQ skill (``.claude/skills/ptq/``) for agent-assisted post-training quantization. The skill guides the agent through environment detection, model support checking, format selection, and execution via the launcher or manual SLURM/Docker/bare GPU paths. Includes handling for unlisted models with custom module patching. This feature is in early testing — use with caution. - [Early Testing] Polish Claude Code evaluation skill (``.claude/skills/evaluation/``) for agent-assisted LLM accuracy benchmarking via NeMo Evaluator Launcher. Adds two companion skills vendored verbatim from `NVIDIA-NeMo/Evaluator `_: ``launching-evals`` (run/check/debug/analyze NEL evaluations) and ``accessing-mlflow`` (query MLflow runs, compare metrics, fetch artifacts). Re-sync at a pinned upstream SHA via ``.claude/scripts/sync-upstream-skills.sh``. Also adds a shared ``skills/common/credentials.md`` covering HF / NGC / Docker token setup referenced by multiple skills. This feature is in early testing — use with caution. +- [Early Testing] Add Claude Code day0-release skill (``.claude/skills/day0-release/``), a deterministic end-to-end driver that chains the PTQ → evaluation → comparison skills (the evaluation stage deploys the checkpoint itself) with an enforced gate after each stage and returns a publish decision (ACCEPT / REGRESSION / ANOMALOUS / INFEASIBLE). Ships three GPU-free, unit-tested gate scripts (``gate_ptq.py``, ``gate_run.py``, ``gate_compare.py``) that validate checkpoint coverage, evaluation-run completeness, and baseline-vs-candidate accuracy threshold. v1 reports and stops on regression; the recipe-search loop is deferred. This feature is in early testing — use with caution. - Add performant layerwise calibration for large models that don't fit on GPU (e.g. DeepSeek-R1, Kimi-K2). See `modelopt_recipes/general/ptq/nvfp4_experts_only-kv_fp8_layerwise.yaml `_ for usage. Layerwise calibration also supports PTQ with intermediate progress saving — useful when long PTQ runs get hit with Slurm timeouts. See `modelopt_recipes/general/ptq/nvfp4_default-kv_none-gptq.yaml `_ for usage. - Add implicit GEMM CUDA kernel for Conv3D with fused NVFP4 fake quantization (``modelopt.torch.quantization.src.conv``). When NVFP4 quantization is applied to an ``nn.Conv3d`` layer via ModelOpt PTQ, the implicit GEMM path is used automatically instead of cuDNN. Uses BF16 WMMA tensor cores (SM80+) with FP32 accumulation and in-kernel FP4 (E2M1) activation quantization. Grouped convolution (``groups > 1``) falls back to the default cuDNN path. Inference only — training mode falls back to cuDNN with a warning. - Add FP8 MHA quantization support for vision transformers. Adds an attention-aware ONNX post-processing pass (scale Mul / K-transpose move before Q, Q→DQ insertion on softmax output) in :class:`FP8QuantExporter `, per-instance nested-attention-wrapper skipping in the HF plugin, and ``nn.LayerNorm`` registration in ``QuantModuleRegistry`` so BMM input quantizers and LayerNorm output quantizers defined in FP8_DEFAULT_CFG are honored end-to-end. See `examples/torch_onnx/torch_quant_to_onnx.py `_ for the general timm-model quantize→ONNX workflow. diff --git a/pyproject.toml b/pyproject.toml index 74927947215..28ca051d9e7 100644 --- a/pyproject.toml +++ b/pyproject.toml @@ -210,6 +210,7 @@ extend-ignore = [ "examples/*" = ["D"] "noxfile.py" = ["D", "E501"] "tests/*" = ["B017", "D", "E402", "PT012"] +".agents/skills/*/tests/test_*.py" = ["D", "E402"] # Skill test scripts: docstring (D) + sys.path import-order (E402) exemptions "*/_[a-zA-Z]*" = ["D"] # Private packages (_abc/*.py) or modules (_xyz.py) "*.ipynb" = [ "D",