NVIDIA · Edwardf0t1 · Jun 2, 2026 · Jun 2, 2026 · Jun 4, 2026 · Jun 4, 2026
diff --git a/.agents/skills/day0-release/SKILL.md b/.agents/skills/day0-release/SKILL.md
@@ -0,0 +1,177 @@
+---
+name: day0-release
+description: Deterministic end-to-end driver for day-0 quantized-checkpoint releases — chains PTQ → evaluation → comparison with enforced gates between stages (the evaluation stage deploys the checkpoint itself), and returns a publish decision (ACCEPT / REGRESSION / ANOMALOUS / INFEASIBLE). Use when the user asks to "release a model at day-0", "quantize and validate model X is within N% of baseline and tell me if it's publishable", or "run the full day-0 workflow". Do NOT use for single-stage requests — quantizing only (use ptq), serving only (use deployment), evaluating only (use evaluation), or comparing two existing runs (use compare-results).
+license: Apache-2.0
+---
+
+# Day-0 Release
+
+Drive a model from a pretrained checkpoint to a publish decision for a quantized
+checkpoint, in a fixed sequence with a gate after every stage. This skill is a
+**conductor**: it sequences the existing domain skills and enforces the gates —
+it does not re-implement quantization, serving, evaluation, or comparison.
+
+**Goal (the default day-0 criterion):** a quantized checkpoint smaller than the
+source, with accuracy drop within the threshold (default <1%) on the standard
+benchmark set versus the matching baseline, plus a publish recommendation.
+
+## When to use
+
+Use only for the full goal-driven release. For a single stage, route to the
+domain skill directly: quantize → **ptq**, serve → **deployment**, evaluate →
+**evaluation**, compare two existing runs → **compare-results**.
+
+## Inputs
+
+Resolve these before starting (ask the user for anything missing):
+
+- **Model** — HF handle or checkpoint path.
+- **Recipe / qformat** — e.g. `nvfp4`, `fp8`, or a recipe path. One candidate for v1.
+- **Cluster / launcher** — from `clusters.yaml` (see `skills/common/environment-setup.md`).
+- **Eval set** — defaults to the AA suite (`evaluation/recipes/tasks/aa/`).
+- **Threshold** — max accuracy drop; default `0.01` (1%).
+
+## The chain
+
+```text
+setup ─▶ PTQ ─▶ baseline-eval ─▶ quantized-eval ─▶ compare ─▶ closeout
+          │          │                │               │
+       gate_ptq   gate_run         gate_run       gate_compare
+```
+
+The **evaluation** skill deploys the model it evaluates (it stands up its own
+endpoint per run), so there is no separate deploy stage — a serving failure
+surfaces through the eval stage's gate (`DEPLOYMENT_HEALTH_FAILED`) and triages
+to the **deployment** skill to debug serving in isolation (see Step 4).
+
+Run each stage by invoking the domain skill, then run its gate before
+proceeding. **Do not advance past a failed gate.** Copy this checklist and track
+progress:
+
+```text
+- [ ] Step 0: Resolve inputs; confirm threshold and eval set
+- [ ] Step 1: Setup gate — creds present, cluster reachable
+- [ ] Step 2: PTQ (ptq skill) → gate_ptq.py
+- [ ] Step 3: Baseline eval (evaluation skill, deploys source) → gate_run.py   [skip if cached, see below]
+- [ ] Step 4: Quantized eval (evaluation skill, deploys candidate) → gate_run.py
+- [ ] Step 5: Compare (compare-results skill) → gate_compare.py → decision
+- [ ] Step 6: Closeout — report + publish recommendation
+```
+
+### Step 1 — Setup gate
+
+Confirm credentials (`skills/common/credentials.md`) and cluster reachability
+(`skills/common/remote-execution.md`). If either fails, stop with
+`SYSTEMIC` — do not start PTQ.
+
+### Step 2 — PTQ
+
+Invoke the **ptq** skill to produce the quantized checkpoint. Then gate:
+
+```bash
+# The ptq skill's post-PTQ validation produces a validation-summary JSON (size
+# ratio + layer-precision counts + metadata diffs; see
+# ptq/references/checkpoint-validation.md). v1 gates on that summary:
+python .agents/skills/day0-release/scripts/gate_ptq.py --summary <validation-summary.json>
+#   add `--recipe <qformat>` to override the recipe recorded in the summary
+```
+
+`gate_ptq.py` returns JSON `{pass, failure_class, detail}`. On `pass: false`,
+branch on `failure_class` (see **Triage** below). Do not evaluate an
+unvalidated checkpoint.
+
+### Step 3 — Baseline eval
+
+The baseline is the **source** (pre-quantization) model on the same task set and
+sampling params. **Look it up first** — if a matching baseline run already
+exists in MLflow (same model, task set, sampling params), reuse it and skip this
+stage. Otherwise run it via the **evaluation** skill (which deploys the source
+model itself). Gate with `gate_run.py`.
+
+### Step 4 — Quantized eval
+
+Invoke the **evaluation** skill on the quantized checkpoint, matching the
+baseline's task set and sampling params. The evaluation skill stands up the
+serving endpoint itself (it builds the `deployment.command`, e.g. a
+`vllm serve …`), so a serving failure surfaces here as a failed `gate_run.py`
+with `DEPLOYMENT_HEALTH_FAILED`. When that happens, **drop to the deployment
+skill** to reproduce and debug serving in isolation (serve the checkpoint
+standalone, confirm `/health` + one generation, iterate on flags / TP / image /
+env vars) rather than burning full eval cycles on a broken endpoint — then carry
+the working command back into NEL's `deployment.command` and resume the eval. If
+the checkpoint genuinely can't serve, `POINT_INFEASIBLE`. Gate:
+
+```bash
+python .agents/skills/day0-release/scripts/gate_run.py --run <run-summary.json>
+```
+
+A `pass: false` here means the run is incomplete or invalid (judge/parse error,
+dropped samples) — do **not** compare scores from it.
+
+### Step 5 — Compare
+
+Invoke the **compare-results** skill to produce per-task deltas, then gate:
+
+```bash
+python .agents/skills/day0-release/scripts/gate_compare.py \
+    --baseline <baseline_scores.json> --candidate <candidate_scores.json> \
+    --threshold 0.01
+```
+
+The threshold is a fraction of each task's score scale. Most AA tasks report
+0-100, but some (e.g. `tau2_bench_telecom` `Result`) report 0-1; the gate infers
+each task's scale (0-1 if both scores are within [0, 1], else 0-100) and
+normalizes the drop accordingly, so `--threshold 0.01` means "≤1 pt on a 0-100
+task / ≤0.01 on a 0-1 task" uniformly. Pass `--scales '{"task": max}'` to
+override inference if a task's scores happen to fall in an ambiguous range.
+
+Decision from `gate_compare.py`:
+
+- **ACCEPT** — every task within threshold → go to Step 6.
+- **REGRESSION** — one or more tasks exceed threshold. **v1 stops here and
+  reports** which tasks regressed by how much. (Picking the next recipe and
+  re-running is deferred — see Scope.)
+- **ANOMALOUS** — scores present but implausible (e.g. baseline lower than
+  candidate by a large margin, or a task score outside its valid range) →
+  surface to the user.
+
+### Step 6 — Closeout
+
+Report the decision with: source vs output size + ratio, per-task baseline /
+candidate / delta / within-threshold, MLflow run IDs, and a publish
+recommendation (publish / do-not-publish / needs-human). Archive artifacts to
+the workspace.
+
+## Triage (gate failure → decision)
+
+Map a gate's `failure_class` to the next action:
+
+| `failure_class` | Action |
+| --- | --- |
+| `INFRA_TRANSIENT` | Retry the stage once; if it recurs, `SYSTEMIC`. |
+| `MODEL_UNSUPPORTED` | PATCH: fix the recipe pattern / add model support (ptq skill owns the patch loop), then retry. If unpatchable, `POINT_INFEASIBLE`. |
+| `QUANT_COVERAGE_FAILURE` | PATCH: fix the recipe wildcard so intended layers are covered; re-run PTQ. |
+| `DEPLOYMENT_HEALTH_FAILED` | Drop to the **deployment** skill: reproduce serving standalone (`/health` + one generation), debug flags / image / TP / env, then carry the working command into NEL's `deployment.command` and retry the eval. If it can't serve, `POINT_INFEASIBLE`. |
+| `EVAL_JUDGE_FAILED` | Usually transient (auth / rate limit) — wait and retry. |
+| `SAMPLE_ACCOUNTING_FAILED` | Investigate dropped/failed samples before trusting scores. |
+| `USER_CONFIG_ERROR` | Stop and ask the user. |
+| `UNKNOWN` | Stop and surface to the user (`NEEDS_HUMAN`). |
+
+`SYSTEMIC` (cluster down, dataset unavailable) aborts the whole run.
+`POINT_INFEASIBLE` means this (model, recipe) can't work as configured.
+
+## Output
+
+Return a decision, not a raw artifact:
+
+- `ACCEPT` + report + publish recommendation
+- `REGRESSION` + which tasks failed the threshold and by how much
+- `ANOMALOUS` / `INFEASIBLE` / `NEEDS_HUMAN` + reason
+- Always: workspace path + MLflow run IDs for traceability
+
+## Scope (v1)
+
+In v1: the linear chain + gates + report. On `REGRESSION`, v1 reports and stops.
+Deferred to a follow-up: the evaluator-optimizer recipe loop (compare → pick the
+next recipe → re-run PTQ), which needs the bigpareto integration and a shared
+config/result schema.
diff --git a/.agents/skills/day0-release/scripts/gate_compare.py b/.agents/skills/day0-release/scripts/gate_compare.py
@@ -0,0 +1,208 @@
+# SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""Day-0 compare gate.
+
+Decides whether a quantized candidate is within the accuracy threshold of its
+baseline, per task. Pure decision logic in ``evaluate_comparison`` (unit-tested
+without GPU/cluster); ``main`` reads score JSON files and prints the verdict.
+
+Score files are ``{task_name: score}`` dicts. Most AA task references report
+``*_avg_of_N`` on a 0-100 scale, but some tasks (e.g. ``tau2_bench_telecom``
+``Result``) report on a 0-1 scale. The gate is therefore scale-aware: each
+task's scale is inferred per task (0-1 if both scores are within [0, 1], else
+0-100) or supplied explicitly via ``--scales``, and the drop is normalized to a
+fraction of that scale so the threshold applies uniformly. The drop is an
+absolute (scale-normalized) delta unless ``--relative`` is passed.
+"""
+
+from __future__ import annotations
+
+import argparse
+import json
+import math
+import sys
+
+
+def _is_valid_score(val):
+    """True only for a finite real number in [_SCORE_MIN, _SCORE_MAX] (not bool)."""
+    return (
+        isinstance(val, (int, float))
+        and not isinstance(val, bool)
+        and math.isfinite(val)
+        and _SCORE_MIN <= val <= _SCORE_MAX
+    )
+
+
+# Decisions
+ACCEPT = "ACCEPT"
+REGRESSION = "REGRESSION"
+ANOMALOUS = "ANOMALOUS"
+
+# Plausibility bounds. Scores may be on a 0-1 or 0-100 scale (see _infer_scale);
+# the upper bound is the larger of the two so both are accepted.
+_SCORE_MIN = 0.0
+_SCORE_MAX = 100.0
+# A candidate scoring this fraction of its scale ABOVE baseline is implausible
+# for quantization (quantization should not meaningfully improve accuracy); flag
+# it rather than silently passing. 0.05 = 5 pts on a 0-100 task, 0.05 on a 0-1 task.
+_IMPLAUSIBLE_GAIN_FRAC = 0.05
+
+
+def _infer_scale(*vals):
+    """Infer a task's score scale: 1.0 if every score is within [0, 1], else 100.0.
+
+    Most AA tasks report 0-100; a few (e.g. ``tau2_bench_telecom``) report 0-1.
+    Without scale metadata in the score files, we treat a task as 0-1 only when
+    every score for it fits in [0, 1] — a 0-100 task with sub-1.0 accuracy is
+    degenerate and caught elsewhere. Pass an explicit scale to override.
+    """
+    return 1.0 if all(0.0 <= v <= 1.0 for v in vals) else 100.0
+
+
+def evaluate_comparison(baseline, candidate, threshold=0.01, relative=False, scales=None):
+    """Compare candidate vs baseline scores per task.
+
+    Args:
+        baseline: dict ``{task: score}``.
+        candidate: dict ``{task: score}``.
+        threshold: max allowed drop, as a fraction of the task's scale
+            (0.01 = 1 percentage point on a 0-100 task / 0.01 on a 0-1 task,
+            or 1% relative if ``relative``).
+        relative: if True, drop is measured relative to the baseline score
+            (scale-invariant).
+        scales: optional dict ``{task: max_scale}`` to override per-task scale
+            inference (e.g. ``{"tau2_bench_telecom": 1.0}``).
+
+    Returns:
+        dict ``{pass, decision, failure_class, detail, per_task}``.
+    """
+    scales = scales or {}
+    missing = sorted((set(baseline) | set(candidate)) - (set(baseline) & set(candidate)))
+    if missing:
+        return {
+            "pass": False,
+            "decision": ANOMALOUS,
+            "failure_class": "SAMPLE_ACCOUNTING_FAILED",
+            "detail": f"task sets differ; missing on one side: {missing}",
+            "per_task": {},
+        }
+    if not baseline:
+        return {
+            "pass": False,
+            "decision": ANOMALOUS,
+            "failure_class": "USER_CONFIG_ERROR",
+            "detail": "no tasks to compare",
+            "per_task": {},
+        }
+
+    per_task = {}
+    regressed = []
+    anomalies = []
+    for task in sorted(baseline):
+        b, c = baseline[task], candidate[task]
+        invalid = False
+        for label, val in (("baseline", b), ("candidate", c)):
+            if not _is_valid_score(val):
+                anomalies.append(f"{task}: {label} score {val!r} not a finite number in [0, 100]")
+                invalid = True
+        if invalid:
+            # Don't compute deltas on non-numeric/out-of-range scores (would raise
+            # TypeError); record the anomaly and move on — the run is ANOMALOUS.
+            per_task[task] = {
+                "baseline": b,
+                "candidate": c,
+                "drop": None,
+                "within_threshold": False,
+            }
+            continue
+        scale = scales.get(task) or _infer_scale(b, c)
+        delta = b - c  # native units, for reporting
+        if relative:
+            drop = delta / b if b else 0.0  # fraction of baseline (scale-invariant)
+        else:
+            drop = delta / scale  # fraction of the task's scale
+        within = drop <= threshold
+        gain = (c - b) / scale
+        if gain > _IMPLAUSIBLE_GAIN_FRAC:
+            anomalies.append(
+                f"{task}: candidate exceeds baseline by {c - b:.4g} ({gain:.1%} of scale, implausible)"
+            )
+        per_task[task] = {
+            "baseline": b,
+            "candidate": c,
+            "drop": round(delta, 4),
+            "drop_fraction": round(drop, 4),
+            "scale": scale,
+            "within_threshold": within,
+        }
+        if not within:
+            regressed.append(task)
+
+    if anomalies:
+        return {
+            "pass": False,
+            "decision": ANOMALOUS,
+            "failure_class": "UNKNOWN",
+            "detail": "; ".join(anomalies),
+            "per_task": per_task,
+        }
+    if regressed:
+        return {
+            "pass": False,
+            "decision": REGRESSION,
+            "failure_class": None,
+            "detail": f"tasks exceeding threshold ({threshold}): {regressed}",
+            "per_task": per_task,
+        }
+    return {
+        "pass": True,
+        "decision": ACCEPT,
+        "failure_class": None,
+        "detail": f"all {len(per_task)} task(s) within threshold {threshold}",
+        "per_task": per_task,
+    }
+
+
+def main(argv=None):
+    """CLI entry point: read baseline/candidate score JSON and print the verdict."""
+    p = argparse.ArgumentParser(description="Day-0 compare gate")
+    p.add_argument("--baseline", required=True, help="baseline score JSON {task: score}")
+    p.add_argument("--candidate", required=True, help="candidate score JSON {task: score}")
+    p.add_argument("--threshold", type=float, default=0.01, help="max drop fraction (default 0.01)")
+    p.add_argument("--relative", action="store_true", help="measure drop relative to baseline")
+    p.add_argument(
+        "--scales",
+        help="optional JSON {task: max_scale} to override per-task scale inference",
+    )
+    args = p.parse_args(argv)
+
+    try:
+        with open(args.baseline) as f:
+            baseline = json.load(f)
+        with open(args.candidate) as f:
+            candidate = json.load(f)
+        scales = json.loads(args.scales) if args.scales else None
+    except (OSError, json.JSONDecodeError) as e:
+        print(json.dumps({"pass": False, "failure_class": "USER_CONFIG_ERROR", "detail": str(e)}))
+        return 2
+
+    result = evaluate_comparison(baseline, candidate, args.threshold, args.relative, scales)
+    print(json.dumps(result, indent=2))
+    return 0 if result["pass"] else 1
+
+
+if __name__ == "__main__":
+    sys.exit(main())