diff --git a/.agents/skills/day0-release/SKILL.md b/.agents/skills/day0-release/SKILL.md
new file mode 100644
index 00000000000..5fb2e2714db
--- /dev/null
+++ b/.agents/skills/day0-release/SKILL.md
@@ -0,0 +1,177 @@
+---
+name: day0-release
+description: Deterministic end-to-end driver for day-0 quantized-checkpoint releases — chains PTQ → evaluation → comparison with enforced gates between stages (the evaluation stage deploys the checkpoint itself), and returns a publish decision (ACCEPT / REGRESSION / ANOMALOUS / INFEASIBLE). Use when the user asks to "release a model at day-0", "quantize and validate model X is within N% of baseline and tell me if it's publishable", or "run the full day-0 workflow". Do NOT use for single-stage requests — quantizing only (use ptq), serving only (use deployment), evaluating only (use evaluation), or comparing two existing runs (use compare-results).
+license: Apache-2.0
+---
+
+# Day-0 Release
+
+Drive a model from a pretrained checkpoint to a publish decision for a quantized
+checkpoint, in a fixed sequence with a gate after every stage. This skill is a
+**conductor**: it sequences the existing domain skills and enforces the gates —
+it does not re-implement quantization, serving, evaluation, or comparison.
+
+**Goal (the default day-0 criterion):** a quantized checkpoint smaller than the
+source, with accuracy drop within the threshold (default <1%) on the standard
+benchmark set versus the matching baseline, plus a publish recommendation.
+
+## When to use
+
+Use only for the full goal-driven release. For a single stage, route to the
+domain skill directly: quantize → **ptq**, serve → **deployment**, evaluate →
+**evaluation**, compare two existing runs → **compare-results**.
+
+## Inputs
+
+Resolve these before starting (ask the user for anything missing):
+
+- **Model** — HF handle or checkpoint path.
+- **Recipe / qformat** — e.g. `nvfp4`, `fp8`, or a recipe path. One candidate for v1.
+- **Cluster / launcher** — from `clusters.yaml` (see `skills/common/environment-setup.md`).
+- **Eval set** — defaults to the AA suite (`evaluation/recipes/tasks/aa/`).
+- **Threshold** — max accuracy drop; default `0.01` (1%).
+
+## The chain
+
+```text
+setup ─▶ PTQ ─▶ baseline-eval ─▶ quantized-eval ─▶ compare ─▶ closeout
+          │          │                │               │
+       gate_ptq   gate_run         gate_run       gate_compare
+```
+
+The **evaluation** skill deploys the model it evaluates (it stands up its own
+endpoint per run), so there is no separate deploy stage — a serving failure
+surfaces through the eval stage's gate (`DEPLOYMENT_HEALTH_FAILED`) and triages
+to the **deployment** skill to debug serving in isolation (see Step 4).
+
+Run each stage by invoking the domain skill, then run its gate before
+proceeding. **Do not advance past a failed gate.** Copy this checklist and track
+progress:
+
+```text
+- [ ] Step 0: Resolve inputs; confirm threshold and eval set
+- [ ] Step 1: Setup gate — creds present, cluster reachable
+- [ ] Step 2: PTQ (ptq skill) → gate_ptq.py
+- [ ] Step 3: Baseline eval (evaluation skill, deploys source) → gate_run.py   [skip if cached, see below]
+- [ ] Step 4: Quantized eval (evaluation skill, deploys candidate) → gate_run.py
+- [ ] Step 5: Compare (compare-results skill) → gate_compare.py → decision
+- [ ] Step 6: Closeout — report + publish recommendation
+```
+
+### Step 1 — Setup gate
+
+Confirm credentials (`skills/common/credentials.md`) and cluster reachability
+(`skills/common/remote-execution.md`). If either fails, stop with
+`SYSTEMIC` — do not start PTQ.
+
+### Step 2 — PTQ
+
+Invoke the **ptq** skill to produce the quantized checkpoint. Then gate:
+
+```bash
+# The ptq skill's post-PTQ validation produces a validation-summary JSON (size
+# ratio + layer-precision counts + metadata diffs; see
+# ptq/references/checkpoint-validation.md). v1 gates on that summary:
+python .agents/skills/day0-release/scripts/gate_ptq.py --summary <validation-summary.json>
+#   add `--recipe <qformat>` to override the recipe recorded in the summary
+```
+
+`gate_ptq.py` returns JSON `{pass, failure_class, detail}`. On `pass: false`,
+branch on `failure_class` (see **Triage** below). Do not evaluate an
+unvalidated checkpoint.
+
+### Step 3 — Baseline eval
+
+The baseline is the **source** (pre-quantization) model on the same task set and
+sampling params. **Look it up first** — if a matching baseline run already
+exists in MLflow (same model, task set, sampling params), reuse it and skip this
+stage. Otherwise run it via the **evaluation** skill (which deploys the source
+model itself). Gate with `gate_run.py`.
+
+### Step 4 — Quantized eval
+
+Invoke the **evaluation** skill on the quantized checkpoint, matching the
+baseline's task set and sampling params. The evaluation skill stands up the
+serving endpoint itself (it builds the `deployment.command`, e.g. a
+`vllm serve …`), so a serving failure surfaces here as a failed `gate_run.py`
+with `DEPLOYMENT_HEALTH_FAILED`. When that happens, **drop to the deployment
+skill** to reproduce and debug serving in isolation (serve the checkpoint
+standalone, confirm `/health` + one generation, iterate on flags / TP / image /
+env vars) rather than burning full eval cycles on a broken endpoint — then carry
+the working command back into NEL's `deployment.command` and resume the eval. If
+the checkpoint genuinely can't serve, `POINT_INFEASIBLE`. Gate:
+
+```bash
+python .agents/skills/day0-release/scripts/gate_run.py --run <run-summary.json>
+```
+
+A `pass: false` here means the run is incomplete or invalid (judge/parse error,
+dropped samples) — do **not** compare scores from it.
+
+### Step 5 — Compare
+
+Invoke the **compare-results** skill to produce per-task deltas, then gate:
+
+```bash
+python .agents/skills/day0-release/scripts/gate_compare.py \
+    --baseline <baseline_scores.json> --candidate <candidate_scores.json> \
+    --threshold 0.01
+```
+
+The threshold is a fraction of each task's score scale. Most AA tasks report
+0-100, but some (e.g. `tau2_bench_telecom` `Result`) report 0-1; the gate infers
+each task's scale (0-1 if both scores are within [0, 1], else 0-100) and
+normalizes the drop accordingly, so `--threshold 0.01` means "≤1 pt on a 0-100
+task / ≤0.01 on a 0-1 task" uniformly. Pass `--scales '{"task": max}'` to
+override inference if a task's scores happen to fall in an ambiguous range.
+
+Decision from `gate_compare.py`:
+
+- **ACCEPT** — every task within threshold → go to Step 6.
+- **REGRESSION** — one or more tasks exceed threshold. **v1 stops here and
+  reports** which tasks regressed by how much. (Picking the next recipe and
+  re-running is deferred — see Scope.)
+- **ANOMALOUS** — scores present but implausible (e.g. baseline lower than
+  candidate by a large margin, or a task score outside its valid range) →
+  surface to the user.
+
+### Step 6 — Closeout
+
+Report the decision with: source vs output size + ratio, per-task baseline /
+candidate / delta / within-threshold, MLflow run IDs, and a publish
+recommendation (publish / do-not-publish / needs-human). Archive artifacts to
+the workspace.
+
+## Triage (gate failure → decision)
+
+Map a gate's `failure_class` to the next action:
+
+| `failure_class` | Action |
+| --- | --- |
+| `INFRA_TRANSIENT` | Retry the stage once; if it recurs, `SYSTEMIC`. |
+| `MODEL_UNSUPPORTED` | PATCH: fix the recipe pattern / add model support (ptq skill owns the patch loop), then retry. If unpatchable, `POINT_INFEASIBLE`. |
+| `QUANT_COVERAGE_FAILURE` | PATCH: fix the recipe wildcard so intended layers are covered; re-run PTQ. |
+| `DEPLOYMENT_HEALTH_FAILED` | Drop to the **deployment** skill: reproduce serving standalone (`/health` + one generation), debug flags / image / TP / env, then carry the working command into NEL's `deployment.command` and retry the eval. If it can't serve, `POINT_INFEASIBLE`. |
+| `EVAL_JUDGE_FAILED` | Usually transient (auth / rate limit) — wait and retry. |
+| `SAMPLE_ACCOUNTING_FAILED` | Investigate dropped/failed samples before trusting scores. |
+| `USER_CONFIG_ERROR` | Stop and ask the user. |
+| `UNKNOWN` | Stop and surface to the user (`NEEDS_HUMAN`). |
+
+`SYSTEMIC` (cluster down, dataset unavailable) aborts the whole run.
+`POINT_INFEASIBLE` means this (model, recipe) can't work as configured.
+
+## Output
+
+Return a decision, not a raw artifact:
+
+- `ACCEPT` + report + publish recommendation
+- `REGRESSION` + which tasks failed the threshold and by how much
+- `ANOMALOUS` / `INFEASIBLE` / `NEEDS_HUMAN` + reason
+- Always: workspace path + MLflow run IDs for traceability
+
+## Scope (v1)
+
+In v1: the linear chain + gates + report. On `REGRESSION`, v1 reports and stops.
+Deferred to a follow-up: the evaluator-optimizer recipe loop (compare → pick the
+next recipe → re-run PTQ), which needs the bigpareto integration and a shared
+config/result schema.
diff --git a/.agents/skills/day0-release/scripts/gate_compare.py b/.agents/skills/day0-release/scripts/gate_compare.py
new file mode 100644
index 00000000000..d8ae195acd1
--- /dev/null
+++ b/.agents/skills/day0-release/scripts/gate_compare.py
@@ -0,0 +1,208 @@
+# SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""Day-0 compare gate.
+
+Decides whether a quantized candidate is within the accuracy threshold of its
+baseline, per task. Pure decision logic in ``evaluate_comparison`` (unit-tested
+without GPU/cluster); ``main`` reads score JSON files and prints the verdict.
+
+Score files are ``{task_name: score}`` dicts. Most AA task references report
+``*_avg_of_N`` on a 0-100 scale, but some tasks (e.g. ``tau2_bench_telecom``
+``Result``) report on a 0-1 scale. The gate is therefore scale-aware: each
+task's scale is inferred per task (0-1 if both scores are within [0, 1], else
+0-100) or supplied explicitly via ``--scales``, and the drop is normalized to a
+fraction of that scale so the threshold applies uniformly. The drop is an
+absolute (scale-normalized) delta unless ``--relative`` is passed.
+"""
+
+from __future__ import annotations
+
+import argparse
+import json
+import math
+import sys
+
+
+def _is_valid_score(val):
+    """True only for a finite real number in [_SCORE_MIN, _SCORE_MAX] (not bool)."""
+    return (
+        isinstance(val, (int, float))
+        and not isinstance(val, bool)
+        and math.isfinite(val)
+        and _SCORE_MIN <= val <= _SCORE_MAX
+    )
+
+
+# Decisions
+ACCEPT = "ACCEPT"
+REGRESSION = "REGRESSION"
+ANOMALOUS = "ANOMALOUS"
+
+# Plausibility bounds. Scores may be on a 0-1 or 0-100 scale (see _infer_scale);
+# the upper bound is the larger of the two so both are accepted.
+_SCORE_MIN = 0.0
+_SCORE_MAX = 100.0
+# A candidate scoring this fraction of its scale ABOVE baseline is implausible
+# for quantization (quantization should not meaningfully improve accuracy); flag
+# it rather than silently passing. 0.05 = 5 pts on a 0-100 task, 0.05 on a 0-1 task.
+_IMPLAUSIBLE_GAIN_FRAC = 0.05
+
+
+def _infer_scale(*vals):
+    """Infer a task's score scale: 1.0 if every score is within [0, 1], else 100.0.
+
+    Most AA tasks report 0-100; a few (e.g. ``tau2_bench_telecom``) report 0-1.
+    Without scale metadata in the score files, we treat a task as 0-1 only when
+    every score for it fits in [0, 1] — a 0-100 task with sub-1.0 accuracy is
+    degenerate and caught elsewhere. Pass an explicit scale to override.
+    """
+    return 1.0 if all(0.0 <= v <= 1.0 for v in vals) else 100.0
+
+
+def evaluate_comparison(baseline, candidate, threshold=0.01, relative=False, scales=None):
+    """Compare candidate vs baseline scores per task.
+
+    Args:
+        baseline: dict ``{task: score}``.
+        candidate: dict ``{task: score}``.
+        threshold: max allowed drop, as a fraction of the task's scale
+            (0.01 = 1 percentage point on a 0-100 task / 0.01 on a 0-1 task,
+            or 1% relative if ``relative``).
+        relative: if True, drop is measured relative to the baseline score
+            (scale-invariant).
+        scales: optional dict ``{task: max_scale}`` to override per-task scale
+            inference (e.g. ``{"tau2_bench_telecom": 1.0}``).
+
+    Returns:
+        dict ``{pass, decision, failure_class, detail, per_task}``.
+    """
+    scales = scales or {}
+    missing = sorted((set(baseline) | set(candidate)) - (set(baseline) & set(candidate)))
+    if missing:
+        return {
+            "pass": False,
+            "decision": ANOMALOUS,
+            "failure_class": "SAMPLE_ACCOUNTING_FAILED",
+            "detail": f"task sets differ; missing on one side: {missing}",
+            "per_task": {},
+        }
+    if not baseline:
+        return {
+            "pass": False,
+            "decision": ANOMALOUS,
+            "failure_class": "USER_CONFIG_ERROR",
+            "detail": "no tasks to compare",
+            "per_task": {},
+        }
+
+    per_task = {}
+    regressed = []
+    anomalies = []
+    for task in sorted(baseline):
+        b, c = baseline[task], candidate[task]
+        invalid = False
+        for label, val in (("baseline", b), ("candidate", c)):
+            if not _is_valid_score(val):
+                anomalies.append(f"{task}: {label} score {val!r} not a finite number in [0, 100]")
+                invalid = True
+        if invalid:
+            # Don't compute deltas on non-numeric/out-of-range scores (would raise
+            # TypeError); record the anomaly and move on — the run is ANOMALOUS.
+            per_task[task] = {
+                "baseline": b,
+                "candidate": c,
+                "drop": None,
+                "within_threshold": False,
+            }
+            continue
+        scale = scales.get(task) or _infer_scale(b, c)
+        delta = b - c  # native units, for reporting
+        if relative:
+            drop = delta / b if b else 0.0  # fraction of baseline (scale-invariant)
+        else:
+            drop = delta / scale  # fraction of the task's scale
+        within = drop <= threshold
+        gain = (c - b) / scale
+        if gain > _IMPLAUSIBLE_GAIN_FRAC:
+            anomalies.append(
+                f"{task}: candidate exceeds baseline by {c - b:.4g} ({gain:.1%} of scale, implausible)"
+            )
+        per_task[task] = {
+            "baseline": b,
+            "candidate": c,
+            "drop": round(delta, 4),
+            "drop_fraction": round(drop, 4),
+            "scale": scale,
+            "within_threshold": within,
+        }
+        if not within:
+            regressed.append(task)
+
+    if anomalies:
+        return {
+            "pass": False,
+            "decision": ANOMALOUS,
+            "failure_class": "UNKNOWN",
+            "detail": "; ".join(anomalies),
+            "per_task": per_task,
+        }
+    if regressed:
+        return {
+            "pass": False,
+            "decision": REGRESSION,
+            "failure_class": None,
+            "detail": f"tasks exceeding threshold ({threshold}): {regressed}",
+            "per_task": per_task,
+        }
+    return {
+        "pass": True,
+        "decision": ACCEPT,
+        "failure_class": None,
+        "detail": f"all {len(per_task)} task(s) within threshold {threshold}",
+        "per_task": per_task,
+    }
+
+
+def main(argv=None):
+    """CLI entry point: read baseline/candidate score JSON and print the verdict."""
+    p = argparse.ArgumentParser(description="Day-0 compare gate")
+    p.add_argument("--baseline", required=True, help="baseline score JSON {task: score}")
+    p.add_argument("--candidate", required=True, help="candidate score JSON {task: score}")
+    p.add_argument("--threshold", type=float, default=0.01, help="max drop fraction (default 0.01)")
+    p.add_argument("--relative", action="store_true", help="measure drop relative to baseline")
+    p.add_argument(
+        "--scales",
+        help="optional JSON {task: max_scale} to override per-task scale inference",
+    )
+    args = p.parse_args(argv)
+
+    try:
+        with open(args.baseline) as f:
+            baseline = json.load(f)
+        with open(args.candidate) as f:
+            candidate = json.load(f)
+        scales = json.loads(args.scales) if args.scales else None
+    except (OSError, json.JSONDecodeError) as e:
+        print(json.dumps({"pass": False, "failure_class": "USER_CONFIG_ERROR", "detail": str(e)}))
+        return 2
+
+    result = evaluate_comparison(baseline, candidate, args.threshold, args.relative, scales)
+    print(json.dumps(result, indent=2))
+    return 0 if result["pass"] else 1
+
+
+if __name__ == "__main__":
+    sys.exit(main())
diff --git a/.agents/skills/day0-release/scripts/gate_ptq.py b/.agents/skills/day0-release/scripts/gate_ptq.py
new file mode 100644
index 00000000000..3425e775fa2
--- /dev/null
+++ b/.agents/skills/day0-release/scripts/gate_ptq.py
@@ -0,0 +1,192 @@
+# SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""Day-0 post-quantization checkpoint gate.
+
+Mirrors the required checks in ptq/references/checkpoint-validation.md:
+  1. Output smaller than source (size ratio < 1 for a compression recipe).
+  2. Quantized-weight coverage matches the requested recipe (no intended layer
+     group left unquantized).
+  3. No unexpected metadata diffs vs the source.
+
+Pure decision logic in ``evaluate_checkpoint`` (unit-tested without real
+checkpoints); ``main`` reads a validation-summary JSON produced from the
+exported checkpoint (e.g. from hf_ptq.py's quant summary + a size scan) and
+prints the verdict.
+
+Validation summary shape:
+    {
+      "source_bytes": int,
+      "output_bytes": int,
+      "recipe": "nvfp4" | "fp8" | "nvfp4_mlp_only" | ...,
+      "layer_precision_counts": {
+          "NVFP4": int, "FP8": int, "INT4": int,
+          "BF16_or_excluded": int,
+          "unexpected_unquantized": int,
+          "declaration_mismatch": int
+      },
+      "metadata_diffs": [str, ...]   # unexpected diffs only; [] if clean
+    }
+"""
+
+from __future__ import annotations
+
+import argparse
+import json
+import sys
+
+# Which precision bucket each recipe is expected to populate with a nonzero count.
+_RECIPE_EXPECTED_PRECISION = {
+    "nvfp4": "NVFP4",
+    "nvfp4_mlp_only": "NVFP4",
+    "nvfp4_experts_only": "NVFP4",
+    "nvfp4_omlp_only": "NVFP4",
+    "fp8": "FP8",
+    "int4_awq": "INT4",
+}
+
+
+def evaluate_checkpoint(summary):
+    """Validate an exported quantized checkpoint summary.
+
+    Returns dict ``{pass, failure_class, detail, checks}``.
+    """
+    if not summary:
+        return {
+            "pass": False,
+            "failure_class": "USER_CONFIG_ERROR",
+            "detail": "empty validation summary",
+            "checks": {},
+        }
+
+    src = summary.get("source_bytes")
+    out = summary.get("output_bytes")
+    recipe = (summary.get("recipe") or "").lower()
+    counts = summary.get("layer_precision_counts") or {}
+    metadata_diffs = summary.get("metadata_diffs") or []
+
+    checks = {}
+    failures = []  # (failure_class, detail)
+
+    # Check 1 — size.
+    if not isinstance(src, (int, float)) or not isinstance(out, (int, float)) or src <= 0:
+        checks["size"] = "missing/invalid source or output bytes"
+        failures.append(("USER_CONFIG_ERROR", "missing source/output sizes"))
+    else:
+        ratio = out / src
+        checks["size"] = f"{out}/{src} = {ratio:.3f}x"
+        if ratio >= 1.0:
+            failures.append(
+                ("QUANT_COVERAGE_FAILURE", f"output not smaller than source (ratio {ratio:.3f})")
+            )
+
+    # Check 2 — coverage.
+    expected_bucket = _RECIPE_EXPECTED_PRECISION.get(recipe)
+    if expected_bucket is None:
+        checks["coverage"] = f"unknown recipe {recipe!r}; cannot verify coverage"
+        failures.append(("USER_CONFIG_ERROR", f"unknown recipe {recipe!r}"))
+    else:
+        covered = counts.get(expected_bucket, 0)
+        unexpected = counts.get("unexpected_unquantized", 0)
+        mismatch = counts.get("declaration_mismatch", 0)
+        checks["coverage"] = (
+            f"{expected_bucket}={covered}, "
+            f"unexpected_unquantized={unexpected}, "
+            f"declaration_mismatch={mismatch}"
+        )
+        if covered == 0:
+            failures.append(
+                (
+                    "MODEL_UNSUPPORTED",
+                    f"recipe {recipe} targets {expected_bucket} but 0 layers covered "
+                    "(wildcard likely missed the module names)",
+                )
+            )
+        if unexpected > 0:
+            failures.append(
+                ("QUANT_COVERAGE_FAILURE", f"{unexpected} layer(s) unexpectedly unquantized")
+            )
+        if mismatch > 0:
+            failures.append(
+                (
+                    "QUANT_COVERAGE_FAILURE",
+                    f"{mismatch} layer(s) with precision/declaration mismatch",
+                )
+            )
+
+    # Check 3 — metadata.
+    checks["metadata"] = "clean" if not metadata_diffs else f"{len(metadata_diffs)} diff(s)"
+    if metadata_diffs:
+        failures.append(("QUANT_COVERAGE_FAILURE", f"unexpected metadata diffs: {metadata_diffs}"))
+
+    if not failures:
+        return {
+            "pass": True,
+            "failure_class": None,
+            "detail": "size, coverage, and metadata all pass",
+            "checks": checks,
+        }
+
+    # Surface the most actionable failure_class first: MODEL_UNSUPPORTED >
+    # QUANT_COVERAGE_FAILURE > USER_CONFIG_ERROR.
+    order = ["MODEL_UNSUPPORTED", "QUANT_COVERAGE_FAILURE", "USER_CONFIG_ERROR"]
+    failures.sort(key=lambda f: order.index(f[0]) if f[0] in order else len(order))
+    return {
+        "pass": False,
+        "failure_class": failures[0][0],
+        "detail": "; ".join(d for _, d in failures),
+        "checks": checks,
+    }
+
+
+def main(argv=None):
+    """CLI entry point: read a validation-summary JSON and print the verdict."""
+    p = argparse.ArgumentParser(description="Day-0 post-quantization checkpoint gate")
+    p.add_argument("--summary", help="validation-summary JSON (see module docstring)")
+    p.add_argument("--checkpoint", help="(reserved) checkpoint dir; v1 expects --summary")
+    p.add_argument("--source", help="(reserved) source model id/path")
+    p.add_argument("--recipe", help="(reserved) qformat; overrides summary.recipe if given")
+    args = p.parse_args(argv)
+
+    if not args.summary:
+        print(
+            json.dumps(
+                {
+                    "pass": False,
+                    "failure_class": "USER_CONFIG_ERROR",
+                    "detail": "v1 requires --summary <validation-summary.json>; "
+                    "produce it from the exported checkpoint (size scan + hf_ptq quant summary)",
+                }
+            )
+        )
+        return 2
+
+    try:
+        with open(args.summary) as f:
+            summary = json.load(f)
+    except (OSError, json.JSONDecodeError) as e:
+        print(json.dumps({"pass": False, "failure_class": "USER_CONFIG_ERROR", "detail": str(e)}))
+        return 2
+
+    if args.recipe:
+        summary["recipe"] = args.recipe
+
+    result = evaluate_checkpoint(summary)
+    print(json.dumps(result, indent=2))
+    return 0 if result["pass"] else 1
+
+
+if __name__ == "__main__":
+    sys.exit(main())
diff --git a/.agents/skills/day0-release/scripts/gate_run.py b/.agents/skills/day0-release/scripts/gate_run.py
new file mode 100644
index 00000000000..d5dcbe94a70
--- /dev/null
+++ b/.agents/skills/day0-release/scripts/gate_run.py
@@ -0,0 +1,159 @@
+# SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""Day-0 evaluation-run gate.
+
+Validates that a completed evaluation run is trustworthy before its scores are
+compared. Mirrors the checks in evaluation/references/run-validation.md. Pure
+decision logic in ``evaluate_run`` (unit-tested without a cluster); ``main``
+reads a run-summary JSON and prints the verdict.
+
+The run summary is a dict with, per task:
+    {
+      "tasks": {
+        "<task>": {
+          "status": "SUCCESS" | "FAILED" | "RUNNING" | "PENDING" | "TIMEOUT" | "RESUMING",
+          "expected_samples": int,
+          "scored_samples": int,
+          "score": float | null,          # canonical score, if extracted
+          "errors": [str, ...]            # judge/parse/sample errors, if any
+        }
+      }
+    }
+Only a terminal SUCCESS with complete, numeric scores passes. Non-terminal
+statuses (RUNNING/PENDING/TIMEOUT/RESUMING) do NOT pass — the run hasn't
+finished — but they classify as INFRA_TRANSIENT (wait for NEL to resume/finish;
+not a real regression), distinct from a terminal FAILED.
+"""
+
+from __future__ import annotations
+
+import argparse
+import json
+import math
+import sys
+
+_TERMINAL_OK = "SUCCESS"
+# Not done yet — NEL resumes/finishes these; transient, not a real failure.
+_NON_TERMINAL = {"TIMEOUT", "RESUMING", "RUNNING", "PENDING"}
+
+
+def evaluate_run(summary):
+    """Validate a completed run summary.
+
+    Returns dict ``{pass, failure_class, detail, per_task}``.
+    """
+    tasks = (summary or {}).get("tasks")
+    if not tasks:
+        return {
+            "pass": False,
+            "failure_class": "USER_CONFIG_ERROR",
+            "detail": "run summary has no tasks",
+            "per_task": {},
+        }
+
+    per_task = {}
+    problems = []
+    for name, t in sorted(tasks.items()):
+        status = t.get("status")
+        expected = t.get("expected_samples")
+        scored = t.get("scored_samples")
+        score = t.get("score")
+        errors = t.get("errors") or []
+
+        ok = True
+        reasons = []
+
+        if status in _NON_TERMINAL:
+            ok = False
+            reasons.append(f"status {status}: not terminal yet (resume/finish expected)")
+        elif status != _TERMINAL_OK:
+            ok = False
+            reasons.append(f"status {status!r} is not SUCCESS")
+
+        if errors:
+            ok = False
+            # Classify the first error to a failure_class hint.
+            joined = " ".join(errors).lower()
+            if any(k in joined for k in ("judge", "rate limit", "unauthorized", "auth")):
+                reasons.append(f"judge/auth error: {errors[0]}")
+            else:
+                reasons.append(f"error: {errors[0]}")
+
+        if expected is not None and scored is not None and scored != expected:
+            ok = False
+            reasons.append(f"sample accounting: scored {scored} of {expected}")
+
+        if score is None:
+            ok = False
+            reasons.append("no score extracted")
+        elif not (
+            isinstance(score, (int, float)) and not isinstance(score, bool) and math.isfinite(score)
+        ):
+            ok = False
+            reasons.append(f"score not numeric/finite: {score!r}")
+
+        per_task[name] = {"ok": ok, "reasons": reasons}
+        if not ok:
+            problems.append((name, reasons))
+
+    if not problems:
+        return {
+            "pass": True,
+            "failure_class": None,
+            "detail": f"all {len(per_task)} task(s) valid",
+            "per_task": per_task,
+        }
+
+    # Pick the dominant failure_class for the run.
+    flat = " ".join(r for _, rs in problems for r in rs).lower()
+    if any(k in flat for k in ("judge", "rate limit", "unauthorized", "auth")):
+        fc = "EVAL_JUDGE_FAILED"
+    elif "not terminal" in flat:
+        # Non-terminal (RUNNING/PENDING/TIMEOUT/RESUMING): wait for resume/finish.
+        fc = "INFRA_TRANSIENT"
+    elif "sample accounting" in flat or "no score" in flat or "score not numeric" in flat:
+        fc = "SAMPLE_ACCOUNTING_FAILED"
+    else:
+        fc = "UNKNOWN"
+
+    return {
+        "pass": False,
+        "failure_class": fc,
+        "detail": "; ".join(f"{n}: {', '.join(rs)}" for n, rs in problems),
+        "per_task": per_task,
+    }
+
+
+def main(argv=None):
+    """CLI entry point: read a run-summary JSON and print the verdict."""
+    p = argparse.ArgumentParser(description="Day-0 evaluation-run gate")
+    p.add_argument("--run", required=True, help="run-summary JSON (see module docstring)")
+    args = p.parse_args(argv)
+
+    try:
+        with open(args.run) as f:
+            summary = json.load(f)
+    except (OSError, json.JSONDecodeError) as e:
+        print(json.dumps({"pass": False, "failure_class": "USER_CONFIG_ERROR", "detail": str(e)}))
+        return 2
+
+    result = evaluate_run(summary)
+    print(json.dumps(result, indent=2))
+    return 0 if result["pass"] else 1
+
+
+if __name__ == "__main__":
+    sys.exit(main())
diff --git a/.agents/skills/day0-release/tests/evals.json b/.agents/skills/day0-release/tests/evals.json
new file mode 100644
index 00000000000..bd78b22ce11
--- /dev/null
+++ b/.agents/skills/day0-release/tests/evals.json
@@ -0,0 +1,71 @@
+[
+  {
+    "name": "full-day0-release-triggers",
+    "skills": [
+      "day0-release"
+    ],
+    "query": "Release org/PLACEHOLDER-MODEL at day-0: quantize to NVFP4, validate it's within 1% of the BF16 baseline on the AA suite, and tell me if it's publishable. Run on my cluster.",
+    "files": [],
+    "expected_behavior": [
+      "Selects the day0-release skill for the full goal-driven release",
+      "Resolves model, recipe/qformat, cluster, eval set, and threshold before starting",
+      "Runs stages in fixed order: setup, PTQ, baseline eval, quantized eval, compare, closeout",
+      "Runs the gate after each stage and does not advance past a failed gate",
+      "Returns a publish decision (ACCEPT / REGRESSION / ANOMALOUS / INFEASIBLE), not just raw scores"
+    ]
+  },
+  {
+    "name": "quantize-only-does-not-trigger-day0",
+    "skills": [
+      "ptq"
+    ],
+    "query": "Quantize org/PLACEHOLDER-MODEL to NVFP4 and save the checkpoint.",
+    "files": [],
+    "expected_behavior": [
+      "Selects the ptq skill, not day0-release",
+      "Does not start deployment, evaluation, or comparison",
+      "Treats this as a single-stage quantization request"
+    ]
+  },
+  {
+    "name": "evaluate-only-does-not-trigger-day0",
+    "skills": [
+      "evaluation"
+    ],
+    "query": "Run MMLU-Pro on the endpoint I already have serving at http://host:8000.",
+    "files": [],
+    "expected_behavior": [
+      "Selects the evaluation skill, not day0-release",
+      "Does not quantize or run a baseline-vs-candidate comparison"
+    ]
+  },
+  {
+    "name": "ptq-gate-blocks-on-coverage-miss",
+    "skills": [
+      "day0-release"
+    ],
+    "query": "Run the day-0 workflow for org/PLACEHOLDER-MODEL with nvfp4, but the PTQ checkpoint came out the same size as the source.",
+    "files": [],
+    "expected_behavior": [
+      "Runs gate_ptq.py on the exported checkpoint before evaluating",
+      "Treats a size ratio >= 1.0 or zero quantized-layer coverage as a gate failure",
+      "Does not evaluate a checkpoint that failed the PTQ gate",
+      "Branches on the failure_class (e.g. MODEL_UNSUPPORTED or QUANT_COVERAGE_FAILURE) rather than silently continuing"
+    ]
+  },
+  {
+    "name": "regression-reports-and-stops-in-v1",
+    "skills": [
+      "day0-release",
+      "compare-results"
+    ],
+    "query": "Day-0 release for org/PLACEHOLDER-MODEL nvfp4; the quantized GPQA score is 3 points below baseline.",
+    "files": [],
+    "expected_behavior": [
+      "Runs gate_compare.py with the accuracy threshold",
+      "Classifies a beyond-threshold drop as REGRESSION",
+      "Reports which tasks regressed and by how much",
+      "Does NOT auto-select a new recipe and re-run in v1 (that loop is deferred)"
+    ]
+  }
+]
diff --git a/.agents/skills/day0-release/tests/test_gates.py b/.agents/skills/day0-release/tests/test_gates.py
new file mode 100644
index 00000000000..e4b39c7b7eb
--- /dev/null
+++ b/.agents/skills/day0-release/tests/test_gates.py
@@ -0,0 +1,235 @@
+# SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""Unit tests for the day-0 gate scripts.
+
+These are deterministic — no GPU, cluster, or network. They test the pure
+decision functions that the gates rest on. Run with:
+
+    python -m pytest .agents/skills/day0-release/tests/test_gates.py
+"""
+
+import sys
+from pathlib import Path
+
+sys.path.insert(0, str(Path(__file__).parent.parent / "scripts"))
+
+from gate_compare import evaluate_comparison
+from gate_ptq import evaluate_checkpoint
+from gate_run import evaluate_run
+
+# ── gate_compare ──────────────────────────────────────────────────────
+
+
+def test_compare_accept_within_threshold():
+    r = evaluate_comparison(
+        {"gpqa": 50.0, "scicode": 30.0}, {"gpqa": 49.5, "scicode": 29.8}, threshold=0.01
+    )
+    assert r["pass"] and r["decision"] == "ACCEPT"
+
+
+def test_compare_regression_exceeds_threshold():
+    r = evaluate_comparison({"gpqa": 50.0}, {"gpqa": 47.5}, threshold=0.01)  # 2.5 pt drop
+    assert not r["pass"] and r["decision"] == "REGRESSION"
+    assert "gpqa" in r["detail"]
+
+
+def test_compare_anomalous_implausible_gain():
+    r = evaluate_comparison({"gpqa": 50.0}, {"gpqa": 60.0}, threshold=0.01)  # +10 pts
+    assert not r["pass"] and r["decision"] == "ANOMALOUS"
+
+
+def test_compare_anomalous_out_of_range():
+    r = evaluate_comparison({"gpqa": 50.0}, {"gpqa": 150.0}, threshold=0.01)
+    assert r["decision"] == "ANOMALOUS"
+
+
+def test_compare_mismatched_task_sets():
+    r = evaluate_comparison({"gpqa": 50.0}, {"scicode": 30.0}, threshold=0.01)
+    assert not r["pass"] and r["failure_class"] == "SAMPLE_ACCOUNTING_FAILED"
+
+
+def test_compare_relative_threshold():
+    # 1% relative of 50 = 0.5 pts; a 0.4 pt drop passes, 0.6 fails.
+    assert evaluate_comparison({"t": 50.0}, {"t": 49.6}, threshold=0.01, relative=True)["pass"]
+    assert not evaluate_comparison({"t": 50.0}, {"t": 49.4}, threshold=0.01, relative=True)["pass"]
+
+
+def test_compare_0_to_1_scale_full_collapse_is_regression():
+    # tau2_bench_telecom reports Result on a 0-1 scale. A full collapse
+    # (1.0 -> 0.0) must REGRESS, not pass via the old 0-100 limit assumption.
+    r = evaluate_comparison(
+        {"tau2_bench_telecom": 1.0}, {"tau2_bench_telecom": 0.0}, threshold=0.01
+    )
+    assert not r["pass"] and r["decision"] == "REGRESSION"
+    assert "tau2_bench_telecom" in r["detail"]
+
+
+def test_compare_0_to_1_scale_within_threshold_accepts():
+    # A 0.005 drop on a 0-1 task is within the 0.01 threshold.
+    r = evaluate_comparison({"t": 0.900}, {"t": 0.895}, threshold=0.01)
+    assert r["pass"] and r["decision"] == "ACCEPT"
+
+
+def test_compare_explicit_scale_override():
+    # Force a 0-100 scale even though both scores fit in [0, 1]: a 0.5 -> 0.4
+    # drop is 0.1 pts on a 0-100 scale, well within threshold.
+    r = evaluate_comparison({"t": 0.5}, {"t": 0.4}, threshold=0.01, scales={"t": 100.0})
+    assert r["pass"] and r["decision"] == "ACCEPT"
+
+
+def test_compare_mixed_scales_in_one_suite():
+    # 0-100 task within threshold + 0-1 task collapsing -> overall REGRESSION.
+    r = evaluate_comparison(
+        {"gpqa": 50.0, "tau2_bench_telecom": 1.0},
+        {"gpqa": 49.8, "tau2_bench_telecom": 0.0},
+        threshold=0.01,
+    )
+    assert not r["pass"] and r["decision"] == "REGRESSION"
+    assert "tau2_bench_telecom" in r["detail"] and "gpqa" not in r["detail"]
+
+
+# ── gate_run ──────────────────────────────────────────────────────────
+
+
+def _task(**kw):
+    base = {
+        "status": "SUCCESS",
+        "expected_samples": 100,
+        "scored_samples": 100,
+        "score": 42.0,
+        "errors": [],
+    }
+    base.update(kw)
+    return base
+
+
+def test_run_all_valid():
+    r = evaluate_run({"tasks": {"gpqa": _task(), "scicode": _task()}})
+    assert r["pass"]
+
+
+def test_run_dropped_samples():
+    r = evaluate_run({"tasks": {"gpqa": _task(scored_samples=90)}})
+    assert not r["pass"] and r["failure_class"] == "SAMPLE_ACCOUNTING_FAILED"
+
+
+def test_run_judge_error():
+    r = evaluate_run({"tasks": {"gpqa": _task(errors=["judge rate limit exceeded"])}})
+    assert not r["pass"] and r["failure_class"] == "EVAL_JUDGE_FAILED"
+
+
+def test_run_missing_score():
+    r = evaluate_run({"tasks": {"gpqa": _task(score=None)}})
+    assert not r["pass"] and r["failure_class"] == "SAMPLE_ACCOUNTING_FAILED"
+
+
+def test_run_timeout_is_not_terminal():
+    r = evaluate_run({"tasks": {"gpqa": _task(status="TIMEOUT")}})
+    assert not r["pass"] and r["failure_class"] == "INFRA_TRANSIENT"
+
+
+def test_run_no_tasks():
+    r = evaluate_run({"tasks": {}})
+    assert not r["pass"] and r["failure_class"] == "USER_CONFIG_ERROR"
+
+
+# ── gate_ptq ──────────────────────────────────────────────────────────
+
+
+def _ckpt(**kw):
+    base = {
+        "source_bytes": 16_000_000_000,
+        "output_bytes": 8_000_000_000,
+        "recipe": "nvfp4",
+        "layer_precision_counts": {
+            "NVFP4": 224,
+            "BF16_or_excluded": 3,
+            "unexpected_unquantized": 0,
+            "declaration_mismatch": 0,
+        },
+        "metadata_diffs": [],
+    }
+    base.update(kw)
+    return base
+
+
+def test_ptq_pass():
+    assert evaluate_checkpoint(_ckpt())["pass"]
+
+
+def test_ptq_not_smaller():
+    r = evaluate_checkpoint(_ckpt(output_bytes=16_000_000_000))
+    assert not r["pass"] and r["failure_class"] == "QUANT_COVERAGE_FAILURE"
+
+
+def test_ptq_zero_coverage_is_model_unsupported():
+    r = evaluate_checkpoint(
+        _ckpt(
+            layer_precision_counts={
+                "NVFP4": 0,
+                "unexpected_unquantized": 0,
+                "declaration_mismatch": 0,
+            }
+        )
+    )
+    assert not r["pass"] and r["failure_class"] == "MODEL_UNSUPPORTED"
+
+
+def test_ptq_unexpected_unquantized():
+    r = evaluate_checkpoint(
+        _ckpt(
+            layer_precision_counts={
+                "NVFP4": 200,
+                "unexpected_unquantized": 24,
+                "declaration_mismatch": 0,
+            }
+        )
+    )
+    assert not r["pass"] and r["failure_class"] == "QUANT_COVERAGE_FAILURE"
+
+
+def test_ptq_metadata_diff():
+    r = evaluate_checkpoint(_ckpt(metadata_diffs=["chat_template changed"]))
+    assert not r["pass"] and r["failure_class"] == "QUANT_COVERAGE_FAILURE"
+
+
+def test_ptq_unknown_recipe():
+    r = evaluate_checkpoint(_ckpt(recipe="mystery"))
+    assert not r["pass"] and r["failure_class"] == "USER_CONFIG_ERROR"
+
+
+# ── regression tests for malformed inputs ────────────────────────────
+
+
+def test_compare_non_numeric_score_is_anomalous_not_crash():
+    # A string/None score must not raise TypeError; it's ANOMALOUS.
+    for bad in ("42", None, float("nan"), True):
+        r = evaluate_comparison({"gpqa": 50.0}, {"gpqa": bad}, threshold=0.01)
+        assert not r["pass"] and r["decision"] == "ANOMALOUS", bad
+
+
+def test_run_non_numeric_score_fails():
+    r = evaluate_run({"tasks": {"gpqa": _task(score="42")}})
+    assert not r["pass"] and r["failure_class"] == "SAMPLE_ACCOUNTING_FAILED"
+
+
+def test_run_running_is_infra_transient():
+    r = evaluate_run({"tasks": {"gpqa": _task(status="RUNNING", score=None)}})
+    assert not r["pass"] and r["failure_class"] == "INFRA_TRANSIENT"
+
+
+if __name__ == "__main__":
+    sys.exit(__import__("pytest").main([__file__, "-q"]))
diff --git a/.github/workflows/unit_tests.yml b/.github/workflows/unit_tests.yml
index fc2ae364cba..66f2486bcec 100644
--- a/.github/workflows/unit_tests.yml
+++ b/.github/workflows/unit_tests.yml
@@ -13,6 +13,7 @@ on:
       - "pyproject.toml"
       - "tests/unit/**"
       - "tools/launcher/**"
+      - ".agents/skills/**"
   schedule:
     - cron: "0 0 * * *" # Nightly
   workflow_dispatch:
@@ -55,6 +56,7 @@ jobs:
             pyproject.toml
             tests/unit/**
             tools/launcher/**
+            .agents/skills/**
   linux:
     needs: [check-dco]
     runs-on: ubuntu-latest
@@ -142,10 +144,27 @@ jobs:
           uv venv .venv
           uv pip install -e . pytest
           uv run python3 -m pytest -v
+  skills:
+    if: needs.check-file-changes.outputs.any_changed == 'true'
+    needs: [linux, check-file-changes]
+    runs-on: ubuntu-latest
+    timeout-minutes: 10
+    steps:
+      - uses: actions/checkout@v6
+      - uses: actions/setup-python@v6
+        with:
+          python-version: "3.12"
+      - name: Run skill gate tests
+        # Skill gate tests are stdlib-only and hermetic (no GPU/cluster/network),
+        # so they run in their own lightweight job rather than the main unit lane.
+        # Override addopts to drop the repo's coverage/instafail plugins (not installed here).
+        run: |
+          pip install pytest
+          python -m pytest .agents/skills/ -o addopts="" -p no:cacheprovider -v
   unit-pr-required-check:
     # Run even if some jobs are skipped
     if: ${{ github.event_name == 'pull_request' && always() }}
-    needs: [check-file-changes, linux, windows, multi-version, partial-install, launcher]
+    needs: [check-file-changes, linux, windows, multi-version, partial-install, launcher, skills]
     runs-on: ubuntu-latest
     steps:
       - name: Required unit tests did not succeed
@@ -154,6 +173,7 @@ jobs:
             needs.windows.result != 'success' ||
             needs.multi-version.result != 'success' ||
             needs.partial-install.result != 'success' ||
-            needs.launcher.result != 'success'
+            needs.launcher.result != 'success' ||
+            needs.skills.result != 'success'
           )) }}
         run: exit 1
diff --git a/CHANGELOG.rst b/CHANGELOG.rst
index da02b315f67..5ec17822d64 100755
--- a/CHANGELOG.rst
+++ b/CHANGELOG.rst
@@ -80,6 +80,7 @@ Changelog
 - Add support for vLLM fakequant reload using ModelOpt state for HF models. See `examples/vllm_serve/README.md <https://github.com/NVIDIA/Model-Optimizer/tree/main/examples/vllm_serve#load-qatptq-model-and-serve-in-vllm-wip>`_ for more details.
 - [Early Testing] Add Claude Code PTQ skill (``.claude/skills/ptq/``) for agent-assisted post-training quantization. The skill guides the agent through environment detection, model support checking, format selection, and execution via the launcher or manual SLURM/Docker/bare GPU paths. Includes handling for unlisted models with custom module patching. This feature is in early testing — use with caution.
 - [Early Testing] Polish Claude Code evaluation skill (``.claude/skills/evaluation/``) for agent-assisted LLM accuracy benchmarking via NeMo Evaluator Launcher. Adds two companion skills vendored verbatim from `NVIDIA-NeMo/Evaluator <https://github.com/NVIDIA-NeMo/Evaluator>`_: ``launching-evals`` (run/check/debug/analyze NEL evaluations) and ``accessing-mlflow`` (query MLflow runs, compare metrics, fetch artifacts). Re-sync at a pinned upstream SHA via ``.claude/scripts/sync-upstream-skills.sh``. Also adds a shared ``skills/common/credentials.md`` covering HF / NGC / Docker token setup referenced by multiple skills. This feature is in early testing — use with caution.
+- [Early Testing] Add Claude Code day0-release skill (``.claude/skills/day0-release/``), a deterministic end-to-end driver that chains the PTQ → evaluation → comparison skills (the evaluation stage deploys the checkpoint itself) with an enforced gate after each stage and returns a publish decision (ACCEPT / REGRESSION / ANOMALOUS / INFEASIBLE). Ships three GPU-free, unit-tested gate scripts (``gate_ptq.py``, ``gate_run.py``, ``gate_compare.py``) that validate checkpoint coverage, evaluation-run completeness, and baseline-vs-candidate accuracy threshold. v1 reports and stops on regression; the recipe-search loop is deferred. This feature is in early testing — use with caution.
 - Add performant layerwise calibration for large models that don't fit on GPU (e.g. DeepSeek-R1, Kimi-K2). See `modelopt_recipes/general/ptq/nvfp4_experts_only-kv_fp8_layerwise.yaml <https://github.com/NVIDIA/Model-Optimizer/blob/main/modelopt_recipes/general/ptq/nvfp4_experts_only-kv_fp8_layerwise.yaml>`_ for usage. Layerwise calibration also supports PTQ with intermediate progress saving — useful when long PTQ runs get hit with Slurm timeouts. See `modelopt_recipes/general/ptq/nvfp4_default-kv_none-gptq.yaml <https://github.com/NVIDIA/Model-Optimizer/blob/main/modelopt_recipes/general/ptq/nvfp4_default-kv_none-gptq.yaml>`_ for usage.
 - Add implicit GEMM CUDA kernel for Conv3D with fused NVFP4 fake quantization (``modelopt.torch.quantization.src.conv``). When NVFP4 quantization is applied to an ``nn.Conv3d`` layer via ModelOpt PTQ, the implicit GEMM path is used automatically instead of cuDNN. Uses BF16 WMMA tensor cores (SM80+) with FP32 accumulation and in-kernel FP4 (E2M1) activation quantization. Grouped convolution (``groups > 1``) falls back to the default cuDNN path. Inference only — training mode falls back to cuDNN with a warning.
 - Add FP8 MHA quantization support for vision transformers. Adds an attention-aware ONNX post-processing pass (scale Mul / K-transpose move before Q, Q→DQ insertion on softmax output) in :class:`FP8QuantExporter <modelopt.onnx.export.fp8_exporter.FP8QuantExporter>`, per-instance nested-attention-wrapper skipping in the HF plugin, and ``nn.LayerNorm`` registration in ``QuantModuleRegistry`` so BMM input quantizers and LayerNorm output quantizers defined in FP8_DEFAULT_CFG are honored end-to-end. See `examples/torch_onnx/torch_quant_to_onnx.py <https://github.com/NVIDIA/Model-Optimizer/tree/main/examples/torch_onnx/torch_quant_to_onnx.py>`_ for the general timm-model quantize→ONNX workflow.
diff --git a/pyproject.toml b/pyproject.toml
index 74927947215..28ca051d9e7 100644
--- a/pyproject.toml
+++ b/pyproject.toml
@@ -210,6 +210,7 @@ extend-ignore = [
 "examples/*" = ["D"]
 "noxfile.py" = ["D", "E501"]
 "tests/*" = ["B017", "D", "E402", "PT012"]
+".agents/skills/*/tests/test_*.py" = ["D", "E402"] # Skill test scripts: docstring (D) + sys.path import-order (E402) exemptions
 "*/_[a-zA-Z]*" = ["D"] # Private packages (_abc/*.py) or modules (_xyz.py)
 "*.ipynb" = [
     "D",