Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
177 changes: 177 additions & 0 deletions .agents/skills/day0-release/SKILL.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,177 @@
---
name: day0-release
description: Deterministic end-to-end driver for day-0 quantized-checkpoint releases — chains PTQ → evaluation → comparison with enforced gates between stages (the evaluation stage deploys the checkpoint itself), and returns a publish decision (ACCEPT / REGRESSION / ANOMALOUS / INFEASIBLE). Use when the user asks to "release a model at day-0", "quantize and validate model X is within N% of baseline and tell me if it's publishable", or "run the full day-0 workflow". Do NOT use for single-stage requests — quantizing only (use ptq), serving only (use deployment), evaluating only (use evaluation), or comparing two existing runs (use compare-results).
license: Apache-2.0
---

# Day-0 Release

Drive a model from a pretrained checkpoint to a publish decision for a quantized
checkpoint, in a fixed sequence with a gate after every stage. This skill is a
**conductor**: it sequences the existing domain skills and enforces the gates —
it does not re-implement quantization, serving, evaluation, or comparison.

**Goal (the default day-0 criterion):** a quantized checkpoint smaller than the
source, with accuracy drop within the threshold (default <1%) on the standard
benchmark set versus the matching baseline, plus a publish recommendation.

## When to use

Use only for the full goal-driven release. For a single stage, route to the
domain skill directly: quantize → **ptq**, serve → **deployment**, evaluate →
**evaluation**, compare two existing runs → **compare-results**.

## Inputs

Resolve these before starting (ask the user for anything missing):

- **Model** — HF handle or checkpoint path.
- **Recipe / qformat** — e.g. `nvfp4`, `fp8`, or a recipe path. One candidate for v1.
- **Cluster / launcher** — from `clusters.yaml` (see `skills/common/environment-setup.md`).
- **Eval set** — defaults to the AA suite (`evaluation/recipes/tasks/aa/`).
- **Threshold** — max accuracy drop; default `0.01` (1%).

## The chain

```text
setup ─▶ PTQ ─▶ baseline-eval ─▶ quantized-eval ─▶ compare ─▶ closeout
│ │ │ │
gate_ptq gate_run gate_run gate_compare
```

The **evaluation** skill deploys the model it evaluates (it stands up its own
endpoint per run), so there is no separate deploy stage — a serving failure
surfaces through the eval stage's gate (`DEPLOYMENT_HEALTH_FAILED`) and triages
to the **deployment** skill to debug serving in isolation (see Step 4).

Run each stage by invoking the domain skill, then run its gate before
proceeding. **Do not advance past a failed gate.** Copy this checklist and track
progress:

```text
- [ ] Step 0: Resolve inputs; confirm threshold and eval set
- [ ] Step 1: Setup gate — creds present, cluster reachable
- [ ] Step 2: PTQ (ptq skill) → gate_ptq.py
- [ ] Step 3: Baseline eval (evaluation skill, deploys source) → gate_run.py [skip if cached, see below]
- [ ] Step 4: Quantized eval (evaluation skill, deploys candidate) → gate_run.py
- [ ] Step 5: Compare (compare-results skill) → gate_compare.py → decision
- [ ] Step 6: Closeout — report + publish recommendation
```

### Step 1 — Setup gate

Confirm credentials (`skills/common/credentials.md`) and cluster reachability
(`skills/common/remote-execution.md`). If either fails, stop with
`SYSTEMIC` — do not start PTQ.

### Step 2 — PTQ

Invoke the **ptq** skill to produce the quantized checkpoint. Then gate:

```bash
# The ptq skill's post-PTQ validation produces a validation-summary JSON (size
# ratio + layer-precision counts + metadata diffs; see
# ptq/references/checkpoint-validation.md). v1 gates on that summary:
python .agents/skills/day0-release/scripts/gate_ptq.py --summary <validation-summary.json>
# add `--recipe <qformat>` to override the recipe recorded in the summary
```
Comment on lines +71 to +77

`gate_ptq.py` returns JSON `{pass, failure_class, detail}`. On `pass: false`,
branch on `failure_class` (see **Triage** below). Do not evaluate an
unvalidated checkpoint.

### Step 3 — Baseline eval

The baseline is the **source** (pre-quantization) model on the same task set and
sampling params. **Look it up first** — if a matching baseline run already
exists in MLflow (same model, task set, sampling params), reuse it and skip this
stage. Otherwise run it via the **evaluation** skill (which deploys the source
model itself). Gate with `gate_run.py`.

### Step 4 — Quantized eval

Invoke the **evaluation** skill on the quantized checkpoint, matching the
baseline's task set and sampling params. The evaluation skill stands up the
serving endpoint itself (it builds the `deployment.command`, e.g. a
`vllm serve …`), so a serving failure surfaces here as a failed `gate_run.py`
with `DEPLOYMENT_HEALTH_FAILED`. When that happens, **drop to the deployment
skill** to reproduce and debug serving in isolation (serve the checkpoint
standalone, confirm `/health` + one generation, iterate on flags / TP / image /
env vars) rather than burning full eval cycles on a broken endpoint — then carry
the working command back into NEL's `deployment.command` and resume the eval. If
the checkpoint genuinely can't serve, `POINT_INFEASIBLE`. Gate:

```bash
python .agents/skills/day0-release/scripts/gate_run.py --run <run-summary.json>
```
Comment on lines +104 to +106

A `pass: false` here means the run is incomplete or invalid (judge/parse error,
dropped samples) — do **not** compare scores from it.

### Step 5 — Compare

Invoke the **compare-results** skill to produce per-task deltas, then gate:

```bash
python .agents/skills/day0-release/scripts/gate_compare.py \
--baseline <baseline_scores.json> --candidate <candidate_scores.json> \
--threshold 0.01
```

The threshold is a fraction of each task's score scale. Most AA tasks report
0-100, but some (e.g. `tau2_bench_telecom` `Result`) report 0-1; the gate infers
each task's scale (0-1 if both scores are within [0, 1], else 0-100) and
normalizes the drop accordingly, so `--threshold 0.01` means "≤1 pt on a 0-100
task / ≤0.01 on a 0-1 task" uniformly. Pass `--scales '{"task": max}'` to
override inference if a task's scores happen to fall in an ambiguous range.

Decision from `gate_compare.py`:

- **ACCEPT** — every task within threshold → go to Step 6.
- **REGRESSION** — one or more tasks exceed threshold. **v1 stops here and
reports** which tasks regressed by how much. (Picking the next recipe and
re-running is deferred — see Scope.)
- **ANOMALOUS** — scores present but implausible (e.g. baseline lower than
candidate by a large margin, or a task score outside its valid range) →
surface to the user.

### Step 6 — Closeout

Report the decision with: source vs output size + ratio, per-task baseline /
candidate / delta / within-threshold, MLflow run IDs, and a publish
recommendation (publish / do-not-publish / needs-human). Archive artifacts to
the workspace.

## Triage (gate failure → decision)

Map a gate's `failure_class` to the next action:

| `failure_class` | Action |
| --- | --- |
| `INFRA_TRANSIENT` | Retry the stage once; if it recurs, `SYSTEMIC`. |
| `MODEL_UNSUPPORTED` | PATCH: fix the recipe pattern / add model support (ptq skill owns the patch loop), then retry. If unpatchable, `POINT_INFEASIBLE`. |
| `QUANT_COVERAGE_FAILURE` | PATCH: fix the recipe wildcard so intended layers are covered; re-run PTQ. |
| `DEPLOYMENT_HEALTH_FAILED` | Drop to the **deployment** skill: reproduce serving standalone (`/health` + one generation), debug flags / image / TP / env, then carry the working command into NEL's `deployment.command` and retry the eval. If it can't serve, `POINT_INFEASIBLE`. |
| `EVAL_JUDGE_FAILED` | Usually transient (auth / rate limit) — wait and retry. |
| `SAMPLE_ACCOUNTING_FAILED` | Investigate dropped/failed samples before trusting scores. |
| `USER_CONFIG_ERROR` | Stop and ask the user. |
| `UNKNOWN` | Stop and surface to the user (`NEEDS_HUMAN`). |

`SYSTEMIC` (cluster down, dataset unavailable) aborts the whole run.
`POINT_INFEASIBLE` means this (model, recipe) can't work as configured.

## Output

Return a decision, not a raw artifact:

- `ACCEPT` + report + publish recommendation
- `REGRESSION` + which tasks failed the threshold and by how much
- `ANOMALOUS` / `INFEASIBLE` / `NEEDS_HUMAN` + reason
- Always: workspace path + MLflow run IDs for traceability

## Scope (v1)

In v1: the linear chain + gates + report. On `REGRESSION`, v1 reports and stops.
Deferred to a follow-up: the evaluator-optimizer recipe loop (compare → pick the
next recipe → re-run PTQ), which needs the bigpareto integration and a shared
config/result schema.
208 changes: 208 additions & 0 deletions .agents/skills/day0-release/scripts/gate_compare.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,208 @@
# SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

"""Day-0 compare gate.

Decides whether a quantized candidate is within the accuracy threshold of its
baseline, per task. Pure decision logic in ``evaluate_comparison`` (unit-tested
without GPU/cluster); ``main`` reads score JSON files and prints the verdict.

Score files are ``{task_name: score}`` dicts. Most AA task references report
``*_avg_of_N`` on a 0-100 scale, but some tasks (e.g. ``tau2_bench_telecom``
``Result``) report on a 0-1 scale. The gate is therefore scale-aware: each
task's scale is inferred per task (0-1 if both scores are within [0, 1], else
0-100) or supplied explicitly via ``--scales``, and the drop is normalized to a
fraction of that scale so the threshold applies uniformly. The drop is an
absolute (scale-normalized) delta unless ``--relative`` is passed.
"""

from __future__ import annotations

import argparse
import json
import math
import sys


def _is_valid_score(val):
"""True only for a finite real number in [_SCORE_MIN, _SCORE_MAX] (not bool)."""
return (
isinstance(val, (int, float))
and not isinstance(val, bool)
and math.isfinite(val)
and _SCORE_MIN <= val <= _SCORE_MAX
)


# Decisions
ACCEPT = "ACCEPT"
REGRESSION = "REGRESSION"
ANOMALOUS = "ANOMALOUS"

# Plausibility bounds. Scores may be on a 0-1 or 0-100 scale (see _infer_scale);
# the upper bound is the larger of the two so both are accepted.
_SCORE_MIN = 0.0
_SCORE_MAX = 100.0
# A candidate scoring this fraction of its scale ABOVE baseline is implausible
# for quantization (quantization should not meaningfully improve accuracy); flag
# it rather than silently passing. 0.05 = 5 pts on a 0-100 task, 0.05 on a 0-1 task.
_IMPLAUSIBLE_GAIN_FRAC = 0.05


def _infer_scale(*vals):
"""Infer a task's score scale: 1.0 if every score is within [0, 1], else 100.0.

Most AA tasks report 0-100; a few (e.g. ``tau2_bench_telecom``) report 0-1.
Without scale metadata in the score files, we treat a task as 0-1 only when
every score for it fits in [0, 1] — a 0-100 task with sub-1.0 accuracy is
degenerate and caught elsewhere. Pass an explicit scale to override.
"""
return 1.0 if all(0.0 <= v <= 1.0 for v in vals) else 100.0


def evaluate_comparison(baseline, candidate, threshold=0.01, relative=False, scales=None):
"""Compare candidate vs baseline scores per task.

Args:
baseline: dict ``{task: score}``.
candidate: dict ``{task: score}``.
threshold: max allowed drop, as a fraction of the task's scale
(0.01 = 1 percentage point on a 0-100 task / 0.01 on a 0-1 task,
or 1% relative if ``relative``).
relative: if True, drop is measured relative to the baseline score
(scale-invariant).
scales: optional dict ``{task: max_scale}`` to override per-task scale
inference (e.g. ``{"tau2_bench_telecom": 1.0}``).

Returns:
dict ``{pass, decision, failure_class, detail, per_task}``.
"""
scales = scales or {}
missing = sorted((set(baseline) | set(candidate)) - (set(baseline) & set(candidate)))
Comment on lines +92 to +93
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Validate scales schema and numeric bounds before use.

--scales is parsed with json.loads, but non-object or invalid values (e.g., [], {"t": "100"}, {"t": 0}) can still reach arithmetic paths and crash (.get/division), which breaks deterministic JSON gate output. Treat these as USER_CONFIG_ERROR instead of throwing.

Suggested fix
 def evaluate_comparison(baseline, candidate, threshold=0.01, relative=False, scales=None):
@@
-    scales = scales or {}
+    if scales is None:
+        scales = {}
+    elif not isinstance(scales, dict):
+        return {
+            "pass": False,
+            "decision": ANOMALOUS,
+            "failure_class": "USER_CONFIG_ERROR",
+            "detail": "scales must be a JSON object: {task: max_scale}",
+            "per_task": {},
+        }
+    else:
+        normalized = {}
+        for task, scale in scales.items():
+            if (
+                not isinstance(scale, (int, float))
+                or isinstance(scale, bool)
+                or not math.isfinite(scale)
+                or scale <= 0
+            ):
+                return {
+                    "pass": False,
+                    "decision": ANOMALOUS,
+                    "failure_class": "USER_CONFIG_ERROR",
+                    "detail": f"invalid scale for task {task!r}: {scale!r}",
+                    "per_task": {},
+                }
+            normalized[task] = float(scale)
+        scales = normalized

Also applies to: 131-131, 197-197

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In @.claude/skills/day0-release/scripts/gate_compare.py around lines 92 - 93,
The code accepts JSON for the variable scales but does not validate that it is a
mapping of string keys to positive numeric values, causing crashes when
non-object or invalid values reach arithmetic paths; update the handling of
scales (the scales variable and any usages around the existing comparison logic)
to first verify isinstance(scales, dict), then for each key/value ensure keys
are strings and values are ints or floats and > 0 (reject zero to avoid
division-by-zero), and if validation fails raise or return a USER_CONFIG_ERROR
with a clear message; apply the same validation/guarding to the other scales
usage sites mentioned (the other two occurrences) and replace direct
.get/division uses with the validated values.

if missing:
return {
"pass": False,
"decision": ANOMALOUS,
"failure_class": "SAMPLE_ACCOUNTING_FAILED",
"detail": f"task sets differ; missing on one side: {missing}",
"per_task": {},
}
if not baseline:
return {
"pass": False,
"decision": ANOMALOUS,
"failure_class": "USER_CONFIG_ERROR",
"detail": "no tasks to compare",
"per_task": {},
}

per_task = {}
regressed = []
anomalies = []
for task in sorted(baseline):
b, c = baseline[task], candidate[task]
invalid = False
for label, val in (("baseline", b), ("candidate", c)):
if not _is_valid_score(val):
anomalies.append(f"{task}: {label} score {val!r} not a finite number in [0, 100]")
invalid = True
if invalid:
# Don't compute deltas on non-numeric/out-of-range scores (would raise
# TypeError); record the anomaly and move on — the run is ANOMALOUS.
per_task[task] = {
"baseline": b,
"candidate": c,
"drop": None,
"within_threshold": False,
}
continue
scale = scales.get(task) or _infer_scale(b, c)
delta = b - c # native units, for reporting
if relative:
drop = delta / b if b else 0.0 # fraction of baseline (scale-invariant)
else:
drop = delta / scale # fraction of the task's scale
within = drop <= threshold
gain = (c - b) / scale
if gain > _IMPLAUSIBLE_GAIN_FRAC:
anomalies.append(
f"{task}: candidate exceeds baseline by {c - b:.4g} ({gain:.1%} of scale, implausible)"
)
per_task[task] = {
"baseline": b,
"candidate": c,
"drop": round(delta, 4),
"drop_fraction": round(drop, 4),
"scale": scale,
"within_threshold": within,
}
if not within:
regressed.append(task)

if anomalies:
return {
"pass": False,
"decision": ANOMALOUS,
"failure_class": "UNKNOWN",
"detail": "; ".join(anomalies),
"per_task": per_task,
}
if regressed:
return {
"pass": False,
"decision": REGRESSION,
"failure_class": None,
"detail": f"tasks exceeding threshold ({threshold}): {regressed}",
"per_task": per_task,
}
return {
"pass": True,
"decision": ACCEPT,
"failure_class": None,
"detail": f"all {len(per_task)} task(s) within threshold {threshold}",
"per_task": per_task,
}


def main(argv=None):
"""CLI entry point: read baseline/candidate score JSON and print the verdict."""
p = argparse.ArgumentParser(description="Day-0 compare gate")
p.add_argument("--baseline", required=True, help="baseline score JSON {task: score}")
p.add_argument("--candidate", required=True, help="candidate score JSON {task: score}")
p.add_argument("--threshold", type=float, default=0.01, help="max drop fraction (default 0.01)")
p.add_argument("--relative", action="store_true", help="measure drop relative to baseline")
p.add_argument(
"--scales",
help="optional JSON {task: max_scale} to override per-task scale inference",
)
args = p.parse_args(argv)

try:
with open(args.baseline) as f:
baseline = json.load(f)
with open(args.candidate) as f:
candidate = json.load(f)
scales = json.loads(args.scales) if args.scales else None
except (OSError, json.JSONDecodeError) as e:
print(json.dumps({"pass": False, "failure_class": "USER_CONFIG_ERROR", "detail": str(e)}))
return 2

result = evaluate_comparison(baseline, candidate, args.threshold, args.relative, scales)
print(json.dumps(result, indent=2))
return 0 if result["pass"] else 1


if __name__ == "__main__":
sys.exit(main())
Loading
Loading