Skip to content

Commit d3064ec

Browse files
committed
Add .claude/skills/test-quality and reference it from AGENTS.md
Checks in the test authoring/review conventions as a project skill so they're available to anyone working in the repo, and adds a pointer at the top of the AGENTS.md Testing section.
1 parent 4caa41f commit d3064ec

2 files changed

Lines changed: 152 additions & 0 deletions

File tree

Lines changed: 150 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,150 @@
1+
---
2+
name: test-quality
3+
description: Test quality bar for this repo. Read when writing, reviewing, or designing tests — covers naming, abstraction level, assertions, determinism, and the process for agent-driven test work.
4+
---
5+
6+
# Test & code quality guide
7+
8+
What "best practice for new work" means in this repo. Each rule carries its recorded reasoning
9+
where one exists; a rule with no stated why is a convention — follow it anyway.
10+
11+
## Naming & shape
12+
13+
- **Test names are behaviour sentences** stating the observable outcome, not the feature being
14+
poked: `test_elicit_form_decline_returns_no_content`, never `test_elicit_form_decline`.
15+
- Plain top-level `test_*` functions; no `Test` classes (legacy files have them — don't copy).
16+
- **Docstrings: 1–2 sentences of behaviour, honest about provenance** — spec-mandated,
17+
SDK-defined, or pinning a known gap? Say which: provenance is the triage key when the test
18+
later fails. A pinned-gap assertion breaking usually means a change *fixed* the gap; a
19+
spec-mandated assertion breaking means a regression.
20+
- Define things in dependency order; nothing forward-references. For client↔server tests:
21+
handlers → server construction → client setup → act → assert — the test reads in the order
22+
the conversation happens.
23+
- Inline the server (or equivalent setup) in the test, so the whole observable behaviour fits
24+
on one screen. Lift to a file-level fixture only when several tests in *that file* genuinely
25+
share it; never share across files.
26+
- A big multi-step test is fine when the property is irreducibly multi-step (e.g. resumability).
27+
Split when a failure wouldn't tell you which claim broke — not for shortness. Compensate with
28+
a numbered "Steps:" docstring so a reader sees the choreography before the body.
29+
30+
## Level of abstraction
31+
32+
- **Drive through the highest-level public API that can observe the property.** Hand-built wire
33+
requests are brittle and don't prove the user-facing contract; tests that stay above the
34+
internals keep working when the internals change. Drop to raw HTTP only when the assertion is
35+
about something the high-level API *cannot* observe (status codes, headers, wire framing).
36+
- **Scripting a peer over raw streams is a last resort**, reserved for behaviour the typed API
37+
cannot *produce* (malformed input, an impossible peer response). First ask what it would take
38+
for the public API to express it — often a small helper suffices. Every such test's docstring
39+
states why the public API couldn't do it.
40+
- **In-memory / in-process first.** HTTP-, SSE-, and auth-shaped behaviour can all be driven
41+
through an in-process ASGI transport; threads only when necessary, subprocesses only when the
42+
process boundary is itself the thing under test. In-process isn't just faster — it surfaces
43+
bugs (a real stream leak was found this way) that subprocess indirection masks.
44+
- **Tests read like real user code**: no aliasing shims in conftest, no walls of suppressions,
45+
no private imports unless that is genuinely the documented way to do the thing.
46+
- The assertion must prove the round trip — no side-channel state. What the server saw comes
47+
back through the protocol, or via a closure-captured list asserted after the call. Handlers
48+
assert their dispatch identity first (`assert params.name == "add"`), proving the request
49+
that arrived is the request the test sent.
50+
51+
## Assertions
52+
53+
- **Transformations** (input → output the SDK produced) → full-object `snapshot(...)` equality,
54+
so an added or dropped field fails. Regenerate with `--inline-snapshot=fix` so intentional
55+
changes arrive as a reviewable diff; never hand-edit snapshot literals.
56+
- **Pass-through values** (opaque tokens, `_meta`, cursors) → identity against the same
57+
variable you sent. A snapshot of a pass-through value only "matches" because a human checked
58+
two literals correspond — it proves nothing.
59+
- **Errors**`pytest.raises` + `.code` against the named constant; snapshot SDK-authored
60+
messages; never `match=` on message text. Third-party text (pydantic, jsonschema) → stable
61+
prefix only — never pin text that changes with a dependency upgrade — with a comment saying so.
62+
63+
## Determinism
64+
65+
- **No sleeps, ever.** A sleep guesses at timing instead of waiting on the condition, so it
66+
either flakes or pads the run — sleeps head the list of the older test code's failure modes.
67+
Coordinate with `anyio.Event` so the wait ends exactly when the condition holds (that
68+
discipline is why 529 e2e tests run in ~10 s). Sole exception: tests *of* time-based
69+
features, with a comment.
70+
- Bound every indefinite wait with `anyio.fail_after(5)`. **5 is the standard** — widening it
71+
needs an articulable reason; "10 to be safe" is covering up a flake, not fixing one, and
72+
unexplained widenings propagate (one agent used 10 and every later one copied it).
73+
- **Never assert wall-clock time, even with huge margins.** `elapsed < 0.9` on a ~0.01 s
74+
operation — a 90× margin — was still rejected in review. `fail_after` bounding a hang is the
75+
only timing primitive allowed.
76+
- **Concurrency tests must prove genuine interleaving.** Without barriers the scheduler is free
77+
to serialize — "a" can start, finish, and return before "b" even begins — so two `start_soon`
78+
calls prove nothing. Gate with events so all parties are mid-flight before any proceeds, emit
79+
interleaved, then assert the demux.
80+
- **Don't over-synchronize either.** When delivery ordering is guaranteed (notifications
81+
emitted during a request you're awaiting, over a single ordered in-memory stream), a plain
82+
collected list asserted after the call is correct; events are for messages not tied to an
83+
awaited operation. Verify the ordering guarantee actually holds for your transport first.
84+
- Async tests use anyio, not asyncio.
85+
86+
## Behaviour philosophy
87+
88+
- **Pin current behaviour; never xfail.** A green suite asserting what actually happens is a
89+
regression bar for any refactor; an xfail proves nothing about it. Where current behaviour
90+
falls short of spec, pin the divergent output and record the gap as data — a tracking issue,
91+
with the docstring naming the known gap (suites with a requirements manifest record it as a
92+
divergence entry). Not hidden, not skipped.
93+
- **Hollow-proof check**: before claiming a test covers a behaviour, re-read the claim and ask
94+
"which assertion proves *this*?" A passing test near a behaviour is not proof of it — a full
95+
review of the e2e suite found two such cases even under this discipline.
96+
97+
## Hygiene
98+
99+
- No new `# pragma`, `# type: ignore`, `# noqa` by default — restructure first; a suppression
100+
usually means a test or a type is missing. The narrow sanctioned escape hatches (and the
101+
audit to run before pushing) are in AGENTS.md. In tests, narrow types with `assert
102+
isinstance`; never `Any`/`object` when a real type exists.
103+
- **Warnings raised during tests are findings, not noise** (the repo runs
104+
`filterwarnings = error`). Fix the cause; if the fix can't land in the same change, scope the
105+
suppression to the one fixture that needs it and track the real fix explicitly.
106+
- Registered-but-never-invoked handler bodies are `raise NotImplementedError`, so they cannot
107+
silently become load-bearing.
108+
- Comments live next to the line they explain, not in docstrings; single backticks for code
109+
refs; match the surrounding comment density (one-liners next to one-liners). No
110+
`from __future__ import annotations` (py310+ repo).
111+
- Test work doesn't change `src/` as a side effect — the one mechanical exception is deleting a
112+
pragma a new test now covers. If a test can't be written without a library change, raise it
113+
as a finding or defer the test; don't quietly edit `src/`.
114+
115+
## Process (for agent-driven work)
116+
117+
- **Small chunks.** ≤10 tests per human-reviewed batch; when fanning out to multiple agents,
118+
≤5 per agent. Review quality scales inversely with batch size: the observed shortcuts
119+
(over-stacked tests, a timing assertion, wrong abstraction level) all surfaced in one
120+
oversized 27-test batch.
121+
- **Design → review → implement, as separate steps.** The design deliverable is not just the
122+
plan — it's the judgement calls (abstraction level, deferrals, contested assertions) stated
123+
explicitly, so the reviewer vetoes them before implementation rather than discovering them
124+
in the diff.
125+
- **High-stakes areas get fresh adversarial reviewers** on both the design and the
126+
implementation — fresh, because whoever designed it (or saw the proposed fix) anchors on the
127+
same layer. SERIOUS findings (wrong assertion, missed MUST/SHOULD, an unrecorded known gap)
128+
re-loop; style doesn't. Reserve the full panel for areas where being wrong is worse than
129+
being slow.
130+
- **Investigations pair a full-context look with a fresh agent given only the problem**
131+
never the proposed fix — and compare conclusions. The unbiased read regularly catches
132+
anchoring on the wrong layer.
133+
- **When review questions a decision, reconsider genuinely**: re-derive the tradeoff, state the
134+
options, recommend with reasons. Defending the original choice is a valid outcome;
135+
reflexively agreeing with every challenge is as bad as ignoring it.
136+
- **Validate the reference artifact before building on it.** Whatever you treat as ground truth
137+
(a spec import, a baseline, a generated list), check it *first* — discovering it was
138+
incomplete after five batches costs far more than before batch one.
139+
- **Verify as you go**: per-file `uv run --frozen pytest <file> -q` + pyright + ruff while
140+
iterating; full suite, coverage, and `./scripts/test` (when `src/` was touched) at
141+
integration.
142+
- **Every agent writes notes**: what it did, what broke, and — most importantly — what it
143+
couldn't decide, so open questions surface instead of being silently resolved by whoever hit
144+
them.
145+
- **Quality over speed is a stated goal, not a preference** ("rushing something out the door is
146+
not the goal here, explicitly so"). Don't quietly de-scope agreed work mid-stream;
147+
renegotiate scope explicitly.
148+
- **Don't copy patterns from existing test code by default** — much of the repo's older test
149+
code is below the current bar (sleeps, mocks, `Test` classes, raciness). These rules define
150+
the bar for new work.

AGENTS.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -45,6 +45,8 @@
4545

4646
## Testing
4747

48+
- When writing or reviewing tests, conform to `.claude/skills/test-quality/SKILL.md`
49+
— it defines the bar for naming, abstraction level, assertions, and determinism.
4850
- Framework: `uv run --frozen pytest`
4951
- Async testing: use anyio, not asyncio
5052
- Do not use `Test` prefixed classes — write plain top-level `test_*` functions.

0 commit comments

Comments
 (0)