From 502e945ba30f404d6a6ae68405f476d50b39ccf5 Mon Sep 17 00:00:00 2001
From: yiyixuxu <yixu310@gmail.com>
Date: Thu, 11 Jun 2026 00:37:38 +0000
Subject: [PATCH 1/9] [.ai] add self-review skill, retire parity-testing skill,
 and tighten the agent guides

- New `self-review` skill mirroring the `@claude` CI review (rubric from
  review-rules.md, call-path dead-code analysis), report-only, with the report
  flagging what to fix before submitting (blocking + dead code) vs what to leave
  for the actual review.
- Remove the WIP `parity-testing` skill; preserve its pitfalls as
  `model-integration/pitfalls.md` (numerical-discrepancy reference).
- model-integration: restructure around a grouped checklist, default-to-modular,
  an overall file-structure sketch (details deferred to the guides), a
  fresh-conversion `Model parity test` example (internal, not shipped), and a
  filled-in weight/checkpoint-conversion section.
- Centralize the loading rule (from_pretrained / from_single_file, no custom
  loaders) in models.md; add per-folder File structure sections to models.md /
  pipelines.md; default-to-modular note in pipelines.md.
- AGENTS.md: dedicated 'Self-review before a PR' and 'Reference guides' sections.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
---
 .ai/AGENTS.md                                 |  25 +--
 .ai/models.md                                 |   1 +
 .ai/pipelines.md                              |  12 ++
 .ai/review-rules.md                           |   3 +-
 .ai/skills/model-integration/SKILL.md         | 137 ++++++++------
 .../pitfalls.md                               |   6 +-
 .ai/skills/parity-testing/SKILL.md            | 172 ------------------
 .../parity-testing/checkpoint-mechanism.md    | 103 -----------
 .ai/skills/self-review/SKILL.md               |  48 +++++
 9 files changed, 157 insertions(+), 350 deletions(-)
 rename .ai/skills/{parity-testing => model-integration}/pitfalls.md (96%)
 delete mode 100644 .ai/skills/parity-testing/SKILL.md
 delete mode 100644 .ai/skills/parity-testing/checkpoint-mechanism.md
 create mode 100644 .ai/skills/self-review/SKILL.md

diff --git a/.ai/AGENTS.md b/.ai/AGENTS.md
index 9a6ef6b30f52..1343dd051ba3 100644
--- a/.ai/AGENTS.md
+++ b/.ai/AGENTS.md
@@ -8,13 +8,11 @@ Strive to write code as simple and explicit as possible.
 - No defensive code, unused code paths, or legacy stubs — do not add fallback paths, safety checks, or configuration options "just in case"; do not carry unused method parameters "for API consistency", backwards-compatibility aliases for names that never shipped, or deprecation shims for code that was never released. When porting from a research repo, delete training-time code paths, experimental flags, and ablation branches entirely — only keep the inference path you are actually integrating.
 - Do not guess user intent and silently correct behavior. Make the expected inputs clear in the docstring, and raise a concise error for unsupported cases rather than adding complex fallback logic.
 
-Before opening the PR, self-review against [review-rules.md](review-rules.md), which collects the most common mistakes we catch in review.
-
 ---
 
 ## Code formatting
 
-- `make style` and `make fix-copies` should be run as the final step before opening a PR
+- `make style` and `make fix-copies` should be run before opening a PR
 
 ### Copied Code
 
@@ -22,22 +20,19 @@ Before opening the PR, self-review against [review-rules.md](review-rules.md), w
 - Do not edit a `# Copied from` block directly — run `make fix-copies` to propagate changes from the source
 - Remove the header to intentionally break the link
 
-### Models
-
-- See [models.md](models.md) for model conventions, attention pattern, implementation rules, dependencies, and gotchas.
-- See the [model-integration](./skills/model-integration/SKILL.md) skill for the full integration workflow, file structure, test setup, and other details.
-
-### Pipelines & Schedulers
-
-- See [pipelines.md](pipelines.md) for pipeline conventions, patterns, and gotchas.
+## Reference guides
 
-### Modular Pipelines
-
-- See [modular.md](modular.md) for modular pipeline conventions, patterns, and gotchas.
+- **Models** — see [models.md](models.md) for model conventions, attention pattern, implementation rules, dependencies, and gotchas. For adding or converting a model, use the [model-integration](./skills/model-integration/SKILL.md) skill.
+- **Pipelines** — see [pipelines.md](pipelines.md) for pipeline conventions, patterns, and gotchas.
+- **Modular pipelines** — see [modular.md](modular.md) for modular pipeline conventions, patterns, and gotchas.
 
 ## Skills
 
 Task-specific guides live in `.ai/skills/` and are loaded on demand by AI agents. Available skills include:
 
 - [model-integration](./skills/model-integration/SKILL.md) (adding/converting pipelines)
-- [parity-testing](./skills/parity-testing/SKILL.md) (debugging numerical parity).
+- [self-review](./skills/self-review/SKILL.md) (pre-PR self-review against the project rules)
+
+## Self-review before a PR
+
+Before opening a PR, run self-review against [review-rules.md](review-rules.md). The [self-review skill](skills/self-review/SKILL.md) runs this as the same pass the `@claude` CI reviewer uses.
diff --git a/.ai/models.md b/.ai/models.md
index 6e37e742ae57..40df77a728a9 100644
--- a/.ai/models.md
+++ b/.ai/models.md
@@ -13,6 +13,7 @@ Linked from `AGENTS.md`, `skills/model-integration/SKILL.md`, and `review-rules.
 
 * Models use `ModelMixin` with `register_to_config` for config serialization. 
 * When adding a new transformer (or reviewing one), skim `src/diffusers/models/transformers/transformer_flux.py`, `src/diffusers/models/transformers/transformer_flux2.py`, `src/diffusers/models/transformers/transformer_qwenimage.py`, and `src/diffusers/models/transformers/transformer_wan.py` first to establish the pattern. Most conventions (mixin set, file structure, naming, gradient-checkpointing implementation, `_no_split_modules` settings, etc.) are easiest to internalize by comparison rather than from a fixed list.
+* **Loading goes through `from_pretrained` / `from_single_file`.** Weights and configs load through the standard paths — never fetched or imported out-of-band at runtime. Don't override or add a custom `from_pretrained`, and don't load weights manually (`load_file(...)`, `hf_hub_download(...)`, or `sys.path.insert(...)` to import a reference repo). For an original-format single checkpoint, add `from_single_file` support (mixin + weight-mapping).
 
 ## Attention pattern
 
diff --git a/.ai/pipelines.md b/.ai/pipelines.md
index e6db54a7f7de..f25df556c44a 100644
--- a/.ai/pipelines.md
+++ b/.ai/pipelines.md
@@ -3,10 +3,22 @@
 Shared reference for pipeline-related conventions, patterns, and gotchas.
 Linked from `AGENTS.md`, `skills/model-integration/SKILL.md`, and `review-rules.md`.
 
+> **Prefer modular for new pipelines.** [Modular Diffusers](modular.md) is the preferred way to add a new pipeline; the standard `DiffusionPipeline` covered below is still supported but is no longer the default. We prefer modular especially for models that don't fit a fixed task-based structure (e.g. modality baked into the checkpoint) or that are actively evolving. The conventions below apply when you do build or review a standard pipeline.
+
 ## Common pipeline conventions
 
 When adding a new pipeline (or reviewing one), skim `pipeline_flux.py`, `pipeline_flux2.py`, `pipeline_qwenimage.py`, `pipeline_wan.py` first to establish the pattern. Most conventions (class structure, mixin set, `__call__` shape — input validation → encode prompt → timesteps → latent prep → denoise loop → decode — `encode_prompt` / `prepare_latents` shape, `output_type` / `generator` / `progress_bar` plumbing, `@torch.no_grad()` on `__call__`, LoRA mixin, `from_single_file` support, etc.) are easiest to internalize by comparison rather than from a fixed list.
 
+## File structure
+
+```
+src/diffusers/pipelines/<model>/
+  __init__.py                          # Lazy imports
+  pipeline_<model>.py                  # Main pipeline (with __call__)
+  pipeline_<model>_<variant>.py        # Variant pipelines (e.g. img2img, inpaint) — one file/class each
+  pipeline_output.py                   # Output dataclass
+```
+
 ## Gotchas
 
 1. **Config-derived static values: prefer `__init__` attributes.** Values that come from a sub-component's config (e.g. `vae_scale_factor`) belong as `self.foo = ...` in `__init__` — not `@property`, not module-level constants. Note the `getattr(...)` fallback — sub-components may not be loaded when the pipeline is constructed (e.g. via `from_pretrained` on a partial config), so don't assume `self.vae` / `self.transformer` exists.
diff --git a/.ai/review-rules.md b/.ai/review-rules.md
index 75b7cbc8be22..f2d5e9f0b4e6 100644
--- a/.ai/review-rules.md
+++ b/.ai/review-rules.md
@@ -7,8 +7,7 @@ Before reviewing, read and apply the guidelines in:
 - [models.md](models.md) — model conventions, attention pattern, implementation rules, dependencies, gotchas
 - [pipelines.md](pipelines.md) — pipeline conventions, coding style, gotchas
 - [modular.md](modular.md) — modular pipeline conventions, patterns, common mistakes
-- [skills/parity-testing/SKILL.md](skills/parity-testing/SKILL.md) — testing rules, comparison utilities
-- [skills/parity-testing/pitfalls.md](skills/parity-testing/pitfalls.md) — known pitfalls (dtype mismatches, config assumptions, etc.)
+- [skills/model-integration/pitfalls.md](skills/model-integration/pitfalls.md) — known pitfalls causing numerical discrepancies between the reference implementation and the diffusers port (dtype mismatches, config assumptions, etc.)
 
 ## Common mistakes
 
diff --git a/.ai/skills/model-integration/SKILL.md b/.ai/skills/model-integration/SKILL.md
index 7c3cf9fd5e37..32f0a30fa0c2 100644
--- a/.ai/skills/model-integration/SKILL.md
+++ b/.ai/skills/model-integration/SKILL.md
@@ -8,15 +8,7 @@ description: >
 
 ## Goal
 
-Integrate a new model into diffusers end-to-end. The overall flow:
-
-1. **Gather info** — ask the user for the reference repo, setup guide, a runnable inference script, and other objectives such as standard vs modular.
-2. **Confirm the plan** — once you have everything, tell the user exactly what you'll do: e.g. "I'll integrate model X with pipeline Y into diffusers based on your script. I'll run parity tests (model-level and pipeline-level) using the `parity-testing` skill to verify numerical correctness against the reference."
-3. **Implement** — write the diffusers code (model, pipeline, scheduler if needed), convert weights, register in `__init__.py`.
-4. **Parity test** — use the `parity-testing` skill to verify component and e2e parity against the reference implementation.
-5. **Deliver a unit test** — provide a self-contained test script that runs the diffusers implementation, checks numerical output (np allclose), and saves an image/video for visual verification. This is what the user runs to confirm everything works.
-
-Work one workflow at a time — get it to full parity before moving on.
+Integrate a new model into diffusers end-to-end, to full numerical parity with the reference implementation — one workflow at a time.
 
 ## Setup — gather before starting
 
@@ -24,56 +16,74 @@ Before writing any code, gather info in this order:
 
 1. **Reference repo** — ask for the github link. If they've already set it up locally, ask for the path. Otherwise, ask what setup steps are needed (install deps, download checkpoints, set env vars, etc.) and run through them before proceeding.
 2. **Inference script** — ask for a runnable end-to-end script for a basic workflow first (e.g. T2V). Then ask what other workflows they want to support (I2V, V2V, etc.) and agree on the full implementation order together.
-3. **Standard vs modular** — standard pipelines, modular, or both?
+3. **Standard vs modular** — **default to modular.** [Modular Diffusers](../../modular.md) is the preferred implementation for new pipelines; the standard `DiffusionPipeline` is still supported but no longer the default. We prefer modular especially for models that don't fit a fixed task-based structure (modality baked into the checkpoint) or that are actively evolving.
+
+Ask step 3 as an `AskUserQuestion`, with modular marked as the recommended default.
+
+Once you have everything, **confirm the plan** with the user before implementing — state exactly what you'll do, e.g. "I'll integrate model X with pipeline Y based on your script, and verify the model matches the reference before considering it done."
 
-Use `AskUserQuestion` with structured choices for step 3 when the options are known.
+Then work through the **Integration checklist** below — it covers both standard and modular; only the pipeline step differs.
 
-## Standard Pipeline Integration
+## Integration checklist
 
-### File structure for a new model
+- [ ] **Transformer model**
+  - [ ] Implement the model with `from_pretrained` support (conventions: [models.md](../../models.md))
+  - [ ] Convert weights (see **Weight / Checkpoint Conversion**)
+  - [ ] Parity test against the reference (internal, not shipped — see **Model parity test**)
+  - [ ] Register in the relevant `__init__.py` files (lazy imports)
+  - [ ] Model-level tests (see **Testing**)
+- [ ] **VAE** (if applicable) — reuse an existing `AutoencoderKL*` if possible; if a new one is needed, follow the same sub-steps as the transformer
+- [ ] **Scheduler** — reuse an existing scheduler, or add a custom one
+- [ ] **Pipeline**
+  - [ ] Implement the pipeline — see [modular.md](../../modular.md) for modular pipeline, or [pipelines.md](../../pipelines.md) for standard pipeline
+  - [ ] Add a LoRA mixin if applicable
+  - [ ] Register in the relevant `__init__.py` files (lazy imports)
+  - [ ] Pipeline-level tests (see **Testing**)
+- [ ] **Docs** — see **File structure**
+- [ ] **Style** — `make style` and `make quality`
+
+## File structure
+
+A new model PR roughly lands these files (the contents of `pipelines/<model>/` and `modular_pipelines/<model>/` live in their guides):
 
 ```
 src/diffusers/
-  models/transformers/transformer_<model>.py     # The core model
-  schedulers/scheduling_<model>.py               # If model needs a custom scheduler
-  pipelines/<model>/
-    __init__.py
-    pipeline_<model>.py                          # Main pipeline
-    pipeline_<model>_<variant>.py                # Variant pipelines (e.g. pyramid, distilled)
-    pipeline_output.py                           # Output dataclass
-  loaders/lora_pipeline.py                       # LoRA mixin (add to existing file)
-
+  models/transformers/transformer_<model>.py   # the model (or models/autoencoders/, models/unets/)
+  schedulers/scheduling_<model>.py              # only if a custom scheduler is needed
+  loaders/lora_pipeline.py                      # LoRA mixin — add to the existing file
+  pipelines/<model>/                            # standard pipeline — see pipelines.md
+  modular_pipelines/<model>/                    # modular pipeline — see modular.md
 tests/
   models/transformers/test_models_transformer_<model>.py
   pipelines/<model>/test_<model>.py
-  lora/test_lora_layers_<model>.py
-
-docs/source/en/api/
-  pipelines/<model>.md
-  models/<model>_transformer3d.md                # or appropriate name
+docs/source/en/
+  _toctree.yml                                  # register the new pages in the docs index
+  api/models/<model>.md
+  api/pipelines/<model>.md
 ```
 
-### Integration checklist
+## Model integration specific rules
 
-- [ ] Implement transformer model with `from_pretrained` support
-- [ ] Implement or reuse scheduler
-- [ ] Implement pipeline(s) with `__call__` method
-- [ ] Add LoRA support if applicable
-- [ ] Register all classes in `__init__.py` files (lazy imports)
-- [ ] Write unit tests (model, pipeline, LoRA)
-- [ ] Write docs
-- [ ] Run `make style` and `make quality`
-- [ ] Test parity with reference implementation (see `parity-testing` skill)
+**Match the reference's numerical logic.** Restructuring code to fit diffusers APIs (`ModelMixin`, `ConfigMixin`, blocks for modular, etc.) is expected, and required diffusers conventions (e.g. the attention pattern in [models.md](../../models.md)) take precedence. Beyond those, keep the actual computation as close to the reference as possible — don't reorder operations, change the math, or rename internals for aesthetics, even if it looks unclean. Small deviations make output mismatches very hard to track down.
 
-### Model conventions, attention pattern, and implementation rules
+## Weight / Checkpoint Conversion
 
-See [../../models.md](../../models.md) for the attention pattern, implementation rules, common conventions, dependencies, and gotchas. These apply to all model work.
+Convert the original checkpoint into diffusers format with a standalone script under `scripts/` (e.g. `scripts/convert_<model>_to_diffusers.py`). The flow:
 
-### Model integration specific rules
+1. Map the original state-dict keys to the diffusers module names (renames + any tensor surgery — see patterns below).
+2. Instantiate the diffusers model from its config and load the converted state dict.
+3. `save_pretrained(...)` to a **persistent** path (never `/tmp/` — see [pitfalls.md](pitfalls.md) #10), then load it back with `from_pretrained` to confirm it round-trips.
 
-**Don't combine structural changes with behavioral changes.** Restructuring code to fit diffusers APIs (ModelMixin, ConfigMixin, etc.) is unavoidable. But don't also "improve" the algorithm, refactor computation order, or rename internal variables for aesthetics. Keep numerical logic as close to the reference as possible, even if it looks unclean. For standard → modular, this is stricter: copy loop logic verbatim and only restructure into blocks. Clean up in a separate commit after parity is confirmed.
+All weights load through the standard paths — `from_pretrained`, or `from_single_file` (add `FromSingleFileMixin` + a weight-mapping) for an original-format single checkpoint. No custom `from_pretrained`, no manual runtime loading. See the loading rule in [models.md](../../models.md).
 
-### Testing
+Common conversion patterns to watch for:
+- Fused QKV weights that need splitting into separate Q, K, V
+- Scale/shift ordering differences (reference stores `[shift, scale]`, diffusers expects `[scale, shift]`)
+- Weight transpositions (linear stored as transposed conv, or vice versa)
+- Interleaved head dimensions that need reshaping
+- Bias terms absorbed into different layers
+
+## Testing
 
 Two test layers must be added for any new pipeline: pipeline-level tests, and (if a new model is introduced) model-level tests. Integration/slow tests and LoRA tests are **not** added in the initial PR — they come later, after discussion with maintainers.
 
@@ -82,7 +92,7 @@ Two test layers must be added for any new pipeline: pipeline-level tests, and (i
 - No LoRA tests in the initial PR (no `LoraTesterMixin`, no `tests/lora/test_lora_layers_<model>.py`).
 - No integration / slow tests in the initial PR — don't add anything gated on `@slow` / `RUN_SLOW=1` yet.
 
-#### Pipeline-level tests
+### Pipeline-level tests
 
 - Location: `tests/pipelines/<model>/test_<model>.py` (one file per pipeline variant, e.g. T2V, I2V).
 - Subclass both `PipelineTesterMixin` (from `..test_pipelines_common`) and `unittest.TestCase`.
@@ -91,7 +101,7 @@ Two test layers must be added for any new pipeline: pipeline-level tests, and (i
 - Skip any inherited tests that don't apply with `@unittest.skip("Test not supported")` rather than deleting them.
 - Reference: `tests/pipelines/wan/test_wan.py`.
 
-#### Model-level tests
+### Model-level tests
 
 Only required if the pipeline introduces a new model class (transformer, VAE, etc.). Don't write these by hand — generate them (example command below):
 
@@ -105,20 +115,35 @@ python utils/generate_model_tests.py src/diffusers/models/transformers/transform
 - Do **not** add `LoraTesterMixin` at the start, even if the model subclasses `PeftAdapterMixin` — strip it from the generated file for the initial PR.
 - Reference: `tests/models/transformers/test_models_transformer_flux.py`.
 
----
+## Model parity test
 
-## Modular Pipeline Conversion
+Confirm the diffusers implementation matches the reference. Test each component on **CPU/float32** with a strict tolerance (`max_diff < 1e-3`), comparing the **freshly converted** weights against the reference in a single script — both sides side by side, nothing saved to disk in between. See [pitfalls.md](pitfalls.md) for the common sources of numerical discrepancy.
 
-See [modular.md](../../modular.md) for the full guide on modular pipeline conventions, block types, build order, guider abstraction, gotchas, and conversion checklist.
+This is an **internal verification tool for integration — it should not be shipped in the PR** (it imports the reference repo). The tests that ship with the PR are the model-level and pipeline-level tests in **Testing**.
 
----
+The example below is schematic (placeholder names). `ReferenceModel` is the component **imported from the original repo**, and `convert_my_component` is **the same conversion function you wrote for the conversion script for the component**. You should make sure both load the *same* checkpoint weights and run the *same* input, so any difference is a conversion or implementation bug — not a difference in inputs.
+
+```python
+@torch.inference_mode()
+def test_my_component():
+    # deterministic input — use the same shape & dtype the real model receives at this stage
+    gen = torch.Generator().manual_seed(42)
+    x = torch.randn(1, 16, 32, 32, generator=gen, dtype=torch.float32)  # adjust to the real input shape
 
-## Weight Conversion Tips
+    original_state_dict = load_original_weights(...)  # the original checkpoint — both sides load these same weights
 
-<!-- TODO: Add concrete examples as we encounter them. Common patterns to watch for:
-  - Fused QKV weights that need splitting into separate Q, K, V
-  - Scale/shift ordering differences (reference stores [shift, scale], diffusers expects [scale, shift])
-  - Weight transpositions (linear stored as transposed conv, or vice versa)
-  - Interleaved head dimensions that need reshaping
-  - Bias terms absorbed into different layers
-  Add each with a before/after code snippet showing the conversion. -->
+    # reference: the original repo's implementation (load one model at a time to fit in CPU RAM)
+    ref_model = ReferenceModel(config)                # ReferenceModel: imported from the original repo
+    ref_model.load_state_dict(original_state_dict, strict=True)
+    ref_model = ref_model.float().eval()
+    ref_out = ref_model(x).clone()                    # clone before freeing the model
+    del ref_model
+
+    # diffusers: convert those same weights with your conversion-script function, then run
+    diff_model = convert_my_component(original_state_dict)  # convert_my_component: the fn from convert_<model>_to_diffusers.py
+    diff_model = diff_model.float().eval()
+    diff_out = diff_model(x)
+
+    max_diff = (ref_out - diff_out).abs().max().item()
+    assert max_diff < 1e-3, f"FAIL: max_diff={max_diff:.2e}"
+```
diff --git a/.ai/skills/parity-testing/pitfalls.md b/.ai/skills/model-integration/pitfalls.md
similarity index 96%
rename from .ai/skills/parity-testing/pitfalls.md
rename to .ai/skills/model-integration/pitfalls.md
index b0f59876f94a..9b1195348e75 100644
--- a/.ai/skills/parity-testing/pitfalls.md
+++ b/.ai/skills/model-integration/pitfalls.md
@@ -1,4 +1,6 @@
-# Complete Pitfalls Reference
+# Numerical Discrepancy Pitfalls
+
+Known pitfalls that cause numerical discrepancies between the original/reference implementation and the diffusers port. Check these when the diffusers outputs don't match the reference.
 
 ## 1. Global CPU RNG
 `MultivariateNormal.sample()` uses the global CPU RNG, not `torch.Generator`. Must call `torch.manual_seed(seed)` before each pipeline run. A `generator=` kwarg won't help.
@@ -99,7 +101,7 @@ The upstream model config may have wrong values for decoder-specific parameters
 
 **Detection**: Feed identical post-loop latents through both decoders. If max pixel diff is large (PSNR < 40 dB) on CPU/float32, it's a real bug, not precision noise. Trace through decoder blocks (conv_in -> mid_block -> up_blocks) to find where divergence starts.
 
-**Fix**: Correct the config value. Don't edit cached files in `~/.cache/huggingface/` -- either save to a local model directory or open a PR on the upstream repo (see Testing Rule #7).
+**Fix**: Correct the config value. Don't edit cached files in `~/.cache/huggingface/` -- either save to a local model directory or open a PR on the upstream repo.
 
 ## 16. Incomplete injection tests -- inject ALL variables or the test is invalid
 
diff --git a/.ai/skills/parity-testing/SKILL.md b/.ai/skills/parity-testing/SKILL.md
deleted file mode 100644
index b005e1a061ff..000000000000
--- a/.ai/skills/parity-testing/SKILL.md
+++ /dev/null
@@ -1,172 +0,0 @@
----
-name: testing-parity
-description: >
-  Use when debugging or verifying numerical parity between pipeline
-  implementations (e.g., research repo vs diffusers, standard vs modular).
-  Also relevant when outputs look wrong — washed out, pixelated, or have
-  visual artifacts — as these are usually parity bugs.
----
-
-> **Note**: Parity testing is **separate from** the unit-level tests that ship in `tests/`. If you are integrating a new model, the model-level test suite under `tests/models/` is still required — follow the **"#### Model-level tests"** section in [`../model-integration/SKILL.md`](../model-integration/SKILL.md) (generate via `utils/generate_model_tests.py`, no `--include` flags initially, no `LoraTesterMixin`). Parity tests verify numerical correctness during development; the generated test suite is what CI runs.
-
-## Setup — gather before starting
-
-Before writing any test code, gather:
-
-1. **Which two implementations** are being compared (e.g. research repo → diffusers, standard → modular, or research → modular). Use `AskUserQuestion` with structured choices if not already clear.
-2. **Two equivalent runnable scripts** — one for each implementation, both expected to produce identical output given the same inputs. These scripts define what "parity" means concretely.
-
-When invoked from the `model-integration` skill, you already have context: the reference script comes from step 2 of setup, and the diffusers script is the one you just wrote. You just need to make sure both scripts are runnable and use the same inputs/seed/params.
-
-## Test strategy
-
-**Component parity (CPU/float32) -- always run, as you build.**
-Test each component before assembling the pipeline. This is the foundation -- if individual pieces are wrong, the pipeline can't be right. Each component in isolation, strict max_diff < 1e-3.
-
-Test freshly converted checkpoints and saved checkpoints.
-- **Fresh**: convert from checkpoint weights, compare against reference (catches conversion bugs)
-- **Saved**: load from saved model on disk, compare against reference (catches stale saves)
-
-Keep component test scripts around -- you will need to re-run them during pipeline debugging with different inputs or config values.
-
-Template -- one self-contained script per component, reference and diffusers side-by-side:
-```python
-@torch.inference_mode()
-def test_my_component(mode="fresh", model_path=None):
-    # 1. Deterministic input
-    gen = torch.Generator().manual_seed(42)
-    x = torch.randn(1, 3, 64, 64, generator=gen, dtype=torch.float32)
-
-    # 2. Reference: load from checkpoint, run, free
-    ref_model = ReferenceModel.from_config(config)
-    ref_model.load_state_dict(load_weights("prefix"), strict=True)
-    ref_model = ref_model.float().eval()
-    ref_out = ref_model(x).clone()
-    del ref_model
-
-    # 3. Diffusers: fresh (convert weights) or saved (from_pretrained)
-    if mode == "fresh":
-        diff_model = convert_my_component(load_weights("prefix"))
-    else:
-        diff_model = DiffusersModel.from_pretrained(model_path, torch_dtype=torch.float32)
-    diff_model = diff_model.float().eval()
-    diff_out = diff_model(x)
-    del diff_model
-
-    # 4. Compare in same script -- no saving to disk
-    max_diff = (ref_out - diff_out).abs().max().item()
-    assert max_diff < 1e-3, f"FAIL: max_diff={max_diff:.2e}"
-```
-Key points: (a) both reference and diffusers component in one script -- never split into separate scripts that save/load intermediates, (b) deterministic input via seeded generator, (c) load one model at a time to fit in CPU RAM, (d) `.clone()` the reference output before deleting the model.
-
-**E2E visual (GPU/bfloat16) -- once the pipeline is assembled.**
-Both pipelines generate independently with identical seeds/params. Save outputs and compare visually. If outputs look identical, you're done -- no need for deeper testing.
-
-**Pipeline stage tests -- only if E2E fails and you need to isolate the bug.**
-If the user already suspects where divergence is, start there. Otherwise, work through stages in order.
-
-First, **match noise generation**: the way initial noise/latents are constructed (seed handling, generator, randn call order) often differs between the two scripts. If the noise doesn't match, nothing downstream will match. Check how noise is initialized in the diffusers script — if it doesn't match the reference, temporarily change it to match. Note what you changed so it can be reverted after parity is confirmed.
-
-For small models, run on CPU/float32 for strict comparison. For large models (e.g. 22B params), CPU/float32 is impractical -- use GPU/bfloat16 with `enable_model_cpu_offload()` and relax tolerances (max_diff < 1e-1 for bfloat16 is typical for passing tests; cosine similarity > 0.9999 is a good secondary check).
-
-Test encode and decode stages first -- they're simpler and bugs there are easier to fix. Only debug the denoising loop if encode and decode both pass.
-
-The challenge: pipelines are monolithic `__call__` methods -- you can't just call "the encode part". See [checkpoint-mechanism.md](checkpoint-mechanism.md) for the checkpoint class that lets you stop, save, or inject tensors at named locations inside the pipeline.
-
-**Stage test order — encode, decode, then denoise:**
-
-- **`encode`** (test first): Stop both pipelines at `"preloop"`. Compare **every single variable** that will be consumed by the denoising loop -- not just latents and sigmas, but also prompt embeddings, attention masks, positional coordinates, connector outputs, and any conditioning inputs.
-- **`decode`** (test second, before denoise): Run the reference pipeline fully -- checkpoint the post-loop latents AND let it finish to get the decoded output. Then feed those same post-loop latents through the diffusers pipeline's decode path. Compare both numerically AND visually.
-- **`denoise`** (test last): Run both pipelines with realistic `num_steps` (e.g. 30) so the scheduler computes correct sigmas/timesteps, but stop after 2 loop iterations using `after_step_1`. Don't set `num_steps=2` -- that produces unrealistic sigma schedules.
-
-```python
-# Encode stage -- stop before the loop, compare ALL inputs:
-ref_ckpts = {"preloop": Checkpoint(save=True, stop=True)}
-run_reference_pipeline(ref_ckpts)
-ref_data = ref_ckpts["preloop"].data
-
-diff_ckpts = {"preloop": Checkpoint(save=True, stop=True)}
-run_diffusers_pipeline(diff_ckpts)
-diff_data = diff_ckpts["preloop"].data
-
-# Compare EVERY variable consumed by the denoise loop:
-compare_tensors("latents", ref_data["latents"], diff_data["latents"])
-compare_tensors("sigmas", ref_data["sigmas"], diff_data["sigmas"])
-compare_tensors("prompt_embeds", ref_data["prompt_embeds"], diff_data["prompt_embeds"])
-# ... every single tensor the transformer forward() will receive
-```
-
-**E2E-injected visual test**: Once you've identified a suspected root cause using stage tests, confirm it with an e2e-injected run -- inject the known-good tensor from reference and generate a full video. If the output looks identical to reference, you've confirmed the root cause.
-
-## Debugging technique: Injection for root-cause isolation
-
-When stage tests show divergence, **inject a known-good tensor from one pipeline into the other** to test whether the remaining code is correct.
-
-The principle: if you suspect input X is the root cause of divergence in stage S:
-1. Run the reference pipeline and capture X
-2. Run the diffusers pipeline but **replace** its X with the reference's X (via checkpoint load)
-3. Compare outputs of stage S
-
-If outputs now match: X was the root cause. If they still diverge: the bug is in the stage logic itself, not in X.
-
-| What you're testing | What you inject | Where you inject |
-|---|---|---|
-| Is the decode stage correct? | Post-loop latents from reference | Before decode |
-| Is the denoise loop correct? | Pre-loop latents from reference | Before the loop |
-| Is step N correct? | Post-step-(N-1) latents from reference | Before step N |
-
-**Per-step accumulation tracing**: When injection confirms the loop is correct but you want to understand *how* a small initial difference compounds, capture `after_step_{i}` for every step and plot the max_diff curve. A healthy curve stays bounded; an exponential blowup in later steps points to an amplification mechanism (see Pitfall #13 in [pitfalls.md](pitfalls.md)).
-
-## Debugging technique: Visual comparison via frame extraction
-
-For video pipelines, numerical metrics alone can be misleading. Extract and view individual frames:
-
-```python
-import numpy as np
-from PIL import Image
-
-def extract_frames(video_np, frame_indices):
-    """video_np: (frames, H, W, 3) float array in [0, 1]"""
-    for idx in frame_indices:
-        frame = (video_np[idx] * 255).clip(0, 255).astype(np.uint8)
-        img = Image.fromarray(frame)
-        img.save(f"frame_{idx}.png")
-
-# Compare specific frames from both pipelines
-extract_frames(ref_video, [0, 60, 120])
-extract_frames(diff_video, [0, 60, 120])
-```
-
-## Testing rules
-
-1. **Never use reference code in the diffusers test path.** Each side must use only its own code.
-2. **Never monkey-patch model internals in tests.** Do not replace `model.forward` or patch internal methods.
-3. **Debugging instrumentation must be non-destructive.** Checkpoint captures for debugging are fine, but must not alter control flow or outputs.
-4. **Prefer CPU/float32 for numerical comparison when practical.** Float32 avoids bfloat16 precision noise that obscures real bugs. But for large models (22B+), GPU/bfloat16 with `enable_model_cpu_offload()` is necessary -- use relaxed tolerances and cosine similarity as a secondary metric.
-5. **Test both fresh conversion AND saved model.** Fresh catches conversion logic bugs; saved catches stale/corrupted weights from previous runs.
-6. **Diff configs before debugging.** Before investigating any divergence, dump and compare all config values. A 30-second config diff prevents hours of debugging based on wrong assumptions.
-7. **Never modify cached/downloaded model configs directly.** Don't edit files in `~/.cache/huggingface/`. Instead, save to a local directory or open a PR on the upstream repo.
-8. **Compare ALL loop inputs in the encode test.** The preloop checkpoint must capture every single tensor the transformer forward() will receive.
-
-## Comparison utilities
-
-```python
-def compare_tensors(name: str, a: torch.Tensor, b: torch.Tensor, tol: float = 1e-3) -> bool:
-    if a.shape != b.shape:
-        print(f"  FAIL {name}: shape mismatch {a.shape} vs {b.shape}")
-        return False
-    diff = (a.float() - b.float()).abs()
-    max_diff = diff.max().item()
-    mean_diff = diff.mean().item()
-    cos = torch.nn.functional.cosine_similarity(
-        a.float().flatten().unsqueeze(0), b.float().flatten().unsqueeze(0)
-    ).item()
-    passed = max_diff < tol
-    print(f"  {'PASS' if passed else 'FAIL'} {name}: max={max_diff:.2e}, mean={mean_diff:.2e}, cos={cos:.5f}")
-    return passed
-```
-Cosine similarity is especially useful for GPU/bfloat16 tests where max_diff can be noisy -- `cos > 0.9999` is a strong signal even when max_diff exceeds tolerance.
-
-## Gotchas
-
-See [pitfalls.md](pitfalls.md) for the full list of gotchas to watch for during parity testing.
diff --git a/.ai/skills/parity-testing/checkpoint-mechanism.md b/.ai/skills/parity-testing/checkpoint-mechanism.md
deleted file mode 100644
index 43743ebb07a5..000000000000
--- a/.ai/skills/parity-testing/checkpoint-mechanism.md
+++ /dev/null
@@ -1,103 +0,0 @@
-# Checkpoint Mechanism for Stage Testing
-
-## Overview
-
-Pipelines are monolithic `__call__` methods -- you can't just call "the encode part". The checkpoint mechanism lets you stop, save, or inject tensors at named locations inside the pipeline.
-
-## The Checkpoint class
-
-Add a `_checkpoints` argument to both the diffusers pipeline and the reference implementation.
-
-```python
-@dataclass
-class Checkpoint:
-    save: bool = False   # capture variables into ckpt.data
-    stop: bool = False   # halt pipeline after this point
-    load: bool = False   # inject ckpt.data into local variables
-    data: dict = field(default_factory=dict)
-```
-
-## Pipeline instrumentation
-
-The pipeline accepts an optional `dict[str, Checkpoint]`. Place checkpoint calls at boundaries between pipeline stages -- after each encoder, before the denoising loop (capture all loop inputs), after each loop iteration, after the loop (capture final latents before decode).
-
-```python
-def __call__(self, prompt, ..., _checkpoints=None):
-    # --- text encoding ---
-    prompt_embeds = self.text_encoder(prompt)
-    _maybe_checkpoint(_checkpoints, "text_encoding", {
-        "prompt_embeds": prompt_embeds,
-    })
-
-    # --- prepare latents, sigmas, positions ---
-    latents = self.prepare_latents(...)
-    sigmas = self.scheduler.sigmas
-    # ...
-
-    _maybe_checkpoint(_checkpoints, "preloop", {
-        "latents": latents,
-        "sigmas": sigmas,
-        "prompt_embeds": prompt_embeds,
-        "prompt_attention_mask": prompt_attention_mask,
-        "video_coords": video_coords,
-        # capture EVERYTHING the loop needs -- every tensor the transformer
-        # forward() receives. Missing even one variable here means you can't
-        # tell if it's the source of divergence during denoise debugging.
-    })
-
-    # --- denoising loop ---
-    for i, t in enumerate(timesteps):
-        noise_pred = self.transformer(latents, t, prompt_embeds, ...)
-        latents = self.scheduler.step(noise_pred, t, latents)[0]
-
-        _maybe_checkpoint(_checkpoints, f"after_step_{i}", {
-            "latents": latents,
-        })
-
-    _maybe_checkpoint(_checkpoints, "post_loop", {
-        "latents": latents,
-    })
-
-    # --- decode ---
-    video = self.vae.decode(latents)
-    return video
-```
-
-## The helper function
-
-Each `_maybe_checkpoint` call does three things based on the Checkpoint's flags: `save` captures the local variables into `ckpt.data`, `load` injects pre-populated `ckpt.data` back into local variables, `stop` halts execution (raises an exception caught at the top level).
-
-```python
-def _maybe_checkpoint(checkpoints, name, data):
-    if not checkpoints:
-        return
-    ckpt = checkpoints.get(name)
-    if ckpt is None:
-        return
-    if ckpt.save:
-        ckpt.data.update(data)
-    if ckpt.stop:
-        raise PipelineStop  # caught at __call__ level, returns None
-```
-
-## Injection support
-
-Add `load` support at each checkpoint where you might want to inject:
-
-```python
-_maybe_checkpoint(_checkpoints, "preloop", {"latents": latents, ...})
-
-# Load support: replace local variables with injected data
-if _checkpoints:
-    ckpt = _checkpoints.get("preloop")
-    if ckpt is not None and ckpt.load:
-        latents = ckpt.data["latents"].to(device=device, dtype=latents.dtype)
-```
-
-## Key insight
-
-The checkpoint dict is passed into the pipeline and mutated in-place. After the pipeline returns (or stops early), you read back `ckpt.data` to get the captured tensors. Both pipelines save under their own key names, so the test maps between them (e.g. reference `"video_state.latent"` -> diffusers `"latents"`).
-
-## Memory management for large models
-
-For large models, free the source pipeline's GPU memory before loading the target pipeline. Clone injected tensors to CPU, delete everything else, then run the target with `enable_model_cpu_offload()`.
diff --git a/.ai/skills/self-review/SKILL.md b/.ai/skills/self-review/SKILL.md
new file mode 100644
index 000000000000..f9bc4545839e
--- /dev/null
+++ b/.ai/skills/self-review/SKILL.md
@@ -0,0 +1,48 @@
+---
+name: self-review
+description: >
+  Use before opening a PR, or whenever asked to self-review a diffusers
+  contribution. Runs the same review the `@claude` CI runs: checks the diff
+  against .ai/review-rules.md, traces call paths for dead code, and reports
+  findings grouped by severity — flagging what to fix before submitting (blocking
+  issues + dead code) vs what to leave for the actual review. Report-only — does
+  not edit files.
+---
+
+# Self-review
+
+The same pass the `@claude` CI reviewer runs, so you can catch issues before a
+maintainer does. You're already on the branch with the conventions loaded, so:
+get the diff → review it against the rubric → report. **Review only changes under
+`src/diffusers/` and `.ai/`** (skip everything else, like the CI does).
+
+## 1. Get the diff
+
+```bash
+git diff main...HEAD          # use your target branch if not main
+```
+
+If the branch trails `main` and the diff looks polluted with unrelated merged
+files, scope to your own commits: `git log main..HEAD --oneline`, then
+`git show <commit>`.
+
+## 2. Read the rubric
+
+`.ai/review-rules.md` is the canonical rubric (the CI pins it from `main`) — read
+it and review against it; don't rely on a remembered copy. For the areas you
+touched, also read `.ai/models.md`, `.ai/pipelines.md`, or `.ai/modular.md`.
+
+## 3. Report
+
+- **Blocking issues** — numbered. Each: title → explanation → `file.py:line` →
+  impact. Cite the rule, e.g. *Per `.ai/models.md`: "…only keep the inference path."*
+- **Non-blocking issues** — same format, lower severity.
+- **Dead code (advisory)** — a table: `path:line` · Likely-dead / Used · reason.
+- **Summary** — short synthesis and a verdict (**READY** / **NEEDS CHANGES**),
+  spelling out:
+  - **Fix before submitting** — all blocking issues, and remove the flagged dead code.
+  - **Leave for the actual review** — non-blocking issues that aren't obviously
+    correct; raise these with the reviewer rather than guessing at them now.
+
+Report only — do not edit files. Be concrete, cite the rule, review only the diff
+under `src/diffusers/` and `.ai/`, and don't invent issues or flag pure style.

From 72be65e5a80561724558c3b3952acf387f13663a Mon Sep 17 00:00:00 2001
From: yiyixuxu <yixu310@gmail.com>
Date: Thu, 11 Jun 2026 00:47:35 +0000
Subject: [PATCH 2/9] [.ai] simplify pitfalls #6 and drop the model-storage /
 injection-test entries

Trim pitfall #6 to the essential point (small dtype diffs compound into a large
final difference), remove the `/tmp` model-storage and incomplete-injection-test
pitfalls, and renumber 1-16 with cross-references updated.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
---
 .ai/skills/model-integration/pitfalls.md | 29 ++++++++----------------
 1 file changed, 10 insertions(+), 19 deletions(-)

diff --git a/.ai/skills/model-integration/pitfalls.md b/.ai/skills/model-integration/pitfalls.md
index 9b1195348e75..356fb1f1be99 100644
--- a/.ai/skills/model-integration/pitfalls.md
+++ b/.ai/skills/model-integration/pitfalls.md
@@ -18,7 +18,7 @@ If noise generation depends on `patch_size` (e.g. `sample_block_noise`), it must
 Nested loops (stages -> chunks -> timesteps) can shadow variable names. If outer loop uses `latents` and inner loop also assigns to `latents`, scoping must match the reference.
 
 ## 6. Float precision differences -- don't dismiss them
-Target may compute in float32 where reference used bfloat16. Small per-element diffs (1e-3 to 1e-2) *look* harmless but can compound catastrophically over iterative processes like denoising loops (see Pitfalls #11 and #13). Before dismissing a precision difference: (a) check whether it feeds into an iterative process, (b) if so, trace the accumulation curve over all iterations to see if it stays bounded or grows exponentially. Only truly non-iterative precision diffs (e.g. in a single-pass encoder) are safe to accept.
+Small per-element diffs from a dtype mismatch (e.g. float32 vs bfloat16, ~1e-3 to 1e-2) look harmless, but in an iterative process like the denoising loop they can compound into a large final difference (see #10 and #12). Check whether a precision diff feeds an iterative process before accepting it.
 
 ## 7. Scheduler state reset between stages
 Some schedulers accumulate state (e.g. `model_outputs` in UniPC) that must be cleared between stages.
@@ -29,10 +29,7 @@ Standard: `self.transformer`. Modular: `components.transformer`. Missing this ca
 ## 9. Guider state across stages
 In multi-stage denoising, the guider's internal state (e.g. `zero_init_steps`) may need save/restore between stages.
 
-## 10. Model storage location
-NEVER store converted models in `/tmp/` -- temporary directories get wiped on restart. Always save converted checkpoints under a persistent path in the project repo (e.g. `models/ltx23-diffusers/`).
-
-## 11. Noise dtype mismatch (causes washed-out output)
+## 10. Noise dtype mismatch (causes washed-out output)
 
 Reference code often generates noise in float32 then casts to model dtype (bfloat16) before storing:
 
@@ -41,7 +38,7 @@ noise = torch.randn(..., dtype=torch.float32, generator=gen)
 noise = noise.to(dtype=model_dtype)  # bfloat16 -- values get quantized
 ```
 
-Diffusers pipelines may keep latents in float32 throughout the loop. The per-element difference is only ~1.5e-02, but this compounds over 30 denoising steps via 1/sigma amplification (Pitfall #13) and produces completely washed-out output.
+Diffusers pipelines may keep latents in float32 throughout the loop. The per-element difference is only ~1.5e-02, but this compounds over 30 denoising steps via 1/sigma amplification (#12) and produces completely washed-out output.
 
 **Fix**: Match the reference -- generate noise in the model's working dtype:
 ```python
@@ -51,15 +48,15 @@ latents = self.prepare_latents(..., dtype=latent_dtype, ...)
 
 **Detection**: Encode stage test shows initial latent max_diff of exactly ~1.5e-02. This specific magnitude is the signature of float32->bfloat16 quantization error.
 
-## 12. RoPE position dtype
+## 11. RoPE position dtype
 
 RoPE cosine/sine values are sensitive to position coordinate dtype. If reference uses bfloat16 positions but diffusers uses float32, the RoPE output diverges significantly (max_diff up to 2.0). Different modalities may use different position dtypes (e.g. video bfloat16, audio float32) -- check the reference carefully.
 
-## 13. 1/sigma error amplification in Euler denoising
+## 12. 1/sigma error amplification in Euler denoising
 
-In Euler/flow-matching, the velocity formula divides by sigma: `v = (latents - pred_x0) / sigma`. As sigma shrinks from ~1.0 (step 0) to ~0.001 (step 29), errors are amplified up to 1000x. A 1.5e-02 init difference grows linearly through mid-steps, then exponentially in final steps, reaching max_diff ~6.0. This is why dtype mismatches (Pitfalls #11, #12) that seem tiny at init produce visually broken output. Use per-step accumulation tracing to diagnose.
+In Euler/flow-matching, the velocity formula divides by sigma: `v = (latents - pred_x0) / sigma`. As sigma shrinks from ~1.0 (step 0) to ~0.001 (step 29), errors are amplified up to 1000x. A 1.5e-02 init difference grows linearly through mid-steps, then exponentially in final steps, reaching max_diff ~6.0. This is why dtype mismatches (#10, #11) that seem tiny at init produce visually broken output. Use per-step accumulation tracing to diagnose.
 
-## 14. Config value assumptions -- always diff, never assume
+## 13. Config value assumptions -- always diff, never assume
 
 When debugging parity, don't assume config values match code defaults. The published model checkpoint may override defaults with different values. A wrong assumption about a single config field can send you down hours of debugging in the wrong direction.
 
@@ -95,7 +92,7 @@ Run this **before** writing any hooks, analysis code, or fixes. It takes 30 seco
 **When debugging divergence -- trace values, don't reason about them:**
 If two implementations diverge, hook the actual intermediate values at the point of divergence rather than reading code to figure out what the values "should" be. Code analysis builds on assumptions; value tracing reveals facts.
 
-## 15. Decoder config mismatch (causes pixelated artifacts)
+## 14. Decoder config mismatch (causes pixelated artifacts)
 
 The upstream model config may have wrong values for decoder-specific parameters (e.g. `upsample_residual`, `upsample_type`). These control whether the decoder uses skip connections in upsampling -- getting them wrong produces severe pixelation or blocky artifacts.
 
@@ -103,16 +100,10 @@ The upstream model config may have wrong values for decoder-specific parameters
 
 **Fix**: Correct the config value. Don't edit cached files in `~/.cache/huggingface/` -- either save to a local model directory or open a PR on the upstream repo.
 
-## 16. Incomplete injection tests -- inject ALL variables or the test is invalid
-
-When doing injection tests (feeding reference tensors into the diffusers pipeline), you must inject **every** divergent input, including sigmas/timesteps. A common mistake: the preloop checkpoint saves sigmas but the injection code only loads latents and embeddings. The test then runs with different sigma schedules, making it impossible to isolate the real cause.
-
-**Prevention**: After writing injection code, verify by listing every variable the injected stage consumes and checking each one is either (a) injected from reference, or (b) confirmed identical between pipelines.
-
-## 17. bf16 connector/encoder divergence -- don't chase it
+## 15. bf16 connector/encoder divergence -- don't chase it
 
 When running on GPU/bfloat16, multi-layer encoders (e.g. 8-layer connector transformers) accumulate bf16 rounding noise that looks alarming (max_diff 0.3-2.7). Before investigating, re-run the component test on CPU/float32. If it passes (max_diff < 1e-4), the divergence is pure precision noise, not a code bug. Don't spend hours tracing through layers -- confirm on CPU/float32 and move on.
 
-## 18. Stale test fixtures
+## 16. Stale test fixtures
 
 When using saved tensors for cross-pipeline comparison, always ensure both sets of tensors were captured from the same run configuration (same seed, same config, same code version). Mixing fixtures from different runs (e.g. reference tensors from yesterday, diffusers tensors from today after a code change) creates phantom divergence that wastes debugging time. Regenerate both sides in a single test script execution.

From 78bd9b7d800028a1781d771b42d18621ca627996 Mon Sep 17 00:00:00 2001
From: yiyixuxu <yixu310@gmail.com>
Date: Thu, 11 Jun 2026 00:56:03 +0000
Subject: [PATCH 3/9] [.ai] drop parity-harness-specific pitfalls

With the parity-testing skill gone, remove the stale-test-fixtures pitfall (saved
tensors / cross-pipeline fixtures no longer apply) and de-jargon the noise-dtype
detection note. Keeps the pitfalls list generic to numerical discrepancy.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
---
 .ai/skills/model-integration/pitfalls.md | 6 +-----
 1 file changed, 1 insertion(+), 5 deletions(-)

diff --git a/.ai/skills/model-integration/pitfalls.md b/.ai/skills/model-integration/pitfalls.md
index 356fb1f1be99..003e8fc1ddc1 100644
--- a/.ai/skills/model-integration/pitfalls.md
+++ b/.ai/skills/model-integration/pitfalls.md
@@ -46,7 +46,7 @@ latent_dtype = self.transformer.dtype  # e.g. bfloat16
 latents = self.prepare_latents(..., dtype=latent_dtype, ...)
 ```
 
-**Detection**: Encode stage test shows initial latent max_diff of exactly ~1.5e-02. This specific magnitude is the signature of float32->bfloat16 quantization error.
+**Detection**: The encoded latents show an initial max_diff of exactly ~1.5e-02 — this specific magnitude is the signature of float32->bfloat16 quantization error.
 
 ## 11. RoPE position dtype
 
@@ -103,7 +103,3 @@ The upstream model config may have wrong values for decoder-specific parameters
 ## 15. bf16 connector/encoder divergence -- don't chase it
 
 When running on GPU/bfloat16, multi-layer encoders (e.g. 8-layer connector transformers) accumulate bf16 rounding noise that looks alarming (max_diff 0.3-2.7). Before investigating, re-run the component test on CPU/float32. If it passes (max_diff < 1e-4), the divergence is pure precision noise, not a code bug. Don't spend hours tracing through layers -- confirm on CPU/float32 and move on.
-
-## 16. Stale test fixtures
-
-When using saved tensors for cross-pipeline comparison, always ensure both sets of tensors were captured from the same run configuration (same seed, same config, same code version). Mixing fixtures from different runs (e.g. reference tensors from yesterday, diffusers tensors from today after a code change) creates phantom divergence that wastes debugging time. Regenerate both sides in a single test script execution.

From 70b687e2159bc0e3733fe1f4b5c491907bc84275 Mon Sep 17 00:00:00 2001
From: yiyixuxu <yixu310@gmail.com>
Date: Thu, 11 Jun 2026 01:21:05 +0000
Subject: [PATCH 4/9] [.ai] trim pitfalls to a concise possible-causes
 reference

Drop the variable-shadowing and decoder-config pitfalls and the noise-dtype
'Detection' aside, tighten the remaining entries, renumber 1-12 (cross-refs
updated), and reframe the intro as a non-checklist reference list of possible
causes to consult only when outputs don't match.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
---
 .ai/skills/model-integration/pitfalls.md | 79 +++++-------------------
 1 file changed, 15 insertions(+), 64 deletions(-)

diff --git a/.ai/skills/model-integration/pitfalls.md b/.ai/skills/model-integration/pitfalls.md
index 003e8fc1ddc1..d32e6f41e187 100644
--- a/.ai/skills/model-integration/pitfalls.md
+++ b/.ai/skills/model-integration/pitfalls.md
@@ -1,6 +1,6 @@
 # Numerical Discrepancy Pitfalls
 
-Known pitfalls that cause numerical discrepancies between the original/reference implementation and the diffusers port. Check these when the diffusers outputs don't match the reference.
+A reference list of things that have caused numerical discrepancies between an original/reference implementation and the diffusers port. It's not a checklist — most won't apply to any given model; consult it only when the diffusers outputs don't match the reference.
 
 ## 1. Global CPU RNG
 `MultivariateNormal.sample()` uses the global CPU RNG, not `torch.Generator`. Must call `torch.manual_seed(seed)` before each pipeline run. A `generator=` kwarg won't help.
@@ -12,24 +12,21 @@ Many transformers expect `int64` timesteps. `get_timestep_embedding` casts to fl
 Parameter names may differ: reference `zero_steps=1` (meaning `i <= 1`, 2 steps) vs target `zero_init_steps=2` (meaning `step < 2`, same thing). Check exact semantics.
 
 ## 4. `patch_size` in noise generation
-If noise generation depends on `patch_size` (e.g. `sample_block_noise`), it must be passed through. Missing it changes noise spatial structure.
+If noise generation depends on `patch_size`, it must be passed through. Missing it changes noise spatial structure.
 
-## 5. Variable shadowing in nested loops
-Nested loops (stages -> chunks -> timesteps) can shadow variable names. If outer loop uses `latents` and inner loop also assigns to `latents`, scoping must match the reference.
+## 5. Float precision differences -- don't dismiss them
+Small per-element diffs from a dtype mismatch (e.g. float32 vs bfloat16, ~1e-3 to 1e-2) look harmless, but in an iterative process like the denoising loop they can compound into a large final difference (see #9 and #11). Check whether a precision diff feeds an iterative process before accepting it.
 
-## 6. Float precision differences -- don't dismiss them
-Small per-element diffs from a dtype mismatch (e.g. float32 vs bfloat16, ~1e-3 to 1e-2) look harmless, but in an iterative process like the denoising loop they can compound into a large final difference (see #10 and #12). Check whether a precision diff feeds an iterative process before accepting it.
-
-## 7. Scheduler state reset between stages
+## 6. Scheduler state reset between stages
 Some schedulers accumulate state (e.g. `model_outputs` in UniPC) that must be cleared between stages.
 
-## 8. Component access
+## 7. Component access
 Standard: `self.transformer`. Modular: `components.transformer`. Missing this causes AttributeError.
 
-## 9. Guider state across stages
+## 8. Guider state across stages
 In multi-stage denoising, the guider's internal state (e.g. `zero_init_steps`) may need save/restore between stages.
 
-## 10. Noise dtype mismatch (causes washed-out output)
+## 9. Noise dtype mismatch (causes washed-out output)
 
 Reference code often generates noise in float32 then casts to model dtype (bfloat16) before storing:
 
@@ -38,7 +35,7 @@ noise = torch.randn(..., dtype=torch.float32, generator=gen)
 noise = noise.to(dtype=model_dtype)  # bfloat16 -- values get quantized
 ```
 
-Diffusers pipelines may keep latents in float32 throughout the loop. The per-element difference is only ~1.5e-02, but this compounds over 30 denoising steps via 1/sigma amplification (#12) and produces completely washed-out output.
+Diffusers pipelines may keep latents in float32 throughout the loop. The per-element difference is only ~1.5e-02, but this compounds over 30 denoising steps via 1/sigma amplification (#11) and produces completely washed-out output.
 
 **Fix**: Match the reference -- generate noise in the model's working dtype:
 ```python
@@ -46,60 +43,14 @@ latent_dtype = self.transformer.dtype  # e.g. bfloat16
 latents = self.prepare_latents(..., dtype=latent_dtype, ...)
 ```
 
-**Detection**: The encoded latents show an initial max_diff of exactly ~1.5e-02 — this specific magnitude is the signature of float32->bfloat16 quantization error.
-
-## 11. RoPE position dtype
-
-RoPE cosine/sine values are sensitive to position coordinate dtype. If reference uses bfloat16 positions but diffusers uses float32, the RoPE output diverges significantly (max_diff up to 2.0). Different modalities may use different position dtypes (e.g. video bfloat16, audio float32) -- check the reference carefully.
-
-## 12. 1/sigma error amplification in Euler denoising
-
-In Euler/flow-matching, the velocity formula divides by sigma: `v = (latents - pred_x0) / sigma`. As sigma shrinks from ~1.0 (step 0) to ~0.001 (step 29), errors are amplified up to 1000x. A 1.5e-02 init difference grows linearly through mid-steps, then exponentially in final steps, reaching max_diff ~6.0. This is why dtype mismatches (#10, #11) that seem tiny at init produce visually broken output. Use per-step accumulation tracing to diagnose.
-
-## 13. Config value assumptions -- always diff, never assume
-
-When debugging parity, don't assume config values match code defaults. The published model checkpoint may override defaults with different values. A wrong assumption about a single config field can send you down hours of debugging in the wrong direction.
-
-**The pattern that goes wrong:**
-1. You see `param_x` has default `1` in the code
-2. The reference code also uses `param_x` with a default of `1`
-3. You assume both sides use `1` and apply a "fix" based on that
-4. But the actual checkpoint config has `param_x: 1000`, and so does the published diffusers config
-5. Your "fix" now *creates* divergence instead of fixing it
-
-**Prevention -- config diff first:**
-```python
-# Reference: read from checkpoint metadata (no model loading needed)
-from safetensors import safe_open
-import json
-ref_config = json.loads(safe_open(checkpoint_path, framework="pt").metadata()["config"])
-
-# Diffusers: read from model config
-from diffusers import MyModel
-diff_model = MyModel.from_pretrained(model_path, subfolder="transformer")
-diff_config = dict(diff_model.config)
-
-# Compare all values
-for key in sorted(set(list(ref_config.get("transformer", {}).keys()) + list(diff_config.keys()))):
-    ref_val = ref_config.get("transformer", {}).get(key, "MISSING")
-    diff_val = diff_config.get(key, "MISSING")
-    if ref_val != diff_val:
-        print(f"  DIFF {key}: ref={ref_val}, diff={diff_val}")
-```
-
-Run this **before** writing any hooks, analysis code, or fixes. It takes 30 seconds and catches wrong assumptions immediately.
-
-**When debugging divergence -- trace values, don't reason about them:**
-If two implementations diverge, hook the actual intermediate values at the point of divergence rather than reading code to figure out what the values "should" be. Code analysis builds on assumptions; value tracing reveals facts.
-
-## 14. Decoder config mismatch (causes pixelated artifacts)
+## 10. RoPE position dtype
 
-The upstream model config may have wrong values for decoder-specific parameters (e.g. `upsample_residual`, `upsample_type`). These control whether the decoder uses skip connections in upsampling -- getting them wrong produces severe pixelation or blocky artifacts.
+RoPE cosine/sine values are sensitive to position coordinate dtype. If reference uses bfloat16 positions but diffusers uses float32, the RoPE output diverges significantly.
 
-**Detection**: Feed identical post-loop latents through both decoders. If max pixel diff is large (PSNR < 40 dB) on CPU/float32, it's a real bug, not precision noise. Trace through decoder blocks (conv_in -> mid_block -> up_blocks) to find where divergence starts.
+## 11. 1/sigma error amplification in Euler denoising
 
-**Fix**: Correct the config value. Don't edit cached files in `~/.cache/huggingface/` -- either save to a local model directory or open a PR on the upstream repo.
+In Euler/flow-matching, the velocity formula divides by sigma: `v = (latents - pred_x0) / sigma`. As sigma shrinks from ~1.0 (step 0) to ~0.001 (step 29), errors are amplified up to 1000x. A 1.5e-02 init difference grows linearly through mid-steps, then exponentially in final steps. This is why dtype mismatches (#9, #10) that seem tiny at init produce visually broken output.
 
-## 15. bf16 connector/encoder divergence -- don't chase it
+## 12. Config value assumptions
 
-When running on GPU/bfloat16, multi-layer encoders (e.g. 8-layer connector transformers) accumulate bf16 rounding noise that looks alarming (max_diff 0.3-2.7). Before investigating, re-run the component test on CPU/float32. If it passes (max_diff < 1e-4), the divergence is pure precision noise, not a code bug. Don't spend hours tracing through layers -- confirm on CPU/float32 and move on.
+Don't assume config values match the code defaults: the published checkpoint may override them (and so may the diffusers config). Look up the actual config.

From c2b8bd29e6fbbf5a162290f8dd321933e0a9ce99 Mon Sep 17 00:00:00 2001
From: YiYi Xu <yixu310@gmail.com>
Date: Wed, 10 Jun 2026 15:23:13 -1000
Subject: [PATCH 5/9] Apply suggestion from @yiyixuxu

---
 .ai/skills/model-integration/pitfalls.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/.ai/skills/model-integration/pitfalls.md b/.ai/skills/model-integration/pitfalls.md
index d32e6f41e187..d64c67175e4e 100644
--- a/.ai/skills/model-integration/pitfalls.md
+++ b/.ai/skills/model-integration/pitfalls.md
@@ -26,7 +26,7 @@ Standard: `self.transformer`. Modular: `components.transformer`. Missing this ca
 ## 8. Guider state across stages
 In multi-stage denoising, the guider's internal state (e.g. `zero_init_steps`) may need save/restore between stages.
 
-## 9. Noise dtype mismatch (causes washed-out output)
+## 9. Noise dtype mismatch
 
 Reference code often generates noise in float32 then casts to model dtype (bfloat16) before storing:
 

From ae6513b5266c57ce50fb3a1e640863809608f276 Mon Sep 17 00:00:00 2001
From: YiYi Xu <yixu310@gmail.com>
Date: Wed, 10 Jun 2026 15:25:17 -1000
Subject: [PATCH 6/9] Apply suggestion from @yiyixuxu

---
 .ai/skills/model-integration/SKILL.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/.ai/skills/model-integration/SKILL.md b/.ai/skills/model-integration/SKILL.md
index 32f0a30fa0c2..cb3dbce4ea44 100644
--- a/.ai/skills/model-integration/SKILL.md
+++ b/.ai/skills/model-integration/SKILL.md
@@ -22,7 +22,7 @@ Ask step 3 as an `AskUserQuestion`, with modular marked as the recommended defau
 
 Once you have everything, **confirm the plan** with the user before implementing — state exactly what you'll do, e.g. "I'll integrate model X with pipeline Y based on your script, and verify the model matches the reference before considering it done."
 
-Then work through the **Integration checklist** below — it covers both standard and modular; only the pipeline step differs.
+Then work through the **Integration checklist** below
 
 ## Integration checklist
 

From 81392e0756c8e4d216d17fc45065fcc9b3eb985e Mon Sep 17 00:00:00 2001
From: yiyixuxu <yixu310@gmail.com>
Date: Thu, 11 Jun 2026 04:04:37 +0000
Subject: [PATCH 7/9] [docs] update contributing guide for the self-review
 skill

Replace the retired parity-testing skill with self-review in the skills list, and
add a 'Self-review before opening' step to the AI-assisted contributions section:
run the self-review skill / review-rules, fix blocking issues + dead code, and
treat the @claude CI review as a non-authoritative helper (note any intentional
skips in the PR for the reviewer).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
---
 docs/source/en/conceptual/contribution.md | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/docs/source/en/conceptual/contribution.md b/docs/source/en/conceptual/contribution.md
index 299adddcaac3..a228e18fd31a 100644
--- a/docs/source/en/conceptual/contribution.md
+++ b/docs/source/en/conceptual/contribution.md
@@ -581,7 +581,7 @@ The repository keeps AI-agent configuration in [`.ai/`](https://github.com/huggi
   - [`.ai/review-rules.md`](https://github.com/huggingface/diffusers/blob/main/.ai/review-rules.md) — what reviewers look for
 - **Skills** (under [`.ai/skills/`](https://github.com/huggingface/diffusers/tree/main/.ai/skills), loaded on demand for specific tasks):
   - `model-integration` — adding a new model or pipeline to diffusers end-to-end (file structure, integration checklist, testing layout, weight conversion)
-  - `parity-testing` — verifying numerical parity between the diffusers implementation and a reference implementation
+  - `self-review` — review your changes against the project rules before opening a PR
 - **Setup commands**:
   - `make codex` — symlink guidelines + skills for OpenAI Codex
   - `make claude` — symlink guidelines + skills for Claude Code
@@ -593,6 +593,7 @@ AI-assisted contributions are welcome, but they must be coordinated, scoped, and
 
 - **Coordinate before opening a PR.** Find or open an issue, review similar PRs (open and recently closed), and wait for an explicit acknowledgment from a maintainer on that issue before opening a PR. This gives us a chance to discuss scope, avoid duplicate work, and confirm the approach.
 - **Fix patterns, not one-offs.** If you spot an recurring issue, search the codebase for similar instances and open a *single* issue with a clear, systematic scope (e.g. "fix mutable defaults across all schedulers") rather than many issues or PRs for individual instances. 
+- **Self-review before opening.** Run the [`self-review`](https://github.com/huggingface/diffusers/blob/main/.ai/skills/self-review/SKILL.md) skill (or review your diff against [`.ai/review-rules.md`](https://github.com/huggingface/diffusers/blob/main/.ai/review-rules.md)) and address what it reports — it's a helper, not authoritative, and can be wrong. Focus on the blocking issues that make sense to you, and clean up dead/unused code as much as possible. If you disagree with a suggestion, it's fine to leave it for the reviewer to discuss after the PR is opened — just add a brief note in the PR description for anything you intentionally skipped, so the reviewer knows it was a deliberate call.
 - **Include in the PR description:**
   - A **coordination link** to the issue or discussion where a maintainer acknowledged the work.
   - The **test commands you ran** and their results (paste relevant output, not just "tests pass").

From 1f93b63100f9b4b06ca09447c5cd01d0a7821523 Mon Sep 17 00:00:00 2001
From: YiYi Xu <yixu310@gmail.com>
Date: Thu, 11 Jun 2026 07:31:30 -1000
Subject: [PATCH 8/9] Apply suggestions from code review

Co-authored-by: Sayak Paul <spsayakpaul@gmail.com>
---
 .ai/skills/model-integration/SKILL.md | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/.ai/skills/model-integration/SKILL.md b/.ai/skills/model-integration/SKILL.md
index cb3dbce4ea44..8bc1a52fd696 100644
--- a/.ai/skills/model-integration/SKILL.md
+++ b/.ai/skills/model-integration/SKILL.md
@@ -26,6 +26,8 @@ Then work through the **Integration checklist** below
 
 ## Integration checklist
 
+A pipeline in Diffusers (be it standard or modular) will have multiple components. These components can be models, schedulers, processors, etc.
+
 - [ ] **Transformer model**
   - [ ] Implement the model with `from_pretrained` support (conventions: [models.md](../../models.md))
   - [ ] Convert weights (see **Weight / Checkpoint Conversion**)
@@ -76,7 +78,7 @@ Convert the original checkpoint into diffusers format with a standalone script u
 
 All weights load through the standard paths — `from_pretrained`, or `from_single_file` (add `FromSingleFileMixin` + a weight-mapping) for an original-format single checkpoint. No custom `from_pretrained`, no manual runtime loading. See the loading rule in [models.md](../../models.md).
 
-Common conversion patterns to watch for:
+Common conversion patterns to watch for model-level components:
 - Fused QKV weights that need splitting into separate Q, K, V
 - Scale/shift ordering differences (reference stores `[shift, scale]`, diffusers expects `[scale, shift]`)
 - Weight transpositions (linear stored as transposed conv, or vice versa)

From 67a8e4ae737ddd00cc1a434ff01684d44ee438b7 Mon Sep 17 00:00:00 2001
From: yiyixuxu <yixu310@gmail.com>
Date: Thu, 11 Jun 2026 18:24:12 +0000
Subject: [PATCH 9/9] [.ai] fix dangling pitfalls ref and broaden self-review
 scope
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

- Drop the broken 'pitfalls.md #10' reference in the conversion step (the /tmp
  model-storage pitfall was removed); save to a local path instead.
- Self-review now reviews the whole diff, not just src/diffusers/ and .ai/ — a
  contributor should review their own tests/docs/scripts too (the CI's scoping is
  a safety measure for untrusted PRs). Reword to 'same rubric as the CI'.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
---
 .ai/skills/model-integration/SKILL.md |  2 +-
 .ai/skills/self-review/SKILL.md       | 22 +++++++++++-----------
 2 files changed, 12 insertions(+), 12 deletions(-)

diff --git a/.ai/skills/model-integration/SKILL.md b/.ai/skills/model-integration/SKILL.md
index 8bc1a52fd696..18f092a47219 100644
--- a/.ai/skills/model-integration/SKILL.md
+++ b/.ai/skills/model-integration/SKILL.md
@@ -74,7 +74,7 @@ Convert the original checkpoint into diffusers format with a standalone script u
 
 1. Map the original state-dict keys to the diffusers module names (renames + any tensor surgery — see patterns below).
 2. Instantiate the diffusers model from its config and load the converted state dict.
-3. `save_pretrained(...)` to a **persistent** path (never `/tmp/` — see [pitfalls.md](pitfalls.md) #10), then load it back with `from_pretrained` to confirm it round-trips.
+3. `save_pretrained(...)` to a local path, then load it back with `from_pretrained` to confirm it round-trips.
 
 All weights load through the standard paths — `from_pretrained`, or `from_single_file` (add `FromSingleFileMixin` + a weight-mapping) for an original-format single checkpoint. No custom `from_pretrained`, no manual runtime loading. See the loading rule in [models.md](../../models.md).
 
diff --git a/.ai/skills/self-review/SKILL.md b/.ai/skills/self-review/SKILL.md
index f9bc4545839e..cfc45e54ebb0 100644
--- a/.ai/skills/self-review/SKILL.md
+++ b/.ai/skills/self-review/SKILL.md
@@ -2,19 +2,19 @@
 name: self-review
 description: >
   Use before opening a PR, or whenever asked to self-review a diffusers
-  contribution. Runs the same review the `@claude` CI runs: checks the diff
-  against .ai/review-rules.md, traces call paths for dead code, and reports
-  findings grouped by severity — flagging what to fix before submitting (blocking
-  issues + dead code) vs what to leave for the actual review. Report-only — does
-  not edit files.
+  contribution. Applies the same rubric as the `@claude` CI (checks the diff
+  against .ai/review-rules.md, traces call paths for dead code). Reports findings grouped by
+  severity, flagging what to fix before submitting (blocking issues + dead code)
+  vs what to leave for the actual review. Report-only — does not edit files.
 ---
 
 # Self-review
 
-The same pass the `@claude` CI reviewer runs, so you can catch issues before a
-maintainer does. You're already on the branch with the conventions loaded, so:
-get the diff → review it against the rubric → report. **Review only changes under
-`src/diffusers/` and `.ai/`** (skip everything else, like the CI does).
+Runs the same rubric as the `@claude` CI reviewer, so you catch issues before a
+maintainer does — but over your **whole** PR diff. (The CI scopes itself to
+`src/diffusers/` and `.ai/`; for your own PR, also review your tests, docs, and
+scripts.) You're already on the branch with the conventions loaded, so: get the
+diff → review it against the rubric → report.
 
 ## 1. Get the diff
 
@@ -44,5 +44,5 @@ touched, also read `.ai/models.md`, `.ai/pipelines.md`, or `.ai/modular.md`.
   - **Leave for the actual review** — non-blocking issues that aren't obviously
     correct; raise these with the reviewer rather than guessing at them now.
 
-Report only — do not edit files. Be concrete, cite the rule, review only the diff
-under `src/diffusers/` and `.ai/`, and don't invent issues or flag pure style.
+Report only — do not edit files. Be concrete, cite the rule, review the whole
+diff, and don't invent issues or flag pure style.