From 9e4cde3be4b64e09294be6972ce577039da16a87 Mon Sep 17 00:00:00 2001
From: "Brian.oh" <49855381+dev07060@users.noreply.github.com>
Date: Sun, 31 May 2026 04:11:01 +0900
Subject: [PATCH 1/7] =?UTF-8?q?docs(perf):=20PR6=20design=20spec=20?=
 =?UTF-8?q?=E2=80=94=20i8=20hot-path=20measure=20+=20=CE=B5/recall=20safet?=
 =?UTF-8?q?y=20net=20(LOC-64)?=
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Brainstorming output for the next work-stream: apply the PR1 "measure
before changing" pattern to the SHIPPED hot path (i8), which every prior
review/bench scrutinized only on the f32 fallback. Non-destructive —
kernels unchanged. Captures i8 bench baseline + numeric ε parity + a
recall@k quantization-quality gate, all fail-closed in CI on the shipped
faer+quant compile tree. Implementation follows via writing-plans.
---
 .../PR6-spec-i8-measure-parity-net.md         | 111 ++++++++++++++++++
 1 file changed, 111 insertions(+)
 create mode 100644 docs/perf/vector-math-refactor/PR6-spec-i8-measure-parity-net.md

diff --git a/docs/perf/vector-math-refactor/PR6-spec-i8-measure-parity-net.md b/docs/perf/vector-math-refactor/PR6-spec-i8-measure-parity-net.md
new file mode 100644
index 0000000..7404db5
--- /dev/null
+++ b/docs/perf/vector-math-refactor/PR6-spec-i8-measure-parity-net.md
@@ -0,0 +1,111 @@
+# PR6 설계 스펙 — i8 출시 핫패스 측정 + ε/recall 안전망 [측정 먼저]
+
+- 작성: 2026-05-31
+- 상태: 📝 설계(브레인스토밍 산출물) — 승인 후 writing-plans로 구현 계획 작성
+- Linear: [LOC-64](https://linear.app/loceract/issue/LOC-64)
+- 브랜치: `feat/loc-64-i8-measure-parity-net` — base = `main` (PR #67 머지 완료, `1217123`). 스택 트랩 회피됨.
+- 접근법: **A — "PR1 리플레이, i8로 확장"**
+
+## 1. 배경 / 왜 (Problem)
+
+이번 세션의 코드 검증으로 드러난 사실: **출시 빌드(`vector_faer,vector_quant_i8`)의 per-candidate 핫패스는 i8 경로**(`cosine_with_query_norm_i8_blob`)인데,
+
+- 그동안의 모든 리뷰·벤치(PR1 포함)·faer/fused 논쟁은 **f32 경로**를 봤고, f32는 출시 빌드에선 **폴백**이다.
+- 정작 출시되는 i8 핫커널은 **마이크로벤치 0개**(`benches/vector_math.rs`는 f32 dot/l2/cosine/decode만).
+- i8 양자화의 **검색 품질(랭킹/리콜)을 검증하는 테스트가 없다**. 기존 i8 테스트([vector_quant.rs:129-200](../../../rust_builder/rust/src/api/vector_quant.rs))는 (a) quantize↔dequantize 라운드트립 오차 `<0.05`, (b) 거친 방향 sanity(`>0.9`/`<-0.9`), (c) blob↔slice 진입점 일치(`<1e-6`)뿐 — **양자화가 근접 이웃 top-k 순위를 뒤집는지는 미검증**.
+
+따라서 "어떤 커널도 바꾸기 전에 지금 상태를 박제한다"는 PR1 원칙을, 이번엔 **출시 핫패스(i8)** 에 적용한다. 이 PR이 머지되면 향후 i8 변경이 검색 품질을 무너뜨릴 때 CI가 수학적으로 차단한다.
+
+## 2. 비목표 (Non-goals)
+
+- **커널/양자화 코드 변경 0줄.** 이 PR은 측정 + 안전망만. i8 최적화는 이 네트 위에 별도 PR로.
+- f32 경로 재측정/재설계 아님(PR1에서 완료, faer 유지 확정).
+- 온디바이스 벤치 아님(별도 선택 작업).
+
+## 3. 컴포넌트 1 — 측정 (bench)
+
+- `src/bench_api.rs`에 i8 표면 노출(`#[cfg(feature = "bench")]`, 기존 f32 노출과 동일 패턴): `quantize_f32_to_i8`, `l2_norm_i8`, `cosine_with_query_norm_i8_blob`, `i8_blob_from_slice`. (대상 함수는 이미 `pub` — 가시성 변경 불필요, re-export만.)
+- `benches/vector_math.rs` 추가 타깃:
+  - `bench_cosine_i8[dim]` — `DIMS`(384/768/1024/1536)별 i8 코사인 마이크로벤치.
+  - `bench_scan_i8` — 1 쿼리 vs `SCAN_N`(2000) 후보 i8-blob 스캔(출시 핫루프 모사), 기존 f32 `exact_scan`과 나란히.
+- 실행:
+  - `cargo bench --manifest-path rust_builder/rust/Cargo.toml --features "bench,vector_quant_i8"` (i8 핫커널)
+  - 기존 f32: `--features "bench,vector_faer"` (비교 기준)
+  - `bench_api::BACKEND` 라벨로 구분.
+- **저널 기록**: i8 핫커널 throughput(차원별) + i8 vs f32-faer 스캔 배수를 `PR6.md`에 박제.
+- 얻는 것: "출시 핫커널 수치 0개" 해소 + i8이 f32 대비 실제로 얼마나 버는지 정량화.
+
+## 4. 컴포넌트 2 — 수치 ε 네트 (커널 정확성)
+
+- 위치: `vector_quant.rs` 테스트 모듈(기존 i8 테스트 옆), `#[cfg(feature = "vector_quant_i8")]`.
+- 모델: PR1의 [`faer_parity_tests`](../../../rust_builder/rust/src/api/vector_math.rs#L208) (커널 ≈ 독립 참조, ε 내 일치).
+- 단언: `cosine_with_query_norm_i8_blob`(커널) ≈ **독립 참조 재구현** 을 차원별로 `ε = 1e-4` 내 일치.
+  - 독립 참조: 동일 i8 입력에 대해 dot·sq_sum을 **f64**로 누산 후 `sqrt`/나눗셈 → 커널과 다른 누산 폭/구현.
+- **ε 근거**: 커널의 `dot_i8_i32`/`sq_sum`은 i32 정수 누산이라 **정확**(dim 1536서 max ~2.5e7 ≪ i32 max 2.1e9, 오버플로 없음). 유일한 부동소수점 오차원은 최종 `(sq_sum as f32).sqrt()` + `query_norm`(f32)으로 나눗셈. `1e-4`는 이 캐스팅의 플랫폼 간 오차를 허용하면서 로직 버그(SIMD 재작성·인덱싱·norm 오류)를 잡는 합리적 바운더리.
+- 기존 테스트와 차별: 기존 건 blob↔slice **진입점 일치**만 봄. 이건 *수학 자체*를 독립 구현과 대조 → **미래 i8 커널 재작성 버그**를 잡음.
+
+## 5. 컴포넌트 3 — recall@k 네트 (양자화 품질) ★핵심
+
+- 위치: `vector_quant.rs` 테스트 모듈, `#[cfg(feature = "vector_quant_i8")]`.
+- **합성 클러스터 코퍼스**(결정론·무 rand 의존, `pseudo_vec` 스타일 시드 생성기):
+  - C개 클러스터 중심(단위벡터) + 가우시안풍 노이즈 → 정규화. "몇 개는 가깝고 대부분 멀다"는 임베딩 분포 모사.
+  - 기본값: **N=2000** 후보(= `SCAN_N` 재사용), **Q=32** 쿼리, **dim=768**, **k=10** (recall@10, 상위 0.5%).
+- **정답(ground truth)**: 각 쿼리의 **f32 코사인 top-k** (f32 코퍼스 기준 = 진짜 랭킹).
+- **i8 랭킹**: 코퍼스/쿼리를 i8 양자화 후 **i8 코사인 top-k**.
+- **전순서 비교자(필수)**: i8·f32 **양쪽 모두** `(score 내림차순, index 오름차순)` 총순서로 정렬. i8은 i32 정수 점수라 **동점이 대량 발생**하므로(커널 누산 구조상), index 타이브레이크를 명시 강제하지 않으면 플랫폼(Ubuntu vs macOS) sort 구현 차이로 **플레이키**. 동일 비교자 적용 → 진짜 양자화 재정렬만 recall에 반영.
+- **지표**: `recall@k = |topk_i8 ∩ topk_f32| / k`, 쿼리 평균. 고정 시드 → **완전 재현(비통계적·비플레이키)**.
+- **임계값 = 측정 먼저의 산물**:
+  1. 구현 첫 실행이 실제 `recall@10` 측정.
+  2. **포화 점검**: 측정이 ~1.0이면 게이트가 장식 → 클러스터 밀도/노이즈/중첩을 올려 recall이 **민감 구간(0.85~0.98)** 에 들 때까지 코퍼스 보정.
+  3. CI 게이트를 `recall@10 ≥ FLOOR` 로 고정(FLOOR = 측정값 − 마진 ≈ 0.03).
+  4. 측정값·FLOOR·코퍼스 파라미터를 `PR6.md`에 기록.
+- 얻는 것: 양자화가 근접 이웃 순위를 무너뜨리면 CI 빨개짐 — 지금 비어 있는 그 안전망. 측정이 임계값을 만들고, 그게 회귀 게이트가 됨.
+
+## 6. 컴포넌트 4 — CI 게이팅 (fail-closed)
+
+- `scripts/test_ci.sh`에 추가: `cargo test --lib --features "vector_quant_i8,vector_faer" -- --test-threads=1`
+  - **출시 컴파일 트리(faer+quant) 100% 일치** — feature 간 매크로/컴파일 충돌까지 CI에서 선제 검출.
+  - PR2의 faer 스텝처럼 **≥1 test 통과 요구(fail-closed)** — 0건 통과(미수집)면 실패 처리.
+  - `--test-threads=1` 유지([[project_rust_tests_need_single_thread]] 규약).
+- 출시 빌드는 PR2가 이미 `vector_faer,vector_quant_i8`로 **빌드** → 여기에 **i8 테스트 실행**을 더해 N2식 사각지대를 원천 차단.
+
+## 7. 추적 (tracking)
+
+- 새 Linear 이슈: **"PR6 — i8 출시 핫패스 측정 + ε/recall 안전망 [측정 먼저]"** (프로젝트 하위, 우선순위 High — 출시 검색 품질 직결).
+- `docs/perf/vector-math-refactor/PR6.md` 신규(저널 템플릿: 결과 Before→After·피드백·리스크/롤백·결정 로그).
+- README PR 상태표에 PR6 행 추가 + RETRO §5 "다음 작업"의 i8 검증 항목과 연결.
+- ✅ **머지 순서**: #67(클로즈아웃) 머지 완료(`1217123`) → 현재 `main`에서 분기하므로 README PR 상태표 충돌·스택 트랩 없음.
+
+## 8. 수용 기준 (Acceptance criteria)
+
+- [ ] i8 마이크로벤치 + i8 스캔 벤치 동작, 수치가 `PR6.md`에 기록(i8 throughput + i8 vs f32-faer 배수).
+- [ ] 수치 ε 네트: 차원별 i8 커널 ≈ f64 참조 `<1e-4` green.
+- [ ] recall@k 네트: 결정론적, 코퍼스가 민감 구간(0.85~0.98)에 위치, `recall@10 ≥ FLOOR` green, FLOOR/측정값 기록.
+- [ ] CI `--features "vector_quant_i8,vector_faer" -- --test-threads=1` fail-closed로 i8 테스트 실행, 기존 잡 회귀 없음.
+- [ ] 커널/양자화 코드 변경 0줄(비파괴) 확인.
+
+## 9. 리스크 / 완화
+
+| 리스크 | 완화 |
+|---|---|
+| recall 게이트 포화(거짓 안심) | §5-2 포화 점검 + 코퍼스 민감도 보정 후 FLOOR 고정 |
+| i8 동점으로 플랫폼 간 플레이키 | §5 전순서 비교자 `(score, index)` 양쪽 강제 |
+| feature 조합 컴파일 충돌 | §6 출시 트리(faer+quant)로 CI 테스트 |
+| #67과 README PR표 충돌 | §7 #67 머지 후 분기(스택 회피) |
+| ε 너무 빡빡/느슨 | 정수 dot=정확, sqrt/div만 오차 → 1e-4 수학적 합리(검토 확인) |
+
+## 10. 튜닝 가능한 기본값 (스펙 명시, 구현 중 조정 가능)
+
+| 파라미터 | 기본값 | 비고 |
+|---|---|---|
+| ε (수치 네트) | `1e-4` | 정수 dot 정확, sqrt/div 오차만 허용 |
+| k (recall) | `10` | recall@10 |
+| N (코퍼스) | `2000` | `SCAN_N` 재사용, 상위 0.5% |
+| Q (쿼리) | `32` | |
+| dim | `768` | 출시 임베딩 대표 차원 |
+| recall 마진 | `측정 − 0.03` | 첫 측정 후 FLOOR 확정 |
+| 클러스터 수/노이즈 | 측정으로 보정 | recall 0.85~0.98 민감 구간 목표 |
+
+---
+
+구현 단계: 승인 후 **writing-plans** 스킬로 단계별 구현 계획 작성 → `feat/loc-64-i8-measure-parity-net` 브랜치에서 실행.

From 00ecf21e17b758c8896b9f7ac5410773bbfddc36 Mon Sep 17 00:00:00 2001
From: "Brian.oh" <49855381+dev07060@users.noreply.github.com>
Date: Sun, 31 May 2026 04:45:48 +0900
Subject: [PATCH 2/7] docs(perf): PR6 implementation plan + spec (recall-floor
 + cosine-fidelity, measured baselines) (LOC-64)

Plan and spec for the i8 measure-first + safety-net work, finalized after an
adversarial pre-flight that (a) compiled the test/bench code on the shipped
vector_quant_i8,vector_faer tree and (b) measured real baselines:
recall@10 = 0.996875 (i8@768 is ~lossless), max|cosine_i8 - cosine_f32_true|
= 0.00121. Net 2 redesigned to recall floor (>= 0.98) + deterministic
cosine-fidelity backstop (<= 0.005), f64 ground truth (kills x86/ARM ULP
jitter), const-assert + CI name guards against vacuous gates.
---
 .../PR6-plan-i8-measure-parity-net.md         | 654 ++++++++++++++++++
 .../PR6-spec-i8-measure-parity-net.md         |  57 +-
 2 files changed, 686 insertions(+), 25 deletions(-)
 create mode 100644 docs/perf/vector-math-refactor/PR6-plan-i8-measure-parity-net.md

diff --git a/docs/perf/vector-math-refactor/PR6-plan-i8-measure-parity-net.md b/docs/perf/vector-math-refactor/PR6-plan-i8-measure-parity-net.md
new file mode 100644
index 0000000..e7adc22
--- /dev/null
+++ b/docs/perf/vector-math-refactor/PR6-plan-i8-measure-parity-net.md
@@ -0,0 +1,654 @@
+# PR6 — i8 출시 핫패스 측정 + ε/recall 안전망 Implementation Plan
+
+> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking.
+
+**Goal:** Lock the shipped int8 retrieval hot path with a benchmark baseline + a numeric ε kernel-parity net + two quantization-quality gates (recall@k floor + cosine fidelity), all fail-closed in CI — without changing any kernel.
+
+**Architecture:** Non-destructive "measure before changing" PR (PR1 pattern, applied to the i8 path). Add tests to the already-`vector_quant_i8`-gated `vector_quant.rs` test module, add i8 benches via the `bench` re-export surface, and wire a fail-closed i8 test step into `scripts/test_ci.sh` on the shipped `vector_faer,vector_quant_i8` compile tree.
+
+**Tech Stack:** Rust, criterion (dev-dep, `bench` feature), cargo feature flags (`vector_faer`, `vector_quant_i8`, `bench`), bash CI script.
+
+**Spec:** [PR6-spec-i8-measure-parity-net.md](PR6-spec-i8-measure-parity-net.md) · **Linear:** [LOC-64](https://linear.app/loceract/issue/LOC-64) · **Branch:** `feat/loc-64-i8-measure-parity-net` (already created off `main` @ `1217123`).
+
+**Verification note (why Task 2 looks the way it does):** an adversarial pre-flight (running the real kernel) found i8 per-vector quantization at dim 768 is too accurate to reorder a top-10 — recall@10 ≈ 0.997 and cannot be pushed into a "sensitive band" without abandoning the shipped settings. So the quality gate **locks that high baseline** (`recall ≥ baseline − margin`) and adds a genuinely sensitive, fully-deterministic **cosine-fidelity** backstop. Ground-truth cosine is computed in **f64** so the recall boundary can't flip on x86-vs-ARM ULP jitter.
+
+**Conventions (this repo):**
+- Rust tests run with `-- --test-threads=1` (shared-SQLite parallelism; convention).
+- Commits authored solely by the user — **NO** `Co-Authored-By` / Claude footer.
+- Open PR, stop at CI green; **user merges**.
+- All `cargo` commands use `--manifest-path rust_builder/rust/Cargo.toml`.
+
+---
+
+## File Structure
+
+| File | Change | Responsibility |
+|---|---|---|
+| `rust_builder/rust/src/api/vector_quant.rs` | Modify (`mod tests` only) | ε kernel-parity net + recall@k floor + cosine-fidelity net + shared deterministic test helpers. **No non-test code touched.** |
+| `rust_builder/rust/src/bench_api.rs` | Modify | `#[cfg(feature="vector_quant_i8")]` i8 re-export wrappers for the bench crate. |
+| `rust_builder/rust/benches/vector_math.rs` | Modify | `bench_cosine_i8` + `bench_scan_i8` (cfg-stubbed when feature off) + targets list. |
+| `scripts/test_ci.sh` | Modify (`native` case) | Fail-closed i8 test run on the shipped `vector_faer,vector_quant_i8` tree, with per-net name guards. |
+| `docs/perf/vector-math-refactor/PR6.md` | Create | Journal entry: bench numbers, recall baseline/FLOOR, fidelity bound, decisions. |
+| `docs/perf/vector-math-refactor/README.md` | Modify | Add PR6 row to the status table. |
+
+---
+
+## Task 1: Numeric ε net (i8 kernel correctness)
+
+Mirror of [`faer_parity_tests`](../../../rust_builder/rust/src/api/vector_math.rs#L208): assert the shipped i8 cosine kernel agrees with an **independent f64 reference of the same i8 inputs** within a tight ε. The i8 dot and squared-norms are exact integer sums, so the only divergence is the final `sqrt` + division → `1e-4` catches logic/SIMD-rewrite bugs while tolerating the f32 cast.
+
+**Files:**
+- Modify: `rust_builder/rust/src/api/vector_quant.rs` (inside existing `#[cfg(test)] mod tests`, after line 197 / before the closing `}` at line 198)
+
+- [ ] **Step 1: Add shared deterministic test helpers + the ε test**
+
+Insert into `mod tests` (before its closing brace):
+
+```rust
+    // --- PR6 shared test helpers (deterministic, no rand dep) ---
+
+    // Same generator as benches/vector_math.rs: reproducible run-to-run.
+    fn pseudo_vec(dim: usize, seed: u32) -> Vec<f32> {
+        (0..dim)
+            .map(|i| {
+                let x = (i as u32)
+                    .wrapping_mul(2_654_435_761)
+                    .wrapping_add(seed.wrapping_mul(40_503));
+                ((x % 1000) as f32 / 1000.0) - 0.5
+            })
+            .collect()
+    }
+
+    // Independent f64 reference cosine of two i8 vectors. Different accumulation
+    // width (i64) and float precision (f64) than the i32->f32 kernel, so a match
+    // proves the kernel math, not just that it agrees with itself.
+    fn ref_cosine_i8_f64(q: &[i8], t: &[i8]) -> f64 {
+        if q.len() != t.len() || q.is_empty() {
+            return 0.0;
+        }
+        let mut dot: i64 = 0;
+        let mut qsq: i64 = 0;
+        let mut tsq: i64 = 0;
+        for (&a, &b) in q.iter().zip(t.iter()) {
+            dot += (a as i64) * (b as i64);
+            qsq += (a as i64) * (a as i64);
+            tsq += (b as i64) * (b as i64);
+        }
+        if qsq == 0 || tsq == 0 {
+            return 0.0;
+        }
+        (dot as f64) / ((qsq as f64).sqrt() * (tsq as f64).sqrt())
+    }
+
+    #[test]
+    fn i8_blob_cosine_matches_independent_reference() {
+        // Integer dot/sq are exact; only the final f32 sqrt+div can drift.
+        const EPS: f64 = 1e-4;
+        for &dim in &[1usize, 2, 3, 16, 384, 768, 1024, 1536] {
+            let q = pseudo_vec(dim, 7);
+            let t = pseudo_vec(dim, 9);
+            let (qi, _) = quantize_f32_to_i8(&q);
+            let (ti, _) = quantize_f32_to_i8(&t);
+            let blob = i8_blob_from_slice(&ti);
+            let qn = l2_norm_i8(&qi);
+
+            let kernel = cosine_with_query_norm_i8_blob(&qi, qn, &blob) as f64;
+            let reference = ref_cosine_i8_f64(&qi, &ti);
+            assert!(
+                (kernel - reference).abs() < EPS,
+                "i8 cosine dim={dim}: kernel={kernel} ref={reference}"
+            );
+        }
+    }
+```
+
+- [ ] **Step 2: Run the test — expect PASS (net green on current kernel)**
+
+Run: `cargo test --manifest-path rust_builder/rust/Cargo.toml --lib --features vector_quant_i8 i8_blob_cosine_matches_independent_reference -- --test-threads=1`
+Expected: `test result: ok. 1 passed`
+
+- [ ] **Step 3: Prove the net has teeth (temporary mutation → red → revert)**
+
+Temporarily change `EPS` to `1e-12` and re-run Step 2.
+Expected: FAIL (`kernel=... ref=...`) — confirms the assertion is live, not vacuous.
+Then **revert `EPS` back to `1e-4`** and re-run Step 2 → PASS.
+
+- [ ] **Step 4: Commit**
+
+```bash
+git add rust_builder/rust/src/api/vector_quant.rs
+git commit -m "test(vector_quant): i8 cosine kernel ε-parity vs independent f64 reference (LOC-64)"
+```
+
+---
+
+## Task 2: Quantization-quality nets — recall@k floor + cosine fidelity (measure-first)
+
+Two complementary nets that reflect the **shipped path** (768-dim, per-vector scale):
+1. **recall@k floor** — top-k(i8) vs top-k(f32-true) overlap, gated `≥ measured baseline − margin`. (Baseline ~0.997; we lock the real high quality, we do NOT force an artificial band.)
+2. **cosine fidelity** — `max|cosine_i8 − cosine_f32_true| ≤ ε_q`. Fully deterministic (no ranking/boundary), the genuinely sensitive gate against a lossier future quantizer.
+
+Ground-truth cosine is f64 (kills x86/ARM boundary jitter). `const _` guards prevent shipping a vacuous threshold if calibration is skipped.
+
+**Files:**
+- Modify: `rust_builder/rust/src/api/vector_quant.rs` (same `mod tests`, append after Task 1's helpers)
+
+- [ ] **Step 1: Add corpus generator + generic comparator + f64 reference**
+
+Append into `mod tests`:
+
+```rust
+    fn normalize(v: &mut [f32]) {
+        let n = v.iter().map(|x| x * x).sum::<f32>().sqrt();
+        if n > 0.0 {
+            for x in v.iter_mut() {
+                *x /= n;
+            }
+        }
+    }
+
+    fn det_unit(dim: usize, seed: u32) -> Vec<f32> {
+        let mut v = pseudo_vec(dim, seed);
+        normalize(&mut v);
+        v
+    }
+
+    // Clustered corpus: vector i belongs to cluster (i % clusters); a weighted
+    // blend of that cluster's center and per-vector noise, normalized. Realistic
+    // "few near, most far" structure (not a sensitivity knob — see verification
+    // note: i8@768 stays ~0.997 regardless; we lock that, not a forced band).
+    fn clustered_corpus(
+        n: usize,
+        dim: usize,
+        clusters: usize,
+        weight: f32,
+        seed0: u32,
+    ) -> Vec<Vec<f32>> {
+        let centers: Vec<Vec<f32>> =
+            (0..clusters).map(|c| det_unit(dim, 1_000 + c as u32)).collect();
+        (0..n)
+            .map(|i| {
+                let c = i % clusters;
+                let noise = pseudo_vec(dim, seed0 + i as u32);
+                let mut v: Vec<f32> = centers[c]
+                    .iter()
+                    .zip(noise.iter())
+                    .map(|(&ce, &no)| weight * ce + (1.0 - weight) * no)
+                    .collect();
+                normalize(&mut v);
+                v
+            })
+            .collect()
+    }
+
+    // Total order: score descending, then index ascending. Deterministic ties.
+    // Generic so it serves both the f64 ground truth and the f32 i8 ranking.
+    fn order_desc<T: PartialOrd>(a: &(usize, T), b: &(usize, T)) -> std::cmp::Ordering {
+        b.1.partial_cmp(&a.1)
+            .unwrap_or(std::cmp::Ordering::Equal)
+            .then(a.0.cmp(&b.0))
+    }
+
+    // True cosine of the ORIGINAL f32 vectors, accumulated in f64. f64 makes the
+    // top-k boundary gap >> any x86-vs-ARM f32 ULP jitter, so the recall ranking
+    // is cross-platform stable; also the reference for cosine fidelity.
+    fn cosine_f64_true(q: &[f32], t: &[f32]) -> f64 {
+        let mut dot = 0.0f64;
+        let mut qsq = 0.0f64;
+        let mut tsq = 0.0f64;
+        for (a, b) in q.iter().zip(t.iter()) {
+            let (a, b) = (*a as f64, *b as f64);
+            dot += a * b;
+            qsq += a * a;
+            tsq += b * b;
+        }
+        if qsq == 0.0 || tsq == 0.0 {
+            0.0
+        } else {
+            dot / (qsq.sqrt() * tsq.sqrt())
+        }
+    }
+```
+
+- [ ] **Step 2: Add the recall@k floor test (f64 ground truth)**
+
+Append into `mod tests`:
+
+```rust
+    #[test]
+    fn i8_topk_recall_matches_f32_within_floor() {
+        const N: usize = 2000;
+        const Q: usize = 32;
+        const DIM: usize = 768;
+        const K: usize = 10;
+        const CLUSTERS: usize = 16;
+        const WEIGHT: f32 = 0.85;
+        // Locked from the measured baseline recall@10 = 0.996875 (deterministic:
+        // f64 GT + integer-exact i8 => bit-identical across x86/ARM). FLOOR =
+        // floor(0.9969 - 0.02) = 0.98, margin ~0.017 (~5 hits of 320). The const
+        // guard forbids a vacuous (<0.5) floor. Confirm in Step 4.
+        const MIN_RECALL: f32 = 0.98;
+        const _: () = assert!(MIN_RECALL >= 0.5, "MIN_RECALL must be a real floor");
+
+        let corpus = clustered_corpus(N, DIM, CLUSTERS, WEIGHT, 5_000);
+        let queries = clustered_corpus(Q, DIM, CLUSTERS, WEIGHT, 9_000);
+        let corpus_blob: Vec<Vec<u8>> = corpus
+            .iter()
+            .map(|v| i8_blob_from_slice(&quantize_f32_to_i8(v).0))
+            .collect();
+
+        let mut recall_sum = 0.0f32;
+        for query in &queries {
+            // f64 ground-truth top-K (f64 removes x86/ARM ULP boundary jitter).
+            let mut gt_scores: Vec<(usize, f64)> = corpus
+                .iter()
+                .enumerate()
+                .map(|(i, c)| (i, cosine_f64_true(query, c)))
+                .collect();
+            gt_scores.sort_by(order_desc);
+            let gt: std::collections::HashSet<usize> =
+                gt_scores.iter().take(K).map(|(i, _)| *i).collect();
+
+            // i8 top-K (shipped kernel) with the identical total order.
+            let (qi, _) = quantize_f32_to_i8(query);
+            let qn_i8 = l2_norm_i8(&qi);
+            let mut i8_scores: Vec<(usize, f32)> = corpus_blob
+                .iter()
+                .enumerate()
+                .map(|(i, blob)| (i, cosine_with_query_norm_i8_blob(&qi, qn_i8, blob)))
+                .collect();
+            i8_scores.sort_by(order_desc);
+            let got: std::collections::HashSet<usize> =
+                i8_scores.iter().take(K).map(|(i, _)| *i).collect();
+
+            recall_sum += gt.intersection(&got).count() as f32 / K as f32;
+        }
+        let recall = recall_sum / Q as f32;
+        println!("PR6 recall@{K} (N={N} Q={Q} dim={DIM} clusters={CLUSTERS}) = {recall}");
+        assert!(
+            recall >= MIN_RECALL,
+            "i8 recall@{K} regressed: {recall} < {MIN_RECALL}"
+        );
+    }
+```
+
+- [ ] **Step 3: Add the cosine-fidelity backstop test (deterministic, sensitive)**
+
+Append into `mod tests`:
+
+```rust
+    #[test]
+    fn i8_cosine_fidelity_vs_true_f32() {
+        const N: usize = 2000;
+        const Q: usize = 32;
+        const DIM: usize = 768;
+        const CLUSTERS: usize = 16;
+        const WEIGHT: f32 = 0.85;
+        // Locked from the measured max error 0.00121 (deterministic: i8 dot
+        // integer-exact, GT in f64 => ~1e-12 platform jitter). 0.005 ~= 4x the
+        // baseline: sensitive to a lossier future quantizer yet never flaky.
+        // The const guard forbids a vacuous (>=0.1) bound. Confirm in Step 4.
+        const MAX_COS_ERR: f64 = 0.005;
+        const _: () = assert!(MAX_COS_ERR < 0.1, "MAX_COS_ERR must be a real bound");
+
+        let corpus = clustered_corpus(N, DIM, CLUSTERS, WEIGHT, 5_000);
+        let queries = clustered_corpus(Q, DIM, CLUSTERS, WEIGHT, 9_000);
+        let corpus_blob: Vec<Vec<u8>> = corpus
+            .iter()
+            .map(|v| i8_blob_from_slice(&quantize_f32_to_i8(v).0))
+            .collect();
+
+        let mut max_err = 0.0f64;
+        for query in &queries {
+            let (qi, _) = quantize_f32_to_i8(query);
+            let qn_i8 = l2_norm_i8(&qi);
+            for (c, blob) in corpus.iter().zip(corpus_blob.iter()) {
+                let i8c = cosine_with_query_norm_i8_blob(&qi, qn_i8, blob) as f64;
+                let truec = cosine_f64_true(query, c);
+                let e = (i8c - truec).abs();
+                if e > max_err {
+                    max_err = e;
+                }
+            }
+        }
+        println!("PR6 max|cosine_i8 - cosine_f32_true| (N={N} Q={Q} dim={DIM}) = {max_err}");
+        assert!(
+            max_err <= MAX_COS_ERR,
+            "i8 cosine fidelity regressed: max err {max_err} > {MAX_COS_ERR}"
+        );
+    }
+```
+
+- [ ] **Step 4: Run & confirm the (pre-measured, deterministic) baselines**
+
+The thresholds above are already locked from an empirical planning-time run (macOS arm64). Confirm they hold — the metrics are deterministic (f64 GT + integer-exact i8), so they should match bit-for-bit:
+
+Run: `cargo test --manifest-path rust_builder/rust/Cargo.toml --lib --features "vector_quant_i8,vector_faer" vector_quant -- --test-threads=1 --nocapture`
+Expected: all `vector_quant` tests PASS and print:
+- `PR6 recall@10 (...) = 0.996875` (gate `MIN_RECALL=0.98` → pass, margin ~0.017)
+- `PR6 max|cosine_i8 - cosine_f32_true| (...) = 0.00121...` (gate `MAX_COS_ERR=0.005` → pass, ~4× margin)
+
+NOTE: `cargo test` takes ONE positional substring filter — `"a|b|c"` matches literally (0 tests). Use the module substring `vector_quant` (runs all 7) as above.
+
+If your measured `X`/`M` differ materially (they shouldn't — deterministic), recompute `MIN_RECALL = floor(X − 0.02 to 2dp)` and `MAX_COS_ERR ≈ 4 × M` (keep the const guards satisfied) and note the deviation in PR6.md.
+
+- [ ] **Step 5: Prove both gates have teeth**
+
+Temporarily set `MIN_RECALL = 0.999` → recall test FAILs (0.996875 < 0.999); revert to `0.98`.
+Temporarily set `MAX_COS_ERR = 1e-9` → fidelity test FAILs; revert to `0.005`.
+Re-run Step 4 → both PASS.
+
+- [ ] **Step 6: Commit**
+
+```bash
+git add rust_builder/rust/src/api/vector_quant.rs
+git commit -m "test(vector_quant): i8 recall@10 floor + cosine-fidelity gates vs f64 truth (LOC-64)"
+```
+
+---
+
+## Task 3: i8 microbench + scan bench
+
+Expose the i8 kernel to the bench crate and add an i8 microbench + an i8 scan bench (shipped hot loop). When `vector_quant_i8` is off, the i8 bench fns compile as no-op stubs so `criterion_group!` is feature-agnostic.
+
+**Files:**
+- Modify: `rust_builder/rust/src/bench_api.rs` (append after line 32, before the `BACKEND` doc comment)
+- Modify: `rust_builder/rust/benches/vector_math.rs` (add fns + extend `targets`)
+
+- [ ] **Step 1: Add i8 re-export wrappers to `bench_api.rs`**
+
+Insert after line 32 (`}` of `decode_f32_embedding`), before the `BACKEND` doc comment:
+
+```rust
+#[cfg(feature = "vector_quant_i8")]
+use crate::api::vector_quant;
+
+#[cfg(feature = "vector_quant_i8")]
+#[inline]
+pub fn quantize_f32_to_i8(input: &[f32]) -> (Vec<i8>, f32) {
+    vector_quant::quantize_f32_to_i8(input)
+}
+
+#[cfg(feature = "vector_quant_i8")]
+#[inline]
+pub fn l2_norm_i8(v: &[i8]) -> f32 {
+    vector_quant::l2_norm_i8(v)
+}
+
+#[cfg(feature = "vector_quant_i8")]
+#[inline]
+pub fn i8_blob_from_slice(input: &[i8]) -> Vec<u8> {
+    vector_quant::i8_blob_from_slice(input)
+}
+
+#[cfg(feature = "vector_quant_i8")]
+#[inline]
+pub fn cosine_with_query_norm_i8_blob(query: &[i8], query_norm: f32, target_blob: &[u8]) -> f32 {
+    vector_quant::cosine_with_query_norm_i8_blob(query, query_norm, target_blob)
+}
+```
+
+- [ ] **Step 2: Verify `bench_api` compiles under the i8 feature**
+
+Run: `cargo build --manifest-path rust_builder/rust/Cargo.toml --features "bench,vector_quant_i8"`
+Expected: builds clean. (`api/mod.rs:29` declares `#[cfg(feature="vector_quant_i8")] pub(crate) mod vector_quant;` and the kernel fns are `pub`, so the crate-internal path `crate::api::vector_quant::*` resolves from `bench_api`. The `use` and wrappers share the same `vector_quant_i8` gate, so nothing dangles when the feature is off.)
+
+- [ ] **Step 3: Add i8 bench fns + extend targets in `benches/vector_math.rs`**
+
+Insert after `bench_scan` (line 107), before the `criterion_group!`:
+
+```rust
+#[cfg(feature = "vector_quant_i8")]
+fn bench_cosine_i8(c: &mut Criterion) {
+    let mut g = c.benchmark_group("cosine_i8");
+    for &dim in &DIMS {
+        let (qi, _) = bench_api::quantize_f32_to_i8(&pseudo_vec(dim, 1));
+        let (ti, _) = bench_api::quantize_f32_to_i8(&pseudo_vec(dim, 2));
+        let qn = bench_api::l2_norm_i8(&qi);
+        let tblob = bench_api::i8_blob_from_slice(&ti);
+        g.throughput(Throughput::Elements(dim as u64));
+        g.bench_with_input(BenchmarkId::from_parameter(dim), &dim, |b, _| {
+            b.iter(|| {
+                bench_api::cosine_with_query_norm_i8_blob(
+                    black_box(&qi),
+                    black_box(qn),
+                    black_box(&tblob),
+                )
+            })
+        });
+    }
+    g.finish();
+}
+#[cfg(not(feature = "vector_quant_i8"))]
+fn bench_cosine_i8(_c: &mut Criterion) {}
+
+// Shipped exact-scan inner loop: one query vs N candidate i8 blobs, scored with
+// zero f32 decode / zero per-row alloc — the actual release hot path.
+#[cfg(feature = "vector_quant_i8")]
+fn bench_scan_i8(c: &mut Criterion) {
+    let (qi, _) = bench_api::quantize_f32_to_i8(&pseudo_vec(SCAN_DIM, 1));
+    let qn = bench_api::l2_norm_i8(&qi);
+    let blobs: Vec<Vec<u8>> = (0..SCAN_N)
+        .map(|i| {
+            let (vi, _) = bench_api::quantize_f32_to_i8(&pseudo_vec(SCAN_DIM, 100 + i as u32));
+            bench_api::i8_blob_from_slice(&vi)
+        })
+        .collect();
+
+    let mut g = c.benchmark_group("exact_scan_i8");
+    g.throughput(Throughput::Elements(SCAN_N as u64));
+    g.bench_function(BenchmarkId::new("i8_blob_cosine", SCAN_N), |b| {
+        b.iter(|| {
+            let mut best = f32::MIN;
+            for blob in &blobs {
+                let s = bench_api::cosine_with_query_norm_i8_blob(black_box(&qi), qn, black_box(blob));
+                if s > best {
+                    best = s;
+                }
+            }
+            black_box(best)
+        })
+    });
+    g.finish();
+}
+#[cfg(not(feature = "vector_quant_i8"))]
+fn bench_scan_i8(_c: &mut Criterion) {}
+```
+
+Then change the `criterion_group!` `targets` line (line 115) from:
+
+```rust
+    targets = bench_cosine, bench_dot, bench_decode, bench_scan
+```
+
+to:
+
+```rust
+    targets = bench_cosine, bench_dot, bench_decode, bench_scan, bench_cosine_i8, bench_scan_i8
+```
+
+- [ ] **Step 4: Run the shipped-tree bench (i8 + f32-faer side by side)**
+
+Run: `cargo bench --manifest-path rust_builder/rust/Cargo.toml --features "bench,vector_faer,vector_quant_i8" -- exact_scan`
+Expected: reports both `exact_scan[faer]` (f32 decode+cosine) and `exact_scan_i8/i8_blob_cosine` (shipped i8). Record both medians + the i8/f32 ratio. (Group names `cosine_i8`/`exact_scan_i8` carry the `_i8` suffix to distinguish from the f32 groups, which is the §3 "distinguish f32 vs i8" intent.)
+
+Also run the i8 microbench: `cargo bench --manifest-path rust_builder/rust/Cargo.toml --features "bench,vector_faer,vector_quant_i8" -- cosine_i8` and record per-dim numbers.
+
+- [ ] **Step 5: Verify the no-op stubs compile with the feature OFF**
+
+Run: `cargo build --manifest-path rust_builder/rust/Cargo.toml --features "bench"`
+Expected: builds clean (i8 bench fns are no-op stubs; `criterion_group!` still references them).
+
+- [ ] **Step 6: Commit**
+
+```bash
+git add rust_builder/rust/src/bench_api.rs rust_builder/rust/benches/vector_math.rs
+git commit -m "bench(vector_math): add i8 hot-kernel + i8 scan benches (LOC-64)"
+```
+
+---
+
+## Task 4: CI fail-closed gate on the shipped i8 tree
+
+Add an i8 test step to `scripts/test_ci.sh`, mirroring the existing faer step: run the `vector_quant` tests (ε + recall + fidelity nets) on the **shipped** `vector_faer,vector_quant_i8` tree. Fail closed on zero matches AND if any **named** net is missing (a broad-filter + N≥1 guard alone would stay green on the 4 legacy tests if a net were renamed/cfg-excluded).
+
+**Files:**
+- Modify: `scripts/test_ci.sh` (`native` case, after the faer `vector_math` block ending at line 50, before the `# Compile-check the actual shipped feature combo` comment at line 51)
+
+- [ ] **Step 1: Insert the i8 test step**
+
+After line 50 (the faer block's closing `fi`), before line 51's comment, insert:
+
+```bash
+    echo "[ci] Running i8 quant kernels + ε/recall/fidelity safety nets on the SHIPPED faer+quant tree"
+    # The shipped per-candidate hot path is i8 (cosine_with_query_norm_i8_blob),
+    # not the f32 faer kernels. Run the vector_quant tests on the exact shipped
+    # feature combo and fail closed on zero matches.
+    if ! quant_out="$(cargo test --manifest-path rust_builder/rust/Cargo.toml --lib --features "vector_quant_i8,vector_faer" vector_quant -- --test-threads=1 2>&1)"; then
+      echo "$quant_out"
+      echo "[ci] ERROR: i8 vector_quant tests failed" >&2
+      exit 1
+    fi
+    echo "$quant_out"
+    if ! grep -Eq 'test result: ok\. [1-9][0-9]* passed' <<<"$quant_out"; then
+      echo "[ci] ERROR: i8 vector_quant matched 0 tests (renamed/cfg-excluded?); failing closed" >&2
+      exit 1
+    fi
+    # Fail closed if any specific safety net was renamed/cfg-excluded (a broad
+    # filter + N>=1 alone would stay green on the legacy i8 tests).
+    for net in i8_blob_cosine_matches_independent_reference \
+               i8_topk_recall_matches_f32_within_floor \
+               i8_cosine_fidelity_vs_true_f32; do
+      if ! grep -Eq "${net} .* ok" <<<"$quant_out"; then
+        echo "[ci] ERROR: i8 safety net '${net}' did not run/pass (renamed/cfg-excluded?); failing closed" >&2
+        exit 1
+      fi
+    done
+```
+
+- [ ] **Step 2: Run the inserted command directly (fast local check)**
+
+Run: `cargo test --manifest-path rust_builder/rust/Cargo.toml --lib --features "vector_quant_i8,vector_faer" vector_quant -- --test-threads=1`
+Expected: `test result: ok. N passed` with N ≥ 6 (4 legacy + ε + recall + fidelity = 7), and the output contains `... ok` lines for all three named nets.
+
+(Full `./scripts/test_ci.sh native` also runs flutter/PDF steps that need the local toolchain; if unavailable, the direct command above is the meaningful check for this task.)
+
+- [ ] **Step 3: Commit**
+
+```bash
+git add scripts/test_ci.sh
+git commit -m "ci(vector_quant): run i8 ε/recall/fidelity nets on shipped faer+quant tree (LOC-64)"
+```
+
+---
+
+## Task 5: Journal — PR6.md + README status row
+
+**Files:**
+- Create: `docs/perf/vector-math-refactor/PR6.md`
+- Modify: `docs/perf/vector-math-refactor/README.md` (status table)
+
+- [ ] **Step 1: Create `PR6.md` with the measured results**
+
+Create `docs/perf/vector-math-refactor/PR6.md` (fill `<...>` from Task 2 Step 5 and Task 3 Step 4):
+
+```markdown
+# PR6 — i8 출시 핫패스 측정 + ε/recall/fidelity 안전망 (N: 측정 먼저)
+
+- 브랜치: `feat/loc-64-i8-measure-parity-net`
+- Linear: [LOC-64](https://linear.app/loceract/issue/LOC-64)
+- 상태: 🟦 진행 (PR 열림, CI green 대기)
+- 설계: [PR6-spec-i8-measure-parity-net.md](PR6-spec-i8-measure-parity-net.md)
+
+## 스코프 (비파괴 — 커널/양자화 0줄 변경)
+출시 핫패스(i8 `cosine_with_query_norm_i8_blob`)에 PR1 패턴 적용: 측정 + 수치 ε 네트 + recall@k floor + 코사인 fidelity 네트 + CI fail-closed.
+
+## 결과 (측정)
+- **i8 핫커널 마이크로벤치** (dim별, ns): 384=<...> / 768=<...> / 1024=<...> / 1536=<...>
+- **스캔(2000×768) 비교**: `exact_scan[faer]`(f32 decode+cosine)=<...> µs vs `exact_scan_i8`(i8 blob)=<...> µs → i8가 f32-faer 대비 **<...>×**.
+- **수치 ε 네트**: 차원 {1,2,3,16,384,768,1024,1536}에서 kernel ≈ f64 참조, ε=1e-4 green.
+- **핵심 발견**: i8 per-vector 양자화는 768d에서 **recall@10 ≈ 0.997**(=319/320, 거의 무손실) — '민감 밴드'는 출시 설정에서 도달 불가이며 강제 시 비대표적. 따라서 게이트는 이 높은 baseline을 잠금.
+- **recall@k floor 네트**: N=2000, Q=32, dim=768, k=10, clusters=16 → 측정 recall@10 = **0.996875**(dev arm64, CI 확인), FLOOR = **0.98** (= floor(X−0.02)). GT는 f64(플랫폼 jitter 제거), 전순서 `(score desc, index asc)`.
+- **코사인 fidelity 네트**: `max|cosine_i8 − cosine_f32_true|` = **0.00121**, 게이트 = **0.005** (≈4× baseline). 완전 결정론적·민감.
+- **CI**: `--features "vector_quant_i8,vector_faer" -- --test-threads=1` fail-closed + 3개 네트 이름별 가드 (N=<...> passed).
+
+## 받은 피드백 (리뷰 / 사전검증)
+- 설계 리뷰: N=2000/k=10, 전순서 타이브레이크, 출시 트리(faer+quant) CI.
+- **사전 적대적 검증이 잡은 것**: recall@10이 768d에서 포화(~0.997)→'민감 밴드' 불가 → **recall floor + cosine fidelity 백스톱**으로 재설계; f32 GT의 1-ULP 경계 jitter → **f64 GT**; vacuous floor 위험 → `const _` 컴파일 가드 + CI 이름별 가드.
+
+## 리스크 / 롤백
+- 비파괴(커널 0줄) → 동작 변경 없음. 롤백: PR revert.
+- 결정론: i8 dot 정수 정확 + f64 GT → 플랫폼 무관. fidelity는 ranking 무관(경계 jitter 0).
+- vacuous 게이트: `const _: () = assert!(...)` 가드로 컴파일 차단 + CI 이름별 fail-closed.
+
+## 결정 로그
+- 출시 핫패스가 i8임을 확정(이전 세션) → 측정/검증 초점을 f32(폴백)에서 i8로 이동.
+- 품질 게이트는 측정 baseline에서 FLOOR/ε_q 도출(측정 먼저). 코퍼스는 출시 설정(768d·per-vector) 유지 — 강제 민감화 안 함.
+```
+
+- [ ] **Step 2: Add the PR6 row to `README.md`**
+
+In `docs/perf/vector-math-refactor/README.md`, add after the PR5 row line:
+
+```markdown
+| PR6 | i8 출시 핫패스 **측정 + ε/recall/fidelity 안전망** | i8 검증갭 | 낮음(비파괴) | main(#67) | [LOC-64](https://linear.app/loceract/issue/LOC-64) | 🟦 진행([PR6.md](PR6.md)) |
+```
+
+- [ ] **Step 3: Commit**
+
+```bash
+git add docs/perf/vector-math-refactor/PR6.md docs/perf/vector-math-refactor/README.md
+git commit -m "docs(perf): PR6 journal entry + status row, i8 measure/parity results (LOC-64)"
+```
+
+---
+
+## Task 6: Full verification + open PR (stop at CI green)
+
+**Files:** none (verification + PR)
+
+- [ ] **Step 1: Full shipped-tree test run (all nets)**
+
+Run: `cargo test --manifest-path rust_builder/rust/Cargo.toml --lib --features "vector_quant_i8,vector_faer" -- --test-threads=1`
+Expected: `test result: ok.` with the ε + recall + fidelity tests among the passed set, 0 failed.
+
+- [ ] **Step 2: Confirm non-shipped trees still build/test (no regressions)**
+
+Run: `cargo test --manifest-path rust_builder/rust/Cargo.toml --lib -- --test-threads=1` (default features)
+Expected: `test result: ok.` (vector_quant tests are cfg-excluded here — fine; the i8 nets only run under the feature).
+Run: `cargo build --manifest-path rust_builder/rust/Cargo.toml --features "bench"` and `--features "bench,vector_faer,vector_quant_i8"`
+Expected: both build clean (stub + real i8 benches).
+
+- [ ] **Step 3: Push branch and open PR**
+
+```bash
+git push -u origin feat/loc-64-i8-measure-parity-net
+gh pr create --base main --head feat/loc-64-i8-measure-parity-net \
+  --title "PR6 — i8 hot-path measure + ε/recall/fidelity safety net (LOC-64)" \
+  --body "$(cat <<'BODY'
+Applies the PR1 "measure before changing" pattern to the SHIPPED int8 hot path (every prior review/bench scrutinized only the f32 fallback). Non-destructive — kernels unchanged.
+
+- **Measure**: i8 micro + i8 scan benches (vs f32-faer). Numbers in PR6.md.
+- **Numeric ε net**: i8 cosine kernel ≈ independent f64 reference, ε=1e-4.
+- **recall@k floor**: top-k(i8) vs top-k(f32, f64 ground truth) recall@10 ≥ measured baseline − margin. (Finding: i8@768 is ~lossless for recall@10.)
+- **cosine fidelity**: max|cosine_i8 − cosine_f32_true| ≤ measured bound — deterministic, sensitive backstop.
+- **CI fail-closed**: nets run on the shipped `vector_faer,vector_quant_i8` tree, with per-net name guards.
+
+Spec: docs/perf/vector-math-refactor/PR6-spec-i8-measure-parity-net.md · Journal: PR6.md
+BODY
+)"
+```
+
+- [ ] **Step 4: Watch CI to green; hand off for user merge**
+
+Run: `gh pr checks <PR#> --watch` (re-poll on transient network error).
+Expected: all checks pass. **Do NOT merge** — report "PR opened, CI green" and let the user merge. After merge: PR6.md status → 🟩, README PR6 row → 🟩 (follow-up).
+
+---
+
+## Self-Review (filled by plan author)
+
+- **Spec coverage**: §3 bench → Task 3; §4 ε net → Task 1; §5 quality net → Task 2 (recall floor + fidelity, per the approved Net-2 redesign); §6 CI → Task 4; §7 tracking → Task 5 + issue created; §8 acceptance → Tasks 1–6; non-goal (0 kernel changes) → only `mod tests`, `bench_api`, `benches`, CI, docs touched. ✅ (Spec §5/§8/§9/§10 updated to the recall-floor + fidelity design.)
+- **Placeholders**: `MIN_RECALL`/`MAX_COS_ERR`/bench numbers are *measure-first outputs* with exact derivation + compile-time `const _` guards (Task 2 Steps 4–5), not vague TODOs. PR6.md `<...>` are explicitly "fill from measured results." ✅
+- **Type consistency**: `order_desc<T: PartialOrd>` used on both `(usize,f64)` (GT) and `(usize,f32)` (i8); `cosine_f64_true`/`ref_cosine_i8_f64`/`clustered_corpus`/`quantize_f32_to_i8`/`l2_norm_i8`/`i8_blob_from_slice`/`cosine_with_query_norm_i8_blob` match real `vector_quant.rs` signatures (verified). bench_api wrappers match. ✅
+- **Verification-driven fixes applied**: f64 GT (jitter), recall-floor not forced-band (saturation), const guards + CI name guards (vacuous gate), `pub(crate) mod` wording, flat `X−0.02` floor. ✅
diff --git a/docs/perf/vector-math-refactor/PR6-spec-i8-measure-parity-net.md b/docs/perf/vector-math-refactor/PR6-spec-i8-measure-parity-net.md
index 7404db5..2dc97fe 100644
--- a/docs/perf/vector-math-refactor/PR6-spec-i8-measure-parity-net.md
+++ b/docs/perf/vector-math-refactor/PR6-spec-i8-measure-parity-net.md
@@ -44,22 +44,28 @@
 - **ε 근거**: 커널의 `dot_i8_i32`/`sq_sum`은 i32 정수 누산이라 **정확**(dim 1536서 max ~2.5e7 ≪ i32 max 2.1e9, 오버플로 없음). 유일한 부동소수점 오차원은 최종 `(sq_sum as f32).sqrt()` + `query_norm`(f32)으로 나눗셈. `1e-4`는 이 캐스팅의 플랫폼 간 오차를 허용하면서 로직 버그(SIMD 재작성·인덱싱·norm 오류)를 잡는 합리적 바운더리.
 - 기존 테스트와 차별: 기존 건 blob↔slice **진입점 일치**만 봄. 이건 *수학 자체*를 독립 구현과 대조 → **미래 i8 커널 재작성 버그**를 잡음.
 
-## 5. 컴포넌트 3 — recall@k 네트 (양자화 품질) ★핵심
+## 5. 컴포넌트 3 — 양자화 품질 네트 (recall@k floor + 코사인 fidelity) ★핵심
+
+> ⚠️ **사전 적대적 검증 발견(2026-05-31):** i8 per-vector 양자화는 768d에서 너무 정확해 top-10을 거의 재정렬하지 않음 — recall@10 ≈ 0.997. '민감 밴드(0.85~0.98)'는 출시 설정에서 **도달 불가**이며, 억지로 맞추려면(저차원·global scale) 출시 경로를 반영 못 함. 따라서 (a) recall은 **측정된 높은 baseline을 잠그고**(밴드 강제 안 함), (b) 진짜 민감·결정론 게이트로 **코사인 fidelity**를 추가.
 
 - 위치: `vector_quant.rs` 테스트 모듈, `#[cfg(feature = "vector_quant_i8")]`.
-- **합성 클러스터 코퍼스**(결정론·무 rand 의존, `pseudo_vec` 스타일 시드 생성기):
-  - C개 클러스터 중심(단위벡터) + 가우시안풍 노이즈 → 정규화. "몇 개는 가깝고 대부분 멀다"는 임베딩 분포 모사.
-  - 기본값: **N=2000** 후보(= `SCAN_N` 재사용), **Q=32** 쿼리, **dim=768**, **k=10** (recall@10, 상위 0.5%).
-- **정답(ground truth)**: 각 쿼리의 **f32 코사인 top-k** (f32 코퍼스 기준 = 진짜 랭킹).
-- **i8 랭킹**: 코퍼스/쿼리를 i8 양자화 후 **i8 코사인 top-k**.
-- **전순서 비교자(필수)**: i8·f32 **양쪽 모두** `(score 내림차순, index 오름차순)` 총순서로 정렬. i8은 i32 정수 점수라 **동점이 대량 발생**하므로(커널 누산 구조상), index 타이브레이크를 명시 강제하지 않으면 플랫폼(Ubuntu vs macOS) sort 구현 차이로 **플레이키**. 동일 비교자 적용 → 진짜 양자화 재정렬만 recall에 반영.
-- **지표**: `recall@k = |topk_i8 ∩ topk_f32| / k`, 쿼리 평균. 고정 시드 → **완전 재현(비통계적·비플레이키)**.
-- **임계값 = 측정 먼저의 산물**:
-  1. 구현 첫 실행이 실제 `recall@10` 측정.
-  2. **포화 점검**: 측정이 ~1.0이면 게이트가 장식 → 클러스터 밀도/노이즈/중첩을 올려 recall이 **민감 구간(0.85~0.98)** 에 들 때까지 코퍼스 보정.
-  3. CI 게이트를 `recall@10 ≥ FLOOR` 로 고정(FLOOR = 측정값 − 마진 ≈ 0.03).
-  4. 측정값·FLOOR·코퍼스 파라미터를 `PR6.md`에 기록.
-- 얻는 것: 양자화가 근접 이웃 순위를 무너뜨리면 CI 빨개짐 — 지금 비어 있는 그 안전망. 측정이 임계값을 만들고, 그게 회귀 게이트가 됨.
+- **합성 클러스터 코퍼스**(결정론·무 rand): C개 클러스터 중심 + 노이즈 → 정규화. **출시 설정 유지**(N=2000, Q=32, dim=768, per-vector scale). 코퍼스는 현실적 분포일 뿐 민감도 노브 아님(clusters=16 고정).
+
+### 5a. recall@k floor
+- **정답(GT)**: 각 쿼리의 **f64 코사인 top-k** (원본 f32를 f64로 누산 → 경계 gap ≫ x86/ARM 1-ULP jitter라 **플랫폼 안정**). k=10.
+- **i8 랭킹**: i8 양자화 후 i8 커널 top-k.
+- **전순서 비교자**: 양쪽 `(score desc, index asc)` — 결정론적 타이브레이크(i8 정수 점수 동점 대량 → index로 확정).
+- **게이트**: `recall@10 ≥ FLOOR`, **FLOOR = floor(baseline − 0.02) = 0.98** (측정 baseline 0.996875 = 319/320, dev arm64·결정론). 밴드 강제·포화 가드 폐기 — baseline이 ~1.0인 게 현실이자 좋은 결과.
+
+### 5b. 코사인 fidelity (결정론·민감 백스톱)
+- 모든 (쿼리, 후보) 쌍에서 `max|cosine_i8 − cosine_f32_true(f64)|` 측정.
+- **게이트**: `max ≤ ε_q`, **ε_q ≈ 4 × 측정 max = 0.005** (측정 max 0.00121). ranking 무관 → 경계 jitter 0, 양자화 품질 저하에 가장 민감.
+
+### 측정 먼저 → 임계값
+1. 첫 실행(planning, dev arm64)이 recall@10=X=0.996875·max fidelity err=M=0.00121 측정.
+2. `FLOOR = floor(X − 0.02) = 0.98`, `ε_q ≈ 4·M = 0.005` **사전 고정**(결정론적이라 CI/플랫폼 동일값). `const _: () = assert!(...)`로 vacuous(<0.5 floor / ≥0.1 ε_q) 임계값 **컴파일 차단**.
+3. X·M·FLOOR·ε_q를 `PR6.md`에 기록, 구현 시 동일값 확인.
+- 얻는 것: 미래 i8/양자화 변경이 검색 품질을 떨어뜨리면 CI 빨개짐. recall=거시 회귀, fidelity=미세 회귀.
 
 ## 6. 컴포넌트 4 — CI 게이팅 (fail-closed)
 
@@ -80,31 +86,32 @@
 
 - [ ] i8 마이크로벤치 + i8 스캔 벤치 동작, 수치가 `PR6.md`에 기록(i8 throughput + i8 vs f32-faer 배수).
 - [ ] 수치 ε 네트: 차원별 i8 커널 ≈ f64 참조 `<1e-4` green.
-- [ ] recall@k 네트: 결정론적, 코퍼스가 민감 구간(0.85~0.98)에 위치, `recall@10 ≥ FLOOR` green, FLOOR/측정값 기록.
-- [ ] CI `--features "vector_quant_i8,vector_faer" -- --test-threads=1` fail-closed로 i8 테스트 실행, 기존 잡 회귀 없음.
+- [ ] recall@k floor: 결정론적(f64 GT), `recall@10 ≥ FLOOR(=baseline−0.02)` green, baseline/FLOOR 기록.
+- [ ] 코사인 fidelity: `max|cosine_i8−cosine_f32_true| ≤ ε_q` green, 측정 max/ε_q 기록.
+- [ ] CI `--features "vector_quant_i8,vector_faer" -- --test-threads=1` fail-closed + 3개 네트 이름별 가드, 기존 잡 회귀 없음.
 - [ ] 커널/양자화 코드 변경 0줄(비파괴) 확인.
 
 ## 9. 리스크 / 완화
 
 | 리스크 | 완화 |
 |---|---|
-| recall 게이트 포화(거짓 안심) | §5-2 포화 점검 + 코퍼스 민감도 보정 후 FLOOR 고정 |
-| i8 동점으로 플랫폼 간 플레이키 | §5 전순서 비교자 `(score, index)` 양쪽 강제 |
+| recall baseline ~1.0(거시 둔감) | 미세 회귀는 코사인 fidelity 백스톱(ε_q)이 담당 |
+| f32 GT 경계 1-ULP jitter | GT를 **f64**로 계산 → 경계 gap ≫ jitter, recall 플랫폼 안정 |
+| vacuous 게이트(미보정 임계값) | `const _ assert` 컴파일 가드 + CI 3개 네트 이름별 fail-closed |
 | feature 조합 컴파일 충돌 | §6 출시 트리(faer+quant)로 CI 테스트 |
-| #67과 README PR표 충돌 | §7 #67 머지 후 분기(스택 회피) |
+| #67과 README PR표 충돌 | 해소(#67 머지 완료, `main`에서 분기) |
 | ε 너무 빡빡/느슨 | 정수 dot=정확, sqrt/div만 오차 → 1e-4 수학적 합리(검토 확인) |
 
 ## 10. 튜닝 가능한 기본값 (스펙 명시, 구현 중 조정 가능)
 
 | 파라미터 | 기본값 | 비고 |
 |---|---|---|
-| ε (수치 네트) | `1e-4` | 정수 dot 정확, sqrt/div 오차만 허용 |
+| ε (커널 ε 네트) | `1e-4` | 정수 dot 정확, sqrt/div 오차만 |
 | k (recall) | `10` | recall@10 |
-| N (코퍼스) | `2000` | `SCAN_N` 재사용, 상위 0.5% |
-| Q (쿼리) | `32` | |
-| dim | `768` | 출시 임베딩 대표 차원 |
-| recall 마진 | `측정 − 0.03` | 첫 측정 후 FLOOR 확정 |
-| 클러스터 수/노이즈 | 측정으로 보정 | recall 0.85~0.98 민감 구간 목표 |
+| N / Q / dim | `2000 / 32 / 768` | 출시 설정(per-vector scale); N=`SCAN_N` 재사용 |
+| recall FLOOR | `floor(X−0.02) = 0.98` | 측정 X 0.996875; const guard `≥0.5` |
+| fidelity ε_q | `≈4·M = 0.005` | 측정 max 0.00121; const guard `<0.1` |
+| 클러스터 수 | `16` (고정) | 현실적 분포; 민감도 노브 아님 |
 
 ---
 

From d8bb71f101ade943de613b52a874ac1f4702f510 Mon Sep 17 00:00:00 2001
From: "Brian.oh" <49855381+dev07060@users.noreply.github.com>
Date: Sun, 31 May 2026 04:54:43 +0900
Subject: [PATCH 3/7] =?UTF-8?q?test(vector=5Fquant):=20i8=20cosine=20kerne?=
 =?UTF-8?q?l=20=CE=B5-parity=20vs=20independent=20f64=20reference=20(LOC-6?=
 =?UTF-8?q?4)?=
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

---
 rust_builder/rust/src/api/vector_quant.rs | 56 +++++++++++++++++++++++
 1 file changed, 56 insertions(+)

diff --git a/rust_builder/rust/src/api/vector_quant.rs b/rust_builder/rust/src/api/vector_quant.rs
index b1dfde3..69ce297 100644
--- a/rust_builder/rust/src/api/vector_quant.rs
+++ b/rust_builder/rust/src/api/vector_quant.rs
@@ -195,4 +195,60 @@ mod tests {
             assert_eq!(direct_blob, two_step_blob);
         }
     }
+
+    // --- PR6 shared test helpers (deterministic, no rand dep) ---
+
+    // Same generator as benches/vector_math.rs: reproducible run-to-run.
+    fn pseudo_vec(dim: usize, seed: u32) -> Vec<f32> {
+        (0..dim)
+            .map(|i| {
+                let x = (i as u32)
+                    .wrapping_mul(2_654_435_761)
+                    .wrapping_add(seed.wrapping_mul(40_503));
+                ((x % 1000) as f32 / 1000.0) - 0.5
+            })
+            .collect()
+    }
+
+    // Independent f64 reference cosine of two i8 vectors. Different accumulation
+    // width (i64) and float precision (f64) than the i32->f32 kernel, so a match
+    // proves the kernel math, not just that it agrees with itself.
+    fn ref_cosine_i8_f64(q: &[i8], t: &[i8]) -> f64 {
+        if q.len() != t.len() || q.is_empty() {
+            return 0.0;
+        }
+        let mut dot: i64 = 0;
+        let mut qsq: i64 = 0;
+        let mut tsq: i64 = 0;
+        for (&a, &b) in q.iter().zip(t.iter()) {
+            dot += (a as i64) * (b as i64);
+            qsq += (a as i64) * (a as i64);
+            tsq += (b as i64) * (b as i64);
+        }
+        if qsq == 0 || tsq == 0 {
+            return 0.0;
+        }
+        (dot as f64) / ((qsq as f64).sqrt() * (tsq as f64).sqrt())
+    }
+
+    #[test]
+    fn i8_blob_cosine_matches_independent_reference() {
+        // Integer dot/sq are exact; only the final f32 sqrt+div can drift.
+        const EPS: f64 = 1e-4;
+        for &dim in &[1usize, 2, 3, 16, 384, 768, 1024, 1536] {
+            let q = pseudo_vec(dim, 7);
+            let t = pseudo_vec(dim, 9);
+            let (qi, _) = quantize_f32_to_i8(&q);
+            let (ti, _) = quantize_f32_to_i8(&t);
+            let blob = i8_blob_from_slice(&ti);
+            let qn = l2_norm_i8(&qi);
+
+            let kernel = cosine_with_query_norm_i8_blob(&qi, qn, &blob) as f64;
+            let reference = ref_cosine_i8_f64(&qi, &ti);
+            assert!(
+                (kernel - reference).abs() < EPS,
+                "i8 cosine dim={dim}: kernel={kernel} ref={reference}"
+            );
+        }
+    }
 }

From 012d110cbb7997e46052652e55659c3a821a7972 Mon Sep 17 00:00:00 2001
From: "Brian.oh" <49855381+dev07060@users.noreply.github.com>
Date: Sun, 31 May 2026 05:00:19 +0900
Subject: [PATCH 4/7] test(vector_quant): i8 recall@10 floor + cosine-fidelity
 gates vs f64 truth (LOC-64)

---
 rust_builder/rust/src/api/vector_quant.rs | 163 ++++++++++++++++++++++
 1 file changed, 163 insertions(+)

diff --git a/rust_builder/rust/src/api/vector_quant.rs b/rust_builder/rust/src/api/vector_quant.rs
index 69ce297..f0e17f8 100644
--- a/rust_builder/rust/src/api/vector_quant.rs
+++ b/rust_builder/rust/src/api/vector_quant.rs
@@ -251,4 +251,167 @@ mod tests {
             );
         }
     }
+
+    // --- PR6 Task 2 helpers ---
+
+    fn normalize(v: &mut [f32]) {
+        let n = v.iter().map(|x| x * x).sum::<f32>().sqrt();
+        if n > 0.0 {
+            for x in v.iter_mut() {
+                *x /= n;
+            }
+        }
+    }
+
+    fn det_unit(dim: usize, seed: u32) -> Vec<f32> {
+        let mut v = pseudo_vec(dim, seed);
+        normalize(&mut v);
+        v
+    }
+
+    // Clustered corpus: vector i belongs to cluster (i % clusters); a weighted
+    // blend of that cluster's center and per-vector noise, normalized.
+    fn clustered_corpus(
+        n: usize,
+        dim: usize,
+        clusters: usize,
+        weight: f32,
+        seed0: u32,
+    ) -> Vec<Vec<f32>> {
+        let centers: Vec<Vec<f32>> =
+            (0..clusters).map(|c| det_unit(dim, 1_000 + c as u32)).collect();
+        (0..n)
+            .map(|i| {
+                let c = i % clusters;
+                let noise = pseudo_vec(dim, seed0 + i as u32);
+                let mut v: Vec<f32> = centers[c]
+                    .iter()
+                    .zip(noise.iter())
+                    .map(|(&ce, &no)| weight * ce + (1.0 - weight) * no)
+                    .collect();
+                normalize(&mut v);
+                v
+            })
+            .collect()
+    }
+
+    // Total order: score descending, then index ascending. total_cmp gives a
+    // provably total order (NaN-safe), so sort output is platform-deterministic.
+    fn order_desc_f64(a: &(usize, f64), b: &(usize, f64)) -> std::cmp::Ordering {
+        b.1.total_cmp(&a.1).then(a.0.cmp(&b.0))
+    }
+    fn order_desc_f32(a: &(usize, f32), b: &(usize, f32)) -> std::cmp::Ordering {
+        b.1.total_cmp(&a.1).then(a.0.cmp(&b.0))
+    }
+
+    // True cosine of the ORIGINAL f32 vectors, accumulated in f64 (boundary gap
+    // >> x86/ARM ULP jitter); also the reference for cosine fidelity.
+    fn cosine_f64_true(q: &[f32], t: &[f32]) -> f64 {
+        let mut dot = 0.0f64;
+        let mut qsq = 0.0f64;
+        let mut tsq = 0.0f64;
+        for (a, b) in q.iter().zip(t.iter()) {
+            let (a, b) = (*a as f64, *b as f64);
+            dot += a * b;
+            qsq += a * a;
+            tsq += b * b;
+        }
+        if qsq == 0.0 || tsq == 0.0 {
+            0.0
+        } else {
+            dot / (qsq.sqrt() * tsq.sqrt())
+        }
+    }
+
+    #[test]
+    fn i8_topk_recall_matches_f32_within_floor() {
+        const N: usize = 2000;
+        const Q: usize = 32;
+        const DIM: usize = 768;
+        const K: usize = 10;
+        const CLUSTERS: usize = 16;
+        const WEIGHT: f32 = 0.85;
+        // Locked from measured baseline recall@10 = 0.996875 (deterministic:
+        // f64 GT + integer-exact i8 => bit-identical across x86/ARM). FLOOR =
+        // floor(0.9969 - 0.02) = 0.98, margin ~0.017 (~5 hits of 320).
+        const MIN_RECALL: f32 = 0.98;
+        const _: () = assert!(MIN_RECALL >= 0.9, "MIN_RECALL must be a real floor");
+
+        let corpus = clustered_corpus(N, DIM, CLUSTERS, WEIGHT, 5_000);
+        let queries = clustered_corpus(Q, DIM, CLUSTERS, WEIGHT, 9_000);
+        let corpus_blob: Vec<Vec<u8>> = corpus
+            .iter()
+            .map(|v| i8_blob_from_slice(&quantize_f32_to_i8(v).0))
+            .collect();
+
+        let mut recall_sum = 0.0f32;
+        for query in &queries {
+            let mut gt_scores: Vec<(usize, f64)> = corpus
+                .iter()
+                .enumerate()
+                .map(|(i, c)| (i, cosine_f64_true(query, c)))
+                .collect();
+            gt_scores.sort_by(order_desc_f64);
+            let gt: std::collections::HashSet<usize> =
+                gt_scores.iter().take(K).map(|(i, _)| *i).collect();
+
+            let (qi, _) = quantize_f32_to_i8(query);
+            let qn_i8 = l2_norm_i8(&qi);
+            let mut i8_scores: Vec<(usize, f32)> = corpus_blob
+                .iter()
+                .enumerate()
+                .map(|(i, blob)| (i, cosine_with_query_norm_i8_blob(&qi, qn_i8, blob)))
+                .collect();
+            i8_scores.sort_by(order_desc_f32);
+            let got: std::collections::HashSet<usize> =
+                i8_scores.iter().take(K).map(|(i, _)| *i).collect();
+
+            recall_sum += gt.intersection(&got).count() as f32 / K as f32;
+        }
+        let recall = recall_sum / Q as f32;
+        println!("PR6 recall@{K} (N={N} Q={Q} dim={DIM} clusters={CLUSTERS}) = {recall}");
+        assert!(
+            recall >= MIN_RECALL,
+            "i8 recall@{K} regressed: {recall} < {MIN_RECALL}"
+        );
+    }
+
+    #[test]
+    fn i8_cosine_fidelity_vs_true_f32() {
+        const N: usize = 2000;
+        const Q: usize = 32;
+        const DIM: usize = 768;
+        const CLUSTERS: usize = 16;
+        const WEIGHT: f32 = 0.85;
+        // Locked from measured max error 0.00121 (deterministic). 0.005 ~= 4x the
+        // baseline: sensitive to a lossier future quantizer yet never flaky.
+        const MAX_COS_ERR: f64 = 0.005;
+        const _: () = assert!(MAX_COS_ERR < 0.1, "MAX_COS_ERR must be a real bound");
+
+        let corpus = clustered_corpus(N, DIM, CLUSTERS, WEIGHT, 5_000);
+        let queries = clustered_corpus(Q, DIM, CLUSTERS, WEIGHT, 9_000);
+        let corpus_blob: Vec<Vec<u8>> = corpus
+            .iter()
+            .map(|v| i8_blob_from_slice(&quantize_f32_to_i8(v).0))
+            .collect();
+
+        let mut max_err = 0.0f64;
+        for query in &queries {
+            let (qi, _) = quantize_f32_to_i8(query);
+            let qn_i8 = l2_norm_i8(&qi);
+            for (c, blob) in corpus.iter().zip(corpus_blob.iter()) {
+                let i8c = cosine_with_query_norm_i8_blob(&qi, qn_i8, blob) as f64;
+                let truec = cosine_f64_true(query, c);
+                let e = (i8c - truec).abs();
+                if e > max_err {
+                    max_err = e;
+                }
+            }
+        }
+        println!("PR6 max|cosine_i8 - cosine_f32_true| (N={N} Q={Q} dim={DIM}) = {max_err}");
+        assert!(
+            max_err <= MAX_COS_ERR,
+            "i8 cosine fidelity regressed: max err {max_err} > {MAX_COS_ERR}"
+        );
+    }
 }

From 77c025057f817e52fe3e94c118a0b6f2db6cae7e Mon Sep 17 00:00:00 2001
From: "Brian.oh" <49855381+dev07060@users.noreply.github.com>
Date: Sun, 31 May 2026 05:10:27 +0900
Subject: [PATCH 5/7] bench(vector_math): add i8 hot-kernel + i8 scan benches
 (LOC-64)

---
 rust_builder/rust/benches/vector_math.rs | 58 +++++++++++++++++++++++-
 rust_builder/rust/src/bench_api.rs       | 27 +++++++++++
 2 files changed, 84 insertions(+), 1 deletion(-)

diff --git a/rust_builder/rust/benches/vector_math.rs b/rust_builder/rust/benches/vector_math.rs
index e3b5d70..effc21e 100644
--- a/rust_builder/rust/benches/vector_math.rs
+++ b/rust_builder/rust/benches/vector_math.rs
@@ -106,12 +106,68 @@ fn bench_scan(c: &mut Criterion) {
     g.finish();
 }
 
+#[cfg(feature = "vector_quant_i8")]
+fn bench_cosine_i8(c: &mut Criterion) {
+    let mut g = c.benchmark_group("cosine_i8");
+    for &dim in &DIMS {
+        let (qi, _) = bench_api::quantize_f32_to_i8(&pseudo_vec(dim, 1));
+        let (ti, _) = bench_api::quantize_f32_to_i8(&pseudo_vec(dim, 2));
+        let qn = bench_api::l2_norm_i8(&qi);
+        let tblob = bench_api::i8_blob_from_slice(&ti);
+        g.throughput(Throughput::Elements(dim as u64));
+        g.bench_with_input(BenchmarkId::from_parameter(dim), &dim, |b, _| {
+            b.iter(|| {
+                bench_api::cosine_with_query_norm_i8_blob(
+                    black_box(&qi),
+                    black_box(qn),
+                    black_box(&tblob),
+                )
+            })
+        });
+    }
+    g.finish();
+}
+#[cfg(not(feature = "vector_quant_i8"))]
+fn bench_cosine_i8(_c: &mut Criterion) {}
+
+// Shipped exact-scan inner loop: one query vs N candidate i8 blobs, scored with
+// zero f32 decode / zero per-row alloc — the actual release hot path.
+#[cfg(feature = "vector_quant_i8")]
+fn bench_scan_i8(c: &mut Criterion) {
+    let (qi, _) = bench_api::quantize_f32_to_i8(&pseudo_vec(SCAN_DIM, 1));
+    let qn = bench_api::l2_norm_i8(&qi);
+    let blobs: Vec<Vec<u8>> = (0..SCAN_N)
+        .map(|i| {
+            let (vi, _) = bench_api::quantize_f32_to_i8(&pseudo_vec(SCAN_DIM, 100 + i as u32));
+            bench_api::i8_blob_from_slice(&vi)
+        })
+        .collect();
+
+    let mut g = c.benchmark_group("exact_scan_i8");
+    g.throughput(Throughput::Elements(SCAN_N as u64));
+    g.bench_function(BenchmarkId::new("i8_blob_cosine", SCAN_N), |b| {
+        b.iter(|| {
+            let mut best = f32::MIN;
+            for blob in &blobs {
+                let s = bench_api::cosine_with_query_norm_i8_blob(black_box(&qi), qn, black_box(blob));
+                if s > best {
+                    best = s;
+                }
+            }
+            black_box(best)
+        })
+    });
+    g.finish();
+}
+#[cfg(not(feature = "vector_quant_i8"))]
+fn bench_scan_i8(_c: &mut Criterion) {}
+
 criterion_group! {
     name = benches;
     config = Criterion::default()
         .sample_size(30)
         .warm_up_time(Duration::from_millis(500))
         .measurement_time(Duration::from_secs(2));
-    targets = bench_cosine, bench_dot, bench_decode, bench_scan
+    targets = bench_cosine, bench_dot, bench_decode, bench_scan, bench_cosine_i8, bench_scan_i8
 }
 criterion_main!(benches);
diff --git a/rust_builder/rust/src/bench_api.rs b/rust_builder/rust/src/bench_api.rs
index d59f4cf..33c4c2d 100644
--- a/rust_builder/rust/src/bench_api.rs
+++ b/rust_builder/rust/src/bench_api.rs
@@ -31,6 +31,33 @@ pub fn decode_f32_embedding(blob: &[u8]) -> Option<Vec<f32>> {
     vector_math::decode_f32_embedding(blob)
 }
 
+#[cfg(feature = "vector_quant_i8")]
+use crate::api::vector_quant;
+
+#[cfg(feature = "vector_quant_i8")]
+#[inline]
+pub fn quantize_f32_to_i8(input: &[f32]) -> (Vec<i8>, f32) {
+    vector_quant::quantize_f32_to_i8(input)
+}
+
+#[cfg(feature = "vector_quant_i8")]
+#[inline]
+pub fn l2_norm_i8(v: &[i8]) -> f32 {
+    vector_quant::l2_norm_i8(v)
+}
+
+#[cfg(feature = "vector_quant_i8")]
+#[inline]
+pub fn i8_blob_from_slice(input: &[i8]) -> Vec<u8> {
+    vector_quant::i8_blob_from_slice(input)
+}
+
+#[cfg(feature = "vector_quant_i8")]
+#[inline]
+pub fn cosine_with_query_norm_i8_blob(query: &[i8], query_norm: f32, target_blob: &[u8]) -> f32 {
+    vector_quant::cosine_with_query_norm_i8_blob(query, query_norm, target_blob)
+}
+
 /// Which backend this build compiled (for labelling bench output).
 pub const BACKEND: &str = if cfg!(feature = "vector_faer") {
     "faer"

From b0b71747f81bb02efb42f1962f9fbc58055af965 Mon Sep 17 00:00:00 2001
From: "Brian.oh" <49855381+dev07060@users.noreply.github.com>
Date: Sun, 31 May 2026 05:14:41 +0900
Subject: [PATCH 6/7] =?UTF-8?q?ci(vector=5Fquant):=20run=20i8=20=CE=B5/rec?=
 =?UTF-8?q?all/fidelity=20nets=20on=20shipped=20faer+quant=20tree=20(LOC-6?=
 =?UTF-8?q?4)?=
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

---
 scripts/test_ci.sh | 24 ++++++++++++++++++++++++
 1 file changed, 24 insertions(+)

diff --git a/scripts/test_ci.sh b/scripts/test_ci.sh
index 8a06532..81bc048 100755
--- a/scripts/test_ci.sh
+++ b/scripts/test_ci.sh
@@ -48,6 +48,30 @@ case "$TARGET" in
       echo "[ci] ERROR: faer vector_math matched 0 tests (renamed/cfg-excluded?); failing closed" >&2
       exit 1
     fi
+    echo "[ci] Running i8 quant kernels + ε/recall/fidelity safety nets on the SHIPPED faer+quant tree"
+    # The shipped per-candidate hot path is i8 (cosine_with_query_norm_i8_blob),
+    # not the f32 faer kernels. Run the vector_quant tests on the exact shipped
+    # feature combo and fail closed on zero matches.
+    if ! quant_out="$(cargo test --manifest-path rust_builder/rust/Cargo.toml --lib --features "vector_quant_i8,vector_faer" vector_quant -- --test-threads=1 2>&1)"; then
+      echo "$quant_out"
+      echo "[ci] ERROR: i8 vector_quant tests failed" >&2
+      exit 1
+    fi
+    echo "$quant_out"
+    if ! grep -Eq 'test result: ok\. [1-9][0-9]* passed' <<<"$quant_out"; then
+      echo "[ci] ERROR: i8 vector_quant matched 0 tests (renamed/cfg-excluded?); failing closed" >&2
+      exit 1
+    fi
+    # Fail closed if any specific safety net was renamed/cfg-excluded (a broad
+    # filter + N>=1 alone would stay green on the legacy i8 tests).
+    for net in i8_blob_cosine_matches_independent_reference \
+               i8_topk_recall_matches_f32_within_floor \
+               i8_cosine_fidelity_vs_true_f32; do
+      if ! grep -Eq "${net} \.\.\. ok" <<<"$quant_out"; then
+        echo "[ci] ERROR: i8 safety net '${net}' did not run/pass (renamed/cfg-excluded?); failing closed" >&2
+        exit 1
+      fi
+    done
     # Compile-check the actual shipped feature combo (faer + i8 quant). A
     # default-feature release build would never cover the backend that ships.
     cargo build --manifest-path rust_builder/rust/Cargo.toml --release --features vector_faer,vector_quant_i8

From 1cf33b13448c397231d0ff6dc4b82a2ad37b048c Mon Sep 17 00:00:00 2001
From: "Brian.oh" <49855381+dev07060@users.noreply.github.com>
Date: Sun, 31 May 2026 05:19:45 +0900
Subject: [PATCH 7/7] docs(perf): PR6 journal entry + status row, i8
 measure/parity results (LOC-64)

---
 docs/perf/vector-math-refactor/PR6.md    | 31 ++++++++++++++++++++++++
 docs/perf/vector-math-refactor/README.md |  1 +
 2 files changed, 32 insertions(+)
 create mode 100644 docs/perf/vector-math-refactor/PR6.md

diff --git a/docs/perf/vector-math-refactor/PR6.md b/docs/perf/vector-math-refactor/PR6.md
new file mode 100644
index 0000000..2a3588e
--- /dev/null
+++ b/docs/perf/vector-math-refactor/PR6.md
@@ -0,0 +1,31 @@
+# PR6 — i8 출시 핫패스 측정 + ε/recall/fidelity 안전망 (N: 측정 먼저)
+
+- 브랜치: `feat/loc-64-i8-measure-parity-net`
+- Linear: [LOC-64](https://linear.app/loceract/issue/LOC-64)
+- 상태: 🟦 진행 (PR 열림, CI green 대기)
+- 설계: [PR6-spec-i8-measure-parity-net.md](PR6-spec-i8-measure-parity-net.md) · 계획: [PR6-plan-i8-measure-parity-net.md](PR6-plan-i8-measure-parity-net.md)
+
+## 스코프 (비파괴 — 커널/양자화 0줄 변경)
+출시 핫패스(i8 `cosine_with_query_norm_i8_blob`)에 PR1 패턴 적용: 측정 + 수치 ε 네트 + recall@k floor + 코사인 fidelity 네트 + CI fail-closed.
+
+## 결과 (측정, dev arm64)
+- **i8 핫커널 마이크로벤치** (ns): 384=7.87 / 768=14.97 / 1024=21.12 / 1536=31.28
+- **스캔(2000×768)**: `exact_scan[faer]`(f32 decode+cosine) **452.82 µs** vs `exact_scan_i8`(i8 blob) **29.98 µs** → i8가 f32-faer 대비 **≈15.1× 빠름**.
+- **핵심 발견**: 출시 i8 핫패스는 f32 폴백보다 ~15× 빠르면서 **recall@10 ≈ 0.997**(=319/320, 거의 무손실) — 빠르고 정확.
+- **수치 ε 네트**: 차원 {1,2,3,16,384,768,1024,1536}에서 kernel ≈ 독립 f64 참조, ε=1e-4 green.
+- **recall@k floor 네트**: N=2000, Q=32, dim=768, k=10, clusters=16 → recall@10 = **0.996875**, FLOOR = **0.98**. GT는 f64(플랫폼 jitter 제거), 전순서 `(score desc, index asc)`는 `total_cmp`(NaN-safe).
+- **코사인 fidelity 네트**: `max|cosine_i8 − cosine_f32_true|` = **0.00121**, 게이트 **ε_q = 0.005**(≈4× baseline). ranking 무관·완전 결정론.
+- **CI**: `--features "vector_quant_i8,vector_faer" -- --test-threads=1` fail-closed + 3개 네트 이름별 가드, 7 passed.
+
+## 받은 피드백 (리뷰 / 사전검증)
+- 사전 적대적 검증이 잡은 것: recall@10이 768d에서 포화(~0.997)→'민감 밴드' 불가 → **recall floor + cosine fidelity 백스톱**으로 재설계; f32 GT 1-ULP 경계 jitter → **f64 GT**; vacuous 게이트 위험 → `const _` 컴파일 가드 + CI 이름별 가드.
+- 구현 리뷰: `order_desc`를 `partial_cmp().unwrap_or(Equal)`(NaN 비전이성)에서 `total_cmp` 기반 concrete 헬퍼로 교체; CI per-net 정규식을 `\.\.\. ok`로 타이트닝.
+
+## 리스크 / 롤백
+- 비파괴(커널 0줄) → 동작 변경 없음. 롤백: PR revert.
+- 결정론: i8 dot 정수 정확 + f64 GT → 플랫폼 무관(측정값 bit-identical). fidelity는 ranking 무관(경계 jitter 0).
+- vacuous 게이트: `const _: () = assert!(...)` 컴파일 가드 + CI 이름별 fail-closed.
+
+## 결정 로그
+- 출시 핫패스가 i8임을 확정(이전 세션) → 측정/검증 초점을 f32(폴백)에서 i8로 이동.
+- 품질 게이트는 측정 baseline에서 FLOOR(0.98)/ε_q(0.005) 도출(측정 먼저). 코퍼스는 출시 설정(768d·per-vector) 유지 — 강제 민감화 안 함.
diff --git a/docs/perf/vector-math-refactor/README.md b/docs/perf/vector-math-refactor/README.md
index 8c1fd94..679a201 100644
--- a/docs/perf/vector-math-refactor/README.md
+++ b/docs/perf/vector-math-refactor/README.md
@@ -29,6 +29,7 @@
 | PR3 | decode 버퍼 재사용 | Claim1 | 낮음~중 | 벤치/N3 게이트 | [LOC-61](https://linear.app/loceract/issue/LOC-61) | ❌ 폐기(출시 i8 빌드서 f32 decode 비핫, 코드검증) |
 | PR4 | ~~다중 누산기 언롤~~ | — | — | — | [LOC-62](https://linear.app/loceract/issue/LOC-62) | ❌ 폐기(faer 유지로 무의미) |
 | PR5 | 위생: 손상 로깅(N6) + 엔디안 문서화(N5) | N6, N5 | 낮음(독립) | — | [LOC-63](https://linear.app/loceract/issue/LOC-63) | 🟩 머지(#66, [PR5.md](PR5.md)) |
+| PR6 | i8 출시 핫패스 **측정 + ε/recall/fidelity 안전망** | i8 검증갭 | 낮음(비파괴) | main(#67) | [LOC-64](https://linear.app/loceract/issue/LOC-64) | 🟦 진행([PR6.md](PR6.md)) |
 
 종료 회고: [RETRO.md](RETRO.md) · PR3([LOC-61](https://linear.app/loceract/issue/LOC-61)) ❌ 폐기 확정(RETRO §5) · 잔여(선택): 온디바이스 벤치 / encode 헬퍼 dedup — 프로젝트 핸드오프 노트 참조.