From 9e4cde3be4b64e09294be6972ce577039da16a87 Mon Sep 17 00:00:00 2001 From: "Brian.oh" <49855381+dev07060@users.noreply.github.com> Date: Sun, 31 May 2026 04:11:01 +0900 Subject: [PATCH 1/7] =?UTF-8?q?docs(perf):=20PR6=20design=20spec=20?= =?UTF-8?q?=E2=80=94=20i8=20hot-path=20measure=20+=20=CE=B5/recall=20safet?= =?UTF-8?q?y=20net=20(LOC-64)?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Brainstorming output for the next work-stream: apply the PR1 "measure before changing" pattern to the SHIPPED hot path (i8), which every prior review/bench scrutinized only on the f32 fallback. Non-destructive — kernels unchanged. Captures i8 bench baseline + numeric ε parity + a recall@k quantization-quality gate, all fail-closed in CI on the shipped faer+quant compile tree. Implementation follows via writing-plans. --- .../PR6-spec-i8-measure-parity-net.md | 111 ++++++++++++++++++ 1 file changed, 111 insertions(+) create mode 100644 docs/perf/vector-math-refactor/PR6-spec-i8-measure-parity-net.md diff --git a/docs/perf/vector-math-refactor/PR6-spec-i8-measure-parity-net.md b/docs/perf/vector-math-refactor/PR6-spec-i8-measure-parity-net.md new file mode 100644 index 0000000..7404db5 --- /dev/null +++ b/docs/perf/vector-math-refactor/PR6-spec-i8-measure-parity-net.md @@ -0,0 +1,111 @@ +# PR6 설계 스펙 — i8 출시 핫패스 측정 + ε/recall 안전망 [측정 먼저] + +- 작성: 2026-05-31 +- 상태: 📝 설계(브레인스토밍 산출물) — 승인 후 writing-plans로 구현 계획 작성 +- Linear: [LOC-64](https://linear.app/loceract/issue/LOC-64) +- 브랜치: `feat/loc-64-i8-measure-parity-net` — base = `main` (PR #67 머지 완료, `1217123`). 스택 트랩 회피됨. +- 접근법: **A — "PR1 리플레이, i8로 확장"** + +## 1. 배경 / 왜 (Problem) + +이번 세션의 코드 검증으로 드러난 사실: **출시 빌드(`vector_faer,vector_quant_i8`)의 per-candidate 핫패스는 i8 경로**(`cosine_with_query_norm_i8_blob`)인데, + +- 그동안의 모든 리뷰·벤치(PR1 포함)·faer/fused 논쟁은 **f32 경로**를 봤고, f32는 출시 빌드에선 **폴백**이다. +- 정작 출시되는 i8 핫커널은 **마이크로벤치 0개**(`benches/vector_math.rs`는 f32 dot/l2/cosine/decode만). +- i8 양자화의 **검색 품질(랭킹/리콜)을 검증하는 테스트가 없다**. 기존 i8 테스트([vector_quant.rs:129-200](../../../rust_builder/rust/src/api/vector_quant.rs))는 (a) quantize↔dequantize 라운드트립 오차 `<0.05`, (b) 거친 방향 sanity(`>0.9`/`<-0.9`), (c) blob↔slice 진입점 일치(`<1e-6`)뿐 — **양자화가 근접 이웃 top-k 순위를 뒤집는지는 미검증**. + +따라서 "어떤 커널도 바꾸기 전에 지금 상태를 박제한다"는 PR1 원칙을, 이번엔 **출시 핫패스(i8)** 에 적용한다. 이 PR이 머지되면 향후 i8 변경이 검색 품질을 무너뜨릴 때 CI가 수학적으로 차단한다. + +## 2. 비목표 (Non-goals) + +- **커널/양자화 코드 변경 0줄.** 이 PR은 측정 + 안전망만. i8 최적화는 이 네트 위에 별도 PR로. +- f32 경로 재측정/재설계 아님(PR1에서 완료, faer 유지 확정). +- 온디바이스 벤치 아님(별도 선택 작업). + +## 3. 컴포넌트 1 — 측정 (bench) + +- `src/bench_api.rs`에 i8 표면 노출(`#[cfg(feature = "bench")]`, 기존 f32 노출과 동일 패턴): `quantize_f32_to_i8`, `l2_norm_i8`, `cosine_with_query_norm_i8_blob`, `i8_blob_from_slice`. (대상 함수는 이미 `pub` — 가시성 변경 불필요, re-export만.) +- `benches/vector_math.rs` 추가 타깃: + - `bench_cosine_i8[dim]` — `DIMS`(384/768/1024/1536)별 i8 코사인 마이크로벤치. + - `bench_scan_i8` — 1 쿼리 vs `SCAN_N`(2000) 후보 i8-blob 스캔(출시 핫루프 모사), 기존 f32 `exact_scan`과 나란히. +- 실행: + - `cargo bench --manifest-path rust_builder/rust/Cargo.toml --features "bench,vector_quant_i8"` (i8 핫커널) + - 기존 f32: `--features "bench,vector_faer"` (비교 기준) + - `bench_api::BACKEND` 라벨로 구분. +- **저널 기록**: i8 핫커널 throughput(차원별) + i8 vs f32-faer 스캔 배수를 `PR6.md`에 박제. +- 얻는 것: "출시 핫커널 수치 0개" 해소 + i8이 f32 대비 실제로 얼마나 버는지 정량화. + +## 4. 컴포넌트 2 — 수치 ε 네트 (커널 정확성) + +- 위치: `vector_quant.rs` 테스트 모듈(기존 i8 테스트 옆), `#[cfg(feature = "vector_quant_i8")]`. +- 모델: PR1의 [`faer_parity_tests`](../../../rust_builder/rust/src/api/vector_math.rs#L208) (커널 ≈ 독립 참조, ε 내 일치). +- 단언: `cosine_with_query_norm_i8_blob`(커널) ≈ **독립 참조 재구현** 을 차원별로 `ε = 1e-4` 내 일치. + - 독립 참조: 동일 i8 입력에 대해 dot·sq_sum을 **f64**로 누산 후 `sqrt`/나눗셈 → 커널과 다른 누산 폭/구현. +- **ε 근거**: 커널의 `dot_i8_i32`/`sq_sum`은 i32 정수 누산이라 **정확**(dim 1536서 max ~2.5e7 ≪ i32 max 2.1e9, 오버플로 없음). 유일한 부동소수점 오차원은 최종 `(sq_sum as f32).sqrt()` + `query_norm`(f32)으로 나눗셈. `1e-4`는 이 캐스팅의 플랫폼 간 오차를 허용하면서 로직 버그(SIMD 재작성·인덱싱·norm 오류)를 잡는 합리적 바운더리. +- 기존 테스트와 차별: 기존 건 blob↔slice **진입점 일치**만 봄. 이건 *수학 자체*를 독립 구현과 대조 → **미래 i8 커널 재작성 버그**를 잡음. + +## 5. 컴포넌트 3 — recall@k 네트 (양자화 품질) ★핵심 + +- 위치: `vector_quant.rs` 테스트 모듈, `#[cfg(feature = "vector_quant_i8")]`. +- **합성 클러스터 코퍼스**(결정론·무 rand 의존, `pseudo_vec` 스타일 시드 생성기): + - C개 클러스터 중심(단위벡터) + 가우시안풍 노이즈 → 정규화. "몇 개는 가깝고 대부분 멀다"는 임베딩 분포 모사. + - 기본값: **N=2000** 후보(= `SCAN_N` 재사용), **Q=32** 쿼리, **dim=768**, **k=10** (recall@10, 상위 0.5%). +- **정답(ground truth)**: 각 쿼리의 **f32 코사인 top-k** (f32 코퍼스 기준 = 진짜 랭킹). +- **i8 랭킹**: 코퍼스/쿼리를 i8 양자화 후 **i8 코사인 top-k**. +- **전순서 비교자(필수)**: i8·f32 **양쪽 모두** `(score 내림차순, index 오름차순)` 총순서로 정렬. i8은 i32 정수 점수라 **동점이 대량 발생**하므로(커널 누산 구조상), index 타이브레이크를 명시 강제하지 않으면 플랫폼(Ubuntu vs macOS) sort 구현 차이로 **플레이키**. 동일 비교자 적용 → 진짜 양자화 재정렬만 recall에 반영. +- **지표**: `recall@k = |topk_i8 ∩ topk_f32| / k`, 쿼리 평균. 고정 시드 → **완전 재현(비통계적·비플레이키)**. +- **임계값 = 측정 먼저의 산물**: + 1. 구현 첫 실행이 실제 `recall@10` 측정. + 2. **포화 점검**: 측정이 ~1.0이면 게이트가 장식 → 클러스터 밀도/노이즈/중첩을 올려 recall이 **민감 구간(0.85~0.98)** 에 들 때까지 코퍼스 보정. + 3. CI 게이트를 `recall@10 ≥ FLOOR` 로 고정(FLOOR = 측정값 − 마진 ≈ 0.03). + 4. 측정값·FLOOR·코퍼스 파라미터를 `PR6.md`에 기록. +- 얻는 것: 양자화가 근접 이웃 순위를 무너뜨리면 CI 빨개짐 — 지금 비어 있는 그 안전망. 측정이 임계값을 만들고, 그게 회귀 게이트가 됨. + +## 6. 컴포넌트 4 — CI 게이팅 (fail-closed) + +- `scripts/test_ci.sh`에 추가: `cargo test --lib --features "vector_quant_i8,vector_faer" -- --test-threads=1` + - **출시 컴파일 트리(faer+quant) 100% 일치** — feature 간 매크로/컴파일 충돌까지 CI에서 선제 검출. + - PR2의 faer 스텝처럼 **≥1 test 통과 요구(fail-closed)** — 0건 통과(미수집)면 실패 처리. + - `--test-threads=1` 유지([[project_rust_tests_need_single_thread]] 규약). +- 출시 빌드는 PR2가 이미 `vector_faer,vector_quant_i8`로 **빌드** → 여기에 **i8 테스트 실행**을 더해 N2식 사각지대를 원천 차단. + +## 7. 추적 (tracking) + +- 새 Linear 이슈: **"PR6 — i8 출시 핫패스 측정 + ε/recall 안전망 [측정 먼저]"** (프로젝트 하위, 우선순위 High — 출시 검색 품질 직결). +- `docs/perf/vector-math-refactor/PR6.md` 신규(저널 템플릿: 결과 Before→After·피드백·리스크/롤백·결정 로그). +- README PR 상태표에 PR6 행 추가 + RETRO §5 "다음 작업"의 i8 검증 항목과 연결. +- ✅ **머지 순서**: #67(클로즈아웃) 머지 완료(`1217123`) → 현재 `main`에서 분기하므로 README PR 상태표 충돌·스택 트랩 없음. + +## 8. 수용 기준 (Acceptance criteria) + +- [ ] i8 마이크로벤치 + i8 스캔 벤치 동작, 수치가 `PR6.md`에 기록(i8 throughput + i8 vs f32-faer 배수). +- [ ] 수치 ε 네트: 차원별 i8 커널 ≈ f64 참조 `<1e-4` green. +- [ ] recall@k 네트: 결정론적, 코퍼스가 민감 구간(0.85~0.98)에 위치, `recall@10 ≥ FLOOR` green, FLOOR/측정값 기록. +- [ ] CI `--features "vector_quant_i8,vector_faer" -- --test-threads=1` fail-closed로 i8 테스트 실행, 기존 잡 회귀 없음. +- [ ] 커널/양자화 코드 변경 0줄(비파괴) 확인. + +## 9. 리스크 / 완화 + +| 리스크 | 완화 | +|---|---| +| recall 게이트 포화(거짓 안심) | §5-2 포화 점검 + 코퍼스 민감도 보정 후 FLOOR 고정 | +| i8 동점으로 플랫폼 간 플레이키 | §5 전순서 비교자 `(score, index)` 양쪽 강제 | +| feature 조합 컴파일 충돌 | §6 출시 트리(faer+quant)로 CI 테스트 | +| #67과 README PR표 충돌 | §7 #67 머지 후 분기(스택 회피) | +| ε 너무 빡빡/느슨 | 정수 dot=정확, sqrt/div만 오차 → 1e-4 수학적 합리(검토 확인) | + +## 10. 튜닝 가능한 기본값 (스펙 명시, 구현 중 조정 가능) + +| 파라미터 | 기본값 | 비고 | +|---|---|---| +| ε (수치 네트) | `1e-4` | 정수 dot 정확, sqrt/div 오차만 허용 | +| k (recall) | `10` | recall@10 | +| N (코퍼스) | `2000` | `SCAN_N` 재사용, 상위 0.5% | +| Q (쿼리) | `32` | | +| dim | `768` | 출시 임베딩 대표 차원 | +| recall 마진 | `측정 − 0.03` | 첫 측정 후 FLOOR 확정 | +| 클러스터 수/노이즈 | 측정으로 보정 | recall 0.85~0.98 민감 구간 목표 | + +--- + +구현 단계: 승인 후 **writing-plans** 스킬로 단계별 구현 계획 작성 → `feat/loc-64-i8-measure-parity-net` 브랜치에서 실행. From 00ecf21e17b758c8896b9f7ac5410773bbfddc36 Mon Sep 17 00:00:00 2001 From: "Brian.oh" <49855381+dev07060@users.noreply.github.com> Date: Sun, 31 May 2026 04:45:48 +0900 Subject: [PATCH 2/7] docs(perf): PR6 implementation plan + spec (recall-floor + cosine-fidelity, measured baselines) (LOC-64) Plan and spec for the i8 measure-first + safety-net work, finalized after an adversarial pre-flight that (a) compiled the test/bench code on the shipped vector_quant_i8,vector_faer tree and (b) measured real baselines: recall@10 = 0.996875 (i8@768 is ~lossless), max|cosine_i8 - cosine_f32_true| = 0.00121. Net 2 redesigned to recall floor (>= 0.98) + deterministic cosine-fidelity backstop (<= 0.005), f64 ground truth (kills x86/ARM ULP jitter), const-assert + CI name guards against vacuous gates. --- .../PR6-plan-i8-measure-parity-net.md | 654 ++++++++++++++++++ .../PR6-spec-i8-measure-parity-net.md | 57 +- 2 files changed, 686 insertions(+), 25 deletions(-) create mode 100644 docs/perf/vector-math-refactor/PR6-plan-i8-measure-parity-net.md diff --git a/docs/perf/vector-math-refactor/PR6-plan-i8-measure-parity-net.md b/docs/perf/vector-math-refactor/PR6-plan-i8-measure-parity-net.md new file mode 100644 index 0000000..e7adc22 --- /dev/null +++ b/docs/perf/vector-math-refactor/PR6-plan-i8-measure-parity-net.md @@ -0,0 +1,654 @@ +# PR6 — i8 출시 핫패스 측정 + ε/recall 안전망 Implementation Plan + +> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking. + +**Goal:** Lock the shipped int8 retrieval hot path with a benchmark baseline + a numeric ε kernel-parity net + two quantization-quality gates (recall@k floor + cosine fidelity), all fail-closed in CI — without changing any kernel. + +**Architecture:** Non-destructive "measure before changing" PR (PR1 pattern, applied to the i8 path). Add tests to the already-`vector_quant_i8`-gated `vector_quant.rs` test module, add i8 benches via the `bench` re-export surface, and wire a fail-closed i8 test step into `scripts/test_ci.sh` on the shipped `vector_faer,vector_quant_i8` compile tree. + +**Tech Stack:** Rust, criterion (dev-dep, `bench` feature), cargo feature flags (`vector_faer`, `vector_quant_i8`, `bench`), bash CI script. + +**Spec:** [PR6-spec-i8-measure-parity-net.md](PR6-spec-i8-measure-parity-net.md) · **Linear:** [LOC-64](https://linear.app/loceract/issue/LOC-64) · **Branch:** `feat/loc-64-i8-measure-parity-net` (already created off `main` @ `1217123`). + +**Verification note (why Task 2 looks the way it does):** an adversarial pre-flight (running the real kernel) found i8 per-vector quantization at dim 768 is too accurate to reorder a top-10 — recall@10 ≈ 0.997 and cannot be pushed into a "sensitive band" without abandoning the shipped settings. So the quality gate **locks that high baseline** (`recall ≥ baseline − margin`) and adds a genuinely sensitive, fully-deterministic **cosine-fidelity** backstop. Ground-truth cosine is computed in **f64** so the recall boundary can't flip on x86-vs-ARM ULP jitter. + +**Conventions (this repo):** +- Rust tests run with `-- --test-threads=1` (shared-SQLite parallelism; convention). +- Commits authored solely by the user — **NO** `Co-Authored-By` / Claude footer. +- Open PR, stop at CI green; **user merges**. +- All `cargo` commands use `--manifest-path rust_builder/rust/Cargo.toml`. + +--- + +## File Structure + +| File | Change | Responsibility | +|---|---|---| +| `rust_builder/rust/src/api/vector_quant.rs` | Modify (`mod tests` only) | ε kernel-parity net + recall@k floor + cosine-fidelity net + shared deterministic test helpers. **No non-test code touched.** | +| `rust_builder/rust/src/bench_api.rs` | Modify | `#[cfg(feature="vector_quant_i8")]` i8 re-export wrappers for the bench crate. | +| `rust_builder/rust/benches/vector_math.rs` | Modify | `bench_cosine_i8` + `bench_scan_i8` (cfg-stubbed when feature off) + targets list. | +| `scripts/test_ci.sh` | Modify (`native` case) | Fail-closed i8 test run on the shipped `vector_faer,vector_quant_i8` tree, with per-net name guards. | +| `docs/perf/vector-math-refactor/PR6.md` | Create | Journal entry: bench numbers, recall baseline/FLOOR, fidelity bound, decisions. | +| `docs/perf/vector-math-refactor/README.md` | Modify | Add PR6 row to the status table. | + +--- + +## Task 1: Numeric ε net (i8 kernel correctness) + +Mirror of [`faer_parity_tests`](../../../rust_builder/rust/src/api/vector_math.rs#L208): assert the shipped i8 cosine kernel agrees with an **independent f64 reference of the same i8 inputs** within a tight ε. The i8 dot and squared-norms are exact integer sums, so the only divergence is the final `sqrt` + division → `1e-4` catches logic/SIMD-rewrite bugs while tolerating the f32 cast. + +**Files:** +- Modify: `rust_builder/rust/src/api/vector_quant.rs` (inside existing `#[cfg(test)] mod tests`, after line 197 / before the closing `}` at line 198) + +- [ ] **Step 1: Add shared deterministic test helpers + the ε test** + +Insert into `mod tests` (before its closing brace): + +```rust + // --- PR6 shared test helpers (deterministic, no rand dep) --- + + // Same generator as benches/vector_math.rs: reproducible run-to-run. + fn pseudo_vec(dim: usize, seed: u32) -> Vec { + (0..dim) + .map(|i| { + let x = (i as u32) + .wrapping_mul(2_654_435_761) + .wrapping_add(seed.wrapping_mul(40_503)); + ((x % 1000) as f32 / 1000.0) - 0.5 + }) + .collect() + } + + // Independent f64 reference cosine of two i8 vectors. Different accumulation + // width (i64) and float precision (f64) than the i32->f32 kernel, so a match + // proves the kernel math, not just that it agrees with itself. + fn ref_cosine_i8_f64(q: &[i8], t: &[i8]) -> f64 { + if q.len() != t.len() || q.is_empty() { + return 0.0; + } + let mut dot: i64 = 0; + let mut qsq: i64 = 0; + let mut tsq: i64 = 0; + for (&a, &b) in q.iter().zip(t.iter()) { + dot += (a as i64) * (b as i64); + qsq += (a as i64) * (a as i64); + tsq += (b as i64) * (b as i64); + } + if qsq == 0 || tsq == 0 { + return 0.0; + } + (dot as f64) / ((qsq as f64).sqrt() * (tsq as f64).sqrt()) + } + + #[test] + fn i8_blob_cosine_matches_independent_reference() { + // Integer dot/sq are exact; only the final f32 sqrt+div can drift. + const EPS: f64 = 1e-4; + for &dim in &[1usize, 2, 3, 16, 384, 768, 1024, 1536] { + let q = pseudo_vec(dim, 7); + let t = pseudo_vec(dim, 9); + let (qi, _) = quantize_f32_to_i8(&q); + let (ti, _) = quantize_f32_to_i8(&t); + let blob = i8_blob_from_slice(&ti); + let qn = l2_norm_i8(&qi); + + let kernel = cosine_with_query_norm_i8_blob(&qi, qn, &blob) as f64; + let reference = ref_cosine_i8_f64(&qi, &ti); + assert!( + (kernel - reference).abs() < EPS, + "i8 cosine dim={dim}: kernel={kernel} ref={reference}" + ); + } + } +``` + +- [ ] **Step 2: Run the test — expect PASS (net green on current kernel)** + +Run: `cargo test --manifest-path rust_builder/rust/Cargo.toml --lib --features vector_quant_i8 i8_blob_cosine_matches_independent_reference -- --test-threads=1` +Expected: `test result: ok. 1 passed` + +- [ ] **Step 3: Prove the net has teeth (temporary mutation → red → revert)** + +Temporarily change `EPS` to `1e-12` and re-run Step 2. +Expected: FAIL (`kernel=... ref=...`) — confirms the assertion is live, not vacuous. +Then **revert `EPS` back to `1e-4`** and re-run Step 2 → PASS. + +- [ ] **Step 4: Commit** + +```bash +git add rust_builder/rust/src/api/vector_quant.rs +git commit -m "test(vector_quant): i8 cosine kernel ε-parity vs independent f64 reference (LOC-64)" +``` + +--- + +## Task 2: Quantization-quality nets — recall@k floor + cosine fidelity (measure-first) + +Two complementary nets that reflect the **shipped path** (768-dim, per-vector scale): +1. **recall@k floor** — top-k(i8) vs top-k(f32-true) overlap, gated `≥ measured baseline − margin`. (Baseline ~0.997; we lock the real high quality, we do NOT force an artificial band.) +2. **cosine fidelity** — `max|cosine_i8 − cosine_f32_true| ≤ ε_q`. Fully deterministic (no ranking/boundary), the genuinely sensitive gate against a lossier future quantizer. + +Ground-truth cosine is f64 (kills x86/ARM boundary jitter). `const _` guards prevent shipping a vacuous threshold if calibration is skipped. + +**Files:** +- Modify: `rust_builder/rust/src/api/vector_quant.rs` (same `mod tests`, append after Task 1's helpers) + +- [ ] **Step 1: Add corpus generator + generic comparator + f64 reference** + +Append into `mod tests`: + +```rust + fn normalize(v: &mut [f32]) { + let n = v.iter().map(|x| x * x).sum::().sqrt(); + if n > 0.0 { + for x in v.iter_mut() { + *x /= n; + } + } + } + + fn det_unit(dim: usize, seed: u32) -> Vec { + let mut v = pseudo_vec(dim, seed); + normalize(&mut v); + v + } + + // Clustered corpus: vector i belongs to cluster (i % clusters); a weighted + // blend of that cluster's center and per-vector noise, normalized. Realistic + // "few near, most far" structure (not a sensitivity knob — see verification + // note: i8@768 stays ~0.997 regardless; we lock that, not a forced band). + fn clustered_corpus( + n: usize, + dim: usize, + clusters: usize, + weight: f32, + seed0: u32, + ) -> Vec> { + let centers: Vec> = + (0..clusters).map(|c| det_unit(dim, 1_000 + c as u32)).collect(); + (0..n) + .map(|i| { + let c = i % clusters; + let noise = pseudo_vec(dim, seed0 + i as u32); + let mut v: Vec = centers[c] + .iter() + .zip(noise.iter()) + .map(|(&ce, &no)| weight * ce + (1.0 - weight) * no) + .collect(); + normalize(&mut v); + v + }) + .collect() + } + + // Total order: score descending, then index ascending. Deterministic ties. + // Generic so it serves both the f64 ground truth and the f32 i8 ranking. + fn order_desc(a: &(usize, T), b: &(usize, T)) -> std::cmp::Ordering { + b.1.partial_cmp(&a.1) + .unwrap_or(std::cmp::Ordering::Equal) + .then(a.0.cmp(&b.0)) + } + + // True cosine of the ORIGINAL f32 vectors, accumulated in f64. f64 makes the + // top-k boundary gap >> any x86-vs-ARM f32 ULP jitter, so the recall ranking + // is cross-platform stable; also the reference for cosine fidelity. + fn cosine_f64_true(q: &[f32], t: &[f32]) -> f64 { + let mut dot = 0.0f64; + let mut qsq = 0.0f64; + let mut tsq = 0.0f64; + for (a, b) in q.iter().zip(t.iter()) { + let (a, b) = (*a as f64, *b as f64); + dot += a * b; + qsq += a * a; + tsq += b * b; + } + if qsq == 0.0 || tsq == 0.0 { + 0.0 + } else { + dot / (qsq.sqrt() * tsq.sqrt()) + } + } +``` + +- [ ] **Step 2: Add the recall@k floor test (f64 ground truth)** + +Append into `mod tests`: + +```rust + #[test] + fn i8_topk_recall_matches_f32_within_floor() { + const N: usize = 2000; + const Q: usize = 32; + const DIM: usize = 768; + const K: usize = 10; + const CLUSTERS: usize = 16; + const WEIGHT: f32 = 0.85; + // Locked from the measured baseline recall@10 = 0.996875 (deterministic: + // f64 GT + integer-exact i8 => bit-identical across x86/ARM). FLOOR = + // floor(0.9969 - 0.02) = 0.98, margin ~0.017 (~5 hits of 320). The const + // guard forbids a vacuous (<0.5) floor. Confirm in Step 4. + const MIN_RECALL: f32 = 0.98; + const _: () = assert!(MIN_RECALL >= 0.5, "MIN_RECALL must be a real floor"); + + let corpus = clustered_corpus(N, DIM, CLUSTERS, WEIGHT, 5_000); + let queries = clustered_corpus(Q, DIM, CLUSTERS, WEIGHT, 9_000); + let corpus_blob: Vec> = corpus + .iter() + .map(|v| i8_blob_from_slice(&quantize_f32_to_i8(v).0)) + .collect(); + + let mut recall_sum = 0.0f32; + for query in &queries { + // f64 ground-truth top-K (f64 removes x86/ARM ULP boundary jitter). + let mut gt_scores: Vec<(usize, f64)> = corpus + .iter() + .enumerate() + .map(|(i, c)| (i, cosine_f64_true(query, c))) + .collect(); + gt_scores.sort_by(order_desc); + let gt: std::collections::HashSet = + gt_scores.iter().take(K).map(|(i, _)| *i).collect(); + + // i8 top-K (shipped kernel) with the identical total order. + let (qi, _) = quantize_f32_to_i8(query); + let qn_i8 = l2_norm_i8(&qi); + let mut i8_scores: Vec<(usize, f32)> = corpus_blob + .iter() + .enumerate() + .map(|(i, blob)| (i, cosine_with_query_norm_i8_blob(&qi, qn_i8, blob))) + .collect(); + i8_scores.sort_by(order_desc); + let got: std::collections::HashSet = + i8_scores.iter().take(K).map(|(i, _)| *i).collect(); + + recall_sum += gt.intersection(&got).count() as f32 / K as f32; + } + let recall = recall_sum / Q as f32; + println!("PR6 recall@{K} (N={N} Q={Q} dim={DIM} clusters={CLUSTERS}) = {recall}"); + assert!( + recall >= MIN_RECALL, + "i8 recall@{K} regressed: {recall} < {MIN_RECALL}" + ); + } +``` + +- [ ] **Step 3: Add the cosine-fidelity backstop test (deterministic, sensitive)** + +Append into `mod tests`: + +```rust + #[test] + fn i8_cosine_fidelity_vs_true_f32() { + const N: usize = 2000; + const Q: usize = 32; + const DIM: usize = 768; + const CLUSTERS: usize = 16; + const WEIGHT: f32 = 0.85; + // Locked from the measured max error 0.00121 (deterministic: i8 dot + // integer-exact, GT in f64 => ~1e-12 platform jitter). 0.005 ~= 4x the + // baseline: sensitive to a lossier future quantizer yet never flaky. + // The const guard forbids a vacuous (>=0.1) bound. Confirm in Step 4. + const MAX_COS_ERR: f64 = 0.005; + const _: () = assert!(MAX_COS_ERR < 0.1, "MAX_COS_ERR must be a real bound"); + + let corpus = clustered_corpus(N, DIM, CLUSTERS, WEIGHT, 5_000); + let queries = clustered_corpus(Q, DIM, CLUSTERS, WEIGHT, 9_000); + let corpus_blob: Vec> = corpus + .iter() + .map(|v| i8_blob_from_slice(&quantize_f32_to_i8(v).0)) + .collect(); + + let mut max_err = 0.0f64; + for query in &queries { + let (qi, _) = quantize_f32_to_i8(query); + let qn_i8 = l2_norm_i8(&qi); + for (c, blob) in corpus.iter().zip(corpus_blob.iter()) { + let i8c = cosine_with_query_norm_i8_blob(&qi, qn_i8, blob) as f64; + let truec = cosine_f64_true(query, c); + let e = (i8c - truec).abs(); + if e > max_err { + max_err = e; + } + } + } + println!("PR6 max|cosine_i8 - cosine_f32_true| (N={N} Q={Q} dim={DIM}) = {max_err}"); + assert!( + max_err <= MAX_COS_ERR, + "i8 cosine fidelity regressed: max err {max_err} > {MAX_COS_ERR}" + ); + } +``` + +- [ ] **Step 4: Run & confirm the (pre-measured, deterministic) baselines** + +The thresholds above are already locked from an empirical planning-time run (macOS arm64). Confirm they hold — the metrics are deterministic (f64 GT + integer-exact i8), so they should match bit-for-bit: + +Run: `cargo test --manifest-path rust_builder/rust/Cargo.toml --lib --features "vector_quant_i8,vector_faer" vector_quant -- --test-threads=1 --nocapture` +Expected: all `vector_quant` tests PASS and print: +- `PR6 recall@10 (...) = 0.996875` (gate `MIN_RECALL=0.98` → pass, margin ~0.017) +- `PR6 max|cosine_i8 - cosine_f32_true| (...) = 0.00121...` (gate `MAX_COS_ERR=0.005` → pass, ~4× margin) + +NOTE: `cargo test` takes ONE positional substring filter — `"a|b|c"` matches literally (0 tests). Use the module substring `vector_quant` (runs all 7) as above. + +If your measured `X`/`M` differ materially (they shouldn't — deterministic), recompute `MIN_RECALL = floor(X − 0.02 to 2dp)` and `MAX_COS_ERR ≈ 4 × M` (keep the const guards satisfied) and note the deviation in PR6.md. + +- [ ] **Step 5: Prove both gates have teeth** + +Temporarily set `MIN_RECALL = 0.999` → recall test FAILs (0.996875 < 0.999); revert to `0.98`. +Temporarily set `MAX_COS_ERR = 1e-9` → fidelity test FAILs; revert to `0.005`. +Re-run Step 4 → both PASS. + +- [ ] **Step 6: Commit** + +```bash +git add rust_builder/rust/src/api/vector_quant.rs +git commit -m "test(vector_quant): i8 recall@10 floor + cosine-fidelity gates vs f64 truth (LOC-64)" +``` + +--- + +## Task 3: i8 microbench + scan bench + +Expose the i8 kernel to the bench crate and add an i8 microbench + an i8 scan bench (shipped hot loop). When `vector_quant_i8` is off, the i8 bench fns compile as no-op stubs so `criterion_group!` is feature-agnostic. + +**Files:** +- Modify: `rust_builder/rust/src/bench_api.rs` (append after line 32, before the `BACKEND` doc comment) +- Modify: `rust_builder/rust/benches/vector_math.rs` (add fns + extend `targets`) + +- [ ] **Step 1: Add i8 re-export wrappers to `bench_api.rs`** + +Insert after line 32 (`}` of `decode_f32_embedding`), before the `BACKEND` doc comment: + +```rust +#[cfg(feature = "vector_quant_i8")] +use crate::api::vector_quant; + +#[cfg(feature = "vector_quant_i8")] +#[inline] +pub fn quantize_f32_to_i8(input: &[f32]) -> (Vec, f32) { + vector_quant::quantize_f32_to_i8(input) +} + +#[cfg(feature = "vector_quant_i8")] +#[inline] +pub fn l2_norm_i8(v: &[i8]) -> f32 { + vector_quant::l2_norm_i8(v) +} + +#[cfg(feature = "vector_quant_i8")] +#[inline] +pub fn i8_blob_from_slice(input: &[i8]) -> Vec { + vector_quant::i8_blob_from_slice(input) +} + +#[cfg(feature = "vector_quant_i8")] +#[inline] +pub fn cosine_with_query_norm_i8_blob(query: &[i8], query_norm: f32, target_blob: &[u8]) -> f32 { + vector_quant::cosine_with_query_norm_i8_blob(query, query_norm, target_blob) +} +``` + +- [ ] **Step 2: Verify `bench_api` compiles under the i8 feature** + +Run: `cargo build --manifest-path rust_builder/rust/Cargo.toml --features "bench,vector_quant_i8"` +Expected: builds clean. (`api/mod.rs:29` declares `#[cfg(feature="vector_quant_i8")] pub(crate) mod vector_quant;` and the kernel fns are `pub`, so the crate-internal path `crate::api::vector_quant::*` resolves from `bench_api`. The `use` and wrappers share the same `vector_quant_i8` gate, so nothing dangles when the feature is off.) + +- [ ] **Step 3: Add i8 bench fns + extend targets in `benches/vector_math.rs`** + +Insert after `bench_scan` (line 107), before the `criterion_group!`: + +```rust +#[cfg(feature = "vector_quant_i8")] +fn bench_cosine_i8(c: &mut Criterion) { + let mut g = c.benchmark_group("cosine_i8"); + for &dim in &DIMS { + let (qi, _) = bench_api::quantize_f32_to_i8(&pseudo_vec(dim, 1)); + let (ti, _) = bench_api::quantize_f32_to_i8(&pseudo_vec(dim, 2)); + let qn = bench_api::l2_norm_i8(&qi); + let tblob = bench_api::i8_blob_from_slice(&ti); + g.throughput(Throughput::Elements(dim as u64)); + g.bench_with_input(BenchmarkId::from_parameter(dim), &dim, |b, _| { + b.iter(|| { + bench_api::cosine_with_query_norm_i8_blob( + black_box(&qi), + black_box(qn), + black_box(&tblob), + ) + }) + }); + } + g.finish(); +} +#[cfg(not(feature = "vector_quant_i8"))] +fn bench_cosine_i8(_c: &mut Criterion) {} + +// Shipped exact-scan inner loop: one query vs N candidate i8 blobs, scored with +// zero f32 decode / zero per-row alloc — the actual release hot path. +#[cfg(feature = "vector_quant_i8")] +fn bench_scan_i8(c: &mut Criterion) { + let (qi, _) = bench_api::quantize_f32_to_i8(&pseudo_vec(SCAN_DIM, 1)); + let qn = bench_api::l2_norm_i8(&qi); + let blobs: Vec> = (0..SCAN_N) + .map(|i| { + let (vi, _) = bench_api::quantize_f32_to_i8(&pseudo_vec(SCAN_DIM, 100 + i as u32)); + bench_api::i8_blob_from_slice(&vi) + }) + .collect(); + + let mut g = c.benchmark_group("exact_scan_i8"); + g.throughput(Throughput::Elements(SCAN_N as u64)); + g.bench_function(BenchmarkId::new("i8_blob_cosine", SCAN_N), |b| { + b.iter(|| { + let mut best = f32::MIN; + for blob in &blobs { + let s = bench_api::cosine_with_query_norm_i8_blob(black_box(&qi), qn, black_box(blob)); + if s > best { + best = s; + } + } + black_box(best) + }) + }); + g.finish(); +} +#[cfg(not(feature = "vector_quant_i8"))] +fn bench_scan_i8(_c: &mut Criterion) {} +``` + +Then change the `criterion_group!` `targets` line (line 115) from: + +```rust + targets = bench_cosine, bench_dot, bench_decode, bench_scan +``` + +to: + +```rust + targets = bench_cosine, bench_dot, bench_decode, bench_scan, bench_cosine_i8, bench_scan_i8 +``` + +- [ ] **Step 4: Run the shipped-tree bench (i8 + f32-faer side by side)** + +Run: `cargo bench --manifest-path rust_builder/rust/Cargo.toml --features "bench,vector_faer,vector_quant_i8" -- exact_scan` +Expected: reports both `exact_scan[faer]` (f32 decode+cosine) and `exact_scan_i8/i8_blob_cosine` (shipped i8). Record both medians + the i8/f32 ratio. (Group names `cosine_i8`/`exact_scan_i8` carry the `_i8` suffix to distinguish from the f32 groups, which is the §3 "distinguish f32 vs i8" intent.) + +Also run the i8 microbench: `cargo bench --manifest-path rust_builder/rust/Cargo.toml --features "bench,vector_faer,vector_quant_i8" -- cosine_i8` and record per-dim numbers. + +- [ ] **Step 5: Verify the no-op stubs compile with the feature OFF** + +Run: `cargo build --manifest-path rust_builder/rust/Cargo.toml --features "bench"` +Expected: builds clean (i8 bench fns are no-op stubs; `criterion_group!` still references them). + +- [ ] **Step 6: Commit** + +```bash +git add rust_builder/rust/src/bench_api.rs rust_builder/rust/benches/vector_math.rs +git commit -m "bench(vector_math): add i8 hot-kernel + i8 scan benches (LOC-64)" +``` + +--- + +## Task 4: CI fail-closed gate on the shipped i8 tree + +Add an i8 test step to `scripts/test_ci.sh`, mirroring the existing faer step: run the `vector_quant` tests (ε + recall + fidelity nets) on the **shipped** `vector_faer,vector_quant_i8` tree. Fail closed on zero matches AND if any **named** net is missing (a broad-filter + N≥1 guard alone would stay green on the 4 legacy tests if a net were renamed/cfg-excluded). + +**Files:** +- Modify: `scripts/test_ci.sh` (`native` case, after the faer `vector_math` block ending at line 50, before the `# Compile-check the actual shipped feature combo` comment at line 51) + +- [ ] **Step 1: Insert the i8 test step** + +After line 50 (the faer block's closing `fi`), before line 51's comment, insert: + +```bash + echo "[ci] Running i8 quant kernels + ε/recall/fidelity safety nets on the SHIPPED faer+quant tree" + # The shipped per-candidate hot path is i8 (cosine_with_query_norm_i8_blob), + # not the f32 faer kernels. Run the vector_quant tests on the exact shipped + # feature combo and fail closed on zero matches. + if ! quant_out="$(cargo test --manifest-path rust_builder/rust/Cargo.toml --lib --features "vector_quant_i8,vector_faer" vector_quant -- --test-threads=1 2>&1)"; then + echo "$quant_out" + echo "[ci] ERROR: i8 vector_quant tests failed" >&2 + exit 1 + fi + echo "$quant_out" + if ! grep -Eq 'test result: ok\. [1-9][0-9]* passed' <<<"$quant_out"; then + echo "[ci] ERROR: i8 vector_quant matched 0 tests (renamed/cfg-excluded?); failing closed" >&2 + exit 1 + fi + # Fail closed if any specific safety net was renamed/cfg-excluded (a broad + # filter + N>=1 alone would stay green on the legacy i8 tests). + for net in i8_blob_cosine_matches_independent_reference \ + i8_topk_recall_matches_f32_within_floor \ + i8_cosine_fidelity_vs_true_f32; do + if ! grep -Eq "${net} .* ok" <<<"$quant_out"; then + echo "[ci] ERROR: i8 safety net '${net}' did not run/pass (renamed/cfg-excluded?); failing closed" >&2 + exit 1 + fi + done +``` + +- [ ] **Step 2: Run the inserted command directly (fast local check)** + +Run: `cargo test --manifest-path rust_builder/rust/Cargo.toml --lib --features "vector_quant_i8,vector_faer" vector_quant -- --test-threads=1` +Expected: `test result: ok. N passed` with N ≥ 6 (4 legacy + ε + recall + fidelity = 7), and the output contains `... ok` lines for all three named nets. + +(Full `./scripts/test_ci.sh native` also runs flutter/PDF steps that need the local toolchain; if unavailable, the direct command above is the meaningful check for this task.) + +- [ ] **Step 3: Commit** + +```bash +git add scripts/test_ci.sh +git commit -m "ci(vector_quant): run i8 ε/recall/fidelity nets on shipped faer+quant tree (LOC-64)" +``` + +--- + +## Task 5: Journal — PR6.md + README status row + +**Files:** +- Create: `docs/perf/vector-math-refactor/PR6.md` +- Modify: `docs/perf/vector-math-refactor/README.md` (status table) + +- [ ] **Step 1: Create `PR6.md` with the measured results** + +Create `docs/perf/vector-math-refactor/PR6.md` (fill `<...>` from Task 2 Step 5 and Task 3 Step 4): + +```markdown +# PR6 — i8 출시 핫패스 측정 + ε/recall/fidelity 안전망 (N: 측정 먼저) + +- 브랜치: `feat/loc-64-i8-measure-parity-net` +- Linear: [LOC-64](https://linear.app/loceract/issue/LOC-64) +- 상태: 🟦 진행 (PR 열림, CI green 대기) +- 설계: [PR6-spec-i8-measure-parity-net.md](PR6-spec-i8-measure-parity-net.md) + +## 스코프 (비파괴 — 커널/양자화 0줄 변경) +출시 핫패스(i8 `cosine_with_query_norm_i8_blob`)에 PR1 패턴 적용: 측정 + 수치 ε 네트 + recall@k floor + 코사인 fidelity 네트 + CI fail-closed. + +## 결과 (측정) +- **i8 핫커널 마이크로벤치** (dim별, ns): 384=<...> / 768=<...> / 1024=<...> / 1536=<...> +- **스캔(2000×768) 비교**: `exact_scan[faer]`(f32 decode+cosine)=<...> µs vs `exact_scan_i8`(i8 blob)=<...> µs → i8가 f32-faer 대비 **<...>×**. +- **수치 ε 네트**: 차원 {1,2,3,16,384,768,1024,1536}에서 kernel ≈ f64 참조, ε=1e-4 green. +- **핵심 발견**: i8 per-vector 양자화는 768d에서 **recall@10 ≈ 0.997**(=319/320, 거의 무손실) — '민감 밴드'는 출시 설정에서 도달 불가이며 강제 시 비대표적. 따라서 게이트는 이 높은 baseline을 잠금. +- **recall@k floor 네트**: N=2000, Q=32, dim=768, k=10, clusters=16 → 측정 recall@10 = **0.996875**(dev arm64, CI 확인), FLOOR = **0.98** (= floor(X−0.02)). GT는 f64(플랫폼 jitter 제거), 전순서 `(score desc, index asc)`. +- **코사인 fidelity 네트**: `max|cosine_i8 − cosine_f32_true|` = **0.00121**, 게이트 = **0.005** (≈4× baseline). 완전 결정론적·민감. +- **CI**: `--features "vector_quant_i8,vector_faer" -- --test-threads=1` fail-closed + 3개 네트 이름별 가드 (N=<...> passed). + +## 받은 피드백 (리뷰 / 사전검증) +- 설계 리뷰: N=2000/k=10, 전순서 타이브레이크, 출시 트리(faer+quant) CI. +- **사전 적대적 검증이 잡은 것**: recall@10이 768d에서 포화(~0.997)→'민감 밴드' 불가 → **recall floor + cosine fidelity 백스톱**으로 재설계; f32 GT의 1-ULP 경계 jitter → **f64 GT**; vacuous floor 위험 → `const _` 컴파일 가드 + CI 이름별 가드. + +## 리스크 / 롤백 +- 비파괴(커널 0줄) → 동작 변경 없음. 롤백: PR revert. +- 결정론: i8 dot 정수 정확 + f64 GT → 플랫폼 무관. fidelity는 ranking 무관(경계 jitter 0). +- vacuous 게이트: `const _: () = assert!(...)` 가드로 컴파일 차단 + CI 이름별 fail-closed. + +## 결정 로그 +- 출시 핫패스가 i8임을 확정(이전 세션) → 측정/검증 초점을 f32(폴백)에서 i8로 이동. +- 품질 게이트는 측정 baseline에서 FLOOR/ε_q 도출(측정 먼저). 코퍼스는 출시 설정(768d·per-vector) 유지 — 강제 민감화 안 함. +``` + +- [ ] **Step 2: Add the PR6 row to `README.md`** + +In `docs/perf/vector-math-refactor/README.md`, add after the PR5 row line: + +```markdown +| PR6 | i8 출시 핫패스 **측정 + ε/recall/fidelity 안전망** | i8 검증갭 | 낮음(비파괴) | main(#67) | [LOC-64](https://linear.app/loceract/issue/LOC-64) | 🟦 진행([PR6.md](PR6.md)) | +``` + +- [ ] **Step 3: Commit** + +```bash +git add docs/perf/vector-math-refactor/PR6.md docs/perf/vector-math-refactor/README.md +git commit -m "docs(perf): PR6 journal entry + status row, i8 measure/parity results (LOC-64)" +``` + +--- + +## Task 6: Full verification + open PR (stop at CI green) + +**Files:** none (verification + PR) + +- [ ] **Step 1: Full shipped-tree test run (all nets)** + +Run: `cargo test --manifest-path rust_builder/rust/Cargo.toml --lib --features "vector_quant_i8,vector_faer" -- --test-threads=1` +Expected: `test result: ok.` with the ε + recall + fidelity tests among the passed set, 0 failed. + +- [ ] **Step 2: Confirm non-shipped trees still build/test (no regressions)** + +Run: `cargo test --manifest-path rust_builder/rust/Cargo.toml --lib -- --test-threads=1` (default features) +Expected: `test result: ok.` (vector_quant tests are cfg-excluded here — fine; the i8 nets only run under the feature). +Run: `cargo build --manifest-path rust_builder/rust/Cargo.toml --features "bench"` and `--features "bench,vector_faer,vector_quant_i8"` +Expected: both build clean (stub + real i8 benches). + +- [ ] **Step 3: Push branch and open PR** + +```bash +git push -u origin feat/loc-64-i8-measure-parity-net +gh pr create --base main --head feat/loc-64-i8-measure-parity-net \ + --title "PR6 — i8 hot-path measure + ε/recall/fidelity safety net (LOC-64)" \ + --body "$(cat <<'BODY' +Applies the PR1 "measure before changing" pattern to the SHIPPED int8 hot path (every prior review/bench scrutinized only the f32 fallback). Non-destructive — kernels unchanged. + +- **Measure**: i8 micro + i8 scan benches (vs f32-faer). Numbers in PR6.md. +- **Numeric ε net**: i8 cosine kernel ≈ independent f64 reference, ε=1e-4. +- **recall@k floor**: top-k(i8) vs top-k(f32, f64 ground truth) recall@10 ≥ measured baseline − margin. (Finding: i8@768 is ~lossless for recall@10.) +- **cosine fidelity**: max|cosine_i8 − cosine_f32_true| ≤ measured bound — deterministic, sensitive backstop. +- **CI fail-closed**: nets run on the shipped `vector_faer,vector_quant_i8` tree, with per-net name guards. + +Spec: docs/perf/vector-math-refactor/PR6-spec-i8-measure-parity-net.md · Journal: PR6.md +BODY +)" +``` + +- [ ] **Step 4: Watch CI to green; hand off for user merge** + +Run: `gh pr checks --watch` (re-poll on transient network error). +Expected: all checks pass. **Do NOT merge** — report "PR opened, CI green" and let the user merge. After merge: PR6.md status → 🟩, README PR6 row → 🟩 (follow-up). + +--- + +## Self-Review (filled by plan author) + +- **Spec coverage**: §3 bench → Task 3; §4 ε net → Task 1; §5 quality net → Task 2 (recall floor + fidelity, per the approved Net-2 redesign); §6 CI → Task 4; §7 tracking → Task 5 + issue created; §8 acceptance → Tasks 1–6; non-goal (0 kernel changes) → only `mod tests`, `bench_api`, `benches`, CI, docs touched. ✅ (Spec §5/§8/§9/§10 updated to the recall-floor + fidelity design.) +- **Placeholders**: `MIN_RECALL`/`MAX_COS_ERR`/bench numbers are *measure-first outputs* with exact derivation + compile-time `const _` guards (Task 2 Steps 4–5), not vague TODOs. PR6.md `<...>` are explicitly "fill from measured results." ✅ +- **Type consistency**: `order_desc` used on both `(usize,f64)` (GT) and `(usize,f32)` (i8); `cosine_f64_true`/`ref_cosine_i8_f64`/`clustered_corpus`/`quantize_f32_to_i8`/`l2_norm_i8`/`i8_blob_from_slice`/`cosine_with_query_norm_i8_blob` match real `vector_quant.rs` signatures (verified). bench_api wrappers match. ✅ +- **Verification-driven fixes applied**: f64 GT (jitter), recall-floor not forced-band (saturation), const guards + CI name guards (vacuous gate), `pub(crate) mod` wording, flat `X−0.02` floor. ✅ diff --git a/docs/perf/vector-math-refactor/PR6-spec-i8-measure-parity-net.md b/docs/perf/vector-math-refactor/PR6-spec-i8-measure-parity-net.md index 7404db5..2dc97fe 100644 --- a/docs/perf/vector-math-refactor/PR6-spec-i8-measure-parity-net.md +++ b/docs/perf/vector-math-refactor/PR6-spec-i8-measure-parity-net.md @@ -44,22 +44,28 @@ - **ε 근거**: 커널의 `dot_i8_i32`/`sq_sum`은 i32 정수 누산이라 **정확**(dim 1536서 max ~2.5e7 ≪ i32 max 2.1e9, 오버플로 없음). 유일한 부동소수점 오차원은 최종 `(sq_sum as f32).sqrt()` + `query_norm`(f32)으로 나눗셈. `1e-4`는 이 캐스팅의 플랫폼 간 오차를 허용하면서 로직 버그(SIMD 재작성·인덱싱·norm 오류)를 잡는 합리적 바운더리. - 기존 테스트와 차별: 기존 건 blob↔slice **진입점 일치**만 봄. 이건 *수학 자체*를 독립 구현과 대조 → **미래 i8 커널 재작성 버그**를 잡음. -## 5. 컴포넌트 3 — recall@k 네트 (양자화 품질) ★핵심 +## 5. 컴포넌트 3 — 양자화 품질 네트 (recall@k floor + 코사인 fidelity) ★핵심 + +> ⚠️ **사전 적대적 검증 발견(2026-05-31):** i8 per-vector 양자화는 768d에서 너무 정확해 top-10을 거의 재정렬하지 않음 — recall@10 ≈ 0.997. '민감 밴드(0.85~0.98)'는 출시 설정에서 **도달 불가**이며, 억지로 맞추려면(저차원·global scale) 출시 경로를 반영 못 함. 따라서 (a) recall은 **측정된 높은 baseline을 잠그고**(밴드 강제 안 함), (b) 진짜 민감·결정론 게이트로 **코사인 fidelity**를 추가. - 위치: `vector_quant.rs` 테스트 모듈, `#[cfg(feature = "vector_quant_i8")]`. -- **합성 클러스터 코퍼스**(결정론·무 rand 의존, `pseudo_vec` 스타일 시드 생성기): - - C개 클러스터 중심(단위벡터) + 가우시안풍 노이즈 → 정규화. "몇 개는 가깝고 대부분 멀다"는 임베딩 분포 모사. - - 기본값: **N=2000** 후보(= `SCAN_N` 재사용), **Q=32** 쿼리, **dim=768**, **k=10** (recall@10, 상위 0.5%). -- **정답(ground truth)**: 각 쿼리의 **f32 코사인 top-k** (f32 코퍼스 기준 = 진짜 랭킹). -- **i8 랭킹**: 코퍼스/쿼리를 i8 양자화 후 **i8 코사인 top-k**. -- **전순서 비교자(필수)**: i8·f32 **양쪽 모두** `(score 내림차순, index 오름차순)` 총순서로 정렬. i8은 i32 정수 점수라 **동점이 대량 발생**하므로(커널 누산 구조상), index 타이브레이크를 명시 강제하지 않으면 플랫폼(Ubuntu vs macOS) sort 구현 차이로 **플레이키**. 동일 비교자 적용 → 진짜 양자화 재정렬만 recall에 반영. -- **지표**: `recall@k = |topk_i8 ∩ topk_f32| / k`, 쿼리 평균. 고정 시드 → **완전 재현(비통계적·비플레이키)**. -- **임계값 = 측정 먼저의 산물**: - 1. 구현 첫 실행이 실제 `recall@10` 측정. - 2. **포화 점검**: 측정이 ~1.0이면 게이트가 장식 → 클러스터 밀도/노이즈/중첩을 올려 recall이 **민감 구간(0.85~0.98)** 에 들 때까지 코퍼스 보정. - 3. CI 게이트를 `recall@10 ≥ FLOOR` 로 고정(FLOOR = 측정값 − 마진 ≈ 0.03). - 4. 측정값·FLOOR·코퍼스 파라미터를 `PR6.md`에 기록. -- 얻는 것: 양자화가 근접 이웃 순위를 무너뜨리면 CI 빨개짐 — 지금 비어 있는 그 안전망. 측정이 임계값을 만들고, 그게 회귀 게이트가 됨. +- **합성 클러스터 코퍼스**(결정론·무 rand): C개 클러스터 중심 + 노이즈 → 정규화. **출시 설정 유지**(N=2000, Q=32, dim=768, per-vector scale). 코퍼스는 현실적 분포일 뿐 민감도 노브 아님(clusters=16 고정). + +### 5a. recall@k floor +- **정답(GT)**: 각 쿼리의 **f64 코사인 top-k** (원본 f32를 f64로 누산 → 경계 gap ≫ x86/ARM 1-ULP jitter라 **플랫폼 안정**). k=10. +- **i8 랭킹**: i8 양자화 후 i8 커널 top-k. +- **전순서 비교자**: 양쪽 `(score desc, index asc)` — 결정론적 타이브레이크(i8 정수 점수 동점 대량 → index로 확정). +- **게이트**: `recall@10 ≥ FLOOR`, **FLOOR = floor(baseline − 0.02) = 0.98** (측정 baseline 0.996875 = 319/320, dev arm64·결정론). 밴드 강제·포화 가드 폐기 — baseline이 ~1.0인 게 현실이자 좋은 결과. + +### 5b. 코사인 fidelity (결정론·민감 백스톱) +- 모든 (쿼리, 후보) 쌍에서 `max|cosine_i8 − cosine_f32_true(f64)|` 측정. +- **게이트**: `max ≤ ε_q`, **ε_q ≈ 4 × 측정 max = 0.005** (측정 max 0.00121). ranking 무관 → 경계 jitter 0, 양자화 품질 저하에 가장 민감. + +### 측정 먼저 → 임계값 +1. 첫 실행(planning, dev arm64)이 recall@10=X=0.996875·max fidelity err=M=0.00121 측정. +2. `FLOOR = floor(X − 0.02) = 0.98`, `ε_q ≈ 4·M = 0.005` **사전 고정**(결정론적이라 CI/플랫폼 동일값). `const _: () = assert!(...)`로 vacuous(<0.5 floor / ≥0.1 ε_q) 임계값 **컴파일 차단**. +3. X·M·FLOOR·ε_q를 `PR6.md`에 기록, 구현 시 동일값 확인. +- 얻는 것: 미래 i8/양자화 변경이 검색 품질을 떨어뜨리면 CI 빨개짐. recall=거시 회귀, fidelity=미세 회귀. ## 6. 컴포넌트 4 — CI 게이팅 (fail-closed) @@ -80,31 +86,32 @@ - [ ] i8 마이크로벤치 + i8 스캔 벤치 동작, 수치가 `PR6.md`에 기록(i8 throughput + i8 vs f32-faer 배수). - [ ] 수치 ε 네트: 차원별 i8 커널 ≈ f64 참조 `<1e-4` green. -- [ ] recall@k 네트: 결정론적, 코퍼스가 민감 구간(0.85~0.98)에 위치, `recall@10 ≥ FLOOR` green, FLOOR/측정값 기록. -- [ ] CI `--features "vector_quant_i8,vector_faer" -- --test-threads=1` fail-closed로 i8 테스트 실행, 기존 잡 회귀 없음. +- [ ] recall@k floor: 결정론적(f64 GT), `recall@10 ≥ FLOOR(=baseline−0.02)` green, baseline/FLOOR 기록. +- [ ] 코사인 fidelity: `max|cosine_i8−cosine_f32_true| ≤ ε_q` green, 측정 max/ε_q 기록. +- [ ] CI `--features "vector_quant_i8,vector_faer" -- --test-threads=1` fail-closed + 3개 네트 이름별 가드, 기존 잡 회귀 없음. - [ ] 커널/양자화 코드 변경 0줄(비파괴) 확인. ## 9. 리스크 / 완화 | 리스크 | 완화 | |---|---| -| recall 게이트 포화(거짓 안심) | §5-2 포화 점검 + 코퍼스 민감도 보정 후 FLOOR 고정 | -| i8 동점으로 플랫폼 간 플레이키 | §5 전순서 비교자 `(score, index)` 양쪽 강제 | +| recall baseline ~1.0(거시 둔감) | 미세 회귀는 코사인 fidelity 백스톱(ε_q)이 담당 | +| f32 GT 경계 1-ULP jitter | GT를 **f64**로 계산 → 경계 gap ≫ jitter, recall 플랫폼 안정 | +| vacuous 게이트(미보정 임계값) | `const _ assert` 컴파일 가드 + CI 3개 네트 이름별 fail-closed | | feature 조합 컴파일 충돌 | §6 출시 트리(faer+quant)로 CI 테스트 | -| #67과 README PR표 충돌 | §7 #67 머지 후 분기(스택 회피) | +| #67과 README PR표 충돌 | 해소(#67 머지 완료, `main`에서 분기) | | ε 너무 빡빡/느슨 | 정수 dot=정확, sqrt/div만 오차 → 1e-4 수학적 합리(검토 확인) | ## 10. 튜닝 가능한 기본값 (스펙 명시, 구현 중 조정 가능) | 파라미터 | 기본값 | 비고 | |---|---|---| -| ε (수치 네트) | `1e-4` | 정수 dot 정확, sqrt/div 오차만 허용 | +| ε (커널 ε 네트) | `1e-4` | 정수 dot 정확, sqrt/div 오차만 | | k (recall) | `10` | recall@10 | -| N (코퍼스) | `2000` | `SCAN_N` 재사용, 상위 0.5% | -| Q (쿼리) | `32` | | -| dim | `768` | 출시 임베딩 대표 차원 | -| recall 마진 | `측정 − 0.03` | 첫 측정 후 FLOOR 확정 | -| 클러스터 수/노이즈 | 측정으로 보정 | recall 0.85~0.98 민감 구간 목표 | +| N / Q / dim | `2000 / 32 / 768` | 출시 설정(per-vector scale); N=`SCAN_N` 재사용 | +| recall FLOOR | `floor(X−0.02) = 0.98` | 측정 X 0.996875; const guard `≥0.5` | +| fidelity ε_q | `≈4·M = 0.005` | 측정 max 0.00121; const guard `<0.1` | +| 클러스터 수 | `16` (고정) | 현실적 분포; 민감도 노브 아님 | --- From d8bb71f101ade943de613b52a874ac1f4702f510 Mon Sep 17 00:00:00 2001 From: "Brian.oh" <49855381+dev07060@users.noreply.github.com> Date: Sun, 31 May 2026 04:54:43 +0900 Subject: [PATCH 3/7] =?UTF-8?q?test(vector=5Fquant):=20i8=20cosine=20kerne?= =?UTF-8?q?l=20=CE=B5-parity=20vs=20independent=20f64=20reference=20(LOC-6?= =?UTF-8?q?4)?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit --- rust_builder/rust/src/api/vector_quant.rs | 56 +++++++++++++++++++++++ 1 file changed, 56 insertions(+) diff --git a/rust_builder/rust/src/api/vector_quant.rs b/rust_builder/rust/src/api/vector_quant.rs index b1dfde3..69ce297 100644 --- a/rust_builder/rust/src/api/vector_quant.rs +++ b/rust_builder/rust/src/api/vector_quant.rs @@ -195,4 +195,60 @@ mod tests { assert_eq!(direct_blob, two_step_blob); } } + + // --- PR6 shared test helpers (deterministic, no rand dep) --- + + // Same generator as benches/vector_math.rs: reproducible run-to-run. + fn pseudo_vec(dim: usize, seed: u32) -> Vec { + (0..dim) + .map(|i| { + let x = (i as u32) + .wrapping_mul(2_654_435_761) + .wrapping_add(seed.wrapping_mul(40_503)); + ((x % 1000) as f32 / 1000.0) - 0.5 + }) + .collect() + } + + // Independent f64 reference cosine of two i8 vectors. Different accumulation + // width (i64) and float precision (f64) than the i32->f32 kernel, so a match + // proves the kernel math, not just that it agrees with itself. + fn ref_cosine_i8_f64(q: &[i8], t: &[i8]) -> f64 { + if q.len() != t.len() || q.is_empty() { + return 0.0; + } + let mut dot: i64 = 0; + let mut qsq: i64 = 0; + let mut tsq: i64 = 0; + for (&a, &b) in q.iter().zip(t.iter()) { + dot += (a as i64) * (b as i64); + qsq += (a as i64) * (a as i64); + tsq += (b as i64) * (b as i64); + } + if qsq == 0 || tsq == 0 { + return 0.0; + } + (dot as f64) / ((qsq as f64).sqrt() * (tsq as f64).sqrt()) + } + + #[test] + fn i8_blob_cosine_matches_independent_reference() { + // Integer dot/sq are exact; only the final f32 sqrt+div can drift. + const EPS: f64 = 1e-4; + for &dim in &[1usize, 2, 3, 16, 384, 768, 1024, 1536] { + let q = pseudo_vec(dim, 7); + let t = pseudo_vec(dim, 9); + let (qi, _) = quantize_f32_to_i8(&q); + let (ti, _) = quantize_f32_to_i8(&t); + let blob = i8_blob_from_slice(&ti); + let qn = l2_norm_i8(&qi); + + let kernel = cosine_with_query_norm_i8_blob(&qi, qn, &blob) as f64; + let reference = ref_cosine_i8_f64(&qi, &ti); + assert!( + (kernel - reference).abs() < EPS, + "i8 cosine dim={dim}: kernel={kernel} ref={reference}" + ); + } + } } From 012d110cbb7997e46052652e55659c3a821a7972 Mon Sep 17 00:00:00 2001 From: "Brian.oh" <49855381+dev07060@users.noreply.github.com> Date: Sun, 31 May 2026 05:00:19 +0900 Subject: [PATCH 4/7] test(vector_quant): i8 recall@10 floor + cosine-fidelity gates vs f64 truth (LOC-64) --- rust_builder/rust/src/api/vector_quant.rs | 163 ++++++++++++++++++++++ 1 file changed, 163 insertions(+) diff --git a/rust_builder/rust/src/api/vector_quant.rs b/rust_builder/rust/src/api/vector_quant.rs index 69ce297..f0e17f8 100644 --- a/rust_builder/rust/src/api/vector_quant.rs +++ b/rust_builder/rust/src/api/vector_quant.rs @@ -251,4 +251,167 @@ mod tests { ); } } + + // --- PR6 Task 2 helpers --- + + fn normalize(v: &mut [f32]) { + let n = v.iter().map(|x| x * x).sum::().sqrt(); + if n > 0.0 { + for x in v.iter_mut() { + *x /= n; + } + } + } + + fn det_unit(dim: usize, seed: u32) -> Vec { + let mut v = pseudo_vec(dim, seed); + normalize(&mut v); + v + } + + // Clustered corpus: vector i belongs to cluster (i % clusters); a weighted + // blend of that cluster's center and per-vector noise, normalized. + fn clustered_corpus( + n: usize, + dim: usize, + clusters: usize, + weight: f32, + seed0: u32, + ) -> Vec> { + let centers: Vec> = + (0..clusters).map(|c| det_unit(dim, 1_000 + c as u32)).collect(); + (0..n) + .map(|i| { + let c = i % clusters; + let noise = pseudo_vec(dim, seed0 + i as u32); + let mut v: Vec = centers[c] + .iter() + .zip(noise.iter()) + .map(|(&ce, &no)| weight * ce + (1.0 - weight) * no) + .collect(); + normalize(&mut v); + v + }) + .collect() + } + + // Total order: score descending, then index ascending. total_cmp gives a + // provably total order (NaN-safe), so sort output is platform-deterministic. + fn order_desc_f64(a: &(usize, f64), b: &(usize, f64)) -> std::cmp::Ordering { + b.1.total_cmp(&a.1).then(a.0.cmp(&b.0)) + } + fn order_desc_f32(a: &(usize, f32), b: &(usize, f32)) -> std::cmp::Ordering { + b.1.total_cmp(&a.1).then(a.0.cmp(&b.0)) + } + + // True cosine of the ORIGINAL f32 vectors, accumulated in f64 (boundary gap + // >> x86/ARM ULP jitter); also the reference for cosine fidelity. + fn cosine_f64_true(q: &[f32], t: &[f32]) -> f64 { + let mut dot = 0.0f64; + let mut qsq = 0.0f64; + let mut tsq = 0.0f64; + for (a, b) in q.iter().zip(t.iter()) { + let (a, b) = (*a as f64, *b as f64); + dot += a * b; + qsq += a * a; + tsq += b * b; + } + if qsq == 0.0 || tsq == 0.0 { + 0.0 + } else { + dot / (qsq.sqrt() * tsq.sqrt()) + } + } + + #[test] + fn i8_topk_recall_matches_f32_within_floor() { + const N: usize = 2000; + const Q: usize = 32; + const DIM: usize = 768; + const K: usize = 10; + const CLUSTERS: usize = 16; + const WEIGHT: f32 = 0.85; + // Locked from measured baseline recall@10 = 0.996875 (deterministic: + // f64 GT + integer-exact i8 => bit-identical across x86/ARM). FLOOR = + // floor(0.9969 - 0.02) = 0.98, margin ~0.017 (~5 hits of 320). + const MIN_RECALL: f32 = 0.98; + const _: () = assert!(MIN_RECALL >= 0.9, "MIN_RECALL must be a real floor"); + + let corpus = clustered_corpus(N, DIM, CLUSTERS, WEIGHT, 5_000); + let queries = clustered_corpus(Q, DIM, CLUSTERS, WEIGHT, 9_000); + let corpus_blob: Vec> = corpus + .iter() + .map(|v| i8_blob_from_slice(&quantize_f32_to_i8(v).0)) + .collect(); + + let mut recall_sum = 0.0f32; + for query in &queries { + let mut gt_scores: Vec<(usize, f64)> = corpus + .iter() + .enumerate() + .map(|(i, c)| (i, cosine_f64_true(query, c))) + .collect(); + gt_scores.sort_by(order_desc_f64); + let gt: std::collections::HashSet = + gt_scores.iter().take(K).map(|(i, _)| *i).collect(); + + let (qi, _) = quantize_f32_to_i8(query); + let qn_i8 = l2_norm_i8(&qi); + let mut i8_scores: Vec<(usize, f32)> = corpus_blob + .iter() + .enumerate() + .map(|(i, blob)| (i, cosine_with_query_norm_i8_blob(&qi, qn_i8, blob))) + .collect(); + i8_scores.sort_by(order_desc_f32); + let got: std::collections::HashSet = + i8_scores.iter().take(K).map(|(i, _)| *i).collect(); + + recall_sum += gt.intersection(&got).count() as f32 / K as f32; + } + let recall = recall_sum / Q as f32; + println!("PR6 recall@{K} (N={N} Q={Q} dim={DIM} clusters={CLUSTERS}) = {recall}"); + assert!( + recall >= MIN_RECALL, + "i8 recall@{K} regressed: {recall} < {MIN_RECALL}" + ); + } + + #[test] + fn i8_cosine_fidelity_vs_true_f32() { + const N: usize = 2000; + const Q: usize = 32; + const DIM: usize = 768; + const CLUSTERS: usize = 16; + const WEIGHT: f32 = 0.85; + // Locked from measured max error 0.00121 (deterministic). 0.005 ~= 4x the + // baseline: sensitive to a lossier future quantizer yet never flaky. + const MAX_COS_ERR: f64 = 0.005; + const _: () = assert!(MAX_COS_ERR < 0.1, "MAX_COS_ERR must be a real bound"); + + let corpus = clustered_corpus(N, DIM, CLUSTERS, WEIGHT, 5_000); + let queries = clustered_corpus(Q, DIM, CLUSTERS, WEIGHT, 9_000); + let corpus_blob: Vec> = corpus + .iter() + .map(|v| i8_blob_from_slice(&quantize_f32_to_i8(v).0)) + .collect(); + + let mut max_err = 0.0f64; + for query in &queries { + let (qi, _) = quantize_f32_to_i8(query); + let qn_i8 = l2_norm_i8(&qi); + for (c, blob) in corpus.iter().zip(corpus_blob.iter()) { + let i8c = cosine_with_query_norm_i8_blob(&qi, qn_i8, blob) as f64; + let truec = cosine_f64_true(query, c); + let e = (i8c - truec).abs(); + if e > max_err { + max_err = e; + } + } + } + println!("PR6 max|cosine_i8 - cosine_f32_true| (N={N} Q={Q} dim={DIM}) = {max_err}"); + assert!( + max_err <= MAX_COS_ERR, + "i8 cosine fidelity regressed: max err {max_err} > {MAX_COS_ERR}" + ); + } } From 77c025057f817e52fe3e94c118a0b6f2db6cae7e Mon Sep 17 00:00:00 2001 From: "Brian.oh" <49855381+dev07060@users.noreply.github.com> Date: Sun, 31 May 2026 05:10:27 +0900 Subject: [PATCH 5/7] bench(vector_math): add i8 hot-kernel + i8 scan benches (LOC-64) --- rust_builder/rust/benches/vector_math.rs | 58 +++++++++++++++++++++++- rust_builder/rust/src/bench_api.rs | 27 +++++++++++ 2 files changed, 84 insertions(+), 1 deletion(-) diff --git a/rust_builder/rust/benches/vector_math.rs b/rust_builder/rust/benches/vector_math.rs index e3b5d70..effc21e 100644 --- a/rust_builder/rust/benches/vector_math.rs +++ b/rust_builder/rust/benches/vector_math.rs @@ -106,12 +106,68 @@ fn bench_scan(c: &mut Criterion) { g.finish(); } +#[cfg(feature = "vector_quant_i8")] +fn bench_cosine_i8(c: &mut Criterion) { + let mut g = c.benchmark_group("cosine_i8"); + for &dim in &DIMS { + let (qi, _) = bench_api::quantize_f32_to_i8(&pseudo_vec(dim, 1)); + let (ti, _) = bench_api::quantize_f32_to_i8(&pseudo_vec(dim, 2)); + let qn = bench_api::l2_norm_i8(&qi); + let tblob = bench_api::i8_blob_from_slice(&ti); + g.throughput(Throughput::Elements(dim as u64)); + g.bench_with_input(BenchmarkId::from_parameter(dim), &dim, |b, _| { + b.iter(|| { + bench_api::cosine_with_query_norm_i8_blob( + black_box(&qi), + black_box(qn), + black_box(&tblob), + ) + }) + }); + } + g.finish(); +} +#[cfg(not(feature = "vector_quant_i8"))] +fn bench_cosine_i8(_c: &mut Criterion) {} + +// Shipped exact-scan inner loop: one query vs N candidate i8 blobs, scored with +// zero f32 decode / zero per-row alloc — the actual release hot path. +#[cfg(feature = "vector_quant_i8")] +fn bench_scan_i8(c: &mut Criterion) { + let (qi, _) = bench_api::quantize_f32_to_i8(&pseudo_vec(SCAN_DIM, 1)); + let qn = bench_api::l2_norm_i8(&qi); + let blobs: Vec> = (0..SCAN_N) + .map(|i| { + let (vi, _) = bench_api::quantize_f32_to_i8(&pseudo_vec(SCAN_DIM, 100 + i as u32)); + bench_api::i8_blob_from_slice(&vi) + }) + .collect(); + + let mut g = c.benchmark_group("exact_scan_i8"); + g.throughput(Throughput::Elements(SCAN_N as u64)); + g.bench_function(BenchmarkId::new("i8_blob_cosine", SCAN_N), |b| { + b.iter(|| { + let mut best = f32::MIN; + for blob in &blobs { + let s = bench_api::cosine_with_query_norm_i8_blob(black_box(&qi), qn, black_box(blob)); + if s > best { + best = s; + } + } + black_box(best) + }) + }); + g.finish(); +} +#[cfg(not(feature = "vector_quant_i8"))] +fn bench_scan_i8(_c: &mut Criterion) {} + criterion_group! { name = benches; config = Criterion::default() .sample_size(30) .warm_up_time(Duration::from_millis(500)) .measurement_time(Duration::from_secs(2)); - targets = bench_cosine, bench_dot, bench_decode, bench_scan + targets = bench_cosine, bench_dot, bench_decode, bench_scan, bench_cosine_i8, bench_scan_i8 } criterion_main!(benches); diff --git a/rust_builder/rust/src/bench_api.rs b/rust_builder/rust/src/bench_api.rs index d59f4cf..33c4c2d 100644 --- a/rust_builder/rust/src/bench_api.rs +++ b/rust_builder/rust/src/bench_api.rs @@ -31,6 +31,33 @@ pub fn decode_f32_embedding(blob: &[u8]) -> Option> { vector_math::decode_f32_embedding(blob) } +#[cfg(feature = "vector_quant_i8")] +use crate::api::vector_quant; + +#[cfg(feature = "vector_quant_i8")] +#[inline] +pub fn quantize_f32_to_i8(input: &[f32]) -> (Vec, f32) { + vector_quant::quantize_f32_to_i8(input) +} + +#[cfg(feature = "vector_quant_i8")] +#[inline] +pub fn l2_norm_i8(v: &[i8]) -> f32 { + vector_quant::l2_norm_i8(v) +} + +#[cfg(feature = "vector_quant_i8")] +#[inline] +pub fn i8_blob_from_slice(input: &[i8]) -> Vec { + vector_quant::i8_blob_from_slice(input) +} + +#[cfg(feature = "vector_quant_i8")] +#[inline] +pub fn cosine_with_query_norm_i8_blob(query: &[i8], query_norm: f32, target_blob: &[u8]) -> f32 { + vector_quant::cosine_with_query_norm_i8_blob(query, query_norm, target_blob) +} + /// Which backend this build compiled (for labelling bench output). pub const BACKEND: &str = if cfg!(feature = "vector_faer") { "faer" From b0b71747f81bb02efb42f1962f9fbc58055af965 Mon Sep 17 00:00:00 2001 From: "Brian.oh" <49855381+dev07060@users.noreply.github.com> Date: Sun, 31 May 2026 05:14:41 +0900 Subject: [PATCH 6/7] =?UTF-8?q?ci(vector=5Fquant):=20run=20i8=20=CE=B5/rec?= =?UTF-8?q?all/fidelity=20nets=20on=20shipped=20faer+quant=20tree=20(LOC-6?= =?UTF-8?q?4)?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit --- scripts/test_ci.sh | 24 ++++++++++++++++++++++++ 1 file changed, 24 insertions(+) diff --git a/scripts/test_ci.sh b/scripts/test_ci.sh index 8a06532..81bc048 100755 --- a/scripts/test_ci.sh +++ b/scripts/test_ci.sh @@ -48,6 +48,30 @@ case "$TARGET" in echo "[ci] ERROR: faer vector_math matched 0 tests (renamed/cfg-excluded?); failing closed" >&2 exit 1 fi + echo "[ci] Running i8 quant kernels + ε/recall/fidelity safety nets on the SHIPPED faer+quant tree" + # The shipped per-candidate hot path is i8 (cosine_with_query_norm_i8_blob), + # not the f32 faer kernels. Run the vector_quant tests on the exact shipped + # feature combo and fail closed on zero matches. + if ! quant_out="$(cargo test --manifest-path rust_builder/rust/Cargo.toml --lib --features "vector_quant_i8,vector_faer" vector_quant -- --test-threads=1 2>&1)"; then + echo "$quant_out" + echo "[ci] ERROR: i8 vector_quant tests failed" >&2 + exit 1 + fi + echo "$quant_out" + if ! grep -Eq 'test result: ok\. [1-9][0-9]* passed' <<<"$quant_out"; then + echo "[ci] ERROR: i8 vector_quant matched 0 tests (renamed/cfg-excluded?); failing closed" >&2 + exit 1 + fi + # Fail closed if any specific safety net was renamed/cfg-excluded (a broad + # filter + N>=1 alone would stay green on the legacy i8 tests). + for net in i8_blob_cosine_matches_independent_reference \ + i8_topk_recall_matches_f32_within_floor \ + i8_cosine_fidelity_vs_true_f32; do + if ! grep -Eq "${net} \.\.\. ok" <<<"$quant_out"; then + echo "[ci] ERROR: i8 safety net '${net}' did not run/pass (renamed/cfg-excluded?); failing closed" >&2 + exit 1 + fi + done # Compile-check the actual shipped feature combo (faer + i8 quant). A # default-feature release build would never cover the backend that ships. cargo build --manifest-path rust_builder/rust/Cargo.toml --release --features vector_faer,vector_quant_i8 From 1cf33b13448c397231d0ff6dc4b82a2ad37b048c Mon Sep 17 00:00:00 2001 From: "Brian.oh" <49855381+dev07060@users.noreply.github.com> Date: Sun, 31 May 2026 05:19:45 +0900 Subject: [PATCH 7/7] docs(perf): PR6 journal entry + status row, i8 measure/parity results (LOC-64) --- docs/perf/vector-math-refactor/PR6.md | 31 ++++++++++++++++++++++++ docs/perf/vector-math-refactor/README.md | 1 + 2 files changed, 32 insertions(+) create mode 100644 docs/perf/vector-math-refactor/PR6.md diff --git a/docs/perf/vector-math-refactor/PR6.md b/docs/perf/vector-math-refactor/PR6.md new file mode 100644 index 0000000..2a3588e --- /dev/null +++ b/docs/perf/vector-math-refactor/PR6.md @@ -0,0 +1,31 @@ +# PR6 — i8 출시 핫패스 측정 + ε/recall/fidelity 안전망 (N: 측정 먼저) + +- 브랜치: `feat/loc-64-i8-measure-parity-net` +- Linear: [LOC-64](https://linear.app/loceract/issue/LOC-64) +- 상태: 🟦 진행 (PR 열림, CI green 대기) +- 설계: [PR6-spec-i8-measure-parity-net.md](PR6-spec-i8-measure-parity-net.md) · 계획: [PR6-plan-i8-measure-parity-net.md](PR6-plan-i8-measure-parity-net.md) + +## 스코프 (비파괴 — 커널/양자화 0줄 변경) +출시 핫패스(i8 `cosine_with_query_norm_i8_blob`)에 PR1 패턴 적용: 측정 + 수치 ε 네트 + recall@k floor + 코사인 fidelity 네트 + CI fail-closed. + +## 결과 (측정, dev arm64) +- **i8 핫커널 마이크로벤치** (ns): 384=7.87 / 768=14.97 / 1024=21.12 / 1536=31.28 +- **스캔(2000×768)**: `exact_scan[faer]`(f32 decode+cosine) **452.82 µs** vs `exact_scan_i8`(i8 blob) **29.98 µs** → i8가 f32-faer 대비 **≈15.1× 빠름**. +- **핵심 발견**: 출시 i8 핫패스는 f32 폴백보다 ~15× 빠르면서 **recall@10 ≈ 0.997**(=319/320, 거의 무손실) — 빠르고 정확. +- **수치 ε 네트**: 차원 {1,2,3,16,384,768,1024,1536}에서 kernel ≈ 독립 f64 참조, ε=1e-4 green. +- **recall@k floor 네트**: N=2000, Q=32, dim=768, k=10, clusters=16 → recall@10 = **0.996875**, FLOOR = **0.98**. GT는 f64(플랫폼 jitter 제거), 전순서 `(score desc, index asc)`는 `total_cmp`(NaN-safe). +- **코사인 fidelity 네트**: `max|cosine_i8 − cosine_f32_true|` = **0.00121**, 게이트 **ε_q = 0.005**(≈4× baseline). ranking 무관·완전 결정론. +- **CI**: `--features "vector_quant_i8,vector_faer" -- --test-threads=1` fail-closed + 3개 네트 이름별 가드, 7 passed. + +## 받은 피드백 (리뷰 / 사전검증) +- 사전 적대적 검증이 잡은 것: recall@10이 768d에서 포화(~0.997)→'민감 밴드' 불가 → **recall floor + cosine fidelity 백스톱**으로 재설계; f32 GT 1-ULP 경계 jitter → **f64 GT**; vacuous 게이트 위험 → `const _` 컴파일 가드 + CI 이름별 가드. +- 구현 리뷰: `order_desc`를 `partial_cmp().unwrap_or(Equal)`(NaN 비전이성)에서 `total_cmp` 기반 concrete 헬퍼로 교체; CI per-net 정규식을 `\.\.\. ok`로 타이트닝. + +## 리스크 / 롤백 +- 비파괴(커널 0줄) → 동작 변경 없음. 롤백: PR revert. +- 결정론: i8 dot 정수 정확 + f64 GT → 플랫폼 무관(측정값 bit-identical). fidelity는 ranking 무관(경계 jitter 0). +- vacuous 게이트: `const _: () = assert!(...)` 컴파일 가드 + CI 이름별 fail-closed. + +## 결정 로그 +- 출시 핫패스가 i8임을 확정(이전 세션) → 측정/검증 초점을 f32(폴백)에서 i8로 이동. +- 품질 게이트는 측정 baseline에서 FLOOR(0.98)/ε_q(0.005) 도출(측정 먼저). 코퍼스는 출시 설정(768d·per-vector) 유지 — 강제 민감화 안 함. diff --git a/docs/perf/vector-math-refactor/README.md b/docs/perf/vector-math-refactor/README.md index 8c1fd94..679a201 100644 --- a/docs/perf/vector-math-refactor/README.md +++ b/docs/perf/vector-math-refactor/README.md @@ -29,6 +29,7 @@ | PR3 | decode 버퍼 재사용 | Claim1 | 낮음~중 | 벤치/N3 게이트 | [LOC-61](https://linear.app/loceract/issue/LOC-61) | ❌ 폐기(출시 i8 빌드서 f32 decode 비핫, 코드검증) | | PR4 | ~~다중 누산기 언롤~~ | — | — | — | [LOC-62](https://linear.app/loceract/issue/LOC-62) | ❌ 폐기(faer 유지로 무의미) | | PR5 | 위생: 손상 로깅(N6) + 엔디안 문서화(N5) | N6, N5 | 낮음(독립) | — | [LOC-63](https://linear.app/loceract/issue/LOC-63) | 🟩 머지(#66, [PR5.md](PR5.md)) | +| PR6 | i8 출시 핫패스 **측정 + ε/recall/fidelity 안전망** | i8 검증갭 | 낮음(비파괴) | main(#67) | [LOC-64](https://linear.app/loceract/issue/LOC-64) | 🟦 진행([PR6.md](PR6.md)) | 종료 회고: [RETRO.md](RETRO.md) · PR3([LOC-61](https://linear.app/loceract/issue/LOC-61)) ❌ 폐기 확정(RETRO §5) · 잔여(선택): 온디바이스 벤치 / encode 헬퍼 dedup — 프로젝트 핸드오프 노트 참조.