diff --git a/proposed/0033-block-turboquant.md b/proposed/0033-block-turboquant.md index 764df99..2d3fcb6 100644 --- a/proposed/0033-block-turboquant.md +++ b/proposed/0033-block-turboquant.md @@ -76,6 +76,58 @@ simple deployment, and theoretical guarantees matter most, while PQ or OPQ may still win empirically when a learned vector codebook can exploit dataset-specific structure. +### Comparison to HIGGS + +HIGGS [12] (Malinovskii et al., 2024) is a data-free quantization method for LLM +weight matrices that shares TurboQuant's core idea — Hadamard rotation followed by +MSE-optimal grid quantization — but targets a different application domain and makes +different design trade-offs: + +| | TurboQuant | HIGGS | +| -------------------- | ------------------------------------------------------------------------ | -------------------------------------------------------------------------- | +| Application domain | ANN embedding search (per-vector, online) | LLM weight quantization (per-layer, offline) | +| Rotation | 3-round SORF (HD₃·HD₂·HD₁): high-quality random orthogonal approximation | Single RHT (H·D): one Hadamard × random diagonal signs | +| Target distribution | Beta marginal (1-x²)^((d-3)/2) on unit sphere | Approximate Gaussian N(0,1) | +| Quantization grid | Max-Lloyd centroids (scalar, p=1), analytically derived for Beta | CLVQ grids (Pagès & Printems 2003), supports vector quantization p∈{1,2,4} | +| Error metric | Pure MSE (reconstruction error) | MSE + Hessian-weighted per-layer coefficients αₗ (Linearity Theorem) | +| Calibration data | None | None for quantization; small calibration set for αₗ estimation | +| Non-uniform bitwidth | No (uniform across all vectors) | Yes (DP solver for per-layer bit allocation) | +| Distance computation | Quantized-domain scan kernel (PDX layout, SIMD over 64 vectors) | GPU matrix multiply (FLUTE kernel) | +| Norm storage | Explicit per-block norms for distance computation | Per-group scales folded into weight reconstruction | + +**Key design differences explained:** + +- **Rotation depth.** TurboQuant normalizes to the unit sphere first, so + coordinates must follow the specific Beta marginal for Max-Lloyd centroids to + be optimal — this requires a high-quality random orthogonal approximation + (3-round SORF). HIGGS operates on raw (group-normalized) weights and only + needs approximate Gaussianity, so a single RHT suffices. +- **VQ dimension.** HIGGS's CLVQ grids support multi-dimensional vector + quantization (p>1), where groups of p coordinates are quantized jointly to an + optimal multi-dimensional grid. At 3-4 bits, p=2 or p=4 achieves better + rate-distortion than scalar (p=1) quantization by exploiting residual + correlations between coordinates. TurboQuant is currently scalar-only (p=1); + p>1 would require changes to the PDX scan kernel (per-subvector codebook + lookup instead of per-coordinate). See Future work for discussion. +- **Error metric.** HIGGS's Linearity Theorem (perplexity increase ≈ Σ αₗ·tₗ²) + enables Hessian-aware optimization specific to LLM inference. For ANN search, + MSE is the natural metric — it directly bounds distance distortion — and + non-uniform bit allocation has no analogue (all vectors share the same + encoding). +- **Beta vs. Gaussian at high d.** As d grows, the Beta distribution + (1-x²)^((d-3)/2) concentrates and becomes approximately Gaussian with + variance ~1/d. At d=256+, the practical difference between Beta-optimal and + Gaussian-optimal grids shrinks. Whether Gaussian grids (simpler: one grid per + bitwidth, no dimension dependence) match Beta Max-Lloyd for ANN recall is an + empirical question — see Experimental plan. + +**Domain mismatch.** Comparisons of TurboQuant vs. HIGGS on LLM perplexity +benchmarks are misleading: HIGGS's Hessian-aware optimization naturally dominates +for that task, but TurboQuant was never designed for LLM weight quantization. The +relevant comparison is ANN recall@k on embedding datasets, where TurboQuant's +block decomposition, PDX scan layout, and per-vector encode/decode are the +critical features. + ### Current Vortex implementation The [current implementation][current-impl] (Rust, in the `vortex-tensor` crate, @@ -908,6 +960,31 @@ maximizes per-block quality and minimizes norm count. Experiments may show that smaller B with more pruning checkpoints yields better end-to-end scan performance despite higher per-block overhead. +### Gaussian-optimal vs. Beta-optimal grids + +HIGGS [12] demonstrates that Gaussian-optimal grids (computed via CLVQ for N(0,1)) +work well after a single Hadamard rotation. Since the Beta marginal converges to +Gaussian at high d, test whether Gaussian grids can replace Beta Max-Lloyd centroids +for ANN search: + +- **Grid comparison**: At B ∈ {64, 128, 256, 512} and b ∈ {2, 3, 4, 5, 8}, + compare ANN recall@k and normalized MSE for (a) Beta Max-Lloyd centroids at + B-dim, (b) Gaussian-optimal scalar grids (Normal Float style), and + (c) CLVQ-computed Gaussian grids. Report the crossover point where the grids + become practically equivalent. +- **Rotation depth**: If Gaussian grids match Beta Max-Lloyd at a given B, test + whether 1-round RHT (H·D with random signs) achieves comparable quality to + 3-round SORF. A single round would reduce rotation cost by ~3× and simplify + the transform. Test at B ∈ {64, 128, 256, 512} on the benchmarking datasets. +- **Simplification potential**: If Gaussian grids + 1-round RHT match quality at + B ≥ 256, this eliminates the dimension-dependent centroid computation (one grid + per bitwidth, shared across all block sizes) and reduces rotation overhead. + This would be a significant implementation simplification for Stage 2+. + +The expectation is that at B=256+ the difference is negligible, but at B=64-128 +the Beta-optimal grids may still win due to stronger non-Gaussian effects. Results +should inform whether the centroid computation strategy changes in Phase 2. + ### QJL strategy comparison (if pursued) - Per-block Gaussian QJL vs. per-block SORF QJL vs. full-dim SORF QJL @@ -992,6 +1069,43 @@ In all cases, MSE-only is the recommended starting point. QJL should only be added if experiments demonstrate clear recall@k improvements for the target workload. +## Future work: Multi-dimensional vector quantization (p>1) + +HIGGS [12] demonstrates that vector quantization with dimension p>1 (quantizing +groups of p coordinates jointly to an optimal multi-dimensional grid) achieves +better rate-distortion than scalar quantization (p=1) at the same bit budget. For +TurboQuant, this would mean replacing the per-coordinate Max-Lloyd centroid lookup +with a per-subvector codebook lookup, where each group of p rotated coordinates +maps to one of n codewords in a p-dimensional CLVQ grid. + +**Benefits:** + +- Improved rate-distortion: at 3-4 bits, p=2 or p=4 captures residual + correlations between coordinates that scalar quantization misses. +- Simpler centroid computation: CLVQ grids for Gaussian inputs are computed once + per (n, p) pair and reused across all block sizes (no dimension dependence). + +**Costs and constraints:** + +- **Distance kernel redesign.** The PDX scan kernel (Stage 3) is built around + per-coordinate centroid lookups with a (2^b)²-entry distance table. At p=2 + with b=4 bits per coordinate, the codebook has 2^(4×2)=256 entries, and the + distance table becomes 256×256=64K entries (256 KB) — still fits in L1/L2 but + much larger than the current 1 KB at b=4 scalar. At p=4 the table is + infeasible; alternative distance strategies (asymmetric distance computation, + partial codebook scans) would be needed. +- **GPU shared memory.** HIGGS notes total grid points 2^(b×p) must fit GPU + shared memory (~2^10 points practical limit), constraining (b, p) pairs. +- **PDX layout interaction.** The current "1 dim × 64 vecs" PDX layout assumes + per-coordinate independence. At p>1, the layout would need to group p + consecutive dimensions together per lookup, changing the transpose structure. + +**Recommendation:** Evaluate p=2 VQ experimentally after Stage 3 (PDX) is +validated. Compare ANN recall@k at matched bit budgets: p=1 at b bits vs. p=2 at +b bits. If p=2 shows meaningful recall improvement (>2% recall@10), design the +kernel changes as a Stage 4 extension. CLVQ grids for p=2 can be precomputed +offline using the Pagès & Printems (2003) algorithm [12]. + ## Future work: GPU decode and fused distance computation The B-dim block structure maps naturally to GPU tile sizes and tensor cores. @@ -1181,6 +1295,10 @@ IEEE Trans. PAMI 36(4):744-755, 2014. [11] Jääsaari, E., Hyvönen, V., Ceccarello, M., Roos, T. and Aumüller, M. "VIBE: Vector Index Benchmark for Embeddings." arXiv:2505.17810, May 2025. +[12] Malinovskii, V., Panferov, A., Ilin, I., Guo, H., Richtárik, P. and +Alistarh, D. "Pushing the Limits of Large Language Model Quantization via the +Linearity Theorem." arXiv:2411.17525, November 2024. + ## Appendix A: Reference implementation bugs and Theorem 1 constant ### Reference implementation bugs