feat: pre-hashed Add/Has for cryptographic digests by lidel · Pull Request #32 · ipfs/bbloom

lidel · 2026-03-17T18:04:32Z

When the input key is already a cryptographic hash digest (SHA2-256, BLAKE2b-256, etc.), the SipHash computation is redundant. This PR adds a fast path that reads two hash values directly from the digest bytes, skipping SipHash entirely.

Based on Kirsch & Mitzenmacher, "Less Hashing, Same Performance: Building a Better Bloom Filter" (2008), which proves that two hash values are sufficient to simulate k independent hash functions with no increase in the asymptotic false positive rate: https://doi.org/10.1002/rsa.20208

(iiuc means we can use digests from Multihashes directly, and avoid extra hash round via SipHash, as long we are careful to not pass identity hashes to it)

New public API

AddHashed(digest []byte) / HasHashed / AddIfNotHasHashed + TS variants

The digest must be at least 16 bytes (true for SHA2-256, BLAKE2b-256, BLAKE3, etc.).

Usage with go-cid

dm, _ := multihash.Decode(c.Hash())
bf.AddHashedTS(dm.Digest)

Digests from different hash functions (SHA2-256, BLAKE2b-256, SHA3, etc.) can be mixed freely in the same filter -- each produces uniform, independent bytes. Identity multihashes and digests shorter than 16 bytes cannot use this path; use a separate filter with Add for those.

Safety

Add (SipHash) and AddHashed (pre-hashed) compute different bit positions for the same logical key. The filter tracks which path was used first and panics if the other is called. The mode is persisted in JSON so it survives serialization. Legacy JSON without a HashMode field defaults to SipHash for non-empty filters. Clear() resets the mode.

Benchmarks (2M SHA2-256 CIDs)

                     SipHash     Hashed     Speedup
Per-op:
  Add               ~19 ns      ~5 ns       3.5x
  Has               ~19 ns      ~5 ns       4.0x

Bulk (2M CIDs):
  Add 2M             67 ms       29 ms      2.3x
  Has 2M             70 ms       28 ms      2.5x
  AddIfNotHas 2M     75 ms       30 ms      2.5x

FP rate (100k non-member probes):
  SipHash=0.046%  Hashed=0.045%  (target: 1%)

The mode-tracking check adds zero measurable overhead (single always-predicted branch).

Skip SipHash when the input is already a crypto digest (SHA2-256, etc.) by accepting pre-split h1/h2 values directly (Kirsch-Mitzenmacher optimization, https://doi.org/10.1002/rsa.20208). - SplitDigest: extracts h1/h2 from digest bytes - AddHashed, HasHashed, AddIfNotHasHashed + TS variants - h2 forced odd internally for coprimality with 2^sizeExp

2M SHA2-256 CIDs, per-op and bulk wall-clock comparisons. Also validates FP rates are equivalent across both paths.

- remove BenchmarkSummary2M that mixed benchmarks with FP assertions - use BenchmarkPerOp_ and BenchmarkBulk2M_ naming convention - drop unused fmt import

- show SplitDigest + AddHashed usage in README example - add AddHashed/HasHashed rows to benchmark table (~4-6 ns/op)

Track which hashing path (Add vs AddHashed) a filter uses and panic if the other path is called on the same filter. The mode is persisted in JSON so it survives serialization. Legacy JSON without a HashMode field defaults to SipHash for non-empty filters. Clear() resets the mode so the filter can be reused with either path.

SplitDigest is now an internal detail. AddHashed, HasHashed, and AddIfNotHasHashed accept the digest []byte and extract h1/h2 internally, mirroring the Add(entry []byte) signature.

lidel added 8 commits March 17, 2026 18:08

test: add benchmarks comparing SipHash vs pre-hashed path

59585e9

2M SHA2-256 CIDs, per-op and bulk wall-clock comparisons. Also validates FP rates are equivalent across both paths.

test: clean up pre-hashed benchmarks

c38da20

- remove BenchmarkSummary2M that mixed benchmarks with FP assertions - use BenchmarkPerOp_ and BenchmarkBulk2M_ naming convention - drop unused fmt import

docs: add pre-hashed API to README and BENCHMARKS

0c91f38

- show SplitDigest + AddHashed usage in README example - add AddHashed/HasHashed rows to benchmark table (~4-6 ns/op)

docs: clarify no-mixing constraint on all *Hashed godoc

e4ee53c

docs: fix SplitDigest example and note mixed digests are safe

be27611

refactor: change *Hashed API to accept []byte digest directly

3ab5bfe

SplitDigest is now an internal detail. AddHashed, HasHashed, and AddIfNotHasHashed accept the digest []byte and extract h1/h2 internally, mirroring the Add(entry []byte) signature.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: pre-hashed Add/Has for cryptographic digests#32

feat: pre-hashed Add/Has for cryptographic digests#32
lidel wants to merge 8 commits intomasterfrom
feat/kirsch-mitzenmacher-hashed

lidel commented Mar 17, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

lidel commented Mar 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

New public API

Usage with go-cid

Safety

Benchmarks (2M SHA2-256 CIDs)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

lidel commented Mar 17, 2026 •

edited

Loading