Skip to content

feat: pre-hashed Add/Has for cryptographic digests#32

Open
lidel wants to merge 8 commits intomasterfrom
feat/kirsch-mitzenmacher-hashed
Open

feat: pre-hashed Add/Has for cryptographic digests#32
lidel wants to merge 8 commits intomasterfrom
feat/kirsch-mitzenmacher-hashed

Conversation

@lidel
Copy link
Copy Markdown
Member

@lidel lidel commented Mar 17, 2026

When the input key is already a cryptographic hash digest (SHA2-256, BLAKE2b-256, etc.), the SipHash computation is redundant. This PR adds a fast path that reads two hash values directly from the digest bytes, skipping SipHash entirely.

Based on Kirsch & Mitzenmacher, "Less Hashing, Same Performance: Building a Better Bloom Filter" (2008), which proves that two hash values are sufficient to simulate k independent hash functions with no increase in the asymptotic false positive rate: https://doi.org/10.1002/rsa.20208

(iiuc means we can use digests from Multihashes directly, and avoid extra hash round via SipHash, as long we are careful to not pass identity hashes to it)

New public API

  • AddHashed(digest []byte) / HasHashed / AddIfNotHasHashed + TS variants

The digest must be at least 16 bytes (true for SHA2-256, BLAKE2b-256, BLAKE3, etc.).

Usage with go-cid

dm, _ := multihash.Decode(c.Hash())
bf.AddHashedTS(dm.Digest)

Digests from different hash functions (SHA2-256, BLAKE2b-256, SHA3, etc.) can be mixed freely in the same filter -- each produces uniform, independent bytes. Identity multihashes and digests shorter than 16 bytes cannot use this path; use a separate filter with Add for those.

Safety

Add (SipHash) and AddHashed (pre-hashed) compute different bit positions for the same logical key. The filter tracks which path was used first and panics if the other is called. The mode is persisted in JSON so it survives serialization. Legacy JSON without a HashMode field defaults to SipHash for non-empty filters. Clear() resets the mode.

Benchmarks (2M SHA2-256 CIDs)

                     SipHash     Hashed     Speedup
Per-op:
  Add               ~19 ns      ~5 ns       3.5x
  Has               ~19 ns      ~5 ns       4.0x

Bulk (2M CIDs):
  Add 2M             67 ms       29 ms      2.3x
  Has 2M             70 ms       28 ms      2.5x
  AddIfNotHas 2M     75 ms       30 ms      2.5x

FP rate (100k non-member probes):
  SipHash=0.046%  Hashed=0.045%  (target: 1%)

The mode-tracking check adds zero measurable overhead (single always-predicted branch).

lidel added 8 commits March 17, 2026 18:08
Skip SipHash when the input is already a crypto digest (SHA2-256, etc.)
by accepting pre-split h1/h2 values directly (Kirsch-Mitzenmacher
optimization, https://doi.org/10.1002/rsa.20208).

- SplitDigest: extracts h1/h2 from digest bytes
- AddHashed, HasHashed, AddIfNotHasHashed + TS variants
- h2 forced odd internally for coprimality with 2^sizeExp
2M SHA2-256 CIDs, per-op and bulk wall-clock comparisons.
Also validates FP rates are equivalent across both paths.
- remove BenchmarkSummary2M that mixed benchmarks with FP assertions
- use BenchmarkPerOp_ and BenchmarkBulk2M_ naming convention
- drop unused fmt import
- show SplitDigest + AddHashed usage in README example
- add AddHashed/HasHashed rows to benchmark table (~4-6 ns/op)
Track which hashing path (Add vs AddHashed) a filter uses and panic
if the other path is called on the same filter. The mode is persisted
in JSON so it survives serialization. Legacy JSON without a HashMode
field defaults to SipHash for non-empty filters.

Clear() resets the mode so the filter can be reused with either path.
SplitDigest is now an internal detail. AddHashed, HasHashed, and
AddIfNotHasHashed accept the digest []byte and extract h1/h2
internally, mirroring the Add(entry []byte) signature.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant