feat: pre-hashed Add/Has for cryptographic digests#32
Open
Conversation
Skip SipHash when the input is already a crypto digest (SHA2-256, etc.) by accepting pre-split h1/h2 values directly (Kirsch-Mitzenmacher optimization, https://doi.org/10.1002/rsa.20208). - SplitDigest: extracts h1/h2 from digest bytes - AddHashed, HasHashed, AddIfNotHasHashed + TS variants - h2 forced odd internally for coprimality with 2^sizeExp
2M SHA2-256 CIDs, per-op and bulk wall-clock comparisons. Also validates FP rates are equivalent across both paths.
- remove BenchmarkSummary2M that mixed benchmarks with FP assertions - use BenchmarkPerOp_ and BenchmarkBulk2M_ naming convention - drop unused fmt import
- show SplitDigest + AddHashed usage in README example - add AddHashed/HasHashed rows to benchmark table (~4-6 ns/op)
Track which hashing path (Add vs AddHashed) a filter uses and panic if the other path is called on the same filter. The mode is persisted in JSON so it survives serialization. Legacy JSON without a HashMode field defaults to SipHash for non-empty filters. Clear() resets the mode so the filter can be reused with either path.
SplitDigest is now an internal detail. AddHashed, HasHashed, and AddIfNotHasHashed accept the digest []byte and extract h1/h2 internally, mirroring the Add(entry []byte) signature.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
When the input key is already a cryptographic hash digest (SHA2-256, BLAKE2b-256, etc.), the SipHash computation is redundant. This PR adds a fast path that reads two hash values directly from the digest bytes, skipping SipHash entirely.
Based on Kirsch & Mitzenmacher, "Less Hashing, Same Performance: Building a Better Bloom Filter" (2008), which proves that two hash values are sufficient to simulate k independent hash functions with no increase in the asymptotic false positive rate: https://doi.org/10.1002/rsa.20208
(iiuc means we can use digests from Multihashes directly, and avoid extra hash round via SipHash, as long we are careful to not pass identity hashes to it)
New public API
AddHashed(digest []byte)/HasHashed/AddIfNotHasHashed+TSvariantsThe digest must be at least 16 bytes (true for SHA2-256, BLAKE2b-256, BLAKE3, etc.).
Usage with go-cid
Digests from different hash functions (SHA2-256, BLAKE2b-256, SHA3, etc.) can be mixed freely in the same filter -- each produces uniform, independent bytes. Identity multihashes and digests shorter than 16 bytes cannot use this path; use a separate filter with
Addfor those.Safety
Add(SipHash) andAddHashed(pre-hashed) compute different bit positions for the same logical key. The filter tracks which path was used first and panics if the other is called. The mode is persisted in JSON so it survives serialization. Legacy JSON without aHashModefield defaults to SipHash for non-empty filters.Clear()resets the mode.Benchmarks (2M SHA2-256 CIDs)
The mode-tracking check adds zero measurable overhead (single always-predicted branch).