Implement a novel consistent hashing algorithm with replication#80
Implement a novel consistent hashing algorithm with replication#80
Conversation
There was a problem hiding this comment.
Pull Request Overview
This PR implements a novel consistent hashing algorithm with support for replication that achieves constant time complexity independent of the number of nodes. The implementation provides consistent hash iterators and a "choose-k" variant for selecting multiple distinct nodes.
- Introduces core consistent hashing iterators (
ConsistentHashIterator,ConsistentHashRevIterator) with bucket-based enumeration - Implements a consistent choose-k algorithm for selecting multiple replicas with guaranteed uniformity
- Adds comprehensive test coverage for uniformity and consistency properties
Reviewed Changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 8 comments.
| File | Description |
|---|---|
crates/consistent-hashing/src/lib.rs |
Core implementation of the consistent hashing algorithm with iterators and choose-k functionality |
crates/consistent-hashing/README.md |
Documentation explaining the algorithm, complexity analysis, and replication concepts |
crates/consistent-hashing/Cargo.toml |
Package configuration for the new consistent hashing crate |
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
… into aneubeck/sampling
Co-authored-by: Luke Francl <look@github.com>
Co-authored-by: Luke Francl <look@github.com>
Co-authored-by: Luke Francl <look@github.com>
Co-authored-by: Luke Francl <look@github.com>
itsibitzi
left a comment
There was a problem hiding this comment.
An initial batch of comments.
crates/consistent-hashing/src/lib.rs
Outdated
| let mut samples = Vec::with_capacity(self.k); | ||
| let mut samplers: Vec<_> = (0..self.k) | ||
| .map(|i| ConsistentHashRevIterator::new(n - i, self.builder.seq_builder(i)).peekable()) |
There was a problem hiding this comment.
Mentioned in Slack but I'll rewrite here.
It might be nice for performance to have k be a const generic to allow these allocations to be on the stack. There might be some other ways to avoid the allocation of course.
I did some quick benchmarking and removing the vec allocations makes the runtime ~35% lower with k=3 and ~25% for k=5.
There was a problem hiding this comment.
added another function which takes a vector as input.
Changed the implementation to get rid of the second vector.
Downside is that we have to do more work now in order to fetch the next consistent hash for a specific sampler 🤷
Generally, performance is actually not so bad...
This algorithm is special, since it's complexity is not dependent on the number of nodes!
The simple implementation is quadratic in the number of desired replication nodes. An r * log(r) complexity could be achieved by utilizing a heap or sorted set data structure. Since r is typically small, this overhead is not worth it though.
rendered