Cuckoo Filter Design Proposal #3079

LiuQhahah · 2025-07-28T03:28:13Z

LiuQhahah
Jul 28, 2025

1. Introduction

In #2534, we plan to support the cuckoo filter.

This document outlines the design for implementing a Cuckoo Filter in kvrocks. The Cuckoo Filter is a high-performance, probabilistic data structure used for approximate set membership testing. It allows for fast checks to see if an item is in a set and, unlike Bloom Filters, supports item deletion.

A key requirement for this implementation is to support the counting of duplicate items (via the CF.COUNT command). To robustly handle scenarios where a single item may be inserted many more times than the capacity of its two candidate buckets, we have chosen a sharded (or chained) expansion strategy. This approach ensures functional correctness and system stability under all conditions, accepting a trade-off in query performance degradation in favor of feature completeness and robust handling of high-frequency items.

2. Core Concepts

Fingerprint: Instead of storing the items themselves, the Cuckoo Filter stores a "fingerprint" of the item, which is a small hash derived from the original item.
Two Candidate Buckets: For any given item, it can be stored in one of two possible buckets within a given sub-filter. The locations of these buckets are determined by two hash functions.
Kicking: When a new item needs to be inserted and both of its candidate buckets are full, the filter will evict an existing item from one of the buckets to make space.
This evicted item is then re-inserted into its alternate location, which may cause a cascade of evictions. This process continues until all items find a home or a maximum number of evictions is reached.
Expansion (Chaining): If the kicking process fails to find a home for an item after a set number of attempts, the current sub-filter is considered full. An expansion is then triggered, which appends a new, larger sub-filter to the chain to accommodate more data. This design does not use global rehashing.

3. Data Structure Design

To efficiently implement the Cuckoo Filter in kvrocks, we will adopt a "Bucket as Key" storage model, combined with a sharded (or chained) expansion strategy. This model maps each bucket of each sub-filter to an individual key-value pair in RocksDB, ensuring high I/O performance and cache efficiency.

3.1. Metadata

The metadata stores the overall configuration for the Cuckoo Filter chain. Each Cuckoo Filter instance corresponds to a single metadata key-value pair in RocksDB.

Key Format: namespace:cuckoo_filter:{user_key}
Value Format: A serialized CuckooFilterMetadata object.

CuckooFilterMetadata Structure

Field Name	Data Type	Size (Bytes)	Description
n_filters	uint32_t	4	The number of sub-filters in the chain.
capacities	uint64_t[]	Variable	An array storing the bucket count for each sub-filter. capacities[i] is the number of buckets for the sub-filter at filter_index=i.
size	uint64_t	8	The total number of items stored in the entire filter chain.
max_iterations	uint16_t	2	The maximum number of kicks (evictions) allowed during an insertion before triggering an expansion.Default: 20
bucket_size	uint8_t	1	The number of fingerprints (slots) each bucket can hold. Default: 2
expansion	uint8_t	1	The growth factor for the capacity of a new sub-filter relative to the previous one. Default: 1

3.2. Bucket Data

Each bucket within each sub-filter is stored as a separate key-value pair. This granular approach is the key to performance.

Key Format: cf:{namespace_key}:{filter_index}:{bucket_index}
- cf: A fixed prefix to identify Cuckoo Filter data.
- {namespace_key}: The same user key as used for the metadata.
- {filter_index}: The 0-based index of the sub-filter in the chain.
- {bucket_index}: The 0-based index of the bucket within that sub-filter.
Value Format:
- A binary string containing the concatenated fingerprints for that bucket. This is a very small value (e.g., 4 bytes).

3.3. Storage Layout in RocksDB

The following diagram illustrates the storage layout for a Cuckoo Filter that has expanded once (n_filters = 2). It details both the Metadata entry and the structure of the individual "Bucket as Key" entries.


+-----------------------------------------------------------------------------------+
|                          RocksDB Storage (Bucket as Key + Chaining)               |
+-----------------------------------------------------------------------------------+
|                                                                                   |
|   +---------------------------------------------------------------------------+   |
|   |                                Metadata Entry                             |   |
|   |   Key: "namespace:cuckoo_filter:my_filter"                                |   |
|   |   Value: (Serialized Data)                                                |   |
|   |     {                                                                     |   |
|   |       "n_filters": 2,                                                     |   |
|   |       "capacities": [1024, 2048], // -> Filter 0 has 1024 buckets,        |   |
|   |                                  //    Filter 1 has 2048 buckets.         |   |
|   |       "size": 1500,                                                       |   |
|   |       "max_iterations": 20,                                               |   |
|   |       "bucket_size": 4,    // -> Each bucket holds 4 fingerprints         |   |
|   |       "expansion": 2,                                                     |   |
|   |     }                                                                     |   |
|   +---------------------------------------------------------------------------+   |
|                                                                                   |
|   +---------------------------------------------------------------------------+   |
|   |                         Buckets for Filter #0                             |   |
|   |                                                                           |   |
|   |   Key: "cf:...:my_filter:0:123"  // Example bucket from Filter #0         |   |
|   |                                                                           |   |
|   |   Value: (4 bytes of binary data)                                         |   |
|   |     +----------+----------+----------+----------+                         |   |
|   |     |   fpA    |   fpB    |   fpC    |  (empty) |                         |   |
|   |     +----------+----------+----------+----------+                         |   |
|   |     |  Slot 0  |  Slot 1  |  Slot 2  |  Slot 3  |                         |   |
|   |     (1 byte)   (1 byte)   (1 byte)   (1 byte)                             |   |
|   |                                                                           |   |
|   +---------------------------------------------------------------------------+   |
|                                                                                   |
|   +---------------------------------------------------------------------------+   |
|   |                         Buckets for Filter #1                             |   |
|   |                                                                           |   |
|   |   Key: "cf:...:my_filter:1:456"  // Example bucket from Filter #1         |   |
|   |                                                                           |   |
|   |   Value: (4 bytes of binary data)                                         |   |
|   |     +----------+----------+----------+----------+                         |   |
|   |     |   fpX    |   fpY    |   fpX    |   fpZ    |                         |   |
|   |     +----------+----------+----------+----------+                         |   |
|   |     |  Slot 0  |  Slot 1  |  Slot 2  |  Slot 3  |                         |   |
|   |                                                                           |   |
|   +---------------------------------------------------------------------------+   |
|                                                                                   |
+-----------------------------------------------------------------------------------+

4. Operational Flow

4.1. Add Operation (CF.ADD)

Read Metadata: Read the metadata for the given user_key to get n_filters and the capacities array.
Target Last Filter: Target the last sub-filter, which has an index of n_filters - 1.
Calculate Buckets: Based on the parameters of the last sub-filter (i.e., capacities[n_filters - 1]), calculate the two candidate bucket indices, i1 and i2, for the item.
Read Buckets: Use MultiGet to read the two corresponding bucket keys: ...:{n_filters-1}:{i1} and ...:{n_filters-1}:{i2}.
Attempt Insert: Try to place the item's fingerprint into an empty slot in either bucket. If successful, proceed to step 7.
Cuckoo Kicking: If both buckets are full, initiate the kicking process within the current sub-filter.
On Success: If the insertion is successful, increment the size in the metadata. Atomically write the updated metadata and all modified bucket data to RocksDB using a single WriteBatch.
On Failure: If the kicking process exceeds max_iterations, trigger the Expansion flow.

4.1.1. Internal Expansion Flow

Update Metadata: Calculate the capacity for the new sub-filter based on the expansion factor and append it to the capacities array.
Create New Filter: Write the updated metadata to a WriteBatch and commit it. This logically creates the new, empty sub-filter.
Retry Insert: Re-attempt the original ADD operation. It will now naturally target the newly created empty sub-filter and succeed.

4.3. Exists Operation (CF.EXISTS)

Read Metadata: Read the metadata to get n_filters and capacities.
Iterate Through Filters: Loop from filter_index = 0 to n_filters - 1.
Inside the loop, for each filter_index:
- Calculate the two candidate bucket indices (i1, i2) for the item based on the current sub-filter's capacity (capacities[filter_index]).
- Use MultiGet to read the two bucket keys: ...:{filter_index}:{i1} and ...:{filter_index}:{i2}.
- If the item's fingerprint is found in either bucket, return True immediately and terminate the operation.
Return Result: If the loop completes without finding the item, return False.

4.4. Count Operation (CF.COUNT)

This operation is the primary reason for choosing the chained expansion strategy, as it correctly handles high-frequency duplicate items.

Read Metadata: Read the metadata to get n_filters and capacities.
Initialize Counter: Set total_count = 0.
Iterate Through Filters:
- Loop from filter_index = 0 to n_filters - 1.
- Inside the loop, for each filter_index:
  - Calculate the two candidate bucket indices (i1, i2) for the item based on the current sub-filter's capacity.
  - Use MultiGet to read the two corresponding buckets.
  - In memory, iterate through the contents of both buckets and add the number of matching fingerprints to total_count.
Return Result: After the loop completes, return the final total_count.

4.5. Delete Operation (CF.DELETE)

The delete operation is similar to EXISTS but only needs to find and remove the first occurrence of the item.

Read Metadata and Iterate Through Filters as in EXISTS.
In the loop, upon finding the first matching fingerprint in a bucket:
- Remove the fingerprint from the bucket's value data.
- Decrement the size in the metadata.
- Atomically write the modified bucket data and the updated metadata to a WriteBatch.
- Return True immediately, and terminate the operation.
Return Result: If the loop completes without finding the item, return False.

5. Disadvantages and Trade-offs

This design is a pragmatic compromise, and it's important to understand its inherent limitations.

Query Performance Degradation: The most significant trade-off is that the performance of read operations (EXISTS, COUNT, DELETE) degrades linearly as the number of sub-filters (n_filters) increases. Each expansion adds another set of lookups to every query.
Low Utilization of Older Sub-Filters: The "write-to-tail" strategy means that once a sub-filter triggers an expansion, it becomes "frozen." It will never receive new insertions, even if space is freed up by DELETE operations. This can lead to poor overall space utilization, as older sub-filters may become sparse over time while new ones are created.
Hotspot-Driven Expansion: As identified, frequent insertion of a single, high-cardinality item can rapidly trigger multiple expansions. This creates numerous sub-filters that are largely dedicated to holding this single item, further exacerbating the space utilization problem. To mitigate this, it is highly recommended to use a larger bucket_size (e.g., 4 or ) instead of the default of 2 if this usage pattern is anticipated. A larger bucket size increases the capacity for duplicate items within a single sub-filter, delaying the need for expansion.

6. Concurrency Safety
Atomic Writes: All write operations must be atomic to ensure data consistency. This is achieved by using RocksDB's WriteBatch.
Locking: To prevent race conditions during a read-modify-write cycle, a lock should be acquired at the application level for the user key. This is especially critical during expansion to ensure that concurrent operations do not interfere with each other.

7. Reference

PragmaTwice · 2025-07-28T07:34:20Z

PragmaTwice
Jul 28, 2025
Collaborator

Thank you for your proposal! Here are some comments after a brief review.

n_filters | uint32_t | 4 | The number of filter shards in the chain.
capacity | uint64_t | 8 | The total theoretical capacity of the entire filter chain.

I think we should also store the max number of buckets of each filter. i.e. how many buckets is stored in one filter?

Also IMHO capacity (if I understand correctly it is estimated number of max items to insert) is not so useful and worth to store, i.e. we can compute it immediately from other info.

The maximum number of kicks allowed during an insertion before giving up.

Before "expansion" (BTW I prefer to use "rehashing" rather than "expansion")? Or before "giving up"?

Data Key 1:

Key: cf:namespace:cuckoo_filter:my_filter:1
Value: (Binary data for the second shard, capacity 2000)

we don't need to create a rocksdb key-value if all buckets in it are empty.
we should have the same number of buckets for each "filter" here. e.g. if the first "filter" has 1000 buckets, the second should have same. If you want more 2000 buckets you should create 2 more "filter"s.
The bucket index can be computed via filter_index * n_buckets_in_filter.

Locate Last Shard: Identify the last filter shard using num_filters - 1.

All filters should be useful. If we only care about the last filter, I think we should delete all other filters.

Here we should locate TWO filters (maybe one, if the two buckets are in the same filter) via its two buckets (located by its two hash values).

1 reply

LiuQhahah Jul 28, 2025
Author

I think we should also store the max number of buckets of each filter. i.e. how many buckets is stored in one filter?
Also IMHO capacity (if I understand correctly it is estimated number of max items to insert) is not so useful and worth to store, i.e. we can compute it immediately from other info.

I'm very sorry for misleading. Let me clarify again that "capacity" refers to the capacity of a single filter rather than the entire filter chain. I think using "capacity" to represent a single filter can solve the problem that you need to specify the capacity for each filter and do not need to set the capacity for the entire filter chain

Before "expansion" (BTW I prefer to use "rehashing" rather than "expansion")? Or before "giving up"?

Thank you for your suggestion. I have updated "before giving up" to "before rehashing".

we don't need to create a rocksdb key-value if all buckets in it are empty.

Yes, for empty filters, key-value pairs should not be inserted into rocks.

Data Key 1:

Key: cf:namespace:cuckoo_filter:my_filter:1
Value: (Binary data for the second shard, capacity 2000)

"Data Key 1" refers to the filter after cf has undergone one reshading(expansion). At this point, the key of shard1 becomes cf:namespace:cuckoo_filter:my_filter:1, and the capacity in the Value is 2000. Data Key 1 and Data Key 0 refer to two different filters rather than the same filter.

Locate Last Shard: Identify the last filter shard using num_filters - 1.

Thank you for pointing out the problem. The description of the add operation is inaccurate. The correct ADD operation is as follows. I have updated the document

Read Metadata: Read the metadata for the given user_key.
Calculate the two candidate bucket locations ((filter_idx1, bucket_in_filter1) and (filter_idx2, bucket_in_filter2)).
Read the data for the corresponding 1 or 2 shards from RocksDB
Attempt to insert the fingerprint into an empty slot in either bucket.
If both are full, begin the cuckoo kicking process, which may involve reading/writing to different shards as items are moved.
If insertion succeeds, increment size in metadata and atomically write the updated metadata and modified shard(s) back to RocksDB.
If kicking exceeds max_iterations, the insertion fails. Trigger the Reshading process.

If you still have any questions, please feel free to tell me. Thank you

PragmaTwice · 2025-07-29T02:31:16Z

PragmaTwice
Jul 29, 2025
Collaborator

"Data Key 1" refers to the filter after cf has undergone one reshading(expansion).

I don't think it is a good idea of one-to-one mapping between filters and count of rehashing.

Before any rehashing, we can also have multiple filters.

And after one rehashing, we can increase numbers of filters from e.g. 2 to something like 4.

18 replies

PragmaTwice Aug 13, 2025
Collaborator

See if other people has advice. cc @git-hulk @mapleFU

PragmaTwice Aug 18, 2025
Collaborator

capacity | uint64_t | 8 | The number of buckets stored in each individual filter shard.

If we have sub-filters of different size, why the capacity is number of buckets in each sub-filters? size of which sub-filter?

I still feel that your proposal is ambiguous or vague in many places.

Also, could you use words consistently, e.g. select one of "sub-filter" and "filter shard", and use it through the whole proposal?

LiuQhahah Aug 18, 2025
Author

Okay, thank you for your suggestion.
I will update the design document to ensure consistency in the description.

LiuQhahah Aug 26, 2025
Author

I updated the design document again.

The document supplemented the necessary reasons for using the cuckcoo filter chain (supporting multiple insertions of one key).
In order to fully utilize the compression feature of rocksdb, the design of rockskey was at the bucket level instead of the original sub-filter level
And the influence of cuckoo filter chain on the query has been listed.

Everyone is welcome to contribute their opinions and ideas. Thank @LindaSummer for his help to me.

LindaSummer Sep 9, 2025
Collaborator

Hi @LiuQhahah ,

Sorry for the delayed reply on your proposal update. 😊

In the latest design, you point out that the read performance would be downgraded, and I think it should be the highest priority of this feature.

Could we mitigate it with some action, like merging and creating a brand new filter if we have too many filters?

Best Regards,
Edward

Cuckoo Filter Design Proposal #3079

Uh oh!

Uh oh!

LiuQhahah Jul 28, 2025

1. Introduction

2. Core Concepts

3. Data Structure Design

3.1. Metadata

CuckooFilterMetadata Structure

3.2. Bucket Data

3.3. Storage Layout in RocksDB

4. Operational Flow

4.1. Add Operation (CF.ADD)

4.1.1. Internal Expansion Flow

4.3. Exists Operation (CF.EXISTS)

4.4. Count Operation (CF.COUNT)

4.5. Delete Operation (CF.DELETE)

5. Disadvantages and Trade-offs

6. Concurrency Safety

7. Reference

Replies: 2 comments · 19 replies

Uh oh!

PragmaTwice Jul 28, 2025 Collaborator

Uh oh!

LiuQhahah Jul 28, 2025 Author

Uh oh!

PragmaTwice Jul 29, 2025 Collaborator

Uh oh!

PragmaTwice Aug 13, 2025 Collaborator

Uh oh!

Uh oh!

PragmaTwice Aug 18, 2025 Collaborator

Uh oh!

LiuQhahah Aug 18, 2025 Author

Uh oh!

LiuQhahah Aug 26, 2025 Author

Uh oh!

LindaSummer Sep 9, 2025 Collaborator

LiuQhahah
Jul 28, 2025

Replies: 2 comments 19 replies

PragmaTwice
Jul 28, 2025
Collaborator

LiuQhahah Jul 28, 2025
Author

PragmaTwice
Jul 29, 2025
Collaborator

PragmaTwice Aug 13, 2025
Collaborator

PragmaTwice Aug 18, 2025
Collaborator

LiuQhahah Aug 18, 2025
Author

LiuQhahah Aug 26, 2025
Author

LindaSummer Sep 9, 2025
Collaborator