Skip to content

Conversation

@RaduBerinde
Copy link
Contributor

Improve binary fuse parameter testing

We add a test that shows the range of sizes and segment counts for
each segment length.

We also add a test that checks filter generation at "boundary" sizes
in terms of segment lengths. The test prints the average and max
number of iterations for each tested size. Output with numTrials=100:

size: 2  iterations: 1.02 avg (2 max)
size: 8  iterations: 1.02 avg (2 max)
size: 24  iterations: 1.13 avg (3 max)
size: 27  iterations: 1.02 avg (2 max)
size: 55  iterations: 1.02 avg (2 max)
size: 91  iterations: 1.04 avg (3 max)
size: 120  iterations: 1.00 avg (1 max)
size: 303  iterations: 1.09 avg (3 max)
size: 349  iterations: 1.04 avg (2 max)
size: 1009  iterations: 1.02 avg (2 max)
size: 1124  iterations: 1.13 avg (2 max)
size: 3361  iterations: 1.03 avg (3 max)
size: 3551  iterations: 9.45 avg (42 max)
size: 11192  iterations: 1.03 avg (2 max)
size: 11521  iterations: 109.79 avg (528 max)
size: 37272  iterations: 1.00 avg (1 max)
size: 37454  iterations: 15.42 avg (70 max)
size: 124117  iterations: 1.02 avg (2 max)
size: 126131  iterations: 1.70 avg (6 max)
size: 413309  iterations: 1.01 avg (2 max)
size: 416077  iterations: 1.83 avg (6 max)
size: 1376321  iterations: 1.00 avg (1 max)

Try smaller segment length in binary fuse build

Some sizes around segment length transitions require many iterations
and would work much better with the previous segment length.

We add a simple fix that is more robust than tweaking the formula:
once every four iterations, we try the previous segment length while
keeping the same capacity. Note that in most cases this won't affect
the build because it's rare to need more than 1-2 iterations.

TestBinaryFuseBoundarySizes output (with numTrials=100):

binaryfusefilter_test.go:490: size: 2  iterations: 1.02 avg (2 max)
binaryfusefilter_test.go:490: size: 8  iterations: 1.08 avg (3 max)
binaryfusefilter_test.go:490: size: 24  iterations: 1.08 avg (3 max)
binaryfusefilter_test.go:490: size: 27  iterations: 1.03 avg (2 max)
binaryfusefilter_test.go:490: size: 55  iterations: 1.02 avg (2 max)
binaryfusefilter_test.go:490: size: 91  iterations: 1.02 avg (2 max)
binaryfusefilter_test.go:490: size: 120  iterations: 1.04 avg (2 max)
binaryfusefilter_test.go:490: size: 303  iterations: 1.04 avg (2 max)
binaryfusefilter_test.go:490: size: 349  iterations: 1.01 avg (2 max)
binaryfusefilter_test.go:490: size: 1009  iterations: 1.01 avg (2 max)
binaryfusefilter_test.go:490: size: 1124  iterations: 1.16 avg (4 max)
binaryfusefilter_test.go:490: size: 3361  iterations: 1.03 avg (2 max)
binaryfusefilter_test.go:490: size: 3551  iterations: 2.05 avg (6 max)
binaryfusefilter_test.go:490: size: 11192  iterations: 1.04 avg (3 max)
binaryfusefilter_test.go:490: size: 11521  iterations: 2.10 avg (6 max)
binaryfusefilter_test.go:490: size: 37272  iterations: 1.01 avg (2 max)
binaryfusefilter_test.go:490: size: 37454  iterations: 2.09 avg (6 max)
binaryfusefilter_test.go:490: size: 124117  iterations: 1.03 avg (2 max)
binaryfusefilter_test.go:490: size: 126131  iterations: 1.53 avg (4 max)
binaryfusefilter_test.go:490: size: 413309  iterations: 1.00 avg (1 max)
binaryfusefilter_test.go:490: size: 416077  iterations: 1.50 avg (4 max)
binaryfusefilter_test.go:490: size: 1376321  iterations: 1.02 avg (3 max)

Informs #23, #24

We add a test that shows the range of sizes and segment counts for
each segment length.

We also add a test that checks filter generation at "boundary" sizes
in terms of segment lengths. The test prints the average and max
number of iterations for each tested size. Output with numTrials=100:
```
size: 2  iterations: 1.02 avg (2 max)
size: 8  iterations: 1.02 avg (2 max)
size: 24  iterations: 1.13 avg (3 max)
size: 27  iterations: 1.02 avg (2 max)
size: 55  iterations: 1.02 avg (2 max)
size: 91  iterations: 1.04 avg (3 max)
size: 120  iterations: 1.00 avg (1 max)
size: 303  iterations: 1.09 avg (3 max)
size: 349  iterations: 1.04 avg (2 max)
size: 1009  iterations: 1.02 avg (2 max)
size: 1124  iterations: 1.13 avg (2 max)
size: 3361  iterations: 1.03 avg (3 max)
size: 3551  iterations: 9.45 avg (42 max)
size: 11192  iterations: 1.03 avg (2 max)
size: 11521  iterations: 109.79 avg (528 max)
size: 37272  iterations: 1.00 avg (1 max)
size: 37454  iterations: 15.42 avg (70 max)
size: 124117  iterations: 1.02 avg (2 max)
size: 126131  iterations: 1.70 avg (6 max)
size: 413309  iterations: 1.01 avg (2 max)
size: 416077  iterations: 1.83 avg (6 max)
size: 1376321  iterations: 1.00 avg (1 max)
```
@lemire
Copy link
Member

lemire commented Jan 7, 2026

One point to take into consideration is that if your application is to create a set of 100 or 1000 elements, these probabilistic filters are probably not worth the effort. I am not opposed to changing the formula but we should be clear on the objectives.

Ping @thomasmueller

Some sizes around segment length transitions require many iterations
and would work much better with the previous segment length.

We add a simple fix that is more robust than tweaking the formula:
once every four iterations, we try the previous segment length while
keeping the same capacity. Note that in most cases this won't affect
the build because it's rare to need more than 1-2 iterations.

`TestBinaryFuseBoundarySizes` output (with numTrials=100):
```
binaryfusefilter_test.go:490: size: 2  iterations: 1.02 avg (2 max)
binaryfusefilter_test.go:490: size: 8  iterations: 1.08 avg (3 max)
binaryfusefilter_test.go:490: size: 24  iterations: 1.08 avg (3 max)
binaryfusefilter_test.go:490: size: 27  iterations: 1.03 avg (2 max)
binaryfusefilter_test.go:490: size: 55  iterations: 1.02 avg (2 max)
binaryfusefilter_test.go:490: size: 91  iterations: 1.02 avg (2 max)
binaryfusefilter_test.go:490: size: 120  iterations: 1.04 avg (2 max)
binaryfusefilter_test.go:490: size: 303  iterations: 1.04 avg (2 max)
binaryfusefilter_test.go:490: size: 349  iterations: 1.01 avg (2 max)
binaryfusefilter_test.go:490: size: 1009  iterations: 1.01 avg (2 max)
binaryfusefilter_test.go:490: size: 1124  iterations: 1.16 avg (4 max)
binaryfusefilter_test.go:490: size: 3361  iterations: 1.03 avg (2 max)
binaryfusefilter_test.go:490: size: 3551  iterations: 2.05 avg (6 max)
binaryfusefilter_test.go:490: size: 11192  iterations: 1.04 avg (3 max)
binaryfusefilter_test.go:490: size: 11521  iterations: 2.10 avg (6 max)
binaryfusefilter_test.go:490: size: 37272  iterations: 1.01 avg (2 max)
binaryfusefilter_test.go:490: size: 37454  iterations: 2.09 avg (6 max)
binaryfusefilter_test.go:490: size: 124117  iterations: 1.03 avg (2 max)
binaryfusefilter_test.go:490: size: 126131  iterations: 1.53 avg (4 max)
binaryfusefilter_test.go:490: size: 413309  iterations: 1.00 avg (1 max)
binaryfusefilter_test.go:490: size: 416077  iterations: 1.50 avg (4 max)
binaryfusefilter_test.go:490: size: 1376321  iterations: 1.02 avg (3 max)
```
@RaduBerinde
Copy link
Contributor Author

RaduBerinde commented Jan 8, 2026

One point to take into consideration is that if your application is to create a set of 100 or 1000 elements, these probabilistic filters are probably not worth the effort. I am not opposed to changing the formula but we should be clear on the objectives.

I agree. I am experimenting with binary fuse filters for Pebble (https://github.com/cockroachdb/pebble), where in a typical LSM we would see ~40K keys in the upper LSM level files and ~600K keys in the lowest level files. But these numbers could be very different for various workloads and I don't want to have a separate code path if we happen to have a smaller set of keys.

@lemire lemire merged commit 9e0c9da into FastFilter:master Jan 8, 2026
5 checks passed
@lemire
Copy link
Member

lemire commented Jan 8, 2026

Merged. I will release.

@lemire
Copy link
Member

lemire commented Jan 8, 2026

@RaduBerinde You may want to check this link https://gihub.com/cockroachdb/pebble

It is likely not pointing at what you expect.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants