Some quick/basic experiments with bloom filters
For an overview of bloom filters see this article: "Space-efficient and exact de Bruijn graph representation based on a Bloom filter" http://almob.biomedcentral.com/articles/10.1186/1748-7188-8-22 or https://en.wikipedia.org/wiki/Bloom_filter
The goal of this project is to
- test performance of different block sizes for the bit array (
uint8_t
..uint64_t
) - test performance of
vector<bool>
- experiment with different hash functions
Interestingly I found that the probability of false positives is higher (up to 9x!) than expected if I use std::hash vs. SpookyHash from http://burtleburtle.net/bob/hash/spooky.html
Determined m = 14377587, h = 10 for desired p(false +ve) 0.001
std::hash False positive rate **0.067031** versus expected 0.001000
SpookyHash False positive rate 0.000949 versus expected 0.001000
Determined m = 9585058, h = 7 for desired p(false +ve) 0.010000
std::hash False positive rate **0.098459** versus expected 0.010039
SpookyHash False positive rate 0.010074 versus expected 0.010039
Determined m = 4792529, h = 4 for desired p(false +ve) 0.100000
std::hash False positive rate **0.188301** versus expected 0.102603
SpookyHash False positive rate 0.102354 versus expected 0.102603
The weakness of std::hash may be exacerbated by the fact my data is generated by LCG...
Anyway as far as performance goes, using the better hash dominates the CPU time
vector<bool>:
Filling took 1.452523s (1.452523e-07 x 10000000 its)
Testing took 0.983000s (9.830005e-08 x 10000000 its)
vector<bool>, spooky hash:
Filling took 6.749714s (6.749714e-07 x 10000000 its)
Testing took 6.254350s (6.254350e-07 x 10000000 its)
So just looking at the cheaper hash because it will highlight the difference in performance of storage type:
Determined m = 1917011, h = 7 for desired p(false +ve) 0.01
64 bit blocks, std::hash... : filling 0.774081s testing 0.522455s p(false +ve) 0.100190 vs E[p(false +ve)] 0.010039
64 bit blocks - 1 byte mis-aligned: filling 0.884380s testing 0.437734s p(false +ve) 0.100190 vs E[p(false +ve)] 0.010039
64 bit blocks - 2 byte mis-aligned: filling 0.691262s testing 0.462049s p(false +ve) 0.100190 vs E[p(false +ve)] 0.010039
64 bit blocks - 3 byte mis-aligned: filling 0.650693s testing 0.509199s p(false +ve) 0.100190 vs E[p(false +ve)] 0.010039
64 bit blocks - 4 byte mis-aligned: filling 0.652699s testing 0.547534s p(false +ve) 0.100190 vs E[p(false +ve)] 0.010039
64 bit blocks - 5 byte mis-aligned: filling 0.690840s testing 0.520012s p(false +ve) 0.100190 vs E[p(false +ve)] 0.010039
64 bit blocks - 6 byte mis-aligned: filling 0.728454s testing 0.529477s p(false +ve) 0.100190 vs E[p(false +ve)] 0.010039
64 bit blocks - 7 byte mis-aligned: filling 0.640647s testing 0.492668s p(false +ve) 0.100190 vs E[p(false +ve)] 0.010039
64 bit blocks - 8 byte mis-aligned: filling 0.667098s testing 0.483496s p(false +ve) 0.100190 vs E[p(false +ve)] 0.010039
vector<bool>, std::hash : filling 0.671950s testing 0.494453s p(false +ve) 0.100190 vs E[p(false +ve)] 0.010039
32 bit blocks, std::hash... : filling 0.709809s testing 0.535923s p(false +ve) 0.100190 vs E[p(false +ve)] 0.010039
32 bit blocks - 1 byte mis-aligned: filling 0.659153s testing 0.498099s p(false +ve) 0.100190 vs E[p(false +ve)] 0.010039
32 bit blocks - 2 byte mis-aligned: filling 0.686998s testing 0.492405s p(false +ve) 0.100190 vs E[p(false +ve)] 0.010039
32 bit blocks - 3 byte mis-aligned: filling 0.689679s testing 0.487244s p(false +ve) 0.100190 vs E[p(false +ve)] 0.010039
8 bit blocks : filling 0.661951s testing 0.517451s p(false +ve) 0.100190 vs E[p(false +ve)] 0.010039
So... basically block size and byte alignment makes little difference..