Skip to content

mikenz1000/bloomfilter

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

bloomfilter

Some quick/basic experiments with bloom filters

For an overview of bloom filters see this article: "Space-efficient and exact de Bruijn graph representation based on a Bloom filter" http://almob.biomedcentral.com/articles/10.1186/1748-7188-8-22 or https://en.wikipedia.org/wiki/Bloom_filter

The goal of this project is to

  • test performance of different block sizes for the bit array (uint8_t .. uint64_t)
  • test performance of vector<bool>
  • experiment with different hash functions

Interestingly I found that the probability of false positives is higher (up to 9x!) than expected if I use std::hash vs. SpookyHash from http://burtleburtle.net/bob/hash/spooky.html

Determined m = 14377587, h = 10 for desired p(false +ve) 0.001
std::hash	 False positive rate **0.067031** versus expected 0.001000
SpookyHash False positive rate 0.000949 versus expected 0.001000

Determined m = 9585058, h = 7 for desired p(false +ve) 0.010000
std::hash  False positive rate **0.098459** versus expected 0.010039
SpookyHash False positive rate 0.010074 versus expected 0.010039

Determined m = 4792529, h = 4 for desired p(false +ve) 0.100000
std::hash  False positive rate **0.188301** versus expected 0.102603
SpookyHash False positive rate 0.102354 versus expected 0.102603

The weakness of std::hash may be exacerbated by the fact my data is generated by LCG...

Anyway as far as performance goes, using the better hash dominates the CPU time

vector<bool>:
	Filling took 1.452523s (1.452523e-07 x 10000000 its)
	Testing took 0.983000s (9.830005e-08 x 10000000 its)
vector<bool>, spooky hash:
	Filling took 6.749714s (6.749714e-07 x 10000000 its)
	Testing took 6.254350s (6.254350e-07 x 10000000 its)

So just looking at the cheaper hash because it will highlight the difference in performance of storage type:

Determined m = 1917011, h = 7 for desired p(false +ve) 0.01
64 bit blocks, std::hash...       : 	filling 0.774081s	testing 0.522455s	p(false +ve) 0.100190 vs E[p(false +ve)] 0.010039
64 bit blocks - 1 byte mis-aligned: 	filling 0.884380s	testing 0.437734s	p(false +ve) 0.100190 vs E[p(false +ve)] 0.010039
64 bit blocks - 2 byte mis-aligned: 	filling 0.691262s	testing 0.462049s	p(false +ve) 0.100190 vs E[p(false +ve)] 0.010039
64 bit blocks - 3 byte mis-aligned: 	filling 0.650693s	testing 0.509199s	p(false +ve) 0.100190 vs E[p(false +ve)] 0.010039
64 bit blocks - 4 byte mis-aligned: 	filling 0.652699s	testing 0.547534s	p(false +ve) 0.100190 vs E[p(false +ve)] 0.010039
64 bit blocks - 5 byte mis-aligned: 	filling 0.690840s	testing 0.520012s	p(false +ve) 0.100190 vs E[p(false +ve)] 0.010039
64 bit blocks - 6 byte mis-aligned: 	filling 0.728454s	testing 0.529477s	p(false +ve) 0.100190 vs E[p(false +ve)] 0.010039
64 bit blocks - 7 byte mis-aligned: 	filling 0.640647s	testing 0.492668s	p(false +ve) 0.100190 vs E[p(false +ve)] 0.010039
64 bit blocks - 8 byte mis-aligned: 	filling 0.667098s	testing 0.483496s	p(false +ve) 0.100190 vs E[p(false +ve)] 0.010039
vector<bool>, std::hash           : 	filling 0.671950s	testing 0.494453s	p(false +ve) 0.100190 vs E[p(false +ve)] 0.010039
32 bit blocks, std::hash...       : 	filling 0.709809s	testing 0.535923s	p(false +ve) 0.100190 vs E[p(false +ve)] 0.010039
32 bit blocks - 1 byte mis-aligned: 	filling 0.659153s	testing 0.498099s	p(false +ve) 0.100190 vs E[p(false +ve)] 0.010039
32 bit blocks - 2 byte mis-aligned: 	filling 0.686998s	testing 0.492405s	p(false +ve) 0.100190 vs E[p(false +ve)] 0.010039
32 bit blocks - 3 byte mis-aligned: 	filling 0.689679s	testing 0.487244s	p(false +ve) 0.100190 vs E[p(false +ve)] 0.010039
8 bit blocks                      : 	filling 0.661951s	testing 0.517451s	p(false +ve) 0.100190 vs E[p(false +ve)] 0.010039

So... basically block size and byte alignment makes little difference..

About

Some basic experiments with bloom filters in C++

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published