Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Second Hash Function? #78

Open
wiany11 opened this issue Dec 11, 2019 · 3 comments
Open

Second Hash Function? #78

wiany11 opened this issue Dec 11, 2019 · 3 comments

Comments

@wiany11
Copy link

wiany11 commented Dec 11, 2019

Does it only counts collisions or handle them as well?
If I want to build my own hash table with MurmurHash, how should I handle collisions...?

Thank you,

@sebres
Copy link

sebres commented Dec 11, 2019

If I want to build my own hash table ... how should I handle collisions...?

You should put entries with same hash into some list, no matter which hash function is used, unless it is perfect hash function (PHF, which is basically possible on fixed set only).
So as example if you have 5 different entries producing 2 collisions using H:

H(AAA) = 1
H(BBB) = 1
H(CCC) = 2
H(DDD) = 2
H(EEE) = 3

the hash table represented as flat JSON could look like this:

{
  1: ["AAA", "BBB"],
  2: ["CCC", "DDD"]
  3: "EEE"
}

Because the probability of hash collision is very small (depending on the data-set), you'd have few elements in every list (mostly 1 only).
Anyway if your hash function is not deterministic unique, so not ideal PHF, you'd always compare the entry key from haystack with the needle after the hash match (in case of list, for every element from the list corresponding the hash).

The second hash function cannot avoid the collisions on variable dataset, it would just decrease the probability depending on distribution algorithm of both functions for variable set (and it'd be larger as you may assume, see Birthday Paradox).

@wiany11
Copy link
Author

wiany11 commented Dec 13, 2019

@sebres But linear probing is faster and easier to implement than chaining... I don't know how to find best(or proper) 2nd hash function. Or, how to probe for handling collisions...

@sebres
Copy link

sebres commented Dec 13, 2019

The case is - if your set, you want to put into table, is variable, you cannot really avoid collision without verifying of whole set or else without mathematical proof for combination of both hash functions.
The fact is - even with two function you would compress M bits of original data to N bits of hash, and if M > N (otherwise hashing doesn't make sense) you would lose some information during "compression", so for example even in "ideal" case by two functions of 64-bit values (hashing into "common" 128-bit value, each would hash different sub-set of bits) related to birthday problem you would have 2128/2 so 264 as lower bound to catch a collision.
But this is a generalized problem and tells us about randomized numbers, which I assume your data set would be not, so you can expect more worse results.

Still worse the combination of both functions would be probably far from ideal in sense of common hashing.
Just for example, see both strings at end of this test. Both are different by only few bits (enclosed in brackets chars), but colliding with the same MD5 hash (so result compressed into 64-bits "ignores" this bits in such strings, depending on other bits). Then imagine your second function would also "lose" the effect in hash on same bits, or else the impact of "compression" of this strings is missed in result in similar way. So the resulting hashes of both functions would be probably different, MD5(a) != 2ndH(a) and MD5(b) != 2ndH(b), but because MD5(a) == MD5(b) and 2ndH(a) == 2ndH(b), the combination of both to single hash doesn't really improve the collision property against subset containing strings like this.

Also note the "quality" of set is sometimes more important as the quantity, so if you'd do a probation over hashing of binary and text (ascii) data sets, you'd catch totally different results (probably would catch a collision earlier by hashing of the text blocks). Here are some tests of one 32-bits hash function, as you can see the results are sometimes far from 232/2 (216), just because of the data-set format.

See similar question #58 about collision properties (note the creation of mathematical proof for two function is more complex).

Or, how to probe for handling collisions...

You can implement your hash table in the way you wanted do that (without serialization inside the bucket, so without grouping the entries in the list), but with data compassion in CreateHashEntry function in case if hash value already exists, so if both keys are different you have found a collision.
Then just store few millions different entries that have similar key format you want to hash (e. g. file paths of some directory recursive) in such hash table and I guess you would catch a collision.
Here Stéphane found more as 400 collisions for MM3 on 2 millions paths.
I made also such tests and for example on SHA-224 I got my first collision after ca. 7E9 URLs (ca. 10 hours, 180 bytes average URL length), where the collision is expected rather by 5E33 (so 2112).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants