-
Notifications
You must be signed in to change notification settings - Fork 603
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(ustring): ustring hash collision protection #4350
base: main
Are you sure you want to change the base?
Conversation
The gist is that the ustring::strhash(str) function is modified to strip out the MSB from Strutil::strhash. The rep entry is filed in the ustring table based on this hash. So effectively, the computed hash is 63 bits, not 64. But rep->hashed field consists of the lower 63 bits being the computed hash, and the MSB indicates whether this is the 2nd (or more) entry in the table that had the same 63 bit hash. ustring::hash() then is modified as follows: If the MSB is 0, the computed hash is the hash. If the MSB is 1, though, we DON'T use that hash, and instead we use the pointer to the unique characters, but with the MSB set (that's an invalid address by itself). Note that the computed hashes never have MSB set, and the char*+MSB always have MSB set, so therefore ustring::hash() will never have the same value for two different ustrings. But -- please note! -- that ustring::strhash(str) and ustring(str).hash() will only match (and also be the same value on every execution) if the ustring is the first to receive that hash, which should be approximately always. Probably always, in practice. But in the very improbable case of a hash collision, one of them (the second to be turned into a ustring) will be using the alternate hash based on the character address, which is both not the same as ustring::strhash(chars), nor is it expected to be the same constant on every program execution. Signed-off-by: Larry Gritz <[email protected]>
Comments @cmstein @sfriedmapixar @etheory ? |
Maybe @chellmuth too? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I implemented this locally with the MSB polarity flipped from what you've presented. I did this it's easier to see (in a debugger) pointer-based perfect hashes are pointers, it doesn't require bit manip to deref them, and so and if it happens, you can just dive into those in a debugger without having to mask off the high bit. Semantically I think it's also cleaner to actually use that pointer value as the hashing value into the map, and store it in the tablerep. Really, just think of it as a way to create a perfect hash out of an imperfect one -- with the caveat that it assumes you're on a system with user-space pointers that can never have the MSB set. You could do the same thing with the LSB instead since the tablerep always comes out of a pool, by guaranteeing the alignment of those allocations wouldn't have the low bit set for pointers and that would work everywhere. If you just think of the collision resolution as part of the "hashing" function, then it makes sense to store that hash in the tablerep for it in the "hashed" member.
static OIIO_HOSTDEVICE constexpr hash_t strhash(string_view str) | ||
{ | ||
return Strutil::strhash(str) & hash_mask; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In our implementation, we do an OR instead of an and, so that "pointer hashes" look like pointers in a debugger, and it saves clearing the bit when we want to use it like a pointer, and "hash-hashes" are always negative when looking at them in a debugger.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting. I think I was trying to preserve the "0 means empty string" property without needing extra conditionals, but maybe it's easier than I thought. I will noodle with this and see if I can make it work just as simply with the reversed polarity, because I do see your point about how it's nice for the pointers to be the unchanged values.
hash_t h = rep()->hashed; | ||
return OIIO_LIKELY((h & duplicate_bit) == 0) | ||
? h | ||
: hash_t(m_chars) | duplicate_bit; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With a "pointer hash", you can actually just store that "hash" the "hashed" member on TableRep creation, then you don't need to do any of this hashing or checking at all, you just return the value like before.
@@ -757,10 +776,29 @@ class OIIO_UTIL_API ustring { | |||
TableRep(string_view strref, hash_t hash); | |||
~TableRep(); | |||
const char* c_str() const noexcept { return (const char*)(this + 1); } | |||
constexpr bool unique_hash() const | |||
{ | |||
return (hashed & duplicate_bit) == 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you flip the polarity of the indicator bit, this needs to flip.
@@ -71,6 +105,10 @@ template<unsigned BASE_CAPACITY, unsigned POOL_SIZE> struct TableRepMap { | |||
|
|||
const char* lookup(string_view str, uint64_t hash) | |||
{ | |||
if (OIIO_UNLIKELY(hash & ustring::duplicate_bit)) { | |||
// duplicate bit is set -- the hash is related to the chars! | |||
return OIIO::bitcast<const char*>(hash & ustring::hash_mask); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you flip the polarity of the indicator bit, the extra & is uneeded.
@@ -98,6 +135,10 @@ template<unsigned BASE_CAPACITY, unsigned POOL_SIZE> struct TableRepMap { | |||
// the hash. | |||
const char* lookup(uint64_t hash) | |||
{ | |||
if (OIIO_UNLIKELY(hash & ustring::duplicate_bit)) { | |||
// duplicate bit is set -- the hash is related to the chars! | |||
return OIIO::bitcast<const char*>(hash & ustring::hash_mask); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you flip the polarity of the indicator bit, the extra & is uneeded.
++dist; | ||
pos = (pos + dist) & mask; // quadratic probing | ||
} | ||
} | ||
|
||
const char* insert(string_view str, uint64_t hash) | ||
{ | ||
OIIO_ASSERT((hash & ustring::duplicate_bit) == 0); // can't happen? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you flip the polarity of the indicator bit, flip this assert.
// If we encountered another ustring with the same hash (if one | ||
// exists, it would have hashed to the same address so we would have | ||
// seen it), set the duplicate bit in the rep's hashed field. | ||
if (duplicate_hash) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you change the polarity of the indicator bit, you can pass "duplicate_hash" into make_rep, and then it can just store the pointer from the pool alloc in the "hashed" member, and all you have to do here is inc the collisions counter (or the inc and dbug print could be moved into make_rep as well. -- if you do that, you need to do another pobe for a valid "pos" here based on the pointer bits rather than the original hash bits.
// ensure low bit clear | ||
OIIO_DASSERT((size_t(rep->c_str()) & 1) == 0); | ||
// ensure low bit clear | ||
OIIO_DASSERT((size_t(rep->c_str()) & ustring::duplicate_bit) == 0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you change the polarity of the indicator bit, you need to flip this assert.
hash_t hash = Strutil::strhash64(strref); | ||
// This line, if uncommented, lets you force lots of hash collisions: | ||
// hash &= ~hash_t(0xffffff); | ||
hash_t hash = ustring::strhash(strref); | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can use a similar "force a lot of collisions" to to locally test that the collision code is all correct, now that it's much more complicated with the ptr vs hash perfect hashing.
collisions->emplace_back(ustring::from_unique(c.first)); | ||
return all_hash_collisions.size(); | ||
if (collisions) { | ||
// Disabled for now |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You don't need to keep a collection of all hash collisions if you put the pointer into the "hashed" member on collision...then you can just iterate through all the strings in the system, and print out those who's hash is actually a pointer.
I'm OK with any of these options as long as collisions are handled. Also remember that on every current system, the bottom few bits are zero due to alignment, and the top few are zero due to memory allocation guarantees: https://stackoverflow.com/questions/46509152/why-in-x86-64-the-virtual-address-are-4-bits-shorter-than-physical-48-bits-vs which means you can often rely on the top 12 to 16bits also being zero. Keep that in mind when you do any of these tricks. We use these bits in glimpse for storage of data, so it's quite useful. |
Something that has been nagging at my mind... I know that the MSB will always be 0 for a real address on x86, but do we feel confident that this will also be true for ARM, RISC V, or others that may come along? I also considered using the LSB for this, also safe because the allocations are always aligned, the characters will never start on an odd address, so if we " Thoughts? |
The reason I really like empty string to be all 0 bits (and thus thought that must mean that it's pointers, not hashes, with the extra bit set) is so we can memset to 0 a region and know that we've made valid (empty) ustrings or ustringhashes. |
Setting the LSB hash will reduce the hash quality in a risky way. Suppose for instance someone did: |
Right, gotcha. That's a good reason to stick with MSB. It's ok, we'll have asserts that no real address has the high bit set. We'll have faith that we'll catch it right away if anybody ports to a hardware platform that might use that part of the address space. |
OK, in my tree (not yet pushed here) I recoded it all to make the MSB 1 for "original" hashes, and 0 for rehashes (which are replaced by the char ptr as the return value for In the table (invisible to the user), each entry has a hashed field, and the chars themselves. The MSB of the hashed field reveals if this is the first entry with that hash (used to be 0, now is 1 in my new draft), or if it is a subsequently encountered string that had the same 31 bit hash. The public hash() function returns rep->hashed if it's a first string entry having that hash, or the char address if it's a duplicate hash. A property we've always had that I think is very important to retain is that both ustring and ustringhash are represented by a true 0 value for empty strings. In other words, it's special -- the ptr is null and the hash should be 0. This is what lets us zero out a block of memory and know that it's initializing any ustring or ustringhash therein to be valid, and represent an empty string. Now then, the way I did it first in this PR trivially preserved that property, because MSB was 1 only for duplicate hashes, so the empty string is canonical and hashes to 0. The alternate way, where it would "want" to be a MSB 1 for canonical hashes, requires special handling of the empty string case in a variety of places. I think this is a little confusing and error-prone, and certainly has a runtime cost for the extra testing and branching. @ssteinbach and others pointed out that it's nice for the pointers to have the MSB=0 rather than =1, because then you can just cast it to a pointer to get to the characters, especially helpful when debugging. But here's where I want to push back a little: Is it, really? The only case where you'd get a hash that's a pointer is for a 2nd string that had the same hash. This should be exceptionally rare, so it's not something that would come up... almost never... while in the debugger looking at a ustringhash. Usually you'd just see the hash, so there's not a direct way to get to the chars, so it doesn't matter if MSB is 0 or 1 for that case. (And if we set MSB on the pointer, that just means to get to the chars, you'd need to mask it... awkward, but how often does that really come up? You could also just CALL -- yes, in the debugger -- ustringhash.c_str() to get to the chars.) So here's the crux of it: MSB=1 for duplicate hash:
MSB=0 for duplicate hash:
Thoughts? Are people willing to pay (slightly) in runtime perf to have this property of trivial cast of duplicate hashes to pointers in the debuger, that you'll almost never encounter in real life? |
You know what, add a static function to do the cast in the debugger and you solve all the issues. Should be straight forward enough. Then you can just call the correct function in the debugger directly. Just document it clearly and that should resolve the issue no? |
Isn't that just the existing |
If you only have the pointer for some reason, you cannot call the function on it (though I guess you could cast it....). |
If you have the pointer, you already have the pointer. If you have a hash value If you have a hash value |
Does my last answer assuage concerns, or do we need an additional utility function? |
Yep all good thanks @lgritz - sorry for the slow turn around, I was on leave. |
@sfriedmapixar you had asked about the stab I took at swapping the bit to the other way as you originally suggested. If you're curious, it's here: https://github.com/lgritz/OpenImageIO/tree/lg-ustringhash2 (just one additional commit on top of what I have here in this PR) I like this PR here better, as it avoids some extra logic (both potentially perf impacting, but maybe more importantly, error prone and easy to forget) that is necessary to special case empty string handing. Let me know if you feel really strongly about doing it the other way. If you're ok with this, though, I'd like to merge it. |
The gist is that the ustring::strhash(str) function is modified to strip out the MSB from Strutil::strhash. The rep entry is filed in the ustring table based on this hash. So effectively, the computed hash is 63 bits, not 64.
But rep->hashed field consists of the lower 63 bits being the computed hash, and the MSB indicates whether this is the 2nd (or more) entry in the table that had the same 63 bit hash.
ustring::hash() then is modified as follows: If the MSB is 0, the computed hash is the hash. If the MSB is 1, though, we DON'T use that hash, and instead we use the pointer to the unique characters, but with the MSB set (that's an invalid address by itself). Note that the computed hashes never have MSB set, and the char*+MSB always have MSB set, so therefore ustring::hash() will never have the same value for two different ustrings.
But -- please note! -- that ustring::strhash(str) and ustring(str).hash() will only match (and also be the same value on every execution) if the ustring is the first to receive that hash, which should be approximately always. Probably always, in practice.
But in the very improbable case of a hash collision, one of them (the second to be turned into a ustring) will be using the alternate hash based on the character address, which is both not the same as ustring::strhash(chars), nor is it expected to be the same constant on every program execution.