Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support getting CityHash64 values for raw byte inputs #69

Open
nealmcb opened this issue May 2, 2022 · 2 comments
Open

Support getting CityHash64 values for raw byte inputs #69

nealmcb opened this issue May 2, 2022 · 2 comments
Assignees

Comments

@nealmcb
Copy link

nealmcb commented May 2, 2022

I'm getting different hash value outputs from python-cityhash than from other implementations. E.g.
fasthash::city::Hash64 - Rust

assert_eq!(Hash64::hash(b"hello"), 2578220239953316063);

vs

>>> cityhash.CityHash64WithSeeds(b"BU9[85WWp/ HASH!", 2239407493875278174, 1318041201923111131)
5218636442198533358

I assume that is because the Python implementation hashes the entire String data structure, not just the 5 bytes of "hello".
That of course makes sense for the goal of a fast hash algorithm for arbitrary data structures.

But it would be nice to also have a way to hash arbitrary byte arrays. Perhaps some sort of argument "raw=True" could be added to the functions?

As an aside, it is frustratingly hard to find any official examples of hash values or test vectors from CityHash official sources!
They are absent from the presentation at CityHash: Fast Hash Functions for Strings. The city-test.cc code is inscrutable. That Rust example is the best I've found. A sad state of affairs.

So I'll add one more, and at the same time more clearly demonstrate the collision issue that was revealed by djb, Jean-Philippe Aumasson, and Martin Boßlet at 29C3: Hash-flooding DoS reloaded: attacks and defenses. I modified their poc citycollisions-20120730.tar.gz code (gone from their original web site, but preserved by the amazing archive.org!) to be clearer about how to call the code to reproduce their collisions.

$ ./citycollisions_ascii 30000
128-bit hex key 24fbab96507d3be76326ad973ed6d702
== 2 64-bit base-10 keys k1, k2: 2664912266603609063, 7144588723975804674
CityHash64WithSeeds( 'BU9[85WWp/ HASH!', 16, k1, k2 ) = a2f4696e95a3dfbc
CityHash64WithSeeds( '8{YDLn;d.2 HASH!', 16, k1, k2 ) = a2f4696e95a3dfbc
CityHash64WithSeeds( 'd+nkK&t?yr HASH!', 16, k1, k2 ) = a2f4696e95a3dfbc
CityHash64WithSeeds( '{A.#v5i]V{ HASH!', 16, k1, k2 ) = a2f4696e95a3dfbc
@escherba
Copy link
Owner

escherba commented May 2, 2022

[deleted previous paragraph because I think it was incorrect]

It's been a while since I worked on this but

  >>> cityhash.CityHash64("hello")
  13009744463427800296
  >>> cityhash.CityHash64(b"hello")
  13009744463427800296

Yeah there may be some translation here that is preventing the above results from being different. I will look into this a bit later.

CityHash64 and FarmHash64 provide two types of functions:

  • fingerprints - these don't take seed and are supposed to produce identical answers across different machines
  • seed functions - these take a seed and their results may or may not be stable depending on your architecture.

If you are looking for a hash that's stable across implementations, you probably want to use a fingerprint (in FarmHash, the function is literally called Fingerprint, while in CityHash, it is the function that does not take a seed). So you should check whether the Rust implementation is a fingerprint or not and use the seedless function when trying to match it. That said, the fingerprint function (CityHash64) still does not match the Rust method above, and I don't know what could be causing the discrepancy.

Regarding matching the official spec, rather than studying test_city.cc, it may be easier to modify tests/test_cityhash.cc which contains some C++ tests I wrote. You can compile and run these with make cpp-test. [Update -- I did just that and I cannot match Rust result by calling from within C++.]

@nealmcb
Copy link
Author

nealmcb commented May 10, 2022

Ahh - thanks for pointing out at the byte and Unicode objects have the same hash. Using the technique at How to memory dump an object?
it is indeed clear that b"hello" is different from "hello" at the Python object representation level, so that helps narrow down the issue.

import sys
import ctypes
s1 = "hello"
len(s1)
5
sys.getsizeof(s1)
54
s1ptr = ctypes.cast(id(s1), ctypes.POINTER(ctypes.c_uint8))
bytes([s1ptr[i] for i in range(sys.getsizeof(s1))])
b'\x05\x00\x00\x00\x00\x00\x00\x00`\xb8\x90\x00\x00\x00\x00\x00\x05\x00\x00\x00\x00\x00\x00\x00n\x04k\x81\xbc\xbf\xd8\x85\xe5\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00hello\x00'
hash(s1)
-8802074655349603218

s1b = b"hello"
sys.getsizeof(s1b)
38
s1bptr = ctypes.cast(id(s1b), ctypes.POINTER(ctypes.c_uint8))
bytes([s1bptr[i] for i in range(sys.getsizeof(s1b))])
b'\x01\x00\x00\x00\x00\x00\x00\x00`\xb2\x90\x00\x00\x00\x00\x00\x05\x00\x00\x00\x00\x00\x00\x00n\x04k\x81\xbc\xbf\xd8\x85hello\x00'

I'm interested in both the fingerprints and the seed functions. I had also tried the fingerprints and they didn't match for me either.

@escherba escherba self-assigned this May 15, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants