Skip to content

gh-136599: Improve long_hash #136600

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 9 commits into
base: main
Choose a base branch
from
Open

Conversation

eendebakpt
Copy link
Contributor

@eendebakpt eendebakpt commented Jul 12, 2025

  • There is a corner case in the C code that not tested in long_hash: the final check
    if (x == (Py_uhash_t)-1)
        x = (Py_uhash_t)-2;
  • We can improve performance of long_hash by unrolling the first two digits

For reviewers: the first commit adds a test, the second and third commit unroll the hash calculation for the digits. The performance gain is not a lot, so I would be fine with reverting the last two commits. A test scripts:

import pyperf

setup = """

z = 2 << 30
two_digit_ints = list(range(z, z + 2000))

def long_hash_small_int():
    for _ in range(4):
        for ii in range(0,250):
            hash(ii)

def long_hash_one_digit():
    z = 1000
    for ii in range(z, z + 1000):
        hash(ii)

def long_hash_two_digit():
    z = 2 << 30
    for ii in range(z, z + 1000):
        hash(ii)

def long_hash_multi_digit():
    z = 1 << 30 << 30 << 30 << 30
    for ii in range(z, z + 1000):
        hash(ii)
"""

runner = pyperf.Runner()
runner.timeit(name="long_hash_small_int", stmt="long_hash_small_int()", setup=setup)
runner.timeit(name="long_hash_one_digit", stmt="long_hash_one_digit()", setup=setup)
runner.timeit(name="long_hash_two_digit", stmt="long_hash_two_digit()", setup=setup)
runner.timeit(name="long_hash_multi_digit", stmt="long_hash_multi_digit()", setup=setup)
runner.timeit(name="set(ints)", stmt="set(two_digit_ints)", setup=setup)

On my system I get a fairly consistent performance increase of 10% for the set(ints) test. For the other benchmarks my system (thermal throttled laptop) is not stable enough (the benchmarks are also a bit dominated by the creation of new integers).

Comment on lines 1698 to 1700
assert hash(10) == 10
assert hash(-1) == -2
assert hash(-2**61) != -1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
assert hash(10) == 10
assert hash(-1) == -2
assert hash(-2**61) != -1
assertEqual(hash(10), 10)
assertEqual(hash(-1), -2)
assertNotEqual(hash(-2**61), -1)

Such methods are generally preferred (e.g. the above test) as they provide better errors.

--i;
x = ((x << PyLong_SHIFT));
x += v->long_value.ob_digit[i];
assert(x < _PyHASH_MODULUS);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
assert(x < _PyHASH_MODULUS);
if (x >= _PyHASH_MODULUS)
x -= _PyHASH_MODULUS;

This assert will trigger for values like hash(2**31) for 32bit targets which use the default #define PYLONG_BITS_IN_DIGIT 30

cpython/Include/pyport.h

Lines 132 to 138 in 0d4fd10

/* PYLONG_BITS_IN_DIGIT describes the number of bits per "digit" (limb) in the
* PyLongObject implementation (longintrepr.h). It's currently either 30 or 15,
* defaulting to 30. The 15-bit digit option may be removed in the future.
*/
#ifndef PYLONG_BITS_IN_DIGIT
#define PYLONG_BITS_IN_DIGIT 30
#endif

because
typedef Py_ssize_t Py_hash_t;

and thus PyHASH_BITS is 31
#if SIZEOF_VOID_P >= 8
# define PyHASH_BITS 61
#else
# define PyHASH_BITS 31
#endif
#define PyHASH_MODULUS (((size_t)1 << _PyHASH_BITS) - 1)

I suggest to add more tests to test_long_hash around these corner cases for such build configurations:

>>> hash(-2**31 - 2)
-3
>>> hash(-2**31 - 1)
-2
>>> hash(-2**31)
-2
>>> hash(2**31)
1
>>> hash(2**31 + 1)
2
>>> hash(2**31 + 2)
3

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a check that hash(-2**31) is not -1. For the others I has a bit hesitant as the values are implementation details (document as implementation details here: https://docs.python.org/3/library/stdtypes.html#hashing-of-numeric-types). To add tests I would have to get the (private) value of PyHASH_MODULUS.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting, your link also mentions that PyHASH_MODULUS is available via https://docs.python.org/3/library/sys.html#sys.hash_info.modulus, which is e.g. used in

_PyHASH_MODULUS = sys.hash_info.modulus

and some other tests. But yeah, seems to be an implementation detail. Don't know for sure whether it's ok to use it in tests - but I think so?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PyHASH_MODULUS is not private anymore. And it was documented in the algorithm description before.

@@ -1693,5 +1693,11 @@ class MyInt(int):
# GH-117195 -- This shouldn't crash
object.__sizeof__(1)

def test_long_hash(self):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 on introducing hash tests in test_long.py even though there are

def test_hash(self):

and Lib/test/test_hash.py

There are special tests with respect to hashing in many other test modules, e.g.

def test_hash(self):

so IMHO this is a good fit 👍

@rhettinger rhettinger requested a review from tim-one July 14, 2025 03:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants