Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add hash function for AffineScalarFunc class #189

Closed
wants to merge 20 commits into from
Closed
Show file tree
Hide file tree
Changes from 17 commits
Commits
Show all changes
20 commits
Select commit Hold shift + click to select a range
d90c541
Update core.py
MichaelTiemannOSC Jul 5, 2023
5d429fe
Implement hash invariant
MichaelTiemannOSC Jul 7, 2023
40154ce
Fix pickling (broken by last commit)
MichaelTiemannOSC Jul 8, 2023
634db47
Improve efficiency of AffineScalarFunc hash
MichaelTiemannOSC Jul 8, 2023
7cca18c
Update appveyor.yml
MichaelTiemannOSC Jul 11, 2023
af3447c
Revert "Update appveyor.yml"
MichaelTiemannOSC Jul 11, 2023
f3cb615
Replace nose with pytest
MichaelTiemannOSC Jul 11, 2023
5e40c49
Revert "Replace nose with pytest"
MichaelTiemannOSC Jul 11, 2023
cd3b7e0
Update appveyor.yml
MichaelTiemannOSC Jul 11, 2023
a2d4bb1
Update appveyor.yml
MichaelTiemannOSC Jul 11, 2023
d74a9d1
feat: Added hash function for AffineScalarFunc class
NelDav Jan 22, 2024
9a9d6d5
Merge remote-tracking branch 'hash/hash_for_pandas' into implement-ha…
NelDav Jan 25, 2024
3e0b064
fix: Merged hash method from "MichaelTiemannOSC". And created a corre…
NelDav Jan 25, 2024
1e57a2f
refactor: Nose does not like return types. Removed them.
NelDav Jan 25, 2024
e4ef9e1
Merge branch 'master' into implement-hash-for-AffineScalarFunc
NelDav Apr 2, 2024
4d8d268
fix: hash calculation works now. However, hash equality between Varia…
NelDav Apr 4, 2024
605b5cd
fix: Variabel hash is equal for objects which __eq__ returns True.
NelDav Apr 4, 2024
82285df
fix: Pickle is able to serialize/deserialize Variables again
NelDav Apr 5, 2024
3c64ae3
fix: pickling works now
NelDav Apr 5, 2024
185aa0d
fix: derivatives with value 0 are filtered out for hash calculation.
NelDav Apr 6, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
40 changes: 31 additions & 9 deletions uncertainties/core.py
Original file line number Diff line number Diff line change
Expand Up @@ -1756,6 +1756,19 @@ def __bool__(self):

########################################

def __hash__(self):
"""
Calculates the hash for any AffineScalarFunc object.
The hash is calculated from the nominal_value, and the derivatives.

Returns:
int: The hash of this object
"""

items = sorted(self.derivatives.items(), key= lambda x: id(x[0]))
ids, values = zip(*map(lambda item: (id(item[0]), item[1]), items))
return hash((self._nominal_value, tuple(ids), tuple(values)))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is an alternative suggestion. It is the same thing (except puts the derivative into a tuple of tuples instead of producing two equal length tuples) but maybe more readable (I guess depends on if you like comprehensions vs map/lambda). In an ideal world, we could just do frozendict(self.derivatives) but there is no frozendict in the standard library.

Suggested change
items = sorted(self.derivatives.items(), key= lambda x: id(x[0]))
ids, values = zip(*map(lambda item: (id(item[0]), item[1]), items))
return hash((self._nominal_value, tuple(ids), tuple(values)))
items = sorted((id(var), val) for var, val in self.derivatives.items())
return hash((self._nominal_value, tuple(items)))

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In an ideal world, we could just do frozendict(self.derivatives) but there is no frozendict in the standard library.

There is MappingProxyType, but I do not know, if this would work.
https://docs.python.org/3/library/types.html#types.MappingProxyType

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MappingProxyType is not hashable, so it will not work.

I think your solution is more elegant, but for some reason, assert hash((2*a+a)/3)==hash(a) fails when using your proposal.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you update __hash__ for both AffineScalarFunc and Variable? I suggested updating one above, but both need to be updated consistently. For Variable the changed version should be like:

    return hash((self._nominal_value, ((id(self), 1.),)))

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I did a mistake with the parenthesis.
Now it works :)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thing I realized -- we moved away from frozenset because duplicate derivative coefficient values could be combined and lead to more hash collisions, but that was before we switched to have a collection of tuples instead of two collections with ids and coefficients. Instead of

        derivatives = sorted([(id(key), value) for key, value in self.derivatives.items()])
        return hash((self._nominal_value, tuple(derivatives)))

we could do

        derivatives = frozenset((id(key), value) for key, value in self.derivatives.items())
        return hash((self._nominal_value, derivatives))

since the ids are always unique and so no tuples will get combined in the set. Whether that is more or less readable or efficient, I am not sure. I like not sorting when it's not necessary but that is pretty minor. I was curious if frozenset.__hash__ used sorting for its implementation, but I checked and what it does is XOR the hashes of the elements together which is neat because the XOR operation is also order independent.


# Uncertainties handling:

def error_components(self):
Expand Down Expand Up @@ -2729,7 +2742,7 @@ def __init__(self, value, std_dev, tag=None):
# differentiable functions: for instance, Variable(3, 0.1)/2
# has a nominal value of 3/2 = 1, but a "shifted" value
# of 3.1/2 = 1.55.
value = float(value)
self._nominal_value = float(value)

# If the variable changes by dx, then the value of the affine
# function that gives its value changes by 1*dx:
Expand All @@ -2739,7 +2752,7 @@ def __init__(self, value, std_dev, tag=None):
# takes much more memory. Thus, this implementation chooses
# more cycles and a smaller memory footprint instead of no
# cycles and a larger memory footprint.
super(Variable, self).__init__(value, LinearCombination({self: 1.}))
super(Variable, self).__init__(self._nominal_value, LinearCombination({self: 1.}))

self.std_dev = std_dev # Assignment through a Python property

Expand All @@ -2766,6 +2779,22 @@ def std_dev(self, std_dev):

self._std_dev = float(std_dev)

def __hash__(self):
"""
Calculates the hash for any Variable object.
This simply calls the __hash__ method of `AffineScalarFunc` in most cases.
During initialization, the `_linear_part` and `_nominal_value` attributes does not exist.
Because of that, the id returned in case this method is called during initialization.

Returns:
int: The hash of this object
"""
NelDav marked this conversation as resolved.
Show resolved Hide resolved
# Otherwise, pickles loads does not work
if not hasattr(self, "_nominal_value"):
self._nominal_value = None
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the issue here is that currently pickle.loads needs to reconstruct the LinearCombination stored in the _linear_part attribute in order to construct an AffineScalarFunc. For Variable, the LinearCombination holds the Variable in a dictionary and so needs to hash it before __setstate__ can fully load it.

I think it would be better to break this cycle by overriding Variable.__getstate__ and Variable.__setstate__ not to serialize _linear_part and instead serialize the other attributes and restore then before setting the linear part so that it can hash properly.

With the current implementation, None will get used in the hash instead of _nominal_value when loading from pickle which is not right. I am not sure if there is a practical case in which this matters but here is an edge case to demonstrate the problem:

y = ufloat(1, 1)
new_y = pickle.loads(pickle.dumps(y))
assert y in y.derivatives  # Passes
assert new_y in new_y.derivatives  # Raises KeyError

I am not sure this is worth adding to test_pickling. Maybe if we think of a more practical example we could add it.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the problem is, that pickle already generates the _linear_part object, before calling __setstate__.
Since the _linear_part always refers to the Variable it self, it cannot calculate the hash properly.

I think we can prevent that by deleting the self reference inside the __getstate__ method. However, the derivative of self must not be one right? That means, we have to serialize the derivative in some way.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried some stuff to handle this in the __getstate__ and __setstate__ methods.
But one single unit test is failing:

    # Corner case test: when an attribute is present both in __slots__
    # and in __dict__, it is first looked up from the slots
    # (references:
    # http://docs.python.org/2/reference/datamodel.html#invoking-descriptors,
    # http://stackoverflow.com/a/15139208/42973). As a consequence,
    # the pickling process must pickle the correct value (i.e., not
    # the value from __dict__):
    x = NewVariable_dict(3, 0.14)
    x._nominal_value = 'in slots'
    # Corner case: __dict__ key which is also a slot name (it is
    # shadowed by the corresponding slot, so this is very unusual,
    # though):
    x.__dict__['_nominal_value'] = 'in dict'
    # Additional __dict__ attribute:
    x.dict_attr = 'dict attribute'

    x_unpickled = pickle.loads(pickle.dumps(x))

I do not completely understand, what this test is for.
I think the problem might be the __getstate__ method.
I will check this in more detail on Monday.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the cleanest option would be to override __getstate__ and __setstate__ on Variable, leaving __getstate__ and __setstate__ as they are on AffineScalarFunc. For Variable, the linear part is always {self: 1.}, so I would just remove the linear part in __getstate__ and set {self: 1.} back in __setstate__. In that way, no self-referencing LinearCombination needs to be pickled. Implementation-wise there are different ways to do this. Maybe the easiest is to call super().__getstate__() and remove the _linear_part and in __setstate__() add the _linear_part to the data dict and then call super().__setstate__().

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That would only work if it is guaranteed, that the _linear_part of AffineScalarFunction does not contain a self reference. Is that the case?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think so. I think it is only Variable that has the self-reference since LinearCombination only holds Variables, so a more complicated AffineScalarFunc instance would not put itself into the LinearCombination.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I implemented the override of __setstate__ and __getstate__ for Variable and it seems to work. All test cases are passing on my local machine.


return hash((self._nominal_value, (id(self),), (1.,)))

# The following method is overridden so that we can represent the tag:
def __repr__(self):

Expand All @@ -2776,13 +2805,6 @@ def __repr__(self):
else:
return "< %s = %s >" % (self.tag, num_repr)

def __hash__(self):
# All Variable objects are by definition independent
# variables, so they never compare equal; therefore, their
# id() are allowed to differ
# (http://docs.python.org/reference/datamodel.html#object.__hash__):
return id(self)

def __copy__(self):
"""
Hook for the standard copy module.
Expand Down
39 changes: 38 additions & 1 deletion uncertainties/test_uncertainties.py
Original file line number Diff line number Diff line change
Expand Up @@ -30,7 +30,7 @@
# Local modules

import uncertainties.core as uncert_core
from uncertainties.core import ufloat, AffineScalarFunc, ufloat_fromstr
from uncertainties.core import ufloat, AffineScalarFunc, ufloat_fromstr, LinearCombination
from uncertainties import umath

# The following information is useful for making sure that the right
Expand Down Expand Up @@ -2338,3 +2338,40 @@ def test_correlated_values_correlation_mat():
assert arrays_close(
numpy.array(cov_mat),
numpy.array(uncert_core.covariance_matrix([x2, y2, z2])))

def test_hash():
'''
Tests the invariance that if x==y, then hash(x)==hash(y)
'''

a = ufloat(1.23, 2.34)
b = ufloat(1.23, 2.34)

# nominal values and std_dev terms are equal, but...
assert a.n==b.n and a.s==b.s
# ...x and y are independent variables, therefore not equal as uncertain numbers
assert a != b
assert hash(a) != hash(b)

# order of calculation should be irrelevant
assert a + b == b + a
assert hash(a + b) == hash(b + a)

# the equation (2x+x)/3 is equal to the variable x, so...
assert ((2*a+a)/3)==a
# ...hash of the equation and the variable should be equal
assert hash((2*a+a)/3)==hash(a)
Comment on lines +2361 to +2363
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one is simple enough that it works out okay, but if the calculation got more complicated than (2. + 1.) / 3 == 1. the equality check could fail due to floating point rounding error. Luckily though the equality check would fail along with the hash, so it would not be a violation of the model.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think you are right.
I already wondered, why the equality check is implemented in such a strange way.
But I did not step into details. Probably there is a good reason for that.


c = ufloat(1.23, 2.34)

# the values of the linear combination entries matter
x = AffineScalarFunc(1, LinearCombination({a:1, b:2, c:1}))
y = AffineScalarFunc(1, LinearCombination({a:1, b:2, c:2}))
assert x != y
assert hash(x) != hash(y)

# the order of linear combination values matter and should not lead to the same hash
x = AffineScalarFunc(1, LinearCombination({a:1, b:2}))
y = AffineScalarFunc(1, LinearCombination({a:2, b:1}))
assert x != y
assert hash(x) != hash(y)