Lukas Lukas - 4 months ago 8
Python Question

Is it really necessary to hash the same for classes that compare the same?

Reading this answer it seems, that if

__eq__
is defined in custom class,
__hash__
needs to be defined as well. This is understandable.

However it is not clear, why - effectively -
__eq__
should be same as
self.__hash__()==other.__hash__


Imagining a class like this:

class Foo:
...
self.Name
self.Value
...
def __eq__(self,other):
return self.Value==other.Value
...
def __hash__(self):
return id(self.Name)


This way class instances could be compared by value, which could be the only reasonable use, but considered identical by name.

This way
set
could not contain multiple instances with equal name, but comparison would still work.

What could be the problem with such definition?

The reason for defining
__eq__
,
__lt__
and other by
Value
is to be able to sort instances by
Value
and to be able to use functions like max. For example, he class should represent a physical output of a device (say heating element). Each of these outputs has unique Name. The Value is power of the output device. To find optimal combination of heating elements to turn on, it is useful to be able to compare them by power (Value). In a set or dictionary, however, it should not be possible to have multiple outputs with same names. Of course, different outputs with different names might easily have equal power.

Answer

The problem is that it does not make sense, hash is used to do efficient bucketing of objects. Consequently, when you have a set, which is implemented as a hash table, each hash points to a bucket, which is usually a list of elements. In order to check if an element is in the set (or other hash based container) you go to the bucket pointed by a hash and then you iterate over all elements in the list, comparing them one by one.

In other words - hash is not supposed to be a comparator (as it can, and should give you sometimes a false positive). In particular, in your example, your set will not work - it will not recognize duplicate, as they do not compare to each other.

class Foo:

    def __eq__(self,other):
        return self.Value==other.Value

    def __hash__(self):
        return id(self.Name)


a = set()
el = Foo()
el.Name = 'x'
el.Value = 1

el2 = Foo()
el2.Name = 'x'
el2.Value = 2

a.add(el)
a.add(el2)
print len(a) # should be 1, right? Well it is 2

actually it is even worse then that, if you have 2 objects with the same values but different names, they are not recognized to be the same either

class Foo:

    def __eq__(self,other):
        return self.Value==other.Value

    def __hash__(self):
        return id(self.Name)


a = set()
el = Foo()
el.Name = 'x'
el.Value = 2

el2 = Foo()
el2.Name = 'a'
el2.Value = 2

a.add(el)
a.add(el2)
print len(a) # should be 1, right? Well it is 2 again

while doing it properly (thus, "if a == b, then hash(a) == hash(b)") gives:

class Foo:

    def __eq__(self,other):
        return self.Name==other.Name

    def __hash__(self):
        return id(self.Name)


a = set()
el = Foo()
el.Name = 'x'
el.Value = 1

el2 = Foo()
el2.Name = 'x'
el2.Value = 2

a.add(el)
a.add(el2)
print len(a) # is really 1

Update

There is also an non deterministic part, which is hard to easily reproduce, but essentially hash does not uniquely define a bucket. Usually it is like

bucket_id = hash(object) % size_of_allocated_memory 

consequently things that have different hashes can still end up in the same bucket. Consequently, you can get two elements equal to each (inside set) due to equality of Values even though Names are different, as well as the other way around, depending on actual internal implementation, memory constraints etc.

In general there are many more examples where things can go wrong, as hash is defined as a function h : X -> Z such that x == y => h(x) == h(y), thus people implementing their containers, authorization protocols, and other tools are free to assume this property. If you break it - every single tool using hashes can break. Furthermore, it can break in time, meaning that you update some library and your code will stop working, as a valid update to the underlying libraries (using the above assumption) can lead to exploiting your violation of this assumption.

Update 2

Finally, in order to solve your issue - you simply should not define your eq, lt operators to handle sorting. It is about actual comparison of the elements, which should be compatible with the rest of the behaviours. All you have to do is define a separate comparator and use it in your sorting routines (sorting in python accepts any comparator, you do not need to rely on <, > etc.). The other way around is to instead have valid <, >, = defined on values, but in order to keep names unique - keep a set with... well... names, and not objects themselves. Whichever path you choose - the crucial element here is: equality and hashing have to be compatible, that's all.

Comments