I need to compare large chunks of data for equality, and I need to compare many per second, fast. Every object is guaranteed to be the same size, and it is possible/likely they may only be slightly different (in unknown positions).
I have seen, from the interactive session below, using
hash(a) == hash(b)
a == b
In : import hashlib
In : with open('/dev/urandom') as f:
...: spam = f.read(2**20 - 1)
In : spamA = spam + 'A'
In : Aspam = 'A' + spam
In : spamB = spam + 'B'
In : timeit spamA == spamB
1000 loops, best of 3: 1.59 ms per loop
In : timeit spamA == Aspam
10000000 loops, best of 3: 66.4 ns per loop
In : timeit hashlib.md5(spamA) == hashlib.md5(spamB)
100 loops, best of 3: 4.42 ms per loop
In : timeit hashlib.md5(spamA) == hashlib.md5(Aspam)
100 loops, best of 3: 4.39 ms per loop
In : timeit hash(spamA) == hash(spamB)
10000000 loops, best of 3: 157 ns per loop
In : timeit hash(spamA) == hash(Aspam)
10000000 loops, best of 3: 160 ns per loop
Python's hash function is designed for speed, and maps into a 64-bit space. Due to the birthday paradox, this means you'll likely get a collision at about 5 billion entries (probably way earlier, since the hash function is not cryptographical). Also, the precise definition of
hash is up to the Python implementation, and may be architecture- or even machine-specific. Don't use it you want the same result on multiple machines.
md5 is designed as a cryptographic hash function; even slight perturbations in the input totally change the output. It also maps into a 128-bit space, which makes it unlikely you'll ever encounter a collision at all unless you're specifically looking for one.
If you can handle collisions (i.e. test for equality between all members in a bucket, possibly by using a cryptographic algorithm like MD5 or SHA2), Python's hash function is perfectly fine.
One more thing: To save space, you should store the data in binary form if you write it to disk. (i.e.
struct.pack('!q', hash('abc')) /
As a side note:
is is not equivalent to
== in Python. You mean