wim wim - 4 months ago 11
Python Question

Is this an appropriate use of python's built-in hash function?

I need to compare large chunks of data for equality, and I need to compare many per second, fast. Every object is guaranteed to be the same size, and it is possible/likely they may only be slightly different (in unknown positions).

I have seen, from the interactive session below, using

==
operator for byte strings can be slower if the differences are towards the end of the string, and it can be very fast if there is a difference near the start.

I thought there might be some way to speed things up using some sort of hash, of course computing the md5 hash and comparing is a fair whack slower, but python's inbuilt hash does seem to speed things up significantly.

However, I have no idea about the implementation details of this hash, is it really hash-like in that I can be comfortable that when
hash(a) == hash(b)
then
a == b
is very likely? I am happy to have a few incorrect results if a hash collision is reasonably rare (rare in the sense of needing an array of 200 PS3s several hours to make a collision)

In [1]: import hashlib

In [2]: with open('/dev/urandom') as f:
...: spam = f.read(2**20 - 1)
...:

In [3]: spamA = spam + 'A'

In [4]: Aspam = 'A' + spam

In [5]: spamB = spam + 'B'

In [6]: timeit spamA == spamB
1000 loops, best of 3: 1.59 ms per loop

In [7]: timeit spamA == Aspam
10000000 loops, best of 3: 66.4 ns per loop

In [8]: timeit hashlib.md5(spamA) == hashlib.md5(spamB)
100 loops, best of 3: 4.42 ms per loop

In [9]: timeit hashlib.md5(spamA) == hashlib.md5(Aspam)
100 loops, best of 3: 4.39 ms per loop

In [10]: timeit hash(spamA) == hash(spamB)
10000000 loops, best of 3: 157 ns per loop

In [11]: timeit hash(spamA) == hash(Aspam)
10000000 loops, best of 3: 160 ns per loop

Answer

Python's hash function is designed for speed, and maps into a 64-bit space. Due to the birthday paradox, this means you'll likely get a collision at about 5 billion entries (probably way earlier, since the hash function is not cryptographical). Also, the precise definition of hash is up to the Python implementation, and may be architecture- or even machine-specific. Don't use it you want the same result on multiple machines.

md5 is designed as a cryptographic hash function; even slight perturbations in the input totally change the output. It also maps into a 128-bit space, which makes it unlikely you'll ever encounter a collision at all unless you're specifically looking for one.

If you can handle collisions (i.e. test for equality between all members in a bucket, possibly by using a cryptographic algorithm like MD5 or SHA2), Python's hash function is perfectly fine.

One more thing: To save space, you should store the data in binary form if you write it to disk. (i.e. struct.pack('!q', hash('abc')) / hashlib.md5('abc').digest()).

As a side note: is is not equivalent to == in Python. You mean ==.