wim wim - 1 year ago 45
Python Question

Is this an appropriate use of python's built-in hash function?

I need to compare large chunks of data for equality, and I need to compare many per second, fast. Every object is guaranteed to be the same size, and it is possible/likely they may only be slightly different (in unknown positions).

I have seen, from the interactive session below, using

operator for byte strings can be slower if the differences are towards the end of the string, and it can be very fast if there is a difference near the start.

I thought there might be some way to speed things up using some sort of hash, of course computing the md5 hash and comparing is a fair whack slower, but python's inbuilt hash does seem to speed things up significantly.

However, I have no idea about the implementation details of this hash, is it really hash-like in that I can be comfortable that when
hash(a) == hash(b)
a == b
is very likely? I am happy to have a few incorrect results if a hash collision is reasonably rare (rare in the sense of needing an array of 200 PS3s several hours to make a collision)

In [1]: import hashlib

In [2]: with open('/dev/urandom') as f:
...: spam = f.read(2**20 - 1)

In [3]: spamA = spam + 'A'

In [4]: Aspam = 'A' + spam

In [5]: spamB = spam + 'B'

In [6]: timeit spamA == spamB
1000 loops, best of 3: 1.59 ms per loop

In [7]: timeit spamA == Aspam
10000000 loops, best of 3: 66.4 ns per loop

In [8]: timeit hashlib.md5(spamA) == hashlib.md5(spamB)
100 loops, best of 3: 4.42 ms per loop

In [9]: timeit hashlib.md5(spamA) == hashlib.md5(Aspam)
100 loops, best of 3: 4.39 ms per loop

In [10]: timeit hash(spamA) == hash(spamB)
10000000 loops, best of 3: 157 ns per loop

In [11]: timeit hash(spamA) == hash(Aspam)
10000000 loops, best of 3: 160 ns per loop

Answer Source

Python's hash function is designed for speed, and maps into a 64-bit space. Due to the birthday paradox, this means you'll likely get a collision at about 5 billion entries (probably way earlier, since the hash function is not cryptographical). Also, the precise definition of hash is up to the Python implementation, and may be architecture- or even machine-specific. Don't use it you want the same result on multiple machines.

md5 is designed as a cryptographic hash function; even slight perturbations in the input totally change the output. It also maps into a 128-bit space, which makes it unlikely you'll ever encounter a collision at all unless you're specifically looking for one.

If you can handle collisions (i.e. test for equality between all members in a bucket, possibly by using a cryptographic algorithm like MD5 or SHA2), Python's hash function is perfectly fine.

One more thing: To save space, you should store the data in binary form if you write it to disk. (i.e. struct.pack('!q', hash('abc')) / hashlib.md5('abc').digest()).

As a side note: is is not equivalent to == in Python. You mean ==.