xralf xralf - 2 months ago 15
Python Question

Limits of hashes comparisons

I'm storing hash codes in a file, one hash per line.

When I have a new hash code, I open the file and check if
the hash code already exists and if it doesn't exist,
I save it to that file.

f = open("hashes.txt", "w")
hashes = f.readlines()
hash_code = "ff071fdf1e060400"
if not hash_code in hashes:
hashes.append(hash_code)
for h in hashes:
f.write(h)
f.close()


This code may be slow, when the number of lines in hashes.txt grows.
Is there some better storage mechanism and check to make it faster
in that case? I need fastest possible check (within a few seconds).

Answer

I would do this:

class Hashes(object):
    def __init__(self, filename):
        self.filename = filename
        with open(filename, 'rt') as f:             # read the file only once
            self.hashes = set(line.strip() for line in f)

    def add_hash(self, hash):
        if hash not in self.hashes:                 # this is very fast with sets
            self.hashes.add(hash)
            with open(self.filename, 'at') as f:
                print(hash, file=f)                 # write only one hash

hashes = Hashes("hashes.txt")
hashes.add_hash("ff071fdf1e060400")

Because:

  • the file is read only once
  • checking whether a hash exists is very fast with sets (no need to read all of them)
  • hashes are added by writing only the new hash
  • the class simplifies creation of mutiple hash files and cleanup of cached hashes, and it simplifies maintenance by organising the code

The downside is that all hashes are kept in memory. If there are many millions of hashes, that could start causing problems, but until then it is fine. It is better than fine, if speed is important.