soupault soupault - 3 months ago 17
Python Question

Way to compute large lists

Suppose, I need to count collisions for various schemes of hashes.
Number of elements in input sequence is 1e10^8 and more.
Don't know how to count this analytically, so using brute-force.

It's obvious not to keep this list of hashes in RAM.
That is the best way to write a code for my needs? Should i dump it in database or something? What libraries are preferred to use?

Thank you!

Answer

I'd suggest keeping a set of files, each one named with a prefix of the hashes contained within it (for example, if you use a prefix length of 6, then the file named ffa23b.txt might contain the hashes ffa23b11d4334, ffa23b712f3, et cetera). Each time you read a hash, you append it to the file with the name corresponding to the first N characters of the hash.

You can also use bloom filters to quickly rule out a large fraction of the hashes as unique, without having to store all of the hashes in memory. That way, you only have to fall back to searching through a given prefix file if the check against the bloom filter says that you might have actually seen it before - something that will happen rarely.