Suppose, I need to count collisions for various schemes of hashes.
Number of elements in input sequence is 1e10^8 and more.
Don't know how to count this analytically, so using brute-force.
It's obvious not to keep this list of hashes in RAM.
That is the best way to write a code for my needs? Should i dump it in database or something? What libraries are preferred to use?
I'd suggest keeping a set of files, each one named with a prefix of the hashes contained within it (for example, if you use a prefix length of 6, then the file named
ffa23b.txt might contain the hashes
ffa23b712f3, et cetera). Each time you read a hash, you append it to the file with the name corresponding to the first N characters of the hash.
You can also use bloom filters to quickly rule out a large fraction of the hashes as unique, without having to store all of the hashes in memory. That way, you only have to fall back to searching through a given prefix file if the check against the bloom filter says that you might have actually seen it before - something that will happen rarely.