I have a 45 million row txt file that contains hashes. What would be the most efficient way to compare the file to another file, and provide only items from the second file that are not in the large txt file?
comm -13 largefile.txt smallfile.txt >> newfile.txt
tp = pd.read_csv(r'./large file.txt',encoding='iso-8859-1', iterator=True, chunksize=50000)
full = pd.concat(tp, ignore_index=True)`
Hash table. Or in Python terms, just use a
Put each item from the smaller file into the set. 200K items is perfectly fine. Enumerate each item in the larger file to see if it exists in the smaller file. If there is a match, remove the item from the the hash table.
When you are done, any item remaining in the set represents an item not found in the larger file.
My Python is a little rusty, but it would go something like this:
s = set() with open("small_file.txt") as f: content = f.readlines() for line in content: line = line.strip() s.add(line) with open("large_file.txt") as f: for line in f: if line in s: s.discard(line.strip()) for i in s: print(i)