johnnyb johnnyb - 2 months ago 5
Python Question

What is the most efficient way to compare 45 Million rows of Text File to about 200k rows text file and produce non matches from the smaller file?

I have a 45 million row txt file that contains hashes. What would be the most efficient way to compare the file to another file, and provide only items from the second file that are not in the large txt file?

Current working:

comm -13 largefile.txt smallfile.txt >> newfile.txt


This works pretty fast but I am looking to push this into python to run regardless of os?

Attempted with memory issues:

tp = pd.read_csv(r'./large file.txt',encoding='iso-8859-1', iterator=True, chunksize=50000)
full = pd.concat(tp, ignore_index=True)`


This method taps me out in memory usage and generally faults for some reason.

Example:

<large file.txt>
hashes
fffca07998354fc790f0f99a1bbfb241
fffca07998354fc790f0f99a1bbfb242
fffca07998354fc790f0f99a1bbfb243
fffca07998354fc790f0f99a1bbfb244
fffca07998354fc790f0f99a1bbfb245
fffca07998354fc790f0f99a1bbfb246


<smaller file.txt>
hashes
fffca07998354fc790f0f99a1bbfb245
fffca07998354fc790f0f99a1bbfb246
fffca07998354fc790f0f99a1bbfb247
fffca07998354fc790f0f99a1bbfb248
fffca07998354fc790f0f99a1bbfb249
fffca07998354fc790f0f99a1bbfb240


Expected Output

<new file.txt>
hashes
fffca07998354fc790f0f99a1bbfb247
fffca07998354fc790f0f99a1bbfb248
fffca07998354fc790f0f99a1bbfb249
fffca07998354fc790f0f99a1bbfb240

Answer

Hash table. Or in Python terms, just use a set.

Put each item from the smaller file into the set. 200K items is perfectly fine. Enumerate each item in the larger file to see if it exists in the smaller file. If there is a match, remove the item from the the hash table.

When you are done, any item remaining in the set represents an item not found in the larger file.

My Python is a little rusty, but it would go something like this:

s = set()

with open("small_file.txt") as f:
     content = f.readlines()

for line in content:
    line = line.strip()
    s.add(line)

with open("large_file.txt") as f:
    for line in f:
         if line in s:
            s.discard(line.strip())

for i in s:
    print(i)