NGuyen NGuyen - 3 months ago 12
JSON Question

Python 3: How to compare two big files fastest?

In "Big_file.txt", I want to extract UIDs of "User A" which is not duplicated with UIDs in "Small_file.txt". I wrote the following code but it seems it will never stop running. So, how to speed up the process ? Thank you very much :)

import json

uid_available = []
linesB = []
for line in open('E:/Small_file.txt'):
line = json.loads(line)
linesB.append(hash(line['uid']))


for line in open('E:/Big_file.txt'):
line = json.loads(line)
if hash(line['uid']) not in linesB and line['user'] == 'User A':
uid_available.append(line['uid'])


This is format of Big_file.txt (has 10 millions lines):

{'uid': 111, 'user': 'User A'}
{'uid': 222, 'user': 'User A'}
{'uid': 333, 'user': 'User A'}
{'uid': 444, 'user': 'User B'}
{'uid': 555, 'user': 'User C'}
{'uid': 666, 'user': 'User C'}


This is format of Small_file.txt (has few millions lines):

{'uid': 333, 'user': 'User A'}
{'uid': 444, 'user': 'User B'}
{'uid': 555, 'user': 'User C'}


The output I expect:

111
222

Answer

looking up an item in a list takes O(n) time. If you use a dict or a set you can improve it to O(1).

shortest modification you can do is :

linesB = []
for line in open('E:/Small_file.txt'):
    line = json.loads(line)
    linesB.append(hash(line['uid']))
linesB = set(linesB)

or to do it right

linesB = set()
for line in open('E:/Small_file.txt'):
    line = json.loads(line)
    linesB.add(hash(line['uid']))