Joris Joris - 1 year ago 51
Python Question

Comparing content in two csv files

So I have two csv files.

has more data than
so I want to pull out the rows in
that do not occur in
Here's what I have so far

with open('Book1.csv', 'rb') as csvMasterForDiff:
with open('similarities.csv', 'rb') as csvSlaveForDiff:
masterReaderDiff = csv.reader(csvMasterForDiff)
slaveReaderDiff = csv.reader(csvSlaveForDiff)

testNotInCount = 0
testInCount = 0
for row in masterReaderDiff:
if row not in slaveReaderDiff:
testNotInCount = testNotInCount + 1
else :
testInCount = testInCount + 1

print('Not in file: '+ str(testNotInCount))
print('Exists in file: '+ str(testInCount))

However, the results are

Not in file: 2093
Exists in file: 0

I know this is incorrect because at least the first 16 entries in
do not exist in
not all of them. What am I doing wrong?

Answer Source

A csv.reader object is an iterator, which means you can only iterate through it once. You should be using lists/sets for containment checking, e.g.:

slave_rows = set(slaveReaderDiff)

for row in masterReaderDiff:
    if row not in slave_rows:
        testNotInCount += 1
        testInCount += 1