Joris Joris - 3 months ago 10
Python Question

Comparing content in two csv files

So I have two csv files.

Book1.csv
has more data than
similarities.csv
so I want to pull out the rows in
Book1.csv
that do not occur in
similarities.csv
Here's what I have so far

with open('Book1.csv', 'rb') as csvMasterForDiff:
with open('similarities.csv', 'rb') as csvSlaveForDiff:
masterReaderDiff = csv.reader(csvMasterForDiff)
slaveReaderDiff = csv.reader(csvSlaveForDiff)

testNotInCount = 0
testInCount = 0
for row in masterReaderDiff:
if row not in slaveReaderDiff:
testNotInCount = testNotInCount + 1
else :
testInCount = testInCount + 1


print('Not in file: '+ str(testNotInCount))
print('Exists in file: '+ str(testInCount))


However, the results are

Not in file: 2093
Exists in file: 0


I know this is incorrect because at least the first 16 entries in
Book1.csv
do not exist in
similarities.csv
not all of them. What am I doing wrong?

Answer

A csv.reader object is an iterator, which means you can only iterate through it once. You should be using lists/sets for containment checking, e.g.:

slave_rows = set(slaveReaderDiff)

for row in masterReaderDiff:
    if row not in slave_rows:
        testNotInCount += 1
    else:
        testInCount += 1
Comments