Blakester Blakester - 12 days ago 7
Python Question

Iterating over a csv file given a specific range

So the problem I'm having is that I'm iterating over a pretty large csv file. startDate and endDate are input given to me by the user and I need to only search in that range.

Although, when I run the program up to that point, it takes a long time to just spit back out "set()" at me. I've pointed where I'm having trouble at in the code

looking for suggestions and possibly sample code, thank you all in advance!

def compare(word1, word2, startDate, endDate):
with open('all_words.csv') as allWords:
readWords = csv.reader(allWords, delimiter=',')
year = set()
for row in readWords:
if row[1] in range(int(startDate), int(endDate)): #< Having trouble here
if row[0] == word1:
year.add(row[1])
print(year)

Answer

The reason your test isn't finding any years is that the expression:

row[1] in range(int(startDate), int(endDate))

is checking to see if a string value appears in a list of integers. If you test:

"1970" in range(1960, 1980)

you will see that it returns False. You need to write:

int(row[1]) in range(int(startDate), int(endDate))

However, this is still quite inefficient. It is checking if the value int(row[1]) occurs anywhere in the sequence [int(startDate), int(startDate)+1, ..., int(endDate)], and it's doing it by linear search. Much faster will be:

if int(startDate) <= int(row[1]) < int(endDate):

Note that your code above was written to exclude endDate for the list of possible dates (because range excludes its second argument), and I've done the same above.

Edit: Actually, I guess I should point out that it's only Python 2 where an expression like 500000 in range(1, 1000000) is inefficient. In Python 3 (or in Python 2 with xrange in place of range), it's fast.