I'm currently trying to read data from .csv files in Python 2.7 with up to 1 million rows, and 200 columns (files range from 100mb to 1.6gb). I can do this (very slowly) for the files with under 300,000 rows, but once I go above that I get memory errors. My code looks like this:
def getdata(filename, criteria):
for criterion in criteria:
def getstuff(filename, criterion):
with open(filename, "rb") as csvfile:
for row in datareader:
if row=="column header":
elif len(data)<2 and row!=criterion:
You are reading all rows into a list, then processing that list. Don't do that.
Process your rows as you produce them. If you need to filter the data first, use a generator function:
import csv def getstuff(filename, criterion): with open(filename, "rb") as csvfile: datareader = csv.reader(csvfile) count = 0 for row in datareader: if row in ("column header", criterion): yield row count += 1 elif count < 2: continue else: return
I also simplified your filter test; the logic is the same but more concise.
You can now loop over
getstuff() directly. Do the same in
def getdata(filename, criteria): for criterion in criteria: for row in getstuff(filename, criterion): yield row
Now loop directly over
getdata() in your code:
for row in getdata(somefilename, sequence_of_criteria): # process row
You now only hold one row in memory, instead of your thousands of lines per criterion.
yield makes a function a generator function, which means it won't do any work until you start looping over it.