David Frank David Frank - 11 months ago 136
Python Question

Pandas read_csv() 1.2GB file out of memory on VM with 140GB RAM

I am trying to read a CSV file of 1.2G, which contains 25K records, each consists of a id and a large string.

However, around 10K rows, I get this error:

pandas.io.common.CParserError: Error tokenizing data. C error: out of memory

Which seems weird, since the VM has 140GB RAM and at 10K rows the memory usage is only at ~1%.

This is the command I use:

pd.read_csv('file.csv', header=None, names=['id', 'text', 'code'])

I also ran the following dummy program, which could successfully fill up my memory to close to 100%.

list = []
while True:
list.append("hello" + list[len(list) - 1])

Answer Source

This sounds like a job for chunksize. It splits the input process into multiple chunks, reducing the required reading memory.

tp = pd.read_csv('Check1_900.csv', header=None, names=['id', 'text', 'code'], chunksize=1000)
df = pd.concat(tp, ignore_index=True)