I am exploring switching to python and pandas as a long-time SAS user.
However, when running some tests today, I was surprised that python ran out of memory when trying to
In principle it shouldn't run out of memory, but there are currently memory problems with
read_csv on large files caused by some complex Python internal issues (this is vague but it's been known for a long time: http://github.com/pydata/pandas/issues/407).
At the moment there isn't a perfect solution (here's a tedious one: you could transcribe the file row-by-row into a pre-allocated NumPy array or memory-mapped file--
np.mmap), but it's one I'll be working on in the near future. Another solution is to read the file in smaller pieces (use
iterator=True, chunksize=1000) then concatenate then with
pd.concat. The problem comes in when you pull the entire text file into memory in one big slurp.