Noobie Noobie - 2 months ago 47
Python Question

what is the optimal chunksize in pandas read_csv to maximize speed?

I am using a 20GB (compressed) .csv file and I load a couple of columns from it using pandas pd.read_csv() with a 10 000 chuncksize parameter.

However, this parameter is completely arbitrary and I wonder whether a simple formula could give me better chunksize that would speed-up the loading of the data.

Any ideas?


chunksize only tells you the number of rows per chunk. To get memory size, you'd have to convert that to a memory-size per-chunk or per-row, by looking at your dtypes; use df.describe(), or here's my idiom:

print 'df Memory usage by column...'
print df.memory_usage(index=False, deep=True) / df.shape[0]

Or else use your OS (top/Task Manager/Activity Monitor) to see how much memory is being used, and check you're not using all your free memory.

(And use all the standard pandas tricks, like specifying dtypes for each column, and using converters rather than pd.Categorical if you want to reduce from 48 bytes to 1 or 4)