I am using a 20GB (compressed) .csv file and I load a couple of columns from it using pandas pd.read_csv() with a 10 000 chuncksize parameter.
However, this parameter is completely arbitrary and I wonder whether a simple formula could give me better chunksize that would speed-up the loading of the data.
chunksize only tells you the number of rows per chunk. To get memory size, you'd have to convert that to a memory-size per-chunk or per-row, by looking at your dtypes; use
df.describe(), or here's my idiom:
print 'df Memory usage by column...' print df.memory_usage(index=False, deep=True) / df.shape
Or else use your OS (
top/Task Manager/Activity Monitor) to see how much memory is being used, and check you're not using all your free memory.
(And use all the standard pandas tricks, like specifying dtypes for each column, and using converters rather than pd.Categorical if you want to reduce from 48 bytes to 1 or 4)