HFBrowning HFBrowning - 4 months ago 20
Python Question

How to stream in and manipulate a large data file in python

I have a relatively large (1 GB) text file that I want to cut down in size by summing across categories:

Geography AgeGroup Gender Race Count
County1 1 M 1 12
County1 2 M 1 3
County1 2 M 2 0


Geography Count
County1 15
County2 23

This would be a simple matter if the whole file could fit in memory but using
. So I have been looking into other methods, and there appears to be many options - HDF5? Using
(which seems complicated - generators?) Or just using the standard file methods to read in the first geography (70 lines), sum the count column, and write out before loading in another 70 lines.

Does anyone have any suggestions on the best way to do this? I especially like the idea of streaming data in, especially because I can think of a lot of other places where this would be useful. I am most interested in this method, or one that similarly uses the most basic functionality possible.

Edit: In this small case I only want the sums of count by geography. However, it would be ideal if I could read in a chunk, specify any function (say, add 2 columns together, or take the max of a column by geography), apply the function, and write the output before reading in a new chunk.


You can use dask.dataframe, which is syntactically similar to pandas, but performs manipulations out-of-core, so memory shouldn't be an issue:

import dask.dataframe as dd

df = dd.read_csv('my_file.csv')
df = df.groupby('Geography')['Count'].sum().to_frame()

Alternatively, if pandas is a requirement you can use chunked reads, as mentioned by @chrisaycock. You may want to experiment with the chunksize parameter.

# Operate on chunks.
data = []
for chunk in pd.read_csv('my_file.csv', chunksize=10**5):
    chunk = chunk.groupby('Geography', as_index=False)['Count'].sum()

# Combine the chunked data.
df = pd.concat(data, ignore_index=True)
df = df.groupby('Geography')['Count'].sum().to_frame()