David David - 2 months ago 11
Python Question

Why does Pandas skip first set of chunks when iterating over csv in my code

I have a very large CSV file that I read via iteration with pandas' chunks function. The problem: If e.g. chunksize=2, it skips the first 2 rows and the first chunks I receive are row 3-4.

Basically, if I read the CSV with nrows=4, I get the first 4 rows while chunking the same file with chunksize=2 gets me first row 3 and 4, then 5 and 6, ...

#1. Read with nrows
#read first 4 rows in csv files and merge date and time column to be used as index
reader = pd.read_csv('filename.csv', delimiter=',', parse_dates={"Datetime" : [1,2]}, index_col=[0], nrows=4)

print (reader)

01/01/2016 - 09:30 - A - 100
01/01/2016 - 13:30 - A - 110
01/01/2016 - 15:30 - A - 120
02/01/2016 - 10:30 - A - 115

#2. Iterate over csv file with chunks
#iterate over csv file in chunks and merge date and time column to be used as index
reader = pd.read_csv('filename.csv', delimiter=',', parse_dates={"Datetime" : [1,2]}, index_col=[0], chunksize=2)

for chunk in reader:

#create a dataframe from chunks
df = reader.get_chunk()
print (df)

01/01/2016 - 15:30 - A - 120
02/01/2016 - 10:30 - A - 115


Increasing chunksize to 10 skips first 10 rows.

Any ideas how I can fix this? I already got a workaround that works, I'd like to understand where I got it wrong.

Any input is appreciated!

Answer

Don't call get_chunk. You already have your chunk since you're iterating over the reader, i.e. chunk is your DataFrame. Call print(chunk) in your loop, and you should see the expected output.

As @MaxU points out in the comments, you want to use get_chunk if you want differently sized chunks: reader.get_chunk(500), reader.get_chunk(100), etc.

Comments