Mohammad ElNesr Mohammad ElNesr - 4 years ago 136
Python Question

Reading large CSV files from nth line in Python (not from the beginning)

I have 3 huge CSV files containing climate data, each about 5GB.
The first cell in each line is the meteorological station's number (from 0 to about 100,000) each station contains from 1 to 800 lines in each file, which is not necessarily equal in all files. For example, Station 11 has 600, 500, and 200 lines in file1, file2, and file3 respectively.
I want to read all the lines of each station, do some operations on them, then write results to another file, then the next station, etc.
The files are too large to load at once in memory, so I tried some solutions to read them with minimal memory load, like this post and this post which include this method:

with open(...) as f:
for line in f:
<do something with line>


The problem with this method that it reads the file from the beginning every time, while I want to read files as follows:

for station in range (100798):
with open (file1) as f1, open (file2) as f2, open (file3) as f3:
for line in f1:
st = line.split(",")[0]
if st == station:
<store this line for some analysis>
else:
break # break the for loop and go to read the next file
for line in f2:
...
<similar code to f1>
...
for line in f3:
...
<similar code to f1>
...
<do the analysis to station, the go to next station>


The problem is that each time I start over to take next station, the for loop would start from the beginning, while I want it to start from where the 'Break' occurs at the nth line, i.e. to continue reading the file.

How can I do it?

Thanks in advance

Answer Source

I would suggest to use pandas.read_csv. You can specify the rows to skip using skiprows and also use a reasonable number of rows to load depending on your filesize using nrows Here is a link to the documentation: http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html

Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download