Andy Andy - 1 year ago 98
Python Question

How can I read only the header column of a CSV file using Python?

I am looking for a a way to read just the header row of a large number of large CSV files.

Using Pandas, I have this method available, for each csv file:

>>> df = pd.read_csv(PATH_TO_CSV)
>>> df.columns

I could do this with just the csv module:

>>> reader = csv.DictReader(open(PATH_TO_CSV))
>>> reader.fieldnames

The problem with these is that each CSV file is 500MB+ in size, and it seems to be a gigantic waste to read in the entire file of each just to pull the header lines.

My end goal of all of this is to pull out unique column names. I can do that once I have a list of column headers that are in each of these files.

How can I extract only the header row of a CSV file, quickly?

Answer Source

I've used iglob as an example to search for the .csv files, but one way is to use a set, then adjust as necessary, eg:

import csv
from glob import iglob

unique_headers = set()
for filename in iglob('*.csv'):
    with open(filename, 'rb') as fin:
        csvin = csv.reader(fin)
        unique_headers.update(next(csvin, []))