King Java King Java - 2 months ago 15
Python Question

Pandas: how to designate starting row to extract data

I am using Pandas library and Python.

I have an Excel file that has some heading information on the top of an Excel sheet which I do not need for data extraction.

But, the heading information could take longer rows, so it is unpredictable how long it could be.

So, my data extraction should start from where it says "ID"...
For this particular case, it starts from row 5, but it could change.

The image is shown on the bottom (I grayed out after row 5 for sensitive info).

How do I put this in logic (to skip heading and jump to row 5)?
The pattern should be, row heading starts from "ID, EMP_ID" etc.

enter image description here

with open('File.xls') as fp:
skip = next(filter(
lambda x: x.startswith('ID'),
enumerate(fp)
))[0]

df = pd.read_excel('File.xls', usercols=['ID', 'EMP_ID'], skiprows=skip)
print df

Answer Source

You could manually check for the header line and then use read_csvs keyword argument skiprows.

with open('data.csv') as fp:
    skip = next(filter(
        lambda x: x[1].startswith('ID'),
        enumerate(fp)
    ))[0]

Then skip the rows:

df = pandas.read_csv('data.csv', skiprows=skip)

Like that you can support pre-header sections of arbitrary length.


For Python 2:

import itertools as it

with open('data.csv') as fp:
    skip = next(it.ifilter(
        lambda x: x[1].startswith('ID'),
        enumerate(fp)
    ))[0]