keirasan keirasan - 3 months ago 15
Python Question

Parsing a Pandas dataframe with an unknown number of columns for use in statsmodels.api

I would like to create a generic script to perform linear regressions on multiple data sets. Each data set will have the same y-variable called "SM" and an unknown number of x-variables. I have been able to do this successfully if I know exactly which data will be used for the regression. For example:

import pandas
import statsmodels.api as sm
import statsmodels.formula.api as smf
from patsy import dmatrices

data = pandas.read_excel('test.xlsx')


Then, print data gives:

print data
SM Glass mag
SiO2 73.500 77.27 0.00
TiO2 0.233 0.15 7.37
Al2O3 11.230 11.49 0.00
FeO* 4.240 2.85 92.46
MnO 0.082 0.06 0.00
MgO 0.040 0.00 0.00
CaO 0.410 0.22 0.00
Na2O 5.630 4.58 0.00
K2O 4.620 3.38 0.00


Then I prepare the dataframe and do the linear regression:

y, X = dmatrices('SM ~ Glass + mag', data=data, return_type='dataframe')
mod = sm.OLS(y, X)
res = mod.fit()
print res.summary()


This all works great. BUT, I'd like to be able to import an excel file with an unknown number of columns so I can do:

y, X = dmatrices('SM ~ X1 + X2 + X3 + ... Xn', data=data, return_type='dataframe')


I can parse the data frame and pull out individual columns, but I don't know how to them put them into the formula needed to do the linear regression. Any advice is appreciated!

Answer

See if this works:

df = pd.DataFrame(np.arange(20).reshape(2, 10), columns=list('abcdefghij'))

formula = '{} ~ {}'.format(df.columns[0], ' + '.join(df.columns[1:]))

formula

'a ~ b + c + d + e + f + g + h + i + j'