Kiran Prajapati Kiran Prajapati -4 years ago 93
Python Question

How to write dynamic code for multiple regression model automatically calculate r-square values and output data

I have a csv file like this,

T_yield Rain PH EC OC N P K S Zn Fe
6854.5 85.25 1 0 0 0 2 2 2 1 1
4494.2 75.475 1 0 0 0 2 2 2 2 1
1449 43.56 1 0 0 0 2 2 2 2 1
4024.3 127.9 1 0 0 0 2 2 2 2 2
2659.2 81.89 1 0 0 0 2 2 2 2 2
4807.1 49.02 1 0 0 0 1 2 2 1 1
1845.8 37.35 1 0 0 0 1 2 2 1 0
4599.2 98.26 1 0 0 0 2 2 2 2 2
4811.7 82.36 1 0 0 0 2 2 2 2 1
4796.6 66.52 1 0 0 0 1 2 2 1 1
1806.7 41.75 1 0 0 0 1 2 2 1 1


Where T_yield is dependent variable and Rain , EC , OC ....Fe are independent variables.

I want to check which multiple model is good fit for calculate T_yield.

For example :

1) T_yield = b0 + b1* Rain

2) T_yield = b0 + b1 *PH

3) T_yield = b0 + b1 * Ec

:

:

11)T_yield = b0 + b1*Rain + b2*PH

12) T_yield = b0 + b1*Rain + b2*EC

:
:

T_yield = b0 + b1*Rain + b2*PH + b3*Ec + b4*OC + b5*N +b6*P + b7*K ....So on.
So, How to write a dynamic code for multiple regression which will display all types of combination with r-square value,

I am using this code :

import pandas as pd
data = pd.read_csv('/home/desktop/regression.csv')
import statsmodels.formula.api as smf
lm = smf.ols(formula='Total_yield ~ Rain', data=data).fit()
lm.summary()


So, I can get r-square value for Rain.
Next I have add next independent variable .

import statsmodels.formula.api as smf
lm = smf.ols(formula='Total_yield ~ Rain+ PH', data=data).fit()
lm.summary()


Next I can get r-square value for Rain +PH

Next Again add another variable.

import statsmodels.formula.api as smf
lm = smf.ols(formula='Total_yield ~ Rain+ PH+EC', data=data).fit()
lm.summary()


So on.So is there any way it will automatic print r-square value with all types of multiple model ?

Answer Source

What I and @PeterE were suggesting was that instead of using multiple lines and multiple training for different independent variables, use only one time with all independent variables.

import pandas as pd
data = pd.read_csv('/home/desktop/regression.csv')
import statsmodels.formula.api as smf
lm = smf.ols(formula='Total_yield ~ Rain+PH+EC+OC+N+P+K+S+Zn+Fe', data=data).fit() 
lm.summary()

i.e use all independent variables only once. Dont do it manually for all different combinations. The algorithm is smart enough to decide which variables to keep and which one to not.

This results in output:

                            OLS Regression Results                            
==============================================================================
Dep. Variable:                T_yield   R-squared:                       0.609
Model:                            OLS   Adj. R-squared:                  0.348
Method:                 Least Squares   F-statistic:                     2.335
Date:                Tue, 28 Feb 2017   Prob (F-statistic):              0.169
Time:                        10:40:16   Log-Likelihood:                -91.598
No. Observations:                  11   AIC:                             193.2
Df Residuals:                       6   BIC:                             195.2
Df Model:                           4                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [95.0% Conf. Int.]
------------------------------------------------------------------------------
Intercept    237.3697    150.151      1.581      0.165      -130.037   604.776
Rain          46.8455     27.206      1.722      0.136       -19.726   113.417
PH           237.3697    150.151      1.581      0.165      -130.037   604.776
EC          3.591e-13    2.4e-13      1.499      0.185     -2.27e-13  9.45e-13
OC          5.245e-15   2.75e-13      0.019      0.985     -6.67e-13  6.77e-13
N                   0          0        nan        nan             0         0
P           1970.3552   1727.567      1.141      0.298     -2256.848  6197.558
K            474.7395    300.302      1.581      0.165      -260.073  1209.552
S            474.7395    300.302      1.581      0.165      -260.073  1209.552
Zn         -2877.5614   1582.540     -1.818      0.119     -6749.898   994.775
Fe          -575.9214   1194.274     -0.482      0.647     -3498.205  2346.362
==============================================================================
Omnibus:                        2.245   Durbin-Watson:                   2.342
Prob(Omnibus):                  0.325   Jarque-Bera (JB):                0.892
Skew:                           0.010   Prob(JB):                        0.640
Kurtosis:                       1.605   Cond. No.                     1.68e+35
==============================================================================

As you can see, the coef values for EC, OC, N are 0 or very close to 0. This means that the T_yield is mostly independent of them. This can be confirmed from your data because all values in EC, OC and N are 0.

Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download