fbstats - 6 months ago 69

Python Question

My regression model using statsmodels in python works with 48,065 lines of data, but while adding new data I have tracked down one line of code that produces a singular matrix error. Answers to similar questions seem to suggest missing data but I have checked and there is nothing visibibly irregular from the error prone row of code causing me major issues. Does anyone know if this is an error in my code or knows a solution to fix it as I'm out of ideas.

Data2.csv - http://www.sharecsv.com/s/8ff31545056b8864f2ad26ef2fe38a09/Data2.csv

`import pandas as pd`

import statsmodels.formula.api as smf

data = pd.read_csv("Data2.csv")

formula = 'is_success ~ goal_angle + goal_distance + np_distance + fp_distance + is_fast_attack + is_header + prev_tb + is_rebound + is_penalty + prev_cross + is_tb2 + is_own_goal + is_cutback + asst_dist'

model = smf.mnlogit(formula, data=data, missing='drop').fit()

CSV Line producing error:

`0,0,0,0,0,0,0,1,22.94476,16.877204,13.484806,20.924627,0,0,11.765203`

Error with Problematic line within the model:

`runfile('C:/Users/User1/Desktop/Model Check.py', wdir='C:/Users/User1/Desktop')`

Optimization terminated successfully.

Current function value: 0.264334

Iterations 20

Traceback (most recent call last):

File "<ipython-input-76-eace3b458e24>", line 1, in <module>

runfile('C:/Users/User1/Desktop/xG_xA Model Check.py', wdir='C:/Users/User1/Desktop')

File "C:\Users\User1\Anaconda2\lib\site-packages\spyderlib\widgets\externalshell\sitecustomize.py", line 714, in runfile

execfile(filename, namespace)

File "C:\Users\User1\Anaconda2\lib\site-packages\spyderlib\widgets\externalshell\sitecustomize.py", line 74, in execfile

exec(compile(scripttext, filename, 'exec'), glob, loc)

File "C:/Users/User1/Desktop/xG_xA Model Check.py", line 6, in <module>

model = smf.mnlogit(formula, data=data, missing='drop').fit()

File "C:\Users\User1\Anaconda2\lib\site-packages\statsmodels\discrete\discrete_model.py", line 587, in fit

disp=disp, callback=callback, **kwargs)

File "C:\Users\User1\Anaconda2\lib\site-packages\statsmodels\base\model.py", line 434, in fit

Hinv = np.linalg.inv(-retvals['Hessian']) / nobs

File "C:\Users\User1\Anaconda2\lib\site-packages\numpy\linalg\linalg.py", line 526, in inv

ainv = _umath_linalg.inv(a, signature=signature, extobj=extobj)

File "C:\Users\User1\Anaconda2\lib\site-packages\numpy\linalg\linalg.py", line 90, in _raise_linalgerror_singular

raise LinAlgError("Singular matrix")

LinAlgError: Singular matrix

Answer

As far as I can see:

The problem is the variable `is_own_goal`

because all observation where this is 1 also have the dependent variable `is_success`

equal to 1. That means there is no variation in the outcome because `is_own_goal`

already specifies that it is a success.

As a consequence, we cannot estimate a coefficient for is_own_goal, the coefficient is not identified by the data. The variance of the coefficient would be infinite and inverting the Hessian to get the covariance of the parameter estimates fails because the Hessian is singular. Given floating point precision, with some computational noise the hessian might be invertible and the Singular Matrix exception would not show up. Which, I guess, is the reason that it works with some but not all observations.

BTW: If the dependent variable, endog, is binary, then Logit is more appropriate, even though MNLogit has it as a special case.

BTW: Penalized estimation would be another way to force an estimate even in singular cases, although the coefficient would still not be identified by the data and be just a consequence of the penalization.

In this example,

`mod = smf.logit(formula, data=data, missing='drop').fit_regularized()`

works for me. This is L1 penalization. In statsmodels 0.8, there is also elastic net penalization for GLM which has Binomial (i.e. Logit) as a family.