user7147790 user7147790 - 17 days ago 8
Python Question

Numbers as variable names not recognized by statsmodels.formula.api

Consider the following example:

import pandas as pd
from pandas import DataFrame
import statsmodels.formula.api as smf
df = DataFrame({'a': [1,2,3], 'b': [2,3,4]})
df2 = DataFrame({'177sdays': [1,2,3], 'b': [2,3,4]})


Then
smf.ols('a ~ b', df)
smf.ols('177sdays ~ b', df2)


And the first work and the second does not. The only difference seems to be the presence of numerical characters in the variable name. Why is this?

Answer

Apparently, statsmodels uses a library called patsy to interpret the formulas passed to ols. From the docs an expressen of the form:

y ~ a + a:b + np.log(x)

Will construct a patsy object of the form:

ModelDesc([Term([EvalFactor("y")])],
      [Term([]),
       Term([EvalFactor("a")]),
       Term([EvalFactor("a"), EvalFactor("b")]),
       Term([EvalFactor("np.log(x)")])])

EvalFactor then "executes arbitrary Python code." Thus your variable names must be valid Python identifiers.. I.e. the uppercase and lowercase letters A through Z, the underscore _ and, except for the first character, the digits 0 through 9.