user7147790 user7147790 - 5 months ago 36
Python Question

Numbers as variable names not recognized by statsmodels.formula.api

Consider the following example:

import pandas as pd
from pandas import DataFrame
import statsmodels.formula.api as smf
df = DataFrame({'a': [1,2,3], 'b': [2,3,4]})
df2 = DataFrame({'177sdays': [1,2,3], 'b': [2,3,4]})

smf.ols('a ~ b', df)
smf.ols('177sdays ~ b', df2)

And the first work and the second does not. The only difference seems to be the presence of numerical characters in the variable name. Why is this?


Apparently, statsmodels uses a library called patsy to interpret the formulas passed to ols. From the docs an expressen of the form:

y ~ a + a:b + np.log(x)

Will construct a patsy object of the form:

       Term([EvalFactor("a"), EvalFactor("b")]),

EvalFactor then "executes arbitrary Python code." Thus your variable names must be valid Python identifiers.. I.e. the uppercase and lowercase letters A through Z, the underscore _ and, except for the first character, the digits 0 through 9.