user7147790 - 9 months ago 61

Python Question

Consider the following example:

`import pandas as pd`

from pandas import DataFrame

import statsmodels.formula.api as smf

df = DataFrame({'a': [1,2,3], 'b': [2,3,4]})

df2 = DataFrame({'177sdays': [1,2,3], 'b': [2,3,4]})

Then

`smf.ols('a ~ b', df)`

smf.ols('177sdays ~ b', df2)

And the first work and the second does not. The only difference seems to be the presence of numerical characters in the variable name. Why is this?

Answer Source

Apparently, `statsmodels`

uses a library called `patsy`

to interpret the formulas passed to `ols`

. From the docs an expressen of the form:

```
y ~ a + a:b + np.log(x)
```

Will construct a patsy object of the form:

```
ModelDesc([Term([EvalFactor("y")])],
[Term([]),
Term([EvalFactor("a")]),
Term([EvalFactor("a"), EvalFactor("b")]),
Term([EvalFactor("np.log(x)")])])
```

`EvalFactor`

then "executes arbitrary Python code." Thus your variable names must be valid Python identifiers.. I.e.
the uppercase and lowercase letters A through Z, the underscore _ and, except for the first character, the digits 0 through 9.