Mahesh Yadav Mahesh Yadav - 2 months ago 4x
R Question

column names different from the fit regsubsets to choose best variables

I am trying to get the variable (column name) of importance from regsubsets. I like to get the important variables one by one that I can analyze. Here is the program

reg_fit=regsubsets(Salary~., data = Hitters, nvmax = 10, method = "forward")

The problem is the column names in reg_fit is not the same as that of data-Hitters.

Here is the output from the original data:

## [1] "AtBat" "Hits" "HmRun" "Runs" "RBI"
## [6] "Walks" "Years" "CAtBat" "CHits" "CHmRun"
## [11] "CRuns" "CRBI" "CWalks" "League" "Division"
## [16] "PutOuts" "Assists" "Errors" "Salary" "NewLeague"

Here is the output extracted from reg_fit:

## [1] "(Intercept)" "AtBat" "Hits" "HmRun" "Runs"
## [6] "RBI" "Walks" "Years" "CAtBat" "CHits"
## [11] "CHmRun" "CRuns" "CRBI" "CWalks" "LeagueN"
## [16] "DivisionW" "PutOuts" "Assists" "Errors" "NewLeagueN"

Note Legaue is changed to LeagueN, Division is changed to DivisionW. Any ideas, if this is a bug or is there an easy way to get the column names from reg_fit?


It's not a bug. It's breaking a categorical variable into indicator variables so that they can be used in the regression and the name change is how it lets you know which level is assigned to the positive level of the indicator.

If you want to avoid this you can do so with pre-processing. Here's an example for the variable League:

League <- rep(0,322)
League[Hitters$League == "N"] <- 1

Hitters$League <- as.numeric(as.character(League))

reg_fit=regsubsets(Salary~., data = Hitters, nvmax = 10, method = "forward")

In the example above I created a numeric variable which equals 1 when League is equal to N and used that to replace the factor variable version of League.

In the case of binary factor variables you could just change the labels in the resulting object after running the regression, however if you have more than 2 levels this won't work. For multi-class factor variables you'll need to create multiple indicator variables in the original dataset, as I did in the example above.