Mahesh Yadav - 1 year ago 73

R Question

I am trying to get the variable (column name) of importance from regsubsets. I like to get the important variables one by one that I can analyze. Here is the program

`library(leaps)`

library(ISLR)

data(Hitters)

reg_fit=regsubsets(Salary~., data = Hitters, nvmax = 10, method = "forward")

The problem is the column names in reg_fit is not the same as that of data-Hitters.

Here is the output from the original data:

`names(Hitters)`

## [1] "AtBat" "Hits" "HmRun" "Runs" "RBI"

## [6] "Walks" "Years" "CAtBat" "CHits" "CHmRun"

## [11] "CRuns" "CRBI" "CWalks" "League" "Division"

## [16] "PutOuts" "Assists" "Errors" "Salary" "NewLeague"

Here is the output extracted from reg_fit:

`colnames(summary(reg_fit)$which)`

## [1] "(Intercept)" "AtBat" "Hits" "HmRun" "Runs"

## [6] "RBI" "Walks" "Years" "CAtBat" "CHits"

## [11] "CHmRun" "CRuns" "CRBI" "CWalks" "LeagueN"

## [16] "DivisionW" "PutOuts" "Assists" "Errors" "NewLeagueN"

Note Legaue is changed to LeagueN, Division is changed to DivisionW. Any ideas, if this is a bug or is there an easy way to get the column names from reg_fit?

Answer Source

It's not a bug. It's breaking a categorical variable into indicator variables so that they can be used in the regression and the name change is how it lets you know which level is assigned to the positive level of the indicator.

If you want to avoid this you can do so with pre-processing. Here's an example for the variable `League`

:

```
League <- rep(0,322)
League[Hitters$League == "N"] <- 1
Hitters$League <- as.numeric(as.character(League))
reg_fit=regsubsets(Salary~., data = Hitters, nvmax = 10, method = "forward")
colnames(summary(reg_fit)$which)
```

In the example above I created a numeric variable which equals 1 when `League`

is equal to `N`

and used that to replace the `factor`

variable version of `League`

.

In the case of binary factor variables you could just change the labels in the resulting object after running the regression, however if you have more than 2 levels this won't work. For multi-class factor variables you'll need to create multiple indicator variables in the original dataset, as I did in the example above.