Vikash B Vikash B - 3 months ago 37
R Question

R xgboost importance plot with many features

I am trying out the Kaggle housing prices challenge : https://www.kaggle.com/c/house-prices-advanced-regression-techniques

Here is the script I wrote

train <- read.csv("train.csv")
train$Id <- NULL
previous_na_action = options('na.action')
options(na.action = 'na.pass')
sparse_matrix <- sparse.model.matrix(SalePrice~.-1,data = train)
options(na.action = previous_na_action)
model <- xgboost(data = sparse_matrix, label = train$SalePrice, missing = NA, max.depth = 6, eta = 0.3, nthread = 4, nrounds = 16, verbose = 2, objective = "reg:linear")
importance <- xgb.importance(feature_names = sparse_matrix@Dimnames[[2]], model = model)
print(xgb.plot.importance(importance_matrix = importance))


The data has over 70 features, I used
xgboost
with
max.depth
= 6 and
nrounds
= 16.

The importance plot i am getting is very messed up, how do i get to view only the top 5 features or something.

enter image description here

Answer

Check out the top_n argument to xgb.plot.importance. It does exactly what you want.

# Plot only top 5 most important variables.
print(xgb.plot.importance(importance_matrix = importance, top_n = 5))

Edit: only on development version of xgboost. Alternative method is to do this:

print(xgb.plot.importance(importance_matrix = importance[1:5]))