Ben Ben - 21 days ago 16
R Question

Does XGBoost distinguish between missing values and 0s in a sparse matrix in R?

Sometimes features can have 0s and missing values. For example, maybe you measure the strike-outs per game per pitcher for a set of baseball pitchers, and you end up with a feature vector like

feats <- c(NA, NA, NA, 3.7, 0, 2.2)


Here, 1 pitcher averaged 0 strikeouts per game and 3 pitchers didn't log any data because they haven't pitched a game yet. When we convert this to a sparse matrix we get something like

library(Matrix)
sparse1 <- sparseMatrix(i=4:6, j=rep(1, 3), x=c(3.7, 0, 2.2), dims=c(6, 1))
sparse1

[1,] .
[2,] .
[3,] .
[4,] 3.7
[5,] 0.0
[6,] 2.2


Here, the
dgCMatrix
class clearly distinguishes missing data from 0s, but from what I understand, missing data in a
dgCMatrix
is assumed to take on the value 0.

What I'm wondering is, when XGBoost attempts to split this data, does it treat the 0s and the missing data separately? In other words, when XGBoost tries to split on this feature, does it follow the NA protocol (checking both split directions) for the missing data, or does it send the missing data to the same location as the non-sparse 0 values?

Answer

To answer the exact question though:

yes, a missing value is not considered for the gain computation (does not add to the sum of gradient and hessian of the children in a boosting tree) while a 0 value is considered (and adds to the sum of gradient and hessian of the children in a boosting tree)

So, a 0 and a missing value is not the same.