Ben - 10 months ago 118
R Question

# How to one-hot-encode factor variables with data.table?

For those unfamiliar, one-hot encoding simply refers to converting a column of categories (i.e. a factor) into multiple columns of binary indicator variables where each new column corresponds to one of the classes of the original column. This example will explain it better:

dt <- data.table(
ID=1:5,
Color=factor(c("green", "red", "red", "blue", "green"), levels=c("blue", "green", "red", "purple")),
Shape=factor(c("square", "triangle", "square", "triangle", "cirlce"))
)

dt
ID Color    Shape
1:  1 green   square
2:  2   red triangle
3:  3   red   square
4:  4  blue triangle
5:  5 green   cirlce

# one hot encode the colors
color.binarized <- dcast(dt[, list(V1=1, ID, Color)], ID ~ Color, fun=sum, value.var="V1", drop=c(TRUE, FALSE))

# Prepend Color_ in front of each one-hot-encoded feature
setnames(color.binarized, setdiff(colnames(color.binarized), "ID"), paste0("Color_", setdiff(colnames(color.binarized), "ID")))

# one hot encode the shapes
shape.binarized <- dcast(dt[, list(V1=1, ID, Shape)], ID ~ Shape, fun=sum, value.var="V1", drop=c(TRUE, FALSE))

# Prepend Shape_ in front of each one-hot-encoded feature
setnames(shape.binarized, setdiff(colnames(shape.binarized), "ID"), paste0("Shape_", setdiff(colnames(shape.binarized), "ID")))

# Join one-hot tables with original dataset
dt <- dt[color.binarized, on="ID"]
dt <- dt[shape.binarized, on="ID"]

dt
ID Color    Shape Color_blue Color_green Color_red Color_purple Shape_cirlce Shape_square Shape_triangle
1:  1 green   square          0           1         0            0            0            1              0
2:  2   red triangle          0           0         1            0            0            0              1
3:  3   red   square          0           0         1            0            0            1              0
4:  4  blue triangle          1           0         0            0            0            0              1
5:  5 green   cirlce          0           1         0            0            1            0              0


This is something I do a lot, and as you can see it's pretty tedious (especially when my data has many factor columns). Is there an easier way to do this with data.table? In particular, I assumed dcast would allow me to one-hot-encode multiple columns at once, when I try doing something like

dcast(dt[, list(V1=1, ID, Color, Shape)], ID ~ Color + Shape, fun=sum, value.var="V1", drop=c(TRUE, FALSE))


I get column combinations

   ID blue_cirlce blue_square blue_triangle green_cirlce green_square green_triangle red_cirlce red_square red_triangle purple_cirlce purple_square purple_triangle
1:  1           0           0             0            0            1              0          0          0            0             0             0               0
2:  2           0           0             0            0            0              0          0          0            1             0             0               0
3:  3           0           0             0            0            0              0          0          1            0             0             0               0
4:  4           0           0             1            0            0              0          0          0            0             0             0               0
5:  5           0           0             0            1            0              0          0          0            0             0             0               0


Using model.matrix:

> cbind(dt[, .(ID)], model.matrix(~ Color + Shape, dt))
ID (Intercept) Colorgreen Colorred Colorpurple Shapesquare Shapetriangle
1:  1           1          1        0           0           1             0
2:  2           1          0        1           0           0             1
3:  3           1          0        1           0           1             0
4:  4           1          0        0           0           0             1
5:  5           1          1        0           0           0             0


This makes the most sense if you're doing modelling.

If you want to suppress the intercept (and restore the aliased column for the 1st variable):

> cbind(dt[, .(ID)], model.matrix(~ Color + Shape - 1, dt))
ID Colorblue Colorgreen Colorred Colorpurple Shapesquare Shapetriangle
1:  1         0          1        0           0           1             0
2:  2         0          0        1           0           0             1
3:  3         0          0        1           0           1             0
4:  4         1          0        0           0           0             1
5:  5         0          1        0           0           0             0