Noobie - 1 year ago 102
R Question

# R: how to use random forests to predict binary outcome using string variables?

Consider the following dataframe

``````outcome <- c(1,0,0,1,1)
string <- c('I love pasta','hello world', '1+1 = 2','pasta madness', 'pizza madness')

df = df=data.frame(outcome,string)

> df
outcome        string
1       1  I love pasta
2       0   hello world
3       0       1+1 = 2
``````

Here I would like to use random forests to understand which words in the sentences contained in the
`string`
variable are strong predictors of the
`outcome`
variable.

Is there a (simple) way to do that in R?

What you want is the variable importance measures as produced by `randomForest`. This is obtained from the `importance` function. Here is some code that should get you started:

``````outcome <- c(1,0,0,1,1)
string <- c('I love pasta','hello world', '1+1 = 2','pasta madness', 'pizza madness')
``````

Step 1: We want `outcome` to be a factor so that `randomForest` will do classification and `string` as character vectors.

``````df <- data.frame(outcome=factor(outcome,levels=c(0,1)),string, stringsAsFactors=FALSE)
``````

Step 2: Tokenize the `string` column into words. Here, I'm using `dplyr` and `tidyr` just for convenience. The key is to have just word tokens that you want as your predictor variable.

``````library(dplyr)
library(tidyr)
inp <- df %>% mutate(string=strsplit(string,split=" ")) %>% unnest(string)
##   outcome  string
##1        1       I
##2        1    love
##3        1   pasta
##4        0   hello
##5        0   world
##6        0     1+1
##7        0       =
##8        0       2
##9        1   pasta
##11       1   pizza
``````

Step 3: Construct a model matrix and feed it to `randomForest`:

``````library(randomForest)
mm <- model.matrix(outcome~string,inp)
rf <- randomForest(mm, inp\$outcome, importance=TRUE)
imp <- importance(rf)
##                     0        1 MeanDecreaseAccuracy MeanDecreaseGini
##(Intercept)   0.000000 0.000000             0.000000        0.0000000
##string1+1     0.000000 0.000000             0.000000        0.3802400
##string2       0.000000 0.000000             0.000000        0.4514319
##stringhello   0.000000 0.000000             0.000000        0.4152465
##stringI       0.000000 0.000000             0.000000        0.2947108
##stringlove    0.000000 0.000000             0.000000        0.2944955
As you can see, pasta and madness are key words to predict the `outcome`.
Please Note: There are many parameters to `randomForest` that will be relevant for tackling the real-problem of scale. This is by no means a complete solution to your problem. It is only meant to illustrate the use of the `importance` function in answering your question. You may want to ask appropriate questions on Cross Validated concerning the details of using `randomForest`.