user7087171 user7087171 - 1 month ago 31
R Question

rpart stops at root node and does not split further when there is an obvious information gain

I am trying to use rpart to build a classification tree model.
The test data frame is very simple containing only two boolean variables in 10 rows.
The hidden logic is simple too: when x is FALSE, y must be FALSE. When x is TRUE, y has 60% chance of being TRUE.
So I would imagine rpart would do one split on x to increase node purity. But it stays at root node and does not split at all. Anyone please advise?

> df <- data.frame(x=rep(c(FALSE,TRUE), each=5), y=c(rep(FALSE,7), rep(TRUE,3)))
> df
x y
1 FALSE FALSE
2 FALSE FALSE
3 FALSE FALSE
4 FALSE FALSE
5 FALSE FALSE
6 TRUE FALSE
7 TRUE FALSE
8 TRUE TRUE
9 TRUE TRUE
10 TRUE TRUE
> rpart(y~x, method='class', data=df)
n= 10

node), split, n, loss, yval, (yprob)
* denotes terminal node

1) root 10 3 FALSE (0.7000000 0.3000000) *

Answer

As I said in my comment, this is meant to avoid overfitting. Formally, there is the argument minsplit, which is preset to 20 but can be adjusted to give the result you seek:

> library(rpart)
> df <- data.frame(x=rep(c(FALSE,TRUE), each=5), y=c(rep(FALSE,7), rep(TRUE,3)))
> rpart(y ~ x, data=df, minsplit=2)
n= 10 

node), split, n, deviance, yval
      * denotes terminal node

1) root 10 2.1 0.3  
  2) x< 0.5 5 0.0 0.0 *
  3) x>=0.5 5 1.2 0.6 *

find more arguments to avoice overfitting (i. e. cp and maxdepth) in

help(rpart.control)

EDIT: With method="class" the output changes to

> rpart(y ~ x, data=df, minsplit=2, method="class")
n= 10 

node), split, n, loss, yval, (yprob)
      * denotes terminal node

 1) root 10 3 FALSE (0.7000000 0.3000000)  
  2) x< 0.5 5 0 FALSE (1.0000000 0.0000000) *
  3) x>=0.5 5 2 TRUE (0.4000000 0.6000000) *