user7087171 - 8 months ago 71

R Question

I am trying to use rpart to build a classification tree model.

The test data frame is very simple containing only two boolean variables in 10 rows.

The hidden logic is simple too: when x is FALSE, y must be FALSE. When x is TRUE, y has 60% chance of being TRUE.

So I would imagine rpart would do one split on x to increase node purity. But it stays at root node and does not split at all. Anyone please advise?

`> df <- data.frame(x=rep(c(FALSE,TRUE), each=5), y=c(rep(FALSE,7), rep(TRUE,3)))`

> df

x y

1 FALSE FALSE

2 FALSE FALSE

3 FALSE FALSE

4 FALSE FALSE

5 FALSE FALSE

6 TRUE FALSE

7 TRUE FALSE

8 TRUE TRUE

9 TRUE TRUE

10 TRUE TRUE

> rpart(y~x, method='class', data=df)

n= 10

node), split, n, loss, yval, (yprob)

* denotes terminal node

1) root 10 3 FALSE (0.7000000 0.3000000) *

Answer

As I said in my comment, this is meant to avoid overfitting. Formally, there is the argument `minsplit`

, which is preset to 20 but can be adjusted to give the result you seek:

```
> library(rpart)
> df <- data.frame(x=rep(c(FALSE,TRUE), each=5), y=c(rep(FALSE,7), rep(TRUE,3)))
> rpart(y ~ x, data=df, minsplit=2)
n= 10
node), split, n, deviance, yval
* denotes terminal node
1) root 10 2.1 0.3
2) x< 0.5 5 0.0 0.0 *
3) x>=0.5 5 1.2 0.6 *
```

find more arguments to avoice overfitting (i. e. `cp`

and `maxdepth`

) in

```
help(rpart.control)
```

*EDIT: With method="class" the output changes to*

```
> rpart(y ~ x, data=df, minsplit=2, method="class")
n= 10
node), split, n, loss, yval, (yprob)
* denotes terminal node
1) root 10 3 FALSE (0.7000000 0.3000000)
2) x< 0.5 5 0 FALSE (1.0000000 0.0000000) *
3) x>=0.5 5 2 TRUE (0.4000000 0.6000000) *
```

Source (Stackoverflow)