Sharp Yan - 8 months ago 54

R Question

I'm currently using decision trees (CART) in R with packages **rpart** and **rattle** for classification.

After training my CART tree, I found that some rules conflict with each other. Consider the following tree, with the conflicting rules indicated by the red circle.

In the parent node the split is CHWC.VLV >= 15; if this is true you go left in the tree and if it is false you go right in the tree. To the left, we find that the child node's rule is CHWC.VLV < 15. However based on the splitting rule in the parent node, I wouldn't expect any of the observations in this part of the tree to have values CHWC.VLV < 15.

Does anybody know the cause of this apparent conflict?

Answer

This sort of issue generally comes from simply not outputting using enough digits of precision when outputting your CART tree. As a simple example, let's consider the following dataset:

```
CHWC.VLV <- seq(14, 16, length.out=10000)
outcome <- ifelse(CHWC.VLV >= 14.97, ifelse(CHWC.VLV <= 15.34, 1, 2), 3)
```

We can train and plot our CART model with:

```
library(rpart)
mod <- rpart(outcome~CHWC.VLV)
library(rpart.plot)
prp(mod)
```

This appears to be a contradiction, because the left subtree from the root node should have all values `CHWC.VLV >= 15`

, but the next split is `CHWC.VLV < 15`

. However, upon plotting with more digits of accuracy we see that this is, in fact, not a contradiction:

```
prp(mod, digits=4)
```