I have a purely categorical dataframe from the UCI machine learning database
I am using rpart to form a decision tree based on a new category on whether patients return before 30 days (a new failed category).
I am using the following parameters for my decision tree
tree_model <- rpart(Failed ~ race + gender + age+ time_in_hospital+ medical_specialty + num_lab_procedures+ num_procedures+num_medications+number_outpatient+number_emergency+number_inpatient+number_diagnoses+max_glu_serum+ A1Cresult+metformin+glimepiride+glipizide+glyburide+pioglitazone+rosiglitazone+insulin+change,method="class", data=training_data, control=rpart.control(minsplit=2, cp=0.0001, maxdepth=20, xval = 10), parms = list(split = "gini"))
CP nsplit rel error xerror xstd
1 0.00065883 0 1.00000 1.0000 0.018518
2 0.00057648 8 0.99424 1.0038 0.018549
3 0.00025621 10 0.99308 1.0031 0.018543
4 0.00020000 13 0.99231 1.0031 0.018543
The x-error is the cross-validation error (rpart has built-in cross validation). You use the 3 columns, rel_error, xerror and xstd together to help you choose where to prune the tree.
Each row represents a different height of the tree. In general, more levels in the tree mean that it has lower classification error on the training. However, you run the risk of overfitting. Often, the cross-validation error will actually grow as the tree gets more levels (at least, after the 'optimal' level).
A rule of thumb is to choose the lowest level where the
rel_error + xstd < xerror.
If you run
plotcp on your output it will also show you the optimal place to prune the tree.
Also, see here.