I am using PART algorithm in R (via package RWeka) for multi-class classification. Target attribute is time bucket in which an invoice will be paid by customer (like 7-15 days, 15-30 days etc). I am using following code for fitting and predicting from the model :
fit <- PART(DELAY_CLASS ~ AMT_TO_PAY + NUMBER_OF_CREDIT_DAYS + AVG_BASE_PRICE, data= trainingData)
predictedTrainingValues <- predict(fit, trainingData)
In order to prune the tree with PART you need to specify it in the control argument of the function:
There is a complete list of the commands you can pass into the control argument here
I quote some of the options here which are relevant to pruning:
Valid options are:
Set confidence threshold for pruning. (Default: 0.25)
Set minimum number of instances per leaf. (Default: 2)
Use reduced error pruning.
Set number of folds for reduced error pruning. One fold is used as the pruning set. (Default: 3)
Looks like the C argument from above might be of help to you and then maybe R and N and M.
In order to use those in the function do:
fit <- PART(DELAY_CLASS ~ AMT_TO_PAY + NUMBER_OF_CREDIT_DAYS + AVG_BASE_PRICE, data= trainingData, control = Weka_control(R = TRUE, N = 5, M = 100)) #random choices
On a separate note for the accuracy metric:
Comparing the accuracy between the training set and the test set to determine over-fitting is not optimal in my opinion. The model was trained on the training set and therefore you expect it to work better there than the test set. A better test is cross-validation. Try performing a 10-fold cross-validation first (you could use caret's function train) and then compare the average cross-validation accuracy to your test set's accuracy. I think this will better. If you do not know what cross-validation is, in general it splits your training set into smaller training and tests sets and trains on the training and tests on the test set. Can read more about it here.