yiyisue yiyisue - 3 months ago 18
R Question

Decission Tree party package prediction error - Levels do not match

I am building a CART regression tree model in R using party package, but I got error message saying levels do not match when I try to apply the model with testing dataset.

I have spent the past week reading through the threads on the forum, but still couldn't find the right solution to my problem. So I am reposting this question here using fake examples I made up.. Can someone help explain the error message and provide a solution?

my training dataset has about 1000 records and testing dataset has about 150. There's no NA or blank fields in either dataset.

my CART model using ctree under the party package is:


mytree<- ctree(Rate~Bank+Product+Salary, data=data_train)


data_train example:

Rate Bank Product Salary
1.5 A aaa 100000
0.6 B abc 60000
3 C bac 10000
2.1 D cba 50000
1.1 E cca 80000


data_test example:

Rate Bank Product Salary
2.0 A cba 80000
0.5 D cca 250000
0.8 E cba 120000
2.1 C abc 65000

levels(data_train$Bank) : A, B, C, D, E

levels(data_test$Bank): A,D,E,C


I tried to set to the same level using the following codes:

>is.factor(data_test$Bank)

TRUE
(Made sure Bank and Products are factors in both datasets)
>levels(data_test$Bank) <-union(levels(data_test$Bank), levels(data_train$Bank))

> levels(data_test$product)<-union(levels(data_test$product),levels(data_train$product))


However, when I try to run prediction on the testing dataset, I get the following error:

> fit1<- predict(mytree,newdata=data_test)

Error in checkData(oldData, RET) :
Levels in factors of new data do not match original data


I have also tried the following method but it alters the fields of my testing dataset...:


levels(data_test$Bank) <-levels(data_train$Bank)


The data_test table is altered:

Rate Bank(altered) Bank (original)
2.0 A A
0.5 B D
0.8 C E
2.1 D C

Answer

You might try rebuilding your factors using comparable levels instead of assigning new levels to existing factors. Here's an example:

# start the party
library(party)

# create training data sample
data_train <- data.frame(Rate = c(1.5, 0.6, 3, 2.1, 1.1),
                         Bank = c("A", "B", "C", "D", "E"),
                         Product = c("aaa", "abc", "bac", "cba", "cca"),
                         Salary = c(100000, 60000, 10000, 50000, 80000))

# create testing data sample
data_test <- data.frame(Rate = c(2.0, 0.5, 0.8, 2.1),
                         Bank = c("A", "D", "E", "C"),
                         Product = c("cba", "cca", "cba", "abc"),
                         Salary = c(80000, 250000, 120000, 65000))

# get the union of levels between train and test for Bank and Product
bank_levels <- union(levels(data_test$Bank), levels(data_train$Bank))
product_levels <- union(levels(data_test$Product), levels(data_train$Product))

# rebuild Bank with union of levels
data_test$Bank <- with(data_test, factor(Bank, levels = bank_levels)) 
data_train$Bank <- with(data_train, factor(Bank, levels = bank_levels)) 

# rebuild Product with union of levels
data_test$Product <- with(data_test, factor(Product, levels = product_levels)) 
data_train$Product <- with(data_train, factor(Product, levels = product_levels)) 

# fit the model
mytree <- ctree(Rate ~ Bank + Product + Salary, data = data_train)

# generate predictions
fit1 <- predict(mytree, newdata = data_test)

> fit1
     Rate
[1,] 1.66
[2,] 1.66
[3,] 1.66
[4,] 1.66