I am currently using SVMs in R (e1071) with linear kernels to attempt to classify a high dimensional data set. It consists of around 300 patients with around 12000 gene activity levels measured for each patient. My goal is to predict patient response (binary: treatment effective or not) to a certain drug based upon these gene activities.
I want to establish the range of cost values to pass to the tune.svm function and this is where I am running into trouble. My understanding is that the way to do this is to try progressively smaller and larger values until lower and upper bounds for reasonable performance are respectively established; nevertheless, when I attempt to do this, no matter how large or small I make my possible costs, my resulting test error rate is never worse than about 50%. This is happening both with my actual data set and with this toy version. If this subset is too small I can provide a more significant chunk of it. Thanks for any advice.
dat.ex <- read.table("svm_ex.txt", header=T, row.names=1)
trainingSize <- 20
possibleCosts <- c(10^-50, 10^-25, 10^25, 10^50)
trainingDat <- sample(1:dim(dat.ex), replace = FALSE, size = trainingSize)
ex.results <- vector()
for(i in 1:length(possibleCosts))
svm.ex <- svm(dat.ex[trainingDat, -1], factor(dat.ex[trainingDat, 1]), kernel="linear", cost=possibleCosts[i], type="C-classification")
test.ex <- predict(svm.ex, newdata=data.frame(x = dat.ex[-trainingDat,-1]))
truth.ex <- table(pred = test.ex, truth = factor(dat.ex[-trainingDat,1]))
exTestCorrectRate <- (truth.ex[1,1] + truth.ex[2,2])/(dim(dat.ex) - trainingSize)
ex.results[i] <- exTestCorrectRate
First, you try ugly weird values of
C. You should check the much smaller range of values (say between
1e10) and in much geater resolution ( for example - 25 different values for the interval I suggested).
Second, you have very small dataset. 20 training vectors with 10 dimensions may be hard to model