Manolete Manolete - 21 days ago 8
R Question

Profiling SVM (e1071) in R

I am new to R and SVMs and I am trying to profile

svm
function from
e1071
package. However, I can't find any large dataset that allows me to get a good profiling range of results varying the size of the input data. Does anyone know how to work
svm
out? Which dataset should I use? Any particular parameters to
svm
that makes it work harder?

I copy some commands that I am using to test the performance. Perhaps it is most useful and easier to get what I am trying here:

#loading libraries
library(class)
library(e1071)
#I've been using golubEsets (more examples availables)
library(golubEsets)

#get the data: matrix 7129x38
data(Golub_Train)
n <- exprs(Golub_Train)

#duplicate rows(to make the dataset larger)
n<-rbind(n,n)

#take training samples as a vector
samplelabels <- as.vector(Golub_Train@phenoData@data$ALL.AML)

#calculate svm and profile it
Rprof('svm.out')
svmmodel1 <- svm(x=t(n), y=samplelabels, type='C', kernel="radial", cross=10)
Rprof(NULL)


I keep increasing the dataset duplicating rows and columns but I reached the limit of memory instead of making
svm
works harder...

Answer

In terms of "working SVM out" - what will make SVM work "harder" is a more complex model which is not easily separated, higher dimensionality and a larger, denser dataset.

SVM performance degrades with:

  • Dataset size increases (number of data points)
  • Sparsity decreases (fewer zeros)
  • Dimensionality increases (number of attributes)
  • Non-linear kernels are used (and kernel parameters can make the kernel evaluation more complex)

Varying Parameters

Are there parameters you can change to make SVM take longer. Of course the parameters affect the quality of the solution you will get and may not make any sense to use.

Using C-SVM, varying C will result in different runtimes. (The similar parameter in nu-SVM is nu) If the dataset is reasonably separable, making C smaller will result in a longer runtime because the SVM will allow more training points to become support vectors. If the dataset is not very separable, making C bigger will cause longer run times because you are essentially telling SVM you want a narrow-margin solution which fits tightly to the data and that will take much longer to compute when the data doesn't easily separate.

Often you find when doing a parameter search that there are parameters that will increase computation time with no appreciable increase in accuracy.

The other parameters are kernel parameters and if you vary them to increase the complexity of calculating the kernel then naturally the SVM runtime will increase. The linear kernel is simple and will be the fastest; non-linear kernels will of course take longer. Some parameters may not increase the calculation complexity of the kernel, but will force a much more complex model, which may take SVM much longer to find the optimal solution to.

Datasets to Use:

The UCI Machine Learning Repository is a great source of datasets.

The MNIST handwriting recognition dataset is a good one to use - you can randomly select subsets of the data to create increasingly larger sized datasets. Keep in mind the data at the link contains all digits, SVM is of course binary so you would have to reduce the data to just two digits or do some kind of multi-class SVM.

You can easily generate datasets as well. To generate a linear dataset, randomly select a normal vector to a hyperplane, then generate a datapoint and determine which side of the hyperplane it falls on to label it. Add some randomness to allow points within a certain distance of the hyperplane to sometimes be labeled differently. Increase the complexity by increasing that overlap between classes. Or generate some numbers of clusters of normally distributed points, labeled either 1 or -1, so that the distributions overlap at the edges. The classic non-linear example is a checkerboard. Generate points and label them in a checkerboard pattern. To make it more difficult enlarge the number of squares, increase the dimensions and increase the number of datapoints. You will have to use a non-linear kernel for that of course.

Comments