I use the CLARA algorithm from Kaufman and Rousseeuw to cluster a large dataset with N > 8*10^6 in R. The implementation of the algorithm itself allows the user to control execution time by e.g. limiting the samplesize to n=100.
However it seems that the use of the
## generate 2.5 mio objects, divided into 2 clusters.
x <- rbind(cbind(rnorm(10^6,0,0.5), rnorm(10^6,0,0.5)),
# get clusters solution
clara.x<-clara(x,k=2,sampsize = 100)
# see medoids
# plot the cluster solution
plot(clara.x) # takes long time. creates crowded plot
clusplot(clara.x) # did not finish
First off, it seems like plot() for clara objects gives two plots, the first being identical to that returned by clusplot(). If the former finished but the latter did not, I'm guessing that's just because you're clogging up the plot history. If you save large plots to png you won't run into this problem. They'll still take a while, but it won't interfere with whatever else it is you're doing.
Regarding reducing the number of plotted points, we can do this manually by adjusting the list elements of
clara.x. You just have to choose which points you want to plot. Below, I give an example where I just use the samples from the
clara method. But if you want to plot more you can choose with
sample() or something:
# Manually shrinking clara object samp <- clara.x$sample clara.x$data <- clara.x$data[samp, ] clara.x$clustering <- clara.x$clustering[samp] clara.x$i.med <- match(clara.x$i.med, samp) # point medoid indx to samp # plot the cluster solution clusplot(clara.x)
One delicacy is that the medoid samples must always be in whatever indices you choose to plot, otherwise the 5th line above won't work. To ensure this for any given
samp, add the following after the 2nd line above:
samp <- union(samp, clara.x$i.med)
ADDENDUM: Just saw the 1st answer, which is different from mine. He is suggesting to re-compute the clustering. A benefit to my approach is it maintains the original clustering computation and only adjusts which points you plot.