joaoal - 28 days ago 7

R Question

I use the CLARA algorithm from Kaufman and Rousseeuw to cluster a large dataset with **N > 8*10^6** in R. The implementation of the algorithm itself allows the user to control execution time by e.g. limiting the samplesize to **n=100**.

However it seems that the use of the

`plot()`

In theory it should be possible to only plot the best sample from

`CLARA`

`N`

`## generate 2.5 mio objects, divided into 2 clusters.`

x <- rbind(cbind(rnorm(10^6,0,0.5), rnorm(10^6,0,0.5)),

cbind(rnorm(1.5*10^6,5,0.5), rnorm(1.5*10^6,5,0.5)))

library("cluster")

# get clusters solution

clara.x<-clara(x,k=2,sampsize = 100)

# see medoids

clara.x$medoids

# plot the cluster solution

plot(clara.x) # takes long time. creates crowded plot

clusplot(clara.x) # did not finish

Answer Source

First off, it seems like plot() for clara objects gives two plots, the first being identical to that returned by clusplot(). If the former finished but the latter did not, I'm guessing that's just because you're clogging up the plot history. If you save large plots to png you won't run into this problem. They'll still take a while, but it won't interfere with whatever else it is you're doing.

Regarding reducing the number of plotted points, we can do this manually by adjusting the list elements of `clara.x`

. You just have to choose which points you want to plot. Below, I give an example where I just use the samples from the `clara`

method. But if you want to plot more you can choose with `sample()`

or something:

```
# Manually shrinking clara object
samp <- clara.x$sample
clara.x$data <- clara.x$data[samp, ]
clara.x$clustering <- clara.x$clustering[samp]
clara.x$i.med <- match(clara.x$i.med, samp) # point medoid indx to samp
# plot the cluster solution
clusplot(clara.x)
```

One delicacy is that the medoid samples must always be in whatever indices you choose to plot, otherwise the 5th line above won't work. To ensure this for any given `samp`

, add the following after the 2nd line above:

```
samp <- union(samp, clara.x$i.med)
```

**ADDENDUM:** Just saw the 1st answer, which is different from mine. He is suggesting to re-compute the clustering. A benefit to my approach is it maintains the original clustering computation and only adjusts which points you plot.