joaoal joaoal - 5 months ago 68
R Question

How create cluster plots for large datasets in R

I use the CLARA algorithm from Kaufman and Rousseeuw to cluster a large dataset with N > 8*10^6 in R. The implementation of the algorithm itself allows the user to control execution time by e.g. limiting the samplesize to n=100.

However it seems that the use of the

function in R includes all data-objects to the plot which results in a very large processing time and very crowded plots (see the reproducible example below).

In theory it should be possible to only plot the best sample from
instead of
. Is there an implementation for this or how can I work around this issue?

## generate 2.5 mio objects, divided into 2 clusters.
x <- rbind(cbind(rnorm(10^6,0,0.5), rnorm(10^6,0,0.5)),
cbind(rnorm(1.5*10^6,5,0.5), rnorm(1.5*10^6,5,0.5)))

# get clusters solution
clara.x<-clara(x,k=2,sampsize = 100)
# see medoids

# plot the cluster solution
plot(clara.x) # takes long time. creates crowded plot
clusplot(clara.x) # did not finish

enter image description here

Answer Source

First off, it seems like plot() for clara objects gives two plots, the first being identical to that returned by clusplot(). If the former finished but the latter did not, I'm guessing that's just because you're clogging up the plot history. If you save large plots to png you won't run into this problem. They'll still take a while, but it won't interfere with whatever else it is you're doing.

Regarding reducing the number of plotted points, we can do this manually by adjusting the list elements of clara.x. You just have to choose which points you want to plot. Below, I give an example where I just use the samples from the clara method. But if you want to plot more you can choose with sample() or something:

# Manually shrinking clara object
samp <- clara.x$sample
clara.x$data <- clara.x$data[samp, ]
clara.x$clustering <- clara.x$clustering[samp]
clara.x$ <- match(clara.x$, samp) # point medoid indx to samp

# plot the cluster solution

One delicacy is that the medoid samples must always be in whatever indices you choose to plot, otherwise the 5th line above won't work. To ensure this for any given samp, add the following after the 2nd line above:

samp <- union(samp, clara.x$

ADDENDUM: Just saw the 1st answer, which is different from mine. He is suggesting to re-compute the clustering. A benefit to my approach is it maintains the original clustering computation and only adjusts which points you plot.

Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download