Pan - 3 years ago 275

R Question

I use following tsclust statement to cluster data

`SURFSKINTEMP_CLUST <- tsclust(SURFSKINTEMP, k = 10L:20L,`

distance = "dtw_basic", centroid = "dba",

trace = TRUE, seed = 938,

norm = "L2", window.size = 2L,

args = tsclust_args(cent = list(trace = TRUE)))

SURFSKINTEMP is very big,

`str(SURFSKINTEMP)`

List of 327239

$ V1 : num [1:7] 0.13 0.631 -0.178 0.731 0.86 ...

$ V2 : num [1:6] 0.117 -0.693 -0.911 -0.911 -0.781 ...

$ V3 : num [1:7] 0.117 -0.693 -0.911 -0.911 -0.781 ...

$ V4 : num [1:6] -0.693 -0.911 -0.911 -0.781 -0.604 ...

Then, I want use cvi to evaluate the optimum number of clusters “k”

`names(SURFSKINTEMP_CLUST) <- paste0("k_",10L:20L)`

sapply(SURFSKINTEMP_CLUST, cvi, type = "internal")

But, there have an errors

`> sapply(SURFSKINTEMP_CLUST, cvi, type = "internal")`

Error: cannot allocate vector of size 797.8 Gb

How can I evaluate the optimum number of clusters “k” in my case?

Recommended for you: Get network issues from **WhatsUp Gold**. **Not end users.**

Answer Source

Specifying `type = "internal"`

will try to calculate 7 indices: Silhouette, Dunn, COP, DB, DB*, CH and SF. As mentioned in the documentation for `cvi`

, the first 3 will try to calculate the whole cross-distance matrix, which in your case would be a `327,239 x 327,239`

matrix; you're going to have a hard time finding a computer that can allocate that, and it would take a *long* time to compute.

Since you're using DBA for centroids, you could see if DB or DB* make sense for your application

```
sapply(SURFSKINTEMP_CLUST, cvi, type = c("DB", "DBstar"))
```

You could also look at the somewhat simple elbow method bearing in mind that you could calculate the sum of squared error (SSE) with (see documentation for `TSClusters-class`

):

```
sapply(SURFSKINTEMP_CLUST, function(cl) { sum(cl@cldist ^ 2) })
```

Recommended from our users: **Dynamic Network Monitoring from WhatsUp Gold from IPSwitch**. ** Free Download**