nathanbeagle - 3 months ago 53

R Question

I am trying to find a method to cluster univariate data by group. For example, in the data below I have two failure codes (a and b) and 6 data points for each grouping. In the plot you can see that for each failure code there are 2 distinct clusters for failure time. Manually this isn't bad, but I can't figure out how to do this with a larger data set (~100K rows and ~30 codes). I would like for the end result to give me the medoid for each cluster and the count of codes in that cluster.

`library(ggplot2)`

failure <- rep(c("a","b"),each=6)

ttf <- c(1,1.5,2,5,5.5,6,8,8.5,9,14,14.5,15)

data <- data.frame(failure,ttf)

qplot(failure, ttf)

results <- data.frame(failure = c("a","b"), m1 = c(1.5,8.5), m2 = c(5.5,14.5))

I would like for the end result to give me something like the table below.

`failure m1 m1count m2 m2count`

a 1.5 3 5.5 3

b 8.5 3 14.5 3

Answer

This is will do what you want, assuming only two clusters per failure group, though you could change it in the `tapply`

it would apply to all failure groups.

```
res2 <- tapply(data$ttf, INDEX = data$failure, function(x) kmeans(x,2))
res3 <- lapply(names(res2), function(x) data.frame(failure=x, Centers=res2[[x]]$centers, Size=res2[[x]]$size))
res3 <- do.call(rbind, res3)
res3
failure Centers Size
1 a 5.5 3
2 a 1.5 3
11 b 14.5 3
21 b 8.5 3
```