nathanbeagle nathanbeagle - 25 days ago 19
R Question

R Univariate Clustering by Group

I am trying to find a method to cluster univariate data by group. For example, in the data below I have two failure codes (a and b) and 6 data points for each grouping. In the plot you can see that for each failure code there are 2 distinct clusters for failure time. Manually this isn't bad, but I can't figure out how to do this with a larger data set (~100K rows and ~30 codes). I would like for the end result to give me the medoid for each cluster and the count of codes in that cluster.

library(ggplot2)
failure <- rep(c("a","b"),each=6)
ttf <- c(1,1.5,2,5,5.5,6,8,8.5,9,14,14.5,15)
data <- data.frame(failure,ttf)
qplot(failure, ttf)
results <- data.frame(failure = c("a","b"), m1 = c(1.5,8.5), m2 = c(5.5,14.5))


enter image description here

I would like for the end result to give me something like the table below.

failure m1 m1count m2 m2count
a 1.5 3 5.5 3
b 8.5 3 14.5 3

Answer

This is will do what you want, assuming only two clusters per failure group, though you could change it in the tapply it would apply to all failure groups.

res2 <- tapply(data$ttf, INDEX = data$failure, function(x) kmeans(x,2))    
res3 <- lapply(names(res2), function(x) data.frame(failure=x, Centers=res2[[x]]$centers, Size=res2[[x]]$size))     
res3 <- do.call(rbind, res3)

res3
   failure Centers Size
1        a     5.5    3
2        a     1.5    3
11       b    14.5    3
21       b     8.5    3
Comments