I am trying to find a method to cluster univariate data by group. For example, in the data below I have two failure codes (a and b) and 6 data points for each grouping. In the plot you can see that for each failure code there are 2 distinct clusters for failure time. Manually this isn't bad, but I can't figure out how to do this with a larger data set (~100K rows and ~30 codes). I would like for the end result to give me the medoid for each cluster and the count of codes in that cluster.
failure <- rep(c("a","b"),each=6)
ttf <- c(1,1.5,2,5,5.5,6,8,8.5,9,14,14.5,15)
data <- data.frame(failure,ttf)
results <- data.frame(failure = c("a","b"), m1 = c(1.5,8.5), m2 = c(5.5,14.5))
failure m1 m1count m2 m2count
a 1.5 3 5.5 3
b 8.5 3 14.5 3
This is will do what you want, assuming only two clusters per failure group, though you could change it in the
tapply it would apply to all failure groups.
res2 <- tapply(data$ttf, INDEX = data$failure, function(x) kmeans(x,2)) res3 <- lapply(names(res2), function(x) data.frame(failure=x, Centers=res2[[x]]$centers, Size=res2[[x]]$size)) res3 <- do.call(rbind, res3) res3 failure Centers Size 1 a 5.5 3 2 a 1.5 3 11 b 14.5 3 21 b 8.5 3