nathanbeagle nathanbeagle - 8 months ago 81
R Question

R Univariate Clustering by Group

I am trying to find a method to cluster univariate data by group. For example, in the data below I have two failure codes (a and b) and 6 data points for each grouping. In the plot you can see that for each failure code there are 2 distinct clusters for failure time. Manually this isn't bad, but I can't figure out how to do this with a larger data set (~100K rows and ~30 codes). I would like for the end result to give me the medoid for each cluster and the count of codes in that cluster.

failure <- rep(c("a","b"),each=6)
ttf <- c(1,1.5,2,5,5.5,6,8,8.5,9,14,14.5,15)
data <- data.frame(failure,ttf)
qplot(failure, ttf)
results <- data.frame(failure = c("a","b"), m1 = c(1.5,8.5), m2 = c(5.5,14.5))

enter image description here

I would like for the end result to give me something like the table below.

failure m1 m1count m2 m2count
a 1.5 3 5.5 3
b 8.5 3 14.5 3


This is will do what you want, assuming only two clusters per failure group, though you could change it in the tapply it would apply to all failure groups.

res2 <- tapply(data$ttf, INDEX = data$failure, function(x) kmeans(x,2))    
res3 <- lapply(names(res2), function(x) data.frame(failure=x, Centers=res2[[x]]$centers, Size=res2[[x]]$size))     
res3 <-, res3)

   failure Centers Size
1        a     5.5    3
2        a     1.5    3
11       b    14.5    3
21       b     8.5    3