vchris_ngs vchris_ngs - 1 year ago 67
R Question

is it the knee or the elbow that should be considered in the plot for defining the number of clusters?

I am using the elbow method, silhouette and trying to find the optimal number of k m clusters from the data. Now with most packages it gives 3 with PAM, Kmeans, clara if I consider wss (within similarity scores) or silhouette. With Hubert analysis I am getting ideally 2 clusters. Only strange things is the below command gives me a plot which to me is a bit confusing. Should I consider it as 3 clusters or 4. If anyone can give me some feedbacks here.

code used

wss <- (nrow(scale(df))-1)*sum(apply(scale(df),2,var))
for (i in 2:10) wss[i] <- sum(kmeans(scale(df),
centers=i)$withinss)
fviz_nbclust(scale(df), kmeans, method = "wss")


I am also trying to put the image so that one can tell me if it's 3 or 4 that should be the cluster number here. Ideally, I think it should be 4 since the whole point of WSS is to select the k where the SSE is more or less flat.

enter image description here

Answer Source

The basic idea is that low "Within Sum of Squared" is a signal of a good model (in terms of error). However, the more clusters, the lower that value of this sum of squared errors (SSE).

In simple terms: "when you see that the rate at which the SSE is decreasing (with a higher number of clusters) is slowing down, that would a good point to freeze the number of clusters".

Hence, it is the elbow, in your case at number 4, because the SSE decline is slowing down after 4.

see also: here and here on SO

On wikipedia there is an excellent overview of how the number of clusters may be determined: here

Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download