vchris_ngs - 1 year ago 67

R Question

I am using the elbow method, silhouette and trying to find the optimal number of k m clusters from the data. Now with most packages it gives 3 with PAM, Kmeans, clara if I consider wss (within similarity scores) or silhouette. With Hubert analysis I am getting ideally 2 clusters. Only strange things is the below command gives me a plot which to me is a bit confusing. Should I consider it as 3 clusters or 4. If anyone can give me some feedbacks here.

code used

`wss <- (nrow(scale(df))-1)*sum(apply(scale(df),2,var))`

for (i in 2:10) wss[i] <- sum(kmeans(scale(df),

centers=i)$withinss)

fviz_nbclust(scale(df), kmeans, method = "wss")

I am also trying to put the image so that one can tell me if it's 3 or 4 that should be the cluster number here. Ideally, I think it should be 4 since the whole point of WSS is to select the k where the SSE is more or less flat.

Recommended for you: Get network issues from **WhatsUp Gold**. **Not end users.**

Answer Source

The basic idea is that low "Within Sum of Squared" is a signal of a good model (in terms of error). However, the more clusters, the lower that value of this sum of squared errors (SSE).

In simple terms: "when you see that the rate at which the SSE is decreasing (with a higher number of clusters) is slowing down, that would a good point to freeze the number of clusters".

Hence, it is the **elbow**, in your case at number **4**, because the SSE decline is slowing down after 4.

see also: here and here on SO

On wikipedia there is an excellent overview of how the number of clusters may be determined: here

Recommended from our users: **Dynamic Network Monitoring from WhatsUp Gold from IPSwitch**. ** Free Download**