Geraldine - 3 months ago 33

R Question

I am trying to run the package NbClust on my data (100 rows x 130 columns) to determine the number of clusters I should choose, but I keep getting this error if I try to apply it to the full data set:

`> nc <- NbClust(mydata, distance="euclidean", min.nc=2, max.nc=99, method="ward",`

index="duda")

[1] "There are only 100 nonmissing observations out of a possible 100 observations."

Error in NbClust(mydata, distance = "euclidean", min.nc = 2, max.nc = 99, :

The TSS matrix is indefinite. There must be too many missing values. The index cannot be calculated.

When I apply the method to a 100x80 matrix, it does produce the required output (100x100 also gave me an error message, but a different one). However, obviously, I want to apply this method to the whole dataset.

FYI - creating the distance matrix, and clustering with Ward's Method was both no problem. Both the distance matrix and the dendrogram were producedâ€¦

Answer

I am pretty sure I found the cause of this error message, and it is essentially data related. I looked up the original code for the NbClust package and found the error originates in the beginning part of the code:

```
NbClust <- function(data, diss="NULL", distance = "euclidean", min.nc=2, max.nc=15, method = "ward", index = "all", alphaBeale = 0.1)
{
x<-0
min_nc <- min.nc
max_nc <- max.nc
jeu1 <- as.matrix(data)
numberObsBefore <- dim(jeu1)[1]
jeu <- na.omit(jeu1) # returns the object with incomplete cases removed
nn <- numberObsAfter <- dim(jeu)[1]
pp <- dim(jeu)[2]
TT <- t(jeu)%*%jeu
sizeEigenTT <- length(eigen(TT)$value)
eigenValues <- eigen(TT/(nn-1))$value
for (i in 1:sizeEigenTT)
{
if (eigenValues[i] < 0) {
print(paste("There are only", numberObsAfter,"nonmissing observations out of a possible", numberObsBefore ,"observations."))
stop("The TSS matrix is indefinite. There must be too many missing values. The index cannot be calculated.")
}
}
```

So, in my case, my matrix produces negative eigenvalues. I double-checked this, and it does: up to about 100 principal submatrices the eigenvalues stay positive, then they start getting negative. So this is a mathematical issue with my matrix, it means it is not a positive-definite matrix. Which is important for quite a lot of reasons - a really good explanation of causes and possible solutions is given at http://www2.gsu.edu/~mkteer/npdmatri.html I am now analyzing my data to find out what causes this. So the code is fine: if you get this error message, you probably have to go back to your data.

I would caution against transposing your data, because then you're essentially multiplying the transpose of your transpose data (i.e. the original data) with your transposed data. And original times transposed is NOT the same as transposed times the original!!