chet chet - 1 year ago 62
R Question

R Clustering 'purity' metric

I am using fpc package in R to perform cluster validation.

I could use the function cluster.stats() to compare my clustering with an external partitioning and compute several metrics like Rand Index, entropy e.t.c.

However, I am looking for a metric called 'purity' or 'cluster accuracy' which is defined in

I am wondering if there is an implementation of this measure in R.


Answer Source

I don't know of an off-the-shelf function, but here is one way you could do it yourself using the equation in your link:

ClusterPurity <- function(clusters, classes) {
  sum(apply(table(classes, clusters), 2, max)) / length(clusters)

Here we can test it on some random assignments, where I believe we expect the purity to be 1/number-of-classes:

> n = 1e6
> classes = sample(3, n, replace=T)
> clusters = sample(5, n, replace=T)
> ClusterPurity(clusters, classes)
[1] 0.334349