C Sumlin - 4 months ago 26

C++ Question

First thing, I'm trying to figure out how to apply this algorithm to solve a homework project. So, I'm not looking for the homework solution, just help completing my algorithm which solves the problem.

I am trying to use K-means clustering to cluster a large set (2^6) of arrays. These arrays are unique permutations of the sequence [0,1,2...31]. However, instead of using euclidean distance, I need to use inversion distance.

My first step in k-means is to choose k=10 random points from the data set. I then calculate the inversion distance of each value in the data set to each of the random k-points. This gives the initial clustering.

Now, I cannot figure out how to convert the next step from euclidean distance to inversion distance. How can I find the center of each of these clusters (in terms of inversion distance) so I can repeat the clustering step?

As a companion question, is euclidean distance a good approximation for (or equivalent) inversion distance? I do not believe it is, but I am not sure how to go about proving it.

Thanks to all in advance.

Answer

In general, you *cannot* use k-means with non-Euclidean distances. You can try to run the algorithm with them, but very little can be said about the meaning of convergence when the algorithm terminates.

As you can see in the Wikipedia entry, the Euclidean distance is inherent to the algorithm. It works by alternating between E and M types of steps (as in the EM algorithm), and for the Euclidean distance, it can be shown that both steps are minimizing the same objective function. For other distances, despite the code looking the same, it doesn't hold, in general.

See also this question in Cross Validated.

If you have a different distance, you should use something else, e.g., hierarchical clustering or k-medoids.