yasar - 7 months ago 100
Python Question

# Distance calculation on matrix using numpy

I am trying to implement a K-means algorithm in Python (I know there is libraries for that, but I want to learn how to implement it myself.) Here is the function I am havin problem with:

``````def AssignPoints(points, centroids):
"""
Takes two arguments:
points is a numpy array such that points.shape = m , n where m is number of examples,
and n is number of dimensions.

centroids is numpy array such that centroids.shape = k , n where k is number of centroids.
k < m should hold.

Returns:
numpy array A such that A.shape = (m,) and A[i] is index of the centroid which points[i] is assigned to.
"""

m ,n = points.shape
temp = []
for i in xrange(n):
temp.append(np.subtract.outer(points[:,i],centroids[:,i]))
distances = np.hypot(*temp)
return distances.argmin(axis=1)
``````

Purpose of this function, given m points in n dimensional space, and k centroids in n dimensional space, produce a numpy array of (x1 x2 x3 x4 ... xm) where x1 is the index of centroid which is closest to first point. This was working fine, until I tried it with 4 dimensional examples. When I try to put 4 dimensional examples, I get this error:

``````  File "/path/to/the/kmeans.py", line 28, in AssignPoints
distances = np.hypot(*temp)
ValueError: invalid number of arguments
``````

How can I fix this, or if I can't, how do you suggest I calculate what I am trying to calculate here?

``````def AssignPoints(points, centroids):
m ,n = points.shape
temp = []
for i in xrange(n):
temp.append(np.subtract.outer(points[:,i],centroids[:,i]))
for i in xrange(len(temp)):
temp[i] = temp[i] ** 2
return distances.argmin(axis=1)
``````

Try this:

``````np.sqrt(((points[np.newaxis] - centroids[:,np.newaxis]) ** 2).sum(axis=2)).argmin(axis=0)
``````

Or:

``````diff = points[np.newaxis] - centroids[:,np.newaxis]
norm = np.sqrt((diff*diff).sum(axis=2))
closest = norm.argmin(axis=0)
``````

And don't ask what's it doing :D

Edit: nah, just kidding. The broadcasting in the middle (`points[np.newaxis] - centroids[:,np.newaxis]`) is "making" two 3D arrays from the original ones. The result is such that each "plane" contains the difference between all the points and one of the centroids. Let's call it `diffs`.

Then we do the usual operation to calculate the euclidean distance (square root of the squares of differences): `np.sqrt((diffs ** 2).sum(axis=2))`. We end up with a `(k, m)` matrix where row 0 contain the distances to `centroids[0]`, etc. So, the `.argmin(axis=0)` gives you the result you wanted.

Source (Stackoverflow)