yasar - 1 year ago 167

Python Question

I am trying to implement a K-means algorithm in Python (I know there is libraries for that, but I want to learn how to implement it myself.) Here is the function I am havin problem with:

`def AssignPoints(points, centroids):`

"""

Takes two arguments:

points is a numpy array such that points.shape = m , n where m is number of examples,

and n is number of dimensions.

centroids is numpy array such that centroids.shape = k , n where k is number of centroids.

k < m should hold.

Returns:

numpy array A such that A.shape = (m,) and A[i] is index of the centroid which points[i] is assigned to.

"""

m ,n = points.shape

temp = []

for i in xrange(n):

temp.append(np.subtract.outer(points[:,i],centroids[:,i]))

distances = np.hypot(*temp)

return distances.argmin(axis=1)

Purpose of this function, given m points in n dimensional space, and k centroids in n dimensional space, produce a numpy array of (x1 x2 x3 x4 ... xm) where x1 is the index of centroid which is closest to first point. This was working fine, until I tried it with 4 dimensional examples. When I try to put 4 dimensional examples, I get this error:

`File "/path/to/the/kmeans.py", line 28, in AssignPoints`

distances = np.hypot(*temp)

ValueError: invalid number of arguments

How can I fix this, or if I can't, how do you suggest I calculate what I am trying to calculate here?

`def AssignPoints(points, centroids):`

m ,n = points.shape

temp = []

for i in xrange(n):

temp.append(np.subtract.outer(points[:,i],centroids[:,i]))

for i in xrange(len(temp)):

temp[i] = temp[i] ** 2

distances = np.add.reduce(temp) ** 0.5

return distances.argmin(axis=1)

Recommended for you: Get network issues from **WhatsUp Gold**. **Not end users.**

Answer Source

Try this:

```
np.sqrt(((points[np.newaxis] - centroids[:,np.newaxis]) ** 2).sum(axis=2)).argmin(axis=0)
```

Or:

```
diff = points[np.newaxis] - centroids[:,np.newaxis]
norm = np.sqrt((diff*diff).sum(axis=2))
closest = norm.argmin(axis=0)
```

And don't ask what's it doing :D

Edit: nah, just kidding. The broadcasting in the middle (`points[np.newaxis] - centroids[:,np.newaxis]`

) is "making" two 3D arrays from the original ones. The result is such that each "plane" contains the difference between all the points and one of the centroids. Let's call it `diffs`

.

Then we do the usual operation to calculate the euclidean distance (square root of the squares of differences): `np.sqrt((diffs ** 2).sum(axis=2))`

. We end up with a `(k, m)`

matrix where row 0 contain the distances to `centroids[0]`

, etc. So, the `.argmin(axis=0)`

gives you the result you wanted.