abhi kafle - 7 months ago 42

Python Question

Here is a simple implementation of kmeans clustering (with the points in cluster labelled from 1 to 500):

`from pylab import plot,show`

from numpy import vstack,array

from numpy.random import rand

from scipy.cluster.vq import kmeans,vq

# data generation

data = vstack((rand(150,2) + array([.5,.5]),rand(150,2)))

# computing K-Means with K = 2 (2 clusters)

centroids,_ = kmeans(data,2)

# assign each sample to a cluster

idx,_ = vq(data,centroids)

#ignore this, just labelling each point in cluster

for label, x, y in zip(labels, data[:, 0], data[:, 1]):

plt.annotate(

label,

xy = (x, y), xytext = (-20, 20),

textcoords = 'offset points', ha = 'right', va = 'bottom',

bbox = dict(boxstyle = 'round,pad=0.5', fc = 'yellow', alpha = 0.5),

arrowprops = dict(arrowstyle = '->', connectionstyle = 'arc3,rad=0'))

# some plotting using numpy's logical indexing

plot(data[idx==0,0],data[idx==0,1],'ob',

data[idx==1,0],data[idx==1,1],'or')

plot(centroids[:,0],centroids[:,1],'sg',markersize=8)

show()

I am trying to find the indices for all of the points within each cluster.

Answer

In this line:

```
idx,_ = vq(data,centroids)
```

you have already generated a vector containing the index of the nearest centroid for each point (row) in your `data`

array.

It seems you want the row indices of all of the points that are nearest to centroid 0, centroid 1 etc. You can use `np.nonzero`

to find the indices where `idx == i`

where `i`

is the centroid you are interested in.

For example:

```
in_0 = np.nonzero(idx == 0)[0]
in_1 = np.nonzero(idx == 1)[0]
```

In the comments you also asked why the `idx`

vector differs across runs. This is because if you pass an integer as the second parameter to `kmeans`

, the centroid locations are randomly initialized (see here).