abhi kafle - 1 year ago 96
Python Question

# Finding the indices of all points corresponding to a particular centroid using kmeans clustering

Here is a simple implementation of kmeans clustering (with the points in cluster labelled from 1 to 500):

``````from pylab import plot,show
from numpy import vstack,array
from numpy.random import rand
from scipy.cluster.vq import kmeans,vq

# data generation
data = vstack((rand(150,2) + array([.5,.5]),rand(150,2)))

# computing K-Means with K = 2 (2 clusters)
centroids,_ = kmeans(data,2)
# assign each sample to a cluster
idx,_ = vq(data,centroids)

#ignore this, just labelling each point in cluster
for label, x, y in zip(labels, data[:, 0], data[:, 1]):
plt.annotate(
label,
xy = (x, y), xytext = (-20, 20),
textcoords = 'offset points', ha = 'right', va = 'bottom',
bbox = dict(boxstyle = 'round,pad=0.5', fc = 'yellow', alpha = 0.5),
arrowprops = dict(arrowstyle = '->', connectionstyle = 'arc3,rad=0'))

# some plotting using numpy's logical indexing
plot(data[idx==0,0],data[idx==0,1],'ob',
data[idx==1,0],data[idx==1,1],'or')
plot(centroids[:,0],centroids[:,1],'sg',markersize=8)
show()
``````

I am trying to find the indices for all of the points within each cluster.

In this line:

``````idx,_ = vq(data,centroids)
``````

you have already generated a vector containing the index of the nearest centroid for each point (row) in your `data` array.

It seems you want the row indices of all of the points that are nearest to centroid 0, centroid 1 etc. You can use `np.nonzero` to find the indices where `idx == i` where `i` is the centroid you are interested in.

For example:

``````in_0 = np.nonzero(idx == 0)[0]
in_1 = np.nonzero(idx == 1)[0]
``````

In the comments you also asked why the `idx` vector differs across runs. This is because if you pass an integer as the second parameter to `kmeans`, the centroid locations are randomly initialized (see here).

Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download