abhi kafle abhi kafle - 1 year ago 96
Python Question

Finding the indices of all points corresponding to a particular centroid using kmeans clustering

Here is a simple implementation of kmeans clustering (with the points in cluster labelled from 1 to 500):

from pylab import plot,show
from numpy import vstack,array
from numpy.random import rand
from scipy.cluster.vq import kmeans,vq

# data generation
data = vstack((rand(150,2) + array([.5,.5]),rand(150,2)))

# computing K-Means with K = 2 (2 clusters)
centroids,_ = kmeans(data,2)
# assign each sample to a cluster
idx,_ = vq(data,centroids)

#ignore this, just labelling each point in cluster
for label, x, y in zip(labels, data[:, 0], data[:, 1]):
xy = (x, y), xytext = (-20, 20),
textcoords = 'offset points', ha = 'right', va = 'bottom',
bbox = dict(boxstyle = 'round,pad=0.5', fc = 'yellow', alpha = 0.5),
arrowprops = dict(arrowstyle = '->', connectionstyle = 'arc3,rad=0'))

# some plotting using numpy's logical indexing

I am trying to find the indices for all of the points within each cluster.image without labels

Answer Source

In this line:

idx,_ = vq(data,centroids)

you have already generated a vector containing the index of the nearest centroid for each point (row) in your data array.

It seems you want the row indices of all of the points that are nearest to centroid 0, centroid 1 etc. You can use np.nonzero to find the indices where idx == i where i is the centroid you are interested in.

For example:

in_0 = np.nonzero(idx == 0)[0]
in_1 = np.nonzero(idx == 1)[0]

In the comments you also asked why the idx vector differs across runs. This is because if you pass an integer as the second parameter to kmeans, the centroid locations are randomly initialized (see here).

Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download