sweeeeeet sweeeeeet - 3 months ago 15
Python Question

Clustering data with given cluster centers in Python

I have a 1-dimension numerical dataset (but my question also applies for a n-dimension numerical dataset) which I want to cluster, and I already know the values of the cluster centers. So I only want to map each data point to its associed cluster center (the one which is the closest of the datapoint).

I could write an ad hoc function, but I would really prefer using a Python scientific library optimised to work on pandas.Series or numpy.arrays, as Scipy, because my dataset is very big (hundreds of millions of data points).

How can I do this?

Thank you!

Answer

You are looking for the scipy vq function.

The first argument is the data to cluster, and the second is the clusters coordinates. The first element of the return value is the index of each cluster (the label), which is what you want:

>>> vq( array([0,5,5]), array([1,2,3]) )
(array([0, 2, 2]), array([ 1.,  2.,  2.]))
Comments