diugalde diugalde - 1 month ago 8
Python Question

How to calculate the distance between a document and each centroid (k-means)?

I executed scikit-learn k-means algorithm and got the resulting centroids. I have a new document (was not in the initial collection) and I would like to calculate the distance between every centroid and the new document to know in which cluster it should be placed.

Is there a built in function to achieve that or should I write a similarity function manually?

Answer

You can use the method predict to get the closest cluster for each sample in a matrix X:

from sklearn.cluster import KMeans

model = KMeans(n_clusters=K)
model.fit(X_train)
label = model.predict(X_test)
Comments