GBR24 GBR24 - 4 months ago 41
Python Question

Euclidean distance between elements in two different matrices?

I am trying to determine the Euclidean distance for my documents from their centroids. The dimensions of the two arrays in question (

points
and
centers
) satisfy the
XA
and
XB
dimensional requirements for
scipy.spatial.distance.cdist
, but I don't know why I'm getting the below
ValueError
.

My code:

import pandas as pd, numpy as np
from scipy.spatial.distance import cdist
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans

corpus = pd.Series(["bye bye brutal good bye apple banana orange", "bye bye hello apple banana", "corn wheat apple banana goodbye cookie brutal", "fruit cake banana apple bye sweet sweet"])
X = vectorizer.fit_transform(corpus)
model = Kmeans(n_clusters = 2)
model.fit(X)
centers = model.cluster_centroids_

cdist(X, centers)


This is the error I get:

ValueError: setting an array element with a sequence.


From
scipy.spatial.distance.cdist
's documentation:

Parameters: XA: ndarray
An Ma by n array of Ma original observations in an n-dimensional space
XB: ndarray
An Mb by n array of Mb original observations in an n-dimensional space
...


My
X
and
centers
numpy
arrays certainly satisfy these dimensional conditions for
cdist
, right? What am I missing?

Answer

Just a small change that you need to do:

cdist(X.toarray(),centers)

Since X is an object of type scipy.sparse.csr.csr_matrix it will not be directly taken as a valid input by the scipy function. The method toarray() converts it to a valid numpy array