GBR24 - 7 months ago 69

Python Question

I am trying to determine the Euclidean distance for my documents from their centroids. The dimensions of the two arrays in question (

`points`

`centers`

`XA`

`XB`

`scipy.spatial.distance.cdist`

`ValueError`

My code:

`import pandas as pd, numpy as np`

from scipy.spatial.distance import cdist

from sklearn.feature_extraction.text import TfidfVectorizer

from sklearn.cluster import KMeans

corpus = pd.Series(["bye bye brutal good bye apple banana orange", "bye bye hello apple banana", "corn wheat apple banana goodbye cookie brutal", "fruit cake banana apple bye sweet sweet"])

X = vectorizer.fit_transform(corpus)

model = Kmeans(n_clusters = 2)

model.fit(X)

centers = model.cluster_centroids_

cdist(X, centers)

This is the error I get:

`ValueError: setting an array element with a sequence.`

From

`scipy.spatial.distance.cdist`

`Parameters: XA: ndarray`

An Ma by n array of Ma original observations in an n-dimensional space

XB: ndarray

An Mb by n array of Mb original observations in an n-dimensional space

...

My

`X`

`centers`

`numpy`

`cdist`

Answer

Just a small change that you need to do:

```
cdist(X.toarray(),centers)
```

Since X is an object of type `scipy.sparse.csr.csr_matrix`

it will not be directly taken as a valid input by the scipy function. The method toarray() converts it to a valid numpy array