jeremy radcliff - 1 year ago 107
Python Question

# Sum of difference of squares between each combination of rows of 17,000 by 300 matrix

Ok, so I have a matrix with 17000 rows (examples) and 300 columns (features). I want to compute basically the euclidian distance between each possible combination of rows, so the sum of the squared differences for each possible pair of rows.
Obviously it's a lot and iPython, while not completely crashing my laptop, says "(busy)" for a while and then I can't run anything anymore and it certain seems to have given up, even though I can move my mouse and everything.

Is there any way to make this work? Here's the function I wrote. I used numpy everywhere I could.
What I'm doing is storing the differences in a difference matrix for each possible combination. I'm aware that the lower diagonal part of the matrix = the upper diagonal, but that would only save 1/2 the computation time (better than nothing, but not a game changer, I think).

EDIT: I just tried using

`scipy.spatial.distance.pdist`
but it's been running for a good minute now with no end in sight, is there a better way? I should also mention that I have NaN values in there...but that's not a problem for numpy apparently.

``````features = np.array(dataframe)
distances = np.zeros((17000, 17000))

def sum_diff():
for i in range(17000):
for j in range(17000):
diff = np.array(features[i] - features[j])
diff = np.square(diff)
sumsquares = np.sum(diff)
distances[i][j] = sumsquares
``````

You could always divide your computation time by 2, noticing that d(i, i) = 0 and d(i, j) = d(j, i).

But have you had a look at `sklearn.metrics.pairwise.pairwise_distances()` (in v 0.18, see the doc here) ?

You would use it as:

``````from sklearn.metrics import pairwise
import numpy as np

a = np.array([[0, 0, 0], [1, 1, 1], [3, 3, 3]])
pairwise.pairwise_distances(a)
``````
Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download