jeremy radcliff jeremy radcliff - 2 months ago 10
Python Question

Sum of difference of squares between each combination of rows of 17,000 by 300 matrix

Ok, so I have a matrix with 17000 rows (examples) and 300 columns (features). I want to compute basically the euclidian distance between each possible combination of rows, so the sum of the squared differences for each possible pair of rows.
Obviously it's a lot and iPython, while not completely crashing my laptop, says "(busy)" for a while and then I can't run anything anymore and it certain seems to have given up, even though I can move my mouse and everything.

Is there any way to make this work? Here's the function I wrote. I used numpy everywhere I could.
What I'm doing is storing the differences in a difference matrix for each possible combination. I'm aware that the lower diagonal part of the matrix = the upper diagonal, but that would only save 1/2 the computation time (better than nothing, but not a game changer, I think).

EDIT: I just tried using

scipy.spatial.distance.pdist
but it's been running for a good minute now with no end in sight, is there a better way? I should also mention that I have NaN values in there...but that's not a problem for numpy apparently.

features = np.array(dataframe)
distances = np.zeros((17000, 17000))


def sum_diff():
for i in range(17000):
for j in range(17000):
diff = np.array(features[i] - features[j])
diff = np.square(diff)
sumsquares = np.sum(diff)
distances[i][j] = sumsquares

Answer

You could always divide your computation time by 2, noticing that d(i, i) = 0 and d(i, j) = d(j, i).

But have you had a look at sklearn.metrics.pairwise.pairwise_distances() (in v 0.18, see the doc here) ?

You would use it as:

from sklearn.metrics import pairwise
import numpy as np

a = np.array([[0, 0, 0], [1, 1, 1], [3, 3, 3]])
pairwise.pairwise_distances(a)
Comments