Ok, so I have a matrix with 17000 rows (examples) and 300 columns (features). I want to compute basically the euclidian distance between each possible combination of rows, so the sum of the squared differences for each possible pair of rows.
Obviously it's a lot and iPython, while not completely crashing my laptop, says "(busy)" for a while and then I can't run anything anymore and it certain seems to have given up, even though I can move my mouse and everything.
Is there any way to make this work? Here's the function I wrote. I used numpy everywhere I could.
What I'm doing is storing the differences in a difference matrix for each possible combination. I'm aware that the lower diagonal part of the matrix = the upper diagonal, but that would only save 1/2 the computation time (better than nothing, but not a game changer, I think).
EDIT: I just tried using
features = np.array(dataframe)
distances = np.zeros((17000, 17000))
for i in range(17000):
for j in range(17000):
diff = np.array(features[i] - features[j])
diff = np.square(diff)
sumsquares = np.sum(diff)
distances[i][j] = sumsquares
You could always divide your computation time by 2, noticing that d(i, i) = 0 and d(i, j) = d(j, i).
But have you had a look at
sklearn.metrics.pairwise.pairwise_distances() (in v 0.18, see the doc here) ?
You would use it as:
from sklearn.metrics import pairwise import numpy as np a = np.array([[0, 0, 0], [1, 1, 1], [3, 3, 3]]) pairwise.pairwise_distances(a)