jeremy radcliff - 1 year ago 72

Python Question

Ok, so I have a matrix with 17000 rows (examples) and 300 columns (features). I want to compute basically the euclidian distance between each possible combination of rows, so the sum of the squared differences for each possible pair of rows.

Obviously it's a lot and iPython, while not completely crashing my laptop, says "(busy)" for a while and then I can't run anything anymore and it certain seems to have given up, even though I can move my mouse and everything.

Is there any way to make this work? Here's the function I wrote. I used numpy everywhere I could.

What I'm doing is storing the differences in a difference matrix for each possible combination. I'm aware that the lower diagonal part of the matrix = the upper diagonal, but that would only save 1/2 the computation time (better than nothing, but not a game changer, I think).

**EDIT**: I just tried using

`scipy.spatial.distance.pdist`

`features = np.array(dataframe)`

distances = np.zeros((17000, 17000))

def sum_diff():

for i in range(17000):

for j in range(17000):

diff = np.array(features[i] - features[j])

diff = np.square(diff)

sumsquares = np.sum(diff)

distances[i][j] = sumsquares

Answer Source

You could always divide your computation time by 2, noticing that d(i, i) = 0 and d(i, j) = d(j, i).

But have you had a look at `sklearn.metrics.pairwise.pairwise_distances()`

(in v 0.18, see the doc here) ?

You would use it as:

```
from sklearn.metrics import pairwise
import numpy as np
a = np.array([[0, 0, 0], [1, 1, 1], [3, 3, 3]])
pairwise.pairwise_distances(a)
```