stvn66 - 1 year ago 65
Python Question

Calculate the triangular matrix of distances between NumPy array of coordinates

I have an NumPy array of coordinates. For example purposes, I will use this

``````In [1]: np.random.seed(123)
In [2]: coor = np.random.randint(10, size=12).reshape(-1,3)
In [3]: coor
Out[3]: array([[2, 2, 6],
[1, 3, 9],
[6, 1, 0],
[1, 9, 0]])
``````

I want the triangular matrix of distances between all coordinates. A simple approach would be to code a double loop over all coordinates

``````In [4]: n_coor = len(coor)
In [5]: dist = np.zeros((n_coor, n_coor))
In [6]: for j in xrange(n_coor):
for k in xrange(j+1, n_coor):
dist[j, k] = np.sqrt(np.sum((coor[j] - coor[k]) ** 2))
``````

with the result being an upper triangular matrix of the distances

``````In [7]: dist
Out[7]: array([[  0.        ,   3.31662479,   7.28010989,   9.2736185 ],
[  0.        ,   0.        ,  10.48808848,  10.81665383],
[  0.        ,   0.        ,   0.        ,   9.43398113],
[  0.        ,   0.        ,   0.        ,   0.        ]])
``````

Leveraging NumPy, I can avoid looping using

``````In [8]: dist = np.sqrt(((coor[:, None, :] - coor) ** 2).sum(-1))
``````

but the result is the entire matrix

``````In [9]: dist
Out[9]: array([[  0.        ,   3.31662479,   7.28010989,   9.2736185 ],
[  3.31662479,   0.        ,  10.48808848,  10.81665383],
[  7.28010989,  10.48808848,   0.        ,   9.43398113],
[  9.2736185 ,  10.81665383,   9.43398113,   0.        ]])
``````

This one line version takes roughly half the time when I use 2048 coordinates (4 s instead of 10 s) but this is doing twice as many calculations as it needs in order to get the symmetric matrix. Is there a way to adjust the one line command to only get the triangular matrix (and the additional 2x speedup, i.e. 2 s)?

Answer Source

We can use `Scipy's` pdist to get those distances. So, we just need to initialize the output array and then set the upper triangular places with those distances, like so -

``````from scipy.spatial.distance import pdist

n_coor = len(coor)
dist = np.zeros((n_coor, n_coor))
row,col = np.triu_indices(n_coor,1)
dist[row,col] = pdist(coor)
``````

Alternatively, we can use `boolean-indexing` to assign values and thus replace the last two lines, like so -

``````dist[np.arange(n_coor)[:,None] < np.arange(n_coor)] = pdist(coor)
``````

Runtime test

Functions -

``````def subscripted_indexing(coor):
n_coor = len(coor)
dist = np.zeros((n_coor, n_coor))
row,col = np.triu_indices(n_coor,1)
dist[row,col] = pdist(coor)
return dist

def boolean_indexing(coor):
n_coor = len(coor)
dist = np.zeros((n_coor, n_coor))
r = np.arange(n_coor)
dist[r[:,None] < r] = pdist(coor)
return dist
``````

Timings -

``````In [110]: # Setup input array
...: coor = np.random.randint(0,10, (2048,3))

In [111]: %timeit subscripted_indexing(coor)
10 loops, best of 3: 91.4 ms per loop

In [112]: %timeit boolean_indexing(coor)
10 loops, best of 3: 47.8 ms per loop
``````