stvn66 stvn66 - 1 month ago 16
Python Question

Calculate the triangular matrix of distances between NumPy array of coordinates

I have an NumPy array of coordinates. For example purposes, I will use this

In [1]: np.random.seed(123)
In [2]: coor = np.random.randint(10, size=12).reshape(-1,3)
In [3]: coor
Out[3]: array([[2, 2, 6],
[1, 3, 9],
[6, 1, 0],
[1, 9, 0]])


I want the triangular matrix of distances between all coordinates. A simple approach would be to code a double loop over all coordinates

In [4]: n_coor = len(coor)
In [5]: dist = np.zeros((n_coor, n_coor))
In [6]: for j in xrange(n_coor):
for k in xrange(j+1, n_coor):
dist[j, k] = np.sqrt(np.sum((coor[j] - coor[k]) ** 2))


with the result being an upper triangular matrix of the distances

In [7]: dist
Out[7]: array([[ 0. , 3.31662479, 7.28010989, 9.2736185 ],
[ 0. , 0. , 10.48808848, 10.81665383],
[ 0. , 0. , 0. , 9.43398113],
[ 0. , 0. , 0. , 0. ]])


Leveraging NumPy, I can avoid looping using

In [8]: dist = np.sqrt(((coor[:, None, :] - coor) ** 2).sum(-1))


but the result is the entire matrix

In [9]: dist
Out[9]: array([[ 0. , 3.31662479, 7.28010989, 9.2736185 ],
[ 3.31662479, 0. , 10.48808848, 10.81665383],
[ 7.28010989, 10.48808848, 0. , 9.43398113],
[ 9.2736185 , 10.81665383, 9.43398113, 0. ]])


This one line version takes roughly half the time when I use 2048 coordinates (4 s instead of 10 s) but this is doing twice as many calculations as it needs in order to get the symmetric matrix. Is there a way to adjust the one line command to only get the triangular matrix (and the additional 2x speedup, i.e. 2 s)?

Answer

We can use Scipy's pdist to get those distances. So, we just need to initialize the output array and then set the upper triangular places with those distances, like so -

from scipy.spatial.distance import pdist

n_coor = len(coor)
dist = np.zeros((n_coor, n_coor))
row,col = np.triu_indices(n_coor,1)
dist[row,col] = pdist(coor)

Alternatively, we can use boolean-indexing to assign values and thus replace the last two lines, like so -

dist[np.arange(n_coor)[:,None] < np.arange(n_coor)] = pdist(coor)

Runtime test

Functions -

def subscripted_indexing(coor):
    n_coor = len(coor)
    dist = np.zeros((n_coor, n_coor))
    row,col = np.triu_indices(n_coor,1)
    dist[row,col] = pdist(coor)
    return dist

def boolean_indexing(coor):
    n_coor = len(coor)
    dist = np.zeros((n_coor, n_coor))
    r = np.arange(n_coor)
    dist[r[:,None] < r] = pdist(coor)
    return dist

Timings -

In [110]: # Setup input array
     ...: coor = np.random.randint(0,10, (2048,3))

In [111]: %timeit subscripted_indexing(coor)
10 loops, best of 3: 91.4 ms per loop

In [112]: %timeit boolean_indexing(coor)
10 loops, best of 3: 47.8 ms per loop